I was invited to give a talk for the Centre for the History of the Book at the University of Edinburgh. I took the opportunity to talk through some of the methodological challenges facing researchers of ebooks.
I was invited to give a talk for the Centre for the History of the Book at the University of Edinburgh. I took the opportunity to talk through some of the methodological challenges facing researchers of ebooks.
I’ve been seriously working on research for my history of the Kindle for a couple of years now and I’m still figuring out how to capture the impact of the Kindle on the scale of both the publishing/technology industry and the individual reader.
This tension is clearest when looking at the available data on reading and the shared highlights. There are a large number of individuals making personal choices behind the 500,000 shared highlights of a single edition of Wuthering Heights. If we scale this to over 4 million ebooks and 40 million Kindle users, it becomes extremely difficult to focus on both the local and global trends (and doubly so when access to the data is obsfucated and entirely unavailable): What counts as an appropriate sample? To what degree can individual highlights link to the mass of activity? How much data can I even get hold of?
While I ponder these questions, there’s still the problem of method. In order to figure this out, here’s a pilot study of the Harry Potter series as a complete unit that is manageable yet has received a fair amount of attention.
On the global level, shared highlights might not be able to tell us much about readership because an unknown number of readers choose not to highlight or share their efforts. The benefit of using Harry Potter, however, comes from the fact it is possible to gauge popularity across the series.
In recent versions of the Kindle software, a helpful pop-up box appears “About This Book” when opening a title for the first time. Luckily, this pop-up contains the total number of shared highlights and how many unique sections of the title have been highlighted. (These may not necessarily be up-to-date, but all the data here comes from 20 October 2015)
The data from the Harry Potter series reveals some interesting patterns. Figure 1 shows the total volume of shared highlights for each title, while figure 2 looks at the number of unique highlights per title. The most striking part of figure 1 is that the visible highlights (the top 10 most shared highlights) barely represent 10% of all shared highlights for any individual title.
Figure 1. Total highlights for each Harry Potter title and the visible top 10 highlights (click for full size)
Figure 2. Unique highlights for each Harry Potter title (click for full size)
While the two graphs appear to show that the popularity of the series drops at the end and plummets after the first novel only to be pick up towards the middle, there is a far simpler explanation: the longer books receive more highlights as there is more text to highlight.
The only notable exception is Harry Potter and the Philosopher’s Stone, where more readers are focusing on particular passages. The large increase in total highlights without a similar increase in unique highlights likely indicates that more people are reading the first book than the rest of the series, or at the very least, they lose enthusiasm after the first book.
The second macroscopic view we can get from the Popular Highlights is the location of the shared highlights. Jordan Ellenberg has coined the Piketty Index as a way of using popular highlight locations to see how far through a book a reader got before quitting. From the evidence I’m gathering, it looks like the top 10 shared highlights are more likely to appear at the beginning of a book than the end, but what about the Harry Potter series?
Figure 3. Top 10 Shared Highlights for each Harry Potter title (click image for full size)
As a series, readers are more likely to highlight passages at the end of the book than the beginning. Not only does this suggest that readers are likely to finish the books, but through looking at the content of the highlights from the end of the book, it is clear that some of the most popular parts of the titles are Dumbledore’s speeches to Harry and the denouement of the narrative. Given the make-up of Rowling’s series and the slow start of most of the books, this inversion makes sense.
And that’s about as much as you can deduce from looking at the global level as far as I can tell. Once I’ve dug into the more traditional annotations and highlights of individual readers, I’ll compare the results with the broad patterns identified here.
One of the problems with studying digital texts is coming up with a bibliographic description that captures enough information for others to identify (and often replicate the conditions) of the object. Unsurprisingly, ebooks have thrown up some interesting challenges for budding digital bibliographers.
Alan Galey has explored this issue across formats in The Enkindling Reciter. From this analysis, it is clear that the format of the ebook is important to record. For example, when talking about Walter Isaacson’s biography of Steve Jobs, the bibliographic record should indicate that the text was the ‘[Kindle edition]’ or ‘[EPUB]’. This is becoming standard practice in several venues, but is this sufficient to identify an edition?
Unfortunately, ebooks are likely to automatically update. Luckily, Amazon have several ways of identifying versions of a text:
Even this information is not sufficient for an accurate bibliographic description, since as I have argued elsewhere, the ebook must be considered as platform of at least four different layers: hardware, software, format and content. Without mapping all of these elements, it is impossible to accurately describe an ebook.
Just five words from Isaacson’s biography (“KOBUN CHINO. A Sōtō Zen…”) are sufficient to demonstrate why we need to pay closer attention to more than just the format of an ebook.
In the paperback edition of the text, the text is formatted with small caps and macrons on both the ‘o’s in Sōtō:
Walter Isaacson (2013) Steve Jobs. New York: Simon & Schuster, xiii.
The second generation Kindle renders this in a slightly different manner:
This in turn is slightly different from the Kindle for Android, iPad, Mac & Cloud Reader edition:
Android 4.4.2 (Sony Xperia D2005 | Kindle for Android 18.104.22.168)
iOS 8.4 (iPad MD522B/A | Kindle for iPad 4.10)
Mac OS X 10.10.4 (Kindle for Mac 1.11.2 )
Kindle Cloud Reader (Chrome 44.0.2403.125 | Mac OS X 10.10.4)
Variation in font and reading preferences aside, there are clear differences between versions that are of interest for the descriptive bibliographer. There are two major differences I want to highlight:
The first is a clear limitation of the Kindle platform and its design. Rather than using the rich and varied palette of a Unicode standard such as UTF-8 (allowing users to include a wide range of alphabets, and more importantly, emoji!), Amazon chose the much more restrictive Latin-1 encoding, which includes a range of diacritics and punctuation common to Latinate alphabets but not a lot else.
Unfortunately, this did not include the ‘o’ with macron, which just so happens to appear twice in a single word. Luckily, rather than simply removing the macrons, the producers have used a work round by including an image of the character. Unfortuantely, the image does not properly scale with the text and it only works with black text on a white background.
This has a couple of consequences for the ebook itself too, since it makes it impossible to search for ‘Sōtō’, as the text is either rendered into two single character words, or worse, turned into ‘St’. Not only does this make the word difficult to search for, but it also effects the quality of the Kindle’s text-to-speech facilities.
Sōtō rendered as “saint”
While the first bibliographic glitch was readily visible, the second would be difficult to spot without comparing different versions of the same edition. Formatting standards such as HTML, which ebooks use as their basic logic, are not hard laws, but recommendations for how to display text which can vary between different interpreters. Small caps is one of those features which is not universally supported by different instances of the Kindle application.
This may appear to be a minor aesthetic variation, but once again, it has an effect on the functionality of the ebook. Due to the variation in parsing the ‘small caps’ formatting tag, different versions of the Kindle software do not agree on whether the start of the ‘small caps’ formatting represents the start of a new word.
For example, Kobun Chino’s second name is rendered as ‘C hino’ on the iPad version, but remains ‘Chino’ on the Kindle for Mac version. This is a problem for readers who try to look up the name through the dictionary, Wikipedia or X-Ray, as the surname may be rendered as two separate words. Again, the text-to-speech functions of the Kindle stumble on this split word too, rendering some of the accessibility functions difficult to navigate.
CHINO VS. C HINO
It is clear that identifying the brand and associated file format alone will not suffice, and even the file format may not be enough due to variation among platforms. Hardware and software configurations make a real difference in the version and behavior of the file. Since Amazon’s file formats (AWZ, PRC, KF8 and so forth) are not openly documented, so it is insufficient to look at the source code, noting the software and OS may be a necessary step in ensuring the replicability and accurate documentation of Kindle ebooks. Even this may not be enough to stave off the constantly updating Kindle infrastructure, but at least it’s a start towards documenting a specific moment in time.
I was asked to write about the importance of Amazon for publishers for the company’s 20th anniversary in The Conversation. It was originally published here: theconversation.com/amazon-is-20-years-old-and-far-from-bad-news-for-publishers-43863
It has now been 20 years since Amazon sold its first book: the titillating-sounding Fluid Concepts and Creative Analogies, by Douglas Hofstadter. Since then publishers have often expressed concern over Amazon. Recent public spates with Hachette and Penguin Random House have heightened the public’s awareness of this fraught relationship.
It has been presented as a David and Goliath battle. This is despite the underdogs’ status as the largest publishing houses in the world. As Amazon has become the primary destination for books online, it has been able to lower book prices through their influence over the book trade. Many have argued that this has reduced the book to “a thing of minimal value”.
Despite this pervasive narrative of the evil overlord milking its underlings for all their worth, Amazon has actually offered some positive changes in the publishing industry over the last 20 years. Most notably, the website has increased the visibility of books as a form of entertainment in a competitive media environment. This is an achievement that should not be diminished in our increasingly digital world.
In Amazon’s early years, Jeff Bezos, the company’s CEO, was keen to avoid stocking books. Instead, he wanted to work as a go-between for customers and wholesalers. Instead of building costly warehouses, Amazon would instead buy books as customers ordered them. This would pass the savings on to the customers. (It wasn’t long, however, until Amazon started building large warehouses to ensure faster delivery times.)
This promise of a large selection of books required a large database of available books for customers to search. Prior to Amazon’s launch, this data was available to those who needed it from Bowker’s Books in Print, an expensive data source run by the people who controlled the International Standardised Book Number (ISBN) standard in the USA.
ISBN was the principle way in which people discovered books, and Bowker controlled this by documenting the availability of published and forthcoming titles. This made them one of the most powerful companies in the publishing industry and also created a division between traditional and self-published books.
Bowker allowed third parties to re-use their information, so Amazon linked this data to their website. Users could now see any book Bowker reported as available. This led to Amazon’s boasts that they had the largest bookstore in the world, despite their lack of inventory in their early years. But many other book retailers had exactly the same potential inventory through access to the same suppliers and Bowker’s Books in Print.
Amazon’s decision to open up the data in Bowker’s Books in Print to customers democratised the ability to discover of books that had previously been locked in to the sales system of physical book stores. And as Amazon’s reputation improved, they soon collected more data than Bowker.
For the first time, users could access data about what publishers had recently released and basic information about forthcoming titles. Even if customers did not buy books from Amazon, they could still access the information. This change benefited publishers as readers who can quickly find information about new books are more likely to buy new books.
As Amazon expanded beyond books, ISBN was no longer the most useful form for recalling information about items they sold. So the company came up with a new version: Amazon Standardized Identifier Numbers (ASINs), Amazon’s equivalent of ISBNs. This allowed customers to shop for books, toys and electronics in one place.
The ASIN is central to any Amazon catalogue record and with Amazon’s expansion into selling eBooks and second hand books, it connects various editions of books. ASINs are the glue that connect eBooks on the Kindle to shared highlights, associated reviews, and second hand print copies on sale. Publishers, and their supporters, can use ASINs as a way of directing customers to relevant titles in new ways.
Will Cookson’s Bookindy is an example of this. The mobile app allows readers to find out if a particular book is available for sale cheaper than Amazon in an independent bookstore nearby. So Amazon’s advantage of being the largest source of book-related information is transformed into a way to build the local economy.
ASINs are primarily useful for finding and purchasing books from within the Amazon bookstore, but this is changing. For example, many self-published eBooks don’t have ISBNs, so Amazon’s data structure can be used to discover current trends in the publishing industry. Amazon’s data allows publishers to track the popularity of books in all forms and shape their future catalogues based on their findings.
While ISBNs will remain the standard for print books, ASIN and Amazon’s large amount of data clearly benefits publishers through increasing their visibility. Amazon have forever altered bookselling and the publishing industry, but this does not mean that its large database cannot be an invaluable resource for publishers who wish to direct customers to new books outside of Amazon.
Abstract: The Kindle’s launch in 2007 is considered pivotal in the transition of the eBook from marginal interest to mainstream phenomenon. This narrative marginalizes the pre-history of the eBook stemming from Bob Brown’s manifesto, The Readies, in 1929 through to Sony’s big push for public eBook acceptance with the Sony Librie in 2006. Traditional accounts of the eBook recall early failures to monetize the eBook through expensive hardware experiments from 1999 to 2006, but this ignores a wider range of precedents apparent from a media archaeological excavation of the eBook before the Kindle.
The current project traces the development of the eBook from the Kindle to its precursors outside of the dedicated hardware that typically characterizes the eBook’s incunabular period. It is clear that dedicated devices did not catch on prior to the Kindle, but this does not mean that a samizdat eBook culture did not exist. eBook reading prior to the launch of the Kindle was facilitated by applications for the portable devices such as PalmPilots and Game Boys. This media archaeological approach reveals the birth of the modern standards for eBook formats and how users were frustrated with the lack of available eBooks and often went to great lengths to create their own eBooks. This reaches its apex in the development of an eBook application for the Game Boy, where readers built a programme to read a range of titles from Robinson Crusoe to Lolita on the games console.
It is possible to see the foundations of the modern eBook from such activity, as the necessity for reflowable text when reading on a Portable Digital Assistant (PDA) led to the formation of the Open eBook Publication Structure (a precursor to the EPUB format) in 1999, and several portable devices such as the Game Boy Advance, PalmPilot and SoftBook had facilities for modems, allowing readers to receive books without using a computer, often seen as one of the core selling points of the original Kindle. Amazon regenerated the eBook marketplace by amalgamating these elements into a single package while leveraging their competitive advantage of their total dominance over online bookselling to transform the commercial eBook marketplace. Through reconstructing this 87 forgotten, and often-unauthorized history, it is possible to find a richer pre-history of the eBook than the generally established historical narrative of public hardware failures.
Abstract: Since the mid-2000s, the ebook has stabilized into an ontologically distinct form, separate from PDFs and other representations of the book on the screen. The current article delineates the ebook from other emerging digital genres with recourse to the methodologies of platform studies and book history. The ebook is modelled as three concentric circles representing its technological, textual and service infrastructure innovations. This analysis reveals two distinct properties of the ebook: a simulation of the services of the book trade and an emphasis on user textual manipulation. The proposed model is tested with reference to comparative studies of several ebooks published since 2007 and defended against common claims of ebookness about other digital textual genres.
Abstract: It is difficult to talk about the digitalization of the book trade without mentioning Amazon, but the constituency and scale of the retailer have not undergone large-scale critical scrutiny. Amazon’s infrastructure, including the integration of ISBNs into Amazon Standard Identification Numbers (ASINs), has shaped the book trade over last two decades, and in places, has replaced traditional sources of information such as Bowker’s Books in Print and Nielsen BookScan. Amazon thus presents a large cache of data for publishing studies, although Amazon is notoriously secretive.
The current project maps Amazon UK’s online bookselling infrastructure and offers an initial foray into how this data can be analysed to present a survey of the contemporary publishing landscape. While Amazon’s websites are a living resource that are difficult to map, there is an impetus to archive and analyse data immediately, as Amazon is not an archival resource, aptly demonstrated by their purge of pre-Kindle ebook data in 2007 and their recent closure of the public popular highlights function. To this end, the current project will provide an overview of Amazon’s digital infrastructure, followed by two practical applications: (1) tracking the used book marketplace with a focus on Vladimir Nabokov; and (2) analysing Amazon’s use a cataloguing tool for books not on sale through Amazon or third-party seller. Through these case studies, the paper aims to open conversations of how to use Amazon as a research tool as well as a research object.
Abstract: Mass digitization of text has resulted in the development of textual generators that are much more capable of writing through reading pre-existing chunks of text. While they do not understand the semantics of the text, many of these machines are capable of creating reasonably intelligible discourse through their reading and reassembly of pre-existing texts. Through targeting specific corpora (including Moby Dick and live data from a remote buoy; instructions from WikiHow; and a database of time zones), text generators and Twitterbots are creating engaging literary works. In this paper, I will theorise and historicise the development of reading automata within the wider context of the recent textual return in digital media facilitated by the development of ebooks and Twitter.
June 1st, 2015 § Comments Off on PUBLICATION: Indexes as Hypertext § permalink
Abstract: Digital media presents several challenges to the index, but this ignores the fact that the index has played an important role in the development of the computer. Hypertext, or links between chunks of text, is a vital concept in computation, and one which can be traced back to the index. The author explores the link between indexes and hypertext through three case studies of novels with indexes: Vladimir Nabokov’s Pale fire, Mark Z. Danielewski’sHouse of leaves and Steven Hall’s The raw shark texts. This analysis reveals how indexes can be used as a subversive part of experimental fiction that authors employ to encourage the reader to move beyond superficial forms of reading.
Simon Rowberry, “‘Indexes as Hypertext.” The Indexer. June 2015, pp. 50-56
May 22nd, 2015 § Comments Off on PRESENTATION: 1984 Redux: The long term materiality of the Kindle infrastructure § permalink
Abstract: The launch of the Kindle in 2007 marked the arrival of the eBook as a marketable phenomenon and in the following years, the eBook marketplace has gone from strength to strength. Amazon has consolidated its position as the market leader through created a complex proprietary infrastructure that has locked users into the Kindle system. This spans the ubiquitous hardware, software, large store and range of services which constitute the Kindle brand.
This has a caveat, as it means that all the data and infrastructure is reliant on Amazon’s continual investment in the Kindle brand. Due to the cloud-based storage of the Kindle’s data and the limited lifespan of the hardware, users are reliant on Amazon’s continual support. This transition is from book-as-object to book-as-service, which has some exciting opportunities but leaves consumers, and book historians, vulnerable to losing important historical data. The removal of data is not without precedent, as a copy of George Orwell was removed from users’ Kindles directly once it was discovered the publisher did not own the rights to the novel. More recently, Amazon discontinued their Kindle Popular Highlights website which offered an annotation corpus of over one million individual highlights, which is now no longer available.
In order to understand the complex materiality of the Kindle’s infrastructure, it is important to understand how it creates a situation in which we have landed into the precarious reliance on Amazon to preserve the infrastructure. The current project explores the precarious materiality of the Kindle infrastructure and the difficulties it presents for contemporary and future book historians who wish to delineate a comprehensive account of digital book culture in the early twenty-first century. As a corollary, the paper will suggest some solutions to the problem that can be undertaken currently including the urgent need to preserve the evidence that is proliferating on the Kindle infrastructure.