PRESENTATION: Fragmentation and Discontinuity in Access to the USPTO Full-text and Image Database

October 27th, 2019 § 0 comments § permalink

Abstract: Historians of technology use patent databases including the United States Patent and Trademark Office (USPTO) Patent Full-Text and Image Database (PatFT) as evidence for technological development. While academics acknowledge the limitations of the content such as citations as a proxy for influence, less attention has been paid to how digital methods for accessing patent data – search engines, PDFs and webpages – alter our perception of PatFT as a primary source.

The USPTO provides access to all granted patents in the United States from Samuel Hopkin’s filing for the manufacture of potash in 1790 onwards via PatFT. There is a clear separation between patents filed before and after 1976, with the latter available as full searchable text, and the former available only as photograph facsimile copies of the original document. Plans to automate patent processing and discovery date back to the 1950s with the “Bush Report”, but it was not until the 1990s that the US government began to invest in the digitisation of patent submission and publishing workflows.

The forty-year lag between exploring and adopting electronic workflows has led to a fragmented discovery system. PDF documents differ from the searchable HTML full text in structure and metadata. Classification numbers are updated for historical filings, but there is no record of this update in either the HTML or PDF version. Some files were scanned from earlier efforts at microfilm preservation. Original copies of up to 5 million patent documents stored in Franconia, Virginia were destroyed in 2018, while the National Archives primarily holds material filed prior to 1978. As a result, many patents granted between 1978 and 1990, if not later, are only available as digital surrogates without a print copy of the original.Even born-digital records filed as late as 2000 started as semi-structured text records, enduring a brief transition to SGML (Standardised General Mark-up Language) in 2001, before settling on XML (Extensible Mark-up Language) standard in early 2002.

This paper examines how digital technologies and processes including data storage, digitisation, and Optical Character Recognition (OCR) shape our knowledge of the history of innovation recorded in patent databases. Through questioning the relationship between patents as data and as digital objects, I demonstrate how what Ian Milligan has termed the “illusionary order” of digitized archives restricts the USPTO PatFT’s use as a data source.

PRESENTATION: The Legacy of E-Readers Beyond Reading

October 25th, 2019 § 0 comments § permalink

While the reports of the ebook’s demise have been greatly exaggerated, dedicated e-readers have diminished in popularity with readers electing to use smartphones and tablets in their place. In this paper, I argue that beyond the e-reader’s continued role as a niche consumer electronic, it also fulfilled an important part of broad technological research and development that will continue to influence future trends in mobile computing and beyond.

Through case studies of e-reader’s early adoption of lithium-ion batteries and continual investment in bendable, low-power screens, I argue that e-reader’s long-term legacy will be solidified by its proof-of-concepts, rather than reading on-screen. For example, Nuvomedia, developers of the Rocket Ebook, designed the product to test the use of lithium-ion batteries, with the inventors later going on found Tesla. Likewise, Amazon subsidiary LiquaVista’s research into next generation electronic paper was a precursor for foldable phones. Through analysing the cutting-edge elements of e-reader design historically, I demonstrate the importance of the device within wider digital culture.

PRESENTATION: An archaeology of patent databases as material objects

September 9th, 2019 § 0 comments § permalink

Patent databases including Espacenet and the USPTO Patent Full-Text and Image Database (PatFT) offer rich sources for big data analysis to document the evolution of a technology or to demonstrate the value of a filing through citation networks. While scholars such as Manuel Trajtenberg, Douglas O’Reagan and Lee Fleming have complicated the relationship between citations and chains of influence between inventors, less attention has been paid to the affordances and limitations of the databases storing the underlying data. Researchers can access patent databases as searchable, online databases or via third-party sources such as the National Bureau of Economic Research (NBER) US Patent citation data files, but how does the structure, type, and availability of data shape our understanding of patents as historical evidence?

In this paper, I offer a case study of the layers of digitisation embedded within USPTO PatFT and Patent Application Full-Text and Image Database (AppFT) to analyse what Ian Milligan terms the “illusionary order” of patent databases. PatFT and AppFT combined provide access to all granted patents in the United States from Samuel Hopkin’s filing for the manufacture of potash in 1790 to weekly updates of new patents in early 2019. There is a clear separation between patents filed before and after 1976, with the latter available as full searchable text, and the latter available only as photograph facsimile copies of the original document. While the USPTO began transitioning to digital workflows in the 1970s, the full text from this time can rely on automatic Optical Character Recognition (OCR) processes, leading to a difference between the facsimile PDF copy of the patent and the semantically-rich full text.

Discrepancies proliferate beyond simple historical distortions. PDF documents can differ from the HTML full text, and updates to patent classifications can create different versions of the same patent without acknowledgement of an update. Some files were scanned from earlier efforts at microfilm preservation. Original copies of up to 5 million patent documents stored in Franconia, Virginia were destroyed in 2018, while the National Archives primarily holds material filed prior to 1978. As a result, many patents filed prior to the complete adoption of digital workflows are only available as digital surrogates without a print copy of the original.

The provenance of these patents as digital objects is therefore uneven and can create an incomplete image of patent filings, which is only exacerbated in other national databases where even photo-facsimiles of all patents remain unavailable. This paper offers a bibliographic and media archaeological excavation of the database documenting the digitalisation of the USPTO’s workflows to contextualise the types of error that may affect our understanding of patents as documents.

PRESENTATION: Reconsidering Project Gutenberg’s Significance as an Early Digitization Project

July 9th, 2019 § 0 comments § permalink

Michael Hart’s Project Gutenberg is often regarded as the first ebook publisher with Hart typing up the Declaration of Independence on a Xerox Sigma V at the University of Illinois at Urbana-Champaign (UIUC) in 1971. This mythology has perpetuated despite contradictory evidence: Hart did not coin the name ‘Project Gutenberg’ until the late 1980s, an early version of Milton’s Paradise Lost was an updated copy of a 1965 digitization by Joseph Raben at CUNY; and the Project’s first full book, the King James Bible, was not released until 1989.

In this paper, I challenge hagiographic accounts of Michael Hart’s early work within the broader context of early collaborative digitization work and innovations with the computer facilities at UIUC in the early 1970s. Computers and the Humanities, a prominent early digital humanities journal notes a range of digitization projects during the 1960s, and the Oxford Text Archive, a digital publication interchange network, formed in 1976. These early projects were more active than Hart, who only began work in earnest in the 1990s with the benefit of Usenet, FTP, and Gopher. Furthermore, Hart acknowledged but never used UIUC’s PLATO (Programmed Logic for Automatic Teaching Operations), an early computer network with a larger audience than ARPANET in the early 1970s, to disseminate texts. Through re-appraising Hart’s work within its historical and geographical context, the paper challenges the concept of a lone genius inventor of ebooks to propose a more inclusive history of digital publishing.

PRESENTATION: The End of Ebooks

July 15th, 2018 § 0 comments § permalink

Abstract: Amazon have dominated the ebook market since the launch of the Kindle in 2007 but the next decade may be defined by the merger of the Independent Digital Publishing Forum (IDPF) with the World Wide Web Consortium (W3C) in January 2017. The merger resulted in the formation of the W3C Publishing Working Group with the remit to maintain the EPUB standard while working to future-proof digital publications as “first-class entities on the Web” in the form of Packaged Web Publications (PWP). The proposed PWP specification would mark a paradigm shift for the book trade with ebooks gaining all the features of the modern Web rather than the more conservative EPUB specification.

The PWP specification is yet to be finalized but during its development, Working Group participants have extensively debated the limits of the book and its digital representation. The new standard must satisfy a broad range of use cases including trade publishing, scholarly communication, journalism, and grey literature. In this presentation, I conduct an analysis of the consensuses and fractures that will shape the presentation of books in browsers from a Science and Technology Studies perspective. The W3C offer an unprecedented level of transparency in decision making compared to prior ebook standards such as EPUB revealing the human decisions behind algorithmic interventions by mark-up validators, InDesign export wizards, and web browsers. These on-going discussions will not only shape the future of digital publishing, but return to the question of “what is a book?” in the context of the early twenty-first century.

PRESENTATION: Resurrecting the Ebook: A media archaeological excavation of the Kindle’s development, 1930-2007.

May 25th, 2017 § 0 comments § permalink

I recently gave a talk for the Media History Seminar at the Institute of English Studies. I took the opportunity to link earlier ebook developments to the success of the Kindle.

Abstract: Amazon’s launch of the Kindle in 2007 was lauded as the moment when ebooks finally became economically viable for publishers. This success was facilitated by Amazon’s careful analysis of previous failed attempts to commercialize ebooks since the early 1990s, and earlier theoretical models developed since the 1930s. This presentation will explore how the Kindle’s reputation stems from a mixture of adapting pre-existing technology and the right social-technological context rather than a complete revolution in ebook design.

 

PRESENTATION: A historiography of the ebook

October 30th, 2015 § 0 comments § permalink

I was invited to give a talk for the Centre for the History of the Book at the University of Edinburgh. I took the opportunity to talk through some of the methodological challenges facing researchers of ebooks.

 

PRESENTATION: The Lost Generation?: A Media Archaeology of the E-Book, 1929–2006

July 8th, 2015 § 0 comments § permalink

Abstract: The Kindle’s launch in 2007 is considered pivotal in the transition of the eBook from marginal interest to mainstream phenomenon. This narrative marginalizes the pre-history of the eBook stemming from Bob Brown’s manifesto, The Readies, in 1929 through to Sony’s big push for public eBook acceptance with the Sony Librie in 2006. Traditional accounts of the eBook recall early failures to monetize the eBook through expensive hardware experiments from 1999 to 2006, but this ignores a wider range of precedents apparent from a media archaeological excavation of the eBook before the Kindle.

The current project traces the development of the eBook from the Kindle to its precursors outside of the dedicated hardware that typically characterizes the eBook’s incunabular period. It is clear that dedicated devices did not catch on prior to the Kindle, but this does not mean that a samizdat eBook culture did not exist. eBook reading prior to the launch of the Kindle was facilitated by applications for the portable devices such as PalmPilots and Game Boys. This media archaeological approach reveals the birth of the modern standards for eBook formats and how users were frustrated with the lack of available eBooks and often went to great lengths to create their own eBooks. This reaches its apex in the development of an eBook application for the Game Boy, where readers built a programme to read a range of titles from Robinson Crusoe to Lolita on the games console.

It is possible to see the foundations of the modern eBook from such activity, as the necessity for reflowable text when reading on a Portable Digital Assistant (PDA) led to the formation of the Open eBook Publication Structure (a precursor to the EPUB format) in 1999, and several portable devices such as the Game Boy Advance, PalmPilot and SoftBook had facilities for modems, allowing readers to receive books without using a computer, often seen as one of the core selling points of the original Kindle. Amazon regenerated the eBook marketplace by amalgamating these elements into a single package while leveraging their competitive advantage of their total dominance over online bookselling to transform the commercial eBook marketplace. Through reconstructing this 87 forgotten, and often-unauthorized history, it is possible to find a richer pre-history of the eBook than the generally established historical narrative of public hardware failures.

PRESENTATION: Mapping Amazon’s Digital Infrastructure

June 19th, 2015 § 0 comments § permalink

Abstract: It is difficult to talk about the digitalization of the book trade without mentioning Amazon, but the constituency and scale of the retailer have not undergone large-scale critical scrutiny. Amazon’s infrastructure, including the integration of ISBNs into Amazon Standard Identification Numbers (ASINs), has shaped the book trade over last two decades, and in places, has replaced traditional sources of information such as Bowker’s Books in Print and Nielsen BookScan. Amazon thus presents a large cache of data for publishing studies, although Amazon is notoriously secretive.

The current project maps Amazon UK’s online bookselling infrastructure and offers an initial foray into how this data can be analysed to present a survey of the contemporary publishing landscape. While Amazon’s websites are a living resource that are difficult to map, there is an impetus to archive and analyse data immediately, as Amazon is not an archival resource, aptly demonstrated by their purge of pre-Kindle ebook data in 2007 and their recent closure of the public popular highlights function. To this end, the current project will provide an overview of Amazon’s digital infrastructure, followed by two practical applications: (1) tracking the used book marketplace with a focus on Vladimir Nabokov; and (2) analysing Amazon’s use a cataloguing tool for books not on sale through Amazon or third-party seller. Through these case studies, the paper aims to open conversations of how to use Amazon as a research tool as well as a research object.

PRESENTATION: Reading Automata

June 12th, 2015 § 0 comments § permalink

Abstract: Mass digitization of text has resulted in the development of textual generators that are much more capable of writing through reading pre-existing chunks of text. While they do not understand the semantics of the text, many of these machines are capable of creating reasonably intelligible discourse through their reading and reassembly of pre-existing texts. Through targeting specific corpora (including Moby Dick and live data from a remote buoy; instructions from WikiHow; and a database of time zones), text generators and Twitterbots are creating engaging literary works. In this paper, I will theorise and historicise the development of reading automata within the wider context of the recent textual return in digital media facilitated by the development of ebooks and Twitter.

Reading automata from sprowberry