PRESENTATION: Fragmentation and Discontinuity in Access to the USPTO Full-text and Image Database

October 27th, 2019 § 0 comments

Abstract: Historians of technology use patent databases including the United States Patent and Trademark Office (USPTO) Patent Full-Text and Image Database (PatFT) as evidence for technological development. While academics acknowledge the limitations of the content such as citations as a proxy for influence, less attention has been paid to how digital methods for accessing patent data – search engines, PDFs and webpages – alter our perception of PatFT as a primary source.

The USPTO provides access to all granted patents in the United States from Samuel Hopkin’s filing for the manufacture of potash in 1790 onwards via PatFT. There is a clear separation between patents filed before and after 1976, with the latter available as full searchable text, and the former available only as photograph facsimile copies of the original document. Plans to automate patent processing and discovery date back to the 1950s with the “Bush Report”, but it was not until the 1990s that the US government began to invest in the digitisation of patent submission and publishing workflows.

The forty-year lag between exploring and adopting electronic workflows has led to a fragmented discovery system. PDF documents differ from the searchable HTML full text in structure and metadata. Classification numbers are updated for historical filings, but there is no record of this update in either the HTML or PDF version. Some files were scanned from earlier efforts at microfilm preservation. Original copies of up to 5 million patent documents stored in Franconia, Virginia were destroyed in 2018, while the National Archives primarily holds material filed prior to 1978. As a result, many patents granted between 1978 and 1990, if not later, are only available as digital surrogates without a print copy of the original.Even born-digital records filed as late as 2000 started as semi-structured text records, enduring a brief transition to SGML (Standardised General Mark-up Language) in 2001, before settling on XML (Extensible Mark-up Language) standard in early 2002.

This paper examines how digital technologies and processes including data storage, digitisation, and Optical Character Recognition (OCR) shape our knowledge of the history of innovation recorded in patent databases. Through questioning the relationship between patents as data and as digital objects, I demonstrate how what Ian Milligan has termed the “illusionary order” of digitized archives restricts the USPTO PatFT’s use as a data source.

Comments are closed.