Patent databases including Espacenet and the USPTO Patent Full-Text and Image Database (PatFT) offer rich sources for big data analysis to document the evolution of a technology or to demonstrate the value of a filing through citation networks. While scholars such as Manuel Trajtenberg, Douglas O’Reagan and Lee Fleming have complicated the relationship between citations and chains of influence between inventors, less attention has been paid to the affordances and limitations of the databases storing the underlying data. Researchers can access patent databases as searchable, online databases or via third-party sources such as the National Bureau of Economic Research (NBER) US Patent citation data files, but how does the structure, type, and availability of data shape our understanding of patents as historical evidence?
In this paper, I offer a case study of the layers of digitisation embedded within USPTO PatFT and Patent Application Full-Text and Image Database (AppFT) to analyse what Ian Milligan terms the “illusionary order” of patent databases. PatFT and AppFT combined provide access to all granted patents in the United States from Samuel Hopkin’s filing for the manufacture of potash in 1790 to weekly updates of new patents in early 2019. There is a clear separation between patents filed before and after 1976, with the latter available as full searchable text, and the latter available only as photograph facsimile copies of the original document. While the USPTO began transitioning to digital workflows in the 1970s, the full text from this time can rely on automatic Optical Character Recognition (OCR) processes, leading to a difference between the facsimile PDF copy of the patent and the semantically-rich full text.
Discrepancies proliferate beyond simple historical distortions. PDF documents can differ from the HTML full text, and updates to patent classifications can create different versions of the same patent without acknowledgement of an update. Some files were scanned from earlier efforts at microfilm preservation. Original copies of up to 5 million patent documents stored in Franconia, Virginia were destroyed in 2018, while the National Archives primarily holds material filed prior to 1978. As a result, many patents filed prior to the complete adoption of digital workflows are only available as digital surrogates without a print copy of the original.
The provenance of these patents as digital objects is therefore uneven and can create an incomplete image of patent filings, which is only exacerbated in other national databases where even photo-facsimiles of all patents remain unavailable. This paper offers a bibliographic and media archaeological excavation of the database documenting the digitalisation of the USPTO’s workflows to contextualise the types of error that may affect our understanding of patents as documents.