Mass Digitization

On or about December 14, 2004, Google announced its library scanning program. While not the start of historical digitization, which had been accelerating since the punch-card era, it marked an inflection point in the growth of online text and the start of the era of mass digitization. Millions of books, periodicals, manuscripts, images, and more are now accessible due to large-scale scanning and archiving projects by commercial (Google Books, Gale Cengage), government (Library of Congress, Bibliothèque nationale de France), and nonprofit (Internet Archive, HathiTrust) entities.

Among the consequences of this great transformation—perhaps the most widespread—is the ability to use full-text search engines on digitally transcribed materials. But search remains a mostly “undertheorized” practice in the humanities, as Underwood (2014a) observes. The beneficiaries of the revolution know that they can search the new digital abundance but do not necessarily know what resources were expropriated or how documents are represented when reading them or ranking likely hits.

Our aim here is to theorize not only search in mass-digitized collections but also the analysis and editing of “literary systems” (Bode, 2017). This chapter is both descriptive—detailing the methods we used in the Viral Texts project to analyze the circulation of texts in the print culture of the nineteenth century—and theoretical—arguing for an iterative interplay among text search, text mining, and textual editing in the construction and criticism of statistical models of texts, which are usually called language models in the field of natural language processing. In other words, we will describe both the research methods we have used while working on this book and also our approach to publishing the results of this research.

Although the remaining sections in this chapter unfold (we hope) in a logical sequence, they may be sampled somewhat independently depending on the reader’s interest:

We first motivate our theoretical approach by tying together quantitative modeling methods for interpreting texts that fall under the rubrics of “cultural analytics” and (the current meaning of) “distant reading” with strands of work on editing and interpretation that have emerged from the study of oral traditions, periodical literature, and algorithmically composed electronic texts.
We then provide a brief overview of language models and their applications, focusing on methods of use to textual criticism and text-reuse analysis. This section and the next are the most mathematical, and discuss specifying and estimating the parameters of probability distributions.
We work through a case study of newspapers carrying over material from one issue to the next by constructing a simple but non-trivial language model with a small number of parameters. This section introduces some concepts from regular languages by analogy to the regular expressions commonly used in search applications.
We then describe how we scaled up these language-modeling methods for detecting reprints in large archives of digitized newspapers.
Finally, we discuss how the language-modeling approach informs producing an edition of the results of this algorithmic investigation.

Speculative Bibliography

Draft Chapters