In her essay on the use of mass-digitized archives in history, Putnam (2016) notes the decline in historians' connection to the local knowledge of archive curators and patrons that offsets an increase in the “transnational” scope of text search in digitized collections. Mass digitization has obscured our view not only of the local knowledge centered in archives but also some of the common practices that the computational humanities had evolved until 2004 (the year in which both “digital humanities” and Google Books emerged). Most obviously, many early digital projects involved teams of scholars, librarians, and computational specialists undertaking both the physical conversion of analog texts, images, and other sources to digital form; the scholarly editing, interpretation, and contextualization of those materials; and the development of exploratory and analytical tools adapted to these online editions. To name only a few examples, the Rosetti Archive, the Whitman Archive, the Cather Archive, the Orlando Project, the Women Writers Project, the Perseus Digital Library Project, and the Valley of the Shadow were produced by teams of researchers who self-consciously linked digital production with interpretation. Practitioners unlocking the mysteries of text encoding, optical character recognition, and information retrieval might find culture heroes in the early humanist-printers such as Aldus or Froben, who amalgamated textual criticism with founding new fonts of type.
This strand of scholarship, like its pre-computational forebears, is a philological one (McGann 2014), engaged in a hermeneutic loop of editing, interpretation, and commentary. This loop applies not only to archival projects that rely on scholarly transcription and editing, but also to collections such as Google Books or HathiTrust whose contents depend on the acquisitions policies of the libraries scanned and on automatic procedures such as optical character recognition (OCR). Cordell (2017) has argued that OCR-transcribed texts should be taken seriously in their own right as sources, rather than treated as transparent surrogates for the print editions on which they are based. In the rest of this section, we consider some consequences of scaling notions of scholarly editing to such collections.
Texts without Authors, or Titles, or Genres
At the lowest level of character transcription, OCR creates sources that are liable to global revisions. If a library upgrades its OCR systems, for example, the transcription changes. If a library re-images the microfilm underlying their OCR, the transcription changes.
Even if we pass over the characters OCR produces—no minor point!—what searchers and readers usually encounter are “editions” corresponding to the physical volumes, microfilm rolls, or newspaper issues already in library catalogs. Readers interested in a novel, such as Moby Dick, can most likely find some volume roughly corresponding to that work, even if other manifestations remain bundled inside volumes of Melville’s collected works with uncertain table-of-contents metadata. On the other hand, readers interested in a lyric poem, such as Charles Mackay’s “The Inquiry”, discussed in an earlier chapter, are likely to find it not in a book’s or a newspaper’s metadata but by searching: for example, on page 322 of a copy of the 1859 Routledge The Collected Works of Charges [sic] Mackay, so titled in WorldCat and in copies digitized by the Internet Archive (at the Library of Congress) or available in HathiTrust (from the New York Public Library). It appears in the Middlebury [Vermont] People’s Press for March 22, 1843, under the title “The Resting Place” (see figure 1), but the Chronicling America newspaper database simply presents the text of the entire newspaper page. The Australian National Library’s Trove database does segment newspaper pages, and so we can find an article, titled “Poetry”, in the Southern Argus (Port Elliot, South Australia) for February 9, 1870, that contains “The Enquiry” by “Mrs. Hemans” (see figure 1). But this “article” also contains three brief items that follow the poem, since they were not typographically delimited by a headline: “Proverbs by Josh Billings”, “The Swallow”, and an untitled anecdote about Thackeray. Poetic fragments could blend into the background even farther. A stanza of a parody of “The Inquiry” appears, set as prose, in a paragraph between anecdotes about rural slovenliness and a fashion for small bonnets in Gawler, South Australia, Bunyip for August 20, 1870.
If a named work by a known poet is hard to discover in mass-digitized “editions”, the situation is much worse for the kinds of unattributed genres abounding in nineteenth-century newspapers: jokes, sketches, vignettes, household hints, and recipes. A recipe for “Potato Puff” appears on a page of the Bel Air (Md.) Ægis & Intelligencer for Feb. 26, 1866 between a note about growing plants by artificial light and a recipe for pudding; the same recipe appears in the Adelaide South Australian Chronicle and Weekly Mail for Aug. 15, 1868 between a description of an experiment for preserved meat and a recipe for preserved rhubarb. These are only a few examples, which would be much expanded if we discussed all 83 witnesses of this text we know of to date.
Even if we return to the question of novels, where author and title information are usually available, problems recur at other scales. For one thing, as Underwood (2014b) notes, most fiction is not noted as such in library catalogs where, by his calculations, searching for the genre “fiction” achieves 37% recall. And even if we could accurately identify volumes of fiction—again leaving aside the lumping and splitting of omnibus collections and multi-volume novels—not everything in a printed edition of a novel is fiction: critical introductions and dedications and ten pages of publisher’s catalogs luxuriate around the edges. Although some metadata (author, title, date) was available for OCR'd editions, Underwood and his colleagues needed to build automatic genre classifiers for each page in their digitized collections to study diachronic questions of genre formation.
The condition of the mass-digitized text is thus closer to the manuscript sources of an edition than to a scholarly publication. But like manuscripts, these digital artifacts have their own material and conceptual constraints (Trettien 2013). The montage of texts on a newspaper page or of genres even within a single-volume novel makes it inadequate for a researcher to select whole periodical runs or physical volumes, which is what appears in catalogs. Most importantly, the instability of features of the image and textual representation—for example, due to otherwise useful improvements in imaging and OCR—make stable citations of many digital artifacts impractical without resorting to extensive version control. Spiro and Segal (2010) analyzed citations to digital scholarly editions of Walt Whitman, Emily Dickinson, and Uncle Tom’s Cabin from 2000–2008 and found that a small but growing proportion of indexed scholarship on those works cited these editions: 21%, 12%, and 10%, respectively. In a survey of literary scholars in these fields, 58% consulted these digital editions frequently, but only 26% cited them. These scholarly editions of well-known authors, however, have been structured more carefully than the output of a mass digitization campaign. Blaney and Siefring (2017), for instance, found that users of British History Online and the Text Creation Partnership transcriptions from Early English Books Online still recommended citing only the print resource 25% of the time. They also document comments complaining about a lack of page numbers in some electronic resources.
The importance of page numbers could result from a desire for a more stable citation scheme, or it could result from the view that these digital collections are simply convenient surrogates for physically scattered print sources. The works of Plato and Aristotle, for example, are still cited according to the page numbers in early printed editions by Stephanus and Bekker, respectively. Few scholars would view later editions of Plato as a surrogate for Stephanus. It is instructive to look at examples where citation and repurposing of mass-digitized archives seems most successful. JSTOR separates periodical volumes into separate articles and enjoys circulation along with similar repositories of scholarly publications. Some of the most popular digitization projects involve photography collections where enough metadata attaches to each image to make them sharable independent of their archival context (Springer et al. 2008).
The work of Underwood and colleagues in producing genre tagging for pages of scanned books exemplifies the work of “corpus editors” needed to turn mass-digitized archives into research corpora and editions (Crane and Rydberg-Cox 2000). Earlier work by Cordell (2013) searching for and collating multiple reprints and paratexts of Hawthorne’s “Celestial Railroad” provided some of the inspiration for our Viral Texts project. The work of Katherine Bode and colleagues compiling and cataloging a corpus of fiction in Australian newspapers provides another important case study (Bode and Hetherington 2014). By extracting and linking the text of chapters serialized across multiple periodical issues and by normalizing or supplying missing author names and titles, they aim, as Bode (2017, p. 98) says, to provide “a stable and accessible representation of a historical literary system for others to investigate, for either traditional literary-historical or data-rich research.”
Literary Systems as Specialized Languages
Generalizing from these cases, what principles might we need for representing a “literary system” or producing a stable, citable edition drawn from mass-digitized sources? Consider the case of documenting and archiving the search results Google presents in response to a user’s query or a user’s Facebook news feed at a particular time. As Lynch (2017) points out, documents in the “age of algorithms” are composed not only from a stable set of inputs (e.g., a set of crawled HTML pages) but from user’s past interactions, the interactions of other users, proprietary tracking and demographic data, and unknown machine learning systems and parameters. As Drucker (2014) says, an archive of algorithmic feeds would “have not only to deal with fragmentary evidence, but with fleeting, fugitive compositions.” Even if we could emulate Google’s or Facebook’s algorithms, we wouldn't necessarily have the computational or legal means to reproduce all of the inputs or the trace of any given user’s interactions with the system. Alternatives to full emulation could involve saving the outputs presented to simulated user populations or recruiting “Nielsen families” of internet users. All of these alternatives go beyond current standards in documentary archiving. Instead, as Lynch notes (with a tip of the hat to Don Waters), if we view algorithmically composed documents as generating an unbounded number of personalized “performances”, we can then “[r]ecognize that the documentation of performances and events of various kinds—dance, ritual, theatre, musical performance, coronations and inaugurations, lectures, public addresses, riots and wars, etc.—is very old” and that such documentary practices are “deeply rooted in historical methods of anthropology, sociology, political science, ethnography and related humanistic and social science disciplines that seek to document behaviors that are essentially not captured in artifacts, and indeed to create such documentary artifacts.” Since, as Drucker puts it pithily, “reception is production”, we might, rather than short-circuiting philological and editorial practices by keeping them within the purview of archives, envision an interplay between creating editions and preserving and presenting them.
One of the oldest efforts to document performances, which Lynch mentions in passing, was the committing to writing of the oral Homeric tradition, traditionally under the auspices of Peisistratos of Athens in the sixth century BCE. As Milman Parry and his student Albert Lord showed by analysis of the Homeric poems and by comparative fieldwork among South Slavic “singers of tales” in the first half of the twentieth century, oral poems are not memorized but composed anew during each performance in front of a particular audience. Just as we are unlikely to observe all of the inputs to an algorithmic news feed, so we cannot tell from a mere textual transcript about the audience for the song of a South Slavic bard, who might sing for Serbs that the Christians won a battle and for Turks that the Muslims did (Lord 1960, p. 19). And just as there is no “true” page of Google search results for a given query, so is there no “true” or “original” text underlying oral performances of a given story.
How, then, do scholars edit and interpret the textual records of the Homeric tradition? Dué and Ebbott (2010), in their edition of book 10 of the Iliad, provide an interesting review of textual criticism of oral poetry and argue for their own approach of presenting each of the main witnesses—one codex and three papyri—in its own transcription and commentary. Although they note that some variation results from written transmission, they argue that other variants are the result of the underlying nature of non-repeatable oral performances. “These kinds of variations are of a kind different from those that are more clearly scribal errors. Instead of ‘mistakes’ to be corrected or choices that must be weighed and evaluated, as an editor would do in the case of a text composed in writing, we assert that these variations are testaments to the system of language that underlies the composition-in-performance of the oral tradition.” (p. 155)
The case of oral traditions, furthermore, allows us to imagine further possibilities for constructing scholarly editions that can serve as “a model, a theoretical instantiation, of the vast and distributed … network in which we have come to embody our knowledge” (McGann 2014, p. 26). Even for written traditions, Bordalejo (2013) locates the answers to editorial questions “not in the documents that preserve versions of the texts, but in the minds of the scholars who have carefully studied the physical documents, their texts and the variant states of the text they represent.” Different oral performances of, e.g., the Odyssey are not merely copies of each other; instead, these performances are each utterances in the specialized language that the oral poet learns during his training. Just as the object of study for a linguist is not only the set of utterances of a language’s speakers but the grammars they follow, so the object of study for scholar of oral traditions is not (only) the set of written records or audio recordings that we happen to have but the properties of the infinite set of potential performances. As we discuss in the next section, given some sample utterances from a language, along with adequate if necessarily inaccurate knowledge of the structure of the language, we can estimate a probability distribution over other, possible, utterances. In other words, we can learn a language model from our data to approximate the extent of a speculative bibliography of possible texts, contexts, and transmissions.
We have seen examples of editions as a single text, or as multiple variant texts, or even (if only speculatively) as a version of an algorithmic process running in emulation. Language models, and other statistical approaches, provide an additional possibility. They are partial, abstract models of systems too complex for us to represent directly as a static document or a full simulation. When we model the historical process of reprinting by clustering strings of text transformed by simple edit operations, we generate speculative editions of the selections that circulated through the nineteenth-century newspaper. When we adjust our model, those editions shift accordingly, perhaps foregrounding different possible texts. Just as scientists might build useful models of natural systems from falling stones to cloud formations without representing each quantum and momentum, so might a “science of the artificial” (Simon 1996) sufficiently approximate a Facebook feed or the socio-technical system of editors and compositors, telegraphs and steamships, authors and scissors-wielding readers, that produced a nineteenth-century newspaper.