From Mass Digitization to Micro-Editions

The methods we described in the previous section for detecting reprinting produce both pairwise alignments between passages and larger clusters of two or more passages. These clusters, such as those depicted above, include the dates and publications in which individual witnesses appear, and serve as a primary source of evidence for our discussions in much of this book. Concretely, we distribute the output of our clustering algorithms as a flat file containing one record for every witness of every cluster, with a unique identifier for each cluster linking these records together (see the figure below). We distribute these files in both Javascript object notation (JSON) and comma-separated value (CSV) formats. If these clusters are to serve as evidence, however, we would like to be able to cite them while writing this book, and to allow others to cite them in the future.

{

"cluster": 523986312413,

"series": "/lccn/sn85042527",

"size": 215,

"begin": 14276,

"end": 16589,

"id": "/lccn/sn85042527/1868-10-09/ed-2/seq-1",

"issue": "/lccn/sn85042527/1868-10-09/ed-2",

"date": "1868-10-09",

"text": "Over Tho Rivor.\nTliis beautiful poem was published...

"url": "https://chroniclingamerica.loc.gov/lccn/sn85042527/...

"seq": 1,

"ed": "2",

"batch": "scu_idacox_ver02",

"open": "true",

"corpus": "ca",

"source": "Abbeville press.",

"publisher": "W.A. Lee and Hugh Wilson, Jr.",

"p1x": 9051,

"p1y": 3509,

"p1w": 3238,

"p1h": 8797,

"p1seq": 1,

"p1width": 23760,

"p1height": 26550,

"p1dpi": 0,

"p1id": "/lccn/sn85042527/1868-10-09/ed-2/seq-1",

"lang": "en",

"placeOfPublication": "Abbeville, S.C.",

}

{

"cluster": 523986312413,

"series": "/lccn/sn90061771",

"size": 215,

"begin": 4072,

"end": 6040,

"id": "/lccn/sn90061771/1868-11-12/ed-1/seq-1",

"issue": "/lccn/sn90061771/1868-11-12/ed-1",

"date": "1868-11-12",

Figure 13. Clustering output in JSON format, showing one record and part of the next. Both of these records are from cluster 523986312413, which contains 215 witnesses according to the size field.

As we discussed above, the variations and lacunae in authors, titles, and text boundaries make it difficult to find and cite individual exemplars below the level of the issue or page. How then should we represent, and cite, the network of texts and contexts for “The Inquiry” or “Beautiful Snow” or a recipe for reusing your leftovers with potato puffs?

We hypothesize that stable citations are linked to editions of texts. To put it another way, the very act of citation calls into existence a description of a text, with certain properties that we describe below. We approach this through our earlier discussion on archiving in the “age of algorithms” before turning, by way of work on fragmentary texts recovered from the partial evidence of manuscripts and quotations, to a philological resolution for the overwhelming abundance of mass print culture (McGann 2014).

Archiving Algorithmic Output?

Researchers using digital methods, in the humanities and beyond, have long considered distributing resources for the replication of their work (ZZZ). In text-based disciplines, such as natural language processing, replication usually involves the distribution of a textual corpus, with additional annotations and code necessary to repeat the reported analyses. Sometimes, when the base text is under copyright, the annotations are distributed as standoff markup, so that the receiver can re-apply them to their own copy of the copyrighted material.

The clusters of reprints we want to cite, however, are the result of computational model applied to an outside corpus. It is inconvenient, or in some cases illegal, to distribute the input corpus. But what if we distributed the output of the model’s analysis of this corpus? We could declare a certain version of the reprint clustering output, produced using a certain input corpus and parameters, to be archival. Concretely, we could choose a particular version of our output data—e.g., one set of files output as comma-separated values (CSV)—and designate it as an archival version of the data, restricting our references to those particular clusters from one moment in our project’s history. This data could become an archive in the mode of many prior digital humanities projects, and our arguments could reference particular records in that database. Each cluster would get a unique identifier, and we could cite those identifiers much as scholars working with texts known only from quotations or manuscript fragments would.

The strongest argument for archiving entire cluster outputs is that every statistic computed from the data—e.g., facts about the distribution of cluster sizes, or the sizes of individual clusters, or the prevalence of reprinted texts on particular dates, or whether two passages belonged to the same cluster—would be preserved. It would not only be clear which cluster of texts some unique identifier was referring to, but readers would be able to check all of the statistics we derived from the algorithmic output, run additional classifiers on the algorithmic output, and so on. If in making an argument we said that “Beautiful Snow” was reprinted more often than Charles M. Dickinson’s “The Children”—despite the latter’s misattribution to the author of The Old Curiosity Shop—, then the reader could check the evidence for that claim. The reader might even be able to estimate, with some work looking at the sampling practices of newspaper digitization campaigns (Cordell 2017), how likely it is that the addition of new newspapers to the input corpus would overturn a particular statistical claim.

If, however, we did add more newspapers to the input corpus, how should we reconcile the new results to the old algorithmic output? Which clusters remain “the same” across these two datasets? Single-link clustering, for example, can be particularly sensitive to new “bridge” documents that induce two heretofore separate clusters to merge by their similarity to documents in both. What if we wanted to re-OCR (some of) the documents in our corpus, or apply automatic or manual corrections to them, in order to improve our results? Even if we froze the input corpus, which parameter values should we choose for the canonical output? Those that give us the best values on some global evaluation metric? Those that optimize the output’s match to some hand-annotated sample?

Scholars working with fragmentary texts have, on a much smaller scale, mitigated these problems in two stages (Berti et al. 2009). First of all, when collecting what is known of the poetry of Sappho, or the papers of Emily Dickinson, they do not publish simply citations of editions of, e.g., Longinus, quotes one of Sappho’s poems, or links to archival records for particular manuscript pamphlets. Scholars of fragments produce new editions of those texts, even though they are building on top of existing editions of Longinus and other sources. Furthermore, when publishing a new edition of a fragmentary author, they often provide a concordance that links each fragment to its appearance, where available, in earlier editions. But it would be impractical to create concordances among all pairs of cluster outputs that might arise from different input and parameter configurations of a program that could be run many times a day.

When we began work on the Viral Texts project, we imagined that we would archive all cluster outputs in just this way: assigning titles, authors, and even subject tags to each one so that other scholars could browse them in a manner familiar from other projects. Soon, however, the sheer scale of reprinting that we uncovered, paired with the rapid iteration of our methods, made it clear such individual annotation would not be possible. The theories we outline here represent more recent thinking that seeks to marry an iterative, algorithmic model of reprinting to a stable bibliographic objects that can be used to compose what Bode (2017) calls a “scholarly edition of a literary system.”

As digital research continues over time, our situation approaches the examples adduced above by Drucker (2014) and Lynch (2017) of the intractability of recreating the “fleeting fugitive compositions” of the Google search results or Facebook feed for a given user at a given time. For convenience, we can save individual outputs. Although we are making results of our clustering methods available for others to download as large CSV or JSON files (with the full text of certain restricted collections removed), such data can be difficult to parse, given their sheer size, and difficult to access for scholars not comfortable with computational text analysis. These restrictions are not supportable if we want our arguments to resonate with Americanist scholars, or if we hope our larger data might inform broader studies into the nineteenth-century. If we want our scholarly argument to point to more than these data dumps, we need a slight change of focus.

Copy on Cite

One step beyond distributing our algorithmic output is the work of “corpus editing” we discussed above. The work of harmonizing discordant metadata and rejoining disconnected chapters and fragments is important for making, to quote Bode (2017) again, “a stable and accessible representation of a historical literary system for others to investigate”. We want, however, to be able to cite events in this “historical literary system” even while investigating and probing it. We hope, moreover, that even once this book is complete that we, or others, might add to or reconfigure the representation of this network of viral texts. The representation of a literary system should still be accessible—in this case, among other things, separable from the complete archive of OCR'd periodicals—but in place of “stable” we propose “citable”.

As we have seen, the very multivalence of mass-digitized archives we are investigating makes citation difficult enough that many scholars refer to print editions even when working with well-curated digital collections (Spiro and Segal 2010; Blaney and Siefring 2017). Fitzpatrick (2016) summarized the dilemma of abundance for “academic style”:

So when a reader searches for a quotation, she is likely to turn up not just the original source of that quotation but also a host of copies, borrowings, and reuses, texts in which that quotation appears but from which it did not originate. Even when the search turns up the proper source, it might not turn up the proper edition of the source, and for scholars, that level of distinction very often matters. In order to ensure that Reader B has every possibility of seeing the same thing in a source text that Reader A saw, B needs to know whether A read the edition of a book published in 1819 or the revised edition published in 1831, or whether A read an article as originally printed in the journal or as it was repackaged for inclusion in a later edited volume. Much like the situation in a laboratory, these variables matter, and so this level of precision in their citation matters.

But how do we construct citations to preserve what matters? If two editions are the same in the passage quoted, is there a difference? If some wider context changed the interpretation, why not include that context in the quotation?

We hypothesize that stable citations are linked to editions of texts. Rather than linking citations to the results of mass digitization—where texts can be re-OCR'd or corrected by distributed proofreaders—or the output of algorithmic systems—where changes in parameters or input data reconfigure what we read, each act of citation should call into existence an edition of a text that documents the current view. These editions can be very lightweight. The editorial “work” need not extend to collating all the witnesses or comparable passages; one could transcribe a single witness or even copy and paste from OCR transcripts. One could of course add metadata about authors, titles, dates, and sources or correct the OCR transcript or the boundaries of automatic text segmentation. Such a casual edition could comprise a selection of several witnesses, variants, and parodies, the better to delimit the range of what text is being cited.

These micro-editions are most useful when paired with the textual critical apparatus of a language model. As described above, we use these models to infer a distribution over possible readings given the evidence from some mixture of human and computational sources. More simply, in the language of information retrieval, these micro-editions are queries that a model uses to call up a wider network of results.[14] The possibility for inference beyond a single edition makes the requirements of editing less onerous: the micro-edition is just one more piece of evidence, and not a single optimal inference about some Urtext that is the only thing the reader sees. One can even, as we said, simply copy some noisy OCR output—or even an image region if we include OCR in our model’s inference procedure—to use as a query, although it will generally be more effective to make clean transcriptions. By standing at the beginning of the editing process, rather than at a culmination, they reflect an “anomalous state of knowledge” that precedes information seeking (Belkin 1980).

While writing this book, therefore, we have created micro-editions of the reference texts that we would like to cite. The reference texts consist of a text transcription and a few metadata fields such as author, title, publication information for the source of the transcription, and so on. We then include these reference texts in the input to our text-reuse analysis software passim, so that we can then extract the other witness passages that cluster with the reference texts. The editions of the reference texts act as stable targets for citation, even as we experiment with new text-reuse models, augment our input corpus, or correct the OCR. The reference texts thus serve a role analogous to the “type specimens” collected by biologists—the single pressed plant in the herbarium that, although its genome differs at many sites from other individuals, serves as evidence for a species across revolutions in taxonomy. There also remains, as in biology, a host of uncatalogued species.

When we transcribe a reference text, its source usually remains in our research corpus. Because reference texts serve as seeds for clusters, there is less concern about producing one perfect edition. In general, we prefer to transcribe earlier witnesses, from open-access collections (such as Chronicling America). We also prefer longer witnesses for texts that are regularly excerpted. We do of course still exercise some editorial discretion about major branches in a given stemma. We display reference texts, where available, at the beginning of each cluster and do not count them in a cluster’s size. Our implementation treats these reference texts like other documents in our corpus, with one exception: all documents that align with the reference text are clustered together, not just those alignments that overlap sufficiently. This equality constraint, combined with single-link clustering, means that separate reference texts—such as different parodies of the same poem—might end up in the same cluster if they align closely enough.

In the future, it would be interesting to explore other semantics for these query editions and other interfaces to clusters besides a list of witnesses. For example, we might produce reference editions of a poem and its various parodies and stipulate that they should be clustered separately or together. We might display a tree or graph structure to clarify closer and more distance relationships within a cluster. For works like “Beautiful Snow” that inspired a wealth of frame stories and other paratextual material, we could transcribe a range of versions. Paratexts might grow into commentaries, as in a sermon on a given text. With such cases, we reach the boundaries of “text reuse” and recirculation, as we have delineated them in this book, and cross over into other intertextualities.

Comparanda

If it seems unsurprising to note the instability of texts and “philological facts”, we concur with van Zundert and Andrews (2017) that the “sheer audacity” of establishing a stable text is merely clearer to anyone trying to transcribe bad digital images, or get the same results from a search engine on two different days, than it might have been to a reader of a printed critical edition. They trace back to Briet (1951) the view of scholarly and official activity as accumulation of “documentation”—what we call above the append-only semantics of editing. In the tradition of reading as abductive inference, we hope in this chapter to have presented some useful tools for reasoning from texts to other representations of texts.

We see a similar lightweight approach to editing the contents of mass-digitized archives or the open web in other projects. Schwebel (2015) collates different accounts in periodicals and monographs about the “Lone Woman” of San Nicolas Island. Each record contains an image of the original source, a manual transcription of the text, markup about themes and tropes in the account, and other metadata. In addition to browsing by source newspaper, date, location, title, and trope, the user can organize the accounts by “document group”, which are manual groupings of similar texts analogous to the clusters created in the Viral Texts project. The Freedom on the Move database organizes the crowdsourced transcription of advertisements for fugitive slaves, with each record containing an image snippet, text transcription, source information, and other metadata. Again this results in editions of texts not enumerated in the digitized source. Projects like SourceLab (Randolph 2017) aim to apply the principles of documentary editing to documents found on the web. It would be a natural next step to connect these editions to wider networks of text using retrieval and language models.

As mentioned above, scholars working on fragmentary texts provide another interesting comparable case. In documenting the evidence for fragmentary Greek and Latin texts, Berti et al. (2009) and colleagues on the Leipzig Open Fragmentary Texts Series (LOFTS) project articulate five fundamental functions that editions of fragments should provide:

quotation as machine-actionable link;
alignment of citation schemes;
fragment as search query;
dynamic collation of editions of fragments and their witnesses; and
links to fragment quotations in secondary and tertiary sources.

These operations attempt to ensure that the evidence for and discussions of fragmentary authors—whether quotes in ancient texts or journal articles, whether on papyrus scraps or in standalone editions—will be grouped together and collated. Since the evidence for ancient texts often consists of testimonia—i.e., mentions and discussions of the text rather than seeming quotations—these methods again go beyond our examination of reprinting. The methodological tradeoff implied by these operations is that, while Berti’s primitives could apply to editions of all texts, they confine themselves to those fragmentary texts where sparse evidence makes them loathe to discard any evidence.

Deformance and Model Checking

Scholars usually undertake a study of textual variation in order to establish a text. They marshal evidence from various witnesses to promote one reading over others. In this chapter, we have presented language models both for collation—reasoning from witnesses to underlying readings—and text-reuse analysis—exploring how texts replicate over networks. Our goal, however, has not been to fix one text in each case but rather to show how computational models can help our readings “dwell in possibility”.

In “Deformance and Interpretation”, Samuels and McGann (1999), taking off from Emily Dickinson’s suggestion to read poems backwards, explore modes for performative reworking of literature. In moves that seem related to the Oulipo movement where, e.g., Georges Perec might write a 300-page novel without the letter e, Samuels and McGann “deform” poems by Wallace Stevens and Samuel Taylor Coleridge to read backwards, or to read only nouns or verbs, or to replace (parts of) words with phonetically similar ones. Their interpretation of these deformations leads them to understand it by reaching the poem as a goal and not a departure: “Take this concatenated text of nouns and verbs and reconstruct it in reverse. You will see it revealed again, in a further range of its visible intelligibility.” Just as Ramsay (2011) reads a frequency-sorted list of the words in a text to see how much one can still interpret or Nelson (2017) repeats her topic-model analysis on texts with and without OCR errors, Samuels and McGann draw out their interpretations through an information bottleneck (Tishby et al. 1999).

These deformances, like many Oulipo constructions, are deterministic or require some human intervention. As Cordell (2017) has suggested, OCR itself is a deformance of its input. Although its output is generally deterministic given the same images, it is subject to unpredictable changes if the input images are modified even slightly.

Using computational and probabilistic methods to explore the possibilities of our readings has a close connection to the statistical practice of model checking. We are not, in most cases, trying to prove that one reading is absolutely correct; experience tells us that new editions will come. Whenever we have recourse to statistical reasoning, “we know that virtually all models are wrong, and thus a more relevant focus is how the model fits in aspects that are important for our problems at hand” (Gelman et al. 1996). If we try Samuels and McGann’s noun-only reading on the Mackay poem, we see only one or two nouns per line, but the adjectival variations of ‘mighty/misty’ or ‘whispering/winding’ fall away. The parodies of caricatured men escaping women and women escaping men, however, become more obvious in the noun-only readings: ‘women / dell / holler [noun] / ground / babies / cradles’ and ‘girls / rest / dough-faces / woman / graces’. Any visualization of the spread and variation of texts will necessarily be compressed, so what we see is inevitably the output of some explicit or implicit model.

We stand at the beginning of editing the works of the network author from the evidence of mass-digitized collections. Our models are certainly wrong, but by capturing even a part of a networked textual system, we can rerun the process and think how it might have been otherwise.[15]

Notes

Draft Chapters