Chapter 5

Computational Literary Studies and Scholarly Editing

Fotis Jannidis

Introduction

In the last ten years or so, trends that in some cases had been decades in the making have combined into a perfect storm. The digitization efforts of libraries, archives, and other institutions, which have been ongoing since the 1980s, have made it very easy today to access large quantities of literary texts. The development of methods for quantitative analysis of literary texts, which in the case of stylometry goes back to the 1960s, has opened up more to computational linguistics and natural language processing (NLP), and it can integrate new methods from there into research programs. And lastly, computational linguistics and NLP have achieved a whole new quality of semantic and structural processing of texts after decades of preliminary work using machine learning, especially through the “deep learning” revolution.

Computational Literary Studies (CLS) came into being in this constellation. CLS methods are something quite old in the sense that they would be inconceivable without the attempts to build literary research on clearly conceived corpora and to clearly define and survey the features of interest, such as those developed in formalism and structuralism. Nor are they conceivable without the relatively broad tradition within what later became known as the digital humanities of extracting textual, stylistic, metrical, rhetorical, and other textual features and using them for questions of authorship attribution or period or genre description. Nevertheless, they are at the same time something entirely new, because many of the working tools that have been in use for about ten years require an extensive collection of texts, such as in topic modeling, and this requirement is increasing, for example in the pretraining of language models for domain adaptation. Typical work processes in the field of CLS start from a literary research question, develop a conceptual modeling of a relevant literary research phenomenon, and derive one (of several possible) operationalizations from it. In most cases, this involves the quantitative examination of a textual property. To do this, the phenomenon in question is annotated in a corpus, and then a model, like a neural network, is trained with this data, which allows this text property to be extracted automatically from a large collection of texts. Now you can count the frequencies of a textual property and, if you are lucky, you can answer the question raised by it.

Today, CLS has established itself as a new, small research field. As of recently, there are two journals that publish relevant results, and two conferences in which exclusively or predominantly CLS results are negotiated. In addition, there are a number of monographs and collections of essays. There are professors who have made CLS their central research topic, monographs, and anthologies: in short, everything that exists when a new subject establishes itself.

But what about the relationship to the other major pillar of the digital humanities (DH), digital editions? At the DH conference, the two fields run side by side in independent tracks. There doesn’t seem to be much exchange, if only because the intricacies of a digital edition are a closed book for many in CLS, while the sophisticated techniques of CLS, whether involving statistics or NLP, are not accessible to most editors. One can say that, on the one hand, we have elaborately annotated data and, on the other hand, the methodological knowledge of how to analyze complex literary data. What is missing is a model of how to bring these two sides together.1 The aim of this chapter is to discuss such a model—others are certainly equally plausible—and to demonstrate its feasibility with a practical example. The actual results are less interesting in this context of discussion of an exemplary study, since the main aim here is to present a model of textual analysis that relates these two fields to each other.2

In order to be able to discuss modeling questions in concrete terms, I will use the example of the Faust edition, for two reasons: because I was involved in the creation of the edition and because it is a verse drama, which, as we will see, makes many things easier.3 The Faust edition is a hybrid edition, available both in print and digitally. In fact, only a small part has been printed, while digitally there is a much larger part that includes everything that has been printed. This is the historical-critical edition of the drama Faust by Johann Wolfgang Goethe, probably the most famous German-language drama. The edition comprises three parts: the constituted text, an archive, and a part on the genesis of the work. The constituted text contains the 12,111 verses and the variants for each of these verses. The archive contains all handwritten and printed witnesses in facsimile images and transcriptions. One of the special features of this edition is that, following the demands of the German editorial-theory tradition, it attempts to distinguish between findings and interpretation, and hence there are two transcriptions: a documentary transcription and a text-genetic one. From the display of variants in the constituted text, one can jump to the corresponding manuscripts or prints. Under “genesis” there are visualizations that break down the process of the text’s creation, which took over fifty years, and also the distribution of witnesses across the verses of the drama, which is quite uneven, as Goethe repeatedly destroyed manuscripts in the first decades of his work. The edition lives as a website where the texts can be read, browsed, and searched. It is also accessible on Github, where one can download not only the software used to create the edition, but also the XML-encoded texts (github.com/faustedition). In principle, anyone can build their own Faust editions in this way, but above all, interested parties can re-use the edition for their own purposes. An open license (CC by-nc-sa) is intended to promote this re-use.

Computational Literary Studies

CLS has developed rapidly in the last ten years or so, and this is reflected not least in the extensive inventory of methods that can now be drawn on. The most established methods are probably topic modeling, sentiment analysis, stylometry, and the analysis of co-occurring and semantically related words.4 For the first time, digital content-based text analysis has been given a real working basis through the possibilities of topic modeling and other semantically rich representations. Topic modeling can be applied to any text collection and determines the distributions of topics in documents and the distributions of words in the topics (across all documents). These “topics” are in fact not necessarily topics. Rather, words that frequently occur together in a given passage length are identified as belonging to a topic. Often the fact that words are thematically related causes this co-occurrence, but so can other factors, such as that they are part of a particular rhetorical strategy, or that they are misspelled in a similar way (OCR errors), or come from the same nontextual language. Three other methods have also established themselves as fruitful approaches to “distant reading” in recent years: sentiment analysis, stylometry, and word-field analysis.

Sentiment analysis, in its simplest version, examines the polarity of texts—whether they tend to feature positive or negative emotions overall. Developed to capture the polarity of shorter texts, sentiment analysis was quickly picked up in the CLS field as an indicator of the positivity or negativity of a given text passage. This resulted in attempts to describe plot as a polarity curve over the course of a text. In recent years, this method has been supplemented by a more complex representation of the text: not only polarity, but also emotions are mostly described today. Often based on models of five or eight basic emotions, the dominant emotions in shorter texts or their courses in longer texts are described. We will discuss an example of this approach in more detail below.

Stylometry, which has a long history in the form of authorship attribution, has established itself as a standard application in CLS through some important algorithmic developments and their implementation in the form of easily accessible software. For many practitioners, this is their first contact with the research field.

The analysis of co-occurring and semantically related terms, which can be understood as a kind of word field analysis, is behind one of the more impressive results in CLS. Ted Underwood shows in his book that an early finding of Heuser and LeKhac, that a specific group of highly correlated words linked to the seed word hard shows a steady increase between 1800 and 1880, is really indicative of a long-term trend in English and American literary history starting around 1750 and ending 1950.5 This new field is developing methods quickly. The sophistication of its methods has increased dramatically in the last decade, quite in parallel to the explosion of methods in computational linguistics. On the other hand, many aspects have yet to mature, like standards for metadata and data exchange, for example, or standards for the evaluation for new methods by using reference tasks. Not least, the multilingualism of the field poses a challenge to the development of shared methods.

Modeling the Edition

The basic idea of this chapter is rather simple: to model one of the main features of digital editions in such a way that it can be related to the results of the computational analysis of its literary text. These results can include anything from one number to complex information structures like graphs containing relations, weights, and attributes. For the sake of simplicity, we will look mainly at one type of result: number per textual unit. Because our example is the edition of Goethe’s Faust, which is written in verse, the verse is our textual unit. We will apply some sort of analysis to the text and the result will be a number (or some other numerical representation) for each verse. Then we will relate this information structure, the roughly twelve thousand verses of the play each annotated with a number, to our model of the same text as an edition. In the first section, we will describe the basic model, and in the second section we will talk about methods of selecting and filtering our data.

Digital editions fulfill a number of tasks in literary studies: they document which relevant witnesses exist for a particular work; they document the variants of origin or transmission, mostly relating to the first or last version of a work; they bring together all documents reflecting the origin or transmission, such as letters of authors from the time of creation; they comment on passages worth explaining; and more. In printed editions, these different tasks are signaled by typographic markings. The same is true of digital editions, but only a few are designed to make these different editorial features accessible in a machine-readable way. Actually, today, when it is obvious that the dual use of editions—read by humans and processed by machines—is their future, this duality should be reflected in their public presentation. But it is seldom done. This is not due to the digital tools; the TEI (Text Encoding Initiative) guidelines provide for corresponding information, but either it is not used in this way or the texts are not made available in the TEI encoding to readers, who are precisely also potential users. Only then will the division of labor that has otherwise been established in the field of CLS also be realized in the field of editions: Some annotate corpora for their specific questions, while others use them to develop new algorithms to answer the questions.

We focus here on the documentation of variants, which plays a role in almost all critical editions. We can think of a text as a sequence of text units. For each of these text units, there are no variants, or one or more variants. In the simplest case, then, we represent the data of the edition as a sequence of the number of variants. This is just one way to proceed. Many other approaches are feasible, depending on the text and the features of the edition. But this approach should be viable for most.

So far, we have treated all variants in the same way; we have made no distinction between the variants. Depending on the question, this can be the appropriate procedure, as when one does not have a clear idea of whether there is a connection between the type of variant and the phenomenon one is analyzing. Sometimes, on the other hand, one will want to exclude certain types of variants. How one classifies the variants also depends on the tradition in which one stands and the research question at hand. One could now proceed by taking a philological concept of variant type and operationalizing it for automatic recognition of these variants. For example, one could take the concept of orthographic variant and model it in such a way that it could be operationalized into an algorithm for identifying spelling variations automatically. For the sake of simplicity, I will present below two strategies that allow a rather vague concept of semantic similarity between variants to be implemented.

In both cases, I will simplify the research design considerably. We can think of the verse-drama Faust as a sequence of verses (between which there is some additional text in the form of act and scene details and stage directions, but we will ignore them in this paper). This is a one-dimensional model that contains a node for each verse, to which the verse number and the text of the verse are assigned. The variants now extend this model into the second dimension, as we can have zero, one, or more variants for each verse. Obviously, to each variant paradigm, we could apply subtle analyses, such as locating the biggest semantic differences. For our purposes, I simplify this by saying that, if there is at least one variant, I will look at only the first and last stages and ignore all the others. So what we are interested in below is the relationship of the first version of a verse to the last.

The first strategy is based on using the Levenshtein distance to calculate the distance between the two versions. “[The] Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.”6 From “house” to “mouse,” it is one edit, the replacement of h by m; similarly, it is one edit from “Faust” to “Faust!,” the insertion of the exclamation mark. The Levenshtein distance makes it easy to count the changes in the text. However, it has the obvious disadvantage that the procedure is semantically blind. The distance from “Faust” to “Faust.” is exactly the same as that from “bird” to “bard.”

Now we can determine for each verse of a scene how great the semantic distance is between the first and the last version. If the distance is very small, then we are probably looking at one of the many orthographic variants in Faust. This allows us to distinguish relevant from nonrelevant variants, if we are primarily interested in the change in meaning of the text in the genesis.

As already mentioned, however, this approach is somewhat crude, since the Levenshtein distance counts only letter changes. With today’s tools of artificial intelligence (AI) and NLP, we can implement a richer concept of semantic distance. In the following, I use “Sentence-BERT” to obtain a vector representation for each verse.7 We can think of vectors as lines in a multidimensional space, starting from the origin and reaching a point defined by the numbers of the vector. Two different vectors are then two such lines, and we can represent the distance between these vectors as the cosine of the angle between them, as has long been common in computational text analysis (see Figure 5.1). The mathematical and algorithmic details, however, are not so important for our purposes. What is important is that, in this way, we have an instrument that allows us to convert the difference in meaning between two versions into a number that expresses the distance between the variants. Both methods allow us to identify the verses that have been particularly intensively edited, or to aggregate this verse-related information across the scenes so that we can see which scenes have been particularly heavily edited. Figure 5.2 shows those scenes. The basis here is the semantic distance between the variants as measured using Sentence-BERT. In fact, the results of the two methods correlate to a high degree.

Let us summarize: We can introduce the variants of an edition into textual analysis as a new and essential information structure. In the simplest case, this can be done on the basis of the (standardized) number of variants per textual unit, which, in the case of the Faust drama, is the verse. However, we can also filter the variants by type so that we look at only the variants relevant to our question, for example, by using the semantic distance between the first and the last variant as a weight. And of course we can also combine the number of variants and the semantic distance to create a relevant representation of the genesis of the text. This representation of the genesis is the starting point for further consideration.

Graph showing how text segments may be represented as vectors, with the distance between vectors as semantic similarity. — Figure 5.1. The geometry of meaning: this figure shows how text segments can be represented as vectors and the distance between vectors can be understood as semantic similarity.

Figure Description

This graph shows how text segments can be represented as vectors and the distance between vectors can be understood as semantic similarity. Represented on a graph, the semantic difference from a common original is represented by two vectors, and the distance between the variants calculated as the cosine of the angle, here labeled with a Greek alpha, between the vectors.

Fifty-four scenes in Goethe’s drama Faust, rated in a bar chart by the semantic distance of variants in each scene from each other. — Figure 5.2. A bar chart showing scenes of Faust weighted based on semantic distance. Higher bars indicate more semantically diverse variants.

Figure Description

This graph shows the standardized number of variants per scene additionally weighted by the semantic distance between the variants. Higher bars indicate more variants which are more semantically diverse, thus highlighting those scenes where the changes in the text changed the meaning (in contrast, for example, to mere variations of spelling). Most of the bars average between 0.1 and 0.2 with the exception of scene 1.1.13, which is around 0.8.

Modeling the Literary Text

The simplest way to model the literary text is to use its structure, something we have done already above, when we used the scenes to aggregate the frequency of variants per line. For a literary scholar, this is a natural way to think about a text, and usually it is quite useful. But nevertheless it is a design or modeling decision. The structuring elements of a text, its chapters, scenes, stanzas and the like, are important units during the production of texts and may therefore yield interesting results as models. But this is a choice on the part of the modeler, and as a choice in the context of research, it should be well founded in our research interests.

Other models arise when we do not start from the given text structure, but use another feature of the text, such as the character who speaks the text in a drama. Behind this could be the thesis that there are characters who have given the author cause for particularly numerous revisions. As usual, we use standardized values: the raw-count values alone cannot be used here either, since the characters in Faust, for example, speak different numbers of verses, so we divide the number of variants of each character by the number of verses the character speaks, and thus obtain a value for the number of variants per verse (see Figure 5.3).

Bar chart showing the frequency of variants in Goethe’s drama Faust per speaker. — Figure 5.3. A bar chart showing textual variants per speaker in Goethe’s Faust, standardized. Noticeable outliers with values above 2 include Helena and the choruses.

Figure Description

This bar chart shows the frequency of variants in Goethe’s drama Faust per speaker. Speakers represented are Mephistopheles, Faust, the Chorus, Margarete, Helena, Herold, Kaiser, Lynceus, Marthe, and Valentin. The Chorus features the highest rate of variation at almost 2.5, and Kaiser the fewest, at less than 1.0.

We can see that a value slightly above 1 is typical for most of the more important characters in the drama. There are only two very noticeable outliers upward with values above 2: Helena and (surprisingly) the choruses. The reason for this is the use of ancient verse measures, which Goethe was more uncertain about and had the classical philologist Friedrich Wilhelm Riemer check. This philological knowledge thus confirms the insight of the data analysis: these characters and their text have special properties.

How we model the text is, as I said, not predetermined, but entirely dependent on our research interest. If we want to test the thesis that the texts of female characters differ from those of male characters, we can structure our data accordingly. The same applies to other character features like age or social rank, or features of the fictional world (e.g., is the space in which the scene is set relevant?), or textual features, such as verses that have a certain rhyme scheme that differ from those that have a different one or none at all. The main effort here is to assign each text unit, in the case of Faust each of the approximately twelve thousand verses, to exactly one of the groups (e.g., “male”) that constitute the structure (for example “gender”).

As described above, the model of our edition is a sequence of verse numbers, and for each verse we have the number of variants. With the model of the text structure, we can create a second information structure: this also consists of the sequence of verse numbers, but now we assign information to each verse, such as the name or gender of the speaker or the number of stresses in the verse. This is the formal representation of our text structure, and this is the second pillar in our three-column model.

CLS Data Structure

The third information structure, the third pillar of our analytical model, consists of the data we create using a CLS-analysis procedure. Typically, such procedures make information explicit (and countable) that a human reader might also collect while reading, but mostly implicitly. In what follows, we present two such procedures, first the automatic analysis of emotions and then the analysis of rhyme. As indicated above, there are now a great many such analytical procedures available, as a result of the leaps and bounds in the development of NLP culminating in the recent pretrained large language models that power tools like ChatGPT.

The two information structures, the values collected via CLS procedures (such as the frequency of certain emotions) and what I called above a text model (such as the gender of characters), are put in relation in many CLS studies—this is not limited to editions. On the contrary, when one uses editions for this purpose, one mostly uses the reading text and ignores precisely the information that is special to an edition, like the variants, document descriptions, commentary, and so on. This may not be obvious because, in most CLS studies, many texts are analyzed and the text model is not so clearly visible as a model of a text.

So we have three information structures: (1) the variants of the edition, (2) the text model, and (3) the data collected via CLS. Above we saw that it can be fruitful to relate (1) and (2), for instance when we studied the frequency of variants aggregated by characters. As just mentioned, the relation between (2), the textual model, and (3), the CLS data, is not particularly relevant to editions, so I will not discuss this relation further here. What interests us in the following is the relation between (1), the variants, and (3), the CLS data.

Emotions

As part of a longer running project, my research group, together with the group of Simone Winko in Göttingen, has developed a method for automatic detection of emotions in nineteenth-century poetry.8 The project focuses on the transition from the poetry of realism to that of early modernism. As is common in supervised machine learning, we annotated the phenomenon under study in numerous texts and then trained a model to detect it automatically. For the annotation, we used an iterative process to define guidelines outlining which emotions to recognize, how to handle borderline cases and problems, and so on. In total, when the model was trained and applied to the Faust text, we had annotated 1278 poems, and most of them multiple times. The hierarchical annotation scheme consisted of forty emotions arranged in six groups: Love, Joy, Surprise/Agitation, Anger, Sadness, Fear. (The inter-annotator agreement, i.e., the agreement between different annotators of the same poem, is rd. 0.75 γ9 at the level of the six group emotions). The quality of the model, in this case its capability to automatically recognize emotions, varies somewhat, depending on the number of relevant annotations (Table 5.1). This means that we had a not inconsiderable number of errors in the automatic assignment of emotional categories to a poem, but in most cases it worked.

If we now apply this model to the text of Faust, we find that: for about two thirds of the verses, there is no emotion; otherwise the emotions love and joy dominate; which are opposed by almost as many verses containing (in order of frequency) sadness, anger, and fear. If we compare these results with the distribution of emotions in poetry, we notice that love is noticeably more frequent in Faust, while sadness is clearly less frequent.

Table 5.1. Performance scores for automatic detection of emotions in nineteenth-century poetry. F1 scores range from the lowest value, 0, to the highest value, 1.
	Love	Joy	Agitation	Anger	Sadness	Fear
F1(macro)	0.77	0.73	0.62	0.71	0.74	0.79

In CLS, it is common to relate CLS values (3) to those of a textual model (2), such as in analyzing the gender of the characters speaking a particular part of the text. This would allow us to answer questions such as whether certain emotions are more common in texts that have female speakers. However, we want to focus here on the relationship between the variants and the CLS data. Since we now have two sets of numbers of the same length, and our research interest is to see whether there is a relationship between them—that is, whether the number of variants increases when a particular emotion or emotions as a whole increase—we can simply calculate a correlation between them. However, the result is negative: there is no significant correlation.

Rhyme

Our second experiment looks at a different text feature: rhyme. For each verse, we automatically collect the information of whether the verse is part of a rhyme or not. For this, we use a tool, the Metricalizer, which, as tests have shown, is quite reliable at detecting rhyme.10 In this way we generate a list in which each verse number is assigned either a 1 or a 0, depending on whether it is or is not a rhyme. For simplicity, we aggregate this information to scenes and standardize it. Strictly speaking, we then have a constellation in which we are using all three columns of our model. Figure 5.4 shows the result.

Figure Description

This line graph shows the frequency of use of rhyme on one line, and of variants on another, across the fifty-four scenes in Goethe’s drama Faust. Notable differentials occur in scenes 1.1.7 through 1.1.13, scenes 1.1.15 through 1.1.20, 1.1.23, and scenes 2.3.1 through 2.4.2, with a peak differential in 2.3.2 of almost 3.0.

It is not easy to see, but in quite a number of cases, where we observe an increase in variants per scene, we can observe a decrease in rhymes such as from scene 1.1.11 to 1.1.12, from 1.1.23 to 1.1.24, and especially from 2.3.1 to 2.3.3. If the increase of values in variants is more often accompanied by the decrease of values in rhymes, then we can expect a negative correlation. This is also the case with a value of -0.56 (p-value 〈 0.0001). Regarded philologically, there are probably two independent factors at work here: in the first part, the specific history of the transmission of these short scenes, and in the second part, the fact that, in the encounter with Helena in the third act, the Greek verse measures were revised relatively frequently.

Two Cultures

The type of research proposed here poses particular challenges for those who would undertake it, as very different competencies must be present. To describe it ideally: on the scholarly editing side, deep historical knowledge is required (including the ability to read difficult hands and decipher complex manuscripts). On the side of CLS, it requires knowledge of methods and approaches with an emphasis on programming, machine learning, and statistics. The state of methodological development and digitization skill varies widely. In the field of scholarly editing there are a number of well-known and robust methods; but then, the digital turn is still happening for many practitioners. In the field of CLS, there are only a few methods that are really robust under all circumstances and that are always well understood in their dependencies; new methods are being developed in rapid succession and the new possibilities of digital tools are being tested very close to the cutting edge of development.

These differences are reinforced by different forms of knowledge and frames of thinking. Editors tend to think in terms of literary or historical studies, interpretively, or in a broad sense of the concept hermeneutically, whereas the representatives of CLS think in terms of applying their methods in the contexts of empirical work, as they have developed in the social sciences, for example. This concerns not least the handling of errors, or we might say the “error culture” of the two fields. In the field of scholarly editing, the ideal of the perfect text applies: errors are a stigma. At the same time, everyone knows that there are always mistakes. In the field of CLS, on the other hand, it is assumed that all methods are flawed; consequently, it is necessary to make empirically based statements about their error rates in order to judge the usefulness of tools, approaches, and algorithms. As errors cannot be avoided, they have to be measured and analyzed.

These differences make cooperation more difficult in practice. Above all, they mean that such cooperation takes time, so that those involved can get used to the respective customs and practices of the other side and adjust to them. In my experience, it is only when this happens, when deviation from the rules of one’s own field is accepted not as a mistake but as a legitimate characteristic of the other field, that productive cooperation can begin. Through this cooperation, however, completely new insights can be gained, especially when the procedures presented here are applied across several texts, indeed across many editions.

Notes

My thanks to two members of my working group: Thorsten Vitt, who provided me with a version of the text of Faust formatted to make the analysis easy, and Leonard Konle, who applied the emotion detection.

1. Modeling has been recognized as one of the main activities in the digital humanities (DH); see for example Willard McCarty, Humanities Computing (London: Palgrave Macmillan, 2005). My understanding owes a lot to my conversations with Julia Flanders and is documented in our introductions to The Shape of Data in Digital Humanities: Modeling Texts and Text-Based Resources, ed. Julia Flanders and Fotis Jannidis (London: Routledge, 2018).
2. This essay owes much to a paper by Gerrit Brüning, which he presented in 2017 at a conference I organized, and which could only now appear. Brüning outlines, first theoretically and then in his 2021 supplement also practically, how one can classify variants and thereby possibly assign some of the variants to an intention of the author. To my knowledge, his contribution is the first to deal systematically with the question of how to utilize variants in editions quantitatively in order to answer questions in literary studies. See Gerrit Brüning, “Modellierung von Textgeschichte. Bedingungen digitaler Analyse und Schlussfolgerungen für die Editorik,” in Digitale Literaturwissenschaft. DFG-Symposium 2017, ed. Fotis Jannidis (Berlin: Metzler and Springer 2022): 307–37. The classification of variants is also discussed in Erik Ketzan and Christof Schöch, “Classifying and Contextualizing Edits in Variants with Coleto: Three Versions of Andy Weir’s The Martian,” Digital Humanities Quarterly 15, no. 4 (2021), digitalhumanities.org/dhq/vol/15/4/000579/000579.html.
3. Johann Wolfgang Goethe, Faust, Historisch-kritische Edition, ed. Anne Bohnenkamp, Silke Henke, and Fotis Jannidis, in collaboration with Gerrit Brüning, Katrin Henzel, Christoph Leijser, Gregor Middell, Dietmar Pravida, Thorsten Vitt, and Moritz Wissenbach, release candidate 1.2 (Frankfurt am Main / Weimar / Würzburg, 2023), faustedition.net/.
4. Most of these methods are explained in detail in one of these monographs that have helped to establish the field: Matthew L. Jockers, Macroanalysis: Digital Methods and Literary History (Chicago: University of Illinois Press, 2013); Andrew Piper, Enumerations: Data and Literary Study (Chicago: University of Chicago Press, 2018); Ted Underwood, Distant Horizons: Digital Evidence and Literary Change (Chicago: University of Chicago Press, 2019).
5. Ryan Heuser and Long Le-Khac, A Quantitative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method, Stanford Literary Lab Pamphlet 4 (Stanford, Calif.: Stanford Literary Lab, 2012); Underwood, Distant Horizons.
6. See Vladimir I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals,” Soviet Physics Doklady 10, no. 8 (1966): 707–10. This English translation and its Russian original are cited in the Wikipedia page for “Levenshtein Distance” (last modified 8/3/2023).
7. See Nils Reimers and Iryna Gurevych, “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics (2019), arxiv.org/abs/1908.10084.
8. See, e.g., Anton Ehrmanntraut, Thora Hagen, Fotis Jannidis, Leonard Konle, Merten Kröncke, and Simone Winko, “Modeling and Measuring Short Text Similarities. On the Multi-Dimensional Differences between German Poetry of Realism and Modernism,” Journal of Computational Literary Studies 1, no. 1 (2022), doi.org/10.48694/jcls.116.
9. See Yann Mathet et al., “The Unified and Holistic Method Gamma (γ) for Inter-Annotator Agreement Measure and Alignment,” Computational Linguistics 41, no. 3 (2015): 437–79, doi.org/10.1162/COLI_a_00227.
10. For the Metricalizer, see web.archive.org/web/20230308082231/https://metricalizer.de/en/. See also Klemens Bobenhausen, “The Metricalizer–Automated Metrical Markup of German Poetry,” in Current Trends in Metrical Analysis, ed. Christoph Küper (Frankfurt am Main: Peter Lang, 2011): 119–31.

The Walt Whitman Archive at a Quarter of a Century