Notes

1. Charles Mackay is best known today for Memoirs of Extraordinary Popular Delusions and the Madness of Crowds.

2. Lord (1960, p. 101): “But if we are pursuing a will-o'-the-wisp when we seek an original, we are deluded by a mirage when we try to construct an ideal form of any given song. If we take all the extant texts of the song of Smailagić Meho and from them extract all the common elements, we have constructed something that never existed in reality or even in the mind of any of the singers of that song. We have simply then the common elements in this restricted number of texts, nothing more, nothing less.”

3. Lord (1960, p. 22, 36) makes the equivalence explicit: “When we speak a language, our native language, we do not repeat words and phrases that we have memorized consciously, but the words and sentences emerge from habitual usage. This is true of the singer of tales working in his specialized grammar. He does not ‘memorize’ formulas, any more than we as children ‘memorize’ language. He learns them by hearing them in other singers' songs, and by habitual usage they become part of his singing as well. Memorization is a conscious act of making one’s own, and repeating, something that one regards as fixed and not one’s own. The learning of an oral poetic language follows the same principles as the learning of language itself, not by the conscious schematization of elementary grammars but by the natural oral method.”

4. We realize that viewing a text as a one-dimensional sequence of tokens is a simplification. A long tradition in the digital humanities treats text modeling as a process of text encoding in markup languages such as SGML and XML. DeRose et al. (1990), for example, in “What is Text, Really?” enunciated the “Ordered Hierarchy of Content Objects” model of text. Renear et al. (1996) soon after refined the scope of individual hierarchies to “analytical perspectives” that, like sentences and verses in enjambment, might overlap. Researchers have also described natural and artificial languages with hierarchical models under the name of (probabilistic) context-free grammars, formalized in the 1950s. We might also model the actual layout of graphical signs on a page or the succession of images and sounds in time. For our present purposes, however, an intuitive notion of “plain text” will suffice.

5. Chomsky (1956) notoriously observed of language models that “[w]hatever the other interest of statistical approximation in this sense may be, it is clear that [they] can shed no light on the problems of grammar.” In this chapter we will elide such questions about the human language faculty and simply exploit probabilistic models to express our uncertainty about how to describe the contents and transmission of texts. It is worth noting, however, that psycholinguistic experiments by Frazier, Hale, Levy, and others (ZZZ) have shown a quantitative relationship between the uncertainty about what words a reader is likely to see next and the amount of time spent reading.

6. OCR correction models perform inference over the distribution p(T | o), which is the inverse of the conditional probability p(o | T) of the noisy channel model discussed above. They can still, however, be used as a component in a generative model similar to the HMM we discussed (cf. Wolf-Sonkin et al. 2018).

7. The careful reader will notice that for “classic” Levenshtein distance, we added the cost of different edit operations. Here, however, it makes sense to multiply the probabilities of successive edits to the input. In practice, we usually add the negative log probabilities of each edits, so the algorithms are the same. In general, we can show that finite-state automata work with sets of weights and operations on them that form a mathematical semiring (Mohri et al. 2002). Additive costs, log probabilities, and probabilities do in fact obey semiring constraints.

8. Unlike what compositors do, this sense of compose describes a generalization of function composition to relations. For functions f and g applied to x, (f ∘ g)(x) = g(f(x)).

9. If we constrain the stemma so that each witness only has a single parent, the message-passing inference algorithms in this section will provide exact results. Language-model features of nodes could be used in unsupervised learning for tree structures (e.g., Xu and Smith 2018). Stemmata where witnesses have multiple influences and “contamination” require approximate inference.

10. Paul and Eisner (2012) proposed using n-gram models as variational approximations for solving the more constrained Steiner consensus string problem. This is an NP-hard problem (Gusfield 1997), but in many cases exact solutions can be found with dual decomposition. The Steiner consensus string problem differs from our HMM model of serial reprinting because, in our notation, the value of all T_is must be identical. Xu and Smith (2017) proposed a different solution to the consensus problem that approximated the construction of a stemma (or guide tree) before rescoring the resulting multiple sequence alignment with a character language model. This approach also incorporated the assumption that the underlying strings were identical.

11. Since these are character-level n-gram models, we can practically estimate up to n = 20 on each set of witnesses. In our implementation, a unigram model, and all higher orders, determines the alphabet of the underlying texts T_i to be equal to the union of all possible OCR output characters. This assumption is clearly violated when, e.g., the true text employs long s or other characters not handled by the OCR system. Although simply having a flat prior distribution over, say, all Unicode glyphs would be impractical, some side information about the likely language and character distribution of a given text would be often obtainable.

12. One advantage of the neural network approaches is that, despite expensive training times, inference is predictably fast at test time, without the danger of blowing up the number of states in inference with finite-state methods.

13. To extract local alignments using Smith-Waterman, we do not use simple finite-state composition followed by a shortest-path algorithm as described above. We can reduce the amount of memory needed by ignoring the unaligned prefixes and suffixes in the strings to be aligned when outputting the best path. We describe this dynamic programming algorithm in detail in the appendix to Wilkerson et al. (2015).

14. In this way, to appropriate a term from programming languages, citations and the editions they point to are not homoiconic, standing for themselves, but homoioiconic, standing for a distribution over more or less similar expressions. Such ontological controversies can feed on themselves, as Jones (1964, p. 964, original in PG XLVI. 557) notes: “Gregory of Nyssa gives an amusing picture of Constantinople in the final stage of the Arian controversy: ‘If you ask about your change, the shopkeeper philosophizes about the Begotten and the Unbegotten; if you enquire about the price of a loaf, the reply is: “The Father is greater and the Son inferior”; and if you say, “Is the bath ready?” the attendant affirms that the Son is of nothing.’”

15. In an earlier era, for ‘model’ read ‘philosophy’: Edward Gibbon, Essai sur l'étude de la littérature, p. 65: “L'histoire est pour un esprit philosophique, ce qu'étoit le jeu pour le Marquis de Dangeau. Il voyoit un systême, des rapports, une suite, là, où les autres ne discernoient que les caprices de la fortune.” (History, for the philosopher, is what gambling was for the Marquis de Dangeau. He saw a system, connections, consequences, where others felt only the whims of fortune.)

References

Draft Chapters