Lazy Compositors
As noted above, we aim to describe the texts that circulated in a culture of reprinting as language models, or samples from language models. Beyond this issue of representation, we aim to use language models to discover the existence and map the extent of reprinting. As we saw in the previous section, commonly used language models for text classification and information retrieval, which assume a known set of classes or a known query, are not good fits for our task.
In this section, we discuss a worked example of building a language model of a much more restricted reprinting phenomenon: the persistence of advertisements and other “boilerplate”, such as mastheads and lists of advertising rates, from one issue of a newspaper to the next. We will model this restricted reprinting problem as a hidden Markov model (HMM) over successive newspaper issues. Classic applications of HMMs to part-of-speech tagging and other sequence-labeling problems model a series of individual tokens, e.g., words and their parts of speech; here, we perform inference on sequences of language models, one for each successive newspaper issue, in order to motivate the alignment algorithms introduced below.
By illustrating each step in building this model, we hope to explain some general principles in treating textual criticism as language modeling. This necessitates some excurses into the machinery of probabilistic formal languages. Readers eager to find out what methods we actually used for the Viral Texts project can skip ahead.
Straight-Line Stemmata
Unlike the examples of reprinting spanning continents and oceans that ran through our earlier discussion, the process that results in reprints from one issue of a publication to the next are local—confined within a single newspaper office. The fundamental assumption we make about this local text reuse is that it was lazy or, more charitably, resulted from the economy of effort necessary to compose four large sheets or more every day. Having taken the order to print an advertisement for shoes or stoves or runaway slaves, the editor and compositor could simply keep pieces of today’s paper set up in type in preparation for tomorrow. Tomorrow’s advertisement did not necessarily end up on the same place on the page as today’s, but there was no need to break up the type (see figure). This assumption of laziness means that any change in the advertisement from one printing to the next will take the earlier version as a starting point and not start afresh from some other (unobserved) manuscript or clipped-out copy.
Figure 3. Portions of the same advertisement for a bank from issues of the Richmond Daily Dispath on two successive days, April 5 (left) and 6 (right), 1861. We observe the same broken B in ‘Trader’s Bank’. On April 6, a new ad for “Naples Figs” was inserted at the bottom of the column, moving these ads up.
These assumptions thus fully specify the transmission of these boilerplate texts as a linear chain from one issue to the next. In the language of textual criticism, this allows us to define a very simple stemma for each of these texts, without any branching or contamination.
Besides the work of editors and compositors, we need to consider some additional phenomena to work with the evidence from local, boilerplate reprinting. Our evidence of the text of newspapers results, as we have seen, from a complex process of archival preservation of newspapers, microfilming, digital photography of the pages or microfilm, image processing, and optical character recognition of the images (Mak 2014; Cordell 2017). In other words, we observe any text in a mass-digitized newspaper mediated by a noisy channel—more specifically, a cascade of different noisy channels, representing the particular errors that arise from the preservation of newsprint (e.g., the crumbling of pages at the outside edge or along folds), from photography, or from OCR (e.g., mismatch between the training set’s observations of ‘e’ and the shape of that letter in a particular printer’s typeface). The output from one phase, such as an image file resulting from photographing a library’s collections, serves as the input to the next phase, such as OCR.
As a consequence, we find it useful to speak not of the text of a particular newspaper article but of a distribution over likely readings of that article—in other words, a language model. The evidence for these likely readings might come from a single OCR output, from a distribution over transcripts from a single OCR system, from a set of transcripts from an ensemble of OCR systems, from a single OCR system transcribing multiple images of the same page, or from human transcripts. All of the mass-digitized newspaper archives that we have consulted present the reader with a single OCR transcript for each page. Most library digitization projects, one assumes, cannot afford, or have not benefited from, the storage and processing of multiple images or OCR outputs for each page.
Hidden Markov Models
Consider Chronicling America’s OCR transcripts for the same line of an advertisement from seven consecutive issues of the Richmond Dispatch:
1 the Trader. Bank of the city of Richmoud, to be
2 tbe Traders' Bank or the city of Biebmond, to bo
3 tbe Traders' Bank of the city of Klchmoud, to be,
4 the Traders' Hank of the city of Richmoud, lo be j
5 the Trader*' Bsnk of the city of Richmond, to be
6 the Traders' Hank of the city of Richmond, to he
7 tha Traders' Bank of the cltv of Richmond to be
Label these OCR observations o1, … , o7. Let us represent the distribution over the text in each of these issues by the random variables T1, … , T7. As we suggested above, when dealing with texts left set up in type from one issue to the next, the text of one issue depends only on the previous one. We thus have a conditional language model of Ti depending on the previous Ti − 1, i.e., p(Ti | Ti − 1). In other words, in the terminology introduced above, this sequence of boilerplate texts has the Markov property. Note this difference, however: Markov’s idea for n-gram language models made the fairly unrealistic, but computationally useful, assumption that a word wi might be conditionally independent of most previous words given the immediate history of n words wi − n, … , wi − 1. Here, the dependence of one reprint on the previous one is more grounded in the physical process of composing a newspaper. As in the noisy channel model introduced above, we assume that an observed, noisy OCR transcript oi depends on the underlying reading of the text, so we also have conditional models of the OCR output p(oi | Ti).
Because the text of each issue’s reprint of the ad is related, we can pool the evidence from all of the issues and reason about all of the texts Ti together. In the language of probability, we can ask about the joint distribution over all, and the marginal distribution over each, of the observed and unobserved variables. Because of the (first-order) Markov property among successive issues and the independence of the OCR outputs for each issue, the joint distribution can be factored as
p(T1, … , T7, o1, … , o7) = p(o1 | T1) p(T1) ∏i = 27 p(oi | Ti) p(Ti | Ti − 1)
This particular model structure has been widely studied and applied in NLP and other fields under the name of a hidden Markov model (HMM). It comprises a Markov model (cf. the discussion of n-gram language models) over the latent (i.e., hidden) variables Ti, which we can only observe through the oi variables, noisy “emissions” from, or “measurements” of, the latent Ti. While HMMs have been applied in speech recognition since the 1970s, Ken Church and Steve DeRose simultaneously published the first applications of HMMs to structured prediction in NLP, with their 1988 work on part-of-speech tagging (Church 1988; DeRose 1988).
Figure 4. A hidden Markov model of reprints in successive newspaper issues represented as a directed graphical model. The oi variables representing the OCR output for each issue are shaded because we actually observe them; the Ti variables representing probability distributions over the true text of each issue are unshaded because we do not observe them. Each directed edge expresses the conditional probability of a child variable given its parent.
The figure above depicts an HMM for the first three texts using the notation of directed graphical models or Bayes' nets. The oi variables representing the OCR output for each issue are shaded because we actually observe them; the Ti variables representing probability distributions over the true text of each issue are unshaded because we do not observe them. Each directed edge expresses the conditional probability of a child variable given its parent(s). In this model, each variable only has a single parent. Since T2 stands athwart the only path from T1 to T3, we conclude that T3 is conditionally independent of T1 given T2. This is simply another way of saying that this sequence has a first-order Markov property.
To specify this HMM, we need to define the conditional distributions, depicted on the edges of the figure, of the form p(Ti | Ti − 1) and p(oi | Ti). As noted above, the values that the random variables Ti take on are strings, and a language model just is a probability distribution over strings. Note the difference in levels of representation: while n-gram Markov models are usually employed for sequences of symbols such as words or characters, we are modeling a sequence of strings, which themselves contain characters or words. This use/mention problem sometimes arises in structured prediction in machine learning, where we might use graphical models to model graphs or neural networks to model networks.
Probability models over output strings given an input string have recently become popular under the name of sequence-to-sequence or encoder-decoder models. Implemented using recurrent neural network architectures, they produce state-of-the-art results in many machine translation (ZZZ), summarization (ZZZ), morphological generation (Wolf-Sonkin et al. 2018), and other tasks. Neural encoder-decoder models have also been applied to the task of OCR correction (Schnober et al. 2016; Dong and Smith 2018), reading observed OCR output o and outputting hypothesized underlying text T.[6] These neural models, however, have tens of thousands of parameters. These parameters can be adjusted with sufficient supervised input-output pairs—i.e., from a corpus of OCR outputs with known ground-truth transcriptions—or with more time-consuming unsupervised training schemes. Even a trigram language model over 60 upper- and lower-case characters plus space a few punctuation marks might have tens of thousands of parameters, up to a maximum of 603 = 216,000.
In this section, however, we will describe a model with only four parameters to provide an example of reasoning about textual variation using language models. This is a “null model” that does not capture facts such as a particular compositor’s or OCR system’s propensity to make particular errors; it merely captures the Markov relation among successive issues. When faced with complex phenomena, simplicity is not necessarily a virtue in itself (Breiman 2001), but simple models can be useful as explanatory devices and as benchmarks for more complex models to prove themselves against.
Regular Expressions and Finite-State Automata
To represent language models, we will use weighted finite-state automata, a generalization of the Markov, or n-gram, models described above. Finite-state automata (FSAs, also known as finite-state machines) describe a set of formal languages called regular languages and are often written down using regular expressions. Regular expressions are probably most familiar as a notation for “fuzzy” searching, such as the early Unix tool grep
or generalized regular expression print. Say that we have the OCR output (from o2 above)
the city of Biebmond, to bo
and want to find similar strings, taking into account that lower-case e
and o
are often confused in this font. We could then search for the regular expression
th[eo] city [eo]f Bi[eo]bm[eo]nd, t[eo] b[eo]
where the brackets are the common shorthand for a set of alternative characters. This regular expression represents the set or union of 26 = 64 different strings, i.e., it would match any of the 64. The figure shows the beginning of the representation of this regular expression as an FSA. Each node in the graph represents a state. The labels on the directed arcs leaving each state show the characters that the automaton will accept when in that state. After accepting a character, the automaton traverses the corresponding arc to a new state.
Figure 5. Beginning of a finite state machine representing a regular expression where e
and o
are interchangeable. This is mostly a “straight-line” machine, with a few parallel arcs indicating alternate readings of some characters.
Markov (n-gram) models are a special case of finite-state models where the states correspond to the possible histories and the arcs are labeled with items in the vocabulary V. For a first-order Markov model, there will therefore be |V| states, for a second-order Markov model there will be |V|2 states, and so on.
Just as a single string s is a degenerate language model that assigns probability 1 to s and probability 0 to any other string, this regular expression is also a language model that implicitly assigns probability 1/64 to each of these 64 alternatives and 0 to anything else. But of course, these 64 strings do not seem equally likely to us. For example, as we said above, of seems more likely than ef in most English contexts. We therefore wish to use weighted or probabilistic finite-state automata as a language model, to assign different probabilities to different strings. In this section, we will be interested in locally normalized probabilistic models, where all of the arcs departing from a state have probabilities that sum to 1.
We define the values of the Ti variables to be probabilistic finite-state automata. The oi variables that represent the OCR output have, in our example, only a single value. They are thus degenerate, or “straight-line”, FSAs that accept only a single string.
Transducers and Edit Distances
How then should we define the conditional distributions p(Ti | Ti − 1) and p(oi | Ti)? A convenient way of describing a relation between two regular expressions is a finite-state transducer (FST). Formally, an FST describes a pair of strings—conventionally called the input and output strings—and thus looks like an FSA with a pair of symbols on each arc. To take an example, say that we want to take the string S, the city of Biebmond, to bo
, and turn it into the regular expression E where e
and o
are interchangeable but the input is otherwise unchanged. We would use the FST A in the figure, where each arc has an input symbol before the colon and a corresponding output symbol after the colon. We use the standard ϱ shorthand to match any character not otherwise matched. We can visualize this transducer “reading” through the input S character by character and outputting an arc for each match of the input. If the input were to present a character not matched by the transducer, the entire output would be null.
Figure 6. A finite-state transducer to that outputs both e
and o
whenever it readers either e
or o
. It leaves unchanged any character not matching these alternatives, represented by the shorthand ϱ.
The practical problem with the finite-state automata and transducers we have seen so far is, as we said above, that all of the outputs are equally weighted. Clearly some possible rewritings of the input are more likely than others. A simple but general way of scoring the set of possible rewritings of a string is through edit distance. Traditionally, we define edit distance as a function of two strings s1 and s2 that outputs the minimum cost of the edit operations needed to transform one string into the other. Different edit-distance functions define different edit operations. The most widely used is Levenshtein distance, which allows the insertion, deletion, or substitution of a single character (or token). Levenshtein distance assesses a cost of 1 for each such edit operation and a cost of 0 for leaving a character unchanged. The series of edit operations that transforms s1 into s2 (or vice versa without loss of generality) with the least cost is the Levenshtein distance between s1 and s2. Although the series of edit operations with minimum cost need not be unique, there is a finite maximum cost for finite inputs of |s1| + |s2|: simply delete all the characters in s1 and insert all the characters in s2 (or vice versa).
We can express Levenshtein distance as an FST with a single state. The left-hand figure below shows an example over just the characters a
and b
, for simplicity. When the empty string ε appears in the input (before the colon), the machine performs the insertion operation, adding the character after the colon to the output. Similarly, when ε appears in the output, it has the effect of deleting the corresponding input character. Both insertion and deletion show a numerical cost, conventionally written after a slash. For the rest, matching input and output characters incur 0 cost, and mismatched input and output characters incur a cost of 1. Adding up the costs on every arc traversed in the FST when composed with an input automaton on one side and an output automaton on the other results in the total cost of those operations. The cost of the shortest such complete path will be the Levenshtein distance.
Figure 7. Finite-state transducer for Levenshtein edit distance over the two-symbol alphabet Σ = {a, b}. Left, traditional Levenshtein distance; right, probabilistic Levenshtein distance. The empty string ε in the input allows insertion and in the output indicates deletion. When the input and output characters match, the cost is 0; otherwise, the cost is 1.
As another example, Damerau-Levenshtein edit distance (ZZZ) allows the transposition of two adjacent tokens as well as the insertion, deletion, and substitution of basic Levenshtein distance. For correcting spelling in text typed on a particular keyboard, one could make edits of adjacent characters less costly than less plausible errors. For correcting OCR, one could give substituting characters with a similar appearance (e.g., e
and o
above) a lower cost. OCR often splits or merges characters, e.g. the
to tlie
. Edit distance functions such as Levenshtein that score at most a single character in each string at a time are less suited to modeling such errors.
These examples of edit distance add the (non-negative) cost of edit operations necessary to produce one string from another. To get the conditional probability models we need for an HMM—namely, p(oi | Ti) and p(Ti | Ti − 1) in the equation above—, we need to make these transducers produce probabilities for each input s1 that sum to 1 for all possible outputs s2. We can prove, using the construction of Cotterell et al. (2014), that the transducer in the right-hand figure defines a proper conditional probability distribution in this sense for Levenshtein distance. Again, we show the transducer for the restricted input and output alphabet Σ = {a, b}. There is a single free parameter α, defined as the probability that a symbol will be copied unchanged from input to output. For each input character, the remaining 1 − α probability is divided evenly among the possible 2|Σ| edit operations: one deletion, |Σ| − 1 substitutions, and |Σ| insertions. The stopping probability is not a free parameter but is completely defined as the residual after summing the insertion probabilities, i.e., 1 − |Σ| (1 − α) / 2|Σ| = (1 + α) / 2. In other words, the only thing you can do after consuming all the input is to insert more symbols or stop. In the example transducer in the figure, α = 0.8.[7]
The standard algorithm to compute the Levenshtein edit distance between two strings s1 and s2 involves a dynamic program that considers all pairs of positions in the two inputs. See for example textbooks by Jurafsky and Martin (2018, §2.5) or Gusfield (1997, pp. 217ff.). Its runtime and memory use thus scales as the product of the lengths of the two inputs. This becomes impractical for long strings without techniques to prune the search space, a consideration to which we return below. One advantage of using more general FSTs is that we can compute the edit distance between two regular languages, i.e., potentially infinite sets of strings. For two language models represented as finite-state automata L1 and L2 and an edit FST E, we compose these machines together (Mohri et al. 2002) into a new FST D = L1 ∘ E ∘ L2, which tells us the edit distance between every pair of strings in the cross-product of the two languages L1 × L2.[8] To describe this procedurally, all strings in L1 will be tried as inputs to E, and all outputs of E will be checked for membership in L2. The shortest path through D will tell us the minimum edit distance between any pair of strings in L1 and L2. If, furthermore, L1 defines a probability distribution over strings, i.e., it is a language model, and E is a probabilistic edit model, then L1 ∘ E will result in the joint probability distribution p(s2 | s1) p(s1) = p(s1, s2), ∀ s1 ∈ L1, s2 ∈ L2. We can then use the range operator range(L1 ∘ E) to turn an FST into a FSA that recognizes strings in the output language; in other words, to give us a language model over strings according to their (marginal) probability of having been generated by L1 and edited by E. We can also use FSTs as noisy channel models to compute marginal probabilities over likely inputs to the channel given the outputs with the domain operator: domain(E ∘ L2).
Probabilistic Levenshtein distance is simple but captures many of the phenomena of imaging and OCR errors or compositorial accidents from one issue to the next. In the next section, we describe the somewhat more complicated edit distance (Smith-Waterman with an affine gap penalty) that we use for finding local regions of text reuse embedded in larger newspaper articles, pages, or issues. We now put together the machinery of finite-state language models and finite-state-transducer edit models to build a joint model of reprinted texts in successive newspaper issues.
Message Passing with Language Models
The HMM in the figure, as we have seen, can be used to model a very simple stemma: the text Ti − 1 on each day is transformed into the text Ti on the next by a simple process of moving the physical type to set the new issue. This process is modeled by p(Ti | Ti − 1), which we will define as a probabilistic Levenshtein distance FST with probability μ of correctly copying a character. We could model other processes, more complicated than simple movement of type—such as copying, the memorization of a teacher’s text in an oral academic culture, and so on—although something more complex than Levenshtein distance would probably work better in those cases. We do not observe the texts Ti directly, but only indirectly through OCR outputs oi. We will also model the relationship between these variables p(oi | Ti) with a probabilistic Levenshtein distance FST with probability ν of correctly copying a character. The edit model from “true” text to observed OCR output could, again, be summarizing some complicated chain of transmission, from archived document, to microfilming and digital photography, to image processing and OCR.
Having specified this model, we can now perform inference from observed data to hidden variables to answer various questions in textual criticism:
- Computing the most likely reading according to the model, of the earliest version of the text: this is usually the most visible output of textual criticism, i.e., the text of an edition. The analogous task here would be to compute the most likely value of T1.
- Computing the most likely variant readings of the text. Here, we would compute the k best paths in the T1 FSA or in all the states of the texts T1, … , TN
- Computing the expected value and other moments of model parameters. Here, we could estimate the average OCR error rate or the expected number of changes that compositors made from one day to the next.
- More generally, drawing samples from a language model to examine its divergence from the textual evidence we have and to estimate parameters.
Another common task in textual criticism is to infer the stemma relating different witnesses. Here, we fix on a straight-line stemma based on assumptions about the process of newspaper production in the nineteenth century. We could, however, alternate inference about the stemma with inference about textual variation among witnesses (Bouchard-Côté et al., ZZZ).[9]
The standard inference methods for hidden Markov models are the Viterbi algorithm, for computing the most likely series of values for all the latent state variables (here, Ti), and the forward-backward or Baum-Welch algorithm, for computing the posterior distributions over the latent state variables and using those posteriors to compute parameter expectations. While standard presentations of these algorithms simply enumerate possible settings of the latent states at each step, we follow Dreyer and Eisner (2009) in generalizing them to allow the distributions over the latent variables' values to be finite-state language models, i.e., distributions over possibly infinite sets of strings. This means that the posterior distribution over the Ti variables will be a probabilistic finite-state automaton. The messages that adjacent variables send to each other during Viterbi or forward-backward inference are also probabilistic finite-state automata. For simplicity, we will focus here on the forward-backward algorithm.
The forward-backward algorithm proceeds by sending messages from each latent state variable to the next in order. Let C be the probabilistic Levensthein transducer that models the distribution over compositor changes from Ti to Ti + 1, with parameter c being the probability of correctly transferring a character. Let R be the probabilistic Levenshtein transducer that models the imaging and OCR process from Ti to oi, with parameter r being the probability of correctly recognizing a character. Define the distribution over possible readings of text Ti using only the evidence of oi, and not the evidence of other witnesses, as:
Wi ≐ domain(R ∘ oi)
The forward messages can then be defined recursively as:
m(Ti → Ti + 1) ≐ range( [ m(Ti − 1 → Ti) ∩ Wi ] ∘ C)
where m(T-1 → 0) is simply defined as the uniform distribution, which can be ignored. We then also need to define backward messages as:
m(Ti → Ti − 1) ≐ domain(C ∘ [ Wi ∩ m(Ti + 1 → Ti) ])
The posterior distributions over each Ti can then be defined as:
Ti ≐ m(Ti − 1 → Ti) ∩ Wi ∩ m(Ti + 1 → Ti)
The maximum a posteriori value for each text will thus be the path with the lowest cost, or highest probability, in each posterior FSA Ti.
Dreyer and Eisner (2009) note that, unlike HMMs whose hidden state variables take on a fixed number of values, the size of the FSA messages in FSA-valued HMMs can grow exponentially as messages move along the chain. They use simple threshold pruning, where all arcs that belong to paths within a multiplicative factor δ of the best path in the FSA are kept, and the rest removed. This is insufficient for the simple HMM we have defined here. Since all edits are weighted equally, pruning the evidence Wi from each OCR output oi simply results in putting an upper bound on the number of edits distant from oi that we will search. We would like to keep only those candidate edits of oi that could plausibly have arisen from the other OCR evidence. As we noted above, an n-gram model is a special case of a finite-state language model. We can thus use an n-gram model as an approximation for a finite-state, or even more complex, language model. We estimate a character n-gram model Ln from the set of OCR output for all witnesses {o1, … , oN}. We then score the possible edits of oi with this language model by intersecting them, prune the results that are more than a factor δ away from the probability of the best path (here, the hypothesis that oi is in fact the true value of Ti), and then remove the weights to give a filter machine Fi.
Fi ≐ unweight(pruneδ[domain(R ∘ oi) ∩ Ln])
We can then intersect this unweighted filter with the set of edits to get a refined evidence distribution Wi.
Wi ≐ domain(R ∘ oi) ∩ Fi
By using an unweighted filter, all paths in Wi retain the same weights they had before pruning, while edits of oi that are unlikely under the n-gram model Ln are removed. By using an n-gram model estimated on noisy OCR output, we can speed up the inference process without having to either clean up the text to estimate a language model to clean up the text or having to employ a language model estimated on outside clean data that matches the current text less well.[10] We will return to this use of n-gram models on noisy text when we discuss efficient reprint detection.
Collation and Denoising
We now return to the OCR output for seven successive issues of the Richmond Dispatch, which we use as evidence variables o1, … , o7:
1 the Trader. Bank of the city of Richmoud, to be
2 tbe Traders' Bank or the city of Biebmond, to bo
3 tbe Traders' Bank of the city of Klchmoud, to be,
4 the Traders' Hank of the city of Richmoud, lo be j
5 the Trader*' Bsnk of the city of Richmond, to be
6 the Traders' Hank of the city of Richmond, to he
7 tha Traders' Bank of the cltv of Richmond to be
We have defined an HMM for a sequence of observations of noisy OCR with two model parameters and two pruning parameters:
- r, the probability that a given character will be correctly recognized by OCR;
- c, the probability that a compositor will correctly transfer a given character from one issue to the next;
- δ, the threshold for pruning messages during inference; and
- n, the order of the character n-gram model used to filter on messages from oi.
We first need to set values for these four parameters. We could set them empirically—either unsupervisedly, by maximizing the likelihood of a given corpus of OCR outputs, or supervisedly, by maximizing over a set of paired OCR outputs and manual transcriptions. For our present purposes, we will choose some reasonable values a priori and then explore the effect of manipulating some of them. The easiest conceptually is r, the probability of the OCR correctly recognizing a given character, which we set to 0.85, the average accuracy of the Richmond Dispatch newspapers in Chronicling America. We start by setting c to 1 − 10-4, so that errors by compositors are three orders of magnitude less likely than OCR errors. We set n = 3, which is the default order for n-gram models in the OpenGrm package and hopefully a good tradeoff between an overly smooth unigram or bigram model and an overly confident, e.g., 15-gram model.[11] Finally, we wish to set the message pruning parameter δ high enough (on a negative log scale) so that the distributions the Ti variables send to each other have a non-empty intersection but not so high that the messages are impractically large. We set δ = 35 since for r = 0.85 e−35 is approximately equal to an edit distance of 5 before applying the language model filter.
After estimating the filter n-gram model and computing the evidence distribution for each witness Wi, which we only have to do once, we start with the first witness and compute the forward messages to each witness in ascending order. While each message is an FSA, and thus may contain multiple candidate strings, as in the example pictured, we can get some idea of the process by looking at the most probable string in each forward message:
1 the Trader. Bank of the city of Richmoud, to be
2 the Trader. Bank or the city of Richmoud, to be
3 tbe Traders' Bank of the city of Bichmoud, to be
4 the Traders' Bank of the city of Richmoud, to be
5 the Traders' Bank of the city of Richmoud, to be
6 the Traders' Bank of the city of Richmoud, to be
For a fixed number of witnesses, the forward message from the last witness is not needed. While these strings are not all correct, we can already see some evidence being incorporated to denoise the OCR: e.g., Trader*'
to Traders'
in #5. Since information is only flowing forward, we can also see some deleterious consensus effects, such as Traders'
to Trader.
in #2.
Similarly, we compute the backward messages recursively starting with the last witness. The most probable string in each backward message is:
7 tha Traders' Bank of the cltv of Richmond to be
6 the Traders' Bank of the cltv of Richmond to he
5 the Traders' Bank of the city of Richmond, to be
4 the Traders' Bank of the city of Richmond, to be
3 the Traders' Bank of the city of Richmond, to be
2 the Traders' Bank of the city of Richmond, to be
We can then compute the posterior distributions for each latent text variable Ti. As with the messages, these distributions are FSAs, as in the example pictured. The maximum a posteriori (MAP) string for each issue is:
1 the Traders' Bank of the city of Richmond, to be
2 the Traders' Bank of the city of Richmond, to be
3 the Traders' Bank of the city of Richmond, to be
4 the Traders' Bank of the city of Richmond, to be
5 the Traders' Bank of the city of Richmond, to be
6 the Traders' Bank of the city of Richmond, to be
7 the Traders' Bank of the city of Richmond, to be
We see that for this set of witnesses, with these parameter settings, the hypothesized readings are now in agreement. Other than through the four parameters above, no evidence from other OCR outputs or language models has been used.
Figure 8. Diagrams of language models represented as finite-state automata. Language models represent our beliefs about the correct transcription of a string by an OCR system (as in the second machine) and are also used to send messages among neighboring variables (i.e., texts) in the model (as in the first machine). Arcs without weights have probability (numerically close to) 1. A, Part of the message sent from the sixth string to the seventh: By this point, the evidence for the readings n
and u
are equally balanced. Evidence from the last string will break the tie in this case. B, Part of the language model representing the posterior distribution over readings of the sixth string after running the forward-backward algorithm: Given the evidence of all seven strings, the probability of n
is over 99%. Due to different local evidence, the posterior distribution over the readings of the first string has n
at 84%.
How sensitive are these inferences to the parameter values we chose? If instead of c = 1 − 10−4, we use values of c between approximately 1 − 10−3 and 1 − 10−1, we see that the relatively balanced global evidence between Richmond
and Richmoud
is resolved differently by the local evidence. Note also that #2, which did have an n
originally, conforms to its neighbors in reading u
.
1 the Traders' Bank of the city of Richmoud, to be
2 the Traders' Bank of the city of Richmoud, to be
3 the Traders' Bank of the city of Richmoud, to be
4 the Traders' Bank of the city of Richmoud, to be
5 the Traders' Bank of the city of Richmond, to be
6 the Traders' Bank of the city of Richmond, to be
7 the Traders' Bank of the city of Richmond, to be
When c is even lower, especially lower than r, the model is more likely to discount the evidence of other strings in reasoning about individual OCR outputs. If the compositors are more likely to make errors than OCR, there’s not much point in consulting their other work. For c = 0.8, for example, there is still some pooling of evidence so that there is consensus on, e.g., Bank of the
, but more local quirks slip through:
1 the Trader. Bank of the city of Richmoud, to be
2 tbe Traders' Bank of the city of Richmoud, to be
3 tbe Traders' Bank of the city of Richmoud, to be
4 the Traders' Bank of the city of Richmoud, to be
5 the Traders' Bank of the city of Richmond, to be
6 the Traders' Bank of the city of Richmond, to be
7 tha Traders' Bank of the cltv of Richmond to be
We do observe that n, the order of the n-gram filtering model, has a significant effect on inference speed at the lower bound. Inference is reasonably accurate and efficient with 3 ≤ n ≤ 15. With n ≤ 2, however, inference becomes very slow. Beyond 15, estimating a backoff n-gram model becomes more expensive than the subsequent collation process.
For this example input, where the underlying true texts are in fact all identical, we are actually solving the more restricted Steiner consensus string problem (Gusfield 1997), which it is intractable to solve exactly (NP-hard). This means that we are not, in general, guaranteed to find good solutions, but the current pruning parameters seem to be effective. (We will discuss this more below.)
For now, however, we ask what will happen if we are solving a problem less constrained than simply finding a single consensus string underlying all of the observations oi. Say that we modify the last OCR output so that it might have resulted from a shortened form of the advertisement without the words the city of
:
1 the Trader. Bank of the city of Richmoud, to be
2 tbe Traders' Bank or the city of Biebmond, to bo
3 tbe Traders' Bank of the city of Klchmoud, to be,
4 the Traders' Hank of the city of Richmoud, lo be j
5 the Trader*' Bsnk of the city of Richmond, to be
6 the Traders' Hank of the city of Richmond, to he
7 tha Traders' Bank of Richmond to be
With c at the original value of 1 − 10-4, we get a correct MAP output from collation:
1 the Traders' Bank of the city of Richmond, to be
2 the Traders' Bank of the city of Richmond, to be
3 the Traders' Bank of the city of Richmond, to be
4 the Traders' Bank of the city of Richmond, to be
5 the Traders' Bank of the city of Richmond, to be
6 the Traders' Bank of the city of Richmond, to be
7 the Traders' Bank of Richmond, to be
The evidence of the first six witnesses is enough to correct the initial tha
and to reinsert the comma after Richmond
. The evidence of the shorter string does not, however, cause any erroneous shortening in the first six.
Say that the shorter hypotheses outnumber the longer by 5 to 4:
1 the Trader. Bank of the city of Richmoud, to be
2 tbe Traders' Bank or the city of Biebmond, to bo
3 tbe Traders' Bank of the city of Klchmoud, to be,
4 the Traders' Hank of the city of Richmoud, lo be j
5 the Trader*' Bsnk of Richmond, to be
6 the Traders' Hank of Richmond to he
7 tha Traders' Bank of Richmond to be
8 tha Traders Bank of Richmond, to be
9 the Trader. Bank of Richmoud, to be
Even then, the MAP output is still correct:
1 the Traders' Bank of the city of Richmond, to be
2 the Traders' Bank of the city of Richmond, to be
3 the Traders' Bank of the city of Richmond, to be
4 the Traders' Bank of the city of Richmond, to be
5 the Traders' Bank of Richmond, to be
6 the Traders' Bank of Richmond, to be
7 the Traders' Bank of Richmond, to be
8 the Traders' Bank of Richmond, to be
9 the Traders' Bank of Richmond, to be
If, however, we interleave these five short and four long hypotheses, the HMM fits poorly, giving the output:
1 the Traders' Bank of Richmond, to be
2 the Traders' Bank of he city of Richmond, to be
3 the Traders' Bank of Richmond, to be
4 the Traders' Bank or to city of Richmond, to be
5 the Traders' Bank of Richmond, to be
6 the Traders' Bank of he city of Richmond, to be
7 the Traders' Bank of Richmond, to be
8 the Traders' Bank of he city of Richmond, to be
9 the Traders' Bank of Richmond, to be
Indeed, the interleaved data were not produced by a compositor with the Markov property of looking only at the previous issue, but by an adversarial human globally rearranging the full list.
The simple HMM model is very brittle, however, when presented with legitimate variants a small edit distance any from each other. For example, among sixteen variants of the Mackay poem mentioned above, the reading misty
replaces mighty
in #5 and #6 below:
1 Tell in \ thou mi&htv deep,/VVh.we billows roiiml me p'?>'i
2 Ttll me, thou mighty deep,/Whose billows round mo play,
3 Tell mo thou mighty deep,/Whose billows round mo play,
4 Tell me thou mighty deep,/Where billows roand may play
5 Tell me, thou misty deep,/Whose billows round roe play,
6 Tell me. thou misty deep./Whose billows round me olay,
7 Tell me, thou mighty deep/Whose billows round me play,
8 Tell me, thou mighty deep,/Where billows round me play,
9 Tell me, thou mighty deep,/Whose billows round me play,
10 " Tell me, thou mighty deep,/Whose billows round me play,
11 Tell me, thoa j?gpty deep,/Whoa.e billows round mo play,
12 Tell me, thou mighty deep./Whose billows round tue piny,
13 Tell me. thou miffhtv deen./Whose billows round me play,
14 Tell me. Ihml ml.tr tleeit./Whose billow, round iuc play,
15 -ell me, thou mighty deep./Whose billows rouid rae play,
16 T'ell me, tl on mighty d, ep,/Whose billuws ro':nd me play,
With r = 0.85 as above, only a few values of c, between 0.85 and 0.86, preserve misty
without allowing too many other errors. With c > 0.86, we see over-smoothing:
1 Tell me thou mighty deep,/Whose billows round me play,
2 Tell me thou mighty deep,/Whose billows round me play,
3 Tell me thou mighty deep,/Whose billows round me play,
4 Tell me thou mighty deep,/Whose billows round me play,
5 Tell me, thou mighty deep,/Whose billows round me play,
6 Tell me, thou mighty deep,/Whose billows round me play,
7 Tell me, thou mighty deep,/Whose billows round me play,
8 Tell me, thou mighty deep,/Whose billows round me play,
9 Tell me, thou mighty deep,/Whose billows round me play,
10 Tell me, thou mighty deep,/Whose billows round me play,
11 Tell me, thou mighty deep,/Whose billows round me play,
12 Tell me, thou mighty deep./Whose billows round me play,
13 Tell me. thou mighty deep./Whose billows round me play,
14 Tell me. Ihml ml.tr tleeit./Whose billows round me play,
15 Tell me, thou mighty deep./Whose billows round me play,
16 Tell me, thou mighty deep./Whose billows round me play,
If, instead of showing the MAP string from each text, we sample paths from the finite-state automaton representing the posterior distribution for T5 where we would like to read misty
, we generate misty
only 10% of the time when c = 0.9 but 70% of the time when c = 0.85.
How well does this HMM perform in general? We construct a test set on the Richmond Dispatch of 1000 lines, for which we have manual double-keyed transcriptions, with between 2 and 8 additional witnesses each. The character error rate of this test set is 14.6%. With the four parameter values above, the HMM reduces the error to 9.9%. An LSTM encoder-decoder with attention (Dong and Smith 2018), trained on supervised input-output pairs with tens of thousands of parameters, reduces CER to 7.7% when it only uses one witness and 4.2% when it uses all witness in each set by averaging the attention distributions.[12] It is worth emphasizing that these evaluations are measuring only the edit distance between the inferred text of the target line and its manual transcriptions—and not the conformance of all the other witnesses to our expectations. A different evaluation would be needed, for example, to assess whether the model reflected the underlying variation in the witnesses of the Mackay poem above.
For the restricted reprinting case of compositors' copying one issues content to the next, this section presented a simple four-parameter hidden Markov model that performs inference over language models. It might be desirable to model how OCR confuses different pairs of characters or large combinations of characters; how compositors might drop type at the end of a line or miss whole words; or how language models estimated from additional text, clean or noisy, might improve our inferences on this task. Despite lacking all these things, this HMM over language models provides a reasonably accurate baseline for this collation task—and perhaps for other tasks where textual critics might need a simple, language-independent null model. In the next section, we discuss how we employ these language-modeling techniques at the scale of millions of issues of poorly OCR'd periodicals.