What is Text, Probably?

Before we discuss methods for modeling systems of textual sharing, we take a step back and describe briefly some more general techniques from language modeling that we will employ. This section and the next are the most mathematical in this chapter, and the reader most interested in a compact description of the methods we have used for tracking and publishing reprinted texts may wish to skip ahead to the sections on text-reuse detection and digital editions. By presenting this technical background, however, we hope not only that readers will be better able to assess the choices we have made but also that they might enter into their own “conversation with the data” (Gibbs and Cohen 2011) by imagining different questions, different models, and different possibilities for computational approaches to textual systems.

Researchers in natural language processing and related fields use the term language model to refer to a mathematical model that assigns a probability to a string of text. This section briefly surveys some language models and their applications, without meaning to replace a dedicated introduction to the subject. Rather, we aim, first, to point out that many computational methods in the digital humanities, from search engines to topic models to pre-trained word embeddings, are applications of language modeling and, second, to develop notation for the remainder of this chapter.

By a string of text, we mean a sequence of tokens from some alphabet or vocabulary, which could be: the characters of the Roman alphabet; the 1945 kanji plus two syllabaries of Japanese; the distinct words in some lexicon of English; the symbols of American Sign Language; or some superset of all of these. Formally, for a given set of vocabulary items V, a language model is a function assigning a probability to any finite sequence of length N p(w₁, w₂, … , w_N) where w_i ∈ V. For brevity, we will write such sequences as w₁^N.[4]

Why might we want to talk about the probability of a string of symbols?[5] A simple but clear justification arises with texts that are not fully visible to the reader. When looking at an image of a page from the Richmond [Virginia] Dispatch from April 9, 1862, someone who knows English will almost surely read the highlighted words as “terms of”, even though the m, s, and o are successively obscured, either by defects in the type or by later dust or imaging errors. It is even possible that the error is the compositor’s, substituting an e for o. Our knowledge of English, however, allows us to easily read those words. Informally, most speakers of English will judge the comparative probabilities of these two possible words as

p(ef) < p(of)

The context for a word is usually helpful for making these judgments, so we will often use a conditional probability. Looking again at the figure, a speaker of English would probably decide that:

p(ef | the plainest terms) < p(of | the plainest terms)

We could somewhat whimsically contrast this with the case where ef is a spelled-out letter of the alphabet:

p(ef | ay bee cee dee ee) > p(of | ay bee cee dee ee)

A good language model for English, therefore, would make similar choices to those made, on average, by speakers of English (Shannon 1951). When working with historical newspapers, it is especially apparent that no two speakers or time periods of what we call English follow exactly the same grammar. In language modeling, as in other quantitative approaches to text analysis, therefore, precise results will depend on how we define such an “average” base on necessarily finite samples.

Noisy Channels

To continue the example of reading the obscured newspaper image, we need not only a language model of what a text is likely to have said but also some model of what the evidence for alternative readings might look like. The round blot of ink in the figure is much more likely to be a lower-case e or o than a capital M. To combine these two sources of information, one common approach is a noisy channel model, first introduced by Warren Weaver as a thought experiment in a letter to Norbert Wiener from 4 March 1947:

When I look at an article in Russian, I say: “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”

In other words, Weaver imagined a world where Russian speakers think in English, just like English speakers, only their utterances come out as Russian. This model implies that we can recover the “intended” English utterance E from Russian R by computing

argmax_E p(E | R) = argmax_E p(E, R) = argmax_E p(R | E) · p(E)

with the help of a language model p(E) of likely English utterances, a translation or “channel” model p(R | E) of the garbling process, and Bayes' rule. This model thus searches for possible English translations that are both “fluent” (i.e., with a high probability under the language model) and “adequate” (i.e., with a high probability of having converted that candidate English utterance into the Russian we were given). Similarly, inf optical character recognition (OCR), we can try to recover the likely reading(s) of a noisy image of English text with the help of a language model of English and a channel model of likely errors based on letter shape or position (e.g., more errors at the ends of lines).

As we shall see when we discuss models of text reuse below, the channel model can be quite simple if the language model is good enough. This stronger reliance on a language model often results from the different kinds of data needed to train them. A channel model for OCR correction might need to have examples of noisy OCR output and the corresponding corrected transcripts. A translation model from English to Russian would usually need to be trained on Russian-English bilingual text. In contrast, a language model could be trained on the much larger amount of clean text available for many languages.

Factoring Sequence Probabilities

So far, we have treated the probability of a string of tokens very abstractly. What precise form should this model take? The most common approach is to use the chain rule of probability to incorporate the conditional probability of each token in the sequence in turn.

p(w₁^N) = p(w₁) · p(w₂ | w₁) · p(w₃ | w₁²) · p(w₄ | w₁³) ⋯ p(w^N | w₁^{N − 1})

This sequential factorization is especially useful for applications where predictions must be made “on line” before the whole sequence is observed, e.g., during the middle of an utterance in a speech-recognition or text-entry system. Each token in the sequence is revealed one at a time. After reading t − 1 tokens, the model uses them to calculate the probability of the tth token. By itself, however, this rewriting does not much simplify the problem. As the model sees more and more of the sequence, it ends up working with the probability of the last token p(w_N | w₁^{N − 1}), conditioned on the preceding N − 1 tokens. This is barely better than working with p(w₁^N), a joint probability over all N tokens on the text at once.

Markov (n-gram) Models

When predicting the next token in sequence w_i, therefore, we seek to limit the amount of history w₁, … , w_{i − 1} that the model needs to remember. The earliest and probably still most widely used approach is a Markov model, often known in language modeling as an n-gram model. For a certain number of tokens of history h, also called the “order” of the Markov model, the following approximation is used:

p(w_i | w₁^{i − 1}) ≈ p(w_i | w_{i − h}^{i − 1})

with the proviso that when h = 0, we use probabilities p(w_i) independent of any context. Although the terminology is unfortunately a bit confusing, it is conventional to call a first-order Markov model (where h = 1) a bigram model, a third-order Markov model (h = 3) a 4-gram model, and so on. In other words, a bigram model has only a limited “field of view” of two words: a one-word history and the current word. This two-word window slides across the text one word at a time. Similarly, a trigram model pays attention to three words at a time: a two-word history plus the current word.

Given a history of limited size, we can more easily estimate the conditional probabilities of each word in sequence. For example, for a bigram model to calculate the probability of of in the article from the Richmond Dispatch, we could use a simple relative-frequency estimate:

p(of | terms) = count(terms of) / ∑_{x ∈ V} count(terms x) = count(terms of) / count(terms)

The count function returns n-gram frequencies in a training corpus of text. Note that for a bigram model over a vocabulary V, there will be |V| possible histories, i.e., distinct conditioning events in the denominator of the calculation above; for a trigram model, there will be |V|² possible histories of ordered pairs of tokens; and so on, i.e., |V|^h.

Because we often need to make predictions about single tokens (“unigrams”), bigrams, trigrams, etc., that did not appear in the training set—or that appeared too infrequently to give reliable estimates—practical n-gram models usually smooth these relative-frequency estimates. The intuition of smoothing is that heretofore unobserved outcomes (e.g., new words) which would have received zero probability will now have slightly positive probabilities and, in compensation, observed outcomes will have slightly lower probabilities.

More generally, we can view this as a Bayesian approach of assigning a prior probability to possible language model parameters—prior, that is, to seeing any particular corpus for training purposes—using Dirichlet, Dirichlet process, or other distributions. A simple but often effective Dirichlet prior (also known as Lidstone smoothing) assumes that each distinct n-gram, whether or not it was seen in the training set, has a baseline positive “pseudo-count” of α, to which the actual counts collected as above are added. The smoothed estimate above would thus be:

p_α(of | terms) = [count(terms of) + α] / [count(terms) + |V|α]

where α is added |V| times to the denominator for normalization purposes. Since α > 0, both the numerator and denominator are also prevented from being zero, even if the n-grams in question have zero count. If terms had never been observed at all in the training set, then we are left with a uniform distribution where p_α(of | terms) = |V|⁻¹. Smoothing is especially important when applying language models to texts with poor OCR transcription, historical spelling, and other sources of variation. We would not want to assign zero probability to a text just because it contained a heretofore unobserved OCR error. When we use n-gram models as a first pass for finding candidate pairs of texts for reprinting, we do not discard pairs simply due to a single mismatched n-gram, as an unsmoothed model would.

Neural Network Language Models

Markov models make a simple assumption about the amount of information from the past that we need to remember in order to predict future tokens, to wit, a fixed number h of preceding tokens. But the precise amount of history that we will find useful in making predictions will often vary with context, and certain words in the history will surely be more informative than others. These observations have motivated work on adaptive histories going back at least thirty years (Bahl et al. 1989). In the last decade, however, neural network language models have exploited alternate encodings of the history to decisively surpass n-gram models at many predictive tasks. Most of these approaches treat predicting each succeeding word as a discriminative learning problem by directly learning the conditional probability distribution p(w | C), where C is a summary of the context that depends on previous tokens. For the usual prediction task for Markov models, for example, C = w_{i − h}^{i − 1}. As with many other predictive models, an encoding of the context as a K-dimensional vector of real numbers u_C is multiplied by a vector of weights corresponding to the word to be predicted β_w and then normalized into a probability distribution:

p(w | C) = exp(β_w · v_C) / ∑_{w' ∈ V} exp(β_w' · v_C)

In the customary terminology of neural networks, this normalization is equivalent to passing the dot-product through the softmax transformation.

The context encoding v_C is computed by a recursive computation over the previous words in the sentence using an Elman recurrent neural network (RNN), or using other architectures that exhibit better computational properties on long sequences, such as long short-term memory (LSTM) or gated recurrent unit (GRU) networks. All of these architectures begin their computations by encoding the words in the history using a “lookup layer”, mapping discrete tokens to numerical vectors. These vector encodings might either be pre-trained on another corpus—e.g., using popular techniques such as word2vec or GLoVe—or learned jointly with the other language model parameters.

Applications of Language Models

At the beginning of this section, we provided motivation for language models by their use in noisy-channel inference problems such as OCR correction and language translation. We close by mentioning a few other tasks common in humanities applications that can be usefully cast as inference with language models: classification, information retrieval, and topic modeling.

Text Classification

To perform a common text classification task, such as sentiment analysis of movie reviews, we could collect a corpus of positive reviews and a corpus of negative reviews and construct a language model for each of them. To the sequence of N words in a given review w₁, w₂, … , w_N, each language model would assign a probability, i.e., p(w₁^N | Rating = positive) and p(w₁^N | Rating = negative). To classify a new review, we can again exploit Bayes' rule to compute the probability for each value of Rating:

p(Rating | w₁^N) ∝ p(w₁^N | Rating) · p(Rating)

This method of text classification has sometimes been called “Bayesian” since it uses Bayes' rule—although nowadays, “Bayesian” usually refers to inference that accounts for uncertainty about the parameters of a probability model. It is much more common to employ a particular instantiation of this classification approach: the naïve Bayes model, which uses a language model where each word in a document is assumed to be independent of the others given a particular value of the output class, i.e., a zero-order Markov, or unigram, model. To compute the probability of a movie review’s rating with a naïve Bayes model, we can multiply the probabilities of each word separately:

p(Rating | w₁^N) ∝ p(Rating) · ∏_{i = 1}^N p(w_i | Rating)

This language model is thus said to treat a document as a “bag of words”. With sufficient training data, bag-of-words naïve Bayes can usually be surpassed by discriminative training and using other representations of word tokens, such as word embeddings trained during language modeling.

When constructing a training set of text classification, we need to specify the set of output classes ahead of time. In a later chapter (ZZZ), we consider specifically the problem of classifying the genre of texts reprinted in nineteenth-century newspapers. Even though we ended up with a relatively small number of classes, we added unexpected genres such as “vignette” and “listicle” as we realized their importance in newspapers of the period. The difficulty of coming upf with a set of classes a priori is even more obvious in the case of individual stories, poems, recipes, and so on.

Information Retrieval

A similar bag-of-words approach is common in many information retrieval models. The usual “ad hoc” retrieval task is to wait for a user to input a query and then rank documents (or passages, posts, etc.) in a collection in descending order of “relevance” to the query. This ranking approach was first widely applied by Gerald Salton and his colleagues in the 1960s, who developed vector-space models. Each document and query containing terms from vocabulary V was embedded in a |V|-dimensional vector space. Candidate documents were then ranked by their increasing distance in this space from the user’s query. Ponte and Croft (1998) proposed a probabilistic approach to the ranking problem formally similar to naïve Bayes classification:

p(D | Q) ∝ p(Q | D) p(D) = p(D) ∏_{i = 1}^|Q| p(q_i | D)

In this “query likelihood” paradigm, a separate language model is trained for each document D in the corpus and then used to compute the probability of the user’s query. The prior probability of a candidate document p(D) might be uniform, or some function of a document’s recency, PageRank position in the hypertext of the World Wide Web, or other features. This approach can be understood as a reduced-form noisy-channel model, which asks, “If I wanted to describe document D, what is the likelihood that I would use query Q?” Note that if a query term q_i is not in D, p(Q | D) would be zero unless the probability distribution is smoothed (see above). With simple maximum likelihood estimation of the language model parameters, we might say that the query likelihood model only contains information about term frequencies. Zhai and Lafferty (2001) pointed out that if these estimates are smoothed with corpus-wide term frequencies, we recover a term proportional to inverse document frequency. A less constrained noisy-channel model might also learn explicit parameters for vocabulary mismatch between query and document (ZZZ), although this is not always helpful on average in ad-hoc retrieval tasks.

Information retrieval models that effectively posit one class per document are a better fit for the task of reprint detection than more traditional classification approaches that require specifying classes a priori. Several problems remain in treating text-reuse analysis as a retrieval problem. First of all, as we noted earlier, mass-digitized archives of newspapers and books do not necessarily divide the physical items into “documents” appropriate for the task at hand. Several poems or recipes may be concatenated into miscellaneous sections titled “Poetry” or “The Harp” or “Household Hints”. A language model, especially a bag-of-words model, constructed from these documents would mix together very loosely coupled texts. Furthermore, the likelihood computation only counts matches between query and document, without gradated degrees of mismatch, whether due to editorial changes or OCR errors. Finally, note that these models were developed for the ad-hoc retrieval task of ranking all documents relative to a single query. This approach has poor scaling properties: simply treating each document in a corpus of millions of newspaper pages as a query has unworkable computational cost. But it also avoids the issue of how many families of reprinted texts we are interested in finding. We discuss our current solution to these issues below.

Topic Modeling

For naïve Bayes classification, we learned a language model for each class (e.g., positive and negative reviews). For information retrieval with query-likelihood models, we learned a language model for each document in the corpus. Topic models are language models that aim to explain the distribution of words across documents as a function of the distribution of T different topics. Each topic just is a (unigram) language model φ_t over the vocabulary V. In contrast to probabilistic information retrieval, the number of topics T is usually chosen to be much smaller than the number of documents but larger than the small number of classes typically used in text classification. Also unlike these two other tasks, in topic modeling, we do not know at training time which documents discuss which topics, and one document can draw words from multiple topics. Most topic models, however, still treat each document as a bag of words—more precisely, as a bag of topic indicators, each of which determines the generation of one word. In other words, topic models are bag-of-words language models with latent variables for each word’s topic and each document’s topic distribution. The order of tokens is obviously important in segmenting the text on a newspaper page, or in an agglomerated article, into individually reprinted passages. Most practical applications of topic models to longer documents, such as novels (e.g., Jockers 2011 ZZZ) thus need to pre-process the corpus into shorter chunks of a few pages in order to make the bag of words assumption reasonable in practice. While some topic models (e.g., Wallach et al., 2008 ZZZ) do consider sequence information, their inference complexity is not attractive for large corpora.

Lazy Compositors

Draft Chapters