In this section, we describe our current methods for detecting pairs of newspaper issues with significant textual overlap, for aligning these issue pairs, and for clustering these pairwise alignments into larger families of reprints. These methods are implemented in the text-reuse analysis software
passim, which is freely available under an open-source license. We have previously discussed and evaluated detecting pairs of documents with overlapping content in Smith et al. (2013) and Smith et al. (2014) and described our approach to alignment in Wilkerson et al. (2015) in the context of modeling both newspaper reprinting and current legislative process in the U.S. Congress. Here we will detail the precise methods that contributed to the arguments we make about reprinting in nineteenth-century newspapers in the rest of this book and that led us to create a corpus of reprinted articles.
As noted above (ZZZ), many of the texts most widely reprinted during our period are not widely known to scholars today. We cannot, therefore, compile an exhaustive list of “query” texts and find other versions of them in a collection using an information retrieval model, as one might when trying to detect plagiarism in a given set of students' essays. Rather than relying on individual “query” documents—e.g., “Find me all (partial) reprints of Nathaniel Hawthorne’s ‘The Celestial Railroad’” as Cordell (2013) did in a precursor to this project—we want to compare each newspaper issue to all other issues—an “all-to-all” comparison—to find both extended quotations and partially or fully reprinted texts. This process cannot, however, be simply one of running millions of queries against a search engine. If done exhaustively, for example, our analyses on N = 3,270,459 issues of nineteenth-century newspaper and magazine issues would have required comparing ½ N (N − 1) = 5,347,949,400,110 (i.e., over 5 trillion) pairs of issues. This means that if, somewhat optimistically, we could align 10 pairs of issues per second, it would take more than 16,946 years to examine the whole collection for reprints. We therefore describe methods for pruning the search space. While they might lead us to miss some reprints, they are necessary due to the large number of potential all-to-all comparisons.
Moreover, the tools for finding clusters of reprinted texts are designed to work with the error-full output of optical character recognition (OCR) systems and to find reprints embedded within much longer documents such as newspaper issues. The newspapers in the Chronicling America collection divide each issue into pages, not logical articles, and the Making of America magazine collection specifies only the page on which each article begins. In contrast, the National Library of Australia’s Trove newspaper collection specifies the zone on each page for each substantial article, although it groups advertisements, obituaries, and other short notices into larger chunks. Note, however, that many of the shorter reprinted texts we have found, such as jokes or personal anecdotes, are included without any distinguishing titles or separators in such “filler” articles of short extracts (e.g., the second version of the Mackay poem above). This lack of consistent document boundaries therefore requires somewhat different techniques compared to some previous work on duplicate detection for web documents and plagiarized student papers, or on finding repeated “memes” of only a few words (Leskovec 2009; Suen et al. 2013). A flexible approach to textual boundaries provides further advantages when searching for newspaper reprints: even if we knew the boundaries of articles, we still aim to find—and do find, in many cases—reprints of parts of speeches or individual stanzas of poems or of popular texts embedded, without attribution, in longer articles, advertisements, and obituaries.
In summary, we aim to detect, without knowing the boundaries of articles, substantial repeated passages of up to tens to thousands of words embedded within unrelated newspaper issues of tens of thousands of words. Additionally, we seek to find these repeated passages among millions of distinct newspaper issues and in the presence of significant errors in optical character recognition of the page images. In the following subsections, we explain the approach in more detail and comment on some of the design choices in our system for searching OCR'd periodicals for clusters of reprinted texts. Our current system compares both whole newspaper issues and segmented articles (where available) to find matching texts contained within them. In what follows, we talk more generically of “documents” in order to stress the generality of the approach.
Search for candidate document pairs
We begin with a scan for document pairs likely to contain reprinted passages. We use “shingling,” which represents documents by an unordered collection of its n-grams, to provide features for document similarity computations. We balance recall and precision by adjusting the size of the shingles and the proportion of the shingles that are used to calculate document similarity. We build an index of the repeated n-grams and then extract candidate document pairs from this index.
N-gram document representations
The choice of document representation affects the precision and recall, as well as the time and space complexity of a text reuse detection system. We adopt the popular choice of “shingled” n-grams. These are the same contiguous subsequences of n words we discussed for Markov language models. Very similar passages of sufficient length will share many long n-grams, but longer n-grams are less robust to textual variation and OCR errors. Our use of n-grams to detect document pairs worth aligning more intensively is similar to the use above of n-gram models to prune the search space for an HMM.
In previous work on detecting short quotes or “memes,” short contiguous n-grams have been proven to be quite effective. Suen et al. (2013), for instance, use 4-grams to identify similar quotes in newspaper articles and blogs. For the newspaper texts, we found word 5-grams achieved an acceptable trade-off between detecting a significant number of pairs of issues with aligned text without retrieving too many pairs of issues with spurious similarity.
We illustrate the algorithm with the following three document fragments drawn from articles about the completion of the transatlantic telegraph cable:
her majesty deares to congratulate the president upon the successful completion of this great intern 1 lions work
the ueen desires to congratulate the p esident upon the successful completion of the gre it internaliooal work
the queen deiirea to congratulate the president upon the euccetwfal completion of thia great inter tatioral work
The first four 5-grams in document #1, for example, are:
her majesty deares to congratulate
majesty deares to congratulate the
deares to congratulate the president
to congratulate the president upon
Efficient n-gram indexing
The next step is to build for each n-gram feature an inverted index of the documents where it appears. As in other duplicate detection and text reuse applications, we are only interested in the n-grams shared by two or more documents. The index, therefore, does not need to contain entries for the n-grams that occur only once, which allows us to reduce the size of our index by more than 50%, a consequence of Zipf’s Law (Huston et al. 2011). Returning to the three example documents, an index of the repeated 5-grams would contain the following three entries with the documents in which they appear and the positions in those documents:
to congratulate the president upon: (1,4), (3,4)
congratulate the president upon the: (1,5), (3,5)
upon the successful completion of: (1,8), (2,9)
Singleton 5-grams such as
her majesty deares to congratulate would not appear.
Even though the final index can omit singleton n-grams, the most straightforward approach to counting n-grams requires that we record all the singletons in intermediate storage until we have processed the entire collection. Consider a simple algorithm where we divide the collection into n-gram shingles and then sort the shingles in lexicographical order. A linear scan through the sorted file would easily identify repeated n-grams, but this sorted file would have to contain all the n-grams in order to avoid missing any repeated ones. For instance, if the first and last 5-gram in the file were identical, we would not be able to discard the first 5-gram while any amount of text shorter than the whole collection intervened.
To reduce the intermediate storage requirements, which can be significant for a large corpus, we adopt the space-efficient two-pass algorithm described in Huston et al. (2011). In particular, the variant of the algorithm they name Location-Based Two-Pass works as follows. In a first pass over the collection, each n-gram is mapped to a b-bit numeric value. This hashing operation randomly maps arbitrary strings to a number (hash codes) between 0 and 2b − 1. At this stage, we are willing to accept some moderate number of collisions, i.e., different n-grams with the same hash code. The idea is that if b is small enough, we can detect repeated hash codes in much less space than repeated n-grams, which may have words of arbitrary length. If a hash code appears only once, then the n-gram that produced it must necessarily appear only once. After producing this index of hash codes, we then make a second pass over the collection and look up the actual n-grams corresponding to repeated hash codes. If there has been a collision, it is possible that a singleton n-gram mapped to the same code as other n-grams, and we can drop the singletons at this stage. This means that we will not have any false negatives in detecting repeated n-grams and will pay a time cost for false positives as a function of b. If b is too small, for instance, most n-grams' hash codes will collide, and we will not save anything with the first pass; if b is too large, we will have almost no collisions, but the hash codes will be so big that they will take just as much space as the original n-grams.
Extracting and filtering candidate pairs
Once we have an inverted index of the documents that contain each n-gram, we use it to generate and rank document pairs that are candidates for containing reprinted texts. As shown above, each entry, or posting list, in the index contains a set of pairs (di, pi) that record the document identifier and position in that document of that n-gram. Once we have a posting list of documents containing each n-gram, we output all pairs of documents in each list. By default, we suppress repeated n-grams appearing in different issues of the same newspaper. Such repetitions often occur in editorial boilerplate or advertisements, which, while interesting, are outside the scope of this project. We also suppress n-grams that generate more than some fixed number of pairs (5000 in our experiments). These frequent n-grams are likely to be common fixed phrases. Filtering terms with high document frequency has led to significant speed increases with small loss in accuracy in other document similarity work (Elsayed et al. 2008). We then sort the list of repeated n-grams by document pair, which allows us to assign a score to each pair based on the number of overlapping n-grams and the distinctiveness of those n-grams. This score computation for a document pair is similar to that of bag-of-words text classification or information retrieval but with a product of overlapping n-gram features. For the three example documents above, the resulting list of document pairs, along with their matching 5-grams, would contain:
(1,2) upon the successful completion of
(1,3) congratulate the president upon
upon the successful completion of
As a last step, we filter out document pairs with fewer than m matching n-grams. In our experiments, we found that m = 5 provided a reasonable tradeoff between precision and recall. In other words, two texts would need to contain at least five matching phrases to be considered a matching text under this model. Overall, for our collection of 69,572,770 pages and articles (depending on the original collections' segmentation) from nineteenth-century periodicals, this technique reduced the number of pairwise comparisons from over 2.4 quadrillion, if performed exhaustively, to 767 million—less than a millionth of a percent of the total.
Local document alignment
The initial n-gram filtering pass returns a large ranked list of candidate document pairs, but it ignores the order of the n-grams as they occur in each document. We therefore employ local alignment techniques to find compact passages within each document in a pair with the highest probability of matching. Due to the high rate of OCR errors and editorial changes during the antebellum period, many n-grams in matching articles will contain slight differences. For example, the figure below shows a character-by-character alignment between two mistranscribed words from the example texts. A simple Levenshtein edit distance model, which we discussed above, aligns these two strings with three character substitutions (in red) and one character insertion. Unlike some partial duplicate detection techniques based on global alignment (Yalniz et al. 2011), we cannot expect all or even most of the articles in two newspaper issues, or the text in two books with a shared quotation, to align. Rather, as in some work on biological subsequence alignment, we look in each document for a region of high overlap embedded within sequences that are otherwise unrelated. Diagrammatically, if we lay the entire content of one text along the horizontal cells of a table and the entire content of the other text along the vertical to form the “chart” data structure conventionally used for edit distance, we do not necessarily want to find an complete alignment from the upper left to the lower right, as in the figure with the alignment of two words, but a partial alignment surrounded by unrelated material.
To find such a local alignment, we employ the Smith-Waterman dynamic programming algorithm (Gusfield 1997). Compare the finite-state transducer representation of Levenshtein distance to the FST for Smith-Waterman. In the Smith-Waterman transducer, state 1 is identical to the single state for Levenshtein distance, except that matching characters have a cost less than zero. It also has a prefix state 0 and a suffix state 2 that consume arbitrary characters in the input and output with cost 0. A similar device is often found in regular expressions, where one wants to match some pattern surrounded by irrelevant context: e.g.,
^.*([Mm]atch [Tt]his!).*$. This transudcer is more general than this unweighted regular expression, since the “pattern” in the center will be one that matches any substring of the input or output string so that we can choose the one with lowest cost. This operation is called “local” alignment since the aligned portions of the input and output strings, when the transducer is in state 1, can be surrounded by arbitrary amounts of unaligned material, when the transducer is in states 0 or 2. In a “global” alignment such as Levenshtein distance, the entirety of the input and output strings must be accounted for with either a match or edit operation.
We add one more refinement to this simple Smith-Waterman alignment. When aligning strings at the character level, two fundamentally different substrings may still share a few characters in common. We would prefer more compact alignments with contiguous insertions and deletions, rather than having them interrupted with a single matching
t or even a single common
the. We therefore make insertions and deletions depend on context by employing an affine gap penalty (see the figure). There is a high cost (here, 5) for starting a gap; once started, however, there is a low cost (here, 0.5) for continuing it.
The time complexity for both Levenshtein and Smith-Waterman alignment is proportional to the product of the lengths of the two texts to be aligned. Smith-Waterman is more expensive in memory and time by a constant factor proportional to the extra states and arcs in its transducer. The product of the lengths of two substantial texts can become quite large, however. We achieve significant speed improvements by forcing the alignment to contain the 5-gram matches already detected in the previous filtering pass. Since the matching n-grams are by construction hapax legomena in their documents, we align them in O(n log n) instead of O(n2) time, as suggested by not implemented by Yalniz et al. (2011). If two anchor n-grams are separated from each other—or from the beginning or end of the document—by more than a gap parameter of 100 words, we allow the Smith-Waterman alignment algorithm to determine the boundary of the aligned passages. This use of model-based alignment distinguishes this approach for other work, for detecting shorter quotations, that starts with areas of n-gram overlap (Kolak and Schilit 2008; Horton et al. 2010) and then expands them greedily, one word at a time, until a stopping condition is reached.
We now have a set of aligned passage pairs. We can view this as a graph, with nodes consisting of text spans in documents and edges connecting them based on alignment score. Because the boundaries of these spans are not given to us, but determined by the Smith-Waterman alignment algorithm, they will often not match exactly. If two passages in the same document overlap by some proportion (80% by default), they are taken to be “the same” for purposes of clustering; otherwise, they are treated as different. In other words, we merge the two nodes in the graph representing those overlapping passages.
We now cluster passages by finding the connected components in this graph of spans linked by alignment. This is equivalent to a “single-link” clustering, since it only requires a single alignment between passages to join the clusters containing them into a larger group. The figure below shows the beginnings of three sample texts, out of 49, in the current clustering of texts on the completion of the transatlantic cable.
As has been observed in other uses of single-link clustering, this procedure can cause a cluster to “drift”: we sometimes see links from passage A to B to C … to Z where A does not align well with Z. Inference algorithms that consider multiple alignments at once, or make multiple passes through the data, will be necessary to solve the problem of jointly inferring a model of a mutating text and the nodes where it appears. We leave those questions for future work.
Table 1. The beginning of three sample texts, out of 49, of the cluster of reprinted texts on the completion of the transatlantic cable
Cleveland Morning Leader
Marshall County Republican