Computational Genre Classification
The question of computational classification has arisen in the humanities alongside the growth of large-scale, digital corpora, such as the newspaper corpora we study in Viral Texts, necessarily overlapping the emergence of rhetoric and research around “big data.” We discuss the relationship of the larger Viral Texts project to the question of big data in chapter one, but here we delve into our classification experiments as another exploratory method for “stirring the archive” and uncovering meaningful patterns (Klein 2015). In chapter one, we posited that data can be understood as “big data” when it cannot be comprehensively read or meaningfully browsed. In such textual environments, scholars require speculative bibliographies that identify, group, sort, or surface more granular items of potential interest from a broader textual field. Computational classification can be considered a method for producing speculative bibliographies—probabilistic reorganizations of the archive that manifest a theory of generic or formal relationship among disparate texts, which scholars then examine and question.
Our reprint-detection algorithm in Viral Texts is designed to identify one type of pattern—textual repetition, or reprinting—across millions of digitized newspaper pages. Even that subset of the archive, however, remains too voluminous to meaningfully read or browse. Our derived data of duplicated texts comprises many millions of individual reprinted texts, each of which was reprinted between two and many hundreds of times. In other words, our derived data is itself “big data.” In addition, the mere fact that a text was reprinted is not often practically or theoretically useful. In our day-to-day research tasks, we might want to study widely-reprinted poems, or compare the spread of news to that of trivia. Computational classification models allow us to parse our data about nineteenth century reprinting writ large for evidence about narrower topics or genres. In addition, however, computational classification has proven useful as a tool to think about genre, as computational classification requires vacillation between algorithmic and humanistic modes of understanding textual categories. The experiments we outline here pressure both modes, as the computational models challenge our default assumptions about generic difference in the nineteenth century and our experience as literary historians reveals generic features that remain opaque using current models of computational text classification. Importantly, our findings emphasize how computational models of genre can express similar ambiguities and overlaps to those emphasized in established humanistic accounts.
Scholars in the humanities often imagine that computational models require absolute consensus, or require binary genre designations, neither of which is true. If boundaries between genres are at root fuzzy, computational methods can model that uncertainty by assigning probabilities to multiple classified genres. Rather than determine whether a given text is or is not a news article, a statistical model can instead indicate that a text has a 75% probability of being news and also a 50% probability of being poetry. Consider, for example, a rather lengthy article that appeared in the September 19, 1889 edition of the Jamestown Weekly Alert, describing an upcoming meeting of the Knights Templar. The article reports that 63,000 Knights will be in attendance, and then goes on to offer a history of the Order, a list of the principal officers, a biographical sketch of the Grand Master, as well as “Other Matters,” according to the article’s sub-headline. Our classifier identified this as a news piece, with 100% probability.
Indeed this article is a news item. It begins, “The twenty-fourth triennial conclave of the grand encampment of Knights Templar in the United States is to be held in the city of Washington during the second week of October,” telling readers who, what, when, and where about this specific event. As its headline indicates, however, this piece is much more than only news. The classifier also guessed that the article is poetry, with around 60% probability. Looking closely at the piece, we note that elements such as the list of officer names repeats many phrases, such as “Very Eminent Sir,” in a way poetry might repeat key lines. It is likewise easy to imagine how the elevated diction inherent the titles, rules, and rituals of a the Knights Templar might be mistaken for poetry by a model trained on nineteenth-century poems. Likewise, the language used by and about the Knights Templar is rife with religious words and phrases, which are also a strong indicator of poetry in the context of nineteenth century newspapers, where many of the most frequently-reprinted poems expound on religious themes.
That our poetry classifier is 60% sure this text is a poem does not, we would posit, mean it has failed, but instead directs our attention to the linguistic and stylistic similarities between the kinds of poems that appeared in nineteenth-century newspapers and the language of and about a group like the Knights Templar in the same period. In their work to classify haiku in the modernist period, Hoyt Long and Richard Jean So find a wider penumbra of poems that literary scholars would not identify as haiku, but which their classifier identified as haiku with a high probability. For Long and So, these “misclassified texts” nevertheless point to a significant relationship between these poems and the haiku:
The machine has discovered a definite empirical relation between these haiku and the misclassified texts, even though that relation is ontologically distinct from the kinds of relations that we tend to focus on as literary critics. At the level of individual poems this relation may seem incidental, but at the level of hundreds of poems scattered across dozens of journals, what emerges is a collection of texts that share specific elements of the haiku style. The textual patterns set down in translated and adapted haiku appear to saturate a much broader array of poems, adding up to a kind of Orientalist milieu that is related to the haiku style but also part of something larger. (Long and So 2016, 266)
Along with Long and So, for us seeming misclassifications can be among the most generative outcomes of computational classification research, pointing to unexpected alignments across texts. In the context of the hybrid nineteenth-century newspaper, it is suggestive that “Knights Templar” draws on language that resembles the poetic or, more broadly, literary. In the Jamestown Weekly Alert “Knights Templar” was printed on a page directly opposite chapters 3–5 of a serialized novel, “Country Luck” by John Habberton. We might imagine the newspaper reader, then, shuttling both linguistically and materially between these texts.
We began our classification process by creating supervised models, which means our models of genres are first trained through data hand-coded by human domain experts. We use these hand-coded texts to train models of particular genres: e.g. poetry, news, fiction. These models operationalize our understanding of a particular generic boundary, allowing us to test to what degree other texts across a larger, unread corpus are like or unlike those we identified as belonging to a given group. Importantly, these measures are only relational to our training corpus, which is to say that the output of the machine’s reading can only be good as the training data created by our reading and interpretations. In the case of the vignette, it was the very thing that perplexed us as human readers—the hybrid genres that never quite disappeared but rather were pulled apart on either side of the gap between fact and fiction in the early twentieth century—that proved fertile ground for testing both what we could know and what we could be taught by seeing our own observations reflected back to us—modeled—at a larger scale.
Human readers often make decisions about genre by looking at higher order aspects of a text. Apart from the feelings a particular work may inspire, we say a Shakespearean play is a comedy if it ends in a wedding or a tragedy if it ends in death. We determine a novel is gothic because it is set in a crumbling castle, or because its characters are beset by supernatural horrors. We can often identify a poem–particularly if it was written before the twentieth century–just by looking at it, without even reading.1 By contrast, computers read by counting what linguists call “tokens”: individual characters, words, phrases, or perhaps sentences. Words are the most typical tokens used for computational classification experiments because they are both easy to isolate within texts but can also be read in meaningful ways by humans.
Hope and Witmore offer a helpful analogy to explain the way computers parse words to make inferences about genre: English puddings: “Many English puddings feature a gloopy matrix in which something more substantial is intermixed, for example, a piece of fruit such as a plum. In our case, gloop is a useful substance to think with because it is analogous to the linguistic gloop that binds together the more spectacular items that literary critics are likely to seek out and savor” (Hope and Witmore 2010, 361). By training a computer to classify Shakespeare’s plays, Hope and Witmore find that generic signals are "legible at the the level of the sentence … genre goes all the way down to where an author plants his or her feet in the ground” (375). The point of such work is not to reduce genre to word frequencies, but instead to make use of the fact that higher-level features of genre lead, whether consciously or not, to distinct choices in vocabulary and textual structure which can, in turn, prove useful signals for discerning the operations of genre that caused them.
Scholars use a variety of methods to model genre computationally, ranging from stylometry—similar to methods used for author identification—to unsupervised methods, like principal component analysis; to network studies; and to topic modeling.2 The classification method we describe in this chapter combines unsupervised and supervised methods; we use topics, derived from a topic model of our reprinted texts, to train a classifier. We initially developed this process in collaboration with Benjamin Schmidt; in a blog post detailing his experiments with this method on a corpus of 44,000 television episodes, Schmidt (2015) explains his reason for using topics, as opposed to words, as features for classification: “To reduce dimensionality into the model, we have been thinking of using a topic model as the classifiers instead of the tokens. The idea is that classifiers with more than several dozen variables tend to get finicky and hard to interpret, and with more than a few hundred become completely unmanageable.” The goal of all models is of course to reduce dimensionality, or to take something large and make it smaller so it can be better examined in full. When we make a model of our entire corpus, the question then is what features are necessary to ensure that it accurately represents the whole? And, in order to make building the model more manageable, what parts could we leave out without losing the larger picture? In the case of our corpus, it is not necessary to include every token, or word. This is true particularly because of a distinct—though by no means unique—challenge of our corpus. As we described in Chapter 1, our textual analysis efforts depend on words produced by Optical Character Recognition (OCR) systems. Using topics has the benefit of not relying so heavily on these words. That is, our texts are often more readable at the phrase level rather than word-level.
Training the Classifier
The first step in creating a classifier to identify genres is to manually assign genres to a selection of texts. Where a scholar interested in novelistic genres might rely on classifications assigned by previous generations of scholars and librarians—those found in metadata or generic bibliographies, for instance—the genres of individual newspaper texts are both indistinct and untagged. Thus hand-coding is partly a practical necessity, as our data includes no tags or other metadata for genre which could be used to automate the process. Further, unlike contemporary newspapers, the nineteenth century newspaper was often not divided neatly into sections—e.g. News, Sports, or Op-Ed—that might help us determine overall categories. Some newspapers were so divided, but those divisions are not consistent across papers, nor are they indicated in the metadata in ways an automatic process could make use of. In addition, we hand-coded our data because many of the genres we are interested in, such as the vignette, have not been studied as distinct genres before, and so bibliographies, tags, or other groupings of such genres do not exist.
Initially we identified four top-level genres to classify—poetry, news, advertisements, and prose—which we assigned to texts after reading. For each of the clusters we tagged as prose, we also assigned a secondary genre related to both content and form: e.g. vignette, sketch, opinion, or advice. We expected these secondary classifications to contribute to a second iteration of the classification experiment: after successfully identifying top-level genres, we would then use the same method to further parse the broad prose category toward greater specificity. As we will describe, however, this secondary classification was stymied.
After we hand-tagged 1,000 newspaper texts, we reintegrated them into a larger corpus of around 4,000 “unknown” texts from the larger Viral Texts corpus. While these represent a small fraction of our entire corpus, they proved to be a robust sample for testing our classifier without overwhelming the capacities of a personal computer. We created a topic model of this corpus. Each time we ran the topic model, a different—though not entirely dissimilar—list of topics resulted. We experimented with the various options involved with topic modeling such as the number of topics the model would output and, for our classification efforts, we found that a lower number—between 12-15—led to more accurate results from the classifier. The following is a representative list of the topics generated from our sub-corpus:
- god life world heart love
- court sir called judge de
- tbe jones _ trumble noble
- tile thle tie tihe thie
- years hundred year paper twenty
- tho bo ho nnd aro
- people great country public con
- states united president state mr
- man time good men makes
- states government united congress made
- 000 10 cent year 30
- water feet cold hot put
- gen men enemy general left
- ot ii ol la aud
- dr blood cure stomach health
This list of 15 topics actually tells us a lot about our overall corpus. Perhaps the most obvious element is the number of misspellings or otherwise garbled words that group together into a topic. It may at first seem that these topics are unhelpful abnormalities that can be thrown out, and indeed to the human reader they might be, but one of the benefits of computational methods is that the even here there are patterns that the algorithm can recognize. If we turn then to the topics we can read, perhaps some sense of genres already begins to emerge. Consider “god life world heart love,” for example, a topic of religiously-inflected language of the kind that, in the nineteenth-century newspaper, often points to poetry. Even a reader who is completely unfamiliar with nineteenth century newspaper might discern that another topic, “states government united congress made,” might closely correlate with news items. Another prominent topic, “dr blood cure stomach health,” might seem to point to health or science reporting, though it actually correlates to the advertisements for patent medicines that dominated the nineteenth century newspaper ad market.
While these topics—derived from an unsupervised model—could offer some sense of genre in our corpus, their true effectiveness becomes clear when we used them to seed our supervised model. We trained a model using logistic regression (see Ted Underwood’s (2019) discussion of the method in Appendix B of Distant Horizons for a legible defense of the method) and then applied the model to each of the four “top-level” genres we initially identified—poetry, news, advertisements, and prose. Logistic regression produces a binary result—either 0 or 1—with the degrees between representing probability. In our experiments, the output of the classifier indicated the probability that a text belongs to a genre. We selected the genre with the highest probability as a kind of best guess, but it was also useful to see the other probabilities, as in the example of “Knights Templar” above. By selecting the best guess we could visualize how the model performed for each genre in the test set.
Here, the actual, hand-tagged genre appears on the Y-axis and the number of texts assigned to each genre by the model is on the X-axis. The bar is colored to indicate the genre that each text was assigned by the classifier. This was a particularly good result; the classifier guessed correctly 86% of the time. The results differ slightly each time the classifier is run and a new model is created, therefore, to get a sense of the overall accuracy, we repeated the experiment and averaged the results. These initial experiments produced results that were accurate 80% of the time on average.
This table also illustrates the persistent problem with the “prose” genre that we identified in our initial hand-coding of the texts. The prose genre, colored purple here, has the lowest accuracy at around 63% and shows up as miscategorized in each of the other genres. Advertisements and news items are both mistaken for prose. Some poems were misclassified as ads and prose, and the prose texts were sometimes mistaken for poetry and advertisements. Ultimately, the classifier accurately identified advertisements 93% of the time, news items were accurately identified 87% of the time, poetry at 84%, and, again, prose texts were accurately identified only 67% of the time.
To this point, however, we had only been testing our model on texts that we had hand-tagged, in order to determine the model’s accuracy. Though we still had concerns about the prose genre, our next step was to turn the classifier loose on the great unknown: texts from our corpus that we had not previously read and tagged. At this stage we were particularly interested not just in the classifier’s “best guess” but rather at the distribution of probabilities for each text. In this way, our classification experiments point to the multiplicity and overlap of genre as it functions in the newspapers and the wider textual field.
The following visualization shows 25 formerly unknown texts, the genres they were assigned, and the probability that they belong to each genre. The texts are representative of a larger cluster of texts—often the longest witness of each cluster. In most cases, the genre decision is split between two genres, and in several cases only one genre is shown, meaning that the probability that the cluster belonged to any other genre is too low to be relevant. In this graph we see the multiple genre signals at play in individual texts, legible to both human or machine readers.
Consider, for example, the text labeled 1010332, which is split just about evenly between prose and news. The text reads:
WAGES IN 1800. The condition of the wages-class of that day may well be examined; it is full of instruction for social agitators. In the great cities unskilled workmen were hired by the day, bought their own food and fond their own lodgings. But in the country, on the farms, or wherever a band was employed on some public work, they were fed and lodged by the employer and given a few dollars a month. On the Pennsylvania canals the diggers ate the coarsest diet, were housed in the rudest sheds, and paid $6 a month from May to November and $5 a month from November to May.
This is not news, but instead a historical piece about wages in the 1800s, which appeared in 1885. But it is easy to see why the classifier might have mistaken it for news. It also helpfully illustrates the way in which the amorphous prose category consistently directed our attention away from the genres we knew existed and toward the unknown and hybrid.
Advertising, by contrast, was typically much easier for the classifier to determine concretely. The text labeled 1009939 in the graph above reads, “If you have coughed and coughed until the lining membrane of your throat and lungs is inflamed, Scott’s Emulsion of Cod-liver Oil will soothe, strengthen and probably cure.” The promise of a probable cure, as well as the anatomical language, makes this one an easy guess for the classifier. The text we considered above, the article about the Knights Templar, also appears in these results as number 1012062. As we noted, the classifier identified it as news with 100% probability but also poetry at about 60% probability.
Principal Component Analysis
The results of these experiments were promising, but the nagging issue of the amorphous “prose” category made it clear that further experimentation was necessary. That is, it belied the fact that our initial genre labels—poetry, news, advertisements, and prose—did not accurately capture the distinctions in our corpus. Indeed, the corpus contains a lot of what could be called “prose,” but within that broad category, other kinds of disambiguation are at play. In an effort to get at what those differences may be, we returned to unsupervised modeling in the form of Principal Component Analysis (PCA).
While the kind of classification we’ve been describing, probabilistic modeling, relies on a human researcher to train the classifier, PCA is an unsupervised method and is well suited to exploring data. That is, the computer “reads” the corpus without a priori knowledge of what it contains (or what the human researcher thinks it contains). In An Introduction to Statistical Learning, Gareth James (2017) notes “Unsupervised learning is often performed as part of an exploratory data analysis” (374). PCA, writes James, “is a popular approach for deriving a low-dimensional set of features from a large set of variables” (230). To put it another way, “The primary goal of PCA is to explain as much of the total variation in the data as possible with as few variables as possible” (Binongo and Smith 1999, 447). In the context of textual analysis, this means finding correlations within a text and summarizing those correlations in meaningful ways.3 PCA is a useful exploratory method in part because it lends itself so readily to visualization. That is, principal components can be graphed and typically cluster according to their correlations.
In many PCA experiments in computational text analysis, words make up the large set of variables that we wish to convert to principal components, but because topic models proved so effective in our classification efforts, we used them instead. In effect, we reduced the dimensionality of our corpus initially by topic modeling and then again using PCA. When these new principal components were graphed in three dimensional space, an evocative shape emerged: a pyramid with four distinct points. Each point in the pyramid is composed of a set of texts, and when read, these texts seemed to group into four distinct genres. In two cases, these genres correlated with those we hypothesized at the beginning of our supervised classification efforts, as advertisements and news remain distinct categories within the corpus. The remaining two points, however, proved slightly more complicated, but ultimately much more revealing. On first pass, they both seemed to be composed primarily of prose pieces, though one of the groups included a substantial amount of poetry as well. After reading the texts in each set, we determined that one set included texts we might name as informational prose, while the included texts we would name as literary.
The informational prose genre contained the kind of texts we describe detail in in Chapter 3 of this book: the “information literature” that suffused period papers, such as the piece about wages in 1800 described above, another on the origin of postage stamps, and not a few meta-articles about the process of printing a newspaper. This set also contains lists of information and an interesting overlap with the news category in the form of what we might call commerce news, distinguished by a heavy use of numbers, which recalls the numeric topic we discussed earlier in this chapter. The literary group, on the other hand, contained a fair amount of poetry in addition to more narrative prose pieces. Crucially, the prose pieces in the literary category brought us nearer to the genre we explore more closely in the next section, vignettes.
Ultimately, unsupervised data exploration in the form of Principal Component Analysis proved to be a corrective to our initial hypothesis about the kinds of genres we might expect to find in nineteenth century newspapers. The distinction, for example, between poetry and prose, proved to be less meaningful than that between literary and informational texts. Additionally, the visual nature of principal components enabled us to look more readily at the overlap as well as the spaces between genres. In the end, we used the new generic distinctions to re-train our classifier, with a clearer sense of the ways in which the texts in our corpus more naturally disambiguate themselves from one another. This process is, of course, ongoing. The more we learn about the genres that actually exist in our corpus, the better suited we are to retrain the classifier. That is to say, there is a fully reciprocal nature between what we know and what our classifer can teach us. In a way, we teach the classifier and then it teaches us. There is no magic here, no silver bullet. This is a wholly humanistic form of inquiry—it’s just that the computer has the ability to read at much greater scale than its human counterparts.
The Genre Automaton
What if humans could read as much as the machine? This question, paired with the desire to improve our own knowledge of genre in our corpus as well as that of our classifier, led us to create what we’ve been calling the “Viral Texts Genre Automaton,” in a not-so-subtle nod to the kind of automata one might find described in sensational terms on the pages of a nineteenth century newspaper. Our Automaton works by presenting the results of our classification experiments to interested readers via a web application created using RStudio’s Shiny framework. The app presents a newspaper text—or rather its OCR—alongside a question about the classifier’s accuracy. For example, for a text that was classified as “literary”, the Automaton states, “This text is classified as LITERARY. Does that seem right?” Below, a description of what we mean by “literary” is given: “In our corpus, literary texts can be poetry or prose such as sermons, sketches, vignettes, or essays.” The user can choose “Yes” to agree with the classifier, “No” to disagree, or “Not sure.” If the user disagrees with the genre classification, they are asked to choose a label from among our other classified genres: news, advertisement, or informational. There is also an option for “other” that allows the user to answer the question “How would you classify this text?” If the user selects “Not sure,” the Automaton asks “What makes this difficult to classify?,” and users can choose from the following options: “The OCR is bad; None of the genres apply; There’s a mix of genres.” In addition to a means for checking our classifier and introducing our corpus to interested readers, the results generated from the Genre Automaton becomes training data for for future classification efforts.
And in fact, researchers have explored the visual qualities of nineteenth-century poems computationally. The University of Nebraska’s Image Analysis for Archival Discovery (Aida) project, led by Elizabeth Lorang and Leen-Kiat Soh, have identified poems in Chronicling America entirely through visual analysis of the shape of text in the newspaper images. ↩
See Lena Hettinger, et al. (2015) for a survey several of these methods. They conclude, “topic features alone are the best discriminative factors for educational and social novels. This indicates that despite orthogonality of topic and genre, topics may still be useful for genre classification” (252). Additionally, see Ted Underwood’s work with the HathiTrust Digital Library. ↩
Christof Schöch (2013) has done some interesting work using PCA to cluster French plays. ↩