Editing a Paper

In order to take newspaper reprinting seriously as a creative, social, and political practice, we must first understand what the newspaper was during the nineteenth century—or better, what the newspaper became over the course of the century. While newspapers were not a new medium, they were radically reimagined several times during this period. A newspaper in 1850 was distinct in form, function, and production from either a newspaper in 1750 or a newspaper in 1950. To undertake a “media-specific analysis” of the nineteenth-century newspaper—to borrow N. Katherine Hayles’s framework—and, in fact, to understand the medium’s parallels to twenty-first century media, requires us to jettison ideas about journalism tied to twentieth-century frameworks and ideals (Hayles 2002, 29). We must reckon with the specific technologies, political situations, and social constructions of newspapers in the nineteenth century, all distinct from the medium’s contexts in other periods.

The story of American newspapers in the nineteenth-century has largely been written as follows: in the first half of the nineteenth century, newspapers in America changed from a relatively expensive, largely urban medium, intended primarily for merchants and political actors, to a relatively cheap, geographically dispersed medium intended for anyone, adults or children, who could read (or who aspired to do so). In larger towns and cities, newspapers were issued daily, perhaps with a larger digest edition issued on the weekend, and in some cases multiple times a day through morning and evening editions. In smaller, rural areas, most papers were issued weekly or bi-weekly. Most American papers were explicitly partisan and nakedly political, but amid these one could also find special interest publications targeted to particular professions, such as farming or commerce; illustrated papers, focused more heavily on sensational stories of crime or intrigue; and family-oriented papers, which published more literature, religion, advice, and other broadly educational material. As this last entry signals, the line between newspaper and magazine was fuzzy and fungible. Finally, the growth of penny papers—which, as their name signals, cost a penny per issue—expanded readership to the middle and lower class while stoking fears about their sensationalistic content, particularly among the affluent readers who were the primary audience for earlier mercantile and political papers, which cost six cents an issue.

Despite these shifts toward greater access, we would not argue the medium became entirely egalitarian during this time. Instead, the types of newspapers available to readers proliferated, so that the word “newspaper” described a much wider spectrum of publications with distinct aims, perspectives, and readerships. While the meaning of novel newspaper forms was debated then and now—a discussion this book pick ups throughout its chapters—the raw expansion of the medium can be quantified. In 1810 there were fewer than 400 newspapers in the US, by 1825 there were more than 800, and then by “the time of Tocqueville’s visit in the early 1830s, the United States had some nine hundred newspapers, about twice as many as Great Britain, its nearest rival. Aggregate newspaper circulation in America was significantly higher as well” (Nord 2006, 88, 94).

Some of this rapid increase can be attributed to territorial expansion and population growth, but there were many interlinked technological, social, and political factors driving the medium’s expansion. Like other historical media shifts, changes to newspapers during the nineteenth century were never immediate, uniform, or evenly distributed. Instead, newspaper texts circulated simultaneously along multiple, overlapping vectors of transmission and reception.

In this chapter we demostrate how computational methods, drawing on large-scale textual data and library metadata, enable quantitative, media-specific analyses of historical newspapers that sometimes bolster and sometimes challenge existing accounts of the medium’s evolution in the nineteenth-century. The chapter draws on historiography, original research, and a series of data analyses to examine the editorial and compositional practices of newspaper production through the nineteenth-century, demonstrating how data can help illuminate not just the content of historical media—our primary focus in most of this book’s chapters—but also its material features. Digitized newspaper corpora can offer insight not only into what nineteenth-century readers were reading, but how that material was organized on the page, how the length and structure of newspapers varied in different locations and changed over the decades, and how expectations of currency accelerated during the century. Our analysis of both newspaper metadata and textual content in this chapter will in some cases confirm, at scale, the claims made by prior historians, showing for example precisely how much of the page was composed by reused text. In these cases, our work offers quantitative and comparative perspective that complements the archival insights of previous scholars. In other cases, however, our findings argue for greater nuance in existing assumptions, such as the idea that newspapers’ publication frequency—daily, weekly, etc.—correlates strongly to urban or rural geography.

Our exploratory analyses in this chapter also serve as a means to reflect on our sources for computational research in the twenty-first century, giving us a critical fulcrum for evaluating how the construction of digitized newspaper corpora can result in unevenness of coverage, exclusions of historical regions or communities, and variations in data quality that determine what can and cannot be learned about history from the resulting data. These critical readings of our data provide vital context and qualification for our analyses of reprinted material in subsequent chapters. Our analyses in this chapter are based in part on the Viral Texts project’s derived data on newspaper reprinting, which we will detail further in Chapter 3. However, as we will describe, our data about reprinted texts reflects what we can learn from digitized newspapers, which are a small subset of the newspapers that were extant in the nineteenth century. To complement those insights, we draw also on newspaper metadata from the Library of Congress’s U.S. Newspaper Directory, 1960–Present (USND), which attempts to publish descriptive data about all known newspapers in the present territory of the United States. In this chapter we demonstrate how simple metadata such as title, location, frequency of publication, and dates of publication can offer broad insight into the networks of nineteenth-century U.S newspapers and contextualize computational findings drawn from the narrower band of fully-digitized newspapers. The U.S. Newspaper Directory’s metadata helps us better understand the limits of our analysis and quantify what we do not know, and thus cannot claim, about historical newspaper exchanges. For the benefit of other researchers, we publish the metadata we derived from the USND, enriched with geographic information about place of publication, in a public GitHub repository. This data will allow outside researchers to check our findings, as well as to investigate new questions about the geography of newspaper publication during the nineteenth century.

Our broadest aim in this chapter is to ask, and venture preliminary answers to, fundamental questions about how we contextualize the findings of computational methods in literary history. What does it mean for a text, a set of related texts, or a particular newspaper to be typical or representative when drawn from a partial—even if “large-scale”—dataset that will never be fully described by metadata?

Consider the approximately 92 reprints we have identified of Arabella Eugenia Smith’s poem “If I Should Die To-night.” Does 92 reprints constitute a popular selection, or argue for more scholarly attention to Smith’s work?

Were a researcher in a physical archive able to locate 92 reprints of a single poem, they would likely assume that number pointed to wide distribution and popularity—it is certainly many more reprints than we know for many widely-anthologized poems, according to official bibliographies of more canonized poets. How do the 92 reprints of “If I Should Die To-night” compare with identified reprints of other poems—and can we assume these comparative numbers, drawn from our corpus, can be taken as representative of how these same texts’ proliferation would compare in a theoretical, complete corpus?

For example, we have identified three times as many reprints of “If I Should Die To-night” as we have of “The Dude” (~26 reprints) a charming, satirical pattern poem that at least one editor claimed was popular, writing in their introduction “much has been said in the papers of late about the ‘Dude’ I have procured.” But the raw numbers for both “If I Should Die To-night” and “The Dude” pale in comparison with Elizabeth Akers Allen’s “Rock Me to Sleep” (~268 reprints) or the anonymous “Beautiful Snow” (~306 reprints).

To really know whether any of these speculative bibliographies—a phrase we will define more fully in the next chapter—constitute a “popular” reprint would require us to know:

How many distinct newspapers are in our dataset? While that might seem a simple matter of counting lines in a spreadsheet, it is complicated by factors such as the fact that any change in newspaper title, however slight, constitutes a new record in the Library of Congress’s metadata. Given the rapidity with which nineteenth-century newspapers changed titles, merged, split, and otherwise mutated, this means that many papers scholars might identify as closely related titles, or even “the same” newspaper save a single new word in the title, nonetheless can be recorded as distinct in the metadata alone.
Next, how many distinct titles have been digitized? While most digitization efforts have taken pains to not duplicate efforts, looking across public databases such as Chronicling America, corporate databases from companies like ProQuest, and non-profit efforts such as the Internet Archive, our metadata cleaning efforts uncover enough duplication that repeated titles could skew results, showing more seeming reprinting that actually occurred.
What can conclusions drawn from the digitized subset of papers, such as our analyses of reprinting networks, tell us about the larger population of nineteenth-century newspapers? As scholars such as Ian Milligan have shown, the simple fact of digitization signifantly increases the scholarly attention paid—in both computational and archival analyses—to particular newspapers over others, regardless of whether those titles were historically the most important (Milligan 2013). Although the editorial and curatorial decisions that produced these digital collections have not resulted in a representative sample, we see how to draw some inferences despite these biases.
How well do habits of reprinting, and other forms of textual duplication, speak to readerly desires and values and how much do they simply reflect the technical and economic operations of historical newspaper production? Can we separate reprinting as circulation, such as a poem “going the rounds,” from other forms of internal duplication, such as ad copy or editorial boilerplate? Can we qualify our claims based on reprint quantity with models that characterize usual versus unusual patterns of duplication? Can we demarcate different forms of duplication by kind—perhaps we would even say genre—rapidity, or even technology?

We cannot fully answer these questions in this chapter, or indeed in this book, but our experiments here will trace the outlines of what we can know about the composition of our historical newspaper corpora, and also illustrate how corpus-level analyses can provide deeper understanding of trends in historical editorship beyond the habits of specific historical editors. As eager as we, and we imagine our readers, are to get to the specific texts that “went the rounds” in the nineteenth century, this chapter will argue that understanding the material basis of both nineteenth-century newspapers and twenty-first-century newspaper databases must proceed those readings.

Counting Newspapers

Before we can begin a computational study of what nineteenth-century editors put in their newspapers, we must account for the the editorial and curatorial decisions that shaped the collections of digitized newspapers we use today. We observe the activities of writers, editors, and compositors through a long chain of choices made long after a newspaper was set in type. In this section we build on data archeologies of digitized newspaper corpora by authors such as Paul Fyfe (2016), Ryan Cordell (2017), and Benjamin Lee(2021), as well as the Atlas of Digitised Newspapers and Metadata written by M. H. Beals et al., to “advocate for the importance of autoethnographic approaches to documenting a cultural heritage dataset’s construction from a humanistic perspective” (Lee 2021). Rather than repeat the investigative work into collections like Chronicling America from this existing literature, we triangulate between multiple historiographies and datasets to complement and enrich those archeologies.

We begin with a set of seemingly simple baseline questions about how we might even count the newspapers we study, which requires us to trace the cocentric circles the Stanford Literary Lab names “the published, the archive, and the corpus” (Algee-Hewitt et al. 2016, 2). How many copies of newspapers did printers produce? How many survive to be collected by libraries and archives? Which newspapers are microfilmed? Which microfilms are digitally photographed and transcribed with OCR? Because of these later steps in the data-generating process, we try to be careful about what we can and cannot conclude from our observations.

To start this inference process, we describe how we interpret the data in the Library of Congress’s U.S. Newspaper Directory (USND). Since it covers only the current territory of the United States, the USND does not help us answer similar questions for British, Australian, or other newspapers we consider elsewhere in this book, but the analyses in this chapter provide models for analyzing newspaper metadata that could be applied to those other collections in the future.

A Note on Reading This Chapter

From this point, this chapter will model literate programming, weaving code and prose together to show how we explore historical newspaper textual and metadata, and developing historical arguments from that data analysis. We do not attempt to explain precisely what each line of code or function does, which would be cumbersome for both code-proficient readers—for whom such explanation would be extraneous—and for readers new to code—for whom such explanation would be insufficient. Instead, we follow Benjamin M. Schmidt’s argument in “Do Digital Humanists Need to Understand Algorithms?” and attempt to account for the transformations that each section of code exact on our data (Schmidt 2016). The goal is not to create a code tutorial, but to responsibly account for the scholarly decisions expressed in our code.

To that end, we encourage our readers across the disciplinary spectrum to engage with this chapter. If the code is unfamiliar, skim and find our prose that seeks to explain the effects of that code on the data. For those with more technical expertise, the code can provide a concrete sense of the decisions we have made.

Organizing the Newspaper Catalog

To prepare for working with the newspaper catalog data, we load some libraries and define some helper functions. A newspaper record in the Library of Congress catalog records its earliest and latest year. In order to track changing newspaper density over time, we turn these ranges into a series of records, one for each year. This approach allows us to observe trends over the course of years and decades, but it is worth noting what this approach obscures. Recording the first year in which a newspaper appears does not distinguish between a newspaper that starts publishing on January 1 and December 31. Furthermore, a newspaper might have gaps or irregularities in its run during the year that our approach would not note. A small proportion of catalog records do contain more precise dates.

Each time a newspaper changes its name, libraries typically create a new entry in the catalog. While this approach is precise and avoids confusion when papers split or merge, nineteenth-century newspapers changed names frequently, while historians often wish to study papers as single entities over time and through name changes. In chapter four, for example, we use social network analysis to trace the changing influence of newspapers over time. In those network graphs, we would not want for each version of, for example, the Nashville Union and American to show up as a separate node, but instead to treat many of those minor variations as the same entity, at least for the purposes of that specific analysis. In order to facilitate those kinds of analyses, we use catalog links among “preceding entry” and “succeeding entry” to join these records into continuous series groups. This function creates a group field with a shared identifier.

The MARC standards for library records use single-letter codes to represent publication frequency, so we next expand these into words more easily understood by human readers.

Now we’re ready to load comma-separated value (CSV) data derived from the Library of Congress MARC records. We only extract those fields we will use for the analyses in this chapter.

Here we show the first six records to provide context for the full dataset.

	series	title	carrier	div1	div2	city	startDate	endDate	frequency	lang
1	/lccn/00062183	Polak amerykański = American Pole.		New York	Erie	Buffalo	190u	19uu	d	pol
2	/lccn/00064000	The apostolic bulletin.		Texas	Bell	Temple	1918	uuuu	m	eng
3	/lccn/00064001	Central Texas oil journal.		Texas	Bell	Temple	1919	uuuu	u	eng
4	/lccn/00064002	Bell County socialist.		Texas	Bell	Temple	1913	uuuu	w	eng
5	/lccn/00064003	Central Texas forum.		Texas	Bell	Temple	1892	uuuu	u	eng
6	/lccn/00064004	The Central Texas weekly news.		Texas	Bell	Temple	193u	uuuu	w	eng

We load a table linking cataloged series to connected series groups. Without loss of generality, we use the control number for the first series in a group to identify the group.

	group	series
1	/lccn/00221339	/lccn/00221340
2	/lccn/00221500	/lccn/00221501
3	/lccn/00221502	/lccn/00221503
4	/lccn/00221518	/lccn/00221519
5	/lccn/00221518	/lccn/00221520
6	/lccn/00221521	/lccn/00221522

Not all available newspaper metadata is contained in the MARC records. The USND’s search interface, for example, allows users to narrow results by the ethnicity of a given paper’s audience or the labor sector it targeted. We have used these values to further enrich our data so we can investigate such community dynamics.

Finally, we load a table of the dates of the first and last digitized issue of each newspaper series in the corpora we work with. In its raw form, this table containts duplicates if a particular series has been digitized in two different collections, such as Chronicling America and Gale’s US Newspapers collection. This table also contains newspapers from outside the US, which don’t appear in the Library of Congress catalog. When comparing what the catalog tells us about US newspapers with the digitized record, we only consider the subset of digitized newspapers that are in the catalog.

	series	corpus	startDig	endDig
1	/lccn/00221512	gale-us	1872-04-04	1872-04-18
2	/lccn/00225879	ca	1915-07-03	1928-11-30
3	/lccn/00229120	gale-us	1898-07-01	1899-12-30
4	/lccn/02004276	aps	1850-05-01	1852-12-01
5	/lccn/02004276	ia	1850-05-01	1852-12-01
6	/lccn/02004276	moa	1850-05-01	1852-11-01

In the corpora.csv file below, we provide more granular information about the collections of digitized newspapers we use in the Viral Texts project.

We start processing the catalog data by removing non-US newspapers (perhaps erroneously pulled in by a search for “Georgia”), expanding the frequency codes, and converting series start- and end-years to integers. Those years that are blank or that contain the u character to mark uncertainty will be NA for now, though below we will attempt to infer their dates more precisely. We merge series digitized in different corpora and take the union of their data ranges. We then convert their date years to integers.

Due to uncertainty and error, some catalog records end have a narrower date range than the dates of digitized newspapers. We extend these date ranges so that the set of years a newspaper is digitized is guaranteed to be a subset of its date range in the catalog.

The Library of Congress catalog often contains multiple records attesting to the same historical newspaper. For instance, there might be separate records for the print run of a newspaper, a microfilm, and an entry in an electronic database, including records for the digitized newspapers in Chronicling America.

If a newspaper published in Richmond, Virginia, changes its name from The daily dispatch to the Richmond dispatch in 1884, then there is a new catalog entry. Related titles, such as The weekly dispatch and the Semi-weekly dispatch get their own catalog entries. Much of the time, though not always, catalogs record these sorts of successor/predecessor or daily/weekly relationships in other fields (not shown here). The Library of Congress catalog also contains separate records for different formats, such as the second record for The Daily dispatch on microfilm.

series	title	carrier	div1	city	startDate	endDate	frequency	lang
/lccn/sn84024738	The daily dispatch.	volume	Virginia	Richmond	1850	1884	Daily	eng
/lccn/sn85026363	The weekly dispatch.	volume	Virginia	Richmond	1850	1903	Semiweekly	eng
/lccn/sn94043545	The Daily dispatch.	microfilm reel	Virginia	Richmond	1850	1884	Daily	eng
/lccn/sn85026059	Semi-weekly dispatch.	volume	Virginia	Richmond	1857	1903	Semiweekly	eng
/lccn/sn85038614	Richmond dispatch.	volume	Virginia	Richmond	1884	1903	Daily	eng

While these distinctions are useful for serving library users’ needs, this chapter is more focused on inference about properties of newspapers in the nineteenth century. To put that another way, in a library catalog it can be useful to differentiate each time a newspaper changed names, which historically often reflects a change in editorship, or possibly a merger or schism. For analyses seeking to understand broader trends in the medium over time, however, separating a newspaper that was functionally contiguous into distinct publications, with distinct founding and ending dates, can obscure the very trends we seek to better understand. We therefore remove catalog records that share a title, edition, frequency, place of publication, and date range with another newspaper. Where duplicate records exist, we prioritize the digitized one—which we can study directly—and, following the convention on the Chronicling America website, we prioritize the record for the print over the microfilm or online version.

Because of the focus on titles, we would count newspapers differently if they changed their titles frequently or not. Unlike the Richmond Dispatch, the Washington, D.C., Evening star started publishing in the 1850s but did not change its name until 1972. We would not want our analyses to give too much weight either to newspapers who changed names frequently or those that did not—to argue, for instance, that more newspapers were founded in Richmond simply because one newspaper was renamed multiple times.

series	title	carrier	div1	div2	city	startDate	endDate	frequency	lang
/lccn/sn83045462	Evening star.	volume	District of Columbia		Washington	1854	1972	Daily	eng

In a final processing of the catalog data, we use the groups table loaded above to link together newspapers and then select one representative for each group. This means that when a newspaper changes its name in a given year, we will only count one newspaper for that year instead of two. If a paper changed from title A, to B, and then back to A in the same year, the Library of Congress would catalog it as three series. Among digitized papers, this only shows up in The tri-weekly journal (Camden SC: 1865). Among un-digitized papers, we thus are thus undercounting the number of papers that reverted their names within the span of a calendar year, but this allows us to make a conservative estimate of the total number of papers.

Here we plot the overall effect of these transformations of the catalog over the course of the nineteenth century. Removing duplicates reduces the estimate by a small but significant amount after about 1840. Although one effect of grouping related series is to reduce the estimate for some years, a countervaling effect is to increase the number of newspapers for which we have known years. Because the date at which one paper gave way to its successor might be uncertain, we can increase the number of known dates by merging these adjacent series.

Overall Counts

After cleaning data from the Library of Congress’s U.S. Newspaper Directory to estimate the number of distinct newspapers published in each year, we can see tremendous growth, from a few hundred papers in 1800 to over 12,000 by the end of the century, with only a slight reversal during the Civil War.

This data hews closely to Nord’s estimates, quoted above, for 1810 (“fewer than 400 newspapers”) and 1825 (“more than 800”), although they are significantly higher than the 900 he estimated for the year of Tocqueville’s visit. As scholars such as Louisa Trott have shown, the U.S. Newspaper Program’s catalog undercounts the newspapers that existed, particularly those published by minority communities, so we would expect so even our numbers likely undercount the historical reality, though it offers a more precise accounting than was previously available (Trott 2022). Our data also provides more precise accounts of newspaper proliferation in the latter half of the nineteenth-century, when the rapid growth of the medium becomes harder to track by hand.

year	n
1810	416
1825	825
1831	1177

If we looking instead at the original catalog data, our estimates would be over 10% higher still.

year	n
1810	493
1825	978
1831	1350

While these graphs do not rewrite existing scholarly accounts of newspaper proliferation, they do offer quantitative and precise evidence of that spread that particularly expand our picture of the mid and latter half of the century.

Digitized Newspapers

Much of what we want to know, however, requires us to look beyond the catalog to the contents of published papers. For the purposes of this book, we examine digital photographs of newspaper pages and read—both computationally and closely—the digitized text. Elsewhere in this book, we consider newspapers from beyond the United States; for most of this chapter, we focus on four digital collections with digitized papers that overlap with the USND.

corpus	n	title	description
ca	19863	Chronicling America	https://chroniclingamerica.loc.gov/
gale-us	5303	Gale Nineteenth Century U.S. Newspapers	https://www.gale.com/c/nineteenth-century-us-newspapers
aps	942	ProQuest American Periodicals Series	https://about.proquest.com/en/products-services/aps/
ia	923	Internet Archive	https://archive.org/details/texts

Given what we know about the concentric circles of the published, the archive, and the corpus, we might wonder how likely it is any given paper will be digitized? This probability varies by year, place of publication, language, and other features. To assess it, we compare the proportion of papers in each year from the catalog with the proportion of various digitized collections that come from each year.

To make the different curves comparable, each one is normalized over the time interval, so that the area under each curve sums to 1. We see that both the Chronicling America and the Gale US collections oversample newspapers, relative to the catalog, from before 1870 and undersamples them after 1870. This is also true of the whole set of digitized newspapers, which includes data from ProQuest’s American Periodicals Series and the Internet Archive collections. Chronicling America, at least, still accurately reflects the fact that there are more papers published in years after 1870 than before; for Gale US, there is an absolute decline in the number of digitized papers after the Civil War, even though more papers existed after that date historically.

When drawing inferences from the digitized sample, then, we would not want to overgeneralize from the years before 1870. One solution might be to reweight the digitized sample to follow the distribution of years in the catalog. Unfortunately, the decision to digitize depends not only on the year but also on other factors, such as state, language, and frequency of publication.

The National Digital Newspaper program provided funding and infrastructure support from the National Endowment for the Humanities and the Library of Congress to state-level digitization efforts.1 States have applied to the NDNP at different times, and some states have received more rounds of funding than others, such that their newspapers are better represented in digital collections. For example, we can see that the proportion of newspapers in the catalog from each state or territory does not match the proportion of newspapers in the digitized sample.

Displaying the same data in another way, we can see that the proportion of papers digitized varies greatly among states, and is often quite far from the overall rate of digitization of 5% indicated by the vertical line.

Other features of newspapers are distributed unevenly in space and time. As we will see below, the frequency with which a newspaper is published leads to different patterns of reprinting. Overall, the digitized collections overrepresent daily papers, and different states had different numbers of daily papers compared to other publication frequencies.

A newspaper’s language has an even bigger effect on what it prints and reprints. Although the overall digitized sample is fairly representative of the population, with 4.5% non-English papers, the proportion of non-English papers both in the catalog and among digitized papers varies by state. In Minnesota, the digitized sample overrepresents the proportion of non-English papers; in Hawaii, the digitized sample contains only 25% non-English, while the USND catalog contains 50% non-English.

Digitized data on the Black press provide an even starker example of the limits of a “representative sample”. Due to legal, economic, and social barriers to the emergence of the Black press, less than 1% of nineteenth-century newspapers, according to the USND’s coding, are African American, which aligns with Benjamin Fagan’s description of the “whiteness of these public archives” which can skew computational analyses and keyword search research alike. As Fagan argues,

we must be careful to acknowledge at the outset that any conclusions drawn from such archives apply specifically, and exclusively, to white periodicals, which cannot be made to stand in for all of early American periodical production. Otherwise, we risk conflating Americanness with whiteness. This is a slippage that the racial politics of digitization encourages us to repeat, which makes it all the more important that we reveal it in our teaching and scholarship (Fagan 2016, 12).

If we sought to understand this subset of nineteenth-century newspapers by digitizing even a small number of them, the sample becomes unbalanced in another way: digitizing only a single paper each from Oklahoma or West Virginia, or two papers each from Minnesota and Wisconsin, overrepresents the Black press in those states, at least as a proportion of digitized newspapers. Our approach, as we discuss below, is to explicitly model the process of selecting papers for digitization, rather than relying on the representativeness of any particular sample we might construct. This approach need not be one of resignation toward inequities, however. Following Rediet Abebe et al.’s argument that too much technical work “treats problematic features of the status quo as fixed,” we instead suggest that modeling how newspapers are selected for digitization can “serve as a diagnostic, helping us to understand and measure social problems with precision and clarity” (Abebe et al. 2020). Our goal is not to portray the current state of newspaper digitization as fixed and immoveable, but instead to offer a detailed account of what current collections include and do not, in hopes these conclusions might influence future digitization efforts.

Many features of newspapers vary across both space and time. In general, the proportion of daily papers increased during the nineteenth century, but at different times in different states. The small sample of digitized papers in many states leads to larger swings in the proportion of dailies among digitized papers.

How then should we generalize to the whole population from a non-representative digitized sample of ~5%? One approach from survey research is poststratification. Say that we conduct an opinion poll of residents of the United States. In addition to asking the respondents about the topic of interest—what candidate they indent to vote for or how many pets they have—we also ask for some demographic data. If we receive fewer responses from women—or from residents of Pennsylvania—than we would expect based on the proportion of women or Pennsylvanians in the population, we could estimate poll responses for those subpopulations and then combine them according to those known population proportions. Using (for better or worse) 2020 U.S. Census categories of males and females, we could collect poll responses for men and multiply them by 49% and poll responses for women and multiply them by 51%.

We could also stratify men and women into further subpopulations by age and state of residence. (Stratification by age might also lead us to limit the population of interest: most polls don’t attempt to estimate the opinions of children; election polls try to concentrate on the opinions of voters.) If we add education, party affiliation, race, and religion, the number of different bins is multiplied considerably. Many bins may contain no respondants to our poll. If we get a thousand responses to a nationwide poll, we might have no responses from male Democrats from Wyoming who are 40–65 years old. Responses from other men, Democrats, Wyoming residents, and 40–65-year-olds, however, might help us estimate how those categories would interact. More specifically, we can go beyond counting responses in a fixed number of bins to estimating a statistical model of the poll response given observed categories like state, age, etc. One widely-used version of this technique is multilevel regression with poststratification (Mr. P, i.e., “Mister P”, Gelman and Little, 1997).

Consider, for the sake of concreteness, the variables in the USND indicating the year (100 values) and state (52 values) of publication, as well as Boolean (two-valued) variables for “African American”, “daily”, and “non-English”. If we could set each of these variables independently, we could potentially sort newspapers into one of 41,600 possible bins.

‘Distinct values for each variable’

Year: 100 state: 52: african.american: 2 daily: 2 non.english: 2

‘Possible bins: 41600’

In reality, these variables are correlated. To take a simple example, different states acquired printing presses, and thus newspapers, in different years. The USND catalog, therefore, shows that only about 10,000 of these bins, or 25%, are occupied. Of those bins in which the catalog records newspapers, we only have digitized data for about 6000 or 59%.

catalog	digitized
10601	6327

On the other hand, 59% is greater than 5%. That is to say, if the phenomena we’re interested in—such as the density of reprints or advertisements, or number of pages or images or recipes published in a newspaper—can be explained by the five variables listed above, then we have observations for more than half of the actually possible settings of those variables, which is greater than the overall 5% sample at the newspaper level. Of course, we might decide that other variables such as city (not just state) or political affiliation might be useful, thereby increasing the number of bins.

The histogram shows not only a large number of bins with no digitized newspapers but also many bins with only one or two observations. In the rest of this chapter, we will describe methods for using evidence from more densely populated bins to help us draw inferences about sparsely populated or empty bins.

Predicting Digitization

Before we try to correct for selection bias in preservation and digitization, we can ask more directly which features of a newspaper predict its presence in our digitized sample. We separate out the number of series-years from diffent source collections.

group	gale-us	ca	aps	ia
/lccn/00221512	1	NA	NA	NA
/lccn/00225457	1	NA	NA	NA
/lccn/00225879	NA	15	NA	NA
/lccn/00229120	2	NA	NA	NA
/lccn/05014021	NA	NA	27	27
/lccn/10021978	NA	14	NA	NA

We then estimate a logistic regression model to predict whether a particular newspaper series will be digitized in a particular collection. In this case, for example, we use the state and frequency (daily, weekly, etc.) of the newspaper and whether it is in a non-English language or on the USND’s list of African-American newspapers as features to predict a binary outcome, whether that newspaper was digitized by the US National Digital Newspaper Program and included in Chronicling America.

We can see that some features of a newspaper, such as being from Delaware or Arizona, are highly positively correlated with being digitized in the NDNP. There are no papers from New Hampshire and only five from Massachusetts in Chronicling America. Although there are a few more newspapers from New York, that state remains very underrepresented. Even the data from Pennsylvania, which digitized 71 newspapers, substantially underrepresents the large number of newspapers published there in the nineteenth century.

state	digitized	n
Massachusetts	5	2002
New Hampshire	0	644
New York	12	4575
Pennsylvania	71	3527

In the regression coefficients predicting inclusion in Chronicling America, we can also see the efforts of the program to digitize the Black press, with a high positive coefficient for african.american. Also notable are the larger negative coefficients for monthly and semimonthly publications. Although there are many monthly magazines not included in the USND catalog, some must have slipped in, but even those are comparitively unlikely to have been digitized by the NDNP.

For comparison, we can fit a logistic regression model using the same features to predict inclusion by Gale’s US newspaper collection.

Besides focusing on a different set of states, these regression coefficients for Gale US suggest a weaker preference for Black papers than in Chronicling America. They also show a preference, all else being equal, for dailies over weeklies.

Scholars interested in “archaeologies of digitization”, as Bonnie Mak usefully termed these investigations of our datasets, could extend these predictive models in several ways (Mak 2014). As Cordell pointed out in his analysis of Pennsylvania’s NDNP digitization, a preference for geographic coverage—seeking to digitize at least one paper for every county in Pennsylvania—will lead to undersampling denser urban publishing centers. We could simplify the model predicting inclusion in Chronicling America with a single variable indicating whether a state had received an NDNP grant, but it would be hard to capture these kinds of constraints on geographic coverage, or budget constraints on how many papers a project can digitize, with the kind of simple model we use here where every paper’s digitization is predicted independently. To capture the kinds of tradeoffs in coverage and budget that digitization projects need to make, we would need to learn the parameters of a more complex optimization problem.

Predicting Frequencies

In the rest of this chapter and book, we focus less on predicting whether a paper will be digitized and more on correcting our inferences about other properties of newspapers given their propensity to be present in our digitized samples.

As a first step toward drawing inferences about categories of newspapers that are poorly represented in our digitized sample, we can build a model to predict a feature that we can observe in the USND catalog: whether a paper is published daily. As we saw above, dailies are not distributed uniformly among states, and their prevalence changes over the nineteenth century.

We base this model on metadata describing city and state of publication and regularity of publication. The USND catalog only has frequency information at the series level, not the year level, so in this experiment, we only classify each paper once for its entire run.2 We do, however, also include a feature for the first year of a paper’s run, to capture the changing prevalence of daily newspapers overall.

Since daily papers often—though far from always—have particular words like “daily” in their titles, we tokenize the newspaper titles, which simply means we convert full titles into individual words, so we can trace key words related to frequency, like “Daily,” cumulatively. Adding a boolean feature for a title word doubles the number of possible bins. This would not be at all practical if we were performing simple poststratification. The key observation is that these features are inputs to a statistical model that then predicts the variable of interest—here, whether it is a daily—for that newspaper.

Since the metadata features, such as place of publication, and title features are now both sorted by series group identifier, we can join their feature matrices.

For training the predictive model, we use the digitized set of newspapers. Since we do know the daily feature for the undigitized papers as well, we can use them to evaluate predictive accuracy.

We now train a logistic regression model to predict whether a paper is a daily, using cross-validation to set hyperparameters.

One application for regression models is to examine which features are most predictive of some phenomenon. While social science applications are often interested in causal inference, here we merely observe that papers with daily or evening in their title are more likely to be dailies. Keep in mind, however, that the weight of individual feature is set given all the other features. That papers from Springfield, Ohio, get a high weight in the model, positively correlated with being a daily, does not mean that that city has more dailies overall; instead, it means that papers from Springfield, Ohio, are more likely to be dailies than we would expect given their other features.

In this chapter, however, we are primarily interested in using regression models to estimate properties of the population of all newspapers given the biased sample of digitized newspapers. The model’s predictions are not simply a binary value—daily or not—but a probability of being a daily. For papers in the digitized collection we used as our training set, we use the curated value from the library catalog, not the model’s prediction.

We compute the prevalence of daily papers in the entire USND as our target, “ground truth” value and label it catalog in the tables below. We will not delve into the assumptions of “ground truth” as a category in computational research—though many scholars have done so productively—but will simply note that the choices we describe above should indicate its complexity. The curated data of the USND, however flawed, does offer useful context for assessing our predictions. If we relied only on proportion of daily papers in the digitized sample, this estimate, labeled digitized, would be too high. If we used the binary predictions from the regression model, our estimate would be too low, but closer to the population average. Averaging the model’s estimated probability of each paper being a daily gives the result closest to the population average. This probabilistic estimate is labeled model.

catalog	digitized	prediction	model
0.1491956	0.1880406	0.1338593	0.1569666

Differences between these estimates can be even greater when we look at subpopulations of newspapers, such as those published in individual states. The proportion of daily papers in Tennessee and Virginia, for example, is much higher in the digitized sample, but the regression model’s estimate is much closer to the average in the catalog.

state	catalog	digitized	prediction	model
Michigan	0.1377343	0.1428571	0.1238794	0.1535382
New York	0.1761749	0.1475410	0.1451366	0.1664219
Tennessee	0.1578947	0.2682927	0.1390268	0.1672177
Virginia	0.1689320	0.3052632	0.1223301	0.1472514

We can also redraw the graph from the previous section, showing that the model’s prediction for each state is generally closer to the catalog.

Newspapers and Political Partisanship

First, during the first half of the nineteenth century the tie between newspapers and political partisanship in the US solidified, codifying editors’ roles as local political advocates and agitators. In the United States, these political roles were filled locally. Most towns large enough for one newspaper in fact had (at least) two, one for each of the major political parties, and possibly others advocating for particular political goals, such as abolition. As Pettegree notes, from the 1830s US papers “mixed trenchant political commentary with reports of lurid crimes and local scandal” with “little pretense of objectivity” (Pettegree 2021, 634). For a vivid example of these political affiliations, we might look to Brownlow’s Knoxville Whig and Independent Journal, which was named—at various times and locations during its long history under one editor, Parson William Gannaway Brownlow—the Tennessee Whig, the Whig, the Jonesborough Whig, the Jonesborough Whig, and Independent Journal, the Knoxville Whig and Independent Journal, and Brownlow’s Knoxville Whig, and Rebel Ventilator. As noted in the National Digital Newspaper Project’s biography of Browlow’s papers, “Brownlow’s speeches and publications drew both attention and anger” throughout his career, such that he was sued and shot multiple times during his life; was arrested and charged with high treason against the Confederacy; was provided by “the federal government…with a press, some type, $1,500, and a government printing contract” when he returned to Knoxville in 1863; and “was a force in the convention that abolished slavery in Tennessee and that led to the creation of a new state government,” to which he was elected as governor immediately after the war (University of Tennessee and National Endowment for the Humanities n.d.).

The motto on the masthead of Brownlow’s Knoxville Whig sums up this political fervor neatly, declaring the paper “Independent in All Things—Neutral in Nothing.” This was a relatively common motto for newspapers in the period, and its oft-repeated notion of “independent” is complicated by the fact that many papers who used this motto, Brownlow’s included, were directly propped up by political parties and interests.3 Editors’ political allegiances could be complex and shifting, and the volatile business of newspaper production saw papers founded, disbanded, sold, and merged in ways that affected their political voice. The Elk County Advocate (Ridgway, PA), for example, was founded as an independent paper in the 1850s but was by the mid-1860s clearly Democratic in its endorsements and advocacy. When the paper was sold to a Republican editor in 1868, the title stayed the same but the numbering started over, so the November 20, 1868 issue is identified as Volume 1, Number 1. The nearby Clearfield Republican (Clearfield, PA) “had nineteen owners, five titles, and…changed its politics four times” through the nineteenth and early twentieth century,(Swoope 1911) and “Despite its name, the Republican for most of its life was an organ of the county’s Democratic (majority) party and was a robust example of Copperhead politics during the Civil War” (Libraries and National Endowment for the Humanities n.d.). Perhaps because of these complexities, the partisanship of the nineteenth-century press contributed substantially to its expansion, as political operatives, editors, and readers alike sought to ensure their views were represented to the public.

That the newspaper, as a medium, was partisan does not mean it was only, or even primarily, political. Among other insights, Viral Texts’ analyses demonstrate the radical generic diversity of most newspapers, even those fervently devoted to political advocacy. As Andie Tucher notes, “Many papers came (or were brought) to town expressly to counter the opposition parties or factions or splinters that were already there, but in the intervals between elections, even the most partisan papers often had a haphazard, milk-mild, ‘something-for-everyone’ air, closer in spirit to a magazine than a newspaper” (Tucher 2010, 396). The reality, we would suggest, is that separating newspapers from magazines on the basis of tone, content, or even format alone can be nearly impossible in this period, as the two genres overlapped and bled into each other.

As editors complained through the widespread reprinting of the listicle “Editing a Paper” (~158 reprints) trying to meet so many readerly needs through a single medium often proved impossible:

If it contains much political matter people won’t have it.

If the type is large it don’t contain much reading matter.

If we publish telegraph reports, folks say they are nothing but lies.

If we omit them they say we have no enterprise, or suppress them for political

effect. If we have a few jokes, folks say we are nothing but rattle heads.

If we omit jokes, folks say we are nothing but old fossils. […] If we insert an

article which pleases the ladies, the men become jealous, and vice versa.

Many newspapers, in other words, aimed to be simultaneously partisan and playful, educational and entertaining. So while newspapers typically advocated explicitly for particular political parties and causes, in ways that can shock modern readers expecting journalistic impartiality, this did not prevent them also publishing recipes, fiction, poetry, and any manner of other texts of more casual interest to their readers.

The Technologies of Newspaper Production

Regardless of readerly and political interests, the dramatic expansion of the newspaper during the nineteenth century could not have happened without concurrent technological changes that made it possible to produce such a large quantity of print at prices affordable to many readers. First among these was the development of reliable methods for making paper from wood pulp rather than rags. For most of print history, procuring sufficient paper constituted the highest expenditure for the printer on any given job, and thus largely determined the price for consumers. Rags of sufficient quality to make good paper were ever in short supply, and the search for a cheaper alternative was continuous throughout the print era. This need can be illustrated by the Mountaineer, from the new and relatively-isolated Salt Lake City. On 20 July 1861, the paper wrote,

We are pleased to learn from our friend, A. C. Pyper, that there is a good prospect for a fresh supply of paper. Much means and labor have been expended in endeavoring to manufacture and introduce into the market this much needed article. We are sorry to say that we will be compelled to sustend (sic) operations in our office for a week or two for want of the material. It is to be hoped, however, that before long we will renew our publication without interruption.

Despite their hopes, however, the Mountaineer never resumed operations. We cannot know for certain, of course, that lack of paper caused its ultimate demise, but this blurb demonstrates how central this resource was to newspaper production, and how valuable it was, as late as 1861, in areas without established paper manufacturing facilities. Similarly, the Opelousas Courier in Louisiana printed multilingual issues (in English and French) on wallpaper at times during 1863 and 1864, when broken supply chains during the Civil War made more conventional papers impossible to obtain.

Printers’ need for more and cheaper paper, in part to print more and larger newspapers, drove paper manufacturers towards industrialization. From the early decades of the nineteenth century, paper mills developed ever bigger and faster steam-driven machinery for creating paper from rags, and in larger rolls or sheets than had been possible when paper was made using hand-held forms. Through the same decades, nineteenth-century paper makers experimented with methods for converting plant fiber, such as wood pulp, into paper. These methods were not perfected until the 1860s, when wood-pulp paper helped further drive down the cost of newspaper production (Weeks 1969, 234–36; Pettegree 2021, 634). The introduction of cheaper resources, then, combined with new industrial methods for processing it, made a material substrate for the newspaper capable of supporting the medium’s rapid expansion throughout the nineteenth century.4

Industrialization transformed paper-making and then printing itself, as iron and then steam-powered rotary presses made it possible to print sheets at speeds unheard of during the hand-press period (which spanned from Gutenberg’s press, developed around 1439, until the mid 1840s). Inventors had experimented with rotary presses since the late eighteenth century, working to develop a press in which paper feeds into a cylindrical printing mechanism that rolls continuously, rather than a platen that must be pressed and lifted for each new impression. The first successful rotary press was patented in by Richard March Hoe in New York in 1843, and quickly spread around the world. As described in R. Hoe & Company’s 1873 catalogue, the precise speed of printing depended on the number of cylinders in a given machine, from two up to ten. Their “Rotary Perfecting Newspaper Press” was “designed to work exclusively from stereotype plates” and could “print…both sides of the sheet at one operation.” This machine was designed specifically for working newspaper offices and could be ordered in a “Two-Feeder” model, which required two operators and could “print, per hour, from 4000 to 5000 perfected eight-page sheets of ordinary size, or will print and cut from 8000 to 10,000 perfected four-page sheets or ordinary size.” The “Four-Feeder Machine” could double that output, but required more operators (R. Hoe & Company 1873, 13). We should not imagine, however, that every small-town newspaper was printed on a rotary press, at least not immediately. The majority of US papers did not require the speed or scale of such machines, nor could they support the creation of the cylindrical stereotypes required to print on them. The introduction of rotary steam presses in large urban markets reshaped the overall press economy, making hand presses more readily available and affordable to printers in smaller markets. Taken together, these advances in paper production and printing aided the newspaper’s rapid proliferation during the middle decades of the nineteenth-century.

Even with these technological advances, the newspaper was a massive commitment of time and labor in small print shops. Many small-town and rural newspapers were produced by a very small team comprising an editor (who might also assume some production duties) and a few compositors or press workers, who likely produced other jobs alongside, or even ahead of, the newspaper. As a popular selection from the 1870s claimed, “a newspaper the size of the [Poughkeepsie] Eagle,” a typical 4-page paper, used 600,000 pieces of type, “the actual number of bits of metal arranged and rearranged every day in preparing a newspaper. Of course, a good deal of material stayed in type from day to day, including advertisements, mastheads, subscription notices, and editorial statements, so the 600,000 figure might be slightly exaggerated. The core sentiment of this selection, however, resonates: the paper was a huge and unending undertaking simply to compose. As with rotary presses, the invention of Linotype and other hot metal typesetting technologies in the late nineteenth century would, eventually, transform newspaper composition, but neither universally nor immediately.

As this book will show, a similar mix of older and newer technologies facilitated the sharing of texts across the newspaper network. We might imagine, for instance, that the invention of the electric telegraph would completely overhaul information sharing during the nineteenth-century. The Associated Press, for example, was founded as a consortium of New York newspapers in 1846 and used telegraph lines to become the dominant conduit for news sharing in the United States, “quicken[ing] the pace of the news cycle while intervening in the already intricate geographies of newsprint” (Gitelman and Mullaney 2021, 177). As cities connect through the expansion of telegraph lines, we do find in the Viral Texts data that the exchange of some kinds of stories—i.e. “hard news”—shifts, among some partners—i.e. mostly urban editors—to the telegraph. However, due to the expense of telegraph transmission, the vast majority of newspaper texts continued to circulate primarily through the mail until century’s end, and even those texts that first circulated by telegraph were then picked up and transmitted through other means to newspapers who lacked access to telegraphic news. Through the end of the nineteenth-century, in fact, the transmission of most newspaper texts—besides reports of things such as battles or elections—seems to remain primarily postal, rather than telegraphic, even as telegraph lines become a literal network across the US landscape. In other words, no single technology reinvented the newspaper, nor were any technological changes instant or equally distributed.

Newspaper Formats

One way of modeling these technological changes computationally is to focus on their format—their size and the number of pages they included. As printing technology became more accessible and faster, for instance, we would expect to see a growth of bigger newspapers with more pages. We cannot rely on the catalog information in the USND to determine these properties, because it does not record page counts or sizes. That data alone, in other words, cannot help us examine changing newspaper formats and sizes or to track how long older formats persisted while newer ones grew. Instead, we derive this data from collections of digitized newspapers. With a few exceptions, these collections indicate the pagination of their paper sources. Chronicling America also contains some information about the collation of different newspaper sections in each issue.

We start with a simple analysis of pages per issue, loading statistics about the numbers of issues in each year of each series with a particular number of pages.

	series	corpus	year	pp	sections	collation	count
1	/lccn/00221512	gale-us	1872	8	NA		3
2	/lccn/00225879	ca	1915	8	1	8	25
3	/lccn/00225879	ca	1915	16	1	16	1
4	/lccn/00225879	ca	1916	8	1	8	40
5	/lccn/00225879	ca	1916	16	1	16	13
6	/lccn/00225879	ca	1917	4	1	4	24

We then combine series into series groups and aggregate the number of issues with 4, 6, 8, etc., pages.

Looking at discrete numbers of pages in U.S. papers, we see the dominance of the four-page format—usually produced as one sheet of paper, printed on both sides and folded in the middle—across the nineteenth century. The four-page paper was quick to produce and required no additional collation or folding, which led to its prominence in the period. As we discussed above, our sample is thin before 1840. We extend the time window until 1910 to see the point, shortly after 1900, when the eight-page format becomes the most common. One interesting exception to the rise of four-page papers is the Civil War, where there seems to be some substitution for two-page papers, which aligns with historical accounts of limited supplies—particularly paper—and labor.

Once issues run over sixteen pages, however, the number of distinct page counts increases. This may be related to more complex collations in multiple newspaper sections. Data on multiple sections in newspaper issues is only found in our data in Chronicling America. Even there, many of the multiple sections are extras appended to the main issues in the archive, rather than sections folded and distributed together. In any case, multiple sections almost never exceed 2% of all issues during the nineteenth century.

To simplify analysis of these different formats, we categorize issues into those printable on a single uncut sheet, i.e., one to four pages, and those requiring more sheets. We then compute the proportion of each series-year printed in the single- or multi-sheet format. For example, a weekly published 50 times a year, of which 45 were four pages and five tipped in one more doubled-sided sheet for a total of six pages, would thus have a single-sheet probability (singlep) of 0.9.

The probability that the editor will print a single-sheet issue varies substantially across newspaper collections. The U.S. papers in Chronicling America are printed almost exclusively on single sheets until 1850. The Gale U.S. collection only drops substantially below 100% single-sheet papers after 1860, but then drops much faster than Chronicling America, reflecting Gale’s focus on urban papers. Gale’s British collection contains less than 50% single-sheet papers after 1850, but the Dutch papers in DDD move away from the single-sheet format more gradually. The Internet Archive’s focus on magazines results in a much lower proportion of single-sheet issues.

This variation in different collections’ sampling strategies, as well as different countries’ publishing practices, makes us wary of reading this raw data too closely. For the rest of this section, therefore, we focus on the subset of papers from the U.S. for which the U.S. Newspaper Directory gives us some idea of the total population over time.

Although the probability that an editor in the U.S. would publish a single-sheet issue descreased after 1850, the absolute number of distinct newspaper series, as we saw above, was sharply increasing. A plot of the counts of single-sheet and multiple-sheet papers in the digitized sample thus shows a further rise in the absolute number of single-sheet papers after the Civil War and then a plateau until the century’s end. Multiple-sheet papers, meanwhile, grew much faster from a lower baseline and, in aggregate, overtook single-sheet papers around 1898. In other words, it seems, from our digitized sample, that single-sheet papers were not losing popularity so much as ceasing to grow. This data-driven account aligns in some ways with standard historiography, which traces a movement toward longer and longer papers, it nuances that account by showing that the overall growth of the medium perhaps obscures the continued prevelance of the four-page paper through the century.

Note that this graph counts newspaper series not newspaper issues in the two formats, so daily papers do not count seven times as much as weeklies. In the next section, we will address the question of total newspaper output in pages and characters.

This trend was not evenly distributed, at least in our digitized data. Selecting some individual states and separating dailies from other papers, we see that, after 1850, more multiple-sheet than single-sheet papers from New York have been digitized. But is this just a bias in digitization (e.g., in the subset of U.S. papers from Gale) towards more urban papers? How can we estimate changing newspaper formats across all papers?

To estimate the prevalence of single-sheet papers in the whole USND catalog, we train a classifier to predict whether a given series in a given year will be printed on one or more sheets. To help the classifier learn differences in newspaper formats in cities with larger or smaller publishing industries, we count the number of papers and the number of dailies in each city in each year. These counts are then appended to the catalog record for each paper, whether it is digitized (with observed page counts) or not.

In addition to these city-level predictors (city.papers and city.dailies) and their logs, we add variables for frequency, for state and year of publication, and for whether a paper is published in a non-English language or on the USND’s list of African-American papers.

As in the example above of predicting a paper’s frequency from words in its title and other features, we use the digitized subset of papers to estimate the model’s parameters. Unlike the frequency-prediction example, we have a different training or test data point for each year of each series since we observe that papers may publish a different mix of page counts over their runs.

Although the primary purpose of this predictive model is to correct for sampling bias, we can still observe some patterns in model coefficients. The African-American and non-English features are positive, suggesting that these papers are more likely, all else being equal, to be published on single sheets. Not surprisingly, papers with a quartly, monthly, semimonthly, and other less frequent publication schedule are less likely to be single-sheet issues, as suggested by the negative regression coefficients. Publication in New York is negatively correlated with single-sheet publication, while publication in Nevada is positively correlated.

We use the trained logistic regression model to estimate the probability that an issue from a given series in a given year will be published on a single sheet. For papers in the digitized sample, we use their observed proportions; we use the model’s predicted probabilities for the undigitized papers. These results are merged in the probability field.

We now plot the expected counts of single-sheet and multiple-sheet papers in the USND. Unlike the graph of the digitized sample, we see substantial growth in single-sheet papers after the dip during the Civil War and a later plateau. Again, this subtly counters received wisdom by showing that shorter papers did not decline in the latter half of the century, and in fact they grew. The perception of their decline is tied more to the exponential growth of the newspaper category, such that single sheet papers prominence relative to other formats dipped, than it is to the single-sheet format’s raw decline in prevalence.

To compare more easily the digitized and catalog estimates, we divide the elements in these time series by the number of digitized and cataloged papers, respectively. As we observed in our earlier analysis, the digitized sample overrepresents papers from 1838 to 1870, and papers after 1870 are underrepresented. Since multiple-sheet papers became more common at the end of the nineteenth century, correcting for this sampling bias leads to a larger estimate of the growth of both formats.

To make an example comparison, the model estimates that the peak of single-sheet issues is five years later than in the digitized sample.

[1] “Max. estimated single-sheet issues”

1890

[1] “Max. digitized single-sheet issues”

1885

The relative differences between the digitized sample and model estimate are often larger when examining subpopulations of the data, such as the Black press or sparsely represented states. For example, the model estimates that the digitized sample is undercounting continued growth in single-sheet African-American papers through the end of the century.

We can also compare the change away from single-sheet formats in different states. Here, the states with the lowest minimum single-sheet proportion are displayed first. The estimates from the full catalog are smoother than the averages from the digitized sample.

Reading Matter

The nineteenth century saw not only a rapid increase in the number of papers and a rise in the number of daily papers, but also changes in newspaper format. The velocity of information thus increased not only by more frequent and widely distributed publication but also because a single issue might contain more pages, more sections, and more words. As with other features of newspapers, these changes in newspaper layout were not evenly distributed.

We augment the dataset of page counts used in the previous section with information on the number of characters printed in each year of a series.

Since some papers only publish for part of a year, we normalize the rate of publication by the number of days over which a series’ issues appear. We then join this output data to the metadata from the USND.

Plotting data on all digitized periodicals—not just US newspapers—we can see a clear pattern where monthly and less frequent magazines have fewer characters per page. Weekly magazines and newspapers overlap somewhat, as shown by the wider distribution of characters per page for series with seven days between issues.

Looking at the data another way, weeklies also have a wider distribution of pages per issue: some weeklies have similar page counts to dailies; other weeklies have more—though not, in general, as many pages as monthlies.

In general, there is a clear (negative) correlation between the number of days between issues and the number of characters published per day. On both dimensions, these distibutions have low overlap.

The number of characters published per day gives us an upper bound on the amount of editorial and compositorial labor needed to produce a newspaper. Without circulation information, we are not able to include paper costs, although changes in pages per issue can give us short-term estimates of it, nor are we able to estimate the total amount of text in circulation. Finally, working with a rate helps us adjust for newspapers that started or stopped publication in a given year.

Characters per day is, as we said, an upper bound on the amount of content needed to publish a paper. As we shall see, editors might lay down the quill and take up the scissors—using exchanges and other reprints to fill newspaper columns. For mastheads, boilerplate on subscription and advertising rates, and advertisements appearing in several running issues, compositors could leave matter in type from one issue to the next to reduce their labor. Later in the nineteenth century, they could join syndicates that sent them plates. In later chapters, we will thus be able to contextualize our analyses of reprinting against this background of editorial and compositorial labor.

We incorporate the same features we used to predict whether a newspaper would be printed on a single sheet.

As we saw above, different types of newspapers are clearly separable on the log scale, so we train a linear regression model to predict the log of the daily output of a given series in a given year.

Besides the intercept of the linear model, which captures the average log characters per day, we see, unsurprisingly, that dailies have relatively higher output and weeklies and less frequent papers have lower output when normalized by number of days. The Black and non-English press have relatively lower rates of output. Newspapers also tend to have higher output when published in the same city as many newspapers or many dailies.

To use the linear regression model to give us an aggregate picture of total newspaper output, we convert the model’s predictions out of log space and back into counts of characters per day.

As with other variables we have analyzed in this chapter, the model estimates higher relative output at the end of the nineteenth century than the totals from the digitized sample alone and lower relative between 1840 and 1870.

We estimate that some states in particular had much more newspaper output towards the end of the nineteenth century than we see in the digitized sample: e.g., Indiana, Iowa, and (despite more robust digitization earlier in the century) Ohio and Pennsylvania.

Even more strikingly, a higher percentage of the newspaper output per day in the digitized sample comes from daily papers, but the output of weekly papers is significantly underrepresented in that sample compared to the model.

Looking at the same data another way, we see that the digitized sample attributes a greater rate of daily newspaper output to daily papers over weekly papers for the entire nineteenth century. As shown by the dashed lines, however, the model estimates that the total rate of output for daily papers only surpasses weeklies in the 1890s.

Newspaper Exchange Networks

We cannot account for the proliferation of newspapers and newspaper content through the nineteenth century based only on political need or the capacity of printing technology. Neither of these helps us understand precisely which texts appeared in the newspaper or, entirely, how they found their way into print. Small newspapers—often staffed by just an editor and perhaps a few press workers—assembled dense, five- or six-column newspapers on a weekly or even daily basis. Such small teams could publish regularly because any given newspaper was only one node within a larger, networked “exchange system” through which newspaper material was shared in common, published and circulated for reprinting, commentary, and other forms of reuse elsewhere. Most individual papers, in fact, produced very little original content, but instead acted as aggregators, collecting interesting selections from their regional, political, or national exchanges and republishing them for local readers. Like modern blogs or news aggregators, such as Buzzfeed or Huffington Post, which use aggregation to produce regular content despite relatively small editorial staffs, nineteenth-century newspaper editors exploited systemic reprinting, which they called “selection,” to fill columns.

In exchanges, editors subscribed to each other’s publications and borrowed content promiscuously from each other’s publications. Ellen Gruber Garvey describes this labor: “On large papers, a special ‘exchange editor’ went through other papers for material,” a process that helped local newspapers expand “by yoking together scattered producers who shared labor and resources by sending their products to one another for free use” (Garvey 2012, 30–31). While copyright law protected domestic books from reproduction, the content shared through the newspaper exchanges was not protected under intellectual property law. Instead, periodical texts were considered common property for reprinting, with or without modification—much as articles, music videos, and other content are shared online today among blogs and social media sites. Indeed, US postal laws explicitly supported these practices. From the late eighteenth century until the 1870s, editors paid no postage to mail papers to each other. The postal system was intended as an information network, and distributing newspapers cheaply helped advance an idea of democratic access to knowledge.5 When texts were reprinted, they were sometimes marked as such, perhaps with a citation to the source paper, but just as often selections were republished without any attribution at all. When looking at a nineteenth-century newspaper page, certainly a lack of citation on any particular article or squib cannot be taken as evidence that piece was original to that paper.

The data we have collected through the Viral Texts project can help clarify just how much of a typical nineteenth century newspaper page was composed of reprints. Our database of reprint clusters not only indicates the texts that circulated widely, but can be used to estimate how much text, on average, was reprinted on the typical newspaper page. The graph below is based on the data from 1,949 distinct Chronicling America newspaper titles, and issues published between 1840–1899.6 The text on each page was compared against identified reprints from our project to see what percentage of the total text in each issue seems to have been printed in other newspapers prior to appearing on that page. In other words, if a segment of text belongs to a reprinted cluster, and that witness is not the first observed witness, it is considered reprinted material. In addition to tallying external reprints, this same method compares segments of text on each page against previous issues of the same title so that we can identity how much material, on average, newspapers recycled as boilerplate, advertisements, editorial statements, and other text that would have been kept in standing type for multiple issues.

Attitudes towards this ubiquitous scissors and paste journalism varied, with some commentators decrying editors’ lack of originality, or even theft, while others valued the range of information and entertainment selection practices brought to local or even isolated readers. Many editors insisted that their primary role was selection and that the primary value of their papers was the way they aggregated information from around the country. As the Fremont Journal (Ohio) claimed on December 29, 1854, “[d]eprive an editor of his exchanges, shut off his mails for a week, and you take away from him his very sustenance, and withdraw from him all that makes his paper interesting.” Newspapers, the Journal asserted, operate under a “general ‘reciprocity treaty’…so that each may assist the other in collecting intelligence and together circulate the vast amount of news, of politics and literature, that circulates through the thousand columns of the periodical press over the land.” Other commentators decried the lack of originality they saw evidenced by the exchange system.

The widely-reprinted listicle “Editing a Paper” laid out the dilemmas facing editors seeking to please readers, including those related to selection:

If we publish original matters, they damn us for not giving selections.If we give selections, people say we are lazy for not writing more and giving them what they have not read in some other paper.

The first reprinting of “Editing a Paper” identified by the Viral Texts project appears in the Big Blue Union of Marysville, Kansas, but even here an editorial preface begs ignorance about the list’s origins, claiming to only know that it has been “going the rounds of the papers. If we knew in what paper it first appeared,” the editor continues, “it would afford us pleasure to give the writer due credit” (11 July 1863). This piece and its preface illustrate much about editors’ and, presumably, readers’ attitudes toward reprinting and the texts which circulated through the newspaper. While they express some concern about authorship and proper credit, circulation and selection ultimately matter more. Being able to credit the author would “afford us pleasure,” but not knowing the author or origin will not prevent an editor from republishing the piece.

“Within the last few days we have added to our Exchange list the following Periodicals,” the Edgefield Advertiser(16 September 1841) wrote to introduce a series of paragraphs outlining the virtues of four new additions to its exchanges. It was not unusual for papers to publish a brief account of why another publication had been added—or in some cases removed—from its list, though this example is unusually thorough. The publication of this article signals that the Advertiser’s readers cared about their local paper’s sources and might even seek to personally subscribe. The Advertiser’s laudations can help clarify what editors valued in each other’s work, and in particular these paragraphs demonstrate that selection and editing were marked as skills equal to editorializing. Of the Southern Botanical Medical Journal(Forsyth, Georgia), the Advertiser writes “The editorial and selections, as far as we are able to judge from a hasty perusal, are good,” praising equally both the original and reprinted elements of the publication. Similarly, the Western Farmer (Detroit, Michigan) is praised for “its editorial and selections…from which our Agricultural friends might gain knowledge,” while the Southern Planter (Richmond, Virginia) is summed up as “well edited, and bids fair to be useful to Planters, Farmers, &c.” This article does not write about selection and editing as secondary or subordinate to original writing, but as an equal guarantor of a publication’s quality.

While the “Editing a Paper” sums up editing as a thankless task, other examples of meta-reflection signal that participants in the exchange system valued discerning eyes, deft scissors, and careful composition, and even pushed back against contemporary rhetoric that sought to promote hierarchies within the system. Consider this brief article about the New York Evening Mirror from the Cadiz Sentinel (19 February 1845):

There is no paper on our exchange list we read with more exquisite pleasure than the N. Y. Evening Mirror. Willis is one of the few original and intellectual writers of the age, and wields a bold and vigorous pen.

In many ways, this article’s laudatory opening seems to contradict what we have thus far written about originality and reprinting, praising the “original and intellectual” writer Nathaniel Parker Willis—the brother of Fanny Fern, who we will discuss in Chapter 5—who was hired in 1844 by the Mirror’s editor George Pope Morris to help reimagine and revitalize that paper. We might contrast this description of Willis’ “bold and vigorous pen” with the Houma Ceres editor, described earlier in this chapter, who apologized for being “not…very distinguished as a ‘knight of the gray goose quill.’” Here an urban paper and literary writer seem to be held up for special esteem, distinguished in a way a rural newspaper editor cannot be. However, the Sentinel immediately pivots from its praise of Willis to describe the Mirror as “the representative of the ‘upper ten thousand,’” a phrase coined by Willis himself to describe the 10,000 wealthiest residents of New York City, and which came to be used as a rhetorical placeholder for the American gentry.7 As the representative of the elite, the Sentinel allows that the Mirror “of course contains much reading that would be of little interest to the individual who penetrates the forest with his axe, his musket and his dog. But the class of readers for whom it is intended,” the Sentinel continues, “may certainly be proud of the brilliancy of their organ.” By identifying “the class or readers for whom” the Mirror is intended as only the “upper ten thousand,” the Sentinel implicitly claims Willis’ “bold and original pen” has a limited reach. If it cannot address the individual in the forest with his axe, musket, and dog, then it cannot, perhaps, address most of the nation.

From here, the Sentinel notes,

“The Mirror of the 10th inst. we perceive has republished an article from the Sentinel on social parties, and introduces it with the following remarks:8
NOTIONS ON SOCIETY.—The Cadiz Sentinel is enlivened with a vigorous argument as to the character of the evening parties in the society of that Ohio village. The following is the deprecatory plea for gayety of the sinner-combattant. [sic]

After praising the “original and intellectual” qualities of writing in the Mirror, the only original prose the Sentinel actually quotes is a single sentence, and one that introduced a reprinted text. What prompts the Sentinel to comment on the Mirror, in other words, are the operations of the exchange network that brought their writing from Ohio and to New York City readers. For all the literary ambitions of the Mirror and Nathaniel Parker Willis, this reprint evidences that print culture does not simply emanate from the metropolis to benighted rural villages, but that through the exchange system urban papers draw on rural writing as well. By explicitly describing the Mirror’s “class of readers” and then calling attention specifically to the way it introduced a reprint from the Sentinel, this article prompts its readers (and us) to consider the urban paper’s subtle condescension, which is belied by its very need for exchange material from papers like the Sentinel.

Approaching the newspaper computationally, focused on reprinting practices, Going the Rounds offers one answer to the “difficulties of tracing editorial practice” Jim Casey and Sarah H. Salter identify in a recent edition of American Periodicals. We concur that “Editors were not merely stand-ins for authors, printers, or publishers” and that understanding editorship requires “the analysis of historically situated editorial practices” (Casey and Salter 2020, 102). It is perhaps strange, then, that we approach editors’ work—and, we would argue, the entangled work of compositors and press workers—through the ahistorical phrase “virality.” As we argue in the next section, however, we employ that frame to destabilize common literary-historical tendencies—our own as well as other scholars’—to center authorship and originality, even when reckoning with a medium that deemphasized both qualities.

Notes

For more on the NDNP and funding, see Ryan Cordell, “‘Q i-Jtb the Raven’: Taking Dirty OCR Seriously,” Book History 20 (2017): 188–225.
Return to note reference.
This is a simplification by the USND: newspapers did change their frequency over time, so while catalogs often assign one frequency category to particular titles, the historical realities of production were often less regular. Ostensibly weekly papers would sometimes issue more frequently, such as during election season when they were actively stumping for their party’s candidates, while ostensibly daily papers would sometimes issue more sporadically, as happened to many papers during the Civil War when supplies ran low or armies interfered with routine. As long as a paper did not change its title, however, the USND catalog does not reflect changes in frequency, place of publication, or other features. In the next section, we will directly model changing properties of newspapers, such as their length in pages or words.
Return to note reference.
Brownlow’s Knoxville Whig continues after Brownlow himself leaves in 1866 to become governor, when his son takes over the paper. They sell the paper in 1869 to others who remove Brownlow from its name;, but Brownlow returned to the paper in the mid-1870s, when it is called the Knoxville Whig and Chronicle. In other words, even a paper like this, so strongly tied to one person, persists after that person leaves it, albeit under a new name, and the paper retains “Whig” in its title long after the Whig party faded in the U.S. The actual politics of the paper continue in their fervency, backing new parties and politicians, even as the official name of the paper is highly personal and idiosyncratic.
Return to note reference.
For the most comprehensive account of paper production and use in nineteenth-century America, see the work of Jonathan Senchyne, in particular (2017) and (2019). We thank Professor Senchyne for his advice about this section of this book.
Return to note reference.
For more on the postal system as an information network, see Henkin (2006).
Return to note reference.
The Viral Texts project corpora do include some titles published earlier than 1840, but the coverage in the corpus becomes much more substantial around the middle of the century. This is due to the expansion of newspapers during the period, as we will describe below, but also because most mass digitization efforts started with papers from around 1900 and have gradually digitized further in the past. So few newspapers from the 1800s-1830s have been digitized that, while we will sometimes reference particular texts reprinted during those earlier periods, it is difficult to meaningfully compare trends at scale between those decades and later ones. Thus this chart starts in 1840.
Return to note reference.
Thinking about historical parallels, we cannot help but link “the upper ten thousand” or “upper ten” from the nineteenth century with “the one percent” from the twenty-first. In both cases, the descriptive phrase about an economic subset of the population becomes a shorthand for critiques of wealth inequality and exploitation. That shorthand is doing significant work, we would argue, in the Sentinel’s sly critique of the Mirror in this selection.
Return to note reference.
The “inst.” in this article is a common newspaper abbreviation for “instant,” which designates the current month. Since this article was printed in the Cadiz Sentinel on February 19, 1845, we would assume the reprinted article appeared in the New York Evening Mirror on February 10th, 1845.
Return to note reference.

References

Abebe, Rediet, Solon Barocas, Jon Kleinberg, Karen Levy, Manish Raghavan, and David G. Robinson. 2020. “Roles for Computing in Social Change.” arXiv:1912.04883 [Cs], January. https://doi.org/10.1145/3351095.3372871.
Algee-Hewitt, Mark, Sarah Allison, Marissa Gemma, Ryan Heuser, Franco Moretti, and Hannah Walser. 2016. Canon/Archive. Large-scale Dynamics in the Literary Field. Stanford Literary Lab, Pamphlet 11.
Casey, Jim, and Sarah H Salter. 2020. “Challenges and Opportunities in Editorship Studies.” American Periodicals 30 (2): 5.
Cordell, Ryan. 2017. ““Q i-Jtb the Raven”: Taking Dirty OCR Seriously.” Book History 20: 188–225. https://doi.org/10.1353/bh.2017.0006.
Fagan, Benjamin. 2016. “Chronicling White America.” American Periodicals: A Journal of History & Criticism 26 (1): 10–13.
Fyfe, Paul. 2016. “An Archaeology of Victorian Newspapers.” Victorian Periodicals Review 49 (4): 546–77. https://doi.org/10.1353/vpr.2016.0039.
Garvey, Ellen Gruber. 2012. Writing with Scissors: American Scrapbooks from the Civil War to the Harlem Renaissance. Oxford University Press.
Gitelman, Lisa, and Thomas S. Mullaney. 2021. “Nineteenth-Century Media Technologies.” In Information: A Historical Companion, edited by Ann Blair, Paul Duguid, Anja-Silvia Goeing, and Anthony Grafton. Princeton: Princeton University Press.
Hayles, N. Katherine. 2002. Writing Machines. Cambridge, Massachusetts: MIT Press.
Henkin, David M. 2006. The Postal Age: The Emergence of Modern Communications in Nineteenth-Century America. University of Chicago Press.
Lee, Benjamin. 2021. “Compounded Mediation: A Data Archaeology of the Newspaper Navigator Dataset.” Digital Humanities Quarterly 015 (4).
Libraries, Penn State University, and National Endowment for the Humanities. n.d. “Clearfield Republican. [Volume].” Accessed May 11, 2021.
Mak, Bonnie. 2014. “Archaeology of a Digitization.” Journal of the Association for Information Science and Technology 65 (8): 1515–26. https://doi.org/10.1002/asi.23061.
Milligan, Ian. 2013. “Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History, 1997.” The Canadian Historical Review 94 (4): 540–69.
Nord, David Paul. 2006. Communities of Journalism: A History of American Newspapers and Their Readers. Urbana: University of Illinois Press.
Pettegree, Andrew. 2021. “Newspapers.” In Information: A Historical Companion, edited by Ann Blair, Paul Duguid, Anja-Silvia Goeing, and Anthony Grafton. Princeton: Princeton University Press.
R. Hoe & Company. 1873. [Catalogue of] Printing Machines . . . Hand-Presses, Self-Inking Machines, Etc. New York.
Schmidt, Benjamin M. 2016. “Do Digital Humanists Need to Understand Algorithms?” In Debates in the Digital Humanities.
Senchyne, Jonathan. 2017. “Paper Nationalism: Material Textuality and Communal Affiliation in Early America.” Book History 19 (1): 66–85.
Senchyne, Jonathan. 2019. The Intimacy of Paper in Early and Nineteenth-Century American Literature. Amherst: University of Massachusetts Press.
Swoope, Roland D. (Roland Davis). 1911. Twentieth Century History of Clearfield County, Pennsylvania, and Representative Citizens. Chicago, Ill., Richmond-Arnold publishing co.
Trott, Louisa. 2022. “Documenting Lost African-American Newspapers.” The National Endowment for the Humanities.
Tucher, Andie. 2010. “Newspapers and Periodicals.” In An Extensive Republic: Print, Culture, and Society in the New Nation, 1790–1840, edited by Robert A. Gross and Mary Kelley. UNC Press Books.
University of Tennessee, and National Endowment for the Humanities. n.d. “Brownlow’s Knoxville Whig. [Volume].” Accessed May 11, 2021.
Weeks, Lyman Horace. 1969. A History of Paper-Manufacturing in the United States, 1690–1916. [New York] B. Franklin.

Draft Chapters

Show the following:

Adjust appearance:

Notes

Editing a Paper

Counting Newspapers

A Note on Reading This Chapter

Organizing the Newspaper Catalog

Overall Counts

Digitized Newspapers

Predicting Digitization

Predicting Frequencies

Newspapers and Political Partisanship

The Technologies of Newspaper Production

Newspaper Formats

Reading Matter

Newspaper Exchange Networks

Notes

References

Annotate