Bahl, Lalit R., Peter F. Brown, Peter V. de Souza, and Robert L. Mercer (1989). A tree-based statistical language model for natural language speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(7):1001–1008.
Belkin, Nicholas J. (1980). Anomalous states of knowledges as a basis for information retrieval. Canadian Journal of Information Science, 5:133–143.
Berti, Monica, Matteo Romanello, Alison Babeu, and Gregory Crane (2009). Collecting fragmentary authors in a digital library. In ACM+IEEE Joint Conf. on Digital Libraries (JCDL), pages 259–262.
Blaney, Jonathan, and Judith Siefring (2017). A culture of non-citation: Assessing the digital impact of British History Online and Early English Books Online Text Creation Partnership. Digital Humanities Quarterly, 11(1).
Bode, Katherine (2017). The equivalence of “close” and “distant” reading; or, toward a new object for data-rich literary history. Modern Language Quarterly, 78(1):77–106.
Bode, Katherine, and Carol Hetherington (2014). Retrieving a world of fiction: Building an index—and an archive—of serialized novels in Australian newspapers, 1850–1914. Script and Print, 38(3):197–211.
Bordalejo, Bárbara (2013). The texts we see and the works we imagine. Ecdotica, 10, 64–76.
Breiman, Leo (2001). Statistical modeling: The two cultures. Statistical Science, 16(3):199–215, August.
Briet, Suzanne (1951). Qu’est-ce que la documentation? Éditions documentaires, industrielles et techniques.
Chomsky, Noam (1956). Three models for the description of language. IRE Transactions on Information Theory, 2(3):113–124.
Church, Kenneth Ward (1988). A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of the Second Conference on Applied Natural Language Processing, pages 136–143.
Cordell, Ryan (2013). “Taken Possession of”: The reprinting and reauthorship of Hawthorne’s “Celestial Railroad” in the antebellum religious press. Digital Humanities Quarterly, 7(1).
Cordell, Ryan (2017). “Q i-jtb the Raven”: Taking dirty OCR seriously. Book History, 20:188–225.
Cotterell Ryan, Nanyun Peng, and Jason Eisner (2014). Stochastic contextual edit distance and probabilistic FSTs. In Association for Computational Linguistics (ACL), pages 625–630, 2014.
Crane, Gregory R., and Jeffrey A. Rydberg-Cox (2000). New technology and new roles: The need for “corpus editors”. In Digital Libraries, pages 252–253.
DeRose, Steven J. (1988). Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14(1):31–39.
DeRose, Steven J., David G. Durand, Elli Mylonas, and Allen H. Renear (1990). What is text, really? Journal of Computing in Higher Education, 1(2):3–26.
Dong, Rui, and David A. Smith (2018). Multi-input attention for unsupervised OCR correction. In Association for Computational Linguistics (ACL).
Dreyer, Markus and Jason Eisner (2009). Graphical models over multiple strings. In Empirical Methods in Natural Language Processing (EMNLP).
Drucker, Johanna (2014). Distributed and conditional documents: Conceptualizing bibliographical alterities. Materialidades da literatura/Materialities of Literature, 2(1):11–29.
Dué, Casey, and Mary Ebbott (2010). Iliad 10 and the Poetics of Ambush: A Multitext Edition with Essays and Commentary. Center for Hellenic Studies.
Elsayed, Tamer, Jimmy Lin, and Douglas W. Oard (2008). Pairwise document similarity in large collections with MapReduce. In ACL Short Papers, pages 265–268.
Fitzpatrick, Kathleen (2016). The future of academic style: Why citations still matter in the age of Google. Los Angeles Review of Books, March.
Gelman, Andrew, Xiao-Li Meng, and Hal Stern (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6(4):733–760.
Gibbs, Frederick W. and Daniel J. Cohen (2011). A conversation with data: Prospecting Victorian words and ideas. Victorian Studies, 54(1):69–77.
Gusfield, Dan (1997). Algorithms on Strings, Trees, and Sequences. Cambridge University Press.
Horton, Russell, Mark Olsen, and Glenn Roe (2010). Something borrowed: Sequence alignment and the identification of similar passages in large text collections. Digital Studies / Le champ numérique, 2(1).
Huston, Samuel, Alistair Moffat, and W. Bruce Croft (2011). Efficient indexing of repeated n-grams. In ACM Web Search and Data Mining Conf. (WSDM), pages 127–136.
Jones, A. H. M. (1964). The Later Roman Empire, 284–602: A Social, Economic, and Administative Survey. Basil Blackwell.
Jurafsky, Dan and James Martin (2018). Speech and Language Processing. Draft, 3 edition.
Leskovec, Jure, Lars Backstrom, and Jon Kleinberg (2009). Meme-tracking and the dynamics of the news cycle. In ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, pages 497–506.
Lord, Albert B. (1960). The Singer of Tales. Harvard University Press.
Lynch, Clifford (2017). Stewardship in the “age of algorithms”. First Monday, 22(12), 4 December.
Mak, Bonnie (2014). Archaeology of a digitization. Journal of the Association for Information Science and Technology, 65(8).
McGann, Jerome (2014). A New Republic of Letters: Memory and Scholarship in the Age of Digital Reproduction. Harvard University Press.
Mohri, Mehryar, Fernando C. N. Pereira, and Michael Riley (2002). Weighted finite-state transducers in speech recognition. Computer Speech and Language, 16(1):69–88.
Nelson, Laura K. (2017). Computational grounded theory. Sociological Methods & Research, 19(3).
Paul, Michael J. and Jason Eisner (2012). Implicitly intersecting weighted automata using dual decomposition. In North American Chapter of the Association for Computational Linguistics (NAACL).
Ponte, Jay M. and W. Bruce Croft (1998). A language modeling approach to information retrieval. In ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, pages 275–281.
Putnam, Lara (2016). The transnational and the text-searchable: Digitized sources and the shadows they cast. The American Historical Review, 121(2):377–402.
Ramsay, Stephen (2011). Reading Machines. University of Illinois Press.
Renear, Allen, Elli Mylonas, and David Durand (1996). Refining our notion of what text really is: The problem of overlapping hierarchies. In Nancy Ide and Susan Hockey, editors, Research in Humanities Computing. Oxford University Press.
Samuels, Lisa, and Jerome McGann (1999). Deformance and interpretation. New Literary History, 30(1):25–56.
Schnober, Carsten, Steffen Eger, Erik-Lân Do Dinh, and Iryna Gurevych (2016). Still not there? Comparing traditional sequence-to-sequence models to encoder-decoder neural networks on monotone string translation tasks. In COLING.
Schwebel, Sara L. (2015). The Lone Woman and Last Indians Digital Archive.
Shannon, Claude E. (1951). Prediction and entopy of printed English. Bell System Technical Journal, 30(1):50–64.
Simon, Herbert A. (1996). The Sciences of the Artificial. MIT Press, 3 edition.
Smith, David A., Ryan Cordell, and Elizabeth Maddock Dillon (2013). Infectious texts: Modeling text reuse in nineteenth-century newspapers. In IEEE Workshop on Big Data and the Humanities.
Smith, David A., Ryan Cordell, Elizabeth Maddock Dillon, Nick Stramp, and John Wilkerson (2014). Detecting and modeling local text reuse. In ACM+IEEE Joint Conf. on Digital Libraries (JCDL).
Spiro, Lisa and Jane Segal (2010). Scholars' use of archives in American literature. In Amy E. Earhart and Andrew Jewell, editors, The American Literature Scholar in the Digital Age, pages 101–122. U. Michigan Press.
Springer, Michelle, Beth Dulabahn, Phil Michel, Barbara Natanson, David Reser, David Woodward, and Helena Zinkham (2008). For the common good: The Library of Congress Flickr pilot project. Technical report, Library of Congress.
Suen, Caroline, Sandy Huang, Chantat Eksombatchai, Rok Sosič, and Jure Leskovec (2013). NIFTY: A system for large scale information flow tracking and clustering. In Int. World Wide Web Conf. (WWW), pages 1237–1248.
Tishby, Naftali, Fernando C. Pereira, and William Bialek (1999). The information bottleneck method. In Proceedings of the Allerton Conference on Communication, Control and Computing, pages 368–377.
Trettien, Whitney Anne (2013). A deep history of electronic textuality: The case of English Reprints Jhon Milton Areopagitica. Digital Humanities Quarterly, 7(1).
Underwood, Ted (2014a). Theorizing research practices we forgot to theorize twenty years ago. Representations, 127(1):64–72, August.
Underwood, Ted (2014b). Understanding genre in a collection of a million volumes. Technical report, University of Illinois, Urbana-Champaign.
van Zundert, Joris J., and Tara L. Andrews (2017). Qu'est-ce qu'un texte numérique?–A new rationale for the digital representation of text. Digital Scholarship in the Humanities, 32(Supplement 2):ii78–ii88.
Wilkerson, John, David A. Smith, and Nick Stramp (2015). Tracing the flow of policy ideas on legislatures: A text reuse approach. American Journal of Political Science.
Wolf-Sonkin, Lawrence, Jason Naradowsky, Sebastian J. Mielke, and Ryan Cotterell (2018). A structured variational autoencoder for contextual morphological inflection. In ACL.
Xu, Shaobin, and David A. Smith (2017). Retrieving and combining repeated passages to improve OCR. In ACM+IEEE Joint Conf. on Digital Libraries (JCDL).
Xu, Shaobin and David A. Smith (2018). Contrastive training for models of information cascades. In Proceedings of the AAAI Conference on Artificial Intelligence.
Yalniz, Ismet Zeki, Ethem F. Can, and R. Manmatha (2011). Partial duplicate detection for large book collections. In ACM Int. Conf. on Information and Knowledge Management (CIKM), pages 469–574.
Zhai, Chengxiang, and John Lafferty (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, pages 334–342.