“Distant Editing” in “Futures of Digital Scholarly Editing”
Chapter 1
Distant Editing
The Challenges of Computational Methods to the Theory and Practice of Textual Scholarship
Elena Pierazzo
Back in 2015, I published a couple of books about digital scholarly editing, in which I reflected hard and long about editing.1 Now, only a few years later, I have the impression that I wrote these books in a completely different environment. What has changed in the meantime? For a start, I did not mention artificial intelligence (AI), deep learning, or computer vision. At the time, discussions of digital scholarly editing were centered around the representation of sources and how computers would help us better understand a variety of text-bearing objects. At the time, computational approaches were scarce and, for the most part, marginal to the core activity of editing documents and texts. The main exceptions were automatic collation and phylogenetic methods for stemmatology, and even these struggled to be accepted by digital editors: no matter how revolutionary we thought them to be at the time, these approaches, most of us concluded, yielded relatively few benefits for the initiated and were challenging for the average researcher.2 For these approaches, we mostly discussed the “black-box” problem, the need to respond to communities of practice and their expectations, and the need to propose tools and methods that can be easily accessible, understandable, and verifiable by the researcher in the humanities.
The use of automatic collation, in particular, was opposed because the hand transcription of all extant witnesses seemed to be an unnecessary addition to the already heavy list of editorial duties. At the time, the idea that we could transcribe manuscripts with optical character recognition (OCR) was almost sci-fi, and even though the Transkribus project started in 2013, its results were limited and kept within the inner circle of the project members. On the other side, phylogenetic approaches were viewed with suspicion and stacked in a methodological loop: if we already know what the stemma looks like, how can we be sure that we are not influenced by what we know? And if we do not know what the stemma looks like, how can we trust algorithms and procedures that we cannot control?
The use of computational analysis was not unknown to digital humanities (DH) and particularly to literary scholars, of course: the work of Franco Moretti, Matthew Jockers, and Patrick Joula, to mention only a few, was making front-page news already. Distant reading and authorship attribution were getting to very important results, like the unmasking of Robert Galbraith, which also opened an allegedly new preoccupation for literary scholars: ethics. And yet, these approaches seemed far away from the digital editor’s preoccupations, particularly because they almost completely ignored (when they didn’t openly oppose) the very idea of adding markup to texts, which represents the methodological core of digital editing. Later, I will discuss this further. For many of us, the main preoccupations of digital editors were (and still are) text representation and presentation, the workflow of a digital edition, the taxonomy and recognition of the phenomenology of the written page, objectivity and interpretation, and the sustainability of our editions. Nevertheless, thanks to some pioneering researchers, we are now starting to see the prolegomena of a computational revolution in editing.
I will start this analysis by tackling the so-called handwriting text recognition (HTR) revolution and, in general, computer vision and deep-learning algorithms applied to images of ancient books. The development of the Transkribus platform and the first results produced by the Venice Time Machine, as well as the research done by Lambert Schomaker in artificial intelligence and pattern recognition, highlighted the enormous potential of computer vision applied to manuscripts;3 a pioneering work produced by Mike Kestemont and published in Speculum wowed first the medieval studies community, and then everybody else.4 In his work, Kestemont showed how a neural network could recognize and classify manuscripts by type of script, and within these scripts, group them by area of production. The impact of this research has been profound, and the community of medievalists has shown a new openness to this kind of research. Kestemont is now working on the manuscript production of Carthusian monks and showing how AI can “see” things that were overseen by editors, like a change of hands, for instance, thus unveiling writing practices that are likely to transform our understanding of how medieval scriptoria within monasteries worked.5 Scholars of the Middle Ages understand that we know the content of only about 9 percent of all the medieval codices that still exist scattered in the world’s libraries, and computational approaches promise to help us “read” them for the first time.6 Libraries are making available online millions and millions of high-quality images of manuscripts. To paraphrase Greg Crane: what shall we do with millions of manuscript images?7 This availability, the development of interoperable standards for harvesting data across countries and libraries (the development of international image interoperability Framework [IIIF] has been enormously important), and the standardization of data and metadata thanks to the Text Encoding Initiative (TEI) and the FAIR (Findable–Accessible–Interoperable–Reusable) protocol—all are contributing factors to the uptake of computational methods in manuscript studies. The transformative potential of this research can be seen via some examples based on medieval material, where these developments have been most impactful.
The work of Dominique Stutzmann has been groundbreaking. Stutzmann was, in fact, the person that launched the competition/hackathon that prompted Kestemont’s research, and has been one of the first paleographers to engage with computational methods for transcriptions.8 His most recent work is focused on quantitative codicology, and in particular on the use of AI and natural language processing for studying books of hours.9 These prayer books for the use of the laity were the “bestseller” of the Middle Ages. Produced in large numbers from the thirteenth century throughout Western Europe and more numerous than the biblical manuscripts, they are a primary source for the history of religious sentiment and social practices, but also the circulation of texts and transmissions between different regions or even the processes of industrialization of book production, because they are largely standardized. Before the invention of printing in the West, they were the first so-called “shelf” books produced by booksellers outside of any religious order, shifting the book from an economy of demand to an economy of offer. These luxury books were often richly decorated and gave rise to the formalization of stable iconographic cycles, such as that of the Childhood of Christ.
Although they are known to all medievalists, especially to art historians, books of hours are often overlooked or despised as standard mass production: they are so abundant that nobody has ever dared to study them. From a textual perspective, they are impossible to manage: they are compilations of compilations, they are produced in series, and they transcend the idea of contamination. Using a combination of traditional techniques and computational approaches, the first results are merging, and they are mind-blowing: we can now see how distinctive groups of production, different offices, and different liturgical emphases serve different communities. Books of hours feature a huge variation, for instance, in handling psalms: some have one, some have 130; some psalms were very popular, others were barely known.
Another type of research using computational analysis is the one produced at the École Nationale des Chartes in Paris. The team is led by Jean-Baptiste Camps, but features other young scholars such as Ariane Pinche, Thibault Clérice, Elena Spadini, and Simon Gabay.10 They have successfully applied stylometry and automatic collation to “noisy” medieval data. Their research earned them the Fortier Prize for the most innovative research at the DH conference in Utrecht in 2019.11 What does “noisy” mean here? Namely that, for instance, in the tradition of the Chanson de Roland or of the Roman de la Rose, we count twenty-seven different spellings of the word “horse,” or that in any given line of a medieval text we can see variations of spelling mixed up with substantial variation, making almost impossible the use of automatic collation. Here AI approaches are combined with linguistic models, which in turn are based on databases of dictionaries. The work championed by this team shows that, in fact, one can use marked-up text in TEI (or another markup system) to apply stylometric analysis, and that, contrary to some previous criticisms, it can be very useful. In fact, the annotation in TEI allows one, for instance, to normalize words, align versions, or establish equivalences. Camps is also responsible for elaborating the model of computational philology, bridging together in a coherent workflow all the “pieces” that we have seen so far: from digitizing and distributing images using IIIF to transcribing them automatically, to collating them, annotating them, re-collating them, proposing stemmas, and finally editing. At the end of his talks, he always asks the same question: are we really there yet? The answer is no, not quite, but we are getting there, one algorithm at a time.12
Peter Stokes and his teams at the École Pratique des Hautes Études in collaboration with the Institut National pour la Robotique et l’Intelligence Artificielle (INRIA), are working toward the democratization of these tools: the elaboration of eScriptorium,13 an open source, free platform for the automatic transcription of manuscripts, is progressing in efforts to allow TEI downloads of transcribed texts that can be published via the TEI Publisher.14 These tools are particularly developed to handle non-Latin, non-left-to-right scripts, but of course, they have been used for any kind of writing, including early modern and modern manuscripts.
With all this progress in new research, what are the implications for textual scholarship, and more specifically, what will scholarly editing look like in ten years? More pointedly, will we be put out of the editing business by computers? Not yet, anyway. Of course, I do not have a crystal ball, but I can look at the past and see what happened a few years ago when computational methods arrived in the germane field of linguistics, and use that as a way to outline some scenarios. To do so, I will use the words of Karen Spärck Jones, one of the leading researchers responsible for that uptake. In an article from 2007 that was published posthumously by her friends and colleagues, she reflects on the fact that, in mainstream linguistics journals, there is no trace of computational linguistics topics and vice versa—and then she candidly asked, “Does it matter?”15 Reflecting on the early stages of the evolution of computational linguistics, she noted how, at first, the use of computational methods was hailed as an amazing innovation by many, but that only a few lucky ones were able to use them, since computers were scarce and their use was limited. From compartmentalization came suspicion and reciprocal misunderstandings. She argues that
since then [the 1960s] there has been a divergence. On the computational side, . . . research continued and expanded in the 1970s without much input from mainstream linguistics. It had to model process. . . . Thus, by the 1980s it was already clear that computational linguistics and natural language processing were advancing without referring significantly to mainstream linguistics or being significantly inadequate thereby.
She comes then to a realization: “As this historical summary implies, computational linguistics does not need mainstream, non-computational linguistics, whether to supply intellectual credibility or to ensure progress. Computational linguistics is not just linguistics with some practically useful but theoretically irrelevant and obfuscating nerdie add-ons.” She concedes that this is a “comforting conclusion” if “perhaps more than a little arrogant.” In her analysis of the field, Spärck Jones states: “The growth of computational linguistics or, more specifically, natural language information processing is increasingly being done by people with a computational rather than linguistic background; machine learning work needs a mathematical, not a linguistic, training.” What Spärck Jones is clearly implying is not only that computational linguistics and mainstream linguistics have grown apart, but that there is basically no interchange between the two anymore. Computational linguistics is not linguistics done digitally, but the result of interdisciplinary research in which computer scientists are not at the service of linguists, but have “merged” with them or even taken their places. Finally, Spärck Jones concedes that “we should not forget that mainstream linguistics may have some things to offer us, even if not as many as linguists themselves may suppose.”
Let us stop here for a moment. I think that the risk of disenfranchisement suggested here is real. Let us go back a few years and remind ourselves about what happened when Google Books was the fancy new toy that everybody wanted to play with. The development of the Ngram viewer, and particularly the closing keynote given at DH2011 in Stanford by Jean-Baptiste Michel and Erez Lieberman,16 was welcomed very skeptically by our community, and rightly so: Ngram is trivial and doesn’t do anything new, but those guys were able to publish the research results in all the right places (their core article was published on the cover of Science)17; they got all the media attention, including a TED talk,18 and we did not; they were seen and depicted as the ones who brought the humanities into the twenty-first century, and they were doing it under our nose. We may not crave that kind of attention, but there is a problem here.
These algorithms and these techniques are hard to master, and while learning XML and TEI can be done quite easily by the average researcher in the humanities with a few hours of training, these cannot. As Spärck Jones said, “machine learning work needs a mathematical, not a linguistic [and editorial] training,” and algorithms require a commitment and a drive that I can only try to imagine, or, better, a partnership and a division of labor, where a humanist matches up with a computer scientist. We are all for collaboration, but this could also be seen as an intellectual surrender, since it implies giving up control of one’s research and putting it straight into the black box. Once we give the power to someone to decide what can and what cannot be done, then we give someone the power over our research, and we are back where we didn’t want to be when we chose to get our hands dirty with code. Of course, we can work with trusted colleagues if we are lucky enough to find them, but what if things go wrong? Who ends up “owning” the research?
Another factor to consider is that computational approaches work only with large datasets, and therefore may not be applicable to most editorial projects, so we might think we are safe and that this revolution does not concern us. However, the fact that we are left out, so to speak, from the latest cutting-edge development may mean that we are no longer the coolest kids on the block and that research funding may prove harder to get; the significance of our research and of our scholarship may become less certain. We know that digital editing has been used as a way for textual scholarship to regain some academic momentum after many decades during which editing was seen as not being so cool or seen as a “service” to other scholars. The risk now is that the significance and standing of our research might decline. And, if the texts we care about are not amenable to computational approaches, they might end up not being edited at all.
I do not mean here to be alarmist about computational methods and editing. On the contrary: I think the potential of this research is inspiring, and as I said after hearing presentations by Kerstemont, Stutzmann, and Camps in October 2021 at a conference in Nijmegen,19 my purpose in life may have been to attend that session! I can’t wait to see what distant editing will look like: maybe we will be able to discover when we started to hyphenate words, which kind of Dante was read in the eighteenth century, which author most influenced the development of Renaissance in Europe, or even be able to edit texts such as the Bible or the Commedia by Dante considering their entire tradition, and not just a handful of witnesses. In short, the digital editing community should embrace these methods and see what we can do with them. The greatest risk here, in my opinion, is to be left behind instead of being at the forefront, as we used to be. This “embracing” must take many forms, and at the least, these:
- We need training in and at least a basic understanding of AI concepts, methods, and techniques.
- We need ideas, we need creativity: what can be done is what we will dream of doing, and then the technology will follow—we cannot be the ones to follow the technology.
- We need to care about ethics.
As mentioned before, the “unmasking” of Robert Galbraith, to which we can add the unmasking of Elena Ferrante, that was done by members of our community has raised a series of questions, aired in major newspapers. The discussion has also taken a gendered flavor,20 with men seen as unable to stand the success of women writers and treating their anonymity as a crime. But besides these considerations, the question that our community has not yet asked enough is the following: does the fact that we can do something mean that we always should do it? Does the fact that we can tell who is behind the name of Robert Galbraith mean that we are allowed to expose J. K. Rowling? Michele Cortelazzo, who led the team researching Ferrante, claims that using a pseudonym is an indirect call for publicity; therefore, he felt completely free to undertake this research.21 Whether we agree or not with this statement, questions of ethics and deontology should be taken seriously, as other disciplines have done.
And here I am not talking about only authorship attribution. Methods like those discussed here are used worldwide for less than ethical reasons. Computer vision is used for face profiling, and textual analysis is used to influence elections and other social and antisocial behavior. Being at the forefront of research also means to be using and taking advantage of research that perpetuates human inequality and discrimination. Methods and tools are not neutral; they have agency and implications. And, since we recognize the fallacy of the slogan “guns don’t kill humans, humans kill humans,” we cannot accept comparable arguments in connection with computational methodology. Ethics, deontology, self-limitation, autoregulation—these are things that we associate perhaps with our work as teachers but not necessarily with our research. Since our authors are, for the great part, dead, we are less inclined to think that these things concern us.
So, we have some homework to do, I think. What will digital scholarly editing look like in ten years? Will editors do it, or will it be done by computer scientists and computer engineers? Will it still be called digital scholarly editing, or computational philology, or distant editing? It is too early to say, and maybe it will be a completely different thing altogether, but I think we will have to decide these things, or these things will be decided in spite of us.
Notes
1. See Elena Pierazzo, Digital Scholarly Editing: Theories, Models and Methods (Aldershot, UK: Ashgate, 2015); Matthew James Driscoll and Elena Pierazzo, eds., Digital Scholarly Editing: Theories and Practices (Open Book, 2016), doi.org/10.11647/OBP.0095.
2. See Heather F. Windram, Prue Shaw, Peter M. W. Robinson, and Christopher J. Howe, “Dante’s Monarchia as a Test Case for the Use of Phylogenetic Methods in Stemmatic Analysis,” Literary and Linguistic Computing 23, no. 4 (2008): 443–63, doi.org/10.1093/llc/fqn023. See also Christopher J. Howe, Ruth Connolly, and Heather F. Windram, “Responding to Criticisms of Phylogenetic Methods in Stemmatology,” Studies in English Literature 1500–1900 52, no. 1 (2012): 51–67.
3. Mladen Popović, Maruf A. Dhali, and Lambert Schomaker, “Artificial Intelligence Based Writer Identification Generates New Evidence for the Unknown Scribes of the Dead Sea Scrolls Exemplified by the Great Isaiah Scroll (1QIsaa),” PLOS ONE 16, no. 4 (2021), doi.org/10.1371/journal.pone.0249769.
4. Mike Kestemont, Vincent Christlein, and Dominique Stutzmann, “Artificial Paleography: Computational Approaches to Identifying Script Types in Medieval Manuscripts,” Speculum 92 (2017): 86–109, doi.org/10.1086/694112.
5. Wouter Haverals and Mike Kestemont, “Silent Voices: A Digital Study of the Herne Charterhouse Scribal Community (ca. 1350–1400),” Queeste 27, no. 2 (2020): 186–95, doi.org/10.5117/QUE2020.2.006.HAVE.
6. Mike Kestemont, Folgert Karsdorp, Elisabeth de Bruijn, Matthew Driscoll, Katarzyna A. Kapitan, Pádraig Ó Macháin, Daniel Sawyer, Remco Sleiderink, and Anne Chao, “Forgotten Books: The Application of Unseen Species Models to the Survival of Culture,” Science 375, no. 6582 (2022): 765–69, doi.org/10.1126/science.abl7655.
7. Gregory Crane, “What Do You Do with a Million Books?,” D-Lib Magazine 12, no. 3 (2006), doi.org/10.1045/march2006-crane.
8. See Kestemont, Christlein, and Stutzmann, “Artificial Paleography.”
9. Amir Hazem, Béatrice Daille, Marie-Laurence Bonhomme, Martin Maarand, Mélodie Boillet, Christopher Kermorvant, and Dominique Stutzmann, “Books of Hours: The First Liturgical Corpus for Text Segmentation,” Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020, 776–84, aclanthology.org/2020.lrec-1.97.pdf.
10. Jean-Baptiste Camps, “La Philologie Computationelle à l’École des Chartes,” Bibliothèque de l’École des Chartes 176 (2021): 193–216.
11. Jean-Baptiste Camps, Thibault Clérice, and Ariane Pinche, “Noisy Medieval Data: from Digitized Manuscript to Stylometric Analysis, Evaluating Paul Meyer’s Hagiographic Hypothesis,” Digital Scholarship in the Humanities 36, supplement 2 (2021): 49–71, doi.org/10.1093/llc/fqab033.
12. Camps, “La Philologie Computationelle,” 193–216.
13. See Peter A. Stokes, Benjamin Kiessling, Daniel Stökl, Ben Ezra, Robin Tissot, and Hassane Gargem, “The EScriptorium VRE for Manuscript Cultures—Classics@ Journal,” Classics@ 18 (2021), https://classics-at.chs.harvard.edu/classics18-stokes-kiessling-stokl-ben-ezra-tissot-gargem/.
14. See “LECTAUREP—L’intelligence artificielle appliquée aux archives notariales,” lectaurep.hypotheses.org/.
15. Karen Spärck Jones, “Computational Linguistics: What About the Linguistics?,” Computational Linguistics 33, no. 3 (2007): 437–41, doi.org/10.1162/coli.2007.33.3.437.
16. See “Conference Video: Closing Keynote by JB Michel and Erez Lieberman,” Digital Humanities, June 19–22, 2011, dh2011.stanford.edu/?p=1385.
17. Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, et al., “Quantitative Analysis of Culture Using Millions of Digitized Books,” Science 331, no. 6014 (2011): 176–82, doi.org/10.1126/science.1199644.
18. See Jean-Baptiste Michel and Erez Lieberman Aiden, “What We Learned from 5 Million Books,” ted.com/talks/jean_baptiste_michel_erez_lieberman_aiden_what_we_learned_from_5_million_books.
19. The proceedings are being published now in a special issue of the Journal of Data Mining and Digital Humanities; see https://jdmdh.episciences.org/page/on-the-way-to-the-future-of-digital-manuscript-studies.
20. Patrick Juola, “The Rowling Case: A Proposed Standard Protocol for Authorship Questions,” Digital Scholarship in the Humanities 30, no. 1 (2015): 100–113.
21. Arjuna Tuzzi and Michele Cortelazzo, ed., Drawing Elena Ferrante’s Profile: Workshop Proceedings (Padova: Padova University Press, 2018).
We use cookies to analyze our traffic. Please decide if you are willing to accept cookies from our website. You can change this setting anytime in Privacy Settings.