“12. Corpus Expansions” in “Humanities in the Time of AI”
12. Corpus Expansions
The growing immensity of the digitized corpus is no excuse for ceasing to interpret—our libraries have been too vast to be thoroughly exploited by one individual for quite some time. On the contrary, computerized means are not a new standard we should measure our aims to. Digitization is a matter of convenience: we can easily travel with hundreds of downloaded books in our bag, we can instantly foray into a new direction through an internet search. Entire series of texts are now readily available in a way they previously were not. Fifteen years ago, for instance, working on nineteenth-century Haitian thought and literature absolutely required to travel to certain places. That a large extent of this corpus can now be found online is allowing for an important reassessment of the writings of this era; one could deplore the loss of experience and reflection that will come from the ease of staying at home to read Demesvar Delorme, compared with enriching journeys and encounters; but the beneficial alteration of the modes of transmission, including the possibility for someone else to return to the sources and draw other conclusions, is undeniable. The same goes for other rare books, manuscripts, archives, paintings, sculptures, objects, movies, some of them being, in fact, never shown to patrons of libraries, museums, or cinemas. We know that developing a “taste for the archive” or perceiving the “aura” of an artwork or a literary manuscript when using electronic files is, at best, a daunting task.1 Even in the case of multiples, such as lithographs or print books, a contact with original physicality is sometimes revelatory. One could wonder if seeing Stanley Kubrick’s 2001: A Space Odyssey on a smartphone is the same as watching it on a big screen, although the information content is identical. This is incontrovertible. Still, something from a movie is able to reach us, even in a degraded form, and it is also a scholarly duty to, at least partly, remedy these constraints. I may be inspired by an unrestored building, although it is present in an incomplete, or dilapidated, way. All the same, digitized archives might not respect the granularity of the original, while accompanying us in intellectual exploration. Positing that we must no longer encounter the dimensionality of Michelangelo’s sculptures since we have excellent reproductions is naturally absurd. But it is unclear that, with the exception of some specialists, seeing the papyrus with Sappho’s recently rediscovered poem to her brother Charaxus (and not its image and not its edition) would be in any way essential.2 In general, humanistic literacy has long relied on partial renditions and snapshots of works—think of the rhapsodic recitations of Homer or the printing of originally calligraphed poems in Chinese or Arabic.
Transcription (copying, editing, annotating, translating) is part of textual scholarship. The transition from the scriptorium to the press and from the press to electronic publication as dominant practices of transmission entailed changes in speed and modes of relations to verbal artifacts. As long as we do not nullify this heterogeneity or associate it with the march of progress, we should be fine with these shifts. Now, and tomorrow, computer-assisted manners of transcription are feasible. While AI techniques for character recognition might and will serve us well, for instance, it is unclear that other tasks would benefit as much. Think of probabilistic translations. Using them as a rough draft will be a perilous temptation because of cognitive path dependence. If I know both Italian and English, do I gain anything by asking DeepL to translate the first stanza of Petrarch’s Canzoniere? The machine is giving me:
Ye who hear in scattered rhymes the sound
Of those sighs whereat I nourished my heart
Upon my first youthful error
When I was in part another man from what I am.3
This version is so close to the English rendition by Robert Durling that I do not know at all if this miraculous transposition is anything but plagiarism on the basis of line recognition in a large corpus.4 Now, let us imagine this is not so and that both the individual and the AI translators independently chose corresponding words on a basis of frequency distribution. This convergence, if it is not an effect of data treachery, would illustrate the formal features of language (under what I call its “operativity”) and provide to our minds an English text that makes sense and is not a pointless distortion of the source.5 Some AI selections might prompt us to reflect on our decisions. For example, should we use ye to keep archaisms, thereby mirroring in English the historicity of Petrarch’s style vis-à-vis contemporary Italian? Or should a translator make ellipses more explicit, here adding a pronoun and giving “my heart” whereas Petrarch writes “the heart” (“il cuore”)? All such remarks could emanate from the work of human translators or during our own process of transposition. However, I am not convinced that the automated iteration given by DeepL would prepare us to decide whether, in the first line, we should use to hear for ascoltare rather than to listen and a simple present instead of the progressive form. Interestingly, and because AI is us, of the dozens of human-made English translations I once compiled with a collaborator, none is opting for the present progressive, and only one is choosing “to listen.” If Petrarch were using standard parlance, this would be understandable. But I propose that the liminal inscription of the performance of reading is precisely happening now and asking us to listen to the sound of worded sighs and not only to hear it from afar (“you who are listening . . . to the sound”). As for the epithet scattered, it is admissible, but what about disparate or dispersed for sparse? Are these terms made more, or less, convincing by the first impression their probabilistic transposition by DeepL could communicate? Is the computer facilitating the rhythmic rendition of the flowing fourth line (“quand’era in parte altr’uom da quel ch’i’ sono”) in submitting to us the syncopated “When I was in part another man from what I am?” Is this a path to a fruitful collaboration or a waste of time and energy? In any case, we are again dealing with the interpretive part of scholarly modes of translation that AI algorithms either eschew or seek to annul (through the implied affirmation that linguistic utterances are first and foremost probability based). Invoking “the untranslatable” might not be particularly relevant. First of all, the circulation of the term, under the influence of what has become in English the Dictionary of Untranslatables, originally edited by Barbara Cassin, has led to the unsound representation of specific concepts that would be too linguistically and culturally situated to be easily adapted across languages. But this is just one half of a truism, and such a semantic resistance does not prevent actual transcultural and cross-linguistic journeys (which was the sense of Derrida’s phrase “nothing is translatable; nothing is untranslatable”).6 Then, a translation is a transient reading of the untranslatable, which entails it should be justified, performed, unfolded, then done and undone again. The issue with machine translation is not that it cannot capture or echo the mythical essence of the untranslatable but that the built-in interpretive approach mainly accounts for a position (usually the dominant one) in a series.
As regards corpus expansion, in addition to the digitization of the extant and to the assistance in transcribing neglected or rare documents and works, AI is helping us recover some losses. Besides the computerized methods for deciphering diverse and complicated types of writing, as exemplified in systems now used in Assyriology, for instance, automated learning might throw a new light on ancient systems of notations that are not understood, or only partially.7 The widespread metaphor of “cracking the code” is inaccurate if the locution refers to the comprehension of a natural language (behind the modality of its symbolic notation). Some advance in the understanding of extinct idioms is conceivable, although, in the absence of a Rosetta stone, it will require a heavy engagement on the part of (human) scholars. If we set aside the promise of resurrecting the dead (languages), AI is currently being used to recover what was lost where the naked eye and the individual lifespan of researchers impede the exploration of the invisible. From Amazonia to Italy, archaeological sites of importance have been recently identified by working on aerial views of the earth.8 Thanks to computerized tools, ancient treatises have been unearthed in palimpsests or margins when magnifying glasses and solvents would be of little help.9 We are in an epistemic situation that is not without correspondence with the eras of the telescope and the microscope. This is not to say that Galileo or Louis Pasteur were simple observers or that we should be happy with rediscoveries and show eternal gratitude to the god AI.
This movement of expansion and recovery demands from us not only a commitment to interpretation (and, thus, selection, critique, alteration) but also a renewed dedication to the plurality of our objects, questions, and references. The globalized world the digital techniques of communication have implemented is marred with standardization and schematism. It is up to us to do more with knowledge accumulation than to produce extensive repertoires (descriptive mapping) or strengthen hyperspecialization (the wealth of accessible information on a topic being used as a pretext for more epistemic fragmentation). If I can exchange with another scholar living on another continent, if I can find all the occurrences of a word in a database, if I can have on my tablet the works of an author whose name I ignored an hour ago, if I may receive the transcription of a text nobody had established before, if I am reading a document undecipherable for millennia, if I am contemplating the historically fluctuating form of a city with a digital “time machine,” I must seize the benefit and, in my turn, expand my own work.10 Time, culture, language distances remain, as we abandon the globalized horizon of the homogeneous present. In this regard, the overall convergence we are witnessing may make it even more difficult to stay attuned to the intricacies of the heterogeneous and not only take it as a local flourish or impenetrable otherness. The challenge of the humanities, in the differential production that is theirs, is the redesign of expanded plurality through the advent of horizons of intelligibility that are more than global and of transcultural zones for exchanges with the previously unknown, far outside our world or its digitized representation on the planetary network.11
Notes
1. A more literal rendition of Arlette Farge’s original phrase on Le goût de l’archive (Paris: Seuil, 1989), translated into English by Thomas Scott-Railton as The Allure of the Archives (New Haven, Conn.: Yale University Press, 2013).
2. Dirk Obbink, “Two New Poems by Sappho,” Zeitschrift für Papyrologie und Epigraphik 189 (2014): 32–49.
3. July 2023 translation by DeepL of Petrarch’s first stanza of Canzoniere, I: “Voi ch’ascoltate in rime sparse il suono / Di quei sospiri ond’io nudriva ’l core, / In sul mio primo giovenile errore, / Quand’era in parte altr’uom da quel ch’i’ sono.” See complementary remarks in my Poetry and Mind: Tractatus Poetico-Philosophicus (New York: Fordham University Press, 2018), 74–75, insert 38.
4. Francesco Petrarca, Petrarch’s Lyric Poems: The “Rime Sparse” and Other Lyrics (Cambridge, Mass.: Harvard University Press, 1976), 36: “You who hear in scattered rhymes the sound of those sighs with which I nourished my heart during my first youthful error, when I was in part another man from what I am now.” Besides minor typographical variants, the only differences between the two translations are the words I just underlined.
5. Laurent Dubreuil, The Intellective Space: Thinking beyond Cognition (Minneapolis: University of Minnesota Press, 2015), § 27, 30.
6. Jacques Derrida, “What Is a ‘Relevant’ Translation?,” trans. Lawrence Venuti, Critical Inquiry 27, no. 2 (2001): 178.
7. This is currently the core work of the Digital Pasts Lab, directed by Shai Gordin, at Ariel University, in Israel (https://digitalpasts.github.io/). See Gai Gutherz et al., “Translating Akkadian into English with Neural Machine Translation,” PNAS Nexus 2, no. 5 (2023): https://doi.org/10.1093/pnasnexus/pgad096.
8. For an overview of the techniques involved in such efforts, see Luca Casini et al., “A Human–AI Collaboration Workflow for Archaeological Sites Detection,” Scientific Reports 13 (2023): https://doi.org/10.1038/s41598-023-36015-5.
9. Many research teams are at work in such areas. One could name, among others, the Lazarus Project at the University of Rochester (https://lazarusprojectimaging.com/). These initiatives are supported by governments and private entities and even include competitions such as the “Vesuvius Challenge,” focusing on the use of machine learning to decipher the charred scrolls from the Herculaneum library (https://scrollprize.org).
10. Originally proposed for the sole city of Venice, the “Time Machine” project now tends to extend to Europe (see https://timemachine.eu).
11. I defend these ideas in More than Global (Beijing: Commercial Press, forthcoming).
We use cookies to analyze our traffic. Please decide if you are willing to accept cookies from our website. You can change this setting anytime in Privacy Settings.