Open Greek and Latin Project
of the Open Philology Project
The ultimate goal is to represent every source text produced in Classical Greek or Latin from antiquity through the present, including texts preserved in manuscript tradition as well as on inscriptions, papyri, ostraca and other written artifacts. Over the course of the next five years, we will focus upon converting as much Greek and Latin, available as scanned printed books, into an open, dynamic corpus, continuously augmented and improved by a combination of automated processes and human contributions of many kinds. The focus upon Greek and Latin reflects both the belief that we have an obligation to disseminate European cultural heritage and the observation that recent advances in OCR technology for Greek and Latin make these intertwined languages ready for large-scale work.
The Open Greek and Latin Project aims at providing at least one version for all Greek and Latin sources produced during antiquity (through c. 600 CE) and a growing collection from the vast body of post-classical Greek and Latin that still survives. Perhaps 150 million words of Greek and Latin, preserved in manuscripts, on stone, on papyrus or other writing surface, survive from antiquity. Analysis of 10,000 books in Latin, downloaded from Archive.org, identified more than 200 million words of post-classical Latin. With 70,000 public domain books listed in the Hathi Trust as being in Ancient Greek or Latin, the amount of Greek and Latin already available will almost certainly exceed 1 billion words.
Where existing corpora of Greek and Latin have generally included one edition of a work, Open Greek and Latin Corpus is designed to manage multiple versions of, and to represent the complete textual history of, a work: every manuscript, every papyrus fragment, and every printed edition are all versions within the history of a text. In the short run, this involves using OCR-technology optimized for Classical Greek and Latin to create an open corpus that is reasonably comprehensive for the c. 100 million words produced through c. 600 CE and that begins to make available the billions of words produced after 600 CE in Greek and Latin that survive.
The Open Greek and Latin Project assumes the following modules
- Philological Workflow Module
- Distributed Review Module
- Philological Repository Module
- e-Portfolio Module
The Philological Workflow Module enables a digital representation of a written source, available in a 2D or 3D form, to be converted into machine actionable text, corrected, and annotated with an increasing range of information (named entities, morphology, syntax, and other linguistic features, alignments between different versions of the same text, whether in the same language or translated across multiple languages, text re-use detection, including quotation, paraphrase and citation). Automated methods include Optical Character Recognition (OCR), Text Alignment, Syntactic Parsing, etc. In each case, human annotation can augment automated annotations or substitute for them altogether where automated methods are not yet able to produce adequate initial results (e.g, manual transcription of inscriptions and medieval manuscripts).
The Distributed Review Module provides a range of options by which to assess and represent the reliability produced, whether by automated systems or by human contributors, as part of the Philological work flow. In many cases annotations can be released even when their reliability is not necessarily high (e.g., noisy OCR-generated text). The point is to identify annotations that most require subsequent attention, whether manual correction or action of some other kind (e.g., poor OCR data may reflect the need to create a new scan of a printed book). The Distributed Review Module assumes that multiple annotations may be equally trustworthy (i.e., experts back different interpretations) and can track inter-annotator disagreement among experts. The Distributed Review Module provides default values but also allows for different weights to be placed upon different validations (e.g., include all readings in a particular version of a text, whether these are readings in a particular manuscript or the readings chosen and emendations proposed by a particular editor, include all prosopographical identifications proposed by one particular scholar). The Distributed Review Module should support searching by both text characteristics (specific passages, authors), annotator characteristics (expert, novice, native language etc.), and annotation characteristics (emendations, grammatical or interpretive comments, degree of inter-annotator disagreement, etc.). But it should also permit browsing the history of annotation by passage, annotator, magnitude of disagreement etc.
The Philological Repository Module can preserve all published philological data, including the transcriptions and all subsequent annotations (e.g., identifying a transcribed word as being in Latin, a place name, in the accusative case etc.) as well as the provenance of each annotation (e.g., the annotation is born-digital and was published by a particular individual at a given time or the annotation was extracted from a print book by a particular author and published at a given time, with or without human verification, and with an estimated accuracy). The repository is based upon the Canonical Text Services/CITE Architecture for textual sources developed by researchers at the Center for Hellenic Studies within the larger framework developed by the DataConservancy.org.
The e-Portfolio Module aggregates and distributes particular subsets of user contributions for particular audiences. The e-Porfolio Module can identify any published contributions according to type, date, and author (e.g., all syntactic analyses published by a particular person during a particular time interval). The e-Portfolio Module can also make selected materials that are not yet published available to selected audiences (e.g., an editorial board or the admission committee for a degree program). The Perseids Project from Tufts University provides a starting point for this work.
Open Greek & Latin Texts
A collection of T’EI-XML versions of classical authors and works, freely available to download and reuse. For more information, click on the tabs below. Texts are published in GitHub on an ongoing basis. Watch this space for updates.
Corrected and Fully CTS-Compliant Texts
These repositories contain texts for which the OCR has been corrected and XML markup has been added to make the texts fully citable using the Canonical Text Services.
The Corpus Scriptorum Ecclesiasticorum Latinorum.
A collection of Greek works from Homer to 250CE that do not already appear in the Perseus Digital Library.
The following collections of texts have undergone automatic OCR correction.
The works of Athenaeus of Naucratis, Greek rhetorician and grammarian.
A selection of Church Fathers.
English translations of classical works.
A Collection of classical fragmentary authors and works.
Italian translations of classical works.
The Patrologia Latina.
The Open Greek and Latin Project Team
Gregory Ralph Crane
Gregory Crane Professor of Classics and Winnick Family Chair of Technology and Entrepreneurship Tufts University Alexander von Humboldt Professor of Digital Humanities Open Access Officer University of Leipzig April 28, 2015 Draft white paper available at http://goo.gl/V9Ddjq This paper describes two issues, the need for an independent Digital Humanities and the opportunity to rethink within a […]
Getting to open data for Classical Greek and Latin: breaking old habits and undoing the damage — a call for comment!
Gregory Crane Professor of Classics and Winnick Family Chair of Technology and Entrepreneurship Tufts University Alexander von Humboldt Professor of Digital Humanities Open Access Officer University of Leipzig March 4, 2015 Philologists must for at least two reasons open up the textual data upon which they base their work. First, researchers need to be able […]
Sunoikisis is a successful national consortium of Classics programs developed by the Harvard’s Center for Hellenic Studies. The goal is to extend Sunoikisis to a global audience and contribute to it with an international consortium of Digital Classics programs (Sunoikisis DC). Sunoikisis DC is based at the Alexander von Humboldt Chair of Digital Humanities at the University […]