Open Greek and Latin Project
of the Open Philology Project
The ultimate goal is to represent every source text produced in Classical Greek or Latin from antiquity through the present, including texts preserved in manuscript tradition as well as on inscriptions, papyri, ostraca and other written artifacts. Over the course of the next five years, we will focus upon converting as much Greek and Latin, available as scanned printed books, into an open, dynamic corpus, continuously augmented and improved by a combination of automated processes and human contributions of many kinds. The focus upon Greek and Latin reflects both the belief that we have an obligation to disseminate European cultural heritage and the observation that recent advances in OCR technology for Greek and Latin make these intertwined languages ready for large-scale work.
The Open Greek and Latin Project aims at providing at least one version for all Greek and Latin sources produced during antiquity (through c. 600 CE) and a growing collection from the vast body of post-classical Greek and Latin that still survives. Perhaps 150 million words of Greek and Latin, preserved in manuscripts, on stone, on papyrus or other writing surface, survive from antiquity. Analysis of 10,000 books in Latin, downloaded from Archive.org, identified more than 200 million words of post-classical Latin. With 70,000 public domain books listed in the Hathi Trust as being in Ancient Greek or Latin, the amount of Greek and Latin already available will almost certainly exceed 1 billion words.
Where existing corpora of Greek and Latin have generally included one edition of a work, Open Greek and Latin Corpus is designed to manage multiple versions of, and to represent the complete textual history of, a work: every manuscript, every papyrus fragment, and every printed edition are all versions within the history of a text. In the short run, this involves using OCR-technology optimized for Classical Greek and Latin to create an open corpus that is reasonably comprehensive for the c. 100 million words produced through c. 600 CE and that begins to make available the billions of words produced after 600 CE in Greek and Latin that survive.
The Open Greek and Latin Project assumes the following modules
- Philological Workflow Module
- Distributed Review Module
- Philological Repository Module
- e-Portfolio Module
The Philological Workflow Module enables a digital representation of a written source, available in a 2D or 3D form, to be converted into machine actionable text, corrected, and annotated with an increasing range of information (named entities, morphology, syntax, and other linguistic features, alignments between different versions of the same text, whether in the same language or translated across multiple languages, text re-use detection, including quotation, paraphrase and citation). Automated methods include Optical Character Recognition (OCR), Text Alignment, Syntactic Parsing, etc. In each case, human annotation can augment automated annotations or substitute for them altogether where automated methods are not yet able to produce adequate initial results (e.g, manual transcription of inscriptions and medieval manuscripts).
The Distributed Review Module provides a range of options by which to assess and represent the reliability produced, whether by automated systems or by human contributors, as part of the Philological work flow. In many cases annotations can be released even when their reliability is not necessarily high (e.g., noisy OCR-generated text). The point is to identify annotations that most require subsequent attention, whether manual correction or action of some other kind (e.g., poor OCR data may reflect the need to create a new scan of a printed book). The Distributed Review Module assumes that multiple annotations may be equally trustworthy (i.e., experts back different interpretations) and can track inter-annotator disagreement among experts. The Distributed Review Module provides default values but also allows for different weights to be placed upon different validations (e.g., include all readings in a particular version of a text, whether these are readings in a particular manuscript or the readings chosen and emendations proposed by a particular editor, include all prosopographical identifications proposed by one particular scholar). The Distributed Review Module should support searching by both text characteristics (specific passages, authors), annotator characteristics (expert, novice, native language etc.), and annotation characteristics (emendations, grammatical or interpretive comments, degree of inter-annotator disagreement, etc.). But it should also permit browsing the history of annotation by passage, annotator, magnitude of disagreement etc.
The Philological Repository Module can preserve all published philological data, including the transcriptions and all subsequent annotations (e.g., identifying a transcribed word as being in Latin, a place name, in the accusative case etc.) as well as the provenance of each annotation (e.g., the annotation is born-digital and was published by a particular individual at a given time or the annotation was extracted from a print book by a particular author and published at a given time, with or without human verification, and with an estimated accuracy). The repository is based upon the Canonical Text Services/CITE Architecture for textual sources developed by researchers at the Center for Hellenic Studies within the larger framework developed by the DataConservancy.org.
The e-Portfolio Module aggregates and distributes particular subsets of user contributions for particular audiences. The e-Porfolio Module can identify any published contributions according to type, date, and author (e.g., all syntactic analyses published by a particular person during a particular time interval). The e-Portfolio Module can also make selected materials that are not yet published available to selected audiences (e.g., an editorial board or the admission committee for a degree program). The Perseids Project from Tufts University provides a starting point for this work.
Open Greek & Latin Texts
A collection of T’EI-XML versions of classical authors and works, freely available to download and reuse. For more information, click on the tabs below. Texts are published in GitHub on an ongoing basis. Watch this space for updates.
Corrected and Fully CTS-Compliant Texts
These repositories contain texts for which the OCR has been corrected and XML markup has been added to make the texts fully citable using the Canonical Text Services.
The Corpus Scriptorum Ecclesiasticorum Latinorum.
A collection of Greek works from Homer to 250CE that do not already appear in the Perseus Digital Library.
The following collections of texts have undergone automatic OCR correction.
The works of Athenaeus of Naucratis, Greek rhetorician and grammarian.
A selection of Church Fathers.
English translations of classical works.
A Collection of classical fragmentary authors and works.
Italian translations of classical works.
The Patrologia Latina.
The Open Greek and Latin Project Team
Gregory Ralph Crane
Authored and posted by Greta Franzini. The Open Greek and Latin project has released a new version of the TEI XML versions of public domain volumes from the Corpus Scriptorum Ecclesiasticorum Latinorum (CSEL). The new versions include the following: The reconstructed texts are now within div tags that contain the subtype “work”. The goal is […]
Research post authored and posted by Emily Franzini. School and university curricula love Homer. This is a fact. You don’t need to be a student of Classics to know who Homer was and what he wrote. Even Hollywood is familiar with his Iliad and Odyssey. What we’re interested in finding out, however, is who else and what else […]
Authored and posted by Greta Franzini. The Open Greek and Latin Project has recently signed a contract with the Saxon State and University Library Dresden (SLUB) to scan books dating between 1922-1984*. In particular, we are digitising editions that are in the public domain under European law. The EU allows its state members to assert copyright protection […]