CSEL XML 2.0

Authored and posted by Greta Franzini.

The Open Greek and Latin project has released a new version of the TEI XML versions of public domain volumes from the Corpus Scriptorum Ecclesiasticorum Latinorum (CSEL).

The new versions include the following:

  • The reconstructed texts are now within div tags that contain the subtype “work”. The goal is that the reconstructed texts can thus be automatically separated from the introductions, textual notes, indices etc.
  • The div tags containing individual works are marked and contain the subtype “work”. Where we have canonical identifiers, we also include those identifiers in the n attribute: e.g.,

XML_screenshot

  • The citations have been extracted and tagged in a step towards making these texts more deeply compatible with the Canonical Text Services Protocol Architecture. This involves choosing one citation scheme to provide the dominant hierarchy as div tags, with others schemes as milestone markers.
  • The current texts have been compared against new OCR runs conducted with ABBYY Finereader. The results were compared with what we received from the Data Entry Contractor. The Data Entry Contractor was required to provide texts where at least 99% of characters in the OCR output for the reconstructed texts were correct. (The introductions, notes, indices etc. received TEI XML but the OCR-generated text was not corrected). Many of the remaining errors are now marked with sic tags and possible corrections from the alternate OCR marked with corr tags. Some errors remain (particularly on small words) but this is a first step.

Before deciding on whether to solicit corrections from the community or to pay for a Data Entry firm to correct the identified errors, we would like to evaluate other methods of error detection and correction. We encourage members of the community to download these texts and to see what they can do to improve them. We will make further decisions about how to remove remaining errors in January 2015.

Share postShare on FacebookShare on LinkedInTweet about this on TwitterEmail this to someone

1 Comment

  1. I would be very interested to know more about how you did this; in general, actually a discussion of best practices for OCRing Latin could be quite useful to others. (There are a couple of nineteenth-century Latin editions, for example, that I am planning to convert into EpiDoc.)

    Reply

Submit a Comment

Your email address will not be published. Required fields are marked *