Authored and posted by Greta Franzini.
The new versions include the following:
- The reconstructed texts are now within
divtags that contain the
subtype“work”. The goal is that the reconstructed texts can thus be automatically separated from the introductions, textual notes, indices etc.
divtags containing individual works are marked and contain the
subtype“work”. Where we have canonical identifiers, we also include those identifiers in the
- The citations have been extracted and tagged in a step towards making these texts more deeply compatible with the Canonical Text Services Protocol Architecture. This involves choosing one citation scheme to provide the dominant hierarchy as
divtags, with others schemes as
- The current texts have been compared against new OCR runs conducted with ABBYY Finereader. The results were compared with what we received from the Data Entry Contractor. The Data Entry Contractor was required to provide texts where at least 99% of characters in the OCR output for the reconstructed texts were correct. (The introductions, notes, indices etc. received TEI XML but the OCR-generated text was not corrected). Many of the remaining errors are now marked with
sictags and possible corrections from the alternate OCR marked with
corrtags. Some errors remain (particularly on small words) but this is a first step.
Before deciding on whether to solicit corrections from the community or to pay for a Data Entry firm to correct the identified errors, we would like to evaluate other methods of error detection and correction. We encourage members of the community to download these texts and to see what they can do to improve them. We will make further decisions about how to remove remaining errors in January 2015.