Important New Developments in Arabographic Optical Character Recognition (OCR)

By: Maxim Romanov, Matthew Thomas Miller, Sarah Bowen Savant, and Benjamin Kiessling   The OpenITI team—building on the foundational open-source OCR work of the Leipzig University’s (LU) Alexander von Humboldt Chair for Digital Humanities—has achieved Optical Character Recognition (OCR) accuracy rates for classical Arabic-script texts in the high nineties. These numbers are based on our tests of seven different Arabic-script texts of varying quality and typefaces, totaling over 7,000 lines (~400 pages, 87,000 words). These accuracy rates not only represent a distinct improvement over the actual accuracy rates of the various proprietary OCR options for classical Arabic-script texts, but, equally important, they are produced using an open-source OCR software called Kraken (developed by Benjamin Kiessling, LU), thus enabling us to make this Arabic-script OCR technology freely available to the broader Islamic, Persian, and Arabic Studies communities in the near future. Unlike more traditional OCR approaches, Kraken relies on a neural network—which mimics the way we learn—to recognize letters in the images of entire lines of text without trying first to segment lines into words and then words into letters. This segmentation step—a mainstream OCR approach that persistently fails on connected scripts—is thus completely removed from the process, making Kraken uniquely powerful for dealing with a diverse variety of ligatures in connected Arabic script (see section 3.1 for more technical details). In the process we also generated over 7,000 lines of “gold standard” (double-checked) data that can be used by others for Arabic-script OCR training and testing purposes. For a full report see our working paper on...

Digital Hill Project

The Digital Hill Project by Marcel Mernitz Reference: M. Mernitz. “The Digital Hill Project: Sources on the Revolt of Samos”. Digital Classics Online Bd. 2,3 (2016) This is a quick overview about the Digital Hill project, which is part of the Open Greek and Latin project at the Humboldt Chair of Digital Humanities at the University of Leipzig. When I started working on the project, the first step was to create a spreadsheet that gathered all sources mentioned by G.F. Hill (Sources for Greek History between the Persian and Peloponnesian Wars, Oxford 1897) in his third chapter about the “Revolt of Samos”. The spreadsheet contains further information about each source, e.g. if an XML file already exists in one of our repositories and a link to it or a link to the new created XML file. Furthermore, any text left out by Hill has been stored in a separate column and the spreadsheet provides links to the treebanking and text alignment files I created for the project. The spreadsheet can be accessed via the following link: https://goo.gl/zEcevt There is a legend in column M that explains the coloured cells. As part of the project we have created a new repository on GitHub where all the XML and EpiDoc files of the project are stored. In the GitHubo repo it is possible to find the treebank and text alignment data and also the data for the web page. The link for this repository is: https://github.com/DigitalHill Speaking of the webpage, it is accessible online at http://digitalhill.github.io/ The results can be found in “Chapter III” –> “Revolt of Samos”. There are two subchapters...

Topic Modelling of Historical Languages in R

Topic Modelling of Historical Languages in R By Thomas Köntges This is a quick note and introduction to topic-modelling historical languages in R and is intended to supplement three publications forthcoming in 2016: one for the AMPHORAE issue of the Melbourne Historical Journal; one for Alexandria: The Journal of National and International Library and Information Issues (currently under review), and one for DH2016. This blog entry also summarises some points I have made in several talks in the past few months about topic-modelling historical languages (including in my talk at the Analyzing Text Reuse at Scale / Working with Big Humanities Data  workshop during the DH Leipzig Workshop Week in December 2015). This blog is therefore intended as a short summary of some of the more important points previously made and in contrast to the specific applications covered in each of the articles it provides an overview of the subject. My work on topic-modelling did not start in Leipzig, rather, it was part of a project I worked on during my time as a research associate at the Victoria University of Wellington (VUW), New Zealand, in 2013: the Digital Colenso Project. Back then I thought that there was only one ideal number of topics for each corpus and I used Martin Ponweiser’s harmonic mean-method (see chapter 3.7.1 and 4.3.3 in his master’s thesis) to attempt to determine this ideal. Although this approach was useful, albeit slow, for the Digital Colenso Project, I now think that this assumption was wrong, because the ideal topic granularity depends more on the research question and use-case of the application of topic modelling to a certain corpus than on...

Research Data, the Humanities, and the First Four Centuries of Print

Research Data, the Humanities, and the First Four Centuries of Print Gregory Crane (gcrane2008@gmail.com) (Alexander von Humboldt Professor of Digital Humanities at Universität Leipzig & Professor of Classics and Winnick Family Chair of Technology and Entrepreneurship at Tufts University) November 2014 I am writing about the critical importance of research data as a topic for humanists — we cannot flourish in a digital age unless we are able to understand and to manage the data that we need for our research, our teaching and our overall contributions to intellectual life of society as a whole. My ultimate goal is to analyze, as precisely as I can, what infrastructure has been developed in Europe and North America, especially from the large European projects Clarin.eu and Dariah.eu upon which humanists can actually build — when projects set out to produce infrastructure, it can be difficult to distinguish the language of the proposed infrastructure from the infrastructure that has actually been produced. The use case for this exploration will be the challenge of moving not only the Perseus Digital Library but also more than a dozen other established projects on Greco-Roman culture, from both Europe and North America, into a shared, computational space that can support hundreds of thousands of users and analysis of Greco-Roman cultural influence in millions of digitized sources. I have chosen, however, to publish this essay first, because I think that, before getting into the details of particular infrastructure projects in both sides of the Atlantic, I would like to consider the potential benefits that the transnational Research Data Alliance (RDA) offers humanists and to suggest  a concrete,...

So you want to become a professor of Greek and/or Latin? Think hard about a PhD in Digital Humanities

So you want to become a professor of Greek and/or Latin? Think hard about a PhD in Digital Humanities Gregory Crane (Alexander von Humboldt Professor of Digital Humanities at Universität Leipzig & Professor of Classics and Winnick Family Chair of Technology and Entrepreneurship at Tufts University) Leipzig November 2o, 2014 I decided to write this piece because this is the time of year when those who wish to become professional students of Greek and Latin are deciding where they should apply for graduate schools. I am now starting to see that the most interesting Phd projects on Greek and Latin are taking place in PhD programs for the Digital Humanities and I think that anyone who wishes to develop a career of sustained satisfaction needs to think carefully about how they move forward. At the present time, I am not aware of any traditional program in Greek and Latin that prepares students for satisfying and sustainable careers. This essay falls into three parts. First I suggest some words of caution, including the well-known challenges about actually landing a permanent faculty position, the amount of work that you will need to commit if you want to maximize your chances for success and then, more substantively, something about the actual work that supports faculty Greek and Latin faculty positions in the United States and (much of) Europe. The second section briefly touches upon some fundamental topics that we must resolve if we are to rethink the study of Greek and Latin (as I think we must if we are to survive, or perhaps even flourish): the information that we produce, the...

Creative Commons translations from Greek and Latin into modern languages?

Posted by Greta Franzini (not authored). The Open Philology project is looking for ways to encourage the distribution of translations from Greek and Latin into modern languages. Many authors are simply happy to put their materials on their own websites. Our goal at this point is to elucidate some issues for those who want their materials to be more widely used and to gauge how many people might produce new materials if they had some sort of support. We see (at least) three topics for producers to consider: Making your materials available under a Creative Commons license: Many producers make their materials freely available on a website but do not include an explicit rights statement enabling third parties to make use of their work. This also prevents their work from having the impact that they often actually wish. There are various CC licenses to choose from. A CC-BY-ND-NC prevents anyone from modifying your work or using it for commercial purposes. Such a license is conservative and relatively easy to accept but a bolder approach, using a CC-BY-SA license, which allows for derivative works and for third-parties to include your work in a commercial service, makes it easier for your work to reach more people. The BY feature requires that you receive credit, while your original version can remain available as a point of reference. The SA feature means that anyone who modifies your work has to share the results — thus preventing a commercial enterprise from making a new version of what you have done and hiding it behind an exclusive subscription wall. Making your materials available in TEI XML:...