Digital Infrastructure Projects and What they already offer historical languages

Workshop Location:

Göttingen Centre for Digital Humanities (GCDH)
Papendiek 16
37073 Göttingen
Seminarraum 1
Google Maps

In this workshop, which is held within the framework of the Global Philology Project, we compare the capabilities of existing infrastructure projects such as CLARIN, DARIAH, the German Digital Library, the Berlin-Potsdam Laudatio repository, and Europeana, as well as projects and working groups within the Research Data Alliance and library/publishing infrastructure projects now being funded by the US Mellon Foundation. This workshop lays the foundations for further discussions and provides an initial assessment of what services exist upon which historical languages can build and what additional services need to be developed or to be adapted. The goal is not to settle upon any one infrastructure (e.g., CLARIN vs. DARIAH) but to develop, insofar as possible, an agile strategy such that those working with historical languages can exploit the strengths of multiple environments, with their data flowing from one system to another as smoothly as possible.

Tuesday, May 9

9-9:30 – Gregory Crane


9:30-10 – Marco Büchler

A Ten-Year Summary of a SOA-based Micro-services Infrastructure for Linguistic Services

From 2004 to 2016 the Leipzig Linguistic Services (LLS) existed as a SOAP-based cyberinfrastructure of atomic micro-services for the Wortschatz project, which covered different-sized textual corpora in more than 230 languages. The LLS were developed in 2004 and went live in 2005 in order to provide a webservice-based API to these corpus databases. In 2006, the LLS infrastructure began to systematically log and store requests made to the text collection, and in August 2016 the LLS were shut down. This article summarises the experience of the past ten years of running such a cyberinfrastructure with a total of nearly one billion requests. It includes an explanation of the technical decisions and limitations but also provides an overview of how the services were used.

10-10:30 – Pietro Liuzzo

EAGLE and IDEA, the International Digital Epigraphy Association: tasks and activities

After the end of the EAGLE project (Europeana Network for Ancient Greek and Latin Epigraphy) the association IDEA (International Digital Epigraphy Association) was founded to maintain the services developed and to offer continuity to the networking activities. IDEA aims at continuing the mission of the EAGLE network in supporting small and medium projects with advice and services as well as to keep the aggregated epigraphic data to the best possible standard. The portal and services which provide search across multiple aggregated databases, vocabularies for authority files, and the Story Telling application continue to live and new perspectives open up for a future project based on the model of

Beta maṣāḥǝft and the Ethiopian literary tradition in the digital age

The project Beta maṣāḥǝft founded by the Academy of sciences aims at taking philological and codicological studies on the Ethiopian literary tradition to the digital age starting with the encoding of primary sources in TEI. The project will build a catalogue of manuscripts, the first Clavis of the Ethiopian literature, a gazetteer of ancient places in Ethiopia and a prosopography of Ethiopian people. Not an easy task dealing with a still leaving tradition and with scarce access to the sources.

10:30-11 – Coffee break

11-11:30 – Carolin Odebrecht

Corpus metadata for the reusability of historical corpora

This talk will address the question of how we can support the reusability of historical corpora with the help of corpus metadata. Historical corpora vary considerable concerning their annotations and formats. Reusing a historical corpus is therefore a challenging task which requires a deep understanding of the corpus architecture and its content. In my talk, I will present our approach to solve this issue with the help of a meta model for corpus metadata. This meta model will provide both the necessary abstraction from the data and the relevant information needed to enable reusability scenarios.

11:30-12 – J. K. Tauber

Greek Linguistic Databases for Better Learning Tools

Language learning tools, from vocabulary drills to adaptive reading environments need both models of student knowledge and rich linguistic databases tied to texts. In this talk I will discuss how richer linguistic databases enable better learning tools and how learning tools can motivate better linguistic databases. The focus of examples will be my own work on the Greek New Testament but the ideas and infrastructure discussed will be applicable to the much broader Greek corpus as well as other languages.

12-12:30 – Tariq Yousef

Ugarit Translation Alignment Editor and Dynamic Lexicon

Ugarit is an online tool for manual text alignment, users can import texts from perseus cts repository or use their own texts, the editor enables users to align two or three parallel translations in a very simple way. Ugarit serves also as a reading environment for parallel texts, it visualises the aligned texts in very simple and meaningful way showing parallel translation pairs and their frequencies with the ability to export the alignment as XML files or the translation pairs as CSV files. The translation pairs obtained from the manual alignment are used to build a dynamic lexicon.

12:30-1 – Christopher H. Johnson & Jörg Wettlaufer

The PANDORA Linked Open Data Presentation Framework

The interconnection of data in the Humanities gets more and more in the focus of research projects. Therefore Christopher Johnson developed a Linked Open Data framework that allows through the combination of a Fedora 4 repository with IIIF APIs and triple stores a SPARQL query driven solution for the Presentation (of) Annotations (in a) Digital Object Repository Architecture (PANDORA). The concrete implementation of PANDORA is a group of distributed web applications that depend on a specification document called a “Manifest” for how they present the data to the client. In PANDORA, the Manifest is a JSON-LD document constructed from Digital Object Repository (FEDORA) resources dynamically using SPARQL. The semantics and conceptualization of the Manifest are in the scope of the IIIF Presentation API, within which is defined how the structure and layout of a complex image-based object can be made available in a standard manner. In this short presentation we will present the architecture and function of the system and like to discuss the possible usage in philological research.

1-2 – Lunch

2-2:30 – Stephan Bartholmei

German Digital Library and Europeana

2:30-3 – Thorsten Trippel


CLARIN-D is a research infrastructure for the humanities and social sciences. The infrastructure provides easy and sustainable access for scholars to digital language data (in written, spoken, video or multimodal form) and to advanced tools to discover, explore, exploit, annotate, analyze or combine them. For projects and scholars creating and using data, CLARIN supports their data management with services for preparation and for depositing of data. In this presentation, we will show selected services and resources provided by CLARIN that can be utilized for historical languages and that can be integrated by scholars in their own projects and software.

3-3:30 – Susanne Haaf

Historical German Data in CLARIN’s user involvement phase: status and perspectives

The Language Center at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) constitutes the CLARIN-D data center for historical German text. It is home to the German Text Archive (Deutsches Textarchiv, DTA), a platform for the publication and exploitation of historical corpora for the German language. The talk will focus on the components offered by CLARIN-D via the BBAW data center that enable and support work with historical data and corpora, including standard formats, documentation, research and archiving platforms, as well as analysis tools. In addition, the talk will outline perspectives for further developments with regard to historical data during CLARIN’s user involvement phase which has started in September 2016 and follows up to CLARIN’s implementation phase.

3:30-4 Coffee

4-4:30 – Mike Mertens

What makes DARIAH-EU, DARIAH-EU? (Or, how I stopped worrying about definitions and learned to love Research Infrastructures)

When one first hears the word “infrastructure” in a research or academic context, one perhaps immediately visualises something akin to CERN: a large, multinational complex that manages a purely physical and often unique asset. DARIAH-EU has long however insisted that the knowledge inherent in Arts and Humanities endeavours requires three distinct elements to flourish – people, skills as well as hardware. Each of these requires a clear framework and shared resources.
In the presentation I will give a brief history of DARIAH-EU, will focus on the importance of international cooperation; the necessary interrelationship between research, archives, museums and other memory institutions, and the broader public; the link with the creative industries, what DARIAH offers in terms of skills and training, and how we see the future of sustainable digitally-enabled Arts and Humanities.

4:30-5 – Stefan Schmunk

DARIAH-DE – Generic usable components for disciplinary requirements

DARIAH-DE offers a variety of IT components and tools, which can be used by research projects and institutions and integrated into their own developments. This includes a number of basic components, but also a number of generically usable special tools and services. This includes, for example, services from the fields of annotations, big data and research data. Within the scope of the lecture, some of these components are to be presented and at the same time the requirements of the project are to be determined.

5-5:30 – Jan Brase

The research data Alliance (RDA)

The RDA was founded in 2013 as a research community through a joint effort of the European Commission, the American National Science Foundation and National Institute of Standards and Technology, and the Australian Department of Innovation. The RDA defines itself not as a digital infrastructure project, but as a global platform to bring together specialists for research data issues. RDA’s main vehicle for outputs are 18-month long working groups that generate recommendations aimed at the RDA community. In addition to working groups, interest groups with no fixed lifetime can produce either informal or “supported” outputs which carry some degree of RDA endorsement.
In this overview we will have a look at those working groups that are of interest for the field of historic languages and discuss how to cooperate with them

6:30 – Dinner

Kartoffelhaus Restaurant
Goethe-Allee 8 (Google Maps)

Wednesday, May 10


Wednesday will be given completely over to discussions, which can include whole-group discussions as well as breakout sessions for participants wanting to focus on specific issues or technical solutions.