Blog

Our News!

Topic Modelling of Historical Languages in R

Topic Modelling of Historical Languages in R

By Thomas Köntges

This is a quick note and introduction to topic-modelling historical languages in R and is intended to supplement three publications forthcoming in 2016: one for the AMPHORAE issue of the Melbourne Historical Journal; one for Alexandria: The Journal of National and International Library and Information Issues (currently under review), and one for DH2016. This blog entry also summarises some points I have made in several talks in the past few months about topic-modelling historical languages (including in my talk at the Analyzing Text Reuse at Scale / Working with Big Humanities Data  workshop during the DH Leipzig Workshop Week in December 2015). This blog is therefore intended as a short summary of some of the more important points previously made and in contrast to the specific applications covered in each of the articles it provides an overview of the subject.

My work on topic-modelling did not start in Leipzig, rather, it was part of a project I worked on during my time as a research associate at the Victoria University of Wellington (VUW), New Zealand, in 2013: the Digital Colenso Project. Back then I thought that there was only one ideal number of topics for each corpus and I used Martin Ponweiser’s harmonic mean-method (see chapter 3.7.1 and 4.3.3 in his master’s thesis) to attempt to determine this ideal. Although this approach was useful, albeit slow, for the Digital Colenso Project, I now think that this assumption was wrong, because the ideal topic granularity depends more on the research question and use-case of the application of topic modelling to a certain corpus than on the data itself. To clarify this I will showcase the results of my topic-modelling research during my 2015 visiting fellowship at VUW. This research was undertaken in collaboration with staff at the Alexander Turnbull Library, National Library of New Zealand (ATL), and was subsequently applied to the Open Philology Project (OPP) of the Department of Digital Humanities at the University of Leipzig, Germany.

After a brief introduction to the research projects in Wellington and Leipzig and to topic-modelling itself, this blog-entry will summarize the limitations of topic-modelling with special emphasis on how to determine an ideal number of topics, as well as a short discussion of morphosyntactic normalization and the use of stop words. It will then suggest a researcher-focused method of addressing these limitations and challenges. I will then briefly demonstrate the applicability to the different use-cases at ATL and OPP, which deal with very different fields and languages, including English, Latin, Ancient Greek, Classical Arabic, and Classical Persian. I will finish by stressing how digital humanities research results and practices can be improved by enabling humanities researchers, who focus on more traditional and qualitative analyses of the corpora, to use the quantitative method of topic-modelling as a macroscope and faceting tool.

During my research stay at VUW I worked with the Research Librarian for cartoons at ATL, Dr Melinda Johnston, on a mixed-methods-based analysis of the reactions of cartoonists and New Zealand print publications to the Canterbury Earthquakes in 2010 and 2011. ATL is part of the National Library of New Zealand, an institution that is interested in making the country’s cultural heritage more accessible to a digital audience and researchers. Within the short project I attempted to automatize the detection and analysis of cartoon descriptions created by ATL and in over 100,000 abstracts produced by Index New Zealand (INNZ); all items were published between September 2010 and January 2014. The INNZ data could be retrieved as a double-zipped XML file from INNZ’s webpage and ATL’s item descriptions could be queried using the Digital New Zealand (DNZ) API. During the project it became apparent how a topic-modelling approach could considerably speed up the finding and faceting of earthquake-related descriptions and abstracts.

The results were so impressive that the author decided to apply it to Latin and Greek literature in Leipzig’s OPP project. OPP has a text collection of over 60 million Greek and Latin words, and has recently begun to add Classical Persian and Arabic texts. It is one of the core interests of OPP to produce methods that can swiftly generate results on big data and that can compete with more traditional approaches. OPP is maintained and organized using exist DB, the CTS/CITE-Architecture developed by the Homer Multitext Project, and additional web-based tools and services (e.g. Morpheus, a Greek and Latin morpho-syntactic analyser and Github). This structure enables researchers to use a CTS-API to retrieve their desired text-corpora or specific texts. In a first evaluation run of the topic-modeller, 30,000 Classical Arabic biographies have been used. At both research institutions, OPP and ATL, researchers applied more qualitative methods to complement the process and evaluate results. Because those evaluations were promising, the topic-modelling process will be showcased in what follows.

Topic modelling is “a method for finding and tracing clusters of words (called “topics” in shorthand) in large bodies of texts”. A topic can be described as a recurring pattern of co-occurring words. Topic models are probabilistic models that are often based on the number of topics in the corpus being assumed and fixed. The simplest and probably one of the most frequently applied topic models is the latent Dirichlet allocation (LDA). Success and results of LDA rely on a number of apriori-set variables: for instance, the number of topics assumed in the corpus, the number of iterations of the modelling process, the decision for or against morpho-syntactic normalisation of the research corpus, and how stop words are implemented in the process. Furthermore, its interpretation is often influenced by how the topics are graphically represented and how the words of each topic are displayed.

While C. Sievert has found already a very convincing solution for the latter, the former is often out of the hands of the qualitative traditional researcher and any bigger modifications will have to be implemented by a computer-savvy researcher. Yet, often topic-modelling is not an end in itself, rather it is a tool used to help answer a specific humanities research question or to facet large text-corpora, so further methods can be applied to a much smaller selection of texts. For example, one could use the theta-values to find topic-similarity between two paragraphs in R to determine text-reuse or to find similar sentences for teaching purposes:

### find out how topic similarity of citable units
### comparing the mean deviation of theta-values for each topic

is_similar <- function(x) {
check <- all.equal(theta.frame[which(theta.frame[,1] == first_element(unlist(x))),], theta.frame[which(theta.frame[,1] == last_element(unlist(x))),]) # comparing with all.equal
result <- mean(as.numeric(sub(".*?difference: (.*?)", "\\1", check)[3:length(check)]))
return(result)
} # produces NA if compared with itself

Or find example paragraphs in the corpus that belong to a certain topic:

## Find exemplar sentence
topic_number = 1
exemplar_level = 1
corpus[which(corpus[,1] == theta.frame[order(theta.frame[,topic_number+1],decreasing=TRUE),][exemplar_level,1]), 2]

Traditional researchers often have to continue to work with topic-modelling results, but may not always be aware of the bias that the apriori-set variables have brought into the selection process. One possible way to bridge this gap between researcher and method is to involve the qualitative researcher earlier in the process by providing them with agency in the topic-modelling process. To do this, I have used R and the web-application framework Shiny to combine J. Chang’s LDA- and C. Sievert’s LDAvis-libraries with DNZ/CTS API requests and language-specific handling of the text data to create a graphical user interface (GUI) that enables researchers to find, topic-model, and browse texts in the collections of ATL, OPP, and INNZ. They can then export their produced corpus and model, so they can apply qualitative methods on a precise facet of a large text corpora, rather than the whole text corpora itself, which contains texts that are irrelevant for answering the researcher’s specific research question.

On the left side of the GUI, the researcher will be able to set the following variables: a) search term(s) or CTS-URN(s); b) the source collection; c) certain stop word lists or processes; d) additional stop words; e) the number of topics; and f) the number of terms shown for each topic in the visualization. The application then generates the necessary API-requests, normalizes the text as desired by the researcher, applies J. Chang’s LDA-library, and finally presents a D3 visualisation of the topics, their relationship to each other, and their terms using C. Sievert’s LDAvis and dimension reduction via Jensen-Shannon Divergence & Principal Components as implemented in LDAvis. The researcher can then directly and visually evaluate the success of their topic-modelling and use the settings on the left as if they were setting a microscope to focus on certain significant relationships of word co-occurrences within the corpus. If they have focussed their research tool, they then can export visualisations, topics, and their corpus for further research. For further clarification, a few example use-cases will be described in what follows.

As stated above one application was building a finding and analytical aid for cartoon descriptions and INNZ abstracts related to the Canterbury earthquakes of 2010 and 2011. One of the original research questions was to what extent did cartoonists react differently to the Canterbury earthquakes than print journalists. To answer this, over 100,000 abstracts and over 30,000 descriptions of born-digital cartoons had to be evaluated regarding their potential to answer the research question. ATL uses Library-of-Congress (LOC) based subject-headings for their collection items. At first, we thought they could be used efficiently to generate ideal selections of descriptions that fit with the research question, but we found that not only is using LDA topic-modelling quicker (among other things because it needs less human-intervention), but it was better able to identify trends over time, for example, it showed much more realistic topic-development regarding texts that dealt with the initial destruction and texts about the rebuild, while a LOC subject-heading based analysis needed much manual labour and yielded a different (unexpectedly wrong) result. A more thorough discussion can be found in the upcoming Alexandria article.

The other application of the topic modelling tools were further use-cases in which OPP corpora were processed: one Latin, one Greek, and one Arabic (please find the most recent, better documented approach for Greek here and for Latin here). In these more morphologically complex languages special emphasis had to be placed on the influence of morpho-syntactic normalisation, this is, the reducing of morphological complexity of different instances of the same word to the same morphological base (the so-called “dictionary-form” of a word). This normalisation by reduction can contribute immensely to the success of topic modelling. The degree of the impact of this normalisation on the success of topic-modelling, however, is language dependent, or more specifically depends on the kind of loss of information that occurs during the normalisation process in a specific language and also on which tools are available to reduce the morphological complexity in a specific language: for instance, while it is useful for Ancient Greek and Latin to normalise by reducing morphological complexity, because a frequency of a term then becomes more detectable and usable (see Greek topic-modelling results for Thucydides and also for Caesar’s De Bello Gallico), there are reasons why it might be better to not normalise Classical Arabic text in the same way. Classical Arabic has a genus verbi that expresses the sexus of the subject of the sentence, making it possible to easily detect, for example, female biographies using topic-modelling (a link to Dr Maxim Romanov’s blog will be posted here, once it’s published, but also see figure 1).

Figure 1. LDA topic-model generated from around 30,000 Classical Arabic biographies. Topic 20 “Biographies of women” is selected.

Currently, the R-script uses Perseid’s Morphological Service API to find a vector of possible lemmata for each word:

parsing <- function(x){
 word_form <- x
 URL <- paste(morpheusURL, word_form, "&lang=grc&engine=morpheusgrc", sep = "")
 message(round((match(word_form, corpus_words)-1)/length(corpus_words)*100, digits=2), "% processed. Checking ", x," now.")
 
 URLcontent <- tryCatch({
 getURLContent(URL)}, 
 error = function(err)
 {tryCatch({
 Sys.sleep(0.1)
 message("Try once more")
 getURLContent(URL)},
 error = function(err)
 {message("Return original value: ", word_form)
 return(word_form)
 })
 })
 if (URLcontent == "ServerError") {
 lemma <- x
 message(x, " is ", lemma)
 return(lemma)}
 else {
 lemma <- if (is.null(XMLpassage2(URLcontent)) == TRUE) {
 lemma <- x
 message(x, " is ", lemma)
 return(lemma)}
 else {lemma <- tryCatch({XMLpassage2(URLcontent)},
 error = function(err) {
 message(x, " not found. Return original value.")
 lemma <- "NotFound1"
 message(x, " is ", lemma)
 return(lemma)})
 
 lemma <- gsub("[0-9]", "", lemma)
 lemma <- tolower(lemma)
 lemma <- unique(lemma)
 if (nchar(lemma) == 0) {
 lemma <- x
 message(x, " is ", lemma)
 return(lemma)}
 else {
 message(x, " is ", lemma)
 return(lemma)
 }
 }
 }
}

The script then guesses the most frequent vector element in the corpus. Because vocabulary repeats in the same corpus and because different morphological versions of the same word have a different vector of potential vectors, this solution works well for many classical authors. That said, it is very slow! The slowest parts in the R script are the retrieving of the text itself and the retrieving of the lemma-vectors for each word. The former could be addressed either by a local installation of exist DB and the Perseus/OPP texts or if the API were modified to allow for requesting the whole text of an author in one API call. The latter could be addressed either by a local installation of Morpheus or even better by new parsers that are finite state transducers like Harry Schmidt’s Parsley written in Go or Neel Smith’s adaption for Greek. Currently, however, a researcher depends on morphological services that are delivered via individual API requests. It is obvious that the success of sending thousands of API requests to the same API service not only depends on the topic-modelling script itself, but also on the stability of the connection to the API service. For instance, connections can time-out and it needs error handling and testing of the corrected corpus to ensure the success of the morphological normalisation and the data-quality of the newly generated parsed corpus.

Here is a possible way of comparing the data-quality of the morphologically normalised corpus with the original corpus (see the parsing function above for handling possible connection problems):

### Compare length of corpus and corpus_parsed

length_corpus <- length(corpus)/2
test_corpus_length <- vector()
for (i in 1:length_corpus){
test_corpus_length[i] <- length(unlist(strsplit(as.character(unlist(corpus_parsed[i,2])), "[[:space:]]+")[1])) == length(unlist(strsplit(as.character(unlist(corpus[i,2])), "[[:space:]]+")[1]))
}
table_corpus_length <- table(test_corpus_length)
bug_report <- which(test_corpus_length == FALSE)

For both tasks, error-handling and the testing of the data-quality, most researchers who are just interested in creating a topic-model of a corpus or only see topic-modelling as a step for further research (e.g. social network analysis, text-reuse) or as a tool (e.g. finding similar sentences of already read passages to test the students’ language comprehension) would welcome help or a solution where this is already done for them. Offering those kinds of services has frequently been discussed in our team and the extended Digital Classics universe and it is not unlikely that researchers will in the future have those resources, especially because this is, at least in the research of historical languages, a finite problem.

In any case, as the reader can see, it is an exciting time to work on topic-modelling historical languages given that almost every month there are better tools and APIs out there that can be used to speed up code and given the push from traditional researchers to better understand topic-modelling and to use its results in their research and teaching. I hope that this short blog has shown a few examples and contributed to a better understanding of how topic-modelling, in all its complexity, can be opened up to and its development positively influenced by more traditional researchers with little computer skills, in turn enabling them to answer specific research question based on large text corpora.

The Big Humanities, National Identity and the Digital Humanities in Germany

Gregory Crane
June 8, 2015

Alexander von Humboldt Professor of Digital Humanities
Universität Leipzig (Germany)

Professor of Classics
Winnick Family Chair of Technology and Entrepreneurship
Tufts University (USA)

Summary

Alexander von Humboldt Professors are formally and explicitly “expected to contribute to enhancing Germany’s sustained international competitiveness as a research location”. And it is as an Alexander von Humboldt Professor of Digital Humanities that I am writing this essay. Two busy years of residence in Germany has allowed me to make at least some preliminary observations but most of my colleagues in Germany have spent their entire careers here, often in fields where they have grown up with their colleagues around the country. I offer initial reflections rather than conclusions and write in order to initiate, rather than to finish, discussions about how the Digital Humanities in Germany can be as attractive outside of Germany as possible. The big problem that I see is the tension between the aspiration to attract more international research talent to Germany and the necessary and proper task of educating the students in any given nation in at least one of their national languages, as well as their national languages and histories. The Big Humanities — German language, literature and history — drive Digital Humanities in Germany (as they do in the US and every other country with which I am familiar).

In my experience, however, the best way to draw new talent into Germany is to develop research teams that run in English and capitalize on a global investment in the use of English as an academic language — our short term experience bears out the larger pattern, in which a large percentage of the students who come to study in Germany enjoy their stay, develop competence in the language and stay in Germany. Big Humanities in Germany, however, bring with them the assumption that work will be done in German and have a corresponding — and entirely appropriate — national and hence inwardly directed focus.

But if it makes sense to have a German Digital Humanities, that also means that Germany may have its own national infrastructure to which only German speaking developers may contribute — 77% of the Arts and Humanities publications in Elsevier’s Scopus Publication database are in English, very few developers outside of the German speaking world learn German and the Big Humanities in the English speaking world tend to cite French as their second language (only 0.3% of the citations of the US Proceedings of the Modern Language Association pointed to German, while the Transactions of the American Philological Association, with 10% of its citations pointing to German, made the most use of German scholarship).

The best way to have a sustainable digital infrastructure is to have as many stakeholders as possible and, ideally, to be agile enough to draw on funding support from different sources, including (especially including) internationally sources of funding. We also need to decide what intellectual impact we wish German investments in Digital Humanities to have outside of the German speaking world and the related question of how the Digital Humanities can expand the role that German language, literature and culture play beyond the German speaking world.

Details and the full text are available here.

Seven reasons why we need an independent Digital Humanities

Gregory Crane
Professor of Classics and Winnick Family Chair of Technology and Entrepreneurship
Tufts University

Alexander von Humboldt Professor of Digital Humanities
Open Access Officer
University of Leipzig

April 28, 2015

Draft white paper available at http://goo.gl/V9Ddjq

This paper describes two issues, the need for an independent Digital Humanities and the opportunity to rethink within a digital space the ways in which Humanists can contribute to society and redefine the social contract upon which they depend.

The paper opens by articulating seven cognitive challenges that the Humanities can, in some cases only, and in other cases much more effectively, combat insofar as we have an independent Digital Humanities: (1) we assume that new research will look like research that we would like to do ourselves; (2) we assume that we should be able to exploit the results of new methods without having to learn much and without rethinking the skills that at least some senior members of our field must have; (3) we focus on the perceived quality of Digital Humanities work rather than the larger forces and processes now in play (which would only demand more and better Digital Humanities work if we do not like what we see); (4) we assume that we have already adapted new digital methods to existing departmental and disciplinary structures and assume that the rate of change over the next thirty years will be similar to, or even slower than, that we experienced in the past thirty years, rather than recognizing that the next step will be for us to adapt ourselves to exploit the digital space of which we are a part; (5) we may support interdisciplinarity but the Digital Humanities provides a dynamic and critically needed space of encounter between not only established humanistic fields but between the humanities and a new range of fields including, but not limited to, the computer and information sciences (and thus I use the Digital Humanities as a plural noun, rather than a collective singular); (6) we lack the cultures of collaboration and of openness that are increasingly essential for the work of the humanities and that the Digital Humanities have proven much better at fostering; (7) we assert all too often that a handful of specialists alone define what is and is not important rather than understanding that our fields depends upon support from society as a whole and that academic communities operate in a Darwinian space.

The Digital Humanities offer a marginal advantage in this seventh and most critical point because the Digital Humanities (and the funders which support them) have a motivation to think about and articulate what they contribute to society. The question is not whether the professors in the Digital Humanities or traditional departments of Literature and History do scholarship of higher quality. The question is why society supports the study of the Humanities at all and, if so, at what level and in what form. The Digital Humanities are important because new digital media and automated methods enable all of us in the Humanities to reestablish the social contracts upon which we always must depend for our existence.

The Digital Humanities provides a space in which we can attack the three fundamental constraints that limited our ability to contribute to the public good: (1) the distribution problem, (2) the library problem, and (3) the comprehension problem. First, all Humanities have the power to solve the distribution problem by insisting upon Open Access (and Open Data) as essential elements of modern publication. Here the Digital Humanities arguably provide a short-term example of leadership because of the greater prevalence of open publication. The second challenge has two components. On the one hand, we need to rethink how we document our publications with the assumption that our readers will, sooner or later, have access to digital libraries of the primary and secondary sources upon which we base our conclusions. At the same time, developing comprehensive digital libraries requires a tremendous amount of work, including fundamental research on document analysis, optical character recognition, and text mining, as well as analysis of the economics and sociology of the Humanities. Third, the comprehension problem challenges us to think about how we can make the sources upon which base our conclusions intellectually accessible — what happens when people in Indonesia confront a text in Greek or viewers in American view a Farsi sermon from Tehran, artifacts of high art from Europe or of religious significance from Sri Lanka, a Cantata of Bach or music played on an Armenian duduk?

The basic questions that we ask in the Humanities will not change. We will still, as Livy pointed out in the opening to his History of Rome, confront the human record in all its forms, ask how we got from there to where we are now and then where we want to go. And we may still, like Goethe, decide that the best thing about the past is simply how much enthusiasm it can kindle within us. But the speed and creativity with which we answer the distribution, library and comprehension problems determines the degree to which our specialist research can feed outwards into society and serve the public good.
The more we labor to open up our work — even the most specialized work — and to articulate its importance, the better we understand ourselves what we are doing and why. Non-specialists include other professional researchers as well as the general public.  We may think that we are giving up, in practice if not in law, something of our perceived (and always only conditional and always short-term) disciplinary autonomy but, in so doing, to win the freedom to serve, each of us according to the possibilities of our individual small subfields within the humanities, the intellectual life of society.

 

Getting to open data for Classical Greek and Latin: breaking old habits and undoing the damage — a call for comment!

Gregory Crane
Professor of Classics and Winnick Family Chair of Technology and Entrepreneurship
Tufts University

Alexander von Humboldt Professor of Digital Humanities
Open Access Officer
University of Leipzig

March 4, 2015

Philologists must for at least two reasons open up the textual data upon which they base their work. First, researchers need to be able to download, modify and redistribute their textual data if they are to fully exploit both new methods that center around algorithmic analysis (e.g., corpus linguistics, computational linguistics, text mining, and various applications of machine learning) and new scholarly products and practices that computational methods enable (e.g., on-going and decentralized production of micro-publications by scholars from around the world, as well as scalable evaluation systems to facilitate contributions from, and learning by, citizen scientists). In some cases, issues of privacy may come into play (e.g., where we study Greek and Latin data produced by our students) but our textual editions of, and associated annotations on, long-dead authors do not fall into this category. Second, open data is essential if researchers working with historical languages such as Classical Greek and Latin are to realize either their obligation to conduct the most effective (as well as transparent) research and or their obligation to advance the role that those languages can play in the intellectual life of society as a whole. It is not enough to make our 100 EUR monographs available under an Open Access license. We must also make as accessible as possible the primary sources upon which those monographs depend.

This blog post addresses two barriers that prevent students of historical languages such as Classical Greek and Latin from shifting to a fully open intellectual ecosystem: (1) the practice of giving control of scholarly work to commercial entities that then use their monopoly rights to generate revenue and (2) the legacy rights over critical editions that scholars have already handed over to commercial entities. The field has the rights, the skills, and the labor so that it can immediately and permanently address the first challenge. The second challenge is much less tractable. We may never be able to place recent work in a form where it can fully support new scholarship. That form includes not only the rights that restrict its distribution and, often, the digital format in which textual editions have been produced (e.g., where editors used word processing files rather than best practices such as well-implemented Text Encoding Initiative XML markup). Both the rights and the format together make it unlikely that we will be able in the immediate future (if ever) to make recent critical editions fully available (under a CC-BY-SA license, with TEI XML markup representing the logical structure of both the reconstructed text and the textual notes). The question before us is to determine how much we can in the immediate future recover for the full range of scholarly use and public discourse.

First, the decision to stop handing over ownership of new textual data (and especially any textual data produced with any significant measure of public funding) is, in 2015, a purely political one. There is no practical reason not to make this change immediately. If it takes editors an extra six months or a year (and it should not) because they need to learn how to produce a digital edition, the delay is insignificant in comparison to the damage that scholars suffer when they hand over control of the reconstructed text for 25 years and of the textual notes, introduction and other materials for 70 years after their death.

The Text Encoding Initiative began publishing interoperable methods for machine actionable digital editions in the late 1980s. Students of Classical Greek and Latin, the largest community of historical philologists, have already all the resources in expertise and infrastructure with which to conduct this shift immediately. The second problem is recovering, insofar as possible, textual data that researchers have already given over to commercial interests which, in turn, exploit monopoly ownership to generate revenue. How many textual decisions in this commercial zone do we need to reference within the open data upon which we base our analysis of Greek and Latin and the cultures that these languages directly influenced? This blog post proposes a two-fold strategy (1) beginning a series of openly licensed (CC-BY-SA) textual commentaries, that are aligned to openly licensed editions and to which members of the community can suggest inclusion of important new editorial choices or conjectures only available in editions controlled by commercial interests; (2) identifying, if absolutely necessary, a small list of editions that commercial entities control but that are of such compelling importance that funding should be solicited to buy the rights to digitize, markup with TEI XML, and distribute their contents.

Many traditional scholars may argue that we should preserve the present system (1) because only specialists in Greek and Latin philology need access to new editions and (2) because students of Greek and Latin have no need of the computational methods that require open data for their full expression as instruments of scholarship. Scholars are free to argue that the primary goal of humanities research is to enable specialist publication along small, effectively closed networks of intellectual exchange, that the results of our work on Greek and Latin do not really have enough broader impact to warrant worrying about open access and open data, that the study of historical languages does not require that researchers have the ability to download, analyze, modify and redistribute textual data, and that publicly funded scholarship is not ultimately answerable to the public which provides that funding.

From a pragmatic point of view, such arguments would be problematic for anyone who wishes to replace retiring faculty in Greek or Latin, to attract the most ambitious minds to the study of these languages or to justify research support for the study of Greek and Latin from any private foundation or governmental agency that could invest its research support elsewhere. There is never enough money to support all the research that would advance human understanding, much less so-called STEM disciplines (science, technology, engineering and mathematics — the corresponding German acronym is MINT) that materially advance the economic prosperity and biological health of society. But the privilege of academic freedom and the right of free expression that we enjoy in nations such as Germany and the United States exist so that we can follow our principles and add our opinions to public debate.

There are two fundamental reasons for scholars to make openly useful both their conclusions (open access publications) and the data upon which those conclusions depend.

The first bears most directly upon those of us who receive most, if not all all, of our salary and research support either from public money or from private foundations that require us to make our results available under an open license. There is our obligation as humanists to advance the intellectual life of humanity. Of course, in 2015, this point of view is finding its way into regulations of government research funding in various countries while private foundations increasingly insist that the results from work that they fund be published under an open license. Ironically, the smallest and the largest disciplines seem to have adapted most rapidly to this much more open model of research. Students of Greek papyrology, for example, have already made the transition to open data and on-going, decentralized editing — those who feel that commercial entities provide the only channel by which to publish Greek and Latin textual editions need first to understand fully the infrastructure to which the papyrologists already have access (http://papyri.info/). In fact, the services at http://papyri.info/ go beyond what editors need if they wish to create individual, single-authored, static editions. For editors of Latin editions, help is on the way from the Digital Latin Library Project. If editors wish to work on their own to create editions of Greek and Latin texts, they should buy a TEI-aware XML editor and learn how to produce a modern edition. Anyone smart enough to edit an edition of Greek and Latin is smart enough to understand the necessary TEI XML (or EpiDoc subset of TEI XML: epidoc.sourceforge.net/). My colleagues at the Humboldt Chair of Digital Humanities are also there to do what we can to help.

Second, there is the scholarly need for open data. This need is not new. More than a decade ago, pioneering philologists badgered me to release the textual data that we had accumulated at Perseus. Licenses for private use were not enough. They argued tirelessly that they needed, as part of their fundamental research, the right to analyze, modify, and then redistribute some or all of those texts in their altered form. After dragging my feet for years, I finally began to open up the TEI XML source for Perseus texts. The initial release of the TEI XML Greek and Latin texts under a CC-BY-SA-NC license (now simplified to a CC-BY-SA license) took place in March 2006, almost a decade ago. The Classicists who demanded that open data — Chris Blackwell, Gabby Bodard, Helma Dik, Tom Elliott, Ross Scaife, and Neel Smith, among others — were pioneers and earned for themselves by their visionary work a permanent place in the history of Greco-Roman studies. In 2015, we are beyond the vision thing. We Greek and Latin Philologists are playing catch-up as a field as we struggle to integrate into our work the best methods available for analyzing textual data.

We have gone beyond the point where we can any longer reasonably argue that computational methods are unimportant, or even optional, instruments within Greek and Latin philology as a whole. Not every professional student of Greek and Latin will master the foundational new methods already available to us from fields such as corpus linguistics, computational linguistics, text mining and various applications of machine learning. But those who do master the results of such new fields will play a crucial role in determining what all students of Greek and Latin at all levels will be able to do in their personal learning and published research. Open textual data is a foundational need for modern scholarship. The question before us is how to free ourselves from our dependence upon closed data and to establish a comprehensive, open, extensible textual space for the study of Greek and Latin. It is time to return, yet again, ad fontes — back to the sources.

It is not difficult to see how the field of Greek and Latin can, and will shift, so that new textual editions appear in proper TEI XML under an open license (ideally CC-BY-SA). For commercial — and especially for for-profit — companies, the shift to an open publication model simply reflects a shift in business models and the most profitable presses have already begun to build new (and reportedly quite profitable) open access tracks. Of course, the editors of Greek and Latin as a whole are perfectly capable of providing the editorial support for each other — the ability to write is a selling point of liberal arts degrees and professors of Greek and Latin would be ill-advised to argue that they needed professional editors in the same way as their colleagues in Computer Science or Physics. We can also build publishing workflows that simplify the use of TEI XML (such as the Leiden plus front end that papyrologists have been using for years). But such a streamlined system is a convenience, not a necessity.

The real problem is, of course, one of academic politics. Many faculty believe that they need to publish their work under an established corporate brand name if they are to receive formal academic credit. In some institutions, this belief may even be true, but I think that many faculty would find that their administrations were not only supportive but relieved to see their humanities faculty taking a stand on behalf of open access and open data, especially where faculty are public servants and/or their universities have strong policies in support of Open Access and open data.

I am confident that the administrations at Tufts University (where I am in the department of Classics) and at Leipzig, for example, (where I am the Open Access officer) would enthusiastically work with any department that wanted to establish a framework for fairly assessing an edition that was published under a CC-BY-SA license. If anything, editors at these institutions would have a chance to earn even more prestige by taking an (apparent) risk to advance the role of Greek and Latin in the intellectual life of society beyond specialist researchers and to enable Greek and Latin philology to exploit evolving new forms of research based on progress in various computational fields. When senior faculty with permanent positions hand over their work to corporate entities, the situation is much more problematic. Certainly, as a senior professor who is not subject to existential pressures that junior scholars may feel, I don’t see how I can justify handing my work over to commercial entities. I feel that I have an obligation to help the next generation have the freedom to keep the results of their work open and available both to the intellectual life of society as a whole and to the most advanced analytical methods available to researchers.

But even when our field does the right thing for scholarship and society (and I would be disingenuous if I put it any other way), we face the consequences of our past actions. Commercial interests now control a substantial amount of the work that we have done, whether or not we did that work with public money or even if we may have ignored clear conditions on research funding that the results needed to be available under an open access license. (A review of funding decisions at various agencies may reveal a systematic pattern where domain experts voted to fund research projects that they knew would be handed over to commercial interests even when the regulations governing that funding prioritized, even where they did not explicitly mandate, publishing research results under an open license).

I was fortunate in that I began my own work developing corpora after legal issues began to emerge from the first efforts at sharing digital corpora. When humanists first began developing textual databases in the 1970s and 1980s, scholars had little understanding of copyright law (which, one could argue, really means that copyright law often does not reflect scholarly standards). Many assumed that the reconstructed texts in Classical Greek and Latin critical editions are in the public domain. The fact that a preponderance of experts in the field made this decision — in fact, operated under this assumption — provides evidence about what copyright law should dictate. In fact, explicit legislation does enable editors in some countries to exercise monopoly control over reconstructed texts for a period of time. I don’t know any editors who personally use that right to restrict access to their work — all the editors I know want their work to circulate as widely as possible. But editors sign contracts that give commercial publishers exclusive rights to their work. These publishers have lawyers and, if the perceived loss justifies the investment in legal fees, they can sue individual scholars. Even when textual data is in the public domain, commercial vendors (whether belonging to a for-profit corporation or a non-profit university) can (and often will) sue those who redistribute that public domain data on the basis of contract law. We work hard to make sure that we respect both copyright and contract law.

Given sufficient funding, the following categories of data can be digitized and made available as open data under the kind of CC license upon which modern philology must depend:

Reconstructed texts: Reconstructed texts constitute the running text as reconstructed in an edition without accompanying textual notes, modern language translations introduction, etc. We can use scientific editions from Germany that were published 25 or more years ago (thus, in early 2015 we can use scientific editions published through the beginning of 1990). The EU has passed a regulation allowing its member nations to exert such copyright for up to 30 years but Germany has not taken advantage of this EU opportunity nor has any other major producer of Greek and Latin texts. For pragmatic purposes, we will initially assume that every other nation but Germany (where support for open access and open data have strong public and political support) is liable to enact such a law. We will thus focus in 2015 on digitizing European editions outside of Germany published through 1985, in 2016 through 1986 etc. Here the goal is to have as many TEI XML transcriptions as possible and to help researchers visualize the degree to which different editions differ and to be able to compare different editions.
Textual notes: The argument has been made that the textual notes are not part of the reconstructed text and constitute a separate copyrightable work. Insofar as textual notes are a scholarly activity, they should aspire to be an annotated database and thus should be receivng only 15 years of protection under EU database regulations (http://ec.europa.eu/internal_market/copyright/prot-databases/). The argument has also been made that the textual notes not only do not belong to a scientific edition but also constitute another form of creative expression and that commercial publishers should be able to monopolize them for the life of the editor plus 70 years. We will, for now, focus on mining textual notes from editions where the editor died 70 or more years ago. In practice, that means that we are working with the apparatus criticus of editions published in the 1920s and 1930s. Here our goal is to have a maximally clean searchable text but not to add substantive TEI XML markup that captures the structure of the textual notes — the structure of these notes tend to be complicated and inconsistent. Our pragmatic goal is to support “image front searching,” so that scholars can find words in the textual notes and then see the original page images.

Given the legal constraints outlined above and assuming that we had the resources to create machine actionable versions of all publicly accessible textual data, what is the best way of representing the data commercial licenses restrict?

Strategy one: Support advanced graduate students and a handful of supervisory faculty to go through reviews of recent editions, identifying those editorial decisions that were deemed most significant. The output of this work would be an initial CC-BY-SA series of machine-actionable commentaries that could automatically flag all passages in the CC-BY-SA editions where copyrighted editions made significant decisions. In effect, we would be creating a new textual review series. Because the textual commentaries would be open and available under a CC-BY-SA, members of the community could suggest additions to them or create new expanded versions or create completely new, but interoperable, textual commentaries that could be linked to the CC-BY-SA texts.

Here the goal is to create an initial set of data about textual decisions in copyrighted editions and a framework that members of the community can extend. If members of the community feel that important textual data should be made available, then they can make it available, they can do so. If no one feels that it is important to make the data available, then the data is, by definition, not that important. The plan is to create a self-regulating environment. An open framework can evolve as members of the community wish. In this plan, we start a light-weight, easily expanded and duplicated process that others can copy.

We can summarize this as a Darwinian strategy. We may have to take a step and lose some more recent textual data to open up the overall corpus, but the lost textual data is not, itself, subject to copyright (copyright protects original expression). The hypothesis is that an open field will outperform a closed field and that the open field will replace what it considers to be lost textual data and ultimately (perhaps very quickly) outperformed the closed system.

This strategy has at least two advantages. First, if funding were secured, that funding could help rising Greek and Latin philologists perform the task of creating the initial textual commentaries, thus immersing a new generation in the basic methods of representing textual data in a machine actionable form (and giving them a position where they have an opportunity to learn quite a bit of Greek and/or Latin). Second, we do not need to create a comprehensive set of textual commentaries. We need to create a critical mass that demonstrates the utility of such commentaries.

Strategy two: How many editions that are owned by commercial entities are so crucial to the mainstream study of Greek and Latin that it is worth trying to negotiate the rights and expend the time/money to produce CC-licensed TEI XML versions? The upper bound for such a purchase might be the cost of paying for production of a new open access book (up to 10.000 British pounds). Since commercial publishers have published several hundred editions in the last 25 or 30 years, paying for the rights for all recent editions would cost millions of euros and is clearly not a reasonable option. If publishers do not offer reasonable terms and the new editions are of critical importance, then members of the community will simply have to create new editions that integrate the most valuable findings from the restricted editions — that is, after all, the sort of thing that we are paid to do. But it might be possible to justify purchasing the rights to a few.

What editions might warrant such special treatment and why?

Conversely, how worthwhile is it for us to worry about editions published after c. 1985? Would it be better to focus on providing comprehensive coverage of editions through 1985 with the assumption that if the recent data is sufficiently important, then we can let members of the community fill in the gaps?

Ironically, I think that the best way to liberate textual data from corporate control is to demonstrate that life will go on without it and thus to destroy its value as a revenue-generating asset. We can use the reconstructed texts from Germany through 1990 and from the rest of Europe at least through 1985. While much has been done since then and it would be a shame if we could not immediately use it in our analysis of the ancient world, I became a professor in 1985 and I do not think that the quality of the textual editions available to us was a major limiting factor on the quality of our research at the time. We can start the process of identifying significant textual decisions in copyrighted editions. Where editors have produce radically new editions, we can try to secure the rights but the best way to free commercialized controlled texts is to move forward with what we have.

Members of the community are, of course, free to make a case that research funding from private and public sources should be used to subsidize commercial services or even websites that provide free services but do not make their data available. Those who feel this way should make the case as fully as possible. I have heard the argument that we must under no circumstances go backwards and lose access to the most up-to-date texts but, unfortunately, we have already lost control over that access and have done so for years after it was possible that we could do otherwise (the Text Encoding Initiative was documenting methods for machine actionable editions in the late 1980s) and after generalized models for open licenses had appeared (CreativeCommons.org released its first licenses in 2002). We could have acted differently a decade ago and we have, for the most part, not chosen to produce editions that are modern in format and accessible to a global audience. If we think that specialists at well-funded academic institutions alone need access to the best textual data, we should express that position clearly so that the federally funded agencies and private foundations know where we stand.

I don’t see an easy solution for rescuing data that we have given to commercial organizations but we should hear the arguments and proposals — and then act. Business as usual simply digs us into a deeper hole. Even if some of us may disagree with the case as a whole, a well-articulated case for sticking with privatized textual data may more clearly articulate issues that we need to address in shifting to an open philology.

Please send your suggestions to crane@informatik.uni-leipzig.de — or, better still, send a link to a public version of your thoughts. I will summarize initial suggestions in a subsequent blog post in May 2015.

Join us for Sunoikisis DC 2015

Sunoikisis is a successful national consortium of Classics programs developed by the Harvard’s Center for Hellenic Studies. The goal is to extend Sunoikisis to a global audience and contribute to it with an international consortium of Digital Classics programs (Sunoikisis DC). Sunoikisis DC is based at the Alexander von Humboldt Chair of Digital Humanities at the University of Leipzig. The aim is to offer collaborative courses that foster interdisciplinary paradigms of learning. Master students of both the humanities and computer science are welcome to join the courses and work together by contributing to digital classics projects in a collaborative environment.

Sunoikisis DC will start in the SS 2015 with a Digital Classics course at the University of Leipzig. Faculty members of participating institutions will gather at the University of Leipzig on February 16-18 for a planning seminar in order to discuss course topics, schedule the academic calendar, and construct the course syllabus. The seminar is organized by the Alexander von Humboldt Chair of Digital Humanities at the University of Leipzig in collaboration with the Center for Hellenic Studies and Perseids.

Sunoikisis DC Planning Seminar 2015
February 16-18, 2015
(full program)

Felix-Klein-Hörsaal (5. Etage)
Paulinum, Hauptgebäude
Universität Leipzig
Augustusplatz 10-11 – 04109 Leipzig

A Tenure Track Job in the US, Anti-Islamification Demonstrations in Germany, and the Redefinition of Classics

Gregory Crane
Perseus Project and the Open Philology Project
The University of Leipzig and Tufts University

The Department of Classics at Tufts University is looking at candidates for a tenure track assistant professor who works on Greco-Roman and Islamic Cultures. Since the demonstrations against Islamification in Germany became prominent in Dresden and now have cropped up at Leipzig, my German home, I thought about the connection between the two. This position can do a lot more now if it properly exploits digital media and helps to change the public understanding of what we in Europe and North America already owe to the achievements of Islamic culture.

A draft of a blog on this topic is available at http://tinyurl.com/p63sdm5