Topic Modelling of Historical Languages in R

Topic Modelling of Historical Languages in R

By Thomas Köntges

This is a quick note and introduction to topic-modelling historical languages in R and is intended to supplement three publications forthcoming in 2016: one for the AMPHORAE issue of the Melbourne Historical Journal; one for Alexandria: The Journal of National and International Library and Information Issues (currently under review), and one for DH2016. This blog entry also summarises some points I have made in several talks in the past few months about topic-modelling historical languages (including in my talk at the Analyzing Text Reuse at Scale / Working with Big Humanities Data  workshop during the DH Leipzig Workshop Week in December 2015). This blog is therefore intended as a short summary of some of the more important points previously made and in contrast to the specific applications covered in each of the articles it provides an overview of the subject.

My work on topic-modelling did not start in Leipzig, rather, it was part of a project I worked on during my time as a research associate at the Victoria University of Wellington (VUW), New Zealand, in 2013: the Digital Colenso Project. Back then I thought that there was only one ideal number of topics for each corpus and I used Martin Ponweiser’s harmonic mean-method (see chapter 3.7.1 and 4.3.3 in his master’s thesis) to attempt to determine this ideal. Although this approach was useful, albeit slow, for the Digital Colenso Project, I now think that this assumption was wrong, because the ideal topic granularity depends more on the research question and use-case of the application of topic modelling to a certain corpus than on the data itself. To clarify this I will showcase the results of my topic-modelling research during my 2015 visiting fellowship at VUW. This research was undertaken in collaboration with staff at the Alexander Turnbull Library, National Library of New Zealand (ATL), and was subsequently applied to the Open Philology Project (OPP) of the Department of Digital Humanities at the University of Leipzig, Germany.

After a brief introduction to the research projects in Wellington and Leipzig and to topic-modelling itself, this blog-entry will summarize the limitations of topic-modelling with special emphasis on how to determine an ideal number of topics, as well as a short discussion of morphosyntactic normalization and the use of stop words. It will then suggest a researcher-focused method of addressing these limitations and challenges. I will then briefly demonstrate the applicability to the different use-cases at ATL and OPP, which deal with very different fields and languages, including English, Latin, Ancient Greek, Classical Arabic, and Classical Persian. I will finish by stressing how digital humanities research results and practices can be improved by enabling humanities researchers, who focus on more traditional and qualitative analyses of the corpora, to use the quantitative method of topic-modelling as a macroscope and faceting tool.

During my research stay at VUW I worked with the Research Librarian for cartoons at ATL, Dr Melinda Johnston, on a mixed-methods-based analysis of the reactions of cartoonists and New Zealand print publications to the Canterbury Earthquakes in 2010 and 2011. ATL is part of the National Library of New Zealand, an institution that is interested in making the country’s cultural heritage more accessible to a digital audience and researchers. Within the short project I attempted to automatize the detection and analysis of cartoon descriptions created by ATL and in over 100,000 abstracts produced by Index New Zealand (INNZ); all items were published between September 2010 and January 2014. The INNZ data could be retrieved as a double-zipped XML file from INNZ’s webpage and ATL’s item descriptions could be queried using the Digital New Zealand (DNZ) API. During the project it became apparent how a topic-modelling approach could considerably speed up the finding and faceting of earthquake-related descriptions and abstracts.

The results were so impressive that the author decided to apply it to Latin and Greek literature in Leipzig’s OPP project. OPP has a text collection of over 60 million Greek and Latin words, and has recently begun to add Classical Persian and Arabic texts. It is one of the core interests of OPP to produce methods that can swiftly generate results on big data and that can compete with more traditional approaches. OPP is maintained and organized using exist DB, the CTS/CITE-Architecture developed by the Homer Multitext Project, and additional web-based tools and services (e.g. Morpheus, a Greek and Latin morpho-syntactic analyser and Github). This structure enables researchers to use a CTS-API to retrieve their desired text-corpora or specific texts. In a first evaluation run of the topic-modeller, 30,000 Classical Arabic biographies have been used. At both research institutions, OPP and ATL, researchers applied more qualitative methods to complement the process and evaluate results. Because those evaluations were promising, the topic-modelling process will be showcased in what follows.

Topic modelling is “a method for finding and tracing clusters of words (called “topics” in shorthand) in large bodies of texts”. A topic can be described as a recurring pattern of co-occurring words. Topic models are probabilistic models that are often based on the number of topics in the corpus being assumed and fixed. The simplest and probably one of the most frequently applied topic models is the latent Dirichlet allocation (LDA). Success and results of LDA rely on a number of apriori-set variables: for instance, the number of topics assumed in the corpus, the number of iterations of the modelling process, the decision for or against morpho-syntactic normalisation of the research corpus, and how stop words are implemented in the process. Furthermore, its interpretation is often influenced by how the topics are graphically represented and how the words of each topic are displayed.

While C. Sievert has found already a very convincing solution for the latter, the former is often out of the hands of the qualitative traditional researcher and any bigger modifications will have to be implemented by a computer-savvy researcher. Yet, often topic-modelling is not an end in itself, rather it is a tool used to help answer a specific humanities research question or to facet large text-corpora, so further methods can be applied to a much smaller selection of texts. For example, one could use the theta-values to find topic-similarity between two paragraphs in R to determine text-reuse or to find similar sentences for teaching purposes:

### find out how topic similarity of citable units
### comparing the mean deviation of theta-values for each topic

is_similar <- function(x) {
check <- all.equal(theta.frame[which(theta.frame[,1] == first_element(unlist(x))),], theta.frame[which(theta.frame[,1] == last_element(unlist(x))),]) # comparing with all.equal
result <- mean(as.numeric(sub(".*?difference: (.*?)", "\\1", check)[3:length(check)]))
return(result)
} # produces NA if compared with itself

Or find example paragraphs in the corpus that belong to a certain topic:

## Find exemplar sentence
topic_number = 1
exemplar_level = 1
corpus[which(corpus[,1] == theta.frame[order(theta.frame[,topic_number+1],decreasing=TRUE),][exemplar_level,1]), 2]

Traditional researchers often have to continue to work with topic-modelling results, but may not always be aware of the bias that the apriori-set variables have brought into the selection process. One possible way to bridge this gap between researcher and method is to involve the qualitative researcher earlier in the process by providing them with agency in the topic-modelling process. To do this, I have used R and the web-application framework Shiny to combine J. Chang’s LDA- and C. Sievert’s LDAvis-libraries with DNZ/CTS API requests and language-specific handling of the text data to create a graphical user interface (GUI) that enables researchers to find, topic-model, and browse texts in the collections of ATL, OPP, and INNZ. They can then export their produced corpus and model, so they can apply qualitative methods on a precise facet of a large text corpora, rather than the whole text corpora itself, which contains texts that are irrelevant for answering the researcher’s specific research question.

On the left side of the GUI, the researcher will be able to set the following variables: a) search term(s) or CTS-URN(s); b) the source collection; c) certain stop word lists or processes; d) additional stop words; e) the number of topics; and f) the number of terms shown for each topic in the visualization. The application then generates the necessary API-requests, normalizes the text as desired by the researcher, applies J. Chang’s LDA-library, and finally presents a D3 visualisation of the topics, their relationship to each other, and their terms using C. Sievert’s LDAvis and dimension reduction via Jensen-Shannon Divergence & Principal Components as implemented in LDAvis. The researcher can then directly and visually evaluate the success of their topic-modelling and use the settings on the left as if they were setting a microscope to focus on certain significant relationships of word co-occurrences within the corpus. If they have focussed their research tool, they then can export visualisations, topics, and their corpus for further research. For further clarification, a few example use-cases will be described in what follows.

As stated above one application was building a finding and analytical aid for cartoon descriptions and INNZ abstracts related to the Canterbury earthquakes of 2010 and 2011. One of the original research questions was to what extent did cartoonists react differently to the Canterbury earthquakes than print journalists. To answer this, over 100,000 abstracts and over 30,000 descriptions of born-digital cartoons had to be evaluated regarding their potential to answer the research question. ATL uses Library-of-Congress (LOC) based subject-headings for their collection items. At first, we thought they could be used efficiently to generate ideal selections of descriptions that fit with the research question, but we found that not only is using LDA topic-modelling quicker (among other things because it needs less human-intervention), but it was better able to identify trends over time, for example, it showed much more realistic topic-development regarding texts that dealt with the initial destruction and texts about the rebuild, while a LOC subject-heading based analysis needed much manual labour and yielded a different (unexpectedly wrong) result. A more thorough discussion can be found in the upcoming Alexandria article.

The other application of the topic modelling tools were further use-cases in which OPP corpora were processed: one Latin, one Greek, and one Arabic (please find the most recent, better documented approach for Greek here and for Latin here). In these more morphologically complex languages special emphasis had to be placed on the influence of morpho-syntactic normalisation, this is, the reducing of morphological complexity of different instances of the same word to the same morphological base (the so-called “dictionary-form” of a word). This normalisation by reduction can contribute immensely to the success of topic modelling. The degree of the impact of this normalisation on the success of topic-modelling, however, is language dependent, or more specifically depends on the kind of loss of information that occurs during the normalisation process in a specific language and also on which tools are available to reduce the morphological complexity in a specific language: for instance, while it is useful for Ancient Greek and Latin to normalise by reducing morphological complexity, because a frequency of a term then becomes more detectable and usable (see Greek topic-modelling results for Thucydides and also for Caesar’s De Bello Gallico), there are reasons why it might be better to not normalise Classical Arabic text in the same way. Classical Arabic has a genus verbi that expresses the sexus of the subject of the sentence, making it possible to easily detect, for example, female biographies using topic-modelling (a link to Dr Maxim Romanov’s blog will be posted here, once it’s published, but also see figure 1).

Figure 1. LDA topic-model generated from around 30,000 Classical Arabic biographies. Topic 20 “Biographies of women” is selected.

Currently, the R-script uses Perseid’s Morphological Service API to find a vector of possible lemmata for each word:

parsing <- function(x){
 word_form <- x
 URL <- paste(morpheusURL, word_form, "&lang=grc&engine=morpheusgrc", sep = "")
 message(round((match(word_form, corpus_words)-1)/length(corpus_words)*100, digits=2), "% processed. Checking ", x," now.")
 
 URLcontent <- tryCatch({
 getURLContent(URL)}, 
 error = function(err)
 {tryCatch({
 Sys.sleep(0.1)
 message("Try once more")
 getURLContent(URL)},
 error = function(err)
 {message("Return original value: ", word_form)
 return(word_form)
 })
 })
 if (URLcontent == "ServerError") {
 lemma <- x
 message(x, " is ", lemma)
 return(lemma)}
 else {
 lemma <- if (is.null(XMLpassage2(URLcontent)) == TRUE) {
 lemma <- x
 message(x, " is ", lemma)
 return(lemma)}
 else {lemma <- tryCatch({XMLpassage2(URLcontent)},
 error = function(err) {
 message(x, " not found. Return original value.")
 lemma <- "NotFound1"
 message(x, " is ", lemma)
 return(lemma)})
 
 lemma <- gsub("[0-9]", "", lemma)
 lemma <- tolower(lemma)
 lemma <- unique(lemma)
 if (nchar(lemma) == 0) {
 lemma <- x
 message(x, " is ", lemma)
 return(lemma)}
 else {
 message(x, " is ", lemma)
 return(lemma)
 }
 }
 }
}

The script then guesses the most frequent vector element in the corpus. Because vocabulary repeats in the same corpus and because different morphological versions of the same word have a different vector of potential vectors, this solution works well for many classical authors. That said, it is very slow! The slowest parts in the R script are the retrieving of the text itself and the retrieving of the lemma-vectors for each word. The former could be addressed either by a local installation of exist DB and the Perseus/OPP texts or if the API were modified to allow for requesting the whole text of an author in one API call. The latter could be addressed either by a local installation of Morpheus or even better by new parsers that are finite state transducers like Harry Schmidt’s Parsley written in Go or Neel Smith’s adaption for Greek. Currently, however, a researcher depends on morphological services that are delivered via individual API requests. It is obvious that the success of sending thousands of API requests to the same API service not only depends on the topic-modelling script itself, but also on the stability of the connection to the API service. For instance, connections can time-out and it needs error handling and testing of the corrected corpus to ensure the success of the morphological normalisation and the data-quality of the newly generated parsed corpus.

Here is a possible way of comparing the data-quality of the morphologically normalised corpus with the original corpus (see the parsing function above for handling possible connection problems):

### Compare length of corpus and corpus_parsed

length_corpus <- length(corpus)/2
test_corpus_length <- vector()
for (i in 1:length_corpus){
test_corpus_length[i] <- length(unlist(strsplit(as.character(unlist(corpus_parsed[i,2])), "[[:space:]]+")[1])) == length(unlist(strsplit(as.character(unlist(corpus[i,2])), "[[:space:]]+")[1]))
}
table_corpus_length <- table(test_corpus_length)
bug_report <- which(test_corpus_length == FALSE)

For both tasks, error-handling and the testing of the data-quality, most researchers who are just interested in creating a topic-model of a corpus or only see topic-modelling as a step for further research (e.g. social network analysis, text-reuse) or as a tool (e.g. finding similar sentences of already read passages to test the students’ language comprehension) would welcome help or a solution where this is already done for them. Offering those kinds of services has frequently been discussed in our team and the extended Digital Classics universe and it is not unlikely that researchers will in the future have those resources, especially because this is, at least in the research of historical languages, a finite problem.

In any case, as the reader can see, it is an exciting time to work on topic-modelling historical languages given that almost every month there are better tools and APIs out there that can be used to speed up code and given the push from traditional researchers to better understand topic-modelling and to use its results in their research and teaching. I hope that this short blog has shown a few examples and contributed to a better understanding of how topic-modelling, in all its complexity, can be opened up to and its development positively influenced by more traditional researchers with little computer skills, in turn enabling them to answer specific research question based on large text corpora.

Share postShare on FacebookShare on LinkedInTweet about this on TwitterEmail this to someone

Submit a Comment

Your email address will not be published. Required fields are marked *