Topic Modelling of Historical Languages in R

Topic Modelling of Historical Languages in R By Thomas Köntges This is a quick note and introduction to topic-modelling historical languages in R and is intended to supplement three publications forthcoming in 2016: one for the AMPHORAE issue of the Melbourne Historical Journal; one for Alexandria: The Journal of National and International Library and Information Issues (currently under review), and one for DH2016. This blog entry also summarises some points I have made in several talks in the past few months about topic-modelling historical languages (including in my talk at the Analyzing Text Reuse at Scale / Working with Big Humanities Data  workshop during the DH Leipzig Workshop Week in December 2015). This blog is therefore intended as a short summary of some of the more important points previously made and in contrast to the specific applications covered in each of the articles it provides an overview of the subject. My work on topic-modelling did not start in Leipzig, rather, it was part of a project I worked on during my time as a research associate at the Victoria University of Wellington (VUW), New Zealand, in 2013: the Digital Colenso Project. Back then I thought that there was only one ideal number of topics for each corpus and I used Martin Ponweiser’s harmonic mean-method (see chapter 3.7.1 and 4.3.3 in his master’s thesis) to attempt to determine this ideal. Although this approach was useful, albeit slow, for the Digital Colenso Project, I now think that this assumption was wrong, because the ideal topic granularity depends more on the research question and use-case of the application of topic modelling to a certain corpus than on...