Latin Literature and the Leipzig Corpus Miner: an example of NLP techniques applied to a historical language

Celano, Giuseppe G. A.
Leipzig University, Germany
giuseppegacelano@gmail.com

Niekler, Andreas
Leipzig University, Germany
aniekler@informatik.uni-leipzig.de

This contribution presents the early phase of an ongoing work to integrate into the Leipzig Corpus Miner (LCM) the Perseus Digital Latin Library, a more-than-10M-word open-source XML collection of documents attesting Latin literature from the early writers (c. 200 BC) to those of the Imperial Period (c. 200 AD). We aim to employ the LCM as a heuristic tool to study language variation and discover (new) topics in Latin literature.

The LCM is a text mining infrastructure allowing access to a variety of data analyses via a GUI. It stores data, which can be inspected for information retrieval or lexicometrics. The former is based on full text indexes and customizable dictionaries, while the latter consists in frequency or co-occurrence analyses, as well as extraction of key terms. More powerfully, the LCM can perform topic modeling with different statistical methods, which detect latent variables both in single texts and collections. It is also possible to add new annotations to the data, which provide the basis for various classifications. Notably, the LCM also allows a variety of visualizations.

We assume that application of these NLP technologies to Latin literature can help us identify new directions of research, as well as shed new light on a large number of old questions concerning linguistic change, literary genre classification, authorial attribution, and text reuse.

Since part-of-speech tagging is necessary in order for the LCM to perform more accurate analyses, we have so far focused on porting a subset of the Latin collection which has been semi-automatically annotated for morphosyntax (Latin Dependency Treebank). The morphological information here contained provides the fine-grained part of speech and the lemma for each word, while syntactic analysis shows labeled dependency relations (i.e., which word depends on which word with which function).

These annotated data have allowed us to train a tokenizer, sentence splitter, and PoS tagger via the Apache OpenNLP library, a machine learning based toolkit on which the LCM relies to ingest data before analysis. The models we have built will be used to automatically annotate more texts. We document the issues concerning this preliminary phase and show some examples of automatic analyses which can be performed with the LCM.