Deriving Players & Themes in the Regesta Imperii using SVMs and Neural Networks

The Regesta Imperii (RI) are an important source for research in European-medieval history. Sources spread over many centuries of medieval history – mainly charters of German-Roman Emperors – are summarized as “Regests” and pooled in the RI. Interesting medieval demographic groups and players are i.a. cities, citizens or spiritual institutions (e.g. bishops or monasteries). Themes of historical interest are i.a. peace and war or the endowment of new privileges . We investigate the RI for important players and themes , applying state-of-the-art text classiﬁcation methods from computational linguistics. We examine the performance of different classiﬁcation methods in view of the linguistically very heterogeneous RI, including a Neural Network approach that is designed to capture complex interactions be-tween players and themes.


Introduction
The Regesta Imperii (RI) 1 are considered a fundamental, autonomous source for German and European history. It extends over many centuries, from the Karolinger dynasty to Maximilian I, from around 800 to 1500 AD. The RI have their roots in the 19th century, when the German librarian Johann Friedrich Böhmer started to collect and document the charters (including known and possibly unknown fakes) of the German-Roman emperors, in terms of so-called Regests. The Regests contain relevant judicial content of the referenced charters (cf. Zimmermann (2000), Niederkorn (2005), Rübsamen and Kuczera (2006)). A royal charter 1 http://www.regesta-imperii.de/cei Regests across time. Others: ratios of Regests, in which the terms "Friedrich II." (triangles) and "Friedrich III." (circle) occur. The names of these German-Roman kings are examples for concepts which are rather confined in time in the RI. was created, for example, when an emperor decided to give a land grant, or privileges such as new rights to one of his landlords or cities.
Covering about 13 million tokens, the RI constitutes a large-scale resource that is still growing today 2 . The 129,504 Regests we have access to can be treated as a collection of corpora (e.g., one corpus for each Roman-German emperor dynasty), or as a single corpus covering all collected materials. Our work treats the RI as a single corpus. The RI comprises texts written in different German varieties, as well as Latin. Often we find up to three different languages or varieties within a single Regest.
As seen in Figure 1, the Regests are not evenly distributed over time but have the greatest mass from about 1200 to 1500 AD. Many terms and concepts only occur in certain times. An overview # Regests 129,504 # types ≈ 407,000 # tokens ≈ 13,000,000 mean length (in tokens) ≈ 85 median length (in tokens) ≈ 52 ttr log ≈ 0.79 ttr log SDeWaC ≈ 0.68 Table 1: Corpus statistics for the RI at the time we used it. ttr log = log(#types) log(#tokens) is the logarithmic type-token ratio. Taking the logarithm allows better comparison with corpora of different sizes. SDeWaC is a German Corpus comprising 44 million sentences crawled from the internet.
of corpus statistics is given in Table 1. The high logarithmic type-token ratio (ttr log ) supports the observation that the language of the RI is highly heterogeneous: although the domain of the RI is rather focused (abstracts of medieval charters), it is notably higher than what we find in the contemporary German SDeWaC corpus 3 .
A Regest itself is a very unique form of a document, and some of them are not easy to comprehend even for humans. Consider The Regest describes an action of King Karl IV. in 1332, in Parma, Italy. Karl IV. acknowledges that he owes "Johann de Landulphis", "achtzig goldgulden" (eighty gold coins) for wages and "sechzig goldgulden" (sixty gold coins) for reasons which are rather difficult to interpret: "(...) wegen versendungen desselben schuldig zu sein" (interpretable as wages and travel expenses). Beyond that, the Regest contains information in Latin ("iudici et auditori curie paterne et sue"), plus references and meta information (last sentence).
abbreviation groups and themes traced in RI   b0  nobility, nobles  b1  spiritual Institutions  b2  lesser nobles  b3  city, citizens  b4  Jews  b5  women  b6  new privileges  b7  confirmation of privileges  b8  land grants, land bestowal  b9  finances  b10  justice  b11 war and peace Table 2: Traced demographic groups and themes.
by "de" in the name of "Johann de Landulphi", who is promised money by the king. The Latin "de" in the middle of a name generally suggests that the person belongs to the class of nobles, as in "Elizabeth of (=de) England". So, one may conclude that in the above Regest, the players are nobles, acting under the theme finances.
Our aim is to trace within the RI interesting demographic groups joint with the themes of their interactions. We aim to identify which Regest is about which theme(s) and group(s), to perform interesting data analysis, e.g. visualizing the importance of different groups and themes not only in relation to time but also in relation to other factors such as issuer, location, and possibly more.
With the support of a domain expert we determined interesting demographic groups (players) and themes which play a role in the Regests. All players and themes can be treated as individual binary classification problems. An overview is given in Table 2. It can be interesting, e.g., to relate the occurrence of city or citizens with occurrences of privileges with respect to time, thus approximately tracing the development of privileges for cities. 5 A Regest can be labeled with zero to all of the 12 selected labels. Thus, there exist many possible combinations. 6 We cast the labeling problem as a multi-label document classification task, allowing several labels (i.e. groups and themes) to be assigned for a single document (i.e. Regest).
For automatic pattern recognition on this historic data, we deploy four state-of-the text classification methods, (i.) Support Vector Machines (SVM) (binary classification); (ii.) Semi-Supervised SVMs (S 3 VMs), to exploit the large amount of unlabeled data; (iii.) a Neural Network as a meta-learner applied to the SVM outputs (do the groups and themes influence each other?) and (iv.). a Convolutional Neural Network (CNN) classifier with pre-trained word vectors as input, which operates directly on the input documents.
We evaluate all methods on a manually labeled test set and perform data analyses on the full RI to illustrate its usage in Digital Humanities research.

Related Work
To the best of our knowledge, no (published) research has yet been conducted in the Digital Humanities community about NLP of the RI. Kuczera (2015) experimentally transfers attributes and relations between entities from the times of Friedrich III. (i.e. a subset of the RI) into a graph database and shows how historians could profit from the possibilities offered by such structured data repositories. Ruotsalo et al. (2009) suggests that knowledgeand machine learning based NLP methods can help with complex annotation tasks in the cultural heritage domain. Their experiments demonstrate that automatic annotation of certain roles in artwork descriptions closely matches the performance of human annotators. Piotrowski (2012) gives an overview of the manifold challenges in applying NLP to historical documents. He reports that the effectiveness of normalization strongly depends on text type and language, and satisfying results are achieved mainly on more recent texts. Piotrowski concludes that "the highly variable spelling found in many historical texts has remained one of the most troublesome issues for NLP". Thus we chose our procedure to not depend on normalized texts. Massad et al. (2013) give an overview of the processing of recorded history texts. They examined a graph-based approach and an approach based on NLP. In their NLP experiments they analyzed the Wikipedia corpus with respect to time, relating specific strings and n-grams to time and page edits. The authors suggest that future research should focus on data analysis, trends and, most importantly, the access to historic corpora spanning a larger time span compared to the corpus employed in their experiments. We think that our research covers these aspects. Meroño-Peñuela et al. (2014) in their survey on History and Computing propose NLP methods for dealing with raw corpora, yet do not propose specific tools due to manifold decisions to be taken, that strongly depend on the nature of the data.

Approach
The aims of our work are two-fold. On the application side, we aim to discover structures involving players and themes over times in the RI. On the methods side, we investigate to what extent Neural Networks (NN) are capable of learning complex relationships between players and themes, beyond the capacity of ordinary SVM classifiers that treat each classification label independently. E.g., if nobles play a role in a given Regest, it seems more likely that it is about bestowal of land, rather than e.g. justice, which presumably concerns other groups equally. We compare two architectures: a NN that builds on the output of independent binary SVM classifiers, in addition to other information, such as document vectors, in contrast to a fullfledged Convolutional Neural Network (CNN).

Preprocessing
Given the heterogeneous nature of the RI, we do not perform major pre-processing of the data. The Regests are only tokenized and converted to lower case. Thereafter they are mapped to boolean and tf-idf vectors of dimensions 2,000 and 10,000. The value at index i of a boolean vector representing document d encodes whether the term represented by i appears in d (1) or not (0). Tf-idf is similar but assumes that words that appear in many documents are less informative, and hence their respective vector-value is decreased 7 .

Using SVMs and S 3 VMs
SVMs are binary maximum-margin classifiers that can be extended to the multi-label case by training one SVM for each label. Semi-supervised SVMs (S 3 VMs) work by forcing the hyperplane separating the labeled data with margin also through low density regions of space, making use of the cluster hypothesis (Chapelle et al., 2008). S 3 VMs have been shown to be very successful especially when few labeled training data is available (Sindhwani and Keerthi, 2006). The downside is that the S 3 VM's optimization problem loses the global optimality of the standard SVM problem.
In both approaches -SVM and S 3 VM -the assumption is that labels do not influence each other. I.e., if women play a role in a given Regest, it is not less likely that it is also about war and peace.

Combining SVMs and NNs
To enable our classifiers to capture possible dependencies between players and themes, we extend the SVM classifiers with a Neural Network, realizing a meta-learning architecture. The NN may learn that if groups x and y participate in a Regest, some theme z is unlikely to occur, even if predicted so by an independent binary classifier. After choosing the "best" SVMs for each label, the outputs of the SVMs are fed into a Deep Neural Network (cf. Figure 2). We employ three input settings: (a.) using SVM output labels only, (b.) using SVM output labels and the document vectors (the boolean variant), and (c.) the SVM output labels jointly with Paragraph Vectors. Paragraph Vectors are learned similar to word embeddings but represent sentences or documents. They have been shown to yield strong performance in classifying sentences, IMDB opinions and also in Information Retrieval. As the Regests are short documents, they are suitable for being represented by these dense vectors, which are learned in an unsupervised manner (Le and Mikolov, 2014).

Using Convolutional NNs
Recently, CNNs have been successfully applied to various text and semantic sentence classification tasks, and often achieved very good performance (Kim, 2014;Zhang et al., 2015). Since CNNs usually require large numbers (thousands or more) of training samples to achieve very good performance, it would come rather as a surprise if trained on some few hundred samples, they would generalize better on unseen data compared to a mix of binary maximum-margin classifiers. We included this setting to serve as a baseline on the Neural NLP side and generated pre-trained wordembeddings of two sizes using all Regests.

Experimental Setup
Training and Test Data. We manually labeled 500 Regests, randomly drawn from the corpus to prevent bias. 89 The data was split into a training and test section of 400 and 100 Regests. The first two lines in Table 4 display the distribution of players and themes in the annotated data. Some of them occur rarely in both training and test data (e.g. Jewish people (b4) with only 3% and 2% of the respective data sets). On the other hand, nobles play a role in over 70% of annotated Regests. For estimation of model parameters we apply cross validation (CV) on the training set. We proceed as follows: (i) Parameter tuning of SVMs. For each different vector size and representation scheme, we tune the inner parameters of an SVM with CV on the training data. (ii) Testing of SVMs. We retrain each SVM on the full training data using the chosen hyperparameters, and evaluate the model on the test data set. (iii) Determining an independent multi-label system. As input to the NN models as meta learners over SVM outputs, we determine an IMC ("Independent Margin Classifiers"), a set of independent margin classifiers, consisting of the 12 SVMs that achieve maximum training CV score for each problem. (iv) Training NN models. For different NN models we again determine hyperparameters with CV on the IMC-outputs for the training section, and retrain the final NN models on the full training data, before (v), Testing of the NNs is again done on the final test set.
Evaluation Metrics. Our evaluation needs to take into account that many labels underlie a skewed distribution (cf. Table 4). For example, consider that one label only is positive among 100 test samples. A classifier that labels all instances as negative yields a deceivingly high score of 0.99 accuracy. Hence we employ Balanced Accuracy, the mean of Recall (Sensitivity) and inverse Recall (Specificity 10 ), defined as Acc bal = Sensitivity+Specif icity 2 . In the above example, where Accuracy yields a biased score of almost one, balanced Accuracy yields a more realistic value of 0.5. Given the unbalanced distribution of our test data set, we report balanced accuracy for each of the 12 binary problems. We also report their arithmetic mean Acc bal to provide a global measure of performance.
Baselines. As Baselines we choose, besides a simple majority voter, a Multinomial Naive Bayes algorithm, which is commonly used in text classification tasks (both in an independent binary manner for each label). Table 3 shows that Naive Bayes improves over the majority baseline for all problems and yields a solid 0.67 Acc bal , 0.17 pp. above the majority voter.
IMC achieves 0.795 Acc bal and significantly outperforms both the majority baseline (+0.3 Acc bal ) and Naive Bayes (+0.13). For each problem the score is better with up to +0.47 Acc bal for recognizing women (b5) in a Regest. For lesser nobles (b2) and war and peace (b11), the independent classifiers combination baseline yields the overall best results (0.62 and 0.79 Acc bal ).

Evaluation Results: In Depth Analysis
SVMs/S 3 VMs combined into the multi-labeler ("IMC", Table 3) achieve good performance (0.795 Acc bal ). Based on the training CV scores, IMC consists of six supervised SVMs and four S 3 VMs. S 3 VMs in the IMC were chosen for problems b0, b1, b8 and b10. With respect to b2 and b11, IMC outperforms all NN approaches (b2: +0.04, b11: +0.01). The Naive Bayes Baseline is outperformed with +0.128 Acc bal . This strong improvement could be due to the generalization capacity of the maximum margin, which might be especially useful with small training set sizes.
With regard to representation schemes such as boolean or tfidf and 2,000 words or 10,000 words, we observe no clear patterns whether one works generally better than the other on the RI. 5 classifiers of IMC are trained on 10,000 words and 10 classifiers use boolean word-features. 10 Specif icity = T N T N +F P CNNs fed with 128 dimensional embeddings outperform majority vote (+0.06 Acc bal ) but not Naive Bayes (-0.11), most likely due to the low amount of training data. Another explanation is that the 129,504 Regests were not sufficient to pretrain useful word-vectors (possibly also negatively influenced by the word variety). As the vector size increases (512 dimensions), the performance drops further (+0.01 over the majority voter).
The remaining classifier models are intended to detect dependencies between players and themes and had access to the outputs of IMC. Specifically, the question is whether NNs are suitable for detecting such dependencies. As baselines we considered SVM and Decision Tree models, trained on the outputs of the independent learners (in Table 3: +Decision Tree, +SVMs). Neither copes specifically well with this input information (-0.045 Acc bal for +Decision Tree and -0.007 for +SVMs). Even when supplied with more information using various sizes of Paragraph Vectors (omitted in Table 3), both systems do not improve their previous scores.
Neural Networks employed as meta learners, by contrast, are able to improve results for specific problems, especially when supplied with Paragraph Vectors, resulting in the overall best system on test, a NN with 2048 hidden nodes and Paragraph Vectors of dimension 512 (+NN 2048 +PV 512 , Table 3). Still, the overall performance gain is small with only +0.004 Acc bal . When omitting b3 (lesser nobles) from the result calculation (it was the most controversial class in the annotation), the gain over IMC increases to +0.006. Notable individual performance gains are achieved for b0 (nobles, +0.02), b6 (new privileges, +0.05) and bestowal of land (+0.02). We conclude that there are dependencies between nobles, bestowal of land and privileges which cannot be captured by considering these classes independently.
To analyze on which groups and themes the neural network meta-learner offers significantly differing predictions ("it disagrees with its input"), we calculate mid-p-values with McNemars test (Fagerland et al., 2013) between different systems outputs (cf.   Why is the resulting perfomance increase in Acc bal only 0.2%? This is due to the fact that the NN is more restrictive in assigning labels than the independent learner model: in all 129,504 Regests, it predicts 50,968 less positive labels than IMC. As positive labels are strongly under-represented in the manually labeled data, the (non-weighted) Acc bal measure is much more influenced by an additional True Positive than a True Negative for a rare group or theme. Paragraph Vectors (PV) used as input to the NNs apparently contain more information than standard (boolean) bag-of-word (BoW) vectors. When the best NN is fed with BoW vectors instead of PVs it achieves lower performance (-0.07 Acc bal ). To test whether Paragraph Vectors work better simply in general, we trained 12 independent SVM classifiers on PVs only, to predict players and themes. The result, for several dimensions of Paragraph Vectors (between 64 and 2048) fed into an SVM (best result: SVMs+PV 128 in Table 3), did not exceed the Naive Bayes baseline, indicating strongly that PVs alone are inferior to BoW vectors for standard textual classification of the RI. Our explanation is as follows: While Quoc Le (2014) achieved good results in classifying sentiment of movie reviews with Paragraph Vectors, he hypotheses that movie reviews are tailor-cut for learning the vectors for this problem, because compositionality plays an important role in deciding whether the review is positive or negative. The RI are a more complex source and it is debatable whether compositionality plays a role with regard to co-occurring groups and themes. Also, while movie reviews often contain similar (sentiment) vocabulary, each Regest presents its content in rather unique ways. The NN that learns Paragraph Vectors is thus presented with very diverse information, most likely generating vectors containing every and thus little information. We conclude that using standard BoW vectors as firstorder information was the correct choice, while PVs prove more suitable as higher-order information for the NN acting as a meta-classifier (as they add little but additional information). 11 Players and themes that can be predicted with great success by many systems on the test set are confirmation of privileges (b7: 0.94), Jews (b4: 1.00) and women (b5: 0.98). By contrast, all systems fail to reliably predict class b2 (lesser nobles), which yields a maximum of 0.12 points beyond majority and no gains beyond Naive Bayes. One explanation for this low performance is that it was really hard (if not sometimes impossible) to distinguish between non-nobles and nobles in the annotation process. All other groups and themes can be predicted with solid accuracy scores (≥ 0.20 above majority, ≥ 0.02 above Naive Bayes, and ≥ 0.62 Acc bal per category in general).
The system +NN 2048 +PV 512 perfoms best in Acc bal . We also analyze two additional criteria of performance: (i) the Kullback-Leibler (KL) divergence between distributions of labels in the manually annotated data to the distributions of labels automatically assigned to the full RI and (ii), the KL divergence between the distributions of amounts of labels (0-12 labels can be assigned to a Regest). For (i), the KL divergences are KL train,test = 0.033 and KL train,RI N N = 0.036, KL train,RI IM C = 0.058 indicating only a small divergence between human and automatic labeling by the NN w.r.t. the distributions of the twelve groups and themes (cf. Table 4). In fact, all of the best three NNs appear to have smaller KL-divergencies than IMC. Also (ii), number of group and theme labels that are assigned by human vs. automatic labeling, shows similar tendencies: KL train,test = 0.02, KL train,RI N N = 0.01, 11 Note that this applies only to NNs as meta-learners: the SVM-based meta-learner baseline performed below majority baseline when supplied also with Paragraph Vectors (acc bal with additional Paragraph Vectors: 0.786, without: 0.788). KL train,RI IM C = 0.07. On average, two labels were assigned to a Regest by all labeling systems. The human assigned 43% of the Regests two labels, IMC 27% and the NN 34%.
In sum, our results indicate that NNs can learn dependencies of labels from independent classifier predictions. NNs are thus suitable to detect structures in the data that are intuitive for humans.

Deriving Structures of European Medieval Times
We labeled all Regests with +N N 2048 + P V 512 . We eye-balled several annotations and found many of the predicted classes to be correctly inferred 12 .

Feature Analysis
The learned weight vectors of the SVMs offer interpretation of the terms w.r.t. the classified groups and themes. Table 5 displays, for selected classes, the phrases which were assigned the highest weights. Many of these intuitively make sense. Indicator terms for War and Peace are "truppen" (troops), "friedensverhandlungen" (peace talks) and the preposition "gegen" (against), other terms point geographically to the East: türken (turkish) or konstantinopel (today: Istanbul). From the analysis we conclude that the decision to not normalize the texts was reasonable, given that we find many high-weighted terms that are abbreviations, e.g., "urkk" (charter), "kgin" (queen) or latin expressions: 'ecclesia" (christian community), "abbati" (father), "monasterii" (locative of monastery) are indicators for spiritual institutions.
is the number of Regests from time bin b which are about all groups and themes contained in gt.
Not only can groups and themes be traced with regard to time, but also to locations or/and to certain emperors. This is exemplified in Fig. 6 and 4 where we count the occurrences of all 12 themes and groups with respect to these parameters and normalize by the sum of all 12 occurrence counts.

Conclusions
We solved a multi-label text classification problem to derive interesting demographic groups (e.g. citizens) and themes of interactions (i.a. bestowal of privileges or justice) in the Regesta Imperii. Evaluation on a held out test set suggests that most groups and themes can be predicted with good reliability: 9 out of 12 classes can be predicted with a (balanced) Accuracy score ≥ 0.75. The arithmetic mean of all 12 scores -our global performance measure -is 0.797 for the system that was finally chosen to label the entire RI.
A Neural Network acting as a meta learner over the outputs of independent maximum margin classifiers and Paragraph Vectors (document embeddings learned by neural networks) led to a minor improvement of 0.2% mean score. However, for the group nobles and the themes bestowal of land and new privileges the scores were improved by up to 3%, 4% and 5%, indicating dependencies between these classes that cannot be captured by classifiers working under the label-independence assumption. We conclude that NNs can give ad-  1486). The impact of spiritual institutions (from Ruprecht III onwards) and women and land bestowals (both from Friedrich I. onwards) seems to decrease. Finances seem to play a more important role in the later Middle Ages. Figure 5: Logistic Regression weights when we force themes and groups to predict the issuing emperor. Negative Weights suggest negative correlation, positive weights suggest positive correlation. Observably finances and war and peace are associated with Maximilian I. He was notoriously famous for his flamboyant lifestyle and led many wars. Two components leading to great debts, which he mostly owed to Jakob Fugger, banker from the famous Fugger family. ditional information on possible dependencies between classes in a multi-label classification task.
Conceptually the approach is straightforward, but a complicating factor is the exploding parameter space: Besides the "inner parameters" of the Learners, regularization control or the number of neurons in the Neural Network, there are numerous "outer parameters", e.g., possible ways of document representation or pre-processing.
As best-performing system we determined a NN model with additional Paragraph Vector information. It obtained the best results on the test set and also yields the minimum KL divergence for the label distribution over manually labeled training data compared to system predictions. This model was chosen to label all 129,504 Regests.
For the project Regesta Imperii and Digital Humanities in general, our work offers the possibility to trace demographic groups (players) and themes through almost one thousand years of medieval history across different European locations. We showcased data analyses and visualizations. Manifold other possibilities may be explored in future work.
The Regesta Imperii in our opinion is a most challenging and linguistically interesting corpus. For historians, the RI is important as a fundamental source for medieval European studies. For linguists the RI may be very interesting due to its linguistic "uniqueness": syntactic constructions range from simple to most complex, the languages range from more modern German to different forms of medieval German to Latin. Great varieties in word forms exist. Semantically, the referenced objects and concepts are often confined to short periods of time. Thus, the RI presents challenges for researchers from many research fields. The challenging language, the considerable amount of data and the many interesting questions of humanities regarding the medieval times of Europe make the RI a great corpus for NLP researchers with special interest in Humanities.