Detecting de minimis Code-Switching in Historical German Books

Code-switching has long interested linguists, with computational work in particular focusing on speech and social media data (Sitaram et al., 2019). This paper contrasts these informal instances of code-switching to its appearance in more formal registers, by examining the mixture of languages in the Deutsches Textarchiv (DTA), a corpus of 1406 primarily German books from the 17th to 19th centuries. We automatically annotate and manually inspect spans of six embedded languages (Latin, French, English, Italian, Spanish, and Greek) in the corpus. We quantitatively analyze the differences between code-switching patterns in these books and those in more typically studied speech and social media corpora. Furthermore, we address the practical task of predicting code-switching from features of the matrix language alone in the DTA corpus. Such classifiers can help reduce errors when optical character recognition or speech transcription is applied to a large corpus with rare embedded languages.


Introduction
Code-switching, the linguistic phenomenon where speakers or writers alternate between different languages in a single utterance or statement, is commonly seen in bilingual or multilingual communities. In the last few decades, code-switching has drawn scholarly attention in computational linguistics and natural language processing from many different perspectives (Sitaram et al., 2019). Researchers from formal linguistics, psycholinguistics, sociolinguistics, philosophy, anthropology, and elsewhere have considered the phenomenon (Nilep, 2006). In formal linguistics, interest has risen in studying syntactic and morphosyntactic constraints on language alternation (Nilep, 2006). In sociolinguistics, there has been a focus on the social context in conversations where code-switching happens. For example, Blom and Gumperz (1972) proposes the dichotomy of situational and metaphorical code-switching, which serve as indicators of whether different languages or language varieties are used in different social situations.
In natural language processing, there is extensive work studying code-switching from an engineering perspective. Different NLP tasks have been proposed on code-switching corpora, such as language ID (Solorio et al., 2014;Sequiera et al., 2015), named entity recognition (Aguilar et al., 2018;Singh et al., 2018) , POS tagging (Solorio and Liu, 2008;Vyas et al., 2014), sentiment analysis (Vilares et al., 2015), automatic speech recognition (Chan et al., 2014;Weiner et al., 2012), question answering (Chandu et al., 2018), etc. Indeed, with the abundance of code-switched speech and (informal) text data, there are many choices of NLP directions that one can pursue.
Code-switching, as a language phenomenon, is usually considered informal. It is often found in speech and in casual text, such as social media (Sitaram et al., 2019); however, code-switching also appears in formal settings, such as newspaper reports, or in this paper, books. Table 1 shows examples from the Deutsches Textarchiv (DTA) corpus of code-switching from German into Latin, English, French, and Greek. Based on such observations, we ask: from a quantitative perspective, precisely how different is formal, "scholarly" code-switching from its informal usage? Furthermore, can we predict which books will host code-switching, so as to reduce language-coding and transcription errors in mass digitization? If we can, how would these prediction tasks be different from common NLP tasks in informal codeswitching settings? To measure the difference between formal and informal code-switching, we evaluate metrics for characterizing code-switching across corpora. Specifically, we are interested in: 1) how "unequal" the distribution of different languages is in a corpus; 2) how frequently the switching occurs; and 3) whether the switching happens in a periodic or aperiodic manner. We want to obtain metrics to answer the questions above quantitatively, and we employ these metrics to get a sense of how code-switching characteristics differ among corpora.
Besides this descriptive task, we are interested in practical tasks for predicting code-switching. There has been previous work formalizing code-switching detection in historical texts as a language ID task (Schulz and Keller, 2016;Sprugnoli et al., 2017), and models such as Conditional Random Fields (CRF) have been deployed to classify words as in one language or another. However, such approaches fail to work in the following scenario: when large collections of page images are transcribed with optical character recognition (OCR) or when large audio collections are transcribed by speech recognition, we do not always know a priori which languages will be included. Including hypotheses from multiple languages in a transcription model can reduce accuracy in the majority matrix language. Including a model trained only on the matrix language results in near-zero accuracy in transcribing embedded languages. When the page containing the Greek example from Table 1 is run through a German-only OCR model, the output contains BxIos in place of βάθος. We therefore experiment with two predictive tasks. First, at the level of the book, can we predict the presence of code-switching using only features of the matrix language text? Second, working sequentially through a text, can we predict when codeswitching will occur?
The rest of the paper is organized as follows: §2 gives an introduction on the metrics we use to characterize code-switching. §3 describes the datasets we analyze and the results of the metrics described in §2. §4 shows the results for the two predictive tasks described above applied to historical books from the DTA. Finally, §5 summarizes our conclusions and outlines possible future directions.

Metrics for Characterizing Code-switched Corpora
In this section we briefly introduce three metrics for measuring characteristics of code-switched corpora. The metrics, as defined by Guzmán et al. (2017) on the basis of previous work, include: M-index, measuring the inequality of the distribution of languages; (normalized) I-index, measuring the frequency of switching in a code-switched corpus; and burstiness, measuring the degree of periodicity in codeswitching patterns in the corpus.
The Multilingual Index (M-index), developed by Barnett et al. (2000), is a token-count-based measure that "quantifies the inequality of the distribution of language tags in a corpus of at least two languages" (Guzmán et al., 2017). It is defined as follows: where k denotes the total number of languages in the corpus, and p j is the proportion of the number of words in language j in the corpus. M-index ranges from 0 (indicating a completely monolingual corpus) to 1 (denoting each language in the corpus has the same number of words). The Integration Index (I-index) measures the frequency of code-switching. It "describes the probability of switching within a text" (Guzmán et al., 2017). The unnormalized I-index is calculated as: where n is the total number of tokens in the text, l i denotes the language of token i, and S(l i , l j ) = 1 if l i = l j and 0 otherwise. As we can see, this quantity measures the proportion of number of switch points relative to the total number of tokens in the corpus. The unnormalized I-index, however, does not consider the underlying language distribution in the corpus. For example, the unnormalized I-index would not be close to 1 unless the M-index of the corpus is close to 1, indicating that each language in the corpus is equally distributed. In order to decouple this metric from the underlying language distribution of the corpus, a normalized version of the I-index is developed by Bullock et al. (2019) and is computed as follows: where H and L are the upper and lower bounds of the unnormalized I-index, respectively. Let n be the total number of tokens in the text, k be the total number of languages, and n i be the number of tokens in language i. We can then define: Note that I normalized ranges from 0 to 1, representing the absolute minimum and maximum numbers of possible switches within the corpus, regardless of the underlying language distribution. Therefore, one can direct compare this metric across different corpora.
Burstiness (Goh and Barabasi, 2008) "quantifies whether switching occurs in bursts or has a more periodical manner" (Guzmán et al., 2017). It is defined as: where σ τ and m τ denote the standard deviation and the mean of the language spans, respectively. Burstiness ranges from −1 (periodic code-switching in corpora) to 1 (aperiodic, less predictable code-switching in corpora).

Datasets and Analysis
We now describe our corpora and analyze them for their patterns of code-switching.

Corpus Descriptions
In this paper, we focus on the Deustches Textarchiv (DTA) 1 corpus, which contains manual transcriptions of 1,406 historical German books from the 17th to the 19th centuries. The corpus contains 131,679,459 tokens in total. Until about the 1930s, German was usually written in a "blackletter" font named "Fraktur", but other languages in the Roman alphabet were written in a Roman font, called "Antiqua" in German. Since the DTA encodes this font information, we are able to identify text written in Roman-script languages other than German. We then use an off-the-shelf language identification API 2  to label the "Antiqua" text spans with their corresponding languages. To eliminate errors made by the API, we then perform manual correction on all the labelled spans. We easily identify spans of embedded Greek by locating Greek UTF-8 characters.
For comparison, we also use the LinCE corpora (Aguilar et al., 2020) to characterize differences between formal and informal code-switched text. LinCE combines Twitter and Facebook data from ten corpora, and the language in these corpora is more informal. The corpora cover four different codeswitched language pairs: Spanish-English, Nepali-English, Hindi-English, and Modern Standard Arabic-Egyptian Arabic. Overall, LinCE contains 64,326 posts with 953,813 tokens. Although we realize that the two corpora differ in many other dimensions, such as cultural context, topic, text length, language pairs and so on, we believe that formality is a crucial factor that needs to be taken into account for our comparison.

Corpus Comparison
Results for the code-switching metrics introduced in §2 are shown in Table 2. We can see that the Mindex for the DTA corpus is very close to zero, while the numbers for the LinCE corpora are significantly greater than zero, suggesting that the language distribution is much more skewed in the DTA corpus compared to the LinCE corpus. We also observe that both the normalized I-index and the burstiness of the DTA corpus are much greater than those of the LinCE corpora, indicating that the code-switching phenomenon is more frequent (regardless of the underneath language distribution) and less periodical in the DTA corpus than in the LinCE corpora. Furthermore, a high normalized I-index of the DTA corpus implies that the non-German blocks in the corpus are quite short so that the probability of switching back to German in those non-German tokens is high, while in LinCE corpora passages in a particular language are usually longer.

Predictive Tasks for Studying Code-Switching
Results in §3.2 demonstrate that the code-switching patterns in historical books are significantly different than in usually studied domains, such as social media. In this section, we consider two tasks that are uniquely suitable for investigating code-switching in historical books. As discussed in §1, both tasks aim to improve OCR performance on the books. For both tasks, we utilize a 80%/10%/10% train/dev/test book-level split of the DTA corpus.
For the first task, we predict whether there are non-German languages present in a book for the DTA corpus. For our book-level baseline, we pick the top 1,000 words in each book and use word count as features. Then a vector that contains the count for each word in the vocabulary is used as the feature vector X for the regression model. We then train a logistic regression model for the prediction task. The logit function is β 0 + β 1 X, where β 0 and β 1 are model parameters. The reason why we choose logistic regression for this task is that we want a simple model that is good for a binary classification task as our baseline. We use sklearn implementation (Pedregosa et al., 2011) (v 0.22.2) for the model. The model outputs log probabilities of the book containing only German or German plus foreign languages.
For the second task, our goal is to predict which language the next character (for the DTA corpus) or the next word (for the LinCE corpora) would be in given a sequence of characters/words that have been read so far. Hence this is a seq2seq (Sutskever et al., 2014) model. Of all the choices we have for such a model, we use a character/word LSTM model as our baseline just for illustration purpose. The reason why we choose a character level model for the DTA corpus is that we want to simulate the environment for OCR where each character is sequentially scanned and predictions for the next character   are made as we proceed. We feed a text chunk (for the DTA corpus input, the chunk could contain 20, 50, 100 characters or characters from an entire page) to a character/word embedding layer (the embedding layer's output dimension is 16). We choose different chunk sizes so as to see whether the choice would influence overall task performance, and the conclusion could be helpful determining the optimal input size for OCR. For predictions of the DTA corpus, we concatenate the embedding layer output with the log probabilities output from the book-level predictions of the first task. We then send the resulting vectors to a single-layer LSTM of 16 hidden units. Then we take all hidden states output by the LSTM layer and send them to a linear layer with output dimension of 2. Then we apply a softmax function to the output vectors to obtain probability distributions over the two output classes. We train this model for 5 epochs. We report precision/recall/F1 for the DTA corpus with different input chunk sizes, and for the LinCE corpora with different language pairs. In Table 3 and Table 4 we provide baseline results for the above two tasks. As we can see, the baseline model gives decent performance on the first task. For the second task, when having the DTA corpus as input, we see that as we increase the chunk size there is an increase in precision, but a decrease in recall. The F1 score first increases, then decreases as the chunk size increases. When we have the LinCE corpora as input, there is a general improvement in precision and recall. Furthermore, we see that the F1 scores for the DTA corpus input are significantly lower than for the LinCE corpus, which is probably due to the fact that the language distribution in the DTA corpus is much more skewed towards the matrix language than in the LinCE corpora, making the prediction inherently more difficult, like finding a needle in a haystack. This encourages improvement over the baseline model for future work.

Conclusion
In this paper, we study the code-switching patterns in 1,406 historical German books. We automatically annotate and manually inspect code-switched text spans. We then quantitatively show that the codeswitching patterns in these books are different from those in typically studied informal code-switching domains, such as social media. We propose two interesting tasks that, if well handled, would help improve OCR performance on code-switched historical books. Finally, we provide baseline results for the two tasks. The first task gives decent baseline performance, and we see an obvious precision-recall tradeoff by input chunk size on the second task where we have the DTA corpus as our input. We also see that locating code-switching points is a more difficult task in historical text than in informal domains such as social media, since it occurs rarely in old books. The F1 score for the sequential prediction task could be further improved by utilizing different model architectures. We leave this for future work. Additionally, our study opens an avenue for more analysis of code-switching patterns in historical texts.