Are All Languages Equally Hard to Language-Model?

For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles? In this work, we develop an evaluation framework for fair cross-linguistic comparison of language models, using translated text so that all models are asked to predict approximately the same information. We then conduct a study on 21 languages, demonstrating that in some languages, the textual expression of the information is harder to predict with both n-gram and LSTM language models. We show complex inflectional morphology to be a cause of performance differences among languages.


Introduction
Modern natural language processing practitioners strive to create modeling techniques that work well on all of the world's languages. Indeed, most methods are portable in the following sense: Given appropriately annotated data, they should, in principle, be trainable on any language. However, despite this crude cross-linguistic compatibility, it is unlikely that all languages are equally easy, or that our methods are equally good at all languages.
In this work, we probe the issue, focusing on language modeling. A fair comparison is tricky. Training corpora in different languages have different sizes, and reflect the disparate topics of discussion in different linguistic communities, some of which may be harder to predict than others. Moreover, bits per character, a standard metric for language modeling, depends on the vagaries of a given orthographic system. We argue for a fairer metric based on the bits per utterance using utterance-aligned multi-text. That is, we train and test on "the same" set of utterances in each language, modulo translation. To avoid discrepancies in out-of-vocabulary handling, we evaluate open-vocabulary models.
We find that under standard approaches, text tends to be harder to predict in languages with finegrained inflectional morphology. Specifically, language models perform worse on these languages, in our controlled comparison. Furthermore, this performance difference essentially vanishes when we remove the inflectional markings. 1 Thus, in highly inflected languages, either the utterances have more content or the models are worse.
(1) Text in highly inflected languages may be inherently harder to predict (higher entropy per utterance) if its extra morphemes carry additional, unpredictable information.
(2) Alternatively, perhaps the extra morphemes are predictable in principlefor example, redundant marking of grammatical number on both subjects and verbs, or marking of object case even when it is predictable from semantics or word order-and yet our current language modeling technology fails to predict them. This might happen because (2a) the technology is biased toward modeling words or characters and fails to discover intermediate morphemes, or because (2b) it fails to capture the syntactic and semantic predictors that govern the appearance of the extra morphemes. We leave it to future work to tease apart these hypotheses.

Language Modeling
A traditional closed-vocabulary, word-level language model operates as follows: Given a fixed set of words V, the model provides a probability distribution over sequences of words with parameters to be estimated from data. Most fixed-vocabulary language models employ a distinguished symbol UNK that represents all words not present in V; these words are termed out-of-vocabulary (OOV).
Choosing the set V is something of a black art: Some practitioners choose the k most com-mon words (e.g., Mikolov et al. (2010) choose k = 10000) and others use all those words that appear at least twice in the training corpus. In general, replacing more words with UNK artificially improves the perplexity measure but produces a less useful model. OOVs present something of a challenge for the cross-linguistic comparison of language models, especially in morphologically rich languages, which simply have more word forms.

The Role of Inflectional Morphology
Inflectional morphology can explode the base vocabulary of a language. Compare, for instance, English and Turkish. The nominal inflectional system of English distinguishes two forms: a singular and plural. The English lexeme BOOK has the singular form book and the plural form books. In contrast, Turkish distinguishes at least 12: kitap, kitablar, kitabı, kitabın, etc.
To compare the degree of morphological inflection in our evalation languages, we use counting complexity (Sagot, 2013). This crude metric counts the number of inflectional categories distinguished by a language (e.g., English includes a category of 3rd-person singular present-tense verbs). We count the categories annotated in the language's UniMorph (Kirov et al., 2018) lexicon. See Table 1 for the counting complexity of evaluated languages.

Open-Vocabulary Language Models
To ensure comparability across languages, we require our language models to predict every character in an utterance, rather than skipping some characters because they appear in words that were (arbitrarily) designated as OOV in that language. Such models are known as "open-vocabulary" LMs.
Notation. Let∪ denote disjoint union, i.e., A∪ B = C iff A ∪ B = C and A ∩ B = ∅. Let Σ be a discrete alphabet of characters, including a distinguished unknown-character symbol . 2 A character LM then defines where we take c |c|+1 to be a distinguished end-ofstring symbol EOS. In this work, we consider two open-vocabulary LMs, as follows.
Baseline n-gram LM. We train "flat" hybrid word/character open-vocabulary n-gram models (Bisani and Ney, 2005), defined over strings Σ + 2 The set of graphemes in these languages can be assumed to be closed, but external graphemes may on rare occasion appear in random text samples. These are rare enough to not materially affect the metrics. from a vocabulary Σ with mutually disjoint subsets: Σ = W∪ C∪ S, where single characters c ∈ C are distinguished in the model from single character full words w ∈ W , e.g., a versus the word a. Special symbols S = {EOW, EOS} are end-of-word and end-of-string, respectively. N-gram histories in H are either word-boundary or word-internal (corresponding to a whitespace tokenization), i.e., H = H b∪ H i . String-internal word boundaries are always separated by a single whitespace character. 3 For example, if foo, baz ∈ W but bar ∈ W , then the string foo bar baz would be generated as: foo b a r EOW baz EOS. Possible 3-gram histories in this string would be, e.g., Symbols are generated from a multinomial given the history h, leading to a new history h that now includes the symbol and is truncated to the Markov order. Histories h ∈ H b can generate symbols s ∈ W ∪ C ∪ {EOS}. If s = EOS, the string is ended. If s ∈ W , it has an implicit EOW and the model transitions to We use standard Kneser and Ney (1995) model training, with distributions at word-internal histories h ∈ H i constrained so as to only provide probability mass for symbols s ∈ C ∪ {EOW}. We train 7-gram models, but prune n-grams hs where the history h ∈ W k , for k > 4, i.e., 6-and 7-gram histories must include at least one s ∈ W . To establish the vocabularies W and C, we replace exactly one instance of each word type with its spelled out version. Singleton words are thus excluded from W , and character sequence observations from all types are included in training. Note any word w ∈ W can also be generated as a character sequence. For perplexity calculation, we sum the probabilities for each way of generating the word.
LSTM LM. While neural language models can also take a hybrid approach (Hwang and Sung, 2017;Kawakami et al., 2017), recent advances indicate that full character-level modeling is now competitive with word-level modeling. A large part of this is due to the use of recurrent neural networks (Mikolov et al., 2010), which can generalize about (c) Difference in BPEC performance of n-gram (blue) and LSTM (green) LMs between words and lemmata. Figure 1: The primary findings of our paper are evinced in these plots. Each point is a language. While the LSTM outperforms the hybrid n-gram model, the relative performance on the highly inflected languages compared to the more modestly inflected languages is almost constant; to see this point, note that the regression lines in Fig. 1c are almost identical. Also, comparing Fig. 1a and Fig. 1b shows that the correlation between LM performance and morphological richness disappears after lemmatization of the corpus, indicating that inflectional morphology is the origin for the lower BPEC.
how the distribution p(c i | c <i ) depends on c <i . We use a long short-term memory (LSTM) LM (Sundermeyer et al., 2012), identical to that of Zaremba et al. (2014), but at the character-level. To achieve the hidden state h i ∈ R d at time step i, one feeds the left context c i−1 to the LSTM: h i = LSTM (c 1 , . . . , c i−1 ) where the model uses a learned vector to represent each character type. This involves a recursive procedure described in Hochreiter and Schmidhuber (1997). Then, the probability distribution over the i th character is Parameters for all models are estimated on the training portion and model selection is performed on the development portion. The neural models are trained with SGD (Robbins and Monro, 1951) with gradient clipping, such that each component has a maximum absolute value of 5. We optimize for 100 iterations and perform early stopping (on the development portion). We employ a character embedding of size 1024 and 2 hidden layers of size 1024. 4 The implementation is in PyTorch.

A Fairer Evaluation: Multi-Text
Effecting a cross-linguistic study on LMs is complicated because different models could be trained and tested on incomparable corpora. To avoid this problem, we use multi-text: k-way translations of the same semantic content. 4 As Zaremba et al. (2014) indicate, increasing the number of parameters may allow us to achieve better performance.
What's wrong with bits per character? Openvocabulary language modeling is most commonly evaluated under bits per character (BPC) = 1 |c|+1 |c|+1 i=1 log p(c i | c <i ). 5 Even with multitext, comparing BPC is not straightforward, as it relies on the vagaries of individual writing systems. Consider, for example, the difference in how Czech and German express the phoneme /tS /: Czech useš c, whereas German tsch. Now, consider the Czech word puč and its German equivalent Putsch. Even if these words are both predicted with the same probability in a given context, German will end up with a lower BPC. 6 Bits per English Character. Multi-text allows us to compute a fair metric that is invariant to the orthographic (or phonological) changes discussed above: bits per English character (BPEC).
where c English is the English character sequence in the utterance aligned to c. The choice of English is arbitrary, as any other choice of language would simply scale the values by a constant factor.
Note that this metric is essentially capturing the overall bits per utterance, and that normalizing using English characters only makes numbers independent of the overall utterance length; it is not critical to the analysis we perform in this paper.  A Potential Confound: Translationese. Working with multi-text, however, does introduce a new bias: all of the utterances in the corpus have a source language and 20 translations of that source utterance into target languages. The characteristics of translated language has been widely studied and exploited, with one prominent characteristic of translations being simplification (Baker, 1993). Note that a significant fraction of the original utterances in the corpus are English. Our analysis may then have underestimated the BPEC for other languages, to the extent that their sentences consist of simplified "translationese." Even so, English had the lowest BPEC from among the set of languages.

Experiments and Results
Our experiments are conducted on the 21 languages of the Europarl corpus (Koehn, 2005). The corpus consists of utterances made in the European parliament and are aligned cross-linguistically by a unique utterance id. With the exceptions (noted in Table 1) of Finnish, Hungarian and Estonian, which are Uralic, the languages are Indo-European.
While Europarl does not contain quite our desired breadth of typological diversity, it serves our purpose by providing large collections of aligned data across many languages. To create our experimental data, we extract all utterances and randomly sort them into train-development-test splits such that roughly 80% of the data are in train and 10% in development and test, respectively. 7 We also perform experiments on lemmatized text, where we replace every word with its lemma using the UD-Pipe toolkit (Straka et al., 2016), stripping away its inflectional morphology. We report two evaluation metrics: BPC and BPEC (see §3). Our BPEC measure always normalizes by the length of the original, not lemmatized, English.
Experimentally, we want to show: (i) When evaluating models in a controlled environment (multitext under BPEC), the models achieve lower performance on certain languages and (ii) inflectional morphology is the primary culprit for the performance differences. However, we repeat that we do not in this paper tease apart whether the models are at fault, or that certain languages inherently encode more information.

Discussion and Analysis
We display the performance of the n-gram LM and the LSTM LM under BPC and BPEC for each of the 21 languages in Fig. 1 with full numbers listed in Table 1. There are several main take-aways.
The Effect of BPEC. The first major take-away is that BPEC offers a cleaner cross-linguistic comparison than BPC. Were we to rank the languages by BPC (lowest to highest), we would find that English was in the middle of the pack, which is surprising as new language models are often only tuned on English itself. For example, BPC surprisingly suggests that French is easier to model than English. However, ranking under BPEC shows that the LSTM has the easiest time modeling English itself. Scandinavian languages Danish and Swedish have BPEC closest to English; these languages are typologically and genetically similar to English.
n-gram versus LSTM. As expected, the LSTM outperforms the baseline n-gram models across the board. In addition, however, n-gram modeling yields relatively poor performance on some languages, such as Dutch, with only modestly more complex inflectional morphology than English. 7 Characters appearing < 100 times in train are .
The Impact of Inflectional Morphology. Another major take-away is that rich inflectional morphology is a difficulty for both n-gram and LSTM LMs. In this section we give numbers for the LSTMs. Studying Fig. 1a, we find that Spearman's rank correlation between a language's BPEC and its counting complexity ( §2.1) is quite high (ρ = 0.59, significant at p < 0.005). This clear correlation between the level of inflectional morphology and the LSTM performance indicates that character-level models do not automatically fix the problem of morphological richness. If we lemmatize the words, however (Fig. 1b), the correlation becomes insignificant and in fact slightly negative (ρ = −0.13, p ≈ 0.56). The difference of the two previous graphs (Fig. 1c) shows more clearly that the LM penalty for modeling inflectional endings is greater for languages with higher counting complexity. Indeed, this penalty is arguably a more appropriate measure of the complexity of the inflectional system. See also Fig. 2.
The differences in BPEC among languages are reduced when we lemmatize, with standard deviation dropping from 0.065 bits to 0.039 bits. Zooming in on Finnish (see Table 1), we see that Finnish forms are harder to model than English forms, but Finnish lemmata are easier to model than English ones. This is strong evidence that it was primarily the inflectional morphology, which lemmatization strips, that caused the differences in the model's performance on these two languages.

Related Work
Recurrent neural language models can effectively learn complex dependencies, even in openvocabulary settings (Hwang and Sung, 2017;Kawakami et al., 2017). Whether the models are able to learn particular syntactic interactions is an intriguing question, and some methodologies have been presented to tease apart under what circumstances variously-trained models encode attested interactions (Linzen et al., 2016;Enguehard et al., 2017). While the sort of detailed, constructionspecific analyses in these papers is surely informative, our evaluation is language-wide.
MT researchers have investigated whether an English sentence contains enough information to predict the fine-grained inflections used in its foreignlanguage translations (see Kirov et al., 2017).  Figure 2: Each dot is a language, and its coordinates are the BPEC values for the LSTM LMs over words and lemmata. The top and right margins show kernel density estimates of these two sets of BPEC values. All dots follow the blue regression, but stay below the green line (y = x), and the darker dots-which represent languages with higher counting complexity-tend to fall toward the right but not toward the top, since counting complexity is correlated only with the BPEC over words. Sproat et al. (2014) present a corpus of close translations of sentences in typologically diverse languages along with detailed morphosyntactic and morphosemantic annotations, as the means for assessing linguistic complexity for comparable messages, though they expressly do not take an information-theoretic approach to measuring complexity. In the linguistics literature, McWhorter (2001) argues that certain languages are less complex than others: he claims that Creoles are simpler. Müller et al. (2012) compare LMs on EuroParl, but do not compare performance across languages.

Conclusion
We have presented a clean method for the crosslinguistic comparison of language modeling: We assess whether a language modeling technique can compress a sentence and its translations equally well. We show an interesting correlation between the morphological richness of a language and the performance of the model. In an attempt to explain causation, we also run our models on lemmatized versions of the corpora, showing that, upon the removal of inflection, no such correlation between morphological richness and LM performance exists. It is still unclear, however, whether the performance difference originates from the inherent difficulty of the languages or with the models. 5