Predicting Declension Class from Form and Meaning

The noun lexica of many natural languages are divided into several declension classes with characteristic morphological properties. Class membership is far from deterministic, but the phonological form of a noun and/or its meaning can often provide imperfect clues. Here, we investigate the strength of those clues. More specifically, we operationalize this by measuring how much information, in bits, we can glean about declension class from knowing the form and/or meaning of nouns. We know that form and meaning are often also indicative of grammatical gender—which, as we quantitatively verify, can itself share information with declension class—so we also control for gender. We find for two Indo-European languages (Czech and German) that form and meaning respectively share significant amounts of information with class (and contribute additional information above and beyond gender). The three-way interaction between class, form, and meaning (given gender) is also significant. Our study is important for two reasons: First, we introduce a new method that provides additional quantitative support for a classic linguistic finding that form and meaning are relevant for the classification of nouns into declensions. Secondly, we show not only that individual declensions classes vary in the strength of their clues within a language, but also that these variations themselves vary across languages.


Introduction
To an English speaker learning German, it may come as a surprise that one cannot necessarily predict the plural form of a noun from its singular. This is because pluralizing nouns in English is relatively simple: Usually we merely add an -s to the end (e.g., cat → cats). Of course, not all English nouns follow such a simple rule (e.g., child → children, sheep → sheep, etc.), but those that do not are few in number. Compared to English, German has comparatively many common morphological rules for inflecting nouns. For example, some plurals are formed by adding a suffix to the singular: Insekt 'insect' → Insekt-en, Hund 'dog' → Hund-e, Radio 'radio' → Radio-s. For others, the plural is formed by changing a stem vowel: 1 Mutter 'mother' → Mütter, or Nagel 'nail' → Nägel. Some others form plurals with both suffixation and vowel change: Haus 'house' → Häus-er and Koch 'chef' → Köch-e. Still others, like Esel 'donkey', have the same form in plural and singular. The problem only worsens when we consider other inflectional morphology, such as case.
Disparate plural formation and case rules of the kind described above split nouns into declension classes. To know a noun's declension class is to know which morphological form it takes in which context (e.g., Benveniste 1935;Wurzel 1989;Nübling 2008;Ackerman et al. 2009;Ackerman and Malouf 2013;. But, this begs the question: What clues can we use to predict the class for a noun? In some languages, predicting declension class is argued to be easier if we know the noun's phonological form (Aronoff, 1992;Dressler and Thornton, 1996) or lexical semantics (Carstairs-McCarthy, 1994;Corbett and Fraser, 2000). However, semantic and phonological clues are, at best, only very imperfect hints as to class (Wurzel, 1989;Harris, , 1992Aronoff, 1992;Halle and Marantz, 1994;Corbett and Fraser, 2000;Aronoff, 2007). Given this, we quantify how much information a noun's form and meaning share with its class, and determine whether that amount of information is uniform across classes.
To do this, we measure the mutual information (Cover and Thomas, 2012) both between declension class and meaning (i.e., distributional semantic vector) and between declension class and form (i.e., orthographic form), as in Figure 1. We select two Indo-European languages (Czech and German) that have declension classes. We find that form and meaning both share significant amounts of information, in bits, with declension class in both languages. We further find that form clues are stronger than meaning clues; for form, we uncover a relatively large effect of 0.5-0.8 bits, while, for lexical semantics, a moderate one of 0.3-0.5 bits. We also measure the three-way interaction between form, meaning, and class, finding that phonology and semantics contribute overlapping information about class. Finally, we analyze individual inflection classes and uncover that the amount of information they share with form and meaning is not uniform across classes or languages.

Declension Classes in Language
The morphological behavior of declension classes is quite complex. Although various factors are undoubtedly relevant, we focus on phonological and lexical semantic ones here. We have ample reason to suspect that phonological factors might affect class predictability. In the most basic sense, the form of inflectional suffixes are often altered based on the identity of the final segment of the stem. For example, the English plural suffix is spelled as -s after most consonants, like in cats, but as -es if it appears after an s, sh, z, ch etc., like in 'mosses', 'rushes', 'quizzes', 'beaches' etc. Often differences such as these in the spelling of plural affixes or declension class affixes are due to phonological rules that are noisily realized in orthography; there could also be regularities between form and class that do not correspond to phonological rules but still have an effect. For example, statistical regularities over phonological segments in continuous speech guide first-language acquisition (Maye et al., 2002), even over non-adjacent segments (Newport and Aslin, 2004). Statistical relationships have also been uncovered between the sounds in a word and the word's syntactic category (Farmer et al., 2006;Monaghan et al., 2007;Sharpe and Marantz, 2017) and between the orthographic form of a word and its argument structure valence (Williams, 2018). Thus, we expect the form of a noun to provide clues to declension class.
Semantic factors too are often relevant for determining certain types of morphologically relevant classes, such as grammatical gender, which is known to be related to declension class. It has been claimed that there are only two types of gender systems: semantic systems (where only semantic information is required) and formal systems (where semantic information as well as morphological and phonological factors are relevant) (Corbett and Fraser, 2000, 294). Moreover, a large typological survey, Qian et al. (2016), finds that meaningsensitive grammatical properties, such as gender and animacy, can be decoded well from distributional word representations for some languages, but less well for others. These examples suggest that it is worth investigating whether noun semantics provides clues about declension class.

Orthography as a proxy for phonology?
We motivate an investigation into the relationship between the form of a word and its declension class by appealing, at least partly, to phonological motivations. However, we make the simplifying assumption that phonological information is adequately captured by orthographic word forms-i.e., strings of written symbols, which are also known as graphemes. In general, one should question this assumption (Vachek, 1945;Luelsdorff, 1987;Sproat, 2000Sproat, , 2012Neef et al., 2012). For the particular languages we investigate here-Czech and German-it is less problematic, as they are have fairly "transparent" mappings between spelling and pronunciation (Matějček, 1998;Miles, 2000;Caravolas and Volín, 2001), which enables them to achieve higher performance on grapheme-tophoneme conversion than do English and other "opaque" orthographic systems (Schlippe et al., 2012). These studies suggest that we are justified in taking orthography as a proxy for phonological form. Nonetheless, to mitigate against any phonological information being inaccurately represented in the orthographic form (e.g., vowel lengthening in German), several of our authors, who are fluent reader-annotators of our languages, checked our classes for any unexpected phonological variations.
We exhibit examples in §3.

Distributional Lexical Semantics
We adopt a distributional approach to lexical semantics (Harris 1954;Mitchell and Lapata 2010;Turney and Pantel 2010;Bernardi et al. 2015;Clark 2015; inter alia) that relies on pretrained word embeddings for this paper. We do this for multiple reasons: First, distributional semantic approaches to create word vectors, such as WORD2VEC (Mikolov et al., 2013), have been shown to do well at extracting lexical features such as animacy and taxonomic information (Rubinstein et al., 2015) and can also recognize semantic anomaly (Vecchi et al., 2011). Second, the distributional approach to lexical meaning yields a straightforward procedure for extracting "meaning" from text corpora at scale.

Controlling for grammatical gender?
Grammatical gender has been found to interact with lexical semantics (Schwichtenberg and Schiller, 2004;Williams et al., 2019Williams et al., , 2020, and often can be determined from form (Brooks et al., 1993;Dobrin, 1998;Frigo and McDonald, 1998;Starreveld and La Heij, 2004). This means that it cannot be ignored in the present study. While the precise nature of the relationship between declension class and gender is far from clear, it is well established that the two should be distinguished (Aronoff 1992;Wiese 2000;Kürschner and Nübling 2011;inter alia). We first measure the amount of information shared between gender and class, according to the methods described in §4, to verify that the predicted relationship exists. We then verify that gender and class overlap in information in German and Czech to a high degree, but that we cannot reduce one to the other (see Table 3 and §6). We proceed to control for gender, and subsequently measure how much additional information form and meaning provide about declension class.

Data
For our study, we need orthographic forms of nouns, their associated word vectors, and their declension classes. Orthographic forms can be found in any large text corpus or dictionary. We isolate noun lexemes (i.e., or syntactic category-specific representations of words) by language. We select Czech nouns from UniMorph (Kirov et al., 2018) and German nouns from CELEX2 (Baayen et al., 1995). For lexical semantics, we trained 300-dimensional WORD2VEC vectors on languagespecific Wikipedia. 2 We select the nominative singular form as the donor for both orthographic and lexical semantic representations because it is the lemma in Czech and German. It is also usually the stem for the rest of the morphological paradigm. We restrict our investigation to monomorphemic lexemes because: (i) one stem can take several affixes which would multiply its contribution to the results, and (ii) certain affixes come with their own class. 3 Compared to form and meaning, declension class is a bit harder to come by, because it requires linguistic annotation. We associated lexemes with their classes on a by-language basis by relying on annotations from fluent speakerlinguists, either for class determination (for Czech) or for verifying existing dictionary information (for German). For Czech, declension classes were derived by an edit distance heuristic over affix forms, which grouped lemmata into subclasses if they received the same inflectional affixes (i.e., they constituted a morphological paradigm). If orthographic differences between two sets of suffixes in the lemma form could be accounted for by positing a phonological rule, then the two sets were collapsed into a single set; for example, in the "feminine -a" declension class, we collapsed forms for which the dative singular suffix surfaces as -e following a coronal continuant consonant (figurka:figurce 'figurine.DAT.SG'), -i following a palatal nasal (pirana:pirani 'piranha.DAT.SG'), and as -ȇ following all other consonants (kráva:krávȇ 'cow.DAT.SG'). As for meaning, descriptively, gender is roughly a superset of declension classes in Czech; among the masculine classes, animacy is a critical semantic feature, whereas form seems to matter more for feminine and neuter classes. For German, nouns came morphologically parsed and lemmatized, as well as coded for class in CELEX2. We also use CELEX2 to isolate monomorphemic noun lexemes and bin them into classes; however, CELEX2 declension classes are more fine-grained than traditional descriptions of declension class-mappings between CELEX2 classes and traditional linguistic descriptions of declension class (Alexiadou and Müller, 2008) are provided in Table 4 in the Appendix. The CELEX2 declension class identifier scheme has multiple subparts. Each declension class identifier includes: (i) the number prefix (being 'S' is for singular, or 'P' for plural), (ii) the morphological form identifierzero refers to paradigmatically missing forms (e.g., plural is zero for singularia tantum nouns), and other numbers refer to a form identifier of particular morphological processes (e.g., genitive applies an additional suffix for singular masculine nouns, but never for feminines)-and (iii) an optional 'u' identifier, which refers to vowel umlaut, if present. More details of the German preprocessing steps are in the Appendix.
After associating nouns with forms, meanings, and classes, we perform exclusions: Because frequency affects class entropy (Parker and Sims, 2015), we removed all classes with fewer than 20 lexemes. 4 We subsequently removed all lexemes which did not appear in our WORD2VEC models trained on Wikipedia dumps. The final tally of Czech yields 2672 nouns in 13 declension classes, and the final tally of German yields 3684 nouns in 16 declension classes, which can be broken into 3 types of singular and 7 types of plural. Table 5 in the Appendix provides final lexeme counts by declension class.
The remaining lexemes were split into 10 folds: one for testing, another for validation, and the remaining eight for training. Table 1 shows trainvalidation-test splits, average length of nouns, and 4 We ran another version of our models that included all the original classes and observed no notable differences. number of declension classes, by language.

Methods
Notation. We define each lexeme in a language as a triple. Specifically, the i th triple consists of an orthographic word form w i , a distributional semantic vector v i that encodes the lexeme's semantics, and a declension class c i . We assume these triples follow a (unknown) probability distribution p(w, v, c)-which can be marginalized to obtain p(c), for example. We take the space of word forms to be the Kleene closure over a language's alphabet Σ; thus, we have The space of declension classes is language-specific and contains as many elements as the language has classes, i.e., C = {1, . . . , K} where c i ∈ C. For each noun, a gender g i from a language-specific space of genders G is associated with the lexeme. In both Czech and German, G contains three genders: feminine, masculine, and neuter. We also consider four random variables: a Σ * -valued random variable W , an R d -valued random variable V , a C-valued random variable C and a G-valued random variable G.
Bipartite Mutual Information. Bipartite MI (or, simply MI) is a symmetric quantity that measures how much information (in bits) two random variables share. In the case of C (declension class) and W (orthographic form), we have As can be seen, MI is the difference between an unconditional and a conditional entropy. The unconditional entropy is defined as and the conditional entropy is defined as The mutual linformation MI(C; W ) naturally encodes how much the orthographic word form tells us about its corresponding lexeme's declension class. Likewise, to measure the interaction between declension class and lexical semantics, we also consider the bipartite mutual information MI(C; V ).
Tripartite Mutual Information. To consider the interaction between three random variables at once, we need to generalize MI to three classes. One can calculate tripartite MI as follows: As can be seen, tripartite MI is the difference between a bipartite MI and a conditional bipartite MI. The conditional bipartite MI is defined as Essentially, Equation 4 is the difference between how much C and W interact and how much they interact after "controlling" for the meaning V . 5 Controlling for Gender. Working with mutual information also gives us a natural way to control for quantities that we know influence meaning and form. We do this by considering conditional MI. We consider both bipartite and tripartite conditional mutual information. These are defined as follows: Estimating these quantities tells us how much C and W (and, in the case of tripartite MI, V also) interact after we take G (the grammatical gender) out of the picture. Figure 1 provides a graphical summary for this section until this point.
Normalization. To further contextualize our results, we consider two normalization schemes for MI. Normalizing renders MI estimates across languages more directly comparable (Gates et al.,5 We emphasize here the subtle, but important, typographic distinction between MI(C; W ; V ) and MI(C; W, V ). (The difference in notation lies in the comma replacing the semicolon.) While the first (tripartite MI) measures the amount of (redundant) information shared by the three variables, the second (bipartite) measures the (total) information that class shares with either the form or the lexical semantics. 2019). We consider the normalized mutual information, i.e., which fraction of the unconditional entropy is the mutual information: This yields a percentage of the entropy that the mutual information accounts for-a more interpretable notion of the predictability between class and form or meaning. In practice, H(C) H(W ) in most cases and our normalized mutual information is termed the uncertainty coefficient (Theil, 1970):

Computation and Approximation
In order to estimate the mutual information quantities of interest per §4, we need to estimate a variety of entropies. We derive our mutual information estimates from a corpus D

Plug-in Estmation of Entropy
The most straightforward quantity to estimate is H(C). Given a corpus, we may use plug-in estimation: We compute the empirical distribution over declension classes from D. Then, we plug that empirical distribution over declension classes C into the formula for entropy in Equation 2. This estimator is biased (Paninski, 2003), but is a suitable choice given that we have only a few declension classes and a large amount of data. Future work will explore whether choice of estimator (Miller, 1955;Hutter, 2001;Archer et al., 2013Archer et al., , 2014 could affect the conclusions of studies such as this one.

Model-based Estimation of Entropy
In contrast, estimating H(C | W ) is non-trivial. We cannot simply apply plug-in estimation because we cannot compute the infinite sum over Σ * that is required. Instead, we follow previous work (Brown et al., 1992;Pimentel et al., 2019) in using the crossentropy upper bound to approximate H (C | W ) with a model. More formally, for any probability distribution q(c | w), we have To circumvent the need for infinite sums, we use a held-out sampleD from D to approximate the true cross-entropy H q (C | W ) with the following quantitŷ where we assume the held-out data is distributed according to the true distribution p. We note that While the exposition above focuses on learning a distribution q(c | w) for classes and forms to approximate H(C | W ), the same methodology can be used to estimate all necessary conditional entropies.
Form and gender: q(c | w, g). We train one LSTM classifier (Hochreiter and Schmidhuber, 1996) for each language. The last hidden state of the LSTM models is fed into a linear layer and then a softmax non-linearity to obtain probability distributions over declension classes. To condition our model on gender, we embed each gender and feed it into each LSTM's initial hidden state.
Meaning and gender: q(c | v, g). We trained a simple multilayer perceptron (MLP) classifier to predict the declension class from the WORD2VEC representation. When conditioning on gender, we again embed each gender class, concatenating these embeddings with the WORD2VEC ones before feeding the result into the MLP.
Form, meaning, and gender: q(c | w, v, g). We again trained two LSTM classifiers, but this time, also conditioned on meaning (i.e., WORD2VEC). Before training, we reduce the dimensionality of the WORD2VEC embeddings from 300 to k dimensions by running PCA on each language's embeddings. We then linearly transformed them to match the hidden size of the LSTMs, and fed them in. To also condition on gender, we followed the same procedures, but used half of each LSTM's initial hidden state for each vector (i.e., WORD2VEC and one-hot gender embeddings).
Optimization. We trained all classifiers using Adam (Kingma and Ba, 2015) and the code was implemented using PyTorch. Hyperparametersnumber of training epochs, hidden sizes, PCA compression dimension (k), and number of layerswere optimized using Bayesian optimization with a Gaussian process prior (Snoek et al., 2012). We explore a maximum of 50 models for each experiment, maximizing the expected improvement on the validation set.

An Empirical Lower Bound on MI
With our empirical approximations of the desired entropy measures, we can calculate the desired approximated MI values, e.g., whereĤ(C | G) is the plug-in estimation of the entropy. Such an approximation, though, is not ideal, since we do not know if the true MI is approximated by above or below. Since we use a plug-in estimator forĤ(C | G), which underestimates entropy, and since H q (C | W, G) is estimated with a cross-entropy upperbound, we have We note that these are expected lower bounds, i.e. they are exact when taking an expectation under the true distribution p. We cannot make a similar statement about tripartite MI, though, since it is computed as the difference of two lower-bound approximations of true mutual information quantities.

Results
Our main experimental results are presented in Table 2. We find that both form and lexical semantics significantly interact with declension class in both Czech and German (each p < 0.01). 6 We observe that our estimates of MI(C; W | G) is larger (0.5-0.8 bits) than our estimates of MI(C; V | G) (0.3-0.5 bits). We also observe that the MI estimates in Czech are higher than in German. However, we caution that the unnormalized estimates for the two languages are not fully comparable because they hail from models trained on different amounts of data. The tripartite MI estimates between class, form, and meaning, were relatively small (0.2-0.35 bits) for both languages. We interpret this finding as showing that much of the information contributed by form is not redundant with information contributed by meaning-although a substantial amount is. 6 All results in this section were significant for both languages, according to a Welch (1947)'s t-test, which yielded p < 0.01 after Benjamini and Hochberg's correction. A Welch (1947)'s t-test differs from Student (1908)'s t-test in that the latter assumes equal variances, and the former does not, making it preferable (see Delacre et al. 2017).

Form & Declension Class (LSTM)
Meaning & Declension Class (MLP)    Table 3: MI between class and gender MI(C; G): H(C) is class entropy, H(C | G) is class entropy given gender, U (C | G) is the uncertainty coefficient.
As a final sanity check, we measure mutual information between class and gender MI(C; G) (see Table 3). For both languages, the mutual information between declension class and gender is significant. Our MI estimates range from approximately 3 /4 of a bit in German up to 1.4 bits in Czech, which respectively amount to nearly 25% and nearly 51% of the remaining unconditional entropy. Like the quantities discussed in §4, this MI was estimated using simple plug-in estimation. Remember, if class were entirely reducible to gender, conditional entropy of class given gender would be zero. This is not the case: Although the conditional entropy of class given gender is lower for Czech (1.35 bits) than for German (2.17 bits), in neither case is declension class informationally equivalent to the language's grammatical gender system.

Discussion and Analysis
Next, we ask whether individual declension classes differ in how idiosyncratic they are, e.g., does any one German declension class share less information with form than the others? To address this, we qualitatively inspect per-class half-pointwise mutual information in Figure 2a-2b. See Table 5 in the Appendix for the five highest and lowest surprisal examples per model. Several qualitative trends were observed: (i) classes show a decent amount of variability, (ii) unconditional entropy for each class is inversely proportional to the class' size, (iii) half-pointwise MI is higher on average for Czech than German, and (iv) classes that have high MI(C = c; V | G) usually have high MI(C = c; W | G) (with a few notable exceptions we discuss below).
Czech. In general, declension classes associated with masculine nouns (g = MSC) have smaller MI(C = c; W | G) than classes associated with feminine (g = FEM) and neuter (g = NEU) ones of a comparable size-the exception being 'special, masculine, plural -ata'. This class ends exclusively in -e or -ȇ, which might contribute to that class' higher MI(C = c; W | G). That MI(C = c; W | G) is high for feminine and neuter classes suggests that the overall MI(C; W | G) results might be largely driven by these classes, which predominantly end in vowels. We also note that the high MI(C = c; W | G) for feminine 'plural -e', might be driven by the many Latin or Greek loanwords present in this class.
Tripartite MI is fairly idiosyncratic in German: The lowest quantity comes from the smallest class, S1/P2u. S1/P3, a class with low MI(C = c; V | G) from above, also has low tripartite MI. We speculate that S1/P3 could be a sort of "catch-all" class with no clear regularities. The highest tripartite MI comes from S1/P4, which also had high MI(C = c; V | G). The existence of significant tripartite MI results suggests that submorphemic meaning bearing units, or phonaesthemes, might be present. Taking inspiration from Pimentel et al. 2019, which aims to automatically discover such units, we observe that many words in S1/P4 contain letters {d, e, g, i, l}, often in identically ordered orthographic sequences, such as Bild, Biest, Feld, Geld, Glied, Kind, Leib, Lied, Schild, Viech, Weib, etc. While these letters are common in German orthography, their noticeable presence suggests that further elucidation of declension classes in the context of phonaesthemes could be warranted.

Conclusion
We adduce new evidence that declension class membership is not wholly idiosyncratic nor fully deterministic based on form or meaning in Czech and German. We quantify mutual information and find estimates which range from 0.2 bits to nearly one bit. Despite their relatively small magnitudes, our estimates of mutual information between class and form accounted for between 25% and 60% of the class' entropy, even after relevant controls, and MI between class and meaning accounted for between 13% and nearly 40%. We analyze results per-class, and find that classes vary in how much information they share with meaning and form. We also observe that classes that have high MI(C = c; V | G) often have high MI(C = c; W | G), with a few noted exceptions that have specific orthographic (e.g., German umlauted plurals), or semantic (e.g., Czech masculine animacy) properties. In sum, this paper has proposed a new information-theoretic method for quantifying the strength of morphological relationships, and applied it to declension class. We verify and build on existing linguistic findings, by showing that the mutual information quantities between declension class, orthographic form, and lexical semantics are statistically significant.

A Further Notes on Preprocessing
The breakdown of our declension classes is given in Table 4. We will first discuss more details about our preprocessing and linguistic analysis for Czech, and then for German.
Czech. The Czech classes were initially derived from an edit-distance heuristic between nouns. A fluent speaker-linguist then identified major noun classes by grouping together nouns with shared suffixes in the surface (orthographic) form. If the differences between two sets of suffixes in the surface form could then be accounted for by positing a basic phonological rule-for example, vowel shortening in monosyllabic words-then the two sets were collapsed.
Among masculine nouns, four large classes were identified that seemed to range from "very animate" to "very inanimate." The morphological divisions between these classes were very systematic, but there was substantial overlap: dat.sg and loc.sg differentiated 'animate1' from 'animate2', 'inani-mate1' and 'inanimate2'; acc.sg, nom.pl and voc.pl differentiated 'animate2' from 'inanimate1' and 'inanimate2', and gen.sg differentiated 'inanimate1' from 'inanimate2' (see Figure 3). Further subdivisions were made within the two animate classes for the apparent idiosyncratic nominative plural suffix, and within the 'inanimate2' class, where nouns took either -u or -e as the genitive singular suffix. This division may have once reflected a final palatal on nouns taking -e in the genitive singular case, but this distinction has since been lost. All nouns in the 'inanimate2' "soft" class end in coronal consonants, whereas nouns in the 'inanimate1' "hard" class have a variety of final consonants.
Among feminine nouns, the 'feminine -a' class contained all feminine words that ended in -a in the nominative singular form. (Note that there exist masculine nouns ending in -a, but these did not pattern with the 'feminine -a' class). The 'feminine pl -e' class contained feminine nouns ending in -e, -ě, or a consonant, and as the name suggests, had the suffix -e in the nominative plural form. The 'feminine pl -i' class contained feminine nouns ending in a consonant and had the suffix -i in the nominative plural form. No feminine nouns ended in a dorsal consonant.
Among neuter nouns, all words ended in a vowel.
German. After extracting declension classes from CELEX2, we made some additional prepro-cessing decisions for German, usually based on orthographic or other considerations. For example, we combined the classes S1 with S4 classes, P1 with P7, and P6 with P3 because the difference between each member of any of these pairs lies solely in spelling (a final <s> is doubled in the spelling when GEN.SG -(e)s, or the PL -(e)n is attached). Whether a given singular, say S1, becomes inflected as P1 or P2-or, for that matter, the corresponding umlauted versions of these plural classes-is phonologically conditioned (Alexiadou and Müller, 2008). If the stem ends in a trochee whose second syllable consists of schwa plus /n/, /l/, or /r/, the schwa is not realized, i.e., it gets P2, otherwise it gets P1. For this phonological reason, we also chose to collapse P1 and P2.
We also collapsed all loan classes (i.e., those with P8-P10) under one plural class 'Loan'. This choice resulted in us merging loans with Greek plurals (like P9, Myth-os / Myth-en) with those with Latin plurals (like P8, Maxim-um / Maxim-a and P10, Trauma / Trauma-ta). This choice might have unintended consequences on the results, as the orthography of Latin and Greek differ substantially from each other, as well as from the native German orthography, and might be affecting our measure of higher form-based MI for S1/Loan and S3/Loan classes in Table 3 of the main text. One could reasonably make a different choice, and instead remove these examples from consideration, as we did for classes with fewer than 20 lemmata.

B Some prototypical examples
To explore which examples, across classes might be most prototypical, we sampled the top five high-    Table 5. We observe that the lowest surprisal forms for each language generally come from a single class for each language: feminine, -a for Czech and S3/P3 for German. These two classes were among the largest, having lower class entropy, and both contained feminine nouns. Forms with higher surprisal generally came from several smaller classes, and were predominately masculine. This sample size is small however, so it remains to be investigated whether this tendency in our data belies a genuine statistically significant relationship between gender, class size, and surprisal.