Conditioning, but on Which Distribution? Grammatical Gender in German Plural Inflection

Grammatical gender is a consistent and informative cue to the plural class of German nouns. We find that neural encoder-decoder models learn to rely on this cue to predict plural class, but adult speakers are relatively insensitive to it. This suggests that the neural models are not an effective cognitive model of German plural formation.


Introduction
In recent years, neural models of natural language have proven to be powerful statistical learners, capable of representing linguistic patterns and the conditions under which they generalize to new forms (e.g. Kirov and Cotterell, 2018). Artificial language learning experiments show that humans are also statistical learners: when patterns appear consistently with certain cues in the input, speakers consistently rely on those cues to generalize patterns to new forms (Newport, 2016).
Our research examines how two different statistical learners -neural encoder-decoder (ED) models and adult German speakers -use the cue of grammatical gender in plural inflection of novel words. Gender has a high statistical association with plural suffix: the feminine noun Wahl ("vote") is Wahlen in the plural, but the rhyming neuter noun Mal ("time") has the plural form Male. We expect that both speakers and the ED model will produce distributions over plural forms which are heavily conditioned on the gender of the input word. We find that the neural model is highly sensitive to grammatical gender; however, speaker productions, while slightly influenced by gender, appear more consistent with a distribution over plural suffixes which is unconditioned on gender. This surprising result suggests that, even though gender is a very informative cue to plural class, speakers may preferentially attend to different cues.

Background
German plural inflection is realized by five major suffixes, none of which commands a majority in either type or token frequency. 1 The two most frequent suffixes, -e and -(e)n, each apply to roughly 35-40% of German nouns (Figure 1, upper). With no majority class, how can a learner determine which pattern to generalize to new words? One major cue to plural class comes from a noun's grammatical gender. The strong statistical association between grammatical gender and plural inflection class is widely recognized in the literature (Zaretsky et al., 2013;Yang, 2016;Williams et al., 2020, to cite some recent examples), and readily apparent in Figure 1 (lower), which shows the distribution of plural suffixes by noun gender in the UniMorph corpus (Kirov et al., 2016). Of the two most frequent plural suffixes, -(e)n is highly associated with feminine nouns, and -e with nonfeminine (masculine and neuter); the tendency is so strong that some researchers have analyzed these suffixes as gender-conditioned "defaults" (Indefrey, 1999;Laaha et al., 2006). From this perspective, grammatical gender provides a highly consistent cue to plural class membership, which ought to inform a statistical learner's generalizations. Fur-thermore, grammatical gender is expressed on the article preceding a noun, and this initial position is perceptually salient to speakers (Frigo and McDonald, 1998). These properties suggest that grammatical gender should influence how speakers inflect novel nouns, and indeed, several wug tests 2 on German-speaking adults have found significant effects of gender (Köpcke 1988;Zaretsky and Lange 2016;though c.f. Marcus et al. 1995). Based on the artificial language learning literature (e.g. Newport, 2016), we might expect speakers to display conditional probability matching on novel German nouns, such that the probability of a noun taking a certain plural inflection -in particular, the two highly frequent classes -e and -(e)n -depends upon its grammatical gender.
Neural encoder-decoder (ED) models have recently been proposed for consideration as models of speaker cognition (Kirov and Cotterell, 2018). This has prompted investigation into the extent to which these models capture speaker behavior (Corkery et al., 2019;King et al., 2020;McCurdy et al., 2020). Earlier work suggests that neural models of German plural inflection are sensitive to grammatical gender: Goebel and Indefrey (2000) found that a simple recurrent network learned to favor -e plurals for masculine nouns, and -(e)n when the same nouns were presented as feminine gender. We hypothesize that neural models and adult speakers are equally capable of using the information available from grammatical gender to predict number inflection. We expect both to demonstrate similar gender-conditioned probability matching to the distribution shown in Figure 1 (lower), resulting in a majority use of -(e)n for feminine nouns, and -e for masculine and neuter nouns.

Method
To compare how grammatical gender influences plural inflection for German speakers and neural models, we use a parallel production task on nonce words (a wug test) for both speakers and model. Our study largely follows the data collection and modeling procedures of McCurdy et al. (2020).
Stimuli We use the 24 made-up nouns developed by Marcus et al. (1995), listed in Appendix A. By design, these nouns lack strong phonological cues 2 A wug test is the task of inflecting an novel word (e.g. "wug" in English). This task lets researchers observe which inflectional variants speakers use (e.g plural "wugs"; Berko, 1958). to plural class. 3 In their original study, Marcus et al. did not find a significant effect of grammatical gender; however, Zaretsky and Lange (2016) used the same stimuli and reported gender effects in the expected direction -participants used -(e)n more on feminine nouns, and -e more for nonfeminine nouns. Zaretsky and Lange speculate that these discrepant findings stem from differences in the two study designs: scale (the earlier study had 48 participants, the later one 585) and task (acceptability ratings vs. elicited productions). A third differentiating factor is the presence of semantic cues in the Marcus et al. study, which provided sentence contexts around the nonce words; for example, a sentence like Die grünen BRALS sind billiger ("The green brals are cheaper") would imply that the nonce word Bral referred to an object, whereas Die BRALS sind ein bißchen komisch ("The Brals are a bit weird") would imply that Bral was a family name. As adult learners can attend to formal and semantic cues under different conditions (Culbertson et al., 2017), it's possible that this manipulation directed participant focus toward semantic cues rather than grammatical gender. Zaretsky and Lange provided no semantic context in their experiment, only presenting the indefinite article and word form to participants (e.g. Ein Bral, "a [masculine/neuter] bral"). Our experimental design for both speakers and the neural model is closer to that of Zaretsky and Lange (2016): we elicit plural form productions and provide no semantic cues. This suggests we might also expect to find a robust effect of grammatical gender for these stimuli.
Human data collection We collected production data from 92 native German speakers 4 through an online survey. Participants saw each noun in the singular with a definite article indicating grammatical gender (e.g. Der Bral for masculine, Das Bral neuter, Die Bral feminine), and typed a pluralinflected form. Participants were randomly assigned to one of three lists. Grammatical gender was counterbalanced within lists (each participant saw 8 feminine, 8 masculine, and 8 neuter nouns) and across lists (each noun appeared with a different gender in each list). Encoder-decoder model A neural encoderdecoder (ED) model encodes an input sequence into a fixed vector representation and then incrementally decodes it into a corresponding output sequence (Sutskever et al., 2014). We follow other recent work in using the architecture of Kann and Schütze (2016), which has been proposed for cognitive modeling (Kirov and Cotterell, 2018).
For the task of German number inflection, the ED takes as input a character sequence representing the singular nominative form of a noun, preceded by a special character for grammatical gender (e.g. f W A H L; f indicates feminine, m masculine, and n neuter). The model is trained to produce the noun's corresponding nominative plural form as output (e.g. W A H L E N). We used the 11,243 German nouns in UniMorph (Kirov et al., 2016) as our corpus, and added noun gender by merging the dataset with another Wiktionary scrape. 5 We follow the modeling procedure of McCurdy et al. (2020), who found that neural ED models correctly learned the most frequent plural suffix for neuter stimuli, but did not evaluate sensitivity to grammatical gender; please see their paper for further implementation details.
Following Corkery et al. (2019), we trained 25 separate random initializations of the same model architecture. This allows separate model instances to be treated as simulated "speakers", letting us aggregate productions and compare more directly to human speaker data. For evaluation, we combined each of the 24 noun stimuli with each of the three grammatical genders, and provided the resulting 72 items as input to each model instance.

Results
Our results (Figure 2) show that both speakers and the ED model are sensitive to grammatical gen-  Table 1: Correlations (Pearson's r, 95% confidence intervals in parentheses below) between item-level production percentages for speakers and ED model with 1) overall type frequency (Overall-TF), 2) genderconditioned type frequency (Gender-TF), 3) each other.
der, but the model relies on this cue considerably more than speakers. Statistical analysis confirms that a) both speakers and the model show reliable effects of grammatical gender on their plural form productions, and b) gender effects are substantially greater for model productions. We fit two logistic mixed-effects models to separately analyze production of -e and -(e)n. Details of our analysis can be found in Appendix B. For both suffixes, we found a significant main effect of gender, and a significant interaction with data source, indicating that gender effects were amplified by the model. Intriguingly, the speaker productions are not only less sensitive to grammatical gender, they also appear very consistent with the overall type frequency distribution of the plural suffixes, unconditioned on gender. To quantify this intuition, we looked at how the distribution of plural suffixes produced over each of the 72 noun-gender item combinations correlated to various other metrics. We asked three questions: 1) How well do item-level speaker and ED model productions correlate with each other? 2) How well do both sets of item-level productions correlate with the gender-conditioned distribution of plural suffix types observed in the German lexicon? 3) How well do both sets correlate with the unconditioned overall distribution of types? Table  1 shows the results: while item-level ED outputs are most correlated with the gender-conditioned distribution, item-level speaker data is most correlated 6 with the overall (unconditioned) type frequency. 7 Even though the speaker and ED data are matched by item, their productions have a lower correlation with each other than with the general type-frequency distributions.

Discussion
We hypothesized that adult speakers and neural encoder-decoder models would make similar use of grammatical gender when inflecting novel words in the plural, as gender is a salient and consistent cue to plural inflection class, especially in an experimental setup where semantic cues are absent. Contrary to expectations, our results indicate that both learners attend to grammatical gender, but to different degrees -the neural model is much more sensitive to grammatical gender than adult speakers, whose productions are closer to the overall type frequency of plural suffixes in German.
The neural model's use of grammatical gender is not surprising, as it aligns with earlier findings (c.f. Goebel and Indefrey, 2000); however, the speakers' lack of attention to gender is unexpected. In their large-scale production study with the same noun stimuli, Zaretsky and Lange (2016) found reliable effects of grammatical gender: their participants used -(e)n for 33% of feminine nouns, versus 19% of non-feminine nouns (compare to our study: 33% vs. 26%). -e also appeared more with nonfeminine nouns (49% vs. 41%), although the effect was not statistically significant. Nonetheless, they note that -e was most frequently produced for feminine nouns as well as nonfeminine nouns, consistent with our results, and their data shows a similarly broad distribution over types. Despite other differences between our study design and theirs (e.g. online vs. in-person data collection, typed vs. written modality, German speakers from various backgrounds vs. one region), we consider our results fundamentally aligned: speakers show a slight but statistically reliable effect of gender on -(e)n and -e production, in both cases much less than the effect shown by the ED model.
One possibility is that the phonological forms of our noun stimuli provide their own statistical conditioning, to a stronger degree than anticipated. This  between item-level production percentages for speakers and ED model with 1) overall type frequency (Overall-TF), 2) gender-conditioned type frequency (Gender-TF), only considering consonant-final monosyllabic nouns in UniMorph (shown in Figure 3).
is illustrated in Figure 3, which plots the distribution of nouns in UniMorph sharing two key properties with our stimuli: they are monosyllabic and end in a consonant. On the one hand, nouns with this type of form clearly also show gender conditioning, with -(e)n much more prevalent among feminine nouns. On the other hand, nouns with this general form are predominantly masculine gender, and the numerical prevalence of nonfeminine forms may diminish speakers' sensitivity to a rare feminine gender cue, such as they encounter in our experiment. Under this account, adult speakers condition their plural productions upon phonological form to a greater extent than grammatical gender. The results in Table 2 further support this interpretation. Looking only at the consonant-final monosyllabic words plotted in Figure 3, ED model productions show a higher correlation to the gender-conditioned distribution over plural suffixes, while the highest correlation generally (.78) appears between speakers productions and the overall distribution of plural classes for these phonologically similar words. The potential shortcoming of the ED as a cognitive model, then, is that it assigns too much weight to the cue of grammatical gender, even though it is statistically reasonable to do so.
In conclusion, our comparison of neural encoderdecoder models and adult German speakers found a significant difference in their use of grammatical gender as a cue to plural inflection. Although this cue is highly informative, speakers -unlike neural models -appear relatively insensitive to gender in our task. This finding suggests that speakers may attend more readily to other cues such as phonology, and therefore match productions to a different distribution which shows less gender conditioning.   A Stimuli Table 4 shows the 24 noun stimuli used in our experiment. In their original study, Marcus et al. distinguished between Rhymes, which rhyme with existing German words, and Non-Rhymes, which don't. As this distinction is not relevant to our study, we omit it from our analysis.

B Statistical analysis
Here we report the results of our statistical model of the production of -e and -(e)n. We fit two separate mixed-effect binomial logistic models using the lme4 package (Bates et al., 2015) in R (R Core Team, 2019). Item (i.e. stimulus word) and subject (participant for human study, random seed for ED model) were included as random effects. Both models were fit using a stepwise procedure. We started with a baseline model of intercept plus random effects and incrementally added the following fixed effects (with sum-coded contrasts): grammatical gender (masculine coded as 1, neuter as 2, feminine not contrasted), data source (ED model coded as 1, speakers not contrasted), gender by source interaction. Each additional fixed effect produced a significantly improved fit as measured by a chisquared test. The final model for both -e and -(e)n production includes all fixed and random effects described above. For both plural suffixes, model results indicate a significant main effect of gender from both speakers and the ED model, and a significant interaction with data source, corresponding to a stronger effect of gender from ED model productions. For -e productions, there is also a main effect of data source: the ED model reliably produces -e more than speakers do overall. The -(e)n model shows no significant main effect for source. When model predictions are transformed to responses and fit to the original data, the binomial model of -e production achieves an overall predictive accuracy of 75% (precision 0.77, recall 0.79, F1 0.78), while the -(e)n model has 82% predictive accuracy (precision 0.71, recall 0.59, F1 0.65).
Sanity checks As human speakers show high inter-participant variability on this task (Fig. 4), we performed additional separate analysis on the speaker data. 8 We fit the same model as previously described, with the exception that the data source factor was omitted, as all data came from speakers. We also fit models using Masculine and Neuter as the reference gender in the sum contrast coding scheme, to see whether they yielded different results from the original model's Feminine reference level.  Figure 4: Individual speaker variation in plural suffix production by gender. Each speaker saw 8 words from each gender, shown on the y-axis. For each gender and plural suffix, the boxes indicate the median and interquartile range of individual speaker productions for that combination. For all gender categories, the median number of -e productions is 4, while the median number of -(e)n productions is 3.

Suffix Effect
Fem.
The speaker-only model shows a reduced but consistent effect of gender (Tab. 5). Speakers reliably produce -(e)n more for feminine nouns, and less for neuter nouns, relative to the grand mean. Speakers also reliably produce -e more for masculine nouns. These difference are statistically significant even though, for all three genders, speakers produce -e more than -(e)n (Fig. 4).