Inverting and Modeling Morphological Inflection

Previous “wug” tests (Berko, 1958) on Japanese verbal inflection have demonstrated that Japanese speakers, both adults and children, cannot inflect novel present tense forms to “correct” past tense forms predicted by rules of existent verbs (de Chene, 1982; Vance, 1987, 1991; Klafehn, 2003, 2013), indicating that Japanese verbs are merely stored in the mental lexicon. However, the implicit assumption that present tense forms are bases for verbal inflection should not be blindly extended to morphologically rich languages like Japanese in which both present and past tense forms are morphologically complex without inherent direction (Albright, 2002). Interestingly, there are also independent observations in the acquisition literature to suggest that past tense forms may be bases for verbal inflection in Japanese (Klafehn, 2003; Murasugi et al., 2010; Hirose, 2017; Tatsumi et al., 2018). In this paper, we computationally simulate two directions of verbal inflection in Japanese, Present → Past and Past → Present, with the rule-based computational model called Minimal Generalization Learner (MGL; Albright and Hayes, 2003) and experimentally evaluate the model with the bidirectional “wug” test where humans inflect novel verbs in two opposite directions. We conclude that Japanese verbs can be computed online via some generalizations and those generalizations do depend on the direction of morphological inflection.


Introduction
In her seminal "wug" test, Berko (1958) demonstrated that English speakers, both adults and children, can inflect novel nouns (e.g. wug) and novel verbs (e.g. rick) to "correct" plural forms (e.g. wugs) and "correct" past tense forms (e.g. ricked), respectively. This demonstration strongly suggests that, since the novel words cannot be experienced before by the experimental participants and thus accessed from the mental lexicon, the inflected forms must have been produced online via some productive generalizations, whose nature has been actively debated in the literature (O'Donnell, 2015;Yang, 2016).
Nevertheless, "wug" tests might be too easy to "pass" in morphologically sparse languages like English in which present tense forms are unsuffixed and homonymous (except 3rd person singular) with infinitival forms, and past tense forms are generated via simple affixation. In fact, previous "wug" tests on Japanese verbal inflection have demonstrated that Japanese speakers, both adults and children, cannot inflect novel present tense forms to "correct" past tense forms predicted by rules of existent verbs (de Chene, 1982;Vance, 1987Vance, , 1991Klafehn, 2003Klafehn, , 2013. The results of these previous "wug" tests have been taken to indicate that Japanese verbs are merely stored in the mental lexicon, not produced online via productive generalizations. However, the implicit assumption that present tense forms are bases for verbal inflection should not be blindly extended to morphologically rich languages like Japanese. As pointed out by Albright (2002), various factors conspire to determine which cell of the paradigm should be identified as the base and, consequently, in which direction morphological inflection should be implemented. In particular, unlike English whose present and past tense forms are asymmetrically complex, both present and past tense forms are suffixed in Japanese and thus morphologically complex without inherent direction, as in Table 1: Table 1: Japanese verbal inflection (McCawley, 1968) Interestingly, there are independent observations in the acquisition literature to suggest that past tense forms may be bases for verbal inflection in Japanese. First, Japanese children acquire past tense forms around age 2 before present tense forms (Clancy, 1985;Klafehn, 2003), unlike English children who acquire present tense or infinitival forms first (Brown, 1973). Second, Japanese children exclusively produce past tense forms as Root Infinitive analogues before age 2 (Murasugi et al., 2010), unlike bona fide Root Infinitives in English (Wexler, 1994). Third, Japanese children overregularize present tense forms (e.g. simu 'die'; Hirose, 2017), but not past tense forms like English (e.g. go-ed; Klafehn, 2003). Finally, Japanese children seem to have an inductive bias to prefer past tense forms to present tense forms (Tatsumi et al., 2018). Those observations converge on the hitherto unexplored hypothesis that past tense forms are bases in Japanese.
In this paper, inspired by the acquisition literature (Klafehn, 2003;Murasugi et al., 2010;Hirose, 2017;Tatsumi et al., 2018), we computationally simulate two directions of verbal inflection in Japanese, Present → Past and Past → Present, with the rule-based computational model called Minimal Generalization Learner (MGL; Albright and Hayes, 2003) and experimentally evaluate the model with the bidirectional "wug" test where humans inflect novel verbs in two opposite directions. The following questions will be addressed: 1. Which direction is computationally less complex for the model?
2. Which direction is experimentally more accurate for humans?
3. In which direction do the model and humans correlate more strongly?
Model complexity will be measured via three evaluation metrics derived from the confidence scores of rules induced based on the lexicon of existent verb pairs (Albright, 2002). In addition, human accuracy and the correlation between the model and humans are explicitly evaluated against the model, avoiding impressionistic interpretations as in previous "wug" tests, where "the meaning of the word 'pass' is a 60% or better score" (Klafehn, 2013, p.182).
The organization of this paper is as follows. Section 2 describes the methodological details of the Minimal Generalization Learner, the bidirectional "wug" test, and the statistical analyses to compare the two. Section 3 presents the results of model complexity, human accuracy, and the correlation between the model and humans, corresponding to the three questions above. Section 4 summarizes the results and discuss theoretical implications. Section 5 concludes the paper.

Minimal Generalization Learner
Training: The rule-based computational model called Minimal Generalization Learner (MGL; Albright andHayes, 2002, 2003) was employed from the literature. The MGL was trained on the lexicon of 1269 existent verb pairs (Suski, 1942) in two directions (Present → Past and Past → Present), with V-V compounds and light verb constructions removed in order to avoid inflation of the number of particular inflections. Then, rules were induced through minimal generalization for each direction. See Albright andHayes (2002, 2003) for the rule induction algorithm. Testing: Novel verbs were then fed into the trained MGL as input and the inflected forms of those verbs were produced as output with the reliability and confidence scores defined below. Reliability score: The reliability score of a rule, p, is defined as Equation 1: Confidence score: Since weak rules supported by smaller data should be penalized (Mikheev, 1997), the reliability score of a rule is transformed into the confidence score, π, defined as Equation 2: wherep * is the smoothed reliabilityp * = hits+0.5 scope+1 , is the estimated variance, and α is the free parameter called confidence value (the higher α, the greater penalty for weak rules) assumed here as α = .75. Mean confidence: Three evaluation metrics can be derived from the confidence score (cf. Albright, 2002). First, the mean confidence score of winning outputs is defined as Equation 3: where I is the the set of all inputs, O i is the set of all outputs derived from the input i, and π o is the confidence score of the output o. The higher mean confidence, the more efficient grammar.
Mean margin: Second, the mean confidence margin of winning outputs is defined as Equation 4: where S is the subset of all outputs derived from the input i without winning outputs. The higher mean margin, the more efficient grammar.
Mean entropy: Finally, the mean confidence entropy of possible outputs is defined as Equation 5: where the confidence scores of all outputs derived from the input, O, are normalized, such that the scores sum up to 1. Unlike the metrics above, the lower mean entropy, the more efficient grammar.

Bidirectional "Wug" Test
Participants: The bidirectional "wug" test was conducted with the within-participants design. The experimental participants were 45 undergraduate students at Waseda University in Japan. They were given a 500 book coupon for their participation. The 6 participants who were non-native speakers of Japanese or misinterpreted the instructions were excluded from statistical analyses, resulting in 39 participants in total. Stimuli: Bisyllabic novel roots ("wug" roots) were created by randomly combining two open syllables (CV). The open syllables were extracted from the lexicon of 1269 existent verb stems (Suski, 1942), with V-V compounds and light verb constructions removed, and only those with token frequency ≥ 20 were included. The "wug" roots that resemble actual Japanese words or contain repetitions of the same segments were excluded. Since the native Japanese words cannot generally start with voiced obstruents (Tanaka and Yashima, 2013) or /r/ (Labrune, 2014), the "wug" roots starting with those segments were also excluded. Then, 32 past tense forms of the "wug" roots were created as target stimuli in the Past → Present direction, by attaching (i) the V-final past tense endings /ta/ to 16 "wug" roots (i.e. 8 ending with /e/ and 8 ending with /i/) and (ii) the 4 C-final past tense endings (i.e. /tta/, /nda/, /ita/, and /ida/) to 4 "wug" roots each, hence 16 V-final and 16 C-final past tense forms. In the same vein, 32 present tense forms of the "wug" roots were created as target stimuli in the Present → Past direction, by attaching (i) the V-final present tense ending /ru/ to 16 "wug" roots (i.e. 8 ending with /e/ and 8 ending with /i/) and (ii) the 8 C-final present tense endings (i.e. /tu/, /u/, /mu/, /bu/, /nu/, /ku/, /su/, and /gu/) to 2 "wug" roots each, hence 16 V-final and 16 C-final present tense forms. Note that the 4 Vfinal past tense forms whose roots end with /hi/ and 2 C-final present tense forms ending with /nu/ turned out to be not attested in the training data and thus excluded from the statistical analyses.
The 4 frames were also created in which the target stimuli are presented. Each frame consisted of two sentences A and B. In the Present → Past direction, A sentences include present tense forms, while B sentences contain a blank and elicit past tense forms. In the Past → Present direction, A sentences include past tense forms, while B sentences contain a blank and elicit present tense forms. In order to make sure that the participants produce target forms in B sentences, temporal adverbs are placed at the sentence initial position to maximally contrast sentences A and B. Specifically, A sentences constitute "Temporal Adverb + Proper Noun + Verb + Evidential", whereas B sentences "Temporal Adverb + Proper Noun + + Sentence Final Particle", where the participants are asked to inflect the Verb.
Procedure: The task was written production "wug" test in the form of the questionnaire. At the top of the questionnaire were some biographical questions such as (i) birthplaces of participants and their parents, (ii) whether participants were born and grew up in Japan, and (iii) whether parents spoke Japanese to participants at home.
The English translation of the original Japanese instructions is reproduced below: "This experiment examines your intuition about Japanese. There are 2 blocks and 48 questions in each block, where both actual and novel verbs in Japanese appear.
For each question, please change bold and underlined verbs of A sentences to appropriate forms and complete the underlined portion of B sentences. Since there are no correct or incorrect answers, please answer the questions based on your intuition without too much reflection." The experiment was divided into two blocks, corresponding to Present → Past and Past → Present directions, and the order of the two directions was counterbalanced across participants. At the beginning of each block were two example questions with answers in B sentences completed, one example with an actual verb and another with a novel verb. The order of stimulus presentation was randomized across participants by creating 2 random orders for each direction, hence 4 different versions of randomization. The 16 actual verbs, 8 V-final and 8 C-final, were interspersed as fillers in each block, on the condition that no more than 3 target stimuli were presented in sequence.

Statistical Analyses
Human accuracy: In contrast with the previous "wug" tests, the human accuracies were explicitly evaluated against the MGL. The winning outputs with highest confidence scores were defined as "correct" among possible outputs derived from each input by the MGL. For example, suppose that the MGL produced three possible outputs X, Y, and Z for an input, among which X was the winning output with the highest confidence score. If the output X was produced by 30 participants, the human accuracy would be 30/39 = .769. Correlation between model and humans: The correlation between model confidence scores and human production probabilities were also analyzed (Albright and Hayes, 2003). The human production probabilities can be simply computed by dividing the frequencies of produced outputs by the number of participants. Given the distribution of human production probabilities being bimodal, nonparametric Kendall's rank correlation analyses were performed in R between model confidence scores and human production probabilities. The outputs not produced by both model and humans were not included in the correlation analyses.

Model Complexity
The result of model complexity is summarized in Table 2, where three model complexity metrics, mean confidence score (π), mean confidence margin (∆), and mean confidence entropy (H), are shown for each direction of verbal inflection: .904 .959 Mean margin (∆) .724 .849 Mean entropy (H) .387 .294 The three evaluation metrics all converge on the conclusion that the Past → Present direction is computationally less complex than the Present → Past direction: the mean confidence and margin were higher, while the mean entropy was lower. On closer inspection, the confidence scores of possible outputs for /ru/-final present tense forms were almost a tie in the Present → Past direction, which increased the mean entropy.

Human Accuracy
The result of human accuracy is summarized in Table 3, where the accuracies of the current experiment are shown for each direction of verbal inflection and compared with six previous "wug" tests (de Chene, 1982;Vance, 1987Vance, , 1991Klafehn, 2003Klafehn, , 2013:

Reference
Modality Task Accuracy de Chene (1982) oral production 46% Vance (1987) written choice 51% Vance (1991) written choice 63% Klafehn (2003) written choice 53% Klafehn (2013) oral production 32% Present → Past written production 48% Past → Present written production 72% First, the accuracy of the Present → Past direction (48%) is generally comparable to the literature, especially the oral production experiment by (46%; de Chene, 1982), despite different participants and stimuli between the experiments. Second, and more importantly, the accuracy of the Past → Present direction (72%) was significantly higher than the literature, even the forced choice experiments (Vance, 1987(Vance, , 1991Klafehn, 2003), which would be regarded as "pass" on the assumption that "the meaning of the word 'pass' is a 60% or better score" (Klafehn, 2013).

Correlation Between Model and Humans
The result of correlation between the model and humans is shown in Figure 1, where the x-axis is the model confidence score computed by the MGL ("Model confidence score"), whereas the y-axis is the human production probability of the forms produced by the experimental participants ("Human production probability"). Color represents the Past → Present (red) and Present → Past (blue) directions. Shape of the data points represents Cfinal (circle; •) and V-final (triangle; ) roots. The lines are the fitted linear models.
There were 1248 forms in total produced for each direction (39 participants * 32 stimuli = 1248 forms). In the Past → Present direction, 1225 forms were present tense forms, while 23 forms were errors (i.e. non-present tense forms). Out of the 1225 present tense forms, 956 forms were also produced by the MGL. In the Present → Past direction, 1225 forms were past tense forms, while 23 forms were errors (i.e. non-past tense forms). Out of the 1225 past tense forms, 821 forms were also produced by the MGL. Kendall's rank correlation analyses revealed that the model confidence scores and human production probabilities were strongly correlated in the Past → Present direction (z = 5.0618, τ = 0.534, p < 0.001***), but only weakly in the Present → Past direction (z = 2.3058, τ = 0.2448, p < 0.05*).

Discussion
In summary, the results demonstrated that (i) the Past → Present direction was computationally less complex than the Present → Past direction, (ii) the Past → Present direction was experimentally more accurate than the Present → Past direction, and (iii) the model and humans were correlated strongly in the Past → Present direction, but only weakly in the Present → Past direction. The present work should be regarded as the computational psycholinguistic approach to the Paradigm Cell Filling Problem (Ackerman et al., 2009) and, importantly, the result of model complexity harmonizes well with the Low Entropy Conjecture (Ackerman and Malouf, 2013), which may in turn provides an insight into SIGMOR-PHON Shared Task on morphological reinflection (Cotterell et al., 2018). In the following, two theoretical implications will be discussed: the past tense debate and language learning.

The Past Tense Debate
In the context of the past tense debate on rule vs. analogy (Pinker and Ullman, 2002), three logically possible models have been proposed in the literature: the single route rule-based model (Yang, 2002), the single route analogy-based model (Rumelhart and McClelland, 1986), and the dual route model (Pinker and Prince, 1988). The results above at least indicate that Japanese verbs can be computed online via some generalizations and those generalizations do depend on the direction of morphological inflection, contrary to the conclusion of previous "wug" tests that Japanese verbs are merely stored in the mental lexicon (de Chene, 1982;Vance, 1987Vance, , 1991Klafehn, 2003Klafehn, , 2013. However, although the MGL is "rule-based", the nature of those generalizations is still an open question to be addressed via the systematic comparison with contemporary analogybased models such as Recurrent Neural Networks (RNN: Kirov and Cotterell, 2018) and Naive Discriminative Learning (NDL: Baayen et al., 2011) couched in Word and Paradigm models of morphology (Stump, 2001;Blevins, 2006). In addition, given the different strength of correlation with the rule-based computational model in two opposite directions, we can hypothesize that the Past → Present direction is rule-based, while other directions including the Present → Past direction is analogy-based. Then, following the electroencephalography (EEG) experiment by Kobayashi et al. (2012) who demonstrated that rule and analogy are indexed by event-related potential (ERP) components called LAN and N400, respectively, we predict that the Past → Present direction is reflected by LAN, whereas the Present → Past direction by N400. This prediction is summarized in Table 4

Human and Machine Language Learning
The results above further suggest that human language learning can provide insights into machine language learning. Interestingly, the Past → Present direction was motivated by not only the language acquisition literature (Klafehn, 2003;Murasugi et al., 2010;Hirose, 2017;Tatsumi et al., 2018) but also artificial language learning by Yin and White (2018) who show that humans have an inductive bias against neutralization and homophony. In other words, the Past → Present direction is preferred to the Present → Past direction because the present tense forms of /r/, /t/, /w/-final "wug" roots and /m/, /b/, /n/-final "wug" roots are neutralized into the homonymous past tense forms ending with /tta/ and /nda/, respectively.
Nevertheless, there were several limitations with the current experiment. First, the assumption that the training data is fed to the model in pairs organized as paradigms is unrealistic due to Zipf's law (Zipf, 1949), where paradigms are almost always incomplete in human language learning (Yang, 2017;Blevins et al., 2017). In order to simulate realistic language learning, childdirected speech (CDS) should be employed as the training data, especially given that relative frequencies of present and past tense forms seem to be diverse (Tatsumi et al., 2018). Second, the MGL is "rule-based" but learns only productoriented generalizations (Becker and Gouskova, 2016) in that present and past tense forms are mapped to each other without underlying roots. In order to test source-oriented generalizations over underlying roots, morphological decomposition must be implemented to retrieve the roots from which present and past tense forms are derived (Taft, 1979(Taft, , 2004. Finally, different "wug" roots were employed in two directions of verbal inflection (though created similarly) so that different results might be attributed to different roots, which remains to be controlled in future.

Conclusion
In this paper, inspired by the acquisition literature (Klafehn, 2003;Murasugi et al., 2010;Hirose, 2017;Tatsumi et al., 2018), we computationally simulated two directions of verbal inflection, Present → Past and Past → Present, with the rule-based computational model called Minimal Generalization Learner (MGL; Albright and Hayes, 2003) and experimentally evaluated the model with the bidirectional "wug" test where humans inflected novel verbs in two opposite directions, addressing the following questions: