Are we there yet? Encoder-decoder neural networks as cognitive models of English past tense inflection

The cognitive mechanisms needed to account for the English past tense have long been a subject of debate in linguistics and cognitive science. Neural network models were proposed early on, but were shown to have clear flaws. Recently, however, Kirov and Cotterell (2018) showed that modern encoder-decoder (ED) models overcome many of these flaws. They also presented evidence that ED models demonstrate humanlike performance in a nonce-word task. Here, we look more closely at the behaviour of their model in this task. We find that (1) the model exhibits instability across multiple simulations in terms of its correlation with human data, and (2) even when results are aggregated across simulations (treating each simulation as an individual human participant), the fit to the human data is not strong—worse than an older rule-based model. These findings hold up through several alternative training regimes and evaluation measures. Although other neural architectures might do better, we conclude that there is still insufficient evidence to claim that neural nets are a good cognitive model for this task.


Introduction
For over 30 years, the English past tense has served as both inspiration and testbed for models of language acquisition and processing (Rumelhart and McClelland, 1986;Pinker and Prince, 1988;Marcus, 1995;Plunkett and Juola, 1999;Pinker and Ullman, 2002;Albright and Hayes, 2003;Seidenberg and Plaut, 2014;Kirov and Cotterell, 2018;Blything et al., 2018, etc.). One of the most wellknown debates centres on whether the apparently rule-governed regular past tense is indeed represented cognitively using explicit rules. Rumelhart and McClelland (1986) famously argued against this hypothesis, presenting a neural network model intended to capture both regular and irregular verbs with no explicit rules. However, Pinker and Prince (1988) presented a scathing rebuttal, pointing out both theoretical and empirical failures of the model. In their alternative (dual-route) view, the regular past tense is categorical and captured via explicit rules, while irregular past tenses are memorized and can (occasionally) generalize via gradient analogical processes (Pinker and Prince, 1988;Prasada and Pinker, 1993). Their arguments were so influential that although neural networks gained considerable traction in cognitive science more generally (Bechtel and Abrahamsen, 1991;McCloskey, 1991;Elman et al., 1996), many linguists dismissed the whole approach. 1 With the recent success of deep learning in NLP, however, there has been renewed interest in exploring the extent to which neural networks capture human behaviour in psycholinguistic tasks (e.g., Linzen and Leonard, 2018;Linzen, 2019). In particular, Kirov and Cotterell (2018;henceforth K&C) revisited the past tense debate and showed that modern sequence-based encoder-decoder (ED) models overcome many of the criticisms levelled at Rumelhart and McClelland's model. Specifically, these models permit variable-length input and output that represent sequential ordering; can reach near-perfect accuracy on both regular and irregular verbs seen in training; and (using multi-task learning) can effectively generalize phonological rules across different inflections.
These primary claims are undoubtedly correct (and indeed, we replicate the accuracy results below). However, we take issue with another part of K&C's work, in which they claim that their ED model also effectively models human behaviour in a nonce-word experiment (i.e., wug test, described below). We explore the model's behaviour on this task in detail, and conclude that its ability to model humans is considerably weaker than K&C suggest.
In particular, we begin by showing that multiple simulations of the same model (with different random initializations) result in very different correlations with the human data. To ensure that this instability is not just due to the evaluation measure, we introduce an alternative measure, but still find unstable results. We then consider whether treating individual simulations as individual participants (rather than as a model of the average participant) captures the human data better. This aggregate model does show some high-level similarities to the human participants: both model and humans tend to produce irregulars more frequently for nonce words that are similar to many real irregular verbs. However, the model is still poor at capturing fine-grained distinctions at the level of individual verbs. We conclude that, although deep learning approaches overcome many of the problems of earlier neural network models, there is still insufficient evidence to claim that they are good models of human morphological processing.
2 Background 2.1 Nonce word experimental data Like K&C, we use data from two experiments run by Albright and Hayes (2003;henceforth A&H). In Experiment 1, using a dialogue-based prompt, A&H presented participants auditorily with nonce "verbs" that are phonotactically legal in English (e.g., spling, dize), and prompted participants to produce past tense forms of these verbs, resulting in a data set of production probabilities of various past tense forms. In Experiment 2, participants first produced each past tense form (as in Experiment 1) and were then asked to rate the acceptability of either two or three possible past tense forms for that verb-one regular, and one or two potential irregulars. For example, for scride /skr"aId/, participants rated scrided /skr"aId@d/ (regular), scrode /skr"oUd/ and scrid /skr"Id/ (irregular). This gives a data set of past tense form ratings.
Most of A&H's own analyses rely on the ratings data, but the ED model is a model of production, so we follow K&C and use the data from Experiment 1. The data is coded using the same set of suggested forms that were rated in Experiment 2: for each nonce word, A&H counted how many participants produced the regular form, the irregular form (or each of the two forms, if there are two), and "other" (any other past tense form that was not among those rated in Experiment 2). The counts are normalized to compute production probabilities for each output form.
The nonce words used by A&H were carefully chosen according to several criteria. First, they are phonologically "bland": i.e., not unusual-sounding as English words (as confirmed by a pre-test with participants). Second, as explained in the following section, they fall into several categories designed to test A&H's hypothesis that (contra Prasada and Pinker, 1993), both regular and irregular past tense forms exhibit gradient (and not categorical) effects.

A&H's model and islands of reliability
To explain the categories of nonce words (which we will refer to in our analyses), we briefly describe A&H's theory of past tense formation, which they implement as a computational model. The model postulates that speakers maintain a set of explicit structured rules that capture inflectional changes at different levels of generality. For example, a speaker might have rules such as: ] based on, e.g., want, need, start.
where X represents arbitrary phonological material and is the location of the changing material. Each rule is given a confidence score based on its precision and statistical strength (the number of cases to which it could potentially apply). When a nonce word is presented, several rules may apply (e.g., the two rules above for gleed), and the goodness of each possible past tense is determined by the confidence score of the corresponding rule.
Crucially, A&H's model can learn multiple rules that all produce regular past tense forms, but with phonological contexts of different specificity, hence different confidence scores. Therefore, some nonce words may reside in so-called "islands of reliability" (IOR) for regular verbs: that is, there is an applicable regular rule that has a very high confidence score. Meanwhile other nonce words might also be considered regular, but with lower confidence. Thus, the model predicts gradient effects even for regular inflection. It also predicts gradient effects for irregular inflection, since there can be IORs for irregular rules as well.
To test these predictions, A&H chose four types of nonce words: those residing in an IOR for regu-lars, for both regulars and irregulars, for irregulars only, or for neither. They also included several nonce verbs similar to burn-burnt, spell-spelt, and some that might potentially elicit single-form analogies. Their results (discussed further in Section 4) showed that the different IOR categories were indeed treated differently by participants.

Evaluating models
To go beyond coarse-grained analysis based on the IOR categories, both A&H and K&C evaluate their models by correlating model output with the human data at the level of individual past tense forms. Correlations are computed between the human data (either production probabilities or ratings) and the model scores for each form. The regulars and irregulars are treated separately. That is, the irregular correlation value is computed by considering the average human production probability (or rating) for each suggested irregular past tense, and comparing these with the model scores for those same forms. The correlation for regulars is computed analogously. Regulars and irregulars are treated separately because the scores for regulars are nearly always larger, so if all forms were considered at once, a baseline that simply assigned (say) 1 to regulars and 0 to irregulars would already achieve a high correlation with humans.
We initially follow K&C in computing the Spearman (rank) correlation against the production probabilities, and later also examine Pearson (linear) correlations and ratings data.

Model and hyperparameters
We adopt the encoder-decoder architecture used by K&C, as well as their implementation framework and hyperparameters. Encoder-decoder models are a type of recurrent neural network (RNN) introduced for machine translation (Sutskever et al., 2014) but also often used for other sequence-tosequence transductions, such as morphological inflection and lemmatization (Kann and Schütze, 2016;Bergmanis and Goldwater, 2018). The encoder is an RNN that reads in the input sequence (here, a sequence of characters representing the phonemes in the present tense verb form) and creates a fixed-size vector representation of it. The decoder is another RNN that takes this vector as input and decodes it sequentially, outputting one symbol at each timestep (here, the phonemes of the past tense form). The ED model with attention (Bahdanau et al., 2015) is implemented in OpenNMT (Klein et al., 2017). 2 It has two bidirectional LSTM encoder layers and two LSTM decoder layers, 300dimensional character embeddings in the encoder, and 100-dimensional hidden layers in the encoder and decoder. The Adadelta optimizer (Zeiler, 2012) is used for training, with the default beam size of 12 for decoding. The batch size is 20, and dropout is applied between layers with a probability of 0.3. Except where otherwise noted below, all models were trained for 100 epochs.

Training data
To compare our results to both A&H and K&C, we use their corresponding training sets, both based on data from CELEX (Baayen et al., 1995). A&H's training data contains all verbs listed in CELEX with a lemma frequency of 10 or more (4253 verbs, 218 of which are irregular). We use A&H's American English IPA phonemic transcriptions, to match the nonce word experiment (which was carried out with American English speakers), and also follow them in using the nonce words as the unseen test set rather than creating dev/test splits from the CELEX data. As argued by A&H, adult English speakers will have been exposed to all of the real verbs many times and would be able to correctly produce the past tense of all of them. Adults' generalization to nonce words is therefore predicated on their knowledge of this entire training set (including, crucially, all of the irregular forms).
For our second training set, we obtained the data from K&C, which is a subset of A&H's: it contains 4039 verbs, 168 of which are irregular-that is, 50 real irregular verbs are missing. Examples of verbs that are missing from the K&C data include do-did and use-used. K&C also randomly divided their data into training, development, and test sets, but we weren't able to obtain these splits, so (since we are using the nonce words for test data) we simply use all 4039 verbs as training data. We include results using the K&C's data mainly to allow closer (though still not exact) comparison with their work, but we feel that A&H's training data, which includes all the irregulars, more accurately reflects adult linguistic exposure.

Evaluation
We report three different evaluation measures. First, we compute training set accuracy: the percentage of verbs in the training data for which the model's top-ranked output is the correct past tense form. This is largely a sanity check and test of convergence: a fully-trained model of adult performance should have near-perfect training set accuracy. Next, as described in Section 2.3, we report Spearman's rank correlation (ρ) of the model's probabilities for the various nonce past tense forms with the human production probabilities. The probability for each suggested past tense form was obtained by forcing the model to output that form (e.g., providing scride as input and forcing it to output scrid). This made it possible to get probabilities for forms that did not occur in the beam (the list of most likely forms output by the model).
Finally, we introduce a third measure, motivated further in Section 4.1, complete recall@5: where n is the total number of nonce verbs, S i  Table 2: Mean training set accuracy (in %, with standard deviations in brackets), averaged over 10 runs for each training set with different random seeds. Oracle accuracy is 99.85% on the K&C data and 99.55% on the A&H data, due to homophones and forms with multiple past tenses. In order to do better on irregulars, the model would have to get more of the regulars wrong.
is the set of A&H's suggested past tense forms for verb i, B i is the set of the top five verbs in the model's beam for i, and [S i ⊆ B i ] = 1 if all verbs from S i appear in B i , and 0 otherwise. For example, a model which only processed the two verbs in Table 1 would have a CR@5 of 0.5, since the beam includes all suggested past tenses for murn (murned, murnt), but not for nold (nolded, nold, neld). 3 4 Experiments 4.1 Experiment 1: Model variability Our first experiment aims to replicate K&C's results showing that (a) the model is able to produce the past tense forms of training verbs with nearperfect accuracy, and (b) its correlation with human data on the nonce verb test set is higher than that of A&H's model. In K&C's paper, these results were based on a single trained model. Here we trained 20 models (10 on each training set) initialized with different random seeds.
Accuracy Table 2 lists the mean and standard deviation of training set accuracy for each of the two training sets. It is not possible to get 100% accuracy because the training sets contain some homophones with different past tenses (e.g., write-wrote and right-righted), and some verbs which have two possible past tenses (e.g., spring-sprung and springsprang). Nevertheless, the models get very close to the best possible accuracy, confirming K&C's finding that they learn both regular and irregular past tenses of previously seen words within 100 epochs. Example convergence plots are shown in 3 Not all of A&H's suggested forms were actually produced by participants, but all of them seem plausible and we felt that a good model should rank them higher than most other potential past tenses, i.e., they should be included within a small beam size. Indeed, in cases where they are not (e.g., nold in Table 1) we do typically see much less plausible forms (such as neelded) included in the beam.   Figure 1, illustrating that the models learn regular verbs very quickly, and irregular verbs more slowly, but both are learned well after 60-80 epochs.
Correlation Despite having consistently high accuracy on real words, Figure 2 shows that models with different random initializations vary considerably in their correlation with human speakers' production probabilities on nonce words, from 0.15 to 0.56 for regulars, and from 0.23 to 0.41 for irregulars. K&C's reported results are at the high end of what we obtained, suggesting that they are likely not representative.
On the other hand, we were concerned that the variability in the correlation measure might be due to an artefact: the vast majority of the beams returned by the model assign very high probability (> 98%) to the top item and little mass to anything else (as in the first example in Table 1). 4 Since the 4 The skewedness of the beams is likely because of the training/testing scenario, where the model is effectively asked to do different tasks: at training time, it is trained to produce one correct past tense, while at test time, it's expected to produce a probability distribution over potential nonce past tenses. We could surely produce better matches to the human probability distributions by training directly to do so, but that wouldn't 0.30 0.40 Complete recall@5 Training dataset A&H K&C Figure 3: Complete recall@5 for 20 models with different random seeds (10 with each training dataset). Horizontal jitter is added for readability. correlation measure is computed across different nonce forms, tiny changes in the beam probabilities of one nonce verb could change the ranking of (say) its regular past with respect to the regular past of another nonce word, even if the relative ranking of forms within each nonce's beam stayed the same.
CR@5 and second best forms The above observation motivated the CR@5 measure (Section 3.3). Rather than measuring the relative probabilities of past forms across different verbs, CR@5 considers the relative rankings of different past forms for each verb. However, CR@5 also yielded unstable results: 39-47% on A&H's data, and 29-44% on K&C's data, as shown in Figure 3. As a final exploration of the models' instability across different simulations, we looked at how often the models agree with each other on the verb occupying the first and the second position in the beam. While there is very high agreement on the most likely form (top of the beam) across the simulations-usually a regular past tense-very few forms in the second position are the same across simulations (see Figure 4). make sense as a cognitive model, since human learners are exposed only to correct past tenses, not to distributions.  Figure 5: Percentage of regular, irregular, and "other" responses produced by humans (top) and the model (bottom). Each of the six blocks corresponds to a different category of nonce words (see Section 2.2).
Summary To recap, we find similar training set accuracy to what K&C reported, but the correlation scores between the model and the human data are generally lower, and the model exhibits unstable behaviour across different simulations. However, the unstable behaviour can potentially be accounted for, if each simulation is interpreted as an individual participant rather than as a model of the average behaviour of all participants. In that case, we should aggregate results from multiple simulations in order to compare them to the human results, since production probabilities from A&H's experiment were obtained by aggregating data over multiple participants. The next experiment examines this alternative interpretation.

Experiment 2: Aggregate model
To simulate A&H's production experiment with each simulation as one participant, we trained 50 individual models on the A&H training data 5 using the same architecture and hyperparameters as before. We then sampled 100 past tense forms for each verb from each model's output probability distribution. Each of the 5000 output forms (100 each from 50 simulated participants) was categorized either as (a) the verb's regular past tense form, (b-c) the first or second irregular past tense form suggested by A&H, or (d) any other possible form. For the aggregate model, the correlation measure is the only evaluation that makes sense. For regulars, correlation with the human production proba-bilities was higher than in the previous experiment (0.45 vs. an average of 0.28 in Experiment 1), but for irregulars it was lower (0.19 vs. 0.22 in Experiment 1). The differences between the humans and aggregate model are clear from Figure 5, which shows the distribution of various past tense forms for both model and humans. For example, in only one case did the humans produce an irregular more frequently than the regular (no-change past chind for present chind), whereas there are several cases where the aggregated model does so. Moreover, for the word chind itself, the model prefers chound rather than chind.
In the previous experiment, we saw that individual models often rank implausible past tenses higher than plausible ones. However, we see here that on aggregate nearly all the model's proposed past tenses are those suggested by A&H. Apparently, the unstable beam rankings wash out the implausible forms, i.e., the plausible forms on average occur nearer the top of the beam than any particular implausible form. In fact, the model actually produces fewer "other" forms than the humans.
We also looked at the model's average production of regular and suggested irregular forms for each of the six categories in Figure 5. The results, shown in Figure 6, indicate that the model does capture the main trends seen in humans across these categories, but overall it is more likely to produce irregular forms. Together with the low overall correlation to human results and obvious differences at the fine-grained level, these results suggest that there are serious weaknesses in the ED model, even when results are aggregated across simulations. We began by assuming that models should be trained at least until they achieve perfect performance on the training set, but perhaps 100 epochs is too much, and the model is just overfitting. Training for less time might produce less skewed beam probabilities, more stable beam rankings, and perhaps better correlations with the human data.
To investigate this possibility, we took the 10 models originally trained on the A&H dataset and computed the correlation with human data for regulars and irregulars after every 10 epochs of training. The highest correlation is achieved after only 10 epochs (0.47 for regulars and 0.50 for irregulars) and the beam probabilities are indeed less skewed: the average probability of the top ranked output is 0.92 after 10 epochs, vs. 0.97 after 100 epochs. However, the models average only 6.5% accuracy on the real irregular words after 10 epochs, so it is difficult to argue that these are good models of human behaviour. 6 It seems that the ED model displays a fundamental tension between correctly modelling humans on real words and nonce words.

Rating data and correlations
We have so far evaluated all models against human production data. However, the A&H model outputs unnormalized scores, so arguably it makes more  Table 3: Correlations (using Spearman's ρ and Pearson's r) between the models' output probabilities vs. human production probabilities and rating data. The data for the individual model is an average over 10 simulations (standard deviation shown in brackets). Highest correlation in each line is shown in bold. sense as a model of ratings. A&H also originally evaluated it using Pearson correlation. For completeness we report in Table 3 the correlations for all models on both ratings data and production data, using both Spearman and Pearson coefficients. We find that the A&H model does score better against ratings data, although surprisingly the ED models do too. More importantly, though, the A&H model fits the human data best on 6 out of 8 measures.

What is the model learning?
To examine the representations acquired by the model, we extract vectors from the encoder's hidden state. As the encoder is a bidirectional LSTM, we concatenate the two states at the last time step (after training on the A&H data). Figure 7a shows a t-SNE visualization of hidden state vectors for both real and nonce verbs in one of our simulations. The model clearly groups the verbs into small clusters, and Figures 7b-c show that this clustering is based on the verbs' trailing phonemes, including some structure withing the clusters: e.g., strip /str"Ip/, grip /gr"Ip/, and trip /tr"Ip/ are next to each other in Figure 7b, and so are clip /kl"Ip/, flip /fl"Ip/, and glip /gl"Ip/. It is not so clear, however, how the model decides on whether to produce a regular or an irregular form for nonce verbs. We do see some evidence in Figure 7c that nonce verbs similar to regular English verbs yield a regular form (note the regular neighbours of nung /n"2N/), and the same holds for irregulars (note the irregular forms around spling /spl"IN/, for which the model produced splung). However, the model also produces an irregular form (stup /st"2p/) for stip /st"Ip/, which is clearly surrounded by regular En-   glish verbs in Figure 7b. We also tested whether the clustering by trailing phonemes is simply an artefact, by training another model on data where we reversed the order of the input phonemes in all cases (e.g., /w"IS/-/w"ISt/ [wish-wished] becomes /SI"w/-/tSI"w/). This time, verbs were grouped based on their leading phonemes-that is, the endings of the original verbs-suggesting that the model finds the regularities in the data regardless of the order of phonemes.
Finally, we investigated the model's phoneme representations, expecting a clustering corresponding to the three types of phonemes that trigger dif-ferent endings in regular past tense forms: /-Id/ after coronal stops /t/ and /d/, /-d/ after voiced consonants and vowels, and /-t/ after voiceless consonants. We extract character-level vectors from the decoder hidden state, apply PCA (which worked better than t-SNE in this case) and visualize the resulting vectors. Figure 8 shows that the expected pattern has emerged (except for /h/ in the 'voiced' cluster, but this phoneme never appears at the end of English words).

General discussion and conclusions
Our results confirm that, unlike earlier neural net models, the ED model has no trouble learning the past tense forms of verbs it is trained on. We found, however, that its behaviour on nonce verbs does not correlate with human experimental data as well as K&C's results implied, and indeed not as well as that of A&H's much earlier rule-based model.
One issue in particular seems to be overproduction of irregulars, which the model consistently prefers to regulars for four verbs (7% of considered nonce verbs), while humans nearly always prefer the regular form. This was an issue with earlier neural net models as well (Plunkett and Juola, 1999). On the other hand, when the model outputs something other than the regular form, its choices are plausible. This was not true for earlier models: Plunkett and Juola's model often chose the wrong regular suffix (with incorrect voicing in the final phoneme), and Rumelhart and McClelland's (1986) model failed to produce regular endings for nonce verbs (Prasada and Pinker, 1993;Marcus, 1998). Here, we see from both our model's output and its internal representations that it has correctly identified the necessary voicing distinctions and that nonce words trigger similar representations and behaviour to real words. In future, a stricter test might use nonce words that are intentionally less similar to real words (e.g., the example from Prasada and Pinker (1993): to out-Gorbachev).
It is also worth pointing out that the ED model, unlike A&H's model and many earlier connectionist models, is fed raw phonemes (rather than the phonemes' distinctive features) as input. Although it learns some of the relevant features anyway, it would be interesting to see whether its behaviour becomes more human-like if the correct features are provided in the input.
Although our paper has revealed a number of weaknesses of the ED model, we do agree with K&C that neural network-based cognitive models of inflection deserve re-evaluation in light of recent technical advances. There are many other potential architectures and modelling decisions that could be explored, as well as other behavioural data such as developmental patterns (Blything et al., 2018;Ambridge, 2010) and inflection in other languages (e.g., Clahsen et al., 1992;Ernestus and Baayen, 2004). As noted by Seidenberg and Plaut (2014), models' failures as well as successes can be informative, and we hope that our detailed exploration of the ED model's behaviour will inspire future developments in these models, both for cognitive modelling and NLP.