Seq2Seq Models with Dropout can Learn Generalizable Reduplication

Natural language reduplication can pose a challenge to neural models of language, and has been argued to require variables (Marcus et al., 1999). Sequence-to-sequence neural networks have been shown to perform well at a number of other morphological tasks (Cotterell et al., 2016), and produce results that highly correlate with human behavior (Kirov, 2017; Kirov & Cotterell, 2018) but do not include any explicit variables in their architecture. We find that they can learn a reduplicative pattern that generalizes to novel segments if they are trained with dropout (Srivastava et al., 2014). We argue that this matches the scope of generalization observed in human reduplication.


Introduction
Reduplication is a common morphological process in which all or part of a word is copied and added to one side of the word's stem. An example of reduplication occurring in the language Karao is given in (1): (1) Reduplication in Karao (from Ŝtekaurer et al. 2012): 'fight each other' (2 people) (>2 people) In the example above, the stem ba is reduplicated to create the affixed form baba. Berent (2013) discusses four different possibilities for how speakers could represent reduplication in their minds: (i) memorization of which reduplicated forms go with which stems, (ii) learning a function that copies all segments that undergo reduplication, (iii) learning a function that copies all feature values that undergo reduplication, or (iv) learning a function that uses algebraic symbols to copy the appropriate material, regardless of its segmental or featural content. She concludes that reduplication and similar processes in language involve the fourth possibility, which she labels an identity function. An identity function for reduplication is illustrated in (2), with α acting as a variable that represents the reduplicated sequence.
(2) Reduplication as an algebraic rule α → αα Marcus et al. (1999) came to a similar conclusion regarding reduplication and identity functions, after showing that infants could learn a reduplication-like pattern and generalize that pattern to novel segments. They used this as evidence against connectionist models of grammar, which do not typically include explicit variables 1 (see, for example, Elman, 1990;Rumelhart & McClelland, 1986). Both feed-forward and simple recurrent neural networks fail at learning generalizable identity functions (Berent, 2013;Marcus, 2001;Marcus et al., 1999;Tupper & Shahriari, 2016).
In this paper, we revisit these arguments against variable-free connectionist models in light of recent developments in neural network architecture and training techniques. Specifically, we test Sequence-to-Sequence models  with LTSM (Long Short-Term Memory; Hochreiter & Schmidhuber, 1997) and dropout (Srivastava et al., 2014). We find that the scope of generalization for the models is increased from copying segments to copying feature values when dropout is added. Additionally, we argue that variable-free feature copying is sufficient to model human generalization, contrary to Berent's (2013) claim that an algebraic identity function is necessary.

Background
The debate between connectionist and symbolic theories of grammar has largely been focused on the domain of morphology (for a review, see Pater, 2018). Reduplication was no exception, with standard connectionist models failing to learn the pattern (Gasser, 1993). Standard models also failed to generalize a reduplicative pattern in a way that mimicked human behavior (Marcus et al., 1999). Marcus (2001) argued that this was evidence of the need for variables in models of cognition. While supporters of connectionism pointed out issues with some of Marcus et al.'s (1999) conclusions (e.g. Seidenberg & Elman, 1999), they failed to show that a connectionist network with no variables could learn reduplication without being previously trained on a similar identity function (see Endress, Dehaene-Lambertz, & Mehler, 2007 for an overview of these studies).
Research in phonotactics has also supported the need for variables in models of language. Berent (2013) showed that Hebrew speakers generalized a phonotactic identity-based restriction in a way that she argued required variables. She presented various experimental results demonstrating that speakers would generalize the restriction to novel words, novel segments, and what she claimed to be novel feature values (for more on our interpretation of these findings, see §5.2). This ran contrary to the predictions of phonotactic learning models that did not include variables (Berent et al., 2012).
However, the models tested by Marcus et al. (1999) and Berent et al. (2012) were relatively simple compared to many modern neural network architectures. The modern model that we will examine is the Seq2Seq neural network , originally designed for machine translation. These models have been shown to perform well at learning a variety of morphological tasks (Cotterell et al., 2016), and produce results that highly correlate with human behavior (Kirov, 2017;Kirov & Cotterell, 2018).
Since these models include a number of recently-invented mechanisms, such as an encoder-decoder structure , Long Short-Term Memory layers instead of simple, recurrent ones (Hochreiter & Schmidhuber, 1997), and the possibility of dropout during training (Srivastava et al., 2014), it's unclear whether they will be limited in the same ways as their predecessors.

The Model
In this section, we will give a brief introduction to each of the mechanisms in our model that we consider to be relevant to the simulations presented in §4. For the documentation on the Python packages used to implement the model, see Chollet et al. (2015) and Rahman (2016). We chose to focus on Seq2Seq models because of their recent success in a number of linguistic tasks (summarized in §3.1). We leave exploring the differences between this architecture and its alternatives (such as simple recurrent networks) to future work.

The Seq2Seq Architecture
Seq2Seq neural networks have the ability to map from one string to another, without requiring a one-on-one mapping between the strings' elements . The model achieves this by using an architecture made up of an encoder and decoder pair. Each member in the pair is its own recurrent network, with the encoder processing the input string and the decoder transforming that processed data into an output string. The ability of these models' inputs and outputs to have independent lengths is useful for morphology, which usually involves adding or copying segments in a stem. An example of this for reduplication is shown in Figure 1. 3 In Figure 1, the encoder passes through the entire input string (i.e. the stem [ba]) before transferring information to the decoder. The decoder then unpacks this information, and gives a reduplicated form (i.e. [baba]) as output. In all of the simulations discussed in this paper, the encoder is bidirectional, meaning that it passes through the input string starting from both the left and right edges.

Long Short-Term Memory (LSTM)
LSTM (Hochreiter & Schmidhuber, 1997) is a kind of recurrent neural network layer which allows the model to store certain information in memory more easily than a simple recurrent layer could. While this architectural innovation was originally designed to address the problem of vanishing gradients (Bengio et al., 1994), it has been demonstrated that LSTM can also provide models with added representational power (Levy et al., 2018). The way the model performs both of these tasks is by using cell states, bundles of interacting layers that can learn which features are important for the model to keep track of in a long-term way. During training, the network is not only keeping track of which information will allow it to predict the output from the input, but also which information at a given time step (i.e. at a given segment in the simulations presented here) will help it to predict the output at future time steps.
Vanishing gradients are not much of a concern in morphological learning, since input and output strings are relatively short in this domain of language. However, the effects of LSTM's added representational power on learning of reduplication have not yet been explored.

Dropout
Dropout is a method used in neural network training that helps models generalize correctly to items outside of their training data (Srivastava et al., 2014). It achieves this by having some units in the network "drop out" in each forward pass. This prevents the network from finding solutions that are too dependent on a small number of units. Practically speaking, this is implemented by setting a probability with which each unit will drop out (a hyper parameter set by the analyst) and then multiplying every unit's output by either a 0 or 1, depending on whether it has been randomly chosen to be dropped out or not. Which units are dropped out is resampled each forward pass, causing the network's solution to be more general than it might have been otherwise. This is illustrated for a single forward pass in a simple, feed-forward network on the right side of Figure 2. In this illustration, dropout causes the output units to have an activation of 2, instead of 4, because a unit in the middle layer is being dropped out and cannot contribute to the activations in the layer above it. For the simulations presented here that use dropout, it was applied with equal probability to all layers of the network.

Experiments
To test whether reduplication can be modeled by a neural network without explicit variables, we ran a number of simulations in which the model was trained on a reduplication pattern in a toy language and tested on how it generalized that pattern to novel data. To test what kind of generalization the model was performing, we set up different scenarios: one in which the model was tested on a novel syllable made up of segments it had seen reduplicating in its training data ( §4.1), one in which the model was tested on a syllable made with a segment that it hadn't received in training ( §4.2), and one in which the model was tested on a syllable with a novel segment containing a feature value that hadn't been presented in the training data ( §4.3).
In the experiments presented here, a language's segments were each represented by a unique, randomly-produced vector of 6 features (excluding , i.e. any of the feature vectors that began with -1 were considered a consonant and any that began with 1 were considered a vowel. If an inventory had no vowels, one of its consonants was randomly chosen and its value for the feature [syllabic] was changed to 1. The toy language for any given simulation consisted of all the possible CV syllables that could be made with that simulation's randomly created segment inventory. Crucially, before the data was given to the model, some portion of it was withheld for testing (see the subsections below for more information on what was withheld in each testing condition). The mapping that the model was trained on treated each stem (e.g.
[ba]) as input and each reduplicated form (e.g. [baba]) as output. The model's input and output lengths were fixed to 2 and 4 segments, respectively (reflecting the fact that all the toy languages only had stems that were 2 segments long).
The models were trained for 1000 epochs, with training batches that included all of the learning data (i.e. learning was done in batch). The loss function that was being minimized was mean-squared error, and the minimization algorithm was RMSprop (Tieleman & Hinton, 2012). The models had 2 layers each in the encoder and decoder, with 18 units in each of these layers. All other parameters were the default values in the deep-learning Python package, Keras (Chollet et al. 2015).
To test whether the model generalized to withheld data, a relatively strict definition of success was used in testing. The model was given a withheld stem as input, and the output it predicted was compared to the correct output (i.e. the reduplicated form of the stem it was given). If every feature value in the predicted output had the same sign (positive/negative) as its counterpart in the correct output, the model was considered to be successfully generalizing the reduplication pattern. However, if any of the feature values did not have the same sign, that model was considered to be non-generalizing. While we only report the results from 25 runs in each condition, we ran many more while investigating various hyperparameter settings and possibilities about the construction of the training data. The results presented here are representative of the general pattern of results.

Generalizing to Novel Syllables
Our first set of simulations tested whether the model could generalize to novel syllables. If the model failed at this task, then it would mean that it was memorizing whole words in the training data, rather than learning an actual pattern. Figure 3 illustrates this.
In Figure 3 For this condition, toy languages always contained 40 segments in their inventory, and the probability of a unit being dropped out was 0%.
The model successfully reduplicated syllables from training in all runs for this condition. Additionally, it generalized to novel syllables in 22 of the 25 simulations (88%). These results are summarized in Figure 6. This shows that a standard Seq2Seq model, with LSTM but no dropout, can perform generalization to novel syllables, and does so a majority of the time.

Generalizing to Novel Segments
The next scope of generalization described by Berent (2013) is generalization to novel segments. To test whether our model could achieve this, we created languages with inventories of 40 segments (as described above), but randomly chose a single consonant in each run to be withheld for testing. This is illustrated for a simplified example in Figure 4. For the results reported in this section, toy languages always had inventories of 40 segments. Two conditions were tested in regards to dropout: one in which dropout never happened (0%) and one which it happened to the majority of units in any given forward pass (75%).
When no dropout was applied, the model was unable to reliably generalize-with only 6 of the 25 runs achieving success on novel segments (24 out of 25 for trained segments). However, when dropout was applied with a probability of 75%, the model successfully generalized to novel segments in 15 of the 25 runs (25 out of 25 for trained segments). These results are summarized in Figure  6. This demonstrates that without dropout, our model does not reliably generalize to novel syllables, but that with dropout it does.

Generalizing to Novel Feature Values
The most powerful form of generalization Berent (2013) discusses is generalization to novel feature values, which would signify the acquisition of a proper identity function. In the context of reduplication, this would involve correctly applying the process to a stem that includes feature values never seen in training. For example, if all of the consonants in training were oral, but the process generalized to nasal consonants, this would demonstrate generalization to the novel feature value [+nasal]. This is shown in Figure 5.
In the example above, the syllable [na] represents a novel feature value. If a model only learned a function that learned to copy individual feature values from the stem into the reduplicant, it wouldn't generalize to this kind of novel feature value correctly. This generalization can only occur if the model learns to copy the reduplicant irrespective of individual features.
For the results reported here, toy languages always contained 43 segments in their inventory, and these were not produced randomly (this made it easier to ensure that a particular feature value could be withheld). A variety of other segment inventories were tested, with no changes in the model's performance. Results are presented here for simulations using 0% and 75% dropout probability, although numerous other values for this were also tested.
Regardless of whether dropout occurred, the model never generalized to novel feature values. These results are summarized in Figure 6. This shows that a standard Seq2Seq model, regardless of whether it has dropout, cannot generalize to novel feature values. We discuss in §5.2 why we do not see this limitation as a flaw in terms of modeling human language learning.

Summary of Results
The results for each simulation can be viewed side-by-side in Figure 6. The findings from this series of experiments showed that even without dropout, a Seq2Seq model is not simply learning to Figure 4: Illustration of generalization to a novel segments in a language with only eight sounds. Specific IPA labels are hypothetical. Syllables surrounded by the black box were presented in training, while the circled syllable was withheld for testing. Figure 5: Illustration of generalization to a novel feature values in a language with only nine sounds.
Specific IPA labels are hypothetical. Syllables surrounded by the black box were presented in training, while the circled syllable was withheld for testing.

98
6 memorize <stem, reduplicant> pairs, since it was able to generalize reduplication to novel stems (i.e. novel syllables). We also showed that the model, when using dropout in training, can reliably generalize reduplication to novel segments.

Can humans generalize to novel feature values?
When discussing generalization, Berent (2013) used evidence from Hebrew speakers' phonotactic judgments to support the idea that they learn a true identity function when acquiring their native phonology. The judgments centered around a phonotactic restriction in Hebrew that prohibits the first two consonants in a stem from being identical. For example, the word simem 'he intoxicated' is grammatical, while the nonce word *sisem is not. Berent (2013) showed that speakers generalized this pattern by having them rate the acceptability of various kinds of nonce forms.
The first results she discussed demonstrated Hebrew speakers generalizing to novel words. These words were made up of segments that were attested in Hebrew. Speakers in this experiment rated words with s-s-m stems (like *sisem above) as significantly less acceptable than words with s-m-m and p-s-m stems. This demonstrated that Hebrew speakers were doing more than just memorizing the lexicon of their language (i.e. that they could extract phonotactic patterns).
The next results that Berent (2013) presented involved Hebrew speakers generalizing the pattern to novel segments. The segments of interest were /tʃ/, /dʒ/ and /w/, all of which are not present in native Hebrew stems. Even when these non-native phonemes were used, Hebrew speakers rated words whose first two consonants were identical (e.g. dʒ-dʒ-r) as worse than those that did not violate the phonotactic restriction (e.g. r-dʒ-dʒ). This demonstrated that speakers had not just memorized a list of consonants that cannot cooccur (e.g. *pp, *ss, *mm, etc.) while acquiring their phonological grammar.
Finally, Berent (2013) showed that speakers can generalize the pattern to the segment [θ], which she claimed represented generalization to a novel feature value (which she referred to as across the board generalization). While we agree with her conclusions regarding the first two sets of results, we find her interpretation of this final result to be less convincing. The novel feature value that she claimed was represented by [θ]  Returning to reduplication, a relevant generalization experiment is Marcus et al. (1999). In that experiment, infants generalized a reduplication-like pattern to novel words with novel segments, but not with novel feature values. All of the words in Marcus et al.'s testing phase used feature values that would have been familiar to the infants from their native language of English.
To our knowledge, no experiment has shown humans generalizing to truly novel feature values. Because of this, we cannot conclude whether our model's results from §4.3 are human-like or not.

Why did dropout allow the model to generalize to novel segments?
While we've shown that dropout increases the Seq2Seq model's scope of generalization, it still remains unclear why this form of regularization This explanation could suggest that other forms of regularization, such as an L2 prior, that don't create a similar kind of ambiguity, may not have the same effect on reduplication learning.

Future Work
A number of avenues exist for furthering this line of research. First of all, as mentioned in §5.2, the full picture of how humans behave in regards to generalizing reduplication-like patterns is still an open question. Additionally, more direct modeling of some of the experimental data that does exist on this subject (e.g. Marcus et al., 1999) could help to shed more light on how well a Seq2Seq network with no explicit variables can model such results.
The simulations outlined here could also be made to more realistically mimic natural language. Stems of varying lengths, and with various syllabic structures could pose an interesting challenge to any model of reduplication. A reduplication process that copies only part of the stem could also test whether the model is capable of learning a more complex identity function. Additionally, presenting the data in a way that mimics the exposure children receive could be useful, since infants are not directly presented with stemreduplicant pairs in isolation.
Future research could also address the effects of dropout presented here. First of all, since dropout is one of many different regularization methods (Wager et al., 2013), testing its alternatives could be useful. And if it's the case that dropout allows a model to learn in a more human-like way, then adding dropout to models of other domains of language (such as phonotactics) should also be explored.

Conclusion
In summary, we found that a Seq2Seq model could learn a simple reduplication pattern and generalize that pattern to novel items. Dropout increased the model's scope of generalization from novel syllables to novel segments, demonstrating a human-like behavior that has been previously used as an argument against connectionist models with no explicit variables (for other alternatives to explicit variables in neural networks, see Garrido Alhama, 2017; Gu et al., 2016). These results weaken this line of argument against connectionism. (Štekauer, Valera, & Körtvélyessy, 2012)