The Fine Line between Linguistic Generalization and Failure in Seq2Seq-Attention Models

Seq2Seq based neural architectures have become the go-to architecture to apply to sequence to sequence language tasks. Despite their excellent performance on these tasks, recent work has noted that these models typically do not fully capture the linguistic structure required to generalize beyond the dense sections of the data distribution (Ettinger et al., 2017), and as such, are likely to fail on examples from the tail end of the distribution (such as inputs that are noisy (Belinkov and Bisk, 2018), or of different length (Bentivogli et al., 2016)). In this paper we look at a model’s ability to generalize on a simple symbol rewriting task with a clearly defined structure. We find that the model’s ability to generalize this structure beyond the training distribution depends greatly on the chosen random seed, even when performance on the test set remains the same. This finding suggests that model’s ability to capture generalizable structure is highly sensitive, and more so, this sensitivity may not be apparent when evaluating the model on standard test sets.


Introduction
It is well known that language has certain structural properties which allows natural language speakers to make "infinite use of finite means" (Chomsky, 1965). This structure allows us to generalize beyond the typical machine learning definition of generalization (Valiant, 1984) (which considers performance on the distribution that generated the training set), permitting the understanding of any utterance sharing the same structure, regardless of probability. For example, sentences of length 100 typically do not appear in natural text or speech (our personal 'training set'), but can be understood regardless due to their structure. We refer to this notion as linguistic generalization 1 .
Many problems in NLP are treated as sequence to sequence tasks with solutions built on seq2seqattention based models. While these models perform very well on standard datasets and also appear to capture some linguistic structure (McCoy et al., 2018;Belinkov et al., 2017;Linzen et al., 2016), they also can be quite brittle, typically breaking on uncharacteristic inputs (Lake and Baroni, 2018;Belinkov and Bisk, 2018), indicating that the extent of linguistic generalization these models achieve is still somewhat lacking.
Due to the high capacity of these models, it is not unreasonable to expect them to learn some structure from the data. However, learning structure is not a sufficient condition to achieving linguistic generalization. If this structure is to be usable on data outside the training distribution, the model must learn the structure without additionally learning (overfitting on) patterns specific to the training data. One may hope, given the right hyperparameter configuration and regularization, that a model converges to a solution that captures the reusable structure without overfitting too much on the training set. While this solution exists in theory, in practice, it may be difficult to find.
In this work, we look at the feasibility of training and tuning seq2seq-attention models towards a solution that generalizes in this linguistic sense. In particular, we train models on a symbol replacement task with a well defined generalizable structure. The task is simple enough that all models achieve near perfect accuracy on the standard test set, i.e., where the inputs are drawn from the same distribution as that of the training set. We then test these models for linguistic generalization by creating test sets of uncharacteristic inputs, i.e., inputs that are not typical in the training distribution but still solvable given that the generalizable structure was learned. Our results indicate that generalization is highly sensitive 2 ; such that even changes in the random seed can drastically affect the ability to generalize. This dependence on an element that is not (or ideally should not be) a hyperparameter suggests that the line between generalization and failure is quite fine, and may not be feasible to reach simply by hyperparameter tuning alone.

Generalization in a Symbol Rewriting Task
Real world NLP tasks are complex, and as such, it can be difficult to precisely define what a model should and should not learn during training. As done in previous work (Lake and Baroni, 2018;Rodriguez and Wiles, 1998), we ease analysis by looking at a simple formal task. The task is set up to mimic (albeit, in an oversimplified manner) the input-output symbol alignments and local syntactic properties that models must learn in many natural language tasks, such as translation, tagging and summarization. The task is defined over sequences of symbols, Each symbol x ∈ X is uniquely associated with its own output alphabet Y x . Output is created by taking each individual symbol x i in the sequence and rewriting it as any sequence of k symbols from Y x i . To do the task, the model must learn alignments between the input and output symbols, and preserve the simple local syntactic conditions (every group of k symbols must come from the same input alphabet Y x ).
As an example, let Each a i and b i has 2 possible values, a i1 or a i2 and b i1 or b i2 respectively. Thus, mapping an input symbol to 48 (8 * 3!) possible permutations. A possible valid output for the input AB is a 21 a 32 a 11 b 32 b 11 b 22 . Note that such valid strings are selected at random when generating the dataset. We allow this stochasticity in the outputs in order to prevent the model from resorting to pure memorization. For our task, |X| = 40 and each x i has a corresponding output alphabet Y x i of size 16. 2 The sensitivity of generalization is also hinted at in Mc-Coy et al. (2018)

Model and Training Details
The models we use are single layer, unidirectional, seq2seq LSTMs (Hochreiter and Schmidhuber, 1997) with bilinear attention (Luong et al., 2015) and trained with vanilla SGD. To determine the epoch to stop training at, we create a validation set of 2000 samples with the same characteristics as the training set, i.e., of length 5-10 with no repeated symbols. Training is stopped once accuracy 3 on the validation set either decreases or remains unchanged. The size of the hidden state and embeddings were chosen such that they were as small as possible without reducing validation accuracy, giving a size of 32.
Tuning hyperparameters is often done on a validation set drawn from the same distribution as the training set (as opposed to validating on data from a different distribution) which motivated our initial decision to use a validation set of characteristic inputs to decide the epoch to stop at. However, we noticed only small variation in the validation performance upon using different learning rates and dropout probabilities (where dropout was applied to the input and output layers). In order to fine tune these parameters to avoid extreme overfitting, we created another validation set consisting of 5000 samples of "uncharacteristic" inputs, i.e., inputs with repeated symbols and varying from length 3-12. These two hyperparameter values were set to 0.125 and 0.1, respectively, according to the performance on this validation set, averaged across a set of randomly chosen random seeds. All other hyperparameters were also decided using this validation set. Further training details are listed in Table 1.

Experiments
To generalize to any input sequence, a model must: (1) learn the generalizable structure -the 3 We compute accuracy as # times the model produced a valid output  alignments between input and output alphabets, and (2) not learn any dependencies among input symbols or sequence length. To test the extent to which (2) is met, we train seq2seq-attention models with 100,000 randomly generated samples with inputs uniformly generated with lengths 5-10 and no input symbol appearing more than once in a single sample. If the model learned alignments without picking up other dependencies among input symbols or input lengths then the resulting model should have little problem in handling inputs with repeated symbols or different lengths, despite never seeing such strings.
For evaluation we trained 50 different models with the same configuration, chosen with a validation set, but with different random seeds. We created 4 different test sets, each with 2000 randomly generated samples. The first test set consists of samples that are characteristic of the training set, having lengths 5-10 and no repeats (Standard). The second set tests the model's ability to generalize to repeated symbols in the input (Repeat). The third and fourth sets test its ability to generalize to different input lengths, strings of length 1-4 (Short) and 11-15 (Long) respectively.

Results
The distribution of model accuracy measured at instance level on the four test sets across all the 50 seeds is given in Figure 1. All models perform above 99% on the standard set, with a deviation well below 0.1. However, the deviation on the other two sets is much larger, ranging from 13.39 for the repeat set to 20.63 for the long set. In general, the model performs better on the repeat set than on the short and long sets. Performance on the short and long sets is not always bad, some seeds giving performances of above 95% for either the short or long set. Ideally, we would like a seed which performs good on all the test sets; however, this seems hard to obtain. The highest average performance across the non standard test sets for any seed was 79.52%. Learning to generalize for both the repeated and longer inputs seems even harder, with the Pearson correlation between performance on the repeat and long sets being -0.71. We provide the summary statistics across all 50 runs (50 different random seeds) in Table 4, which gives the mean, standard deviation, minimum, and maximum accuracies across all random seeds. We additionally provide a sample of performances for some individual random seeds in Table 3, with the highest and lowest accuracies in each column highlighted.

Related Work
Elman (1991) provides one of the earliest works investigating the ability of RNNs to capture properties of language necessary for generalization. Further work explored how RNNs can learn con-text free languages (Wiles and Elman, 1995;Rodriguez and Wiles, 1998) as well some context sensitive languages (Gers and Schmidhuber, 2001). Recent work has moved outside the formal language space, with experiments that indicate RNNs may capture the hierarchical structure of natural language syntax, despite not having any hierarchical bias built in the model or the data (McCoy et al., 2018;Gulordava et al., 2018).
Though these models appear to capture systematic structure necessary for generalization, how they do so is often extremely counter intuitive (Williams et al., 2018;Bernardy and Lappin, 2017). Lake and Baroni (2018) has questioned the systematicity with which these models learn. The same work also makes similar observations about the difficulty of generalizing to longer length strings. Despite the difficulty models have generalized to longer length strings; they note that the model only needs to see a relatively small amount of longer length strings during training in order to generalize (up to the length of the longest string shown in training).
Our experiments indicate that the final model's ability to generalize is highly dependent on the random seed. In should be noted however that the random seed has an effect on several components of the training process, most notably, the exact initialization of the network and the order in which training samples are shown. Though the exact value of the initialization likely plays a factor, recent work by Liska et al. (2018) gives evidence that the order training data is presented may play an equally large part in determining whether a model achieves generalization.

Conclusions
The variability in generalization on uncharacteristic inputs (and thus, the extent of linguistic generalization) given different random seeds is alarming, particularly given the fact that the standard test set performance remains mostly the same regardless. The task presented here was easy and simple to analyze, however, future work may be done on natural language tasks. If these properties hold it might indicate that a new evaluation paradigm for NLP should be pushed; one that emphasizes performance on uncharacteristic (but structurally sound) inputs in addition to the data typically seen in training.