Jump to better conclusions: SCAN both left and right

Lake and Baroni (2018) recently introduced the SCAN data set, which consists of simple commands paired with action sequences and is intended to test the strong generalization abilities of recurrent sequence-to-sequence models. Their initial experiments suggested that such models may fail because they lack the ability to extract systematic rules. Here, we take a closer look at SCAN and show that it does not always capture the kind of generalization that it was designed for. To mitigate this we propose a complementary dataset, which requires mapping actions back to the original commands, called NACS. We show that models that do well on SCAN do not necessarily do well on NACS, and that NACS exhibits properties more closely aligned with realistic use-cases for sequence-to-sequence models.


Introduction
In a recent paper, Lake and Baroni (2018) (L&B) investigate if recurrent sequence-to-sequence models can exhibit the same strong generalization that humans are capable of, by virtue of our capacity to infer the meaning of a phrase from its constituent parts (i.e., compositionality), providing empirical tests for this long-standing goal (Fodor and Pylyshyn, 1988). Compositional generalization might be a fundamental component in making models drastically less sample-thirsty than they currently are. L&B introduce the SCAN data set ( §2), meant to study such generalization to novel examples. It consists of simple command-action pairs, in which more complex commands are composed of simpler ones (see Figure 1 for examples).
SCAN comprises several tests of generalization, namely with respect to (1) a random subset of the data ('simple'), (2) commands with action sequences longer than those seen during training ('length'), and (3)  primitive in novel ways that was only seen in isolation during training ('primitive'). In the latter case, the training set would for example only include the command 'jump', after which the test set includes all other commands containing 'jump', e.g. 'jump opposite left after walk twice'.
In this paper we take a closer look at SCAN. We start with the observation ( §3) that there are few target-side dependencies in the data, a consequence of SCAN being generated from a phrasestructure grammar. We show ( §6) that this allows simple sequence-to-sequence models ( §5) to obtain good accuracies e.g. on tasks involving a new primitive, even without access to previous outputs. However, these simple models do not use composition in any interesting way, and their performance is therefore not a realistic indicator of their generalization capability. We hence propose NACS ( §4) as a more realistic alternative: SCAN with commands and actions flipped, i.e., mapping actions back to their original commands. This is harder, because different commands may map to the same action sequence, and it introduces targetside dependencies, so that previous outputs need to be remembered. We show in particular that well-tuned attention-based models do achieve a certain degree of generalization on SCAN, and, as predicted, simpler models do better there. However, the models still struggle in the more demanding NACS setup, which we offer as a challenge for future work.
Our contributions can be summarized as follows: 1. we provide an analysis of SCAN and make the important observation that it does not test for target-side dependencies, allowing too simple models to do well; 2. we propose NACS to introduce target-side dependencies and remedy the problem; 3. we repeat all experiments in Lake and Baroni (2018) using early-stopping on validation sets created from the training data.

SCAN
SCAN stands for Simplified version of the Com-mAI Navigation tasks (Mikolov et al., 2016). Each example in SCAN is constructed by first sampling a command X = (x 1 , . . . , x T ) from a finite phrase-structure grammar with start symbol C: For each command, the corresponding target action sequence Y = (y 1 , . . . , y T ) then follows by applying a set of interpretation functions, such as: of which only the last function requires global reordering, which occurs at most once per command. See the supplementary materials for the full set. Figure 1 shows examples of commands and their action sequences as obtained by the interpretation functions. The commands can be decoded compositionally by a learner by discovering the interpretation functions, enabling generalization to unseen commands. The total data set is finite but large (20910 unambiguous commands).

SCAN prefers simple models
We observe an important property of the data set generation process for SCAN: temporal dependencies of the action sequence are limited to the phrasal boundaries of each sub-phrase, which span at most 24 actions (e.g. jump around left thrice). Crucially, even rules that require repetition (such as 'thrice') as well as global reordering, can be resolved by simple counting and without remembering previously generated outputs, due to the limited depth of the phrase-structure grammar (see e.g. Rodriguez and Wiles (1998)).
This observation has two important implications. First, because SCAN is largely a phrase-tophrase conversion problem, any machine learning method that aims at solving SCAN needs to have an alignment mechanism between the source and target sequences. Such an alignment mechanism could work fairly accurately by simply advancing a pointer. Somewhat contrary to the observation by Lake and Baroni (L&B), we therefore hypothesize that an attention mechanism (Bahdanau et al., 2015) always helps when a neural conditional sequence model (Sutskever et al., 2014;Cho et al., 2014) is used to tackle any variant of SCAN. Second, we speculate that any algorithm with strong long-term dependency modeling capabilities can be detrimental in terms of generalization, because such an approach might inappropriately capture spurious target-side regularities in the training data. We thus hypothesize that less powerful decoders generalize better on to unseen action combinations on SCAN when equipped with an attention mechanism.
To summarize: good performance on SCAN does not necessarily indicate the capability of a model to strongly generalize. SCAN favors simpler models that need not capture target-side dependencies, which might not work well on more realistic sequence-to-sequence problems, such as machine translation, where strong auto-regressive models are needed for good results (Bahdanau et al., 2015;Kaiser and Bengio, 2016).

NACS: actions to commands
By simply flipping the source X and target Y of each example, we obtain a data-set that suddenly features strong target-side dependencies. Even when the mapping p(Y |X) from the source to target is simple, the opposite p(X|Y ) ∝ p(Y |X)p(X) is non-trivial due to the complexity  Bahdanau et al. (2015) of the prior p(X). The inclusion of p(X) naturally induces strong dependencies among the output tokens, while maintaining the original properties of SCAN that were intended to test various aspects of systematic generalization.
NACS naturally makes the mapping that needs to be learned stochastic and multi-modal (sensitive to both commands and actions). For instance, an action sequence of the form x 1 x 2 could be mapped to either x 1 and x 2 or x 2 after x 1 , both of which are correct. In order for a model to decide whether to output "and" or "after", it is necessary for it to remember what has already been generated (i.e., x 1 or x 2 ).
Another example is LTURN LTURN LTURN LTURN, which can be translated into either "turn around left" or two repetitions of "turn opposite left". Deciding whether to output "and" after the first phrase requires the model to remember whether "around" was generated previously.
In §6 we experimentally evaluate the proposed NACS task using the same scenarios as SCAN (simple, length and primitive). We observe that NACS prefers more advanced models that could capture long-term dependencies in the output (now a command sequence) better. However, we notice that even these powerful models, equipped with GRUs and attention, cannot systematically generalize to this task, as was also observed by Lake and . Based on this observation, we believe that NACS (or perhaps a combination of SCAN and NACS) is better suited for evaluating any future progress in this direction.

Sequence-to-sequence models
In this section, we describe the sequence-tosequence models we use for evaluating on SCAN and its proposed sibling NACS.
We directly model the probability of a target sequence given a source sequence p(Y |X).
Our encoder-decoder is modeled after Cho et al. (2014) and our attention-based encoder-decoder after Bahdanau et al. (2015). The attention-based decoder is a function that takes as input the previous target word embedding e y i−1 , the context vector c i , and the previous hidden state s i−1 (see also Figure 2): The prediction for the current time step is then made from a pre-output layer t i : We do not apply a max-out layer and directly obtain the output by o i = W o t i . For the encoder-decoder without attention, the prediction is made directly from decoder state s i . We vary the recurrent cell, experimenting with simple RNN (Elman, 1990), GRU (Cho et al., 2014), and LSTM cells (Hochreiter and Schmidhuber, 1997). For conciseness we only report results with RNN and GRU cells in the main text, and LSTM results in the appendix.
In this paper, we investigate the properties of both SCAN and NACS using RNN-based sequence-to-sequence models for evaluation. We leave further investigation of alternative architectures (see, e.g., Vaswani et al., 2017;Gehring et al., 2017;Chen et al., 2018) for the future.

Settings
Our models are implemented in PyTorch and trained using mini-batch SGD with an initial learning rate of 0.2, decayed by 0.96 each epoch. We use a batch size of 32, 256 hidden units (64 for embeddings), and a dropout rate of 0.2. We test on all SCAN/NACS tasks 1 , as well as on the Fr-En Machine Translation (MT) task that L&B used. The reported results are averaged over three runs for each experiment. Models with attention are marked as such with +Attn, e.g. 'GRU +Attn'.
Validation Set. L&B split each SCAN subtask into a training set (80%) and a test set (20%). They train for a fixed number of updates (100k) and evaluate on the test set. Because any training run without early stopping may have missed the optimal solution (Caruana et al., 2001), we believe their results may not reflect the reality as closely as they could. We thus augment each of the SCAN variants with a validation set that follows the training distribution but contains examples that are not contained in the corresponding training set. This  Figure 2). L&B best is the best reported score for each task by L&B, and L&B best overall is the score for their best-scoring model all tasks considered.
allows us to incorporate early stopping in our experiments so that they are better benchmarks for evaluating future progress. For each experiment we remove 10% of the training examples to be used as a validation set.
Accuracy. Following L&B we measure performance according to sequence-level accuracy, i.e., whether the generated sequence entirely matches the reference. This metric is also used for early stopping. For NACS, an output (command) is considered correct if its interpretation ('backmapping') produces the input action sequence.
Ablations. To validate our analysis, we remove the connections from the previous target word embedding e y i−1 to the decoder state and the preoutput layer (es and et in Figure 2), so that the current prediction is not informed by previous outputs. If our analysis in §3 is correct, then these simpler models should still be able to make the correct predictions on SCAN, but not on NACS.

Results and Analysis
Results on the three SCAN and NACS tasks are listed in Table 1. The full results including models with LSTM cells and MT experiments may be found in the supplementary materials. We will now discuss our observations. SCAN is not enough. NACS is harder. NACS is a harder problem to solve compared to SCAN, as evidenced by consistently lower accuracies in Table 1 for all tasks. The discrepancy between SCAN and NACS performance is the most extreme when we look at the primitive tasks (turn left and jump). For turn left, the severed models (+Attn -Dep) obtain the highest scores on SCAN, but are the worst on NACS. The 'turn left' task benefits from TURNL occurring on the target-side in other contexts during training, which is not the case for 'jump'. 3 Since there is no evidence in the training data that 'jump' is a verb, Table 2 shows results where additional (composed) 'jump' commands were added for training. We see that performance quickly goes up when adding more commands. 4 Again here the simpler models (+Attn -Dep) perform better.
Machine Translation. We repeat L&Bs English-French MT experiment for both directions. Table 3 shows that models that perform well on NACS also perform well here, with the GRU outperforming the other cells (see appendix   for other cell types). In a setting similar to the jump task, the sentence pair 'I am daxy' ('je suis daxiste') was added to the training set. The goal is now to test if eight novel sentences that contain 'daxy' are correctly translated.
In our setting with mini-batching and earlystopping, the GRU gets 70.8% (En-Fr) and 54.2% (Fr-En) of the daxy-sentences right, which is surprisingly good compared to L&B (12.5%).
Other observations. As expected, Table 1 shows that attention always helps. Generalizing to longer sequences is generally hard, and this remains an open problem.

Related Work
Ever since Fodor and Pylyshyn (1988) conjectured that neural networks are unable to show strong generalization, many attempts were made to show that the opposite is true, leading to inconclusive evidence. For example, Phillips (1998) found that feed-forward nets and RNNs do not always generalize to novel 2-tuples on an auto-association task, while Wong and Wang (2007) and Brakel and Frank (2009) found that RNNs can show systematic behavior in a language modeling task.
In the context of analyzing RNNs, Rodriguez and Wiles (1998) found that simple RNNs can develop a symbol-sensitive counting strategy for accepting a simple (palindrome) context-free language. Weiss et al. (2018) show that LSTMs and simple RNNs with ReLU-activation can learn to count unboundedly, in contrast to GRUs. Linzen et al. (2016) probed the sensitivity of LSTMs to hierarchical structure (not necessarily in novel constructions). Instead of a binary choice, with SCAN a sequence-to-sequence model productively generates an output string. Liska et al. (2018) found that a small number of identical RNNs trained with different initializations show compositional behavior on a function composition task, suggesting that more specific architectures may not be necessary.
Finally, Lake and Baroni (2018) introduced the SCAN data set to study systematic compositionality in recurrent sequence-to-sequence models, including gating mechanisms and attention. This work is a direct response to that and aims to facilitate future progress by showing that SCAN does not necessarily test for strong generalization.

Conclusion
In the quest for strong generalization, benchmarks measuring progress are an important component. The existing SCAN benchmark allows too simple models to shine, without the need for compositional generalization. We proposed NACS to remedy this. NACS still requires systematicity, while introducing stochasticity and strong dependencies on the target side. We argue that a good benchmark needs at least those properties, in order not to fall prey to trivial solutions, which do not work on more realistic use-cases for sequence-to-sequence models such as machine translation.      0.0 ±0.0 2.1 ±0.2 3.7 ±0.9 6.6 ±0.5 12.5 ±2.5 21.8 ±2.6 34.2 ±1.7 LSTM +Attn-Dep 0.0 ±0.0 0.4 ±0.2 0.9 ±0.1 1.5 ±0.2 1.9 ±0.3 3.2 ±0.6 7.4 ±0.9 Table 7: NACS test scores for jump with additional composed commands.