Compositional Generalization by Factorizing Alignment and Translation

Standard methods in deep learning for natural language processing fail to capture the compositional structure of human language that allows for systematic generalization outside of the training distribution. However, human learners readily generalize in this way, e.g. by applying known grammatical rules to novel words. Inspired by work in cognitive science suggesting a functional distinction between systems for syntactic and semantic processing, we implement a modification to an existing approach in neural machine translation, imposing an analogous separation between alignment and translation. The resulting architecture substantially outperforms standard recurrent networks on the SCAN dataset, a compositional generalization task, without any additional supervision. Our work suggests that learning to align and to translate in separate modules may be a useful heuristic for capturing compositional structure.


Introduction
A crucial property underlying the expressive power of human language is its systematicity (Lake et al., 2017;Fodor and Pylyshyn, 1988): syntactic or grammatical rules allow arbitrary elements to be combined in novel ways, making the number of sentences possible in a language to be exponential in the number of its basic elements. Recent work has shown that standard deep learning methods in natural language processing fail to capture this important property: when tested on unseen combinations of known elements, standard models fail to generalize (Lake and Baroni, 2018;Loula et al., 2018;Bastings et al., 2018). It has been suggested that this failure represents a major deficiency of current deep learning models, especially when they are compared to human learners (Marcus, 2018;Lake et al., 2017Lake et al., , 2019. From a statistical-learning perspective, this failure is quite natural. The neural networks trained on compositional generalization tasks fail to generalize because they have memorized biases that do indeed exist in the training set. These tasks require networks to make an out-of-domain (o.o.d.) extrapolation (Marcus, 2018), rather than merely interpolate according to the assumption that training and testing data are independent and identically distributed (i.i.d.). To the extent that humans can perform well on certain kinds of o.o.d. tests, they must be utilizing inductive biases that are lacking in current deep learning models (Battaglia et al., 2018).
It has long been suggested that the human capacity for systematic generalization is linked to mechanisms for processing syntax, and their functional separation from the meanings of individual words (Chomsky, 1957;Fodor and Pylyshyn, 1988). In this work, we take inspiration from this idea and explore operationalizing it as an inductive bias in an existing neural network architecture.
First, we notice a connection between syntactic structure and the correct alignment of words in the source sequence to meanings in the target. In our model, alignment is accomplished with an attention mechanism (Bahdanau et al., 2015) that determines the relevance of each word in the source to the translation of the next word in the target. This process must take into account the syntactic structure of both sequences (e.g. if a verb was just translated, it would be important to know whether there is in the source sequence an adverb that modifies it). We reasoned that if alignment was separated from direct translation (analogous to a separation of syntax and the meanings of individual words),  the difficult o.o.d. problem of composing known elements into a novel combination would be reduced to two easier i.i.d. problems, because the distributions of correct alignments and translations would be similar in training and testing data (see Figure  1).
We implemented this intuition by modifying an existing attention mechanism (Bahdanau et al., 2015), and call the resultant architecture Syntactic Attention to reflect the intuition that the attention mechanism used for alignment should operate primarily on syntactic information, which should be separated from the information relevant to translating individual words. We show that this modification achieves substantially improved compositional generalization performance over the original architecture on the SCAN dataset.

Syntactic Attention
The Syntactic Attention model improves the compositional generalization capability of an existing attention mechanism (Bahdanau et al., 2015) by separating two streams of information processing for alignment and translation (see Figure 2). We describe the mechanisms of this separation and the other details of the model below.

Factorizing alignment and translation
In the seq2seq problem, models must learn a mapping from arbitrary-length sequences of inputs x = {x 1 , x 2 , ..., x Tx } to arbitrary-length sequences of outputs y = {y 1 , y 2 , ..., y Ty }: p(y|x). The underlying assumption made by the Syntactic Attention architecture is that the dependence of target words on the input sequence can be separated into two independent factors. One factor, p(y i |x j ), models the conditional distribution from individual words in the input to individual words in the target. Note that, unlike in the model of Bahdanau et al. (2015), these x j do not contain any information about the other words in the input sequence because they are not processed with an RNN. The other factor, p(j → i|x, y 1:i−1 ), models the conditional probability that word j in the input is relevant to word i in the target sequence, given the entire input sequence. This alignment is accomplished from encodings of the inputs produced by an RNN. The crucial architectural assumption, then, is that any temporal dependency between individual words in the input that can be captured by an RNN should only be relevant to their alignment to words in the target sequence, and not to the translation of individual words. This assumption will be made clearer in the model description below.

Encoder
The encoder produces two separate vector representations for each word in the input sequence. Unlike the previous attention model (Bahdanau et al., 2015)), we separately extract the information that will be used for direct translation with a linear transformation: m j = W m x j , where W m is a learned weight matrix that multiplies the one-hot encodings {x 1 , ..., x Tx }. Note that these representations do not contain any information about the other words in the sentence. As in the previous attention mechanism (Bahdanau et al., 2015), we use a bidirectional RNN (biRNN) to extract the information that will be used for alignment. The biRNN produces a vector for each word on the forward pass, ( − → h 1 , ..., − − → h Tx ), and a vector for each word on the backward pass, ( In all experiments, we used a bidirectional LSTM for this purpose. Note that h j is encoding the context of the surrounding words in the sen-tence. Our motivation for doing this was to force the RNN in the encoder to rely on the "role" the word is playing in the sentence. Note also that because there is no sequence information in the m j , all of the information required to align the input sequence correctly (e.g. phrase structure, modifying relationships, etc.) must be encoded by the biRNN.

Decoder
The decoder models the conditional probability of each target word given the input and the previous targets: p(y i |y 1 , y 2 , ..., y i−1 , x), where y i is a target and x is the whole input sequence. As in the previous model, we use an RNN to determine an attention distribution over the inputs at each time step (i.e. to align words in the input to the current target). However, our decoder diverges from this model in that the mapping from inputs to outputs is performed from a weighted average of the m j : where f is parameterized by a linear function with a softmax, and the α ij are the weights determined by the attention model. The attention weights are computed by a function measuring how well the input representations h j align with the current hidden state of the decoder RNN, s i : where e ij can be thought of as measuring the importance of a given input word x j to the current target word y i , and s i is the current hidden state of the decoder RNN. Bahdanau et al. (2015) model the function a with a feedforward network, but we choose to use a simple dot product: a(s i , h j ) = s i · h j . Finally, the hidden state of the RNN is updated with the same weighted combination of the h j : where g is the decoder RNN, s i is the current hidden state, and c i can be thought of as the information in the attended words that can be used to determine what to attend to on the next time step. Again, in all experiments an LSTM was used.

SCAN dataset
The SCAN 1 dataset was specifically designed to test compositional generalization (details can be found in the appendix, or in Lake and Baroni, 2018). It is composed of 20,910 sequences of commands that must be mapped to sequences of actions, and is generated from a simple finite phrasestructure grammar that includes things like adverbs and conjunctions. The splits of the dataset include: 1) Simple split, where training and testing data are split randomly, 2) Length split, where training includes only shorter sequences, and 3) Add primitive split, where a primitive command (e.g. "turn left" or "jump") is held out of the training set, except in its most basic form (e.g. " jump" → JUMP) Here we focus on the most difficult problem in the SCAN dataset, the add-jump split, where "jump" is held out of the training set.

Implementation details
Experimental procedure is described in detail in the appendix. Training and testing sets were kept as they were in the original dataset, but following (Bastings et al., 2018), we used early stopping by validating on a 20% held out sample of the training set. All reported results are from runs of 200,000 iterations with a batch size of 1. Unless stated otherwise, each architecture was trained 5 times with different random seeds for initialization, to measure variability in results. All experiments were implemented in PyTorch. Details of the hyperparameter search are given in the appendix. Our best model used LSTMs, with 2 layers and 200 hidden units in the encoder, and 1 layer and 400 hidden units in the decoder, and 120-dimensional vectors for the m j . The model included a dropout rate of 0.5, and was optimized using an Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.001.

Compositional generalization results
The Syntactic Attention model achieves high compositional generalization performance on the standard seq2seq SCAN dataset (see table 1). The table shows results (mean test accuracy (%) ± standard deviation) on the test splits of the dataset. Syntactic Attention is compared to the previous models, which were a CNN (Dessì and Baroni, 2019), GRUs augmented with an attention mechanism ("+ attn"), which either included or did not include a dependency ("-dep") in the decoder on the previous action (Bastings et al., 2018), and the recent model of Li et al. (2019).
Lake (2019) showed that a meta-learning architecture using an external memory achieves 99.95% accuracy on a meta-seq2seq version of the SCAN task. In this version, models are trained to learn how to generalize compositionally across a number of variants of a compositional seq2seq problem. Here, we focus on the standard seq2seq version, which limits the model to one training episode.
The best model from the hyperparameter search showed strong compositional generalization performance, attaining a mean accuracy of 91.1% (median = 98.5%) on the test set of the add-jump split. However, as in Dessì and Baroni (2019), we found that our model showed variance across initialization seeds (see appendix for details). For this reason, we ran the best model 25 times on the addjump split to get a more accurate assessment of performance. These results were highly skewed, with a mean accuracy of 78.4% but a median of 91.0% (see appendix for detailed results). Overall, this represents an improvement in the compositional generalization performance compared to the original attention mechanism (Bahdanau et al., 2015;Bastings et al., 2018), and rivals the recent results from Li et al. (2019).

Additional SCAN experiments
We hypothesized that a key feature of our architecture was that an RNN was used to encode the information in the input sequence relevant to alignment, while one was not used to encode the information relevant to translation. To test this hypothesis, we conducted two more experiments: 1. RNN for translation-encoding. An additional biLSTM was used to process the input sequence: where − → m j and ← − m j are the vectors produced for the source word x j by a biLSTM on the forward and backward passes, respectively. These m j replace those generated by the simple linear layer in the Syntactic Attention model.
2. c i used for translation. Sequential information from the encoder RNN (i.e. the c i ) was allowed to directly influence the output at each time step in the decoder: p(y i |y 1 , y 2 , ..., , where again f is parameterized with a linear function and a softmax output nonlinearity. The results of the additional experiments (mean test accuracy (%) ± standard deviations) are shown in table 2. These results partially confirmed our hypothesis: performance on the jump-split test set was worse when encodings from an RNN were directly used for translation. However, when sequential information from the biLSTM encoder was used an additional input in the final production of actions, the model maintained good compositional generalization performance. We hypothesize that this was because in this setup, it was easier for the model to learn to use the m j to directly translate actions, so it largely ignored the sequential information. This experiment suggests that the factorization between alignment and translation does not have to be perfectly strict, as long as nonsequential representations are available for direct translation.
Additional results, including on other SCAN splits and analyses of the attention distributions, can be found in the appendix.

Machine translation experiments
Although the purpose of this work was to study the inductive biases that might encourage compositional generalization, we also validated our architecture on a small machine translation dataset to obtain a basic measure of its efficacy in a more naturalistic setting. The dataset (Lake and Baroni, 2018;Bastings et al., 2018) is composed of 10,000 English/French sentence pairs in the training set and 1,190 pairs in the test set. We trained and tested our existing model without making any changes, except for adjusting the learning rate. We also ran the same experiment with the architecture described above that used c i for translation, as this architecture also showed strong compositional generalization performance on SCAN. BLEU scores on the test set for the best learning rate (0.00015 for both models) are shown in the table below, with comparison to previously reported results using basic recurrent architectures. Our model performs comparably in neural MT, validating it in a more naturalistic setting.

Related work
The principle of compositionality has recently regained the attention of deep learning researchers (Bahdanau et al., 2019b,a;Lake et al., 2017; Lake

Model
Simple Length Add turn left Add jump GRU + attn (Bastings et al., 2018) 100.0 ± 0.0 18.1 ± 1.1 59.1 ± 16.8 12.5 ± 6.6 GRU + attn -dep (Bastings et al., 2018) 100.0 ± 0.0 17.8 ± 1.7 90.8 ± 3.6 0.7 ± 0.4 CNN (Dessì and Baroni, 2019) 100.0 ± 0.0 --69.   and Baroni, 2018;Battaglia et al., 2018;Johnson et al., 2017;Keysers et al., 2020) . In particular, the issue has been explored in the visual-question answering (VQA) setting (Andreas et al., 2016;Hudson and Manning, 2018;Johnson et al., 2017;Perez et al., 2018;Hu et al., 2017). Many of the successful models in this setting learn hand-coded operations (Andreas et al., 2016;Hu et al., 2017), use highly specialized components (Hudson and Manning, 2018), or use additional supervision (Hu et al., 2017). In contrast, our model uses standard recurrent networks and simply imposes the additional constraint that mechanisms for alignment and translation are separated. In the Compositional Attention Network, built for VQA, the representations used to encode images and questions are restricted to interact only through attention distributions (Hudson and Manning, 2018). Our model utilizes a similar restriction, reinforcing the idea that compositionality is enhanced when information from different modules are only allowed to interact through discrete probability distributions. Li et al. (2019) recently showed good performance on the SCAN tasks using a very similar ap-proach. Our results lend additional support to the idea that separating alignment and translation can facilitate compositional generalization. The results from the meta-seq2seq version of the SCAN task (Lake, 2019) suggest that meta-learning may also be a viable approach to inducing compositionality in neural networks.
We were inspired by work in cognitive science emphasizing the relationship between systematicity and syntax (Chomsky, 1957;Fodor and Pylyshyn, 1988). Others have explored similar ideas in different natural language tasks (Bastings et al., 2017(Bastings et al., , 2019Chen et al., 2018;Havrylov et al., 2019;Strubell et al., 2018). This work supports the suggestion that intuitions from cognitive science can aid architecture design in deep learning.

Conclusion
In this work we attempt to operationalize an intuition from cognitive science, implementing it as inductive bias in the form of a factorization between alignment and translation in the seq2seq setting. We showed that this can improve compositional generalization performance on the SCAN task, and that it doesn't degrade performance on a small MT task. We believe this factorization prevents the model from memorizing spurious correlations in the data, and note that similar ideas may be useful in other natural language tasks.

A SCAN dataset details
The SCAN dataset (Lake and Baroni, 2018) is composed of sequences of instructions that must be mapped to sequences of actions (see Figure 3). The instruction sequences are generated using the pharase-structure grammar described in Figure  4. This simple grammar is not recursive, and so can generate a finite number of command sequences (20,910 total).
These commands are interpreted according to the rules shown in Figure 5. Although the grammar used to generate and interpret the commands is simple compared to any natural language, it captures the basic properties that are important for testing compositionality (e.g. modifying relationships, discrete grammatical roles, etc.). The add-primitive splits (described in main text) are meant to be analogous to the capacity of humans to generalize the usage of a novel verb (e.g. "dax") to many constructions (Lake and Baroni, 2018).

B Experimental procedure details
The cluster used for all experiments consists of 3 nodes, with 68 cores in total (48 times Intel(R) Xeon(R) CPU E5-2650 v4 at 2.20GHz, 20 times Intel(R) Xeon(R) CPU E5-2650 v3 at 2.30GHz), with 128GB of ram each, connected through a 56Gbit infiniband network. It has 8 pascal Titan X GPUs and runs Ubuntu 16.04.
All experiments were conducted with the SCAN dataset as it was originally published (Lake and Baroni, 2018). No data were excluded, and no preprocessing was done except to encode words in the input and action sequences into one-hot vectors, and to add special tokens for start-of-sequence and end-of-sequence tokens. Train and test sets were  kept as they were in the original dataset, but following (Bastings et al., 2018), we used early stopping by validating on a 20% held out sample of the training set. All reported results are from runs of 200,000 iterations with a batch size of 1. Except for the additional batch of 25 runs for the add-jump split, each architecture was trained 5 times with different random seeds for initialization, to measure variability in results. All experiments were implemented in PyTorch.
Initial experimentation included different implementations of the assumption that syntactic information be separated from semantic information. After the architecture described in the main text showed promising results, a hyperparameter search was conducted to determine optimization (stochastic gradient descent vs. Adam), RNN-type (GRU vs. LSTM), regularizers (dropout, weight decay), and number of layers (1 vs. 2 layers for encoder and decoder RNNs). We found that the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.001, two layers in the encoder RNN and 1 layer in the decoder RNN, and dropout worked the best, so all further experiments used these specifications. Then, a grid-search was conducted to find the number of hidden units and dropout rate. We tried hidden dimensions ranging from 50 to 400, and dropout rates ranging from 0.0 to 0.5.
The best model used an LSTM with 2 layers and 200 hidden units in the encoder, and an LSTM with 1 layer and 400 hidden units in the decoder, and used 120-dimensional m j vectors, and a dropout rate of 0.5. The results for this model are reported in the main text. All additional experiments were done with models derived from this one, with the same hyperparameter settings.
All evaluation runs are reported in the main text: for each evaluation except for the add-jump split, models were trained 5 times with different random seeds, and performance was measured with means and standard deviations of accuracy. For the add-jump split, we included 25 runs to get a more accurate assessment of performance. This revealed a strong skew in the distribution of results, so we included the median as the main measure of performance. Occasionally, the model did not train at all due to an unknown error (possibly very poor random initialization, high learning rate or numerical error). For this reason, we excluded runs in which training accuracy did not get above 10%. No other runs were excluded.

C Skew of add-jump results
As mentioned in the results section of the main text, we found that test accuracy on the add-jump split was variable and highly skewed. Figure 6 shows a histogram of these results (proportion correct). The model performs near-perfectly most of the time, but is also prone to catastrophic failures. This may be because, at least for our model, the add-jump split represents a highly nonlinear problem in the sense that slight differences in the way the primitive verb "jump" is encoded during training can have huge differences for how the model performs on more complicated constructions. We recommend that future experiments with this kind of compositional generalization problem take note of this phenomenon, and conduct especially comprehensive analyses of variability in results. Future research will also be needed to better understand the factors that determine this variability, and whether it can be overcome with other priors or regularization techniques.

D Supplementary experiments D.1 Testing nonlinear translation
Our main hypothesis is that the separation between sequential information used for alignment and information about the meanings of individual words encourages systematicity. The results reported in the main text are largely consistent with this hypothesis, as shown by the performance of the Syntactic Attention model on the compositional generalization tests of the SCAN dataset. However, it is also possible that the simplicity of the translation stream in the model is also important for improving compositional generalization. To test this, we replaced the linear layer in this stream with a nonlinear neural network. From the model description in the main text: In the original model, f was parameterized with a simple linear layer, but here we use a two-layer feedforward network with a ReLU nonlinearity, before a softmax is applied to generate a distribution over the possible actions. We tested this model on the add-primitive splits of the SCAN dataset. The results (mean (%) with standard deviations) are shown in Table 4, with comparison to the baseline Syntactic Attention model. The results show that this modification did not substantially degrade compositional generalization performance, suggesting that the success of the Syntactic Attention model does not depend on the parameterization of the translation stream with a simple linear function.

D.2 Add-jump split with additional examples
The original SCAN dataset was published with compositional generalization splits that have more than one example of the held-out primitive verb (Lake and Baroni, 2018). The training sets in these splits of the dataset include 1, 2, 4, 8, 16, or 32 random samples of command sequences with the "jump" command, allowing for a more fine-grained measurement of the ability to generalize the usage of a primitive verb from few examples. For each number of "jump" commands included in the training set, five different random samples were taken to capture any variance in results due to the selection of particular commands to train on.
Lake and Baroni (2018) found that their best model (an LSTM without an attention mechansim) did not generalize well (below 39%), even when it was trained on 8 random examples that included the "jump" command, but that the addition of further examples to the training set improved performance. Subsequent work showed better performance at lower numbers of "jump" examples, with GRU's augmented with an attention mechanism ("+ attn"), and either with or without a dependence in the decoder on the previous target ("-dep") (Bastings et al., 2018). Here, we compare the Syntactic Attention model to these results. The Syntactic Attention model shows a substantial improvement over these previous approaches   Figure 7 and Table 5). Compositional generalization performance is already quite high at 1 example, and at 2 examples is almost perfect (99.997% correct).

D.3 Template splits
The compositional generalization splits of the SCAN dataset were originally designed to test for the ability to generalize known primitive verbs to valid unseen constructions (Lake and Baroni, 2018). Further work with SCAN augmented this set of tests to include compositional generalization based not on known verbs but on known templates (Loula et al., 2018). These template splits included the following (see Figure 8 for examples): • Jump around right: All command sequences with the phrase "jump around right" are held out of the training set and subsequently tested.
• Primitive right: All command sequences containing primitive verbs modified by "right" are held out of the training set and subsequently tested.
• Primitive opposite right: All command se-quences containing primitive verbs modified by "opposite right" are held out of the training set and subsequently tested.
• Primitive around right: All command sequences containing primitive verbs modified by "around right" are held out of the training set and subsequently tested.
Results of the Syntactic Attention model on these template splits are compared to those originally published (Loula et al., 2018) in Table 6. The model, like the one reported in (Loula et al., 2018), performs well on the jump around right split, consistent with the idea that this task does not present a problem for neural networks. The rest of the results are mixed: Syntactic Attention shows good compositional generalization performance on the Primitive right split, but fails on the Primitive opposite right and Primitive around right splits. All of the template tasks require models to generalize based on the symmetry between "left" and "right" in the dataset. However, in the opposite right and around right splits, this symmetry is substantially violated, as one of the two prepositional phrases in which they can occur is never seen with "right." Figure 7: Compositional generalization performance on add-jump split with additional examples. Syntactic Attention model is compared to previously reported models (Bastings et al., 2018) on test accuracy as command sequences with "jump" are added to the training set. Mean accuracy (proportion correct) was computed with 5 different random samples of "jump" commands. Error bars represent standard deviations.

E Visualizing attention
Here, we visualize the attention distributions over the words in the command sequence at each step during the decoding process. In the following figures (Figures 9 to 14), the attention weights on each command (in the columns of the image) is shown for each of the model's outputs (in the rows of the image) for some illustrative examples. Darker blue indicates a higher weight. The examples are shown in pairs for a model trained and tested on the add-jump split, with one example drawn from the training set and a corresponding example drawn from the test set. Examples are shown in increasing complexity, with a failure mode depicted in Figure  14.
In general, it can be seen that although the attention distributions on the test examples are not exactly the same as those from the corresponding training examples, they are usually good enough for the model to produce the correct action sequence. This shows the model's ability to apply the same syntactic rules it learned on the other verbs to the novel verb "jump." In the example shown in Figure  14, the model fails to attend to the correct sequence of commands, resulting in an error.