Sequence-Level Mixed Sample Data Augmentation

Despite their empirical success, neural networks still have difficulty capturing compositional aspects of natural language. This work proposes a simple data augmentation approach to encourage compositional behavior in neural models for sequence-to-sequence problems. Our approach, SeqMix, creates new synthetic examples by softly combining input/output sequences from the training set. We connect this approach to existing techniques such as SwitchOut and word dropout, and show that these techniques are all approximating variants of a single objective. SeqMix consistently yields approximately 1.0 BLEU improvement on five different translation datasets over strong Transformer baselines. On tasks that require strong compositional generalization such as SCAN and semantic parsing, SeqMix also offers further improvements.


Introduction
Natural language is thought to be characterized by systematic compositionality (Fodor and Pylyshyn, 1988). A computational model that is able to exploit such systematic compositionality should understand sentences by appropriately recombining subparts that have not been seen together during training. Consider the following example from Andreas (2020): (1a) She picks the wug up in Fresno.
(1b) He puts the cup down in Tempe.
Given the above sentences, a model which has learned compositional structure should be able to generalize and understand sentences such as: (2a) She puts the wug down in Fresno.
(2b) She picks the wug up in Tempe.
In practice, neural models often overfit to long segments of text and fail to generalize compositionally.
This work proposes a simple data augmentation strategy for sequence-to-sequence learning, SeqMix, which creates soft synthetic examples by randomly combining parts of two sentences. This prevents models from memorizing long segments and encourages models to rely on compositions of subparts to predict the output. To motivate our approach, consider some example sentences that can be created by combining (1a)  Instead of enumerating over all possible combinations of two sentences, SeqMix crafts a new example by softly mixing the two sentences via a convex combination of the original examples. This approach can be seen as a sequence-level variant of a broader family of techniques called mixed sample data augmentation (MSDA), which was originally proposed by  and has been shown to be particularly effective for classification tasks (DeVries and Taylor, 2017;Yun et al., 2019;Verma et al., 2019). We also show that SeqMix shares similarities with word replacement/dropout strategies in machine translation (Sennrich et al., 2016;Wang et al., 2018;Gao et al., 2019), SeqMix targets a crude but simple approach to data augmentation for language applications. We apply SeqMix to a variety of sequence-to-sequence tasks including neural machine translation, semantic parsing, and SCAN (a dataset designed to test for compositionality of data-driven models), and find that SeqMix improves results on top of (and when combined with) existing data augmentation methods.

Motivation and Related Work
While neural networks trained on large datasets have led to significant improvements across a wide range of NLP tasks, training them to generalize by learning the compositional structure of language remains a challenging open problem. Notably, Lake and Baroni (2018) propose an influential dataset (SCAN) to evaluate the systematic compositionality of neural models and find that they often fail to generalize compositionally.
One approach to encouraging compositional behavior in neural models is by incorporating compositional structures such as parse trees or programs directly into a network's computational graph (Socher et al., 2013;Dyer et al., 2016;Bowman et al., 2016;Andreas et al., 2016;Johnson et al., 2017). While effective on certain domains such as visual question answering, these approaches usually rely on intermediate structures predicted from pipelined models, which limits their applicability in general. Further, it is an open question as to whether such putatively compositional models result in significant empirical improvements on many NLP tasks (Shi et al., 2018).
Expressive parameterizations over high dimensional input afforded by neural networks contribute to their excellent performance in high resource settings; however, such flexible parameterizations can easily lead to a model's memorizing-i.e., overfitting to-long segments of text, instead of relying on the appropriate subparts of segments. Another approach to encouraging compositionality in richly-parameterized neural models, then, is to augment the training data with more examples. Existing work in this vein include SwitchOut (Wang et al., 2018), which replaces a word in a sentence with a random word from the vocabulary, GECA (Andreas, 2020), which creates new examples by switching subparts that occur in similar contexts, and TMix (Chen et al., 2020), which interpolates between hidden states of neural models for text classification. We compare to these approaches to our proposed approach in this paper.

Method
Our proposed approach, SeqMix, is simple, and is essentially a sequence-level variant of MixUp , which has primarily been used for image classification tasks (DeVries and Taylor, 2017; Yun et al., 2019). We first describe the generative data augmentation process behind this model for text generation, and show how SeqMix approximates the resulting latent variable objective with a relaxed version.
Let X ∈ R s×V represent a source sequence of length s with vocabulary size V and Y ∈ R t×V represent a target sequence to generate of length t. Assume that we sample a pair of training examples (X, Y ) and (X , Y ) from the training set, ensuring that both have the same length (s = s , t = t ) by padding or truncation. We then sample a binary , 1} t to decide which token to use at each position. Each element m i is sampled i.i.d from Bernoulli(λ), where the parameter λ is itself sampled from Beta(α, α), and α is hyperparameter. This gives a mixed synthetic example: The new example pair of sentences (X,Ŷ ) will not correspond to natural sentences in general, but may contain valid subparts (phrases) that bias the model towards learning the compositional structure (as in the examples discussed in the introduction). Marginalizing over m gives the following log marginal likelihood, where p λ (m) = s+t i=1 p λ (m i ) and D, D are the example distributions. As exact marginalization in the above is intractable, we could target a lower bound, with Monte Carlo samples from p λ (m), resulting from Jensen's inequality, An alternative, which we refer to as SeqMix, is to consider a soft variant of the original objective by training on expected samples, Letting f θ (X, Y <t ) be the output of the log-softmax layer, the local probability of Y t

Method Intuition
Combination SeqMix then trains on the objective, To summarize, this results in a simple algorithm where we sample λ ∼ Beta(α, α) and train on these expected samples. 1 Table 1 shows that we can recover existing data augmentation methods such as SwitchOut and word dropout under the above framework. In particular, these methods approximate a version of the "hard" latent variable objective in Eq. 2 by considering different swap distributions p(m) and sampling distributions D . 2 Compared to other approaches, SeqMix is essentially a relaxed variant of the same objective, similar to the difference between soft vs. hard attention (Xu et al., 2015;Deng et al., 2018;Wu et al., 2018;Shankar et al., 2018). SeqMix is also more efficient than more sophisticated augmentation strategies such as GECA which requires a computationally expensive validation check for swaps.

Relationship to Existing Methods
1 Our implementation can be found at https:// github.com/dguo98/seqmix, and pseudocode can be found in supplementary materials.
2 Wang et al. (2018) also offer an alternative formulation which unifies various data augmentation strategies as training on a distribution that better approximates the underlying data distribution. While the hard version of SeqMix can also be unified under SwitchOut's resulting objective, we chose our alternative formulation given its natural extension to the relaxed version.

Experimental Setup
We test our approach against existing baselines across a variety of sequence-to-sequence tasks: machine translation, SCAN, and semantic parsing. For all datasets, we tune the α hyperparameter in the range of [0.1, 1.5] on the validation set. 3 Exact details regarding the training setup (including descriptions of the various datasets) can be found in the supplementary materials.
SCAN SCAN is a command execution dataset designed to test for systematic compositionality of data-driven models. SCAN consists of simple English commands and corresponding action sequences. We consider three different splits that have been widely utilized in the existing literature: jump, around-right, turn-left. For the splits (jump, turn-left), the primitive commands (i.e. "jump", "turn left") are only seen in isolation during training, and the test set consists commands that compose the isolated primitive command with the other commands seen during training. For the template split (around-right), training examples contain the commands "around" and "right" but never in combination. Following  Table 2: Experimental results on machine translation (BLEU), SCAN (accuracy) and semantic parsing GeoQuery SQL Queries subset (accuracy). Note we were unable to apply GECA to translation datasets as it was too computationally expensive.
previous work (Andreas, 2020), we use a one-layer LSTM encoder-decoder model with hidden size of 512 and embedding size of 64.
Semantic Parsing For semantic parsing, we consider the SQL queries subset of GeoQuery (Finegan-Dollak et al., 2018), which consists of 880 English questions paired with SQL commands. The standard question split ensures no questions are repeated between the train and test sets, while the more challenging query split ensures that neither questions nor logical forms (anonymized) are repeated. Following Andreas (2020), we use the same model as for SCAN but additionally introduce a copy mechanism. Table 2 shows the results from SeqMix and the relevant baselines. On all datasets, SeqMix consistently improves over SwitchOut and word dropout (Word-Drop). For machine translation, SeqMix achieves around 1 BLEU score gain on IWSLT over strong baselines, and these gains persist on WMT which is an order of magnitude larger. On SCAN and semantic parsing, SeqMix does not perform as well as GECA on its own but does well when combined with GECA.

Analysis on SCAN
We perform further analysis on the SCAN dataset, which is explicitly designed to test for compositional generalization.  (Bottom) Model predicted outputs. "Á" = "turn right", " " = "jump", " " = "walk", and "E" = "look". To "jump right", one needs to first turn to the right and then jump. larly, SeqMix can boost the performance on the turn-left split from 49% to 99% in contrast to SwitchOut and WordDrop.
The fact that SeqMix can improve over simple regularization methods (such as WordDrop) even without GECA indicates that despite its crudity, SeqMix is somewhat effective at biasing models to learn the appropriate compositional structure. However, these results on SCAN also highlight its limitations: SeqMix fails on the difficult around-right split, where the model has to learn combine "around" with "right" even though they are not encountered together in training, and does not outperform more sophisticated data augmentation strategies such as GECA (Andreas, 2020).
In Table 3, we show a qualitative example in the jump split of SCAN dataset. Recall that the jump split of SCAN is constructed to test the gen-eralization of primitive "jump" in novel contexts.
Given train examples such as jump; walk; walk left; look after walk twice, the model demonstrates compositionality if it is able to correctly process test examples such as jump left; look after jump twice, i.e. generalize the understanding of isolated jump to unseen combinations with jump. As shown in Table 3, only SeqMix successfully exhibits this compositional generalization.

Conclusion
This paper presents SeqMix, a simple data augmentation strategy for sequence-to-sequence applications. Despite being a crude approximation to compositional phenomena in language, we found SeqMix to be effective on three different sequenceto-sequence tasks, including the challenging SCAN dataset which is designed to test for compositional generalization. SeqMix is efficient and easy to implement, and as a secondary contribution, we provide a framework that unifies several data augmentation strategies for compositionality, which naturally suggests avenue for future research (e.g., a relaxed variant of GECA).