Generalization in Instruction Following Systems

Understanding and executing natural language instructions in a grounded domain is one of the hallmarks of artificial intelligence. In this paper, we focus on instruction understanding in the blocks world domain and investigate the language understanding abilities of two top-performing systems for the task. We aim to understand if the test performance of these models indicates an understanding of the spatial domain and of the natural language instructions relative to it, or whether they merely over-fit spurious signals in the dataset. We formulate a set of expectations one might have from an instruction following model and concretely characterize the different dimensions of robustness such a model should possess. Despite decent test performance, we find that state-of-the-art models fall short of these expectations and are extremely brittle. We then propose a learning strategy that involves data augmentation and show through extensive experiments that the proposed learning strategy yields models that are competitive on the original test set while satisfying our expectations much better.


Introduction
Building agents that can understand and execute natural language instructions in a grounded environment is a hallmark of artificial intelligence (Winograd, 1972). There is wide applicability of this technology in navigation (Chen et al., 2019;Tellex et al., 2011;Chen and Mooney, 2011), collaborative building (Narayan- Chen et al., 2019), and several others areas (Li et al., 2020b;Branavan et al., 2009). The key challenge underlying these and many other applications is the need to understand the natural language instruction (to the extent that it is possible) and ground relevant parts of it in the environment. While the use of deep networks has led to significant progress on several 1 Our code is publicly available at: http://cogcomp.org/page/publication_view/936 benchmarks (Abiodun et al., 2018) an investigation into the instruction understanding capabilities of such systems remains lacking. We do not know the extent to which these models truly understand the spatial relations in the environment, nor their robustness to variability in the environment or in the instructions. This understanding is also important from the viewpoint of safety critical applications , where robustness to variability is essential. While robustness to input perturbations at test-time has been studied in computer vision (Goodfellow et al., 2014) and in certain natural language tasks (Alzantot et al., 2018;Wallace et al., 2019;Shah et al., 2020), it remains relatively elusive in the instruction following task in a grounded environment. This can be attributed to the difficulty in characterizing the different expectations of robustness in this setting, due to the multiple channels of input involved, which semantically constrain each other.
The Blocks World domain is an ideal platform to study the abilities of a system to understand instructions (Winograd, 1972;Bisk et al., 2016;Narayan-Chen et al., 2019;Misra et al., 2017;Bisk et al., 2018;Mehta and Goldwasser, 2019;Tan and Bansal, 2018). Despite being seemingly simple, it presents key reasoning challenges, including compositional language understanding and spatial understanding, that need to be addressed in any instructional domain. In Bisk et al. (2016), the en- vironment consists of a number of blocks placed on a flat board. The model is provided with the current configuration of blocks in the environment along with an instruction, and is tasked with executing the instruction by manipulating appropriate blocks. In this work, we follow the more challenging setting in Bisk et al. (2016) where the blocks are unlabeled, necessitating the use of involved referential expressions in the instructions. Fig.1 shows that the instruction and block configuration are semantically dependent, jointly determining the outcome.
Despite the success of top performing models (Tan and Bansal, 2018;Bisk et al., 2016) on the test set for this task, we question if the models are able to reason about the complex language and spatial concepts of this task and generalize or are merely over fitting the test set. To investigate these questions we formulate the following expectations one should have from an instruction following model: (1) Identity Invariance Expectation: The performance of the model on an input should not degrade on slightly perturbing the input.
(2) Symmetry Equivariance Expectation: A symmetric transformation of an input should cause an equivalent transformation of model prediction and performance should not degrade.
(3) Length Invariance Expectation: The performance of a model should not depend on the length of the input, as long as the semantics is unchanged.
Our expectations complement existing work in three dimensions: (1) is related to adversarial perturbations (Goodfellow et al., 2014) and (2) is related to equivariance of CNNs explored in computer vision (Cohen and Welling, 2016). It is also related to contrast sets (Gardner et al., 2020;Li et al., 2020a) and counterfactual data augmentation (Kaushik et al., 2019). Here, we extend the investigation to this new task of instruction following involving both natural language and an environment, discrete and continuous perturbations and both regression and classification tasks. Contrast (3) is related to Lake and Baroni (2018) where vulnerability to length in a toy sequence-sequence task was demonstrated. Here we show that length-based vulnerability exists in another modality-the number of blocks present on the board, for this much more complicated task.
While these form only a subset of the expectations one might have from an instruction following model, it already allows us to formally characterize some of the dimensions of robustness an instruction following agent must have. As an example, a tiny shift in the location of each block should not affect the model prediction (identity invariance). In Sec. 2, we formulate concrete perturbations to test whether a given model satisfies these expectations. The space of perturbations that we consider have the following attributes: (a) Semantic Preserving or Semantic Altering. (b) Linguistic or Geometric. (c) Discrete or Continuous. We find that both models studied suffer a large performance drop under each of the perturbations, and fall short of satisfying our expectations. We then present a data augmentation scheme designed to better address our expectations from such models. Our extensive experiments in Sec. 2.3 indicate that our learning strategy results in more robust models that perform much better on the perturbed test set while maintaining similar performance on the original test set. Figure 3: Relative Performance Degradation for the source (classification and regression) and target (regression) sub-tasks. ↑ (↓) denotes higher (lower) is better respectively. Here, SPP uses only one permutation and the degradation becomes more severe when consistency across a larger set of permutations is considered (Appendix A) target location to move it to. While the target output is always a location y ∈ R 3 , for the source task the model can either predict a particular block y ∈ {1, 2, ..., 20} (Tan and Bansal, 2018) or a particular source location y ∈ R 3 (Bisk et al., 2016). Let P denote a perturbation space and (I , W ) be the perturbed version of (I, W ) under P. Note that (I , W ) can be chosen randomly or adversarially as the perturbation which maximizes the loss : Here denotes a loss function and O denotes the gold source/target location. If the perturbation space is discrete and finite we can simple search over all candidate (I , W ) to find the one with the maximum loss. If it is continuous and infinite, we can use a first order method (eg: First Order Gradient Signed Method FGSM (Goodfellow et al., 2014)) to find the adversarial (I , W ). Now we characterize P. Broadly, we have the following two types of perturbations: (i) Semantics Preserving (SP): Perturbations when applied to either I or W , do not change the meaning of either. Since the modified instruction I or world state W is semantically unchanged, the model should perform similarly on the perturbed input. Informally, we want f (I, W ) ≈ f (I , W ) since I ≈ I and W ≈ W . SP perturbations can be of the following types: concepts we adversarially pick the one with the highest loss over all combinations of substitutions from the synonyms in C. Since the size of these synonym sets are small, an explicit search over all candidate substitutions is possible, although the search space grows combinatorially with the number of elements of C in I.
• Geometric (SPG): These perturbations do not change the semantics of the board. Tiny changes in the block locations which preserve the overall semantics of W should not affect model predictions. We perturb each block location slightly in an adversarial direction 3 w.r.t W .
• Count (SPC): We identify distractor blocks which do not affect the meaning of the instruction (Fig. 2(b)). Large distance from the source and target location acts as a proxy for this. P comprises of deleting k blocks where k ∈ {0, 1, 2, ..., N } is chosen adversarially to generate W . We set N = 3. 4 • Permutation (SPP): These are perturbations where the order in which the block locations are fed to the model, are permuted: Π(B)={b Π(1) , ..., b Π(20) }. While semantically nothing changes in the input ((I , W ) ≡ (I, Π(W )), where ≡ denotes semantic equivalence), we see models still suffer a large performance drop, even for a random permutation Π.
(ii) Semantic Altering ( , as in Fig: 2(a) should satisfy: if the error on f (I, W ) is small, the error on f (I , W ) should also be small.

Model Performance vs Our Expectations
The dataset from Bisk et al. (2016) Bisk et al. (2016) and from Tan and Bansal (2018). One important difference between the two models is that while both models treat the target subtask T as a regression task (trained and evaluated using a normalized mean squared error called block distance BD), Tan and Bansal (2018) treats the source subtask as a classification task S cls (trained using cross entropy loss as and evaluated using classification accuracy Acc.) while Bisk et al. (2016) treats it as a regression task S reg (trained and evaluated using BD). We use both models for the source and the .03% for the source-classification,-regression and target subtasks respectively, over different P.

Adversarial Data Augmentation
In this section we show that a simple data augmentation strategy improves model performance under robust evaluation on the perturbed test set. For each input (I, W ) in the training data we add another example which is adversarial: This perturbation set P used in training is the same one that is used for robust test evaluation. When P is continuous (eg: SPG), we use the FGSM attack to solve this maximization and obtain (I , W ). When P is discrete (eg: SPL, SPC) we search over the perturbation space to find the perturbation with the highest loss. We train the model on a combined dataset consisting of both the original train-set and the adversarially augmented data. This is an extension of Adversarial Training (Madry et al., 2017) to the instruction following task for (i) both discrete and continuous perturbations (ii) both regression and classification tasks.

Results
In this section we show the benefits of adversarially augmented robust training. Consider the models M std from Bisk et al. (2016) and Tan and Bansal (2018) which were shown to perform poorly under robust evaluation in Sec. 2.1. Here we compare their performance with their robustly trained variants M rob . For all models we perform standard evaluation and robust evaluations for each perturbation type. This is done for the source (classification and regression) and target sub-tasks. In Table 1 we show the results under the different settings, averaged over 5 runs. For every perturbation category and for all sub-tasks, we see that the robust models (i) outperform their standard counterparts in terms of robust evaluation metric and (ii) in some cases even on standard evaluation. Thus, knowledgefree robust training framework can produce models which are less brittle to perturbations with competitive standard performance on the original test set.

Conclusion
In this paper we formulated the performance expectations for an instruction following system. Based on these expectations, we created several categories of perturbations and showed that existing models fail spectacularly on them. We then demonstrated the benefits of adversarial data augmentation on each perturbation category.

A Appendix A: Additional Experiments
In this appendix, we show a few additional experiments to investigate the following claims: • For the SPP perturbation, an even stricter evaluation that requires consistent predictions for a larger set of permutations over the block indices, further degrades performance of the existing models. Table 2 shows this for the case of two permutations corresponding to each instance. In all cases, adversarial data augmentation helps improve performance under the robust evaluation metric.  • For the SPC perturbation, a gradual degradation in model performance is observed as the number of distractor blocks (whose presence or absence do not affect the semantics of the instruction) removed, are increased. Further, addition of distractor blocks also leads to significant performance degradation in Table 3. In all cases, adversarial data augmentation helps improve performance under the robust evaluation metric and sometimes, even under the standard evaluation metric.   (2018) model (M std ) and its robust counterpart (M rob ) for the different perturbations (P): A(i) and R(i) denotes the addition and removal of i blocks respectively. ↑ denotes higher is better.