Adversarial Over-Sensitivity and Over-Stability Strategies for Dialogue Models

We present two categories of model-agnostic adversarial strategies that reveal the weaknesses of several generative, task-oriented dialogue models: Should-Not-Change strategies that evaluate over-sensitivity to small and semantics-preserving edits, as well as Should-Change strategies that test if a model is over-stable against subtle yet semantics-changing modifications. We next perform adversarial training with each strategy, employing a max-margin approach for negative generative examples. This not only makes the target dialogue model more robust to the adversarial inputs, but also helps it perform significantly better on the original inputs. Moreover, training on all strategies combined achieves further improvements, achieving a new state-of-the-art performance on the original task (also verified via human evaluation). In addition to adversarial training, we also address the robustness task at the model-level, by feeding it subword units as both inputs and outputs, and show that the resulting model is equally competitive, requires only 1/4 of the original vocabulary size, and is robust to one of the adversarial strategies (to which the original model is vulnerable) even without adversarial training.


Introduction
Adversarial evaluation aims at filling in the gap between potential train/test distribution mismatch and revealing how models will perform under realworld inputs containing natural or malicious noise. Recently, there has been substantial work on adversarial attacks in computer vision and NLP. Unlike vision, where one can simply add in imperceptible perturbations without changing an image's meaning, carrying out such subtle changes in text is harder since text is discrete in nature (Jia and We publicly release all our code and data at https: //github.com/WolfNiu/AdversarialDialogue . Thus, some previous works have either avoided modifying original source inputs and only resorted to inserting distractive sentences (Jia and Liang, 2017), or have restricted themselves to introducing spelling errors (Belinkov and Bisk, 2018) and adding non-functioning tokens . Furthermore, there has been limited adversarial work on generative NLP tasks, e.g., dialogue generation , which is especially important because it is a crucial component of real-world virtual assistants such as Alexa, Siri, and Google Home. It is also a challenging and worthwhile task to keep the output quality of a dialogue system stable, because a conversation usually involves multiple turns, and a small mistake in an early turn could cascade into bigger misunderstanding later on.
Motivated by this, we present a comprehensive adversarial study on dialogue models -we not only simulate imperfect inputs in the real world, but also launch intentionally malicious attacks on the model in order to assess them on both oversensitivity and over-stability. Unlike most previous works that exclusively focus on Should-Not-Change adversarial strategies (i.e., non-semanticschanging perturbations to the source sequence that should not change the response), we demonstrate that it is equally valuable to consider Should-Change strategies (i.e., semantics-changing, intentional perturbations to the source sequence that should change the response).
We investigate three state-of-the-art models on two task-oriented dialogue datasets. Concretely, we propose and evaluate five naturally motivated and increasingly complex Should-Not-Change and five Should-Change adversarial strategies on the VHRED (Variational Hierarchical Encoder-Decoder) model (Serban et al., 2017b) and the RL (Reinforcement Learning) model (Li et al., 2016) with the Ubuntu Dialogue Cor-pus (Lowe et al., 2015), and Dynamic Knowledge Graph Network with the Collaborative Communicating Agents (CoCoA) dataset (He et al., 2017).
On the Should-Not-Change side for the Ubuntu task, we introduce adversarial strategies of increasing linguistic-unit complexity -from shallow word-level errors, to phrase-level paraphrastic changes, and finally to syntactic perturbations. We first propose two rule-based perturbations to the source dialogue context, namely Random Swap (randomly transposing neighboring tokens) and Stopword Dropout (randomly removing stopwords). Next, we propose two data-level strategies that leverage existing parallel datasets in order to simulate more realistic, diverse noises: namely, Data-Level Paraphrasing (replacing words with their paraphrases) and Grammar Errors (e.g., changing a verb to the wrong tense). Finally, we employ Generative-Level Paraphrasing, where we adopt a neural model to automatically generate paraphrases of the source inputs. 1 On the Should-Change side for the Ubuntu task, we propose the Add Negation strategy, which negates the root verb of the source input, and the Antonym strategy, which changes verbs, adjectives, or adverbs to their antonyms. As will be shown in Section 6, the above strategies are effective on the Ubuntu task, but not on the collaborative-style, database-dependent Co-CoA task. Thus for the latter, we investigate additional Should-Change strategies including Random Inputs (changing each word in the utterance to random ones), Random Inputs with Entities (like Random Inputs but leaving mentioned entities untouched), and Normal Inputs with Confusing Entities (replacing entities in an agent's utterance with distractive ones) to analyze where the model's robustness stems from.
To evaluate these strategies, we first show that (1) both VHRED and the RL model are vulnerable to most Should-Not-Change and all Should-Change strategies, and (2) DynoNet's robustness to Should-Change inputs shows that it does not pay any attention to natural language inputs other than the entities contained in them. Next, observing how our adversarial strategies 'successfully' fool the target models, we try to expose these models to such perturbation patterns early on during training itself, where we feed adversarial input context and ground-truth target pairs as training data. Importantly, we realize this adversarial training via a maximum-likelihood loss for Should-Not-Change strategies, and via a maxmargin loss for Should-Change strategies. We show that this adversarial training can not only make both VHRED and RL more robust to the adversarial data, but also improve their performances when evaluated on the original test set (verified via human evaluation). In addition, when we train VHRED on all of the perturbed data from each adversarial strategy together, the performance on the original task improves even further, achieving the state-of-the-art result by a significant margin (also verified via human evaluation).
Finally, we attempt to resolve the robustness issue directly at the model-level (instead of adversarial-level) by feeding subword units derived from the Byte Pair Encoding (BPE) algorithm (Sennrich et al., 2016) to the VHRED model. We show that the resulting model not only reduces the vocabulary size by around 75% (thus trains much faster) and obtains results comparable to the original VHRED, but is also naturally (i.e., without requiring adversarial training) robust to the Grammar Errors adversarial strategy.

Tasks and Models
For a comprehensive study on dialogue model robustness, we investigate both semi-task-based troubleshooting dialogue (the Ubuntu task) and the new important paradigm of collaborative twobot dialogue (the CoCoA task). The former focuses more on natural conversations, while the latter focuses more on the knowledge base. Consequently, the model trained on the latter tends to ignore the natural language context (as will be shown in Section 6.2) and hence requires a different set of adversarial strategies that can directly reveal this weakness (e.g., Random Inputs with Entities). Overall, adversarial strategies on Ubuntu and CoCoA reveal very different types of weaknesses of a dialogue model. We implement two models on the Ubuntu task and one on the Co-CoA task, each achieving state-of-the-art result on its respective task. Note that although we employ these two strong models as our testbeds for the proposed adversarial strategies, these adversarial strategies are not specific to the two models.

Ubuntu Dialogue
Dataset and Task: The Ubuntu Dialogue Corpus (Lowe et al., 2015) contains 1 million 2-person, multi-turn dialogues extracted from Ubuntu chat logs, used to provide and receive technical support. We focus on the task of generating fluent, relevant, and goal-oriented responses. Evaluation Method: The model is evaluated on F1's for both activities (technical verbs, e.g., "download", "install") and entities (technical nouns, e.g., "root", "web"). These metrics are computed by mapping the ground-truth and model responses to their corresponding activity-entity representations using the automatic procedure described in Serban et al. (2017a), who found that F1 is "particularly suited for the goal-oriented Ubuntu Dialogue Corpus" based on manual inspection of the extracted activities and entities. We also conducted human studies on the dialogue quality of generated responses (see Section 5 for setup and Section 6.1 for results). Models: We reproduce the state-of-the-art Latent Variable Hierarchical Recurrent Encoder-Decoder (VHRED) model , and a Deep Reinforcement Learning based generative model (Li et al., 2016). For the VHRED model, we apply additive attention mechanism (Bahdanau et al., 2015) to the source sequence while keeping the remaining architecture unchanged. For the RL-based model, we adopt the mixed objective function (Paulus et al., 2018) and employ a novel reward: during training, for each source sequence S, we sample a response G on the decoder side, feed the encoder with a random source sequence S R drawn from the train set, and use − log P (G|S R ) as the reward. Intuitively, if S R stands a high chance of generating G (which corresponds to a large negative reward), it is very likely that G is dull and generic.

Collaborative Communicating Agents
Dataset and Task: The collaborative CoCoA 2 dialogue task involves two agents that are asymmetrically primed with a private Knowledge Base (KB), and engage in a natural language conversation to find out the unique entry shared by the two KBs. For a bot-bot chat of the CoCoA task, a bot is allowed one of the two actions each turn: performing an UTTERANCE action, where it generates an utterance, or making a SELECT action, where it chooses an entry from the KB. Note that each bot's SELECT action is visible to the other bot, and each is allowed to make multiple SELECT actions if the previous guess is wrong. Evaluation Method: One of the major metrics is Completion Rate, the percentage of two bots successfully finishing the task. Models: We focus on DynoNet, the bestperforming model for the CoCoA task (He et al., 2017). It consists of a dynamic knowledge graph, a graph embedding over the entity nodes, and a Seq2seq-based utterance generator.

Adversarial Strategies on Ubuntu
For Ubuntu, we introduce adversarial strategies of increasing linguistic-unit complexity -from shallow word-level errors such as Random Swap and Stopword Dropout, to phrase-level paraphrastic changes, and finally to syntactic Grammar Errors.

Should-Not-Change Strategies
(1) Random Swap: Swapping adjacent words occurs often in the real world, e.g., transposition of words is one of the most frequent errors in manuscripts (Headlam, 1902;Marqués-Aguado, 2014); it is also frequently seen in blog posts. 3 Thus, being robust to swapping adjacent words is useful for chatbots that take typed/written text as inputs (e.g., virtual customer support on a airline/bank website). Even for speech-based conversations, non-native speakers can accidentally swap words due to habits formed in their native language (e.g., SVO in English vs. SOV in Hindi, Japanese, and Korean). Inspired by this, we also generate globally contiguous but locally "time-reversed" text, where positions of neighboring words are swapped (e.g., "I don't want you to go" to "I don't want to you go").
(2) Stopword Dropout: Stopwords are the most frequent words in a language. The most commonly-used 25 words in the Oxford English corpus make up one-third of all printed material in English, and these words consequently carry less information than other words do in a sentence. 4 Inspired by this observation, we propose randomly dropping stopwords from the inputs (e.g., "Ben ate the carrot" to "Ben ate carrot").
(3) Data-level Paraphrasing: We repurpose PPDB 2.0 (Pavlick et al., 2015) and replace words and phrases in the original inputs with their paraphrases (e.g., "She bought a bike" to "She purchased a bicycle"). (4) Generative-level Paraphrasing: Although Data-level Paraphrasing provides us with semantic-preserving inputs most of the time, it still suffers from the fact that the validity of a paraphrase depends on the context, especially for words with multiple meanings. In addition, simply replacing word-by-word does not lead to new compositional sentence-level paraphrases, e.g., "How old are you" to "What's your age". We thus also experiment with generative-level paraphrasing, where we employ the Pointer-Generator Networks (See et al., 2017), and train it on the recently published paraphrase dataset ParaNMT-5M (Wieting and Gimpel, 2017) which contains 5 millions paraphrase pairs. (5) Grammar Errors: We repurpose the AESW dataset (Daudaravicius, 2015), text extracted from 9, 919 published journal articles with data before/after language editing. This dataset was used for training models that identify and correct grammar errors. Based on the corrections in the edits, we build a look-up table to replace each correct word/phrase with a wrong one (e.g., "He doesn't like cakes" to "He don't like cake").

Should-Change Strategies
(1) Add Negation: Suppose we add negation to the source sequence of some task-oriented model -from "I want some coffee" to "I don't want some coffee". A proper response to the first utterance could be "Sure, I will bring you some coffee", but for the second one, the model should do anything but bring some coffee. We thus assume that if we add negation to the root verb of each source sequence and the response is unchanged, the model must be ignoring important linguistic cues like negation. Hence this qualifies as a Should-Change strategy, i.e., if the model is robust, it should change the response.
(2) Antonym: We change words in utterances to their antonyms to apply more subtle meaning changes (e.g., "You need to install Ubuntu" to "You need to uninstall Ubuntu"). 5 5 Note that Should-Change strategies may lead to contexts

Adversarial Strategies on CoCoA
We applied all the above successful strategies used for the Ubuntu task to the UTTERANCE actions in a bot-bot-chat setting for the CoCoA task, but found that none of them was effective on DynoNet. This is surprising considering that the model's language generation module is a traditional Seq2seq model. This observation motivated us to perform the following analysis. The high performance of bot-bot chat may have stemmed from two sources: information revealed in an utterance, or entries directly disclosed by a SELECT action.
To investigate which part the model relies on more, we experiment with different Should-Change strategies which introduce obvious perturbations that have minimal word or semantic meaning overlap with the original source inputs: (1) Random Inputs: Turn both bots' utterances into random inputs. This aims at investigating how much the model depends on the SELECT action.
(2) Random Inputs with Kept Entities: Replace each bot's utterance with random inputs, but keep the contained entities untouched. This further investigates how much entities alone contribute to the final performance.
(3) Confusing Entity: Replace entities mentioned in bot A's utterances with entities that are present in bot B's KB but not in their shared entry (and vice versa). This aims at coaxing bot B into believing that the mentioned entities come from their shared entry. By intentionally making the utterances misleading, we expect DynoNet's performance to be lower -hence this qualifies as a Should-Change strategy.

Adversarial Training
To make a model robust to an adversarial strategy, a natural approach is exposing it to the same pattern of perturbation during training (i.e., adversarial training). This is achieved by feeding adversarial inputs as training data. For each strategy, we report results under three train/test combinations: (1) trained with normal inputs, tested on adversarial inputs (N-train + A-test), which evaluates whether the adversarial strategy is effective at that do not correspond to any legitimate task completion action, but the purpose of such a strategy is to make sure that the model at least should not respond the same way as it responded to the original context, i.e., even for the no-action state, the model should respond with something different like "Sorry, I cannot help with that." Our semantic similarity results in Table 4 capture this intuition directly.
fooling the model and exposing its robustness issues; (2) trained with adversarial inputs, tested on adversarial inputs (A-train + A-test), which next evaluates whether adversarial training made the model more robust to that adversarial attack; and (3) trained with adversarial inputs, tested on normal inputs (A-train + N-test), which finally evaluates whether the adversarial training also makes the model perform equally or better on the original normal inputs. Note that (3) is important, because one should not make the model more robust to a strategy at the cost of lower performance on the original data; also when (3) improves the performance on the original inputs, it means adversarial training successfully teaches the model to recognize and be robust to a certain type of noise, so that the model performs better when encountering similar patterns during inference. Also note that we use perturbed train set for adversarial training, and perturbed test set for adversarial testing. There is thus no overlap between the two sets.

Adversarial Training for Should-Not-Change Strategies
For each Should-Not-Change strategy, we take an already trained model from a certain checkpoint, 6 and train it on the adversarial inputs with maximum likelihood loss for K epochs Belinkov and Bisk, 2018;Jia and Liang, 2017;. By feeding "adversarial source sequence + ground-truth response pairs" as regular positive data, we teach the model that these pairs are also valid examples despite the added perturbations.

Adversarial Training for Should-Change Strategies
For Should-Change strategies, we want the F1's to be lower with adversarial inputs after adversarial training, since this shows that the model becomes sensitive to subtle yet semantic-changing perturbations. This cannot be achieved by naively training on the perturbed inputs with maximum likelihood loss, because the "perturbed source sequence + ground-truth response pairs" for Should-Change strategies are negative examples which we need to train the model to avoid from generating. Inspired by Mao et al. (2016) and Yu et al. (2017), we instead use a linear combination of maximum likeli- 6 We do not train from scratch because each model (for each strategy) takes several days to converge.

Model
Activity  hood loss and max-margin loss: where L ML is the maximum likelihood loss, L MM is the max-margin loss, α is the weight of the maxmargin loss (set to 1.0 following Yu et al. (2017)), M is the margin (tuned be to 0.1), and t i , s i and a i are the target sequence, normal input, and adversarial input, respectively. 7

Experimental Setup
In addition to datasets, tasks, models and evaluation methods introduced in Section 2, we present training details in this section (see Appendix for a comprehensive version). Models on Ubuntu: We implemented VHRED and Reranking-RL in TensorFlow (Abadi et al., 2016) and employed greedy search for inference. As shown in  we also add a heuristic where an inflected verb is replaced with its respective infinitive form, and a plural noun with its singular form. Note that for all strategies we only keep an adversarial token if it is within the original vocabulary set. Should-Change Strategies on Ubuntu: For Add Negation, we negate the first verb in each utterance. For Antonym, we modify the first verb, adjective or adverb that has an antonym.
Human Evaluation: We also conducted human studies on MTurk to evaluate adversarial training (pairwise comparison for dialogue quality) and generative paraphrasing (five-point Likert scale). The utterances were randomly shuffled to anonymize model identity, and we used MTurk with US-located human evaluators with approval rate > 98%, and at least 10, 000 approved HITs. Results are presented in Section 6.1. Note that the human studies and automatic evaluation are complementary to each other: while MTurk annotators are good at judging how natural and coherent a response is, they are usually not experts in the Ubuntu operating system's technical details. On the other hand, automatic evaluation focuses more on the technical side (i.e., whether key activities or entities are present in the response).
Model on CoCoA: We adopted the publicly available code from He et al. (2017), 9 and used their already trained DynoNet model.

Adversarial Results on Ubuntu
Result Interpretation For Table 2 and 3 with Should-Not-Change strategies, lower is better in the first column (since a successful adversarial testing strategy will be effective at fooling the model), while higher is better in the second column (since successful adversarial training should bring the performance back up). However, for 9 https://worksheets.codalab.org/worksheets/ 0xc757f29f5c794e5eb7bfa8ca9c945573/ Should-Change strategies, the reverse holds. 10 Lastly, in the third column, higher is better since we want the adversarially trained model to perform better on the original source inputs.
Results on Should-Not-Change Strategies Table 2 and 3 present the adversarial results on F1 scores of all our strategies for VHRED and Reranking-RL, respectively. Table 2 shows that VHRED is robust to none of the Should-Not-Change strategies other than Random Swap, while Table 3 shows that Reranking-RL is robust to none of the Should-Not-Change strategies other than Stopword Dropout. For each effective strategy, at least one of the F1's decreases statistically significantly 11 as compared to the same model fed with normal inputs. Next, all adversarial trainings on Should-Not-Change strategies not only make the model more robust to adversarial inputs (each Atrain + A-test F1 is stat. significantly higher than that of N-train + A-test) , but also make them perform better on normal inputs (each A-train + Ntest F1 is stat. significantly higher than that of Ntrain + N-test, except for Grammar Errors's Activity F1). Motivated by the success in adversarial training on each strategy alone, we also experimented with training on all Should-Not-Change strategies combined, and obtained F1's stat. significantly higher than any single strategy (the All Should-Not-Change row in Table 2), except that All-Should-Not-Change's Entity F1 is stat. equal to that of Data-Level Paraphrasing, showing that these strategies are able to compensate for each other to further improve performance. An inter-10 Higher is better in the first column, because this shows that the model is not paying attention to important semantic changes in the source inputs (and is maintaining its original performance); while lower is better in the second column, since we want the model to be more sensitive to such changes after adversarial training. 11 We obtained stat. significance via the bootstrap test (Noreen, 1989;Efron and Tibshirani, 1994) with 100K samples, and consider p < 0.05 as stat. significant.   Table 4: Textual similarity of adversarial strategies on the VHRED and Reranking-RL models. "Cont." stands for "Context", and "Resp." stands for "Response".
esting strategy to note is Random Swap: although it itself is not effective as an adversarial strategy for VHRED, training on it does make the model perform better on normal inputs. Table 2 and 3 show that Add Negation and Antonym are both successful Should-Change strategies, because no change in N-train + A-test F1 is stat. significant compared to that of N-train + Ntest, which shows that both models are ignoring the semantic-changing perturbations to the inputs. From the last two rows of A-train + A-test column in each table, we also see that adversarial training successfully brings down both F1's (stat. significantly) for each model, showing that the model becomes more sensitive to the context change.

Results on Should-Change Strategies
Semantic Similarity In addition to F1, we also follow Serban et al. (2017a) and employ cosine similarity between average embeddings of normal and adversarial inputs/responses (proposed by ) to evaluate how much the inputs/responses change in semantic meaning (Table 4). This metric is useful in three ways. Firstly, by comparing the two columns of context similarity, we can get a general idea of how much change is perceived by each model. For example, we can see that Stopword Dropout leads to more evident changes from VHRED's perspective than from Reranking-RL's. This also agrees with the F1 results in Table 2    that Reranking-RL is much more robust to this strategy than VHRED is. The high context similarity of Should-Change strategies shows that although we have added "not" or replaced antonyms in every utterance of the source inputs, from the model's point of view the context has not changed much in meaning. Secondly, for each Should-Not-Change strategy, the cosine similarity of context is much higher than that of response, indicating that responses change more significantly in meaning than their corresponding contexts. Lastly, The high semantic similarity for Generative Paraphrasing also partly shows that the Pointer-Generator model in general produces faithful paraphrases. Human Evaluation As introduced in Section 5, we performed two human studies on adversarial training and Generative Paraphrasing. For the first study, Table 5 indicates that models trained on each adversarial strategy (as well as on all Should-Not-Change strategies combined) indeed on average produced better responses, and mostly agrees with the adversarial training results in Table 2. 12 Context Response N: ... you could save your ubuntu files and reinstall Windows , then install ubuntu as a dual boot option eou eot aight buddy , so how do i get that **unknown** space back eou Random Swap: ... you could your save ubuntu and files Windows reinstall , then install ubuntu as dual a option boot eou eot aight buddy , so do how i that get space **unknown** back eou NN: you can use the Live CD , you can install Ubuntu on the same partition as the Windows partition eou NA: I am using ubuntu . eou AA: you can use Windows XP on the Windows partition , and then install Ubuntu on the same drive eou Table 7: VHRED output example before and after adversarial training on the Random Swap strategy.
For the second study, Table 6 shows that on average the generated paraphrase has roughly the same semantic meaning with the original utterance, but may sometimes miss some information. Its quality is also close to that of the ground-truth in ParaNMT-5M dataset.

Output Examples of Generated Responses
We present a selected example of generated responses before and after adversarial training on the Random Swap strategy with the VHRED model in Table 7 (more examples in Appendix on all strategies with both models). First of all, we can see that it is hard to differentiate between the original and the perturbed context (N-context and A-context) if one does not look very closely. For this reason, the model gets fooled by the adversarial strategy, i.e., after adversarial perturbation, the N-train + A-test response (NA-Response) is worse than that of N-train + N-test (NN-Response). However, after our adversarial training phase, A-train + A-test (AA-Response) becomes better again. Table 8 shows the results of Should-Change strategies on DynoNet with the CoCoA task. The Random Inputs strategy shows that even without communication, the two bots are able to locate their shared entry 82% of the time by revealing their own KB through SELECT action. When we keep the mentioned entities untouched but randomize all other tokens, DynoNet actually achieves stateof-the-art Completion Rate, indicating that the two agents are paying zero attention to each other's utterances other than the entities contained in them. This is also why we did not apply Add Negation and Antonym to DynoNet -if Random Inputs does not work, these two strategies will also make no difference to the performance (in other words Random Inputs subsumes the other two Shouldgies, though the latter does agree with F1 trends. Overall, we provide both human and F1 evaluations because they are complementary at judging naturalness/coherence vs. key Ubuntu technical activities/entities.  Change strategies). We can also see that even with the Normal Inputs with Confusing Entities strategy, DynoNet is still able to finish the task 77% of the time, and with only slightly more turns. This again shows that the model mainly relies on the SELECT action to guess the shared entry.

Byte-Pair-Encoding VHRED
Although we have shown that adversarial training on most strategies makes the dialogue model more robust, generating such perturbed data is not always straightforward for diverse, complex strategies. For example, our data-level and generativelevel strategies all leverage datasets that are not always available to a language. We are thus motivated to also address the robustness task on the model-level, and explore an extension to the VHRED model that makes it robust to Grammar Errors even without adversarial training. Model Description: We performed Byte Pair Encoding (BPE) (Sennrich et al., 2016) on the Ubuntu dataset. 13 This algorithm encodes rare/unknown words as sequences of subword units, which helps segmenting words with the same lemma but different inflections (e.g., "showing" to "show + ing", and "cakes" to "cake + s"), making the model more likely to be robust to grammar errors such as verb tense or plural/singular noun confusion. We experimented BPE with 5K merging operations, and obtained a vocabulary size of 5121.
Results: As shown in Table 9, BPE-VHRED achieved F1's (5.99, 3.66), which is stat. equal to (5.94, 3.52) obtained without BPE. To our best knowledge, we are the first to apply BPE to a gen-  erative dialogue task. Moreover, BPE-VHRED achieved (5.86, 3.54) on Grammar Errors based adversarial test set, which is stat. equal to the F1's when tested with normal data, indicating that BPE-VHRED is more robust to this adversarial strategy than VHRED is, since the latter had (5.60, 3.09) when tested with perturbed data, where both F1's are stat. signif. lower than when fed with normal inputs. Moreover, BPE-VHRED reduces the vocabulary size by 15K, corresponding to 4.5M fewer parameters. This makes BPE-VHRED train much faster. Note that BPE only makes the model robust to one type of noise (i.e. Grammar Errors), and hence adversarial training on other strategies is still necessary (but we hope that this encourages future work to build other advanced models that are naturally robust to diverse adversaries).

Conclusion
We first revealed both the over-sensibility and over-stability of state-of-the-art models on Ubuntu and CoCoA dialogue tasks, via Should-Not-Change and Should-Change adversarial strategies. We then showed that training on adversarial inputs not only made the models more robust to the perturbations, but also helped them achieve new stateof-the-art performance on the original data (with further improvements when we combined strategies). Lastly, we also proposed a BPE-enhanced VHRED model that not only trains faster with comparable performance, but is also robust to Grammar Errors even without adversarial training, motivating that if no strong adversary-generation tools (e.g., paraphraser) are available (esp. in lowresource domains/languages), we should try alternative model-robustness architectural changes.