Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study

Neural generative models have been become increasingly popular when building conversational agents. They offer flexibility, can be easily adapted to new domains, and require minimal domain engineering. A common criticism of these systems is that they seldom understand or use the available dialog history effectively. In this paper, we take an empirical approach to understanding how these models use the available dialog history by studying the sensitivity of the models to artificially introduced unnatural changes or perturbations to their context at test time. We experiment with 10 different types of perturbations on 4 multi-turn dialog datasets and find that commonly used neural dialog architectures like recurrent and transformer-based seq2seq models are rarely sensitive to most perturbations such as missing or reordering utterances, shuffling words, etc. Also, by open-sourcing our code, we believe that it will serve as a useful diagnostic tool for evaluating dialog systems in the future.


Introduction
With recent advancements in generative models of text (Wu et al., 2016;Vaswani et al., 2017;Radford et al., 2018), neural approaches to building chit-chat and goal-oriented conversational agents (Sordoni et al., 2015;Vinyals and Le, 2015;Serban et al., 2016;Bordes and Weston, 2016;Serban et al., 2017b) has gained popularity with the hope that advancements in tasks like machine translation (Bahdanau et al., 2015), abstractive summarization (See et al., 2017) should translate to dialog systems as well. While these models have demonstrated the ability to generate fluent responses, they still lack the ability to "understand" and process the dialog history to produce coherent and interesting responses. They often produce boring and repetitive responses like "Thank you." (Li et al., 2015;Serban et al., 2017a) or meander away from the topic of conversation. This has been often attributed to the manner and extent to which these models use the dialog history when generating responses. However, there has been little empirical investigation to validate these speculations.
In this work, we take a step in that direction and confirm some of these speculations, showing that models do not make use of a lot of the information available to it, by subjecting the dialog history to a variety of synthetic perturbations. We then empirically observe how recurrent (Sutskever et al., 2014) and transformer-based (Vaswani et al., 2017) sequence-to-sequence (seq2seq) models respond to these changes. The central premise of this work is that models make minimal use of certain types of information if they are insensitive to perturbations that destroy them. Worryingly, we find that 1) both recurrent and transformer-based seq2seq models are insensitive to most kinds of perturbations considered in this work 2) both are particularly insensitive even to extreme perturbations such as randomly shuffling or reversing words within every utterance in the conversation history (see Table 1) and 3) recurrent models are more sensitive to the ordering of utterances within the dialog history, suggesting that they could be modeling conversation dynamics better than transformers.

Related Work
Since this work aims at investigating and gaining an understanding of the kinds of information a generative neural response model learns to use, the most relevant pieces of work are where sim-  ilar analyses have been carried out to understand the behavior of neural models in other settings. An investigation into how LSTM based unconditional language models use available context was carried out by Khandelwal et al. (2018). They empirically demonstrate that models are sensitive to perturbations only in the nearby context and typically use only about 150 words of context. On the other hand, in conditional language modeling tasks like machine translation, models are adversely affected by both synthetic and natural noise introduced anywhere in the input (Belinkov and Bisk, 2017). Understanding what information is learned or contained in the representations of neural networks has also been studied by "probing" them with linear or deep models (Adi et al., 2016;Subramanian et al., 2018;Conneau et al., 2018). Several works have recently pointed out the presence of annotation artifacts in common text and multi-modal benchmarks. For example, Gururangan et al. (2018) demonstrate that hypothesisonly baselines for natural language inference obtain results significantly better than random guessing. Kaushik and Lipton (2018) report that reading comprehension systems can often ignore the entire question or use only the last sentence of a document to answer questions. Anand et al. (2018) show that an agent that does not navigate or even see the world around it can answer questions about it as well as one that does. These pieces of work suggest that while neural methods have the potential to learn the task specified, its design could lead them to do so in a manner that doesn't use all of the available information within the task.
Recent work has also investigated the inductive biases that different sequence models learn. For example, Tran et al. (2018) find that recurrent models are better at modeling hierarchical structure while Tang et al. (2018) find that feedforward architectures like the transformer and convolutional models are not better than RNNs at modeling long-distance agreement. Transformers however excel at word-sense disambiguation. We analyze whether the choice of architecture and the use of an attention mechanism affect the way in which dialog systems use information available to them.

Experimental Setup
Following the recent line of work on generative dialog systems, we treat the problem of generating an appropriate response given a conversation history as a conditional language modeling problem. Specifically we want to learn a conditional probability distribution P θ (y|x) where y is a reasonable response given the conversation history x. The conversation history is typically represented as a sequence of utterances The response y is a single utterance also comprised of a sequence of words y 1 , y 2 . . . y m . The overall conditional probability is factorized autoregressively as P θ , in this work, is parameterized by a recurrent or transformer-based seq2seq model. The crux of this work is to study how the learned probability distribution behaves as we artificially perturb the conversation history x 1 , . . . x n . We measure behavior by looking at how much the per-token perplexity increases under these changes. For example, one could think of shuffling the order in which x 1 . . . x n is presented to the model and observe how much the perplexity of y under the model increases. If the increase is only minimal, we can conclude that the ordering of x 1 . . . x n isn't informative to the model. For a complete list of perturbations considered in this work, please refer to Section 3.2. All models are trained without any perturbations and sensitivity is studied only at test time. Figure 1: The increase in perplexity for different models when only presented with the k most recent utterances from the dialog history for Dailydialog (left) and bAbI dialog (right) datasets. Recurrent models with attention fare better than transformers, since they use more of the conversation history.

Datasets
We experiment with four multi-turn dialog datasets.
bAbI dialog is a synthetic goal-oriented multiturn dataset (Bordes and Weston, 2016) consisting of 5 different tasks for restaurant booking with increasing levels of complexity. We consider Task 5 in our experiments since it is the hardest and is a union of all four tasks. It contains 1k dialogs with an average of 13 user utterances per dialog.

Persona
Chat is an open domain dataset (Zhang et al., 2018) with multi-turn chit-chat conversations between turkers who are each assigned a "persona" at random. It comprises of 10.9k dialogs with an average of 14.8 turns per dialog.
Dailydialog is an open domain dataset (Li et al., 2017) which consists of dialogs that resemble dayto-day conversations across multiple topics. It comprises of 13k dialogs with an average of 7.9 turns per dialog.
MutualFriends is a multi-turn goal-oriented dataset (He et al., 2017) where two agents must discover which friend of theirs is mutual based on the friends' attributes. It contains 11k dialogs with an average of 11.41 utterances per dialog.

Types of Perturbations
We experimented with several types of perturbation operations at the utterance and word (token) levels. All perturbations are applied in isolation.
Utterance-level perturbations We consider the following operations 1) Shuf that shuffles the sequence of utterances in the dialog history, 2) Rev that reverses the order of utterances in the history (but maintains word order within each utterance) 3) Drop that completely drops certain utterances and 4) Truncate that truncates the dialog history to contain only the k most recent utterances where k ≤ n, where n is the length of dialog history.
Word-level perturbations We consider similar operations but at the word level within every utterance 1) word-shuffle that randomly shuffles the words within an utterance 2) reverse that reverses the ordering of words, 3) word-drop that drops 30% of the words uniformly 4) noun-drop that drops all nouns, 5) verb-drop that drops all verbs.

Models
We experimented with two different classes of models -recurrent and transformer-based sequence-to-sequence generative models. All data loading, model implementations and evaluations were done using the ParlAI framework. We used the default hyper-parameters for all the models as specified in ParlAI.

Recurrent Models
We trained a seq2seq (seq2seq lstm) model where the encoder and decoder are parameterized as LSTMs (Hochreiter and Schmidhuber, 1997). We also experiment with using decoders that use an attention mechanism (seq2seq lstm att) (Bahdanau et al., 2015). The encoder and decoder LSTMs have 2 layers with 128 dimensional hidden states with a dropout rate of 0.1.
Transformer Our transformer (Vaswani et al., 2017) model uses 300 dimensional embeddings and hidden states, 2 layers and 2 attention heads with no dropout. This model is significantly smaller than the ones typically used in machine  translation since we found that the model that resembled Vaswani et al. (2017) significantly overfit on all our datasets. While the models considered in this work might not be state-of-the-art on the datasets considered, we believe these models are still competitive and used commonly enough at least as baselines, that the community will benefit by understanding their behavior. In this paper, we use early stopping with a patience of 10 on the validation set to save our best model. All models achieve close to the perplexity numbers reported for generative seq2seq models in their respective papers.

Results & Discussion
Our results are presented in Table 2 and Figure 1. Table 2 reports the perplexities of different models on test set in the second column, followed by the increase in perplexity when the dialog history is perturbed using the method specified in the column header. Rows correspond to models trained on different datasets. Figure 1 presents the change in perplexity for models when presented only with the k most recent utterances from the dialog history.
We make the following observations: 1. Models tend to show only tiny changes in perplexity in most cases, even under extreme changes to the dialog history, suggesting that they use far from all the information that is available to them.
2. Transformers are insensitive to wordreordering, indicating that they could be learning bag-of-words like representations.
3. The use of an attention mechanism in seq2seq lstm att and transformers makes these models use more information from earlier parts of the conversation than vanilla seq2seq models as seen from increases in perplexity when using only the last utterance.
4. While transformers converge faster and to lower test perplexities, they don't seem to capture the conversational dynamics across utterances in the dialog history and are less sensitive to perturbations that scramble this structure than recurrent models.

Conclusion
This work studies the behaviour of generative neural dialog systems in the presence of synthetically introduced perturbations to the dialog history, that it conditions on. We find that both recurrent and transformer-based seq2seq models are not significantly affected even by drastic and unnatural modifications to the dialog history. We also find subtle differences between the way in which recurrent and transformer-based models use available context. By open-sourcing our code, we believe this paradigm of studying model behavior by introducing perturbations that destroys different kinds of structure present within the dialog history can be a useful diagnostic tool. We also foresee this paradigm being useful when building new dialog datasets to understand the kinds of information models use to solve them.