Probing Neural Dialog Models for Conversational Understanding

The predominant approach to open-domain dialog generation relies on end-to-end training of neural models on chat datasets. However, this approach provides little insight as to what these models learn (or do not learn) about engaging in dialog. In this study, we analyze the internal representations learned by neural open-domain dialog systems and evaluate the quality of these representations for learning basic conversational skills. Our results suggest that standard open-domain dialog systems struggle with answering questions, inferring contradiction, and determining the topic of conversation, among other tasks. We also find that the dyadic, turn-taking nature of dialog is not fully leveraged by these models. By exploring these limitations, we highlight the need for additional research into architectures and training methods that can better capture high-level information about dialog.


Introduction
Open-domain dialog systems often rely on neural models for language generation that are trained end-to-end on chat datasets. End-to-end training eliminates the need for hand-crafted features and task-specific modules (for example, for question answering or intent detection), while delivering promising results on a variety of language generation tasks including machine translation (Bahdanau et al., 2014), abstractive summarization (Rush et al., 2015), and text simplification (Wang et al., 2016).
However, current generative models for dialog suffer from several shortcomings that limit their usefulness in the real world. Neural models can be opaque and difficult to interpret, posing barriers to their deployment in safety-critical applications such as mental health or customer service (Belinkov and Glass, 2019). End-to-end training provides little insight as to what these models learn about engaging in dialog. Open-domain dialog systems also struggle to maintain basic conversations, frequently ignoring user input (Sankar et al., 2019) while generating irrelevant, repetitive, and contradictory responses (Saleh et al., 2019;Li et al., 2016Li et al., , 2017aWelleck et al., 2018). Table 1 shows examples from standard dialog models which fail at basic interactions -struggling to answer questions, detect intent, and understand conversational context.
In light of these limitations, we aim to answer the following questions: (i) Do neural dialog models effectively encode information about the conversation history? (ii) Do neural dialog models learn basic conversational skills through end-to-end training? (iii) And to what extent do neural dialog models leverage the dyadic, turn-taking structure of dialog to learn these skills?
To answer these questions, we propose a set of eight probing tasks to measure the conversational understanding of neural dialog models. Our tasks include question classification, intent detection, natural language inference, and commonsense reasoning, which all require high-level understanding of language. We also carry out perturbation experiments designed to test if these models fully exploit dialog structure during training. These experiments entail breaking the dialog structure by training on shuffled conversations and measuring the effects on probing performance and perplexity.
We experiment with both recurrent (Sutskever et al., 2014) and transformer-based (Vaswani et al., 2017) open-domain dialog models. We also analyze models with different sizes and initialization strategies, training small models from scratch and fine-tuning large pre-trained models on dialog data. Thus, our study covers a variety of standard models and approaches for open-domain dialog generation.
Our analysis reveals three main insights: 1. Dialog models trained from scratch on chat datasets perform poorly on the probing tasks, suggesting that they struggle with basic conversational skills. Large, pre-trained models achieve much better probing performance but are still on par with simple baselines.
2. Neural dialog models fail to effectively encode information about the conversation history and the current utterance. In most cases, simply averaging the word embeddings is superior to using the learned encoder representations. This performance gap is smaller for large, pre-trained models.
3. Neural dialog models do not leverage the dyadic, turn-taking nature of conversation. Shuffling conversations in the training data had little impact on perplexity and probing performance. This suggests that breaking the dialog structure did not significantly affect the quality of learned representations.
Our code integrates with and extends ParlAI (Miller et al., 2017), a popular open-source platform for building dialog systems. We also publicly release all our code at https://github.com/ AbdulSaleh/dialog-probing, hoping that probing will become a standard method for interpreting and analyzing open-domain dialog systems.

Related Work
Evaluating and interpreting open-domain dialog models is notoriously challenging. Multiple studies have shown that standard evaluation metrics such as perplexity and BLEU scores (Papineni et al., 2002) correlate very weakly with human judgements of conversation quality (Liu et al., 2016;Dziri et al., 2019). This has inspired multiple new approaches for evaluating dialog systems. One popular evaluation metric involves calculating the semantic similarity between the user input and generated response in high-dimensional embedding space (Liu et al., 2016;Dziri et al., 2019;Park et al., 2018;Zhao et al., 2017;Xu et al., 2018).  proposed calculating conversation metrics such as sentiment and coherence on self-play conversations generated by trained models. Similarly, Dziri et al. (2019) use neural classifiers to identify whether the modelgenerated responses entail or contradict user input in a natural language inference setting.
To the best of our knowledge, all existing approaches for evaluating the performance of opendomain dialog systems only consider external model behavior in the sense that they analyze properties of the generated text. In this study, we explore internal representations instead, motivated by the fact that reasonable internal behavior is crucial for interpretability and is often a prerequisite for effective external behavior.
Outside of open-domain dialog, probing has been applied for analyzing natural language processing models in machine translation  and visual question answering . Probing is also commonly used for evaluating the quality of "universal" sentence representations which are trained once and used for a variety of applications (Conneau et al., 2018;Adi et al., 2016) (for example, InferSent (Conneau et al., 2017), SkipThought (Kiros et al., 2015), USE (Cer et al., 2018)). Along the same lines, natural language understanding benchmarks such as GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) propose a set of diverse tasks for evaluating general linguistic knowledge. Our analysis differs from previous work since it is focused on probing for conversational skills that are particularly relevant to dialog generation.
With regard to perturbation experiments, Sankar et al. (2019) found that standard dialog models are largely insensitive to perturbations of the input text. Here we introduce an alternative set of perturbation experiments to similarly explore the extent to which dialog structure is being leveraged by these models.

Models and Data
In this study, we focus on the three most widespread dialog architectures: recurrent neural networks (RNNs) (Sutskever et al., 2014), RNNs with attention (Bahdanau et al., 2014), and Transformers (Vaswani et al., 2017). We use the Par-lAI platform (Miller et al., 2017) for building and training the models. We train models of two different sizes and initialization strategies. Small models (≈ 14M parameters) are initialized randomly and trained from scratch on DailyDialog (Li et al., 2017b). Large models (≈ 70M parameters) are  pre-trained on WikiText-103 (Merity, 2016), and then fine-tuned on DailyDialog. 2 DailyDialog (Li et al., 2017b) is a dataset of 14K train, 1K validation, and 1K test multi-turn dialogs collected from an English learning website. The dialogs are of much higher quality than datasets scraped from Twitter or Reddit. WikiText-103 (Merity, 2016) is a dataset of 29K Wikipedia articles. For pre-training the large models, we format WikiText-103 as a dialog dataset by treating each paragraph as a conversation and each sentence as an utterance.

Probing experiments
In open-domain dialog generation, the goal is to generate the next utterance or response, u t+1 , given the conversation history, [u 1 , . . . , u t ]. First, we train our models on dialog generation using a maximum-likelihood objective (Sutskever et al., 2014). We then freeze these trained models and use them as feature extractors. We run the dialog models on text from the probing tasks and use the internal representations as features for a two-layer multilayer perceptron (MLP) classifier trained on the probing tasks as in figure 1. This follows the same methodology outlined in previous probing 2 See the supplemental material for further training details. studies Conneau et al., 2018;Adi et al., 2016).
The assumption here is that if a model learns certain conversational skills, then knowledge of these skills should be reflected in its internal representations. For example, a model that excels at answering questions would be expected to learn useful internal representations for question answering. Thus, the performance of the probing classifier on question answering can be used as a proxy for learning this skill. We extend this reasoning to eight probing tasks designed to measure a model's conversational understanding.
The probing tasks require high-level reasoning, sometimes across multiple utterances, therefore we aggregate utterance-level representations for probing. Our probing experiments consider three types of internal representations: Word Embeddings: To get the word embedding representations, we first averaged word embeddings of all words in the previous utterances, [u 1 , . . . , u t−1 ], then we separately averaged word embeddings of all words in the current utterance, u t , and concatenated the two resulting, equallength vectors. Encoding the past utterances and the current utterance separately is important since it provides some temporal information about utterance order. We used the dialog model's encoder word embedding matrix.
Encoder State: For the the encoder state, we extracted the encoder outputs after running it on the entire probing task input (i.e. the full conversation history, [u 1 , . . . , u t ]). Crucially, encoder states are the representations passed to the decoder for generation and are thus different for each architecture. For RNNs we used the last encoder hidden and cell states. For RNNs with attention the decoder has access to all the encoder hidden states (not just the final ones), through the attention mechanism. Thus, for RNNs with attention, we first averaged the encoder hidden states corresponding to the previous utterances, [u 1 , . . . , u t−1 ], and then we separately averaged the encoder hidden states corresponding to the current utterance, u t , and concatenated the two resulting, equal-length vectors. We also concatenated the last cell state. Similarly, for Transformers, we averaged the encoder outputs corresponding to the previous utterances and separately averaged encoder outputs corresponding to the current utterance and concatenated them.
Combined: The combined representations are the concatenation of of the word embeddings and encoder state representations.
We also use GloVe (Pennington et al., 2014) word embeddings as a simple baseline. We encode the probing task inputs using the word embeddings approach described above. We ensure that GloVe and all models of a certain size (small vs large) share the same vocabulary for comparability.

Perturbation Experiments
We also propose a set of perturbation experiments designed to measure whether dialog models fully leverage dialog structure for learning conversational skills. We create a new training dataset by shuffling the order of utterances within each conversation in DailyDialog. This completely breaks the dialog structure and utterances no longer naturally follow one another. We train (or fine-tune) separate models on the shuffled dataset and evaluate their probing performance relative to models trained on data as originally ordered.

Probing Tasks
The probing tasks selected for this study measure conversational understanding and skills relevant to dialog generation. Some tasks are inspired by previous benchmarks (Wang et al., 2018), while others have not been explored before for probing. Examples are listed in the supplemental material.
TREC: Question answering is a key skill for effective dialog systems. A system that deflects user questions could seem inattentive or indifferent. In order to correctly respond to questions, a model needs to determine what type of information the question is requesting. We probe for question answering using the TREC question classification dataset (Li and Roth, 2002), which consists of questions labeled with their associated answer types.
DialogueNLI: Any two turns in a conversation could entail each other (speakers agreeing, for example), or contradict each other (speakers disagreeing), or be unrelated (speakers changing topic of conversation). A dialog system should be sensitive to contradictions to avoid miscommunication and stay aligned with human preferences. We use the Dialogue NLI dataset (Welleck et al., 2018), which consists of pairs of dialog turns with entailment, contradiction, and neutral labels to probe for natural language inference. The original dataset examines two utterances from the same speaker ("I go to college", "I am a student"), so we modify the second utterance to simulate a second speaker ("I go to college", "You are a student").
MultiWOZ: Every utterance in a conversation can be considered as an action or a dialog act performed by the speaker. A speaker could be making a request, providing information, or simply greeting the system. MultiWOZ 2.1 (Eric et al., 2019) is a dataset of multi-domain, goal-oriented conversations. Human turns are labeled with dialog acts and the associated domains (hotel, restaurant, etc.), which we use to probe for natural language understanding.

SGD:
Tracking user intent is also important for generating appropriate responses. The same intent is often active across multiple dialog turns since it takes more than one turn to book a hotel, for example. Determining user intent requires reasoning over multiple turns in contrast to dialog acts which are turn-specific. To probe for this task, we use intent labels from the multi-domain, goaloriented Schema-Guided Dialog dataset (Rastogi et al., 2019).
WNLI: Endowing neural models with commonsense reasoning is an ongoing challenge in machine learning (Storks et al., 2019). We use the Winograd NLI dataset, a variant of the Winograd Schema Challenge (Levesque et al., 2012), provided in the GLUE benchmark (Wang et al., 2018) to probe for commonsense reasoning. WNLI is a sentence pair classification task where the goal is to identify whether the hypothesis correctly resolves the referent of an ambiguous pronoun in the premise.

SNIPS:
The Snips NLU benchmark (Coucke et al., 2018) is a dataset of crowdsourced, singleturn queries labeled for intent. We use this dataset to probe for intent classification.
ScenarioSA: An understanding of sentiment and emotions is crucial for building social, humancentered conversational agents. We use Scenar-ioSA (Zhang et al., 2019) as a sentiment classification probing task. The dataset is composed of natural, multi-turn, open-ended dialogs with turnlevel sentiment labels.
DailyDialog Topic: The DailyDialog dataset comes with conversation-level annotations for ten diverse topics, such as ordinary life, school life, relationships, and health. Inferring the topic of conversation is an important skill that could help dialog systems stay consistent and on topic. We use dialogs from the DailyDialog test set to create a probing tasks where the goal is to classify a dialog into the appropriate topic.

Quality of Encoder Representations
Results from our probing experiments are presented in tables 2 and 3. We calculate an average score to summarize the overall accuracy on all tasks. Here we explore whether the encoder learns high quality representations of the conversation history. We focus on encoder states because these representations are passed to the decoder and used for generation (figure 1). Thus, effectively encoding information in the encoder states is crucial for dialog generation. Figure 2 shows the difference in average probing accuracy between the word embeddings and the encoder state for each model. The word embeddings outperform the encoder state for all the small models. This performance gap is most pronounced for the Transformer but is non-existent for the large recurrent models. One possible explanation is that the encoder highlights information relevant to generating dialog at the cost of obfuscating or losing information relevant to the probing tasks -given that the goals of certain probing tasks do not perfectly align with natural dialog generation. For example, the Daily-Dialog dataset contains examples where a question is answered with another question (perhaps for clarification). The TREC question classification task does not account for such cases and expects each question to have a specific answer type. This explanation is supported by the observation that the information in the word embeddings and encoder state is not necessarily redundant. The combined representations often outperform using either one separately (albeit by a minute amount).
Regardless of the reason behind this gap in performance, multiple models still fail to effectively encode information about the conversation history that is already present in the word embeddings.

Probing for Conversational Understanding
In this section, we compare the probing performance of the ordered dialog models to the simple baseline of averaging GloVe word embeddings.
Here we consider the combined representations since they achieve the best performance overall and can act as a proxy for all the information captured by the encoder about the conversation history. Since our probing tasks test for conversational skills important for dialog generation, we would expect the dialog models to outperform GloVe word embeddings. However, this is generally not the case. As figure 3 shows, the GloVe baseline outperforms the small recurrent models while being on par with the large pre-trained models in terms of    average score. Tables 2 and 3 show that this pattern also generally applies at the task level, not just in terms of average score. Closer inspection, however, reveals one exception. Combined representations from both the small and large models consistently outperform GloVe on the DailyDialog Topic task. This is the only task that is derived from the DailyDialog test data, which follows the same distribution as the dialogs used for training the models. This suggests that lack of generalization can partly explain the weak performance on other tasks. It is also worth noting that DailyDialog Topic is labeled at the conversation level rather than the turn level. Thus, identifying the correct label does not necessarily require reasoning about turn-level interactions (unlike Dia-logueNLI, for example).
The poor performance on the majority of tasks, relative to the simple GloVe baseline, leads us to conclude that standard dialog models trained from scratch struggle to learn the basic conversational skills examined here. Large, pre-trained models do not seem to master these skills either, with performance on par with the baselines.

Effect of Dialog Structure
Tables 4 and 5 summarize the results of the perturbation experiments. Figure 4 shows the difference in average performance between the ordered and shuffled models. We show results for the encoder states since these representations are important for encoding the conversation history, as discussed in section 5.1. The encoder states are also sensitive to word and utterance order, unlike averaging the word embeddings. So if a model can fully exploit the dyadic, turn-taking, structure of dialog, this is likely to be reflected in the encoder state representations.
In most of our experiments, models trained on ordered data outperformed models trained on shuffled data, as expected. We can see in figure 4, that average scores for ordered models were often higher than for shuffled models. However, the absolute gap in performance was at most 2%, which is a minute difference in practice. And even though ordered models achieved higher accuracy on average, if we examine individual tasks in tables 4 and 5, we can find instances where the shuffled models outperformed the ordered ones for each of the tested architectures, sizes, and initialization strategies.
The average difference in test perplexity between all the ordered and shuffled models was less than 2 points. This is also a minor difference in practice, suggesting that model fit and predictions are not substantially different when training on shuffled data. We evaluated all the models on the ordered DailyDialog test set to calculate perplexity. The minimal impact of shuffling the training data suggests that dialog models do not adequately leverage dialog structure during training. Our results show that essentially all of the information captured when training on ordered dialogs is also learned when training on shuffled dialogs.

Limitations
Some of our conclusions assume that probing performance is indicative of performance on the endtask of dialog generation. Yet it could be the case that certain models learn high quality representations for probing but cannot effectively use them for generation, due to a weakness in the decoder for example. To address this limitation, future work could examine the relationship between probing performance and human judgements of conversation quality. Belinkov (2018) argues more research on the causal relation between probing and end-task performance is required to address this limitation.
However, it is reasonable to assume that capturing information about a certain probing task is a pre-requisite to utilizing information relevant to that task for generation. For example, a model that cannot identify user sentiment is unlikely to use information about user sentiment for generation. We also find that lower perplexity (better data fit) is correlated with better probing performance (    Table 6: Probing performance of the encoder state negatively correlates with test perplexity. Results imply that models with better data fit (lower perplexity) achieve better probing performance. Note that this is insufficient to establish a causal relationship.

Conclusion
We use probing to shed light on the conversational understanding of neural dialog models. Our findings suggest that standard neural dialog models suffer from many limitations. They do not effectively encode information about the conversation history, struggle to learn basic conversational skills, and fail to leverage the dyadic, turn-taking structure of dialog. These limitations are particularly severe for small models trained from scratch on dialog data but occasionally also affect large pre-trained models. Addressing these limitations is an interesting direction of future work. Models could be augmented with specific components or multi-task loss functions to support learning certain skills. Future work can also explore the relationship between probing performance and human evaluation.

A.1 Training Details
For the small RNN trained from scratch, we used a 2-layer encoder, 2-layer decoder network with bidirectional LSTM units with a hidden size of 256 and a word embedding size of 128. For the small RNN with attention, we used the same architecture but also added multiplicative attention (Luong et al., 2015). We set dropout to 0.3 and used a batch size of 64. We used an Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.005, inverse square root decay, and 4000 warm-up updates.
For the small Transformer, we used a 2-layer encoder, 2-layer decoder network with an embedding size of 400, 8 attention heads, and a feedforward network size of 300. We set dropout to 0.3 and used a batch size of 64. We used an Adam optimizer with a learning rate of 0.001, inverse square root decay, and 6000 warm-up updates.
For the large RNN pretrained on Wikitext-103 (Merity, 2016), we used a 2-layer encoder, 2-layer decoder network with bidirectional LSTM units with a hidden size of 1024 and a word embeddings size of 300. For the large RNN with attention, we used the same architecture but also included multiplicative attention. We set dropout to 0.3 and used a batch size of 40. We used an Adam optimizer with a learning rate of 0.005, inverse square root decay, and 4000 warm-up updates.
For the large Transformer we used a 2-layer encoder, 2-layer decoder network with an embedding size of 768, 12 attention heads, and a feedforward network size of 2048. We set dropout to 0.1 and used a batch size of 32. We used an Adam optimizer with a learning rate of 0.001, inverse square root decay, and 4000 warm-up updates.