A Co-Attentive Cross-Lingual Neural Model for Dialogue Breakdown Detection

Ensuring smooth communication is essential in a chat-oriented dialogue system, so that a user can obtain meaningful responses through interactions with the system. Most prior work on dialogue research does not focus on preventing dialogue breakdown. One of the major challenges is that a dialogue system may generate an undesired utterance leading to a dialogue breakdown, which degrades the overall interaction quality. Hence, it is crucial for a machine to detect dialogue breakdowns in an ongoing conversation. In this paper, we propose a novel dialogue breakdown detection model that jointly incorporates a pretrained cross-lingual language model and a co-attention network. Our proposed model leverages effective word embeddings trained on one hundred different languages to generate contextualized representations. Co-attention aims to capture the interaction between the latest utterance and the conversation history, and thereby determines whether the latest utterance causes a dialogue breakdown. Experimental results show that our proposed model outperforms all previous approaches on all evaluation metrics in both the Japanese and English tracks in Dialogue Breakdown Detection Challenge 4 (DBDC4 at IWSDS2019).


Introduction
In today's world, we often use interactive virtual conversational agents in our daily lives, such as Siri, Google Assistant, Alexa, etc. Such chat-oriented dialogue systems assist users in their daily lives through interaction in the form of conversations. Through information obtained via a conversation, a system can understand a user's need and generate the corresponding response, such as reporting weather forecasts, booking tickets, or entertaining the user by chit-chatting. To facilitate hassle-free communication, the system needs to generate proper and rational responses without any ambiguity. Furthermore, while generating a response, the system should consider not only the last turn in the conversation but also all previous utterances so that the response generated is coherent with the entire conversation history. The user experience degrades when a dialogue breakdown occurs, where the system generates an inappropriate response which breaks the conversation flow. Dialogue breakdown detection is one of the major

User utterances
System utterances Breakdown? U 1 : What do you like to do?
S 1 : I like to eat out and eat. NO U 2 : I like to eat too. Hamburgers and french fries are my favorite.
S 2 : I love fries. I love cheese. NO U 3 : Where do you work? S 3 : I work for a company. NO U 4 : As a what?
S 4 : I have a lot of friends. YES Table 1: An example dialogue where the last system-generated response causes a dialogue breakdown. U i and S j denote the i-th user utterance and the j-th system utterance in the dialogue, respectively. challenges faced by current chat-oriented dialogue systems, but it has still not been carefully studied by the research community. The dialogue breakdown detection task is designed to test a system's capability of identifying the undesired utterance causing a dialogue breakdown, which is expected to further help us to build more fluent dialogue systems. The dialogue breakdown detection task requires a participating system to determine whether an utterance generated by a system causes a dialogue breakdown (Higashinaka et al., 2016). Table 1 shows an example of a dialogue breakdown in an ongoing conversation. In this example, the last system response causes a dialogue breakdown, since it does not answer the question "As a what?" in the last user utterance but instead gives a completely irrelevant response "I have a lot of friends." Dialogue Breakdown Detection Challenge 4 (DBDC4) 1 is a shared task dedicated to dialogue breakdown detection. In this paper, we use the dataset released in DBDC4 for our experiments. Since whether a system utterance causes a dialogue breakdown is somewhat subjective, the task is modeled as a classification task with three classes: Breakdown (B), Possible Breakdown (PB), and Not a Breakdown (NB). Every single instance in the dataset is annotated by multiple annotators. Each instance is assigned a gold-standard (majority) class, and a probability distribution based on all annotators' predictions. The task has two tracks involving two different languages: Japanese and English. The evaluation metrics consist of both classification-related metrics and distribution-related metrics. We will introduce them in detail in Section 4.3.
Prior work has exploited feature-engineered machine learning approaches like random forests, neural network architectures such as LSTM (Hendriksen et al., 2019;Shin et al., 2019;, and the monolingual pretrained language model BERT (Devlin et al., 2019;Sugiyama, 2019). Most prior work treats the dialogue history and the last system response in the same manner. In feature-based approaches , the features are extracted from the concatenation of both the dialogue history and the last system response, while in other models using pretrained word embeddings (Hendriksen et al., 2019;Shin et al., 2019;Sugiyama, 2019), the interaction between the dialogue history and the last system response is also not explicitly captured.
Recently, cross-lingual language models such as XLM (Conneau and Lample, 2019) and XLM-R (Conneau et al., 2020) demonstrate strong performance on many cross-lingual natural language processing (NLP) tasks (Wang et al., 2018;Conneau et al., 2018). They also outperform pretrained monolingual language models on tasks with low-resource languages. By utilizing shared word embeddings over different languages and multilingual parallel texts, cross-lingual language models encode input texts into one single representation space shared by all languages. This removes the cost of language-specific training. In this paper, we utilize pretrained cross-lingual language models to benefit from the multilingual training data. To the best of our knowledge, ours is the first work to incorporate pretrained cross-lingual language models in dialogue breakdown detection.
Typically, pretrained language models are not trained in a task-specific setting. When we utilize them for a particular task, it might not perform well if the training data is not too large. To better capture the interaction between the previous dialogue history and the last utterance, we propose to integrate a coattention network with the cross-lingual language model. Experimental results show that our co-attention network significantly improves the performance of our model on the DBDC4 dataset. The source code and trained models of this paper are available at https://github.com/nusnlp/CXM.

Task Overview
A dialogue history H consists of a sequence of alternating user and system utterances. The target utterance T for dialogue breakdown detection is the succeeding system utterance. Each instance (H, T ) is assigned one of three candidate classes: Breakdown (B), Possible Breakdown (PB), and Not a Breakdown (NB). The output of a model includes two components: a predicted class from one of the three candidates {B, PB, NB}, and a probability distribution over the three classes. DBDC4 includes two tracks in two different languages (Japanese and English) with the same task setting.

Model Description
We propose a Co-attentive Cross-lingual Neural Model (CXM), which is based on a pretrained crosslingual language model and a co-attention network to tackle the task of DBDC4. We give a detailed description of CXM in this section. We present the overall architecture of the proposed model in Figure 1.

Pretrained Cross-lingual Embeddings
For the embedding layer, we adopt a state-of-the-art cross-lingual pretrained language model named XLM-R (Conneau et al., 2020), which is pretrained on large-scale multilingual corpora. Compared with other cross-lingual language models mBERT (Devlin et al., 2019) and XLM (Conneau and Lample, 2019), the data used to pretrain XLM-R is enlarged by orders of magnitude, especially for low-resource languages. The CC-100 corpus used for pretraining XLM-R includes significantly larger size of both Japanese and English texts than the Wiki-100 corpus on which mBERT and XLM are pretrained (Conneau et al., 2020). In this paper, we choose the largest model of XLM-R with hidden dimension size d = 1024.
Assume that the dialogue history H consists of user-system utterances with h tokens after tokenization, and the target system utterance T consists of t tokens. We first concatenate these two parts into a combined input representation to XLM-R, and then we obtain the last layer output, denoted as G ∈ R (h+t)×d : where d is the hidden dimension size of XLM-R, H ∈ R h×d is the last layer output corresponding to the words in the dialogue history, and T ∈ R t×d is the last layer output corresponding to the words in the target utterance. With XLM-R embeddings, we utilize data from other languages to enrich the training dataset during training, so as to transfer the knowledge learned from other languages to the target language.

Co-Attentive Encoding
To further capture the interaction, we propose to utilize a co-attentive encoder to compute a combined representation of the dialogue history and the target utterance. The co-attention network (Lu et al., 2016) was initially proposed for visual question answering (VQA). In VQA, it was used to jointly reason over image and question attention. Xiong et al. (2017) adapted the co-attention network for machine reading comprehension (e.g., SQuAD). They used a co-attention network to capture the interaction between the question and passage for answer span extraction. While Xiong et al. (2017) used the co-attentive encoder for single-turn machine reading comprehension, we utilize the co-attentive encoder in a more complex dialogue breakdown detection task in a multi-turn conversation setting. The proposed approach of utilizing the co-attentive encoder is described as follows.
To capture the interaction between the dialogue history H and the target utterance T , we use a coattentive encoder to compute a combined representation of H and T . We first calculate the token-wise contextual similarity A ∈ R t×h between the dialogue history and the target utterance.
where A i,j calculates the similarity of the i-th token in the target utterance (i.e., the i-th row of T) and the j-th token in previous dialogue utterances (i.e., the j-th row of H).
We apply a row-wise softmax function for normalization to produce the attention scores over the dialogue history for each word in the target utterance, resulting inÃ. Next, we compute a summary or weighted representation of the dialogue history corresponding to each word of the target utterance: Next, the initial target utterance encoding vectors in T (i.e., each row of T), and the vectors in T x are concatenated, and passed through a bidirectional LSTM, which results inT ∈ R t×d .T essentially captures the interaction between the dialogue history and the target utterance by jointly encoding them.
We use a self-attention layer on top ofT, which can effectively aggregate evidence from the joint encoding vectors to infer the output. First, we compute the self-attention matrix: where W ∈ R d×d is a trainable bi-linear matrix. Next, we apply a row-wise softmax function for normalization, resulting inÃ s . Now, the self-attentive encoding vectors can be aggregated as: Then, we concatenate the joint encoding vectors inT with the self-attentive encoding vectors in T s , followed by a feed-forward layer, which results in Y ∈ R t×d . We aggregate the vectors in Y for the output layer. If the ith row of Y is represented as y i ∈ R d , the aggregated vector v ∈ R d can be written as: where α ∈ R t and w ∈ R d is a trainable vector. Output Layer: In the output layer, we use a simple feed-forward layer on top of v. The number of output units is three for the three different classes. For training, we minimize the mean squared error (MSE) loss summing over all the training instances.

Datasets
In our experiments, we focus on the official annotated dataset of DBDC4 including both Japanese and English versions. Following the practice of the DBDC4 participating systems (Hendriksen et al., 2019;Sugiyama, 2019;Shin et al., 2019), we also utilize the official annotated data released in previous Dialogue Breakdown Detection Challenges including DBDC1, DBDC2, and DBDC3.
We present the dataset statistics in Table 2. We adopt the split of the training set and development set as described in (Sugiyama, 2019). Each dialogue session consists of multiple turns of user-system

Training
We prepare the training data for our proposed CXM model in two ways. We denote the model trained on single language data as CXM-S and on two languages as CXM-D. The parameters in the XLM-R embedding layer are pretrained on corpora of one hundred languages (Conneau et al., 2020) and updated during training. We use mean squared error for the loss function during training.

Evaluation Metrics
The official evaluation metrics of DBDC4 include classification-based and distribution-based metrics.
Classification: The classification-based metrics consist of accuracy and F1 scores.
• Accuracy: the number of correctly classified instances divided by the total number of instances.
• F1 (B): F1 score corresponding to the classification label B.
• F1 (PB+B): F1 score corresponding to the classification labels PB and B grouped together.
For classification-related metrics, a higher score indicates better performance.
Distribution: The distribution-based metrics utilize Jensen-Shannon divergence (JSD) and Mean Squared Error (MSE).
• JSD (NB, PB, B): Jensen-Shannon divergence calculated over the predicted distribution of all three labels.
• JSD (NB, PB+B): JSD calculated over the predicted distribution of two labels (PB and B are combined and considered as one single label).
Distribution-based metrics measure the difference between the predicted distribution and the goldstandard distribution, so a lower score indicates better performance.

Participating DBDC4 Models
We compare our proposed CXM model with all previous participating models in DBDC4. We give brief descriptions of these models as follows.

Results
In this section, we present the comparison results and an ablation study to better understand our model.

Model Comparison
We present the comparison results on DBDC4 Japanese track and English track in Table 3 and Table 4, respectively. We denote NTTCS19 models as NTT, RSL19BD models as RSL, and the organizer's Baseline as BL. The number following the model name is the index of a run. The scores of previous participating models are retrieved from (Higashinaka et al., 2019). The reported scores of our models are calculated by the official evaluation script provided by the organizers. The results show that our bestperforming model CXM-D outperforms all previous models significantly on every evaluation metric in both Japanese and English tracks.
In the Japanese track, when evaluated on classification-based metrics, CXM-D outperforms the best previous models by 13.13%, 13.25%, and 5.57% on accuracy, F1 (B), and F1 (PB+B), respectively. CXM-D improves the previous best models by 34.7%, 32.0%, and 45.6% when evaluated on three JSD-based metrics. When evaluated on MSE-based metrics, the improvements are 36.9%, 39.7%, and 46.0%. For each of the JSD and MSE metrics, the percentage of improvement is obtained by 100 − (ours/previous best) × 100.
In the English track, CXM-D outperforms the previous best models by 7.60%, 11.35%, and 0.92% on accuracy, F1 (B), and F1 (PB+B), respectively. CXM-D improves the previous best models by 14.8%, 13.6%, and 22.4% on the three JSD-based metrics. On the three MSE-based metrics, the improvements are 10.7%, 14.4%, and 14.6%.   More importantly, CXM-D gives the best scores on all metrics in both DBDC4 Japanese track and English track. It is the first model to achieve dominance on this task. Additionally, we show that when trained on the same monolingual training data as previous models, CXM-S still achieves excellent performance. CXM-S outperforms all previous models on all metrics in the Japanese track, and on all metrics except F1 (PB+B) in the English track.

Ablation Study
We conduct an ablation study on our proposed model CXM-D on the development set, in order to better understand the effectiveness of our proposed model. We present the results of our ablation study in Table 5. First, we experiment with removing the co-attention component. In this case, we use the output from the first position ([CLS]) of the XLM-R last layer output as the contextualized representation. We then use a feed-forward layer followed by a softmax function to output the probability distribution over three labels. Next, we experiment with training the model with single-language data, similar to prior work (Sugiyama, 2019).
By removing the co-attention component, the accuracy drops by 1.77% and 1.65% on the Japanese track and English track, respectively. This indicates that co-attention does better capture the interaction between the dialogue history and the target system utterance. If we use only single-language data during training, the accuracy also drops by 2.01% and 2.85% on the Japanese track and English track, respectively. This indicates that transfer learning from other languages to the target language improves the performance in this task. We observe a further decrease in accuracy on both tracks if we do not use both the co-attention network and the dual-language data.
We analyze 100 randomly selected examples from the development set where CXM-D predicts correctly but the model without co-attention fails. Two examples are given in Table 6. It is evident that   co-attention helps to identify the undesired utterance with respect to the dialogue history topic. It also manages to make better predictions in the cases where the target system response is completely irrelevant with respect to the the dialogue history.

Error Analysis
We also perform an error analysis on 100 randomly selected examples from the development set where CXM-D fails to make correct predictions. We identify the following three primary cases where our model tends to make incorrect predictions.

Continuous questions
The target utterance is a question and the dialogue history involves continuous question turns where the user and system take turns to ask questions. Sarcasm We observe that it is challenging for the model to distinguish an undesired response from a sarcastic but appropriate response. In these cases, our model is most likely to classify them as Possible Breakdown (PB).
Overly long responses While the target utterance consists of multiple sentences, it is also challenging to capture how they are interacting with the dialogue history.

Related Work
Several prior works on the dialogue breakdown detection task are based on long short-term memory (LSTM). Hendriksen et al. (2019) incorporated LSTM with pretrained word embeddings, such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014).  followed a similar idea but added convolutional neural network (CNN) to perform textual feature extraction. Shin et al. (2019) utilized bidirectional LSTM and self-attention layers to incorporate the dialogue in representation learning. Prior works also include feature-engineered models.  proposed to utilize a curated set of features including keyword counts and TF-IDF scores to build a regression model based on random forests. Pretrained language models achieve state-of-the-art performance on many NLP tasks. Sugiyama (2019) developed a model based on BERT (Devlin et al., 2019). Instead of directly using the [CLS] token, the model took the concatenation of the entire dialogue including the target system utterance, dialogue acts, and textual features as input to a BERT encoder and utilized the entire last layer for representation learning.
Among the participating models in DBDC4, none of the prior models achieve dominance on both classification-based and distribution-based metrics in either the Japanese or English track. This indicates that it is challenging for a single model to perform well on all metrics. The desired model should possess the capability to alleviate the mismatch between the training objective of classification-based metrics and distribution-based metrics.
Pretrained language models have been employed in several dialogue tasks and show good performance. BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019) have been utilized to encode system and user utterances in dialogue state tracking (Gao et al., 2019;Ma et al., 2020;. Liu et al. (2020) also incorporated BERT-based contextualized word embeddings for dialogue generation in chat-oriented dialogue systems.
Recently, cross-lingual transfer learning has gained much popularity in the NLP research community (Chen et al., 2019). Cross-lingual pretrained language models are able to map words from different languages to one single shared embedding space. Empirical results show that cross-lingual language models trained on one hundred languages outperform language-specific language models on several standard NLP benchmarks (Conneau and Lample, 2019;Conneau et al., 2020). However, cross-lingual transfer learning has not been explored in dialogue breakdown detection, and our work is the first to incorporate a pretrained cross-lingual language model for dialogue breakdown detection. On the one hand, we use cross-lingual word embeddings to transfer the knowledge learned from multiple languages. On the other hand, as different languages can be mapped to a single shared space in XLM-R (Conneau et al., 2020), we further enrich the training data by adding available data in other languages.
Co-attention network has been used before to tackle the task of single-turn QA (Lu et al., 2016;Xiong et al., 2017;Yu et al., 2019). It shows good capability in capturing the interaction between the context passage and the question in reading comprehension-based QA. In visual QA, it captures the interaction between the compressed image representation and the question. In contrast, we treat the dialogue history and the target system utterance as the two components which interact with each other. The co-attention encoder attends to the two interacting components simultaneously and finally combines both attention contexts.

Conclusion
In this paper, we have proposed a novel model based on a cross-lingual language model and a co-attention network for dialogue breakdown detection. Our model achieves new state-of-the-art scores in Dialogue Breakdown Detection Challenge 4. Our proposed model is the first to achieve the best scores on all the evaluation metrics, significantly outperforming all previous models. We have also observed that our model outperforms previous monolingual models on this task. We exploit transfer learning using a crosslingual language model to utilize training data from other languages and further improve the performance of our model on this task. The co-attention network built on top of the cross-lingual language model better captures the relation between the current utterance and the dialogue history. This helps to reduce the probability of a system generating an undesired response, so that communication and user experience are further improved.