Improving Open-Domain Dialogue Systems via Multi-Turn Incomplete Utterance Restoration

In multi-turn dialogue, utterances do not always take the full form of sentences. These incomplete utterances will greatly reduce the performance of open-domain dialogue systems. Restoring more incomplete utterances from context could potentially help the systems generate more relevant responses. To facilitate the study of incomplete utterance restoration for open-domain dialogue systems, a large-scale multi-turn dataset Restoration-200K is collected and manually labeled with the explicit relation between an utterance and its context. We also propose a “pick-and-combine” model to restore the incomplete utterance from its context. Experimental results demonstrate that the annotated dataset and the proposed approach significantly boost the response quality of both single-turn and multi-turn dialogue systems.


Introduction
Dialogue systems have attracted increasing attention due to the promising potentials on applications like virtual assistants or customer support systems (Hauswald et al., 2015;Poulami Debnath, 2018). However, studies (Carbonell, 1983) show that users of dialogue systems tend to use succinct language which often omits entities or concepts made in previous utterances. (also known as non-sentential utterances, (Fernández et al., 2005(Fernández et al., , 2007). To make appropriate responses, dialogue systems must be equipped with the ability to understand these incomplete utterances.
Take Example 1 in Table 1 for instance, contents in parentheses are information omitted in the utterance. Humans are capable of comprehending those incomplete utterances based on previous utterances. For example, A 3 means what kind of dessert matches B's taste, instead of what kind of shop B likes. Failing to understand this utterance would be a disaster for dialogue systems to generate a relevant and coherent response. According to our survey (details in Table 2), in about 60% conversations, fully comprehending current utterance depends on previous context. We will refer conversation history (A 1 to B 2 in above example) as previous utterances, the utterance to be restored (A 3 ) as original utterance, and the complete form of A 3 as restored utterance.
Studies show that restoring the incomplete questions could help question-answering systems better understand users' intention (Raghu et al., 2015;Kumar and Joshi, 2017). It inspires us to improve the performance of open-domain dialogue systems via incomplete utterance restoration. However, most existing multi-turn dialogue datasets only provide sets of utterances, without any information about relations between utterances. In other words, they lack the necessary supervisions to restore the incomplete utterance.
To make dialogue systems better understand incomplete utterances, we collect a multi-turn conversation dataset from internet communities, and each of the conversations contains at least six utterances. Then we hire an annotation team to (1) label whether an utterance is related to its context or not, and (2) restore an incomplete utterance to a complete and context-free form based on its context. Finally, we get a high-quality and large-scale dataset with 200K annotated conversations. Such a dataset offers a new way of modelling utterance relations and improving the context-understanding ability of dialogue systems. We hope it would benefit the research of context understanding for multi-turn dialogue systems in the future.
With the annotated dataset above, we first attempt to restore the incomplete utterance using  two vanilla models, Sequence-to-Sequence model (Seq2Seq) and Pointer Generative Network. Then a cascaded pick-and-combine model is further proposed to first "Pick" omitted words from context and then "Combine" them with the incomplete utterance. To better evaluate the restoration performance, an evaluation metric is also designed. In the experiment, both automatic metrics and human evaluation show that the proposed approach could achieve promising results and significantly improve the response relevance of both single-turn dialogue systems and multi-turn dialogue systems.
The proposed approach enables single-turn dialogue systems with the capability to comprehend the dialogue context. It also facilitates multi-turn dialogue systems to model the relation between the query (original utterance) and context (previous utterances) explicitly in a supervised manner, in contrast to modelling the relation implicitly and without extra guidance (Serban et al., 2016;Wu et al., 2017).
Our contributions are summarized as below: 1) A large-scale Chinese dataset with 200K multi-turn conversations are collected and manually labeled with the explicit relations between an utterance and its context.
2) A cascaded pick-and-combine model is proposed, which achieves promising results on both automatic metrics and human evaluation.
3) Experimental results demonstrate that the incomplete utterance restoration model could be complementary to existing dialogue systems and is conducive for improving response quality.
In the remaining part, we first describe the collected dataset in Section 2. In Section 3, several models and a new metric are presented for incomplete utterance restoration. Section 4 shows experimental results on both automatic metrics and human evaluation for the proposed method. In Section 5 the proposed model is applied to dialogue systems to evaluate its effectiveness. Section 6 introduces related work. We conclude our work and discuss future directions in Section 7.

Data Collection
We collect open-domain dialogues from Douban group 2 , a well-known Chinese online community which is a common data source for dialogue systems (Wu et al., 2017). To determine if one utterance is complete or not, we take four utterances before it as the conversation context. Crawled conversations with less than six (extra one to assist annotation) utterances are filtered out. For conversations with more than six utterances, only the first six are reserved. We construct a conversation by the identification of reply tags in comments under each post.

Data Annotation
To ensure the annotation quality, five professional data annotators are hired to annotate the dataset instead of using crowd-sourcing platforms like MTurk. It took six months to finish the annotation. As shown in Figure 1, annotators first discern whether the original utterance (the fifth utterance in conversation) omits concepts or entities made in previous utterances. Since the boundary  Table 2: Statistics of Restoration-200K. The incomplete ratio refers to the ratio of conversations that contains the incomplete utterance. Vocabulary size is counted after Chinese word segmentation and the average length is counted on character level. between yes and no is vague under certain circumstances, annotators are allowed to skip and discard the instance if it is hard to make a decision. The sixth utterance is also provided to assist annotators in comprehending the conversation. When restoration is needed, they rewrite the original utterance to a context-free restored utterance which contains all necessary information for utterance understanding, as shown in the last row of Table 1.
To reduce the diversity of rewritten sentences, annotators are instructed to use words from previous utterances wherever possible. A small commonly used word list is provided, and annotators could use those words to ensure the fluency of rewritten sentences. In other words, all words in the restored utterance come from previous and original utterances or the extra word list. Our survey shows only about 4.8% of conversations that need to be restored in the dataset do not satisfy such condition. All of them are discarded as well.

Data Statistics
The statistics of train, validation and test set are shown in Table 2. Kumar and Joshi (2016) collect 6K incomplete questions for QA systems, where each utterance only consists of 3.52 words on average 3 . In comparison, Restoration-200K has a longer utterance length, which indicates covering a broader range of topics and more informative contents.

Methodology
In this section, we try to tackle the incomplete utterance restoration problem by using the vanilla Seq2Seq model with attention and pointer gen-3 The average sentence length is estimated based on the released test set. erative network. Since these two models are easily dominated by a simple copy mechanism that directly regenerates the original utterance as the restored utterance, we further propose a cascaded pick-and-combine (PAC) model to restore the original utterance.

Vanilla Models
Seq2Seq Model with Attention: the Seq2Seq framework has been widely used in sequence generation tasks. The encoder encodes the input sequence into a vector representation c and the decoder generates the target sequence based on representation vector c, and all the previous words. As shown in Figure 1, to restore the incomplete utterance, we concatenate previous utterances C and the original utterance X as the input sequence. By inserting a special token [sep] to separate C and X, the input sequence will be: Pointer Generative Network: as analyzed in Section 2.2, the incomplete utterance restoration problem has a unique characteristic: most generated words come from previous utterances or original utterance. To exploit this constraint, we propose using the pointer generative network to directly copy words from the input sequence. It has the capability of copying words from source sequence by directly taking the attention score a k as prediction probability, as shown in the green part of Figure 1. The generation distribution P gen can be calculated as:  Figure 2: The pick stage predicts which words in previous utterances are omitted by the original utterance. P represents the label of positive, namely the omitted words. In combine stage, the picking result is appended to original conversation as extra guidance for the complete utterance sequence generation. P vocab is defined as: P vocab = p gen * P gen + (1 − p gen ) * P copy

Pick-and-Combine Model
For most utterances to be restored in the corpus, the reference only differs from the original utterance in few words. Our study shows on average only 17.7% words in previous utterances overlap with the restored utterance, while 100% words in the original utterance are included in the restored utterance. Such unbalanced probability makes models mentioned above tend to simply regenerate the original utterance to maximize the conditional probability of the generated sentence during the beam-search process. In other words, the Seq2Seq model and pointer generative network tend to regenerate the original utterance and cannot effectively restore the original utterance. The third conversation in Table 5 is a typical example.
To mitigate this problem, we propose to decompose the incomplete utterance restoration task into a cascaded process. The first stage is the Pick process that identifies omitted words in previous utterances. The second is the Combine stage that restores the original utterance based on the identified omitted words. Pick: inspired by recent advances in transfer learning for language representation, we fine-tune the pre-trained deep bidirectional transformers for language understanding model (BERT) (Devlin et al., 2019) as a classifier to select omitted words from previous utterances. Specifically, instead of discrete classifications, we first formulate this word selection problem as a sequence tagging problem. Each word in previous utterances will be determined to be positive P (should be identified as an omitted word) or negative N (not an omitted word). Combine: the combine stage is straightforward. The selected omitted words are appended to the input sequence as extra guidance. And the two sequences are taken as input of the pointer generative network mentioned in Section 3.1. We also tried to fine-tune the BERT to directly generate the restored utterance, but the result is far from satisfactory. Therefore, we directly append selected words to the original input sequence as extra guidance for generating restored sentences.

Task Evaluation
BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) are commonly used metrics in machine translation and summarization. However, for the task of incomplete utterance restoration, these metrics do not differentiate words in the original utterance from those in previous utterances, and each gram is equally important. The statistics in Table 2 show that most words in the reference overlap with words in the original utterance. Thus, a simple copy mechanism which regenerates the original utterance as restored utterance would achieve a high score on these metrics.
To alleviate this issue, we propose the restoration score to evaluate the performance of incomplete utterance restoration model. This metric focuses on n-grams that contain at least one restored word, excluding other n-grams. Specifically, the n-gram restoration precision, recall, and F-score can be calculated as: p n = |{restored n-grams} ∩ {n-grams in ref}| |{restored n-grams}| r n = |{restored n-grams} ∩ {n-grams in ref}| |{n-grams in ref}| f n = 2 · p n · r n p n + r n  Table 3: p, r, f here represent the restoration precision, recall and F-score that we propose in Section 3.3. B n represents n-gram BLEU score and R n represents n-gram ROUGE score.
where "restored n-grams" refer to the n-grams in restored utterance that contain at least one restored words, and "n-grams in ref" refer to the n-grams in reference that contain at least one restored words, |X| refers to the number of elements in set X.

Compared methods
We compare the performance of the following methods on the collected dataset: • Syntactic: model proposed by Kumar (Kumar and Joshi, 2016). Specifically, they replace each out-of-vocabulary word in the corpus with a numbered unknown token. The number is based on its relative position in the conversation. The model itself is a Seq2Seq model with attention.
• Seq2Seq: the Seq2Seq model with attention introduced in Section 3.1.
• Pointer: the pointer generative network introduced in Section 3.1.
• PAC: the pick-and-combine model introduced in Section 3.2.
We adopt one layer unidirectional LSTM as the encoder and decoder of each model. During training, the vocabulary size is set to 10K. The size of each mini-batch is 64. Parameters are updated by Adam algorithm (P and Ba, 2014) with the betas set to 0.9 and 0.999 and the eps set to 1e-8. The learning rate is 0.25 and the clipping threshold of gradients is 0.1. The dropout rate is set to 0.5. The word embedding size, encoder hidden size and decoder hidden size are all set to 512. During the inference stage, the checkpoint with smallest validation loss is chosen and the beam-search size is set to 5 for all methods. For the pick stage in the proposed approach, we use a BERT model trained on 200G high-quality news data from Tencent AI Lab.

Method
Quality  Table 4: Human evaluation on the restoration quality and language fluency. The quality score adopts a 4point scale, and fluency score adopts a 3-point scale. A higher score is better for both.

Evaluation metrics
The n-gram restoration score proposed in Section 3.3 is adopted as the automatic evaluation metrics. Then human evaluation is further conducted. Specifically, human annotators are instructed to evaluate the quality and fluency of restored utterances. The quality score evaluates if the restored utterances can be better understood without context than the original utterance. A restored utterance should be scored 4 if it can be perfectly understood without previous utterances, 3 if it is not perfect but still better than the original utterance, 2 if it hasn't been restored or it cannot help better the understanding, and 1 if it is worse than the original utterance. The fluency score is used to evaluate the fluency of a restored utterance. A 3-point scale is adopted and higher is better. We randomly select three hundred examples from the test set for human evaluation. Five professional annotators who have at least one year's experience on text annotation participate in the experiment.

Automatic Metrics
Scores of each method on the automatic metric are shown on Table 3. PAC model achieves the best performance on the restoration precision, recall, and F-score from 1-gram to 3-grams. The restoration recall reflects how often does the model choose to restore the original utterance. As discussed in Section 3.2, Seq2Seq model and pointer  generative network tend to directly copy the original utterance, instead of taking risks to choose words from previous utterances. Restoration recall scores of these models can prove this claim. PAC model is also higher than other models in restoration precision. It demonstrates that the words selected by the PAC model are more accurate than that of other models.
We also report the BLEU and ROUGE score of each model. PAC model also achieves the highest scores in these metrics. However, scores of other models are comparatively high because these two metrics take each gram in the reference as equally important. It illustrates the necessity of the proposed evaluation metrics.

Human Evaluation
Results of the human evaluation are shown in Table 4. The proposed method has the highest quality score among all methods. We find the model with the higher score in human evaluation also has a higher restoration score. This tendency demonstrates that the proposed metric makes sense and is practical in this task. As for the fluency, all models except the Seq2Seq model achieve a similar score. The reason why the score of the PAC model is slightly lower than the syntactic model is that the fluency score follows a principle that the more you do, the more mistakes you make. Fluency of original utterances is high, so models with lower recall may get higher fluency. Table 5 shows some examples of the incomplete utterance restoration among different models. At a closer look to the restored utterances, we find PAC model is capable of understanding the conversation context and correctly identify omitted concepts from multiple possible choices. In Example 1, none except PAC model can successfully restore the word "cooking" from previous utterances. In Example 2, most models can only extract the word "stay up" from previous utterances, and they fail to generate a fluent utterance. In contrast, the PAC model picks the correct words from previous utterances and generates a more fluent utterance. As for Example 3, only the PAC model can correctly restore the original utterance. This scenario also supports our analysis at the beginning of section 3.2 that vanilla models have a tendency to regenerate the original utterance.

Case study
We conclude an interesting phenomenon from this case study: only the PAC model owns the capability to extract omitted words from very early context. Other models merely choose words from the latest utterance exactly before the original one, namely the B 2 in Table 5. This great context comprehension ability is benefit from the pick process that identifies words from previous utterances.

Improving Existing Dialogue Systems
The main motivation for incomplete utterance restoration is to facilitate dialogue systems better understanding conversation context and generate better responses. Thus, we conduct another experiment to evaluate if the response quality of singleturn and multi-turn dialogue systems improves after incomplete utterance restoration.

Settings
We applied the proposed PAC model to a singleturn and a multi-turn response generation model to evaluate its effectiveness on dialogue systems. Details are shown as follows: • MMI: a state-of-the-art single-turn response generation model (Li et al., 2016). It adopts the maximum mutual information (MMI) as the objective function to prevent generating general responses. Note MMI model does not have access to the dialogue context. It only takes the original utterance as the input.
• SMN: the sequential matching network (SMN) is a state-of-the-art multi-turn reranking modeling proposed by Wu et al. (2017). The query expanded with keywords is used to retrieve top 100 response candidates. When reranking, SMN first matches a response with each utterance and distills matching information into a vector. The vectors are then accumulated through a recurrent neural network (RNN). The hidden states of the RNN is used to calculate the final matching score. In our experiments, SMN is applied to rerank candidates by taking the original utterance and conversation context as model inputs. The candidate with the highest matching score is chosen as the response.
• MMI after Restoration: The MMI model takes the restored utterance as the input.
• SMN after Restoration: The SMN model takes the restored utterance and conversation context as inputs.
Three hundred randomly selected conversations from the test set are selected and restored by all models. Five annotators are asked to evaluate whether responses to restored utterances are more relevant and appropriate than those of the original utterances. There are three choices for the annotators: better, similar or worse. If the response gets more appropriate or more relevant to the context, it will be judged as "better". If   Table 7: Human evaluation on response quality from the multi-turn dialogue system after restoration.
the response quality does not change much after the restoration, it will be regarded as "similar". Otherwise, it will be judged as "worse". Utterances that are not restored are judged as "Not Restored" (NR).
Note that we also train a generative multi-turn response generation model (Serban et al., 2016) on the collected dataset. But generated responses are not clear and coherent enough to understand. As a result, no human study is further conducted for such approach.

Results
Evaluation results of single-turn and multi-turn dialogue systems are shown in Table 6 and  Table 7, respectively. Both systems get better responses after the restoration process, and PAC model is the most effective one. The PAC model achieves the lowest NR, which means it can restore more utterances than other models. Meanwhile, for single-turn dialogue systems, more than 50% queries get better responses when they are restored, and only 10% get worse responses. For multi-turn dialogue systems, although the SMN already has the ability to understand conversation context, the restoration still improves the response quality of nearly half of quires, and only about 20% responses get worse. The results prove that the proposed dataset and approach significantly improve the response quality.  Some examples are shown in Table 8. Example 1 and 2 are cases where the response quality improves after incomplete utterance restoration. One interesting response is the one from SMN+PAC in Example 2. Note that the Park of Xingqing Palace is a famous historic relic in Xi'an. The dialogue system successfully generates a much more concrete and relevant response after restoring the city name Xi'an. There are also cases where responses become worse even if the model correctly restores omitted words, as shown in Example 3. One possible explanation is in this case the restored part is less important than the original utterance. This motivates us to find more strategies to alleviate this issue in future studies, such as training the restoration model and response generation model in an end-to-end manner or deploying a ranking model to choose the better response.
6 Related Work 6.1 Non-sentential Utterance Resolution In addition to a further corpus and taxonomy study, Fernández et al. (2007) design a series of linguistic features to determine the NSU class. Dragone (2015); Dragone and Lison (2016) extend these features for NSU classification. Raghu et al. (2015) propose generating NSU resolution from templates. Poulami Debnath (2018) design a set of rules to classify and restore NSUs for customer support systems. All methods above heavily rely on the syntactic structure or frequent patterns observed empirically, which may not scale well for unseen domains. Kumar and Joshi (2016) approach NSU resolution as a Seq2Seq learning problem. One type of NSU that draws extra attention is ellipsis (Merchant, 2016;Kenyon-Dean et al., 2016;McShane and Babkin, 2016).

Multi-turn Dialogue Systems
Many efficient approaches have been proposed for developing intelligent dialogue systems (He et al., 2017;Shang et al., 2018;Tian et al., 2019;Cai et al., 2019). For multi-turn dialogue systems, Serban et al. (2016) and Xing et al. (2018) adopt hierarchical neural networks to model context. Tian et al. (2017) conduct an extensive comparison among existing methods and found context information is conducive for neural networks to generate longer, more meaningful and diverse replies. Wu et al. (2017) and Zhang et al. (2018) utilize the context information via a matching network and rerank the retrieved responses.  investigate matching a response with its multi-turn context using dependency information based entirely on attention. Most of these studies model the dialogue context information as an extra input in the response generation or response matching process. Our method, on the other hand, tries to understand the context by restoring the incomplete utterance to a context-free and complete form. It can be complementary to above studies.

Conclusion
In this paper, we propose to facilitate the comprehension of conversation context by restoring incomplete utterances. A large-scale dataset Restoration-200K, which consists of multi-turn conversations for open-domain dialogue systems, is collected and manually annotated. Based on this dataset, the proposed pick-and-combine method achieves promising results on the incomplete utterance restoration task. Experimental results also demonstrate that the annotated dataset and the proposed approach significantly improve response quality for existing dialogue systems.