GECOR: An End-to-End Generative Ellipsis and Co-reference Resolution Model for Task-Oriented Dialogue

Ellipsis and co-reference are common and ubiquitous especially in multi-turn dialogues. In this paper, we treat the resolution of ellipsis and co-reference in dialogue as a problem of generating omitted or referred expressions from the dialogue context. We therefore propose a unified end-to-end Generative Ellipsis and CO-reference Resolution model (GECOR) in the context of dialogue. The model can generate a new pragmatically complete user utterance by alternating the generation and copy mode for each user utterance. A multi-task learning framework is further proposed to integrate the GECOR into an end-to-end task-oriented dialogue. In order to train both the GECOR and the multi-task learning framework, we manually construct a new dataset on the basis of the public dataset CamRest676 with both ellipsis and co-reference annotation. On this dataset, intrinsic evaluations on the resolution of ellipsis and co-reference show that the GECOR model significantly outperforms the sequence-to-sequence (seq2seq) baseline model in terms of EM, BLEU and F1 while extrinsic evaluations on the downstream dialogue task demonstrate that our multi-task learning framework with GECOR achieves a higher success rate of task completion than TSCP, a state-of-the-art end-to-end task-oriented dialogue model.


Introduction
Due to the rhetorical principle of saving words and avoiding repetitions, ellipsis and co-reference occur frequently in multi-turn dialogues leaving utterances paragmatically incomplete if they are separate from context. Humans can easily understand utterances with anaphorically referenced or absent * Work performed during an internship at Lenovo Research AI Lab.
† Corresponding author information (e.g., Q 2 and Q 3 in Table 1) based on the dialogue context while dialogue systems often fail to understand such utterances correctly, which may result in false or incoherent responses. If user utterances can be automatically supplemented with information that is left out or substituted by anaphora according to the dialogue context as humans do (e.g., Q 2 : I want cheap Italian restaurants. Q 3 : Yes, I would like the phone number please.), dialogue models may understand user requests correctly and would not generate wrong responses caused by ellipsis and co-reference phenomena. Especially in task-oriented dialogue systems, explicitly providing such information to the models can effectively improve the success rate of task completion.
In order to achieve this goal, we propose an endto-end generative ellipsis and co-reference resolution model (GECOR) for task-oriented dialogue in this paper. The essential idea behind GECOR is that we treat the resolution of ellipsis and co-reference in user utterances as a generation task: transforming a user utterance with ellipsis or anaphora into a new utterance where the left-out or referred expressions are automatically generated from the dialogue context. We refer to the new utterance as the complete version of the original utterance. We use an endto-end sequence-to-sequence model with two encoders for this transformation task, where one encoder reads the user utterance and the other the dialogue context and the decoder generates the complete utterance. Since most omitted expressions or antecedents can be found in the dialogue context, we resort to the attention and copy mechanism to detect such fragments in previous context and copy them into the generated complete utterance.
We then incorporate GECOR into an end-toend task-oriented dialogue system in a multi-task learning framework. The entire model contains two encoders (one for user utterance and the other for the dialogue context) and three decoders: one decoder for predicting dialogue states, the second decoder for generating complete user utterances and the third decoder for generating system responses. The three decoders are jointly trained.
In order to train GECOR with the task-oriented dialogue model, we manually annotate the public task-oriented dialogue dataset CamRest676 with omitted expressions and substitute anaphora in the dataset with corresponding antecedents. The new dataset can be used either to train a standalone ellipsis or co-reference resolution model or to jointly train a task-oriented dialogue model equipped with the ellipsis / co-reference resolution model.
We conduct a series of experiments and analyses, demonstrating that the proposed method can significantly outperform a strong baseline model. Our contributions are threefold: • We propose an end-to-end generative resolution model that attempts to solve the ellipsis and co-reference reolution in a single unified framework, significantly different from previous end-to-end co-reference resolution network with two phases of detection and candidate ranking.
• To the best of our knowledge, this is the first attempt to combine the task of ellipsis and coreference resolution with the multi-turn taskoriented dialogue. The success rate of task completion is significantly improved with the assistance of the ellipsis and co-reference resolution.
• We construct a new dataset based on Cam-Rest676 for ellipsis and co-reference resolution in the context of task-oriented dialogue. 1

Related Work
Ellipsis recovery: The earliest work on ellipsis as far as we know is the PUNDIT system (Palmer et al., 1986) which discusses the communication between the syntactic, semantic and pragmatic modules that is necessary for making implicit linguistic information explicit. Dalrymple et al. (1991) and Shieber et al. (1996) establish a set of linguistic theories in the ellipsis recovery of English verb phrases. Nielsen (2003) first proposes an end-to-end computable system to perform English verb phrase ellipsis recovery on the original input text. Liu et al. (2016) propose to decompose the resolution of the verb phrase ellipsis into three sub-tasks: target detection, antecedent head resolution, and antecedent boundary detection.
Co-reference resolution: Co-reference resolution is mainly concerned with two sub-tasks, referring expressions (i.e., mentions) detection, and entity candidate ranking. Uryupina and Moschitti (2013) propose a rule-based approach for coreference detection which employs parse tree features with an SVM model. Peng et al. (2015) improve the performance of mention detection by applying a binary classififier on their feature set. In recent years, applying deep neural networks to the co-reference resolution has gained great success. Clark and Manning (2016) apply reinforcement learning on mention-ranking co-reference resolution. Lee et al. (2017) introduce the first endto-end co-reference resolution model. Lee et al. (2018) present a high-order co-reference resolution model with coarse-to-fine inference.
Ellipsis and co-reference resolution in QA and Dialogue: The methods mentioned above do not generalize well to dialogues because they normally require a large amount of well-annotated contextual data with syntactic norms and candidate antecedents. In recent years, a few studies try to solve ellipsis / co-reference resolution tailored for dialogue or QA tasks. Kumar and Joshi (2016) train a semantic sequence model to learn semantic patterns and a syntactic sequence model to learn linguistic patterns to tackle with the non-sentential (incomplete) questions in a question answering system. Zheng et al. (2018) builds a seq2seq neural network model for short texts to identify and recover ellipsis. However, these methods are still limited to short texts or one-shot dialogues. Our work is the first attempt to provide both solution and dataset for ellipsis and co-reference resolution in multi-turn dialogues.
End-to-end task-oriented dialogue: Taskoriented dialogue systems have evolved from traditional modularized pipeline architectures (Rudnicky et al., 1999; to recent end-to-end neural frameworks (Eric and Manning, 2017a,b;  Yes, I would like the phone number please.  . Our work is an innovative combination of ellipsis and co-reference resolution and the end-to-end task-oriented dialogue.

The GECOR Model
In this section, we reformulate the ellipsis and coreference resolution task in the context of multiturn dialogue and detail the proposed GECOR model.

Ellipsis and Co-Reference Resolution Reformulation
Our task is to reconstruct a pragmatically complete utterance from a user utterance where the ellipsis and/or co-reference phenomena are present according to the dialogue context. Table 1 provides examples of reconstructed utterances in which the omitted information is recovered or the anaphor is substituted with referred expressions. We attempt to solve the resolution of ellipsis and co-reference in a unified framework because in essence both ellipsis and co-reference can be understood from contextual clues. We consider these two problems in multi-turn dialogue and reformulate the resolution of them as a generation problem: generating the omitted or referred expressions. In this way, the modeling of ellipsis and co-reference is in line with response generation in dialogue modeling.
Unlike previous methods that combine detection and ranking models, our generation-based formulation is not constrained by the syntactic forms of ellipsis or co-reference in sentences. They can be either words (e.g., noun, verb) or phrases or even clauses. Furthermore, the formulation does not need to provide a set of candidate antecedents to be resolved. Previous studies usually need to traverse the text when there are multiple ellipsis or anaphora to be resolved, which leads to a high computational complexity.
In this reformulation, we assume that the dialogue context is composed of all utterances from the beginning of the dialogue to the current user utterance. Both the context and the user utterance in question are input to the GECOR model to generate the complete version of the user utterance.

Model Structure
The GECOR model is shown in Figure 1. The model essentially contains an embedding module, a user utterance encoder, a dialogue context encoder and a decoder with either copy (Gu et al., 2016) or gated copy mechanism (modified from See et al. (2017)). Both the generation probability over the entire vocabulary and the copy probability over all words from the dialogue context are taken into account for predicting the complete user utterance.
Embedding Layer In GECOR, we first tokenize the input user utterance and the dialogue context. We then use GloVe (Pennington et al., 2014) (the pre-trained 50-dimensional word vectors) in the embedding layer to obtain word embeddings. Let U = {u 1 , ..., u m }, C = {c 1 , ..., c n } be representations of the tokenized utterance and context sequence.
Utterance and Context Encoder We use a single-layer bidirectional GRU to construct both encoders. The forward and backward hidden states over the input embeddings from the embedding layer are concatenated to form the hidden states of the two encoders.
Decoder The decoder is a single-layer unidirectional GRU. In the decoder, the attention distribution a t is calculated as in Bahdanau et al. (2015): where v, W h , W s and b attn are learnable parameters, h i is the hidden state for word u i from the sequence produced by the utterance encoder. The attention distribution a t is used to produce a weighted sum of the encoder hidden states, known as the context vector h * t : It is fed into the single-layer unidirectional GRU together with the previous decoder state s t and the word embedding y t−1 of the previously generated word to obtain the decoder state s t . The updated s t−1 is then concatenated with the context vector h * t to produce the generation probability distribution over the vocabulary V as follows: where W h g , W s g and b g are learnable parameters and v i is the one-hot indicator vector for word v i ∈ V. ψ g is the score function for the generation-mode and Z is the normalization term shared by the generation-mode and copy-mode.
Copy Network The copy network (Gu et al., 2016) is used to calculate the probabilities for words copied from the dialogue context. These words are parts of the omitted or referred expressions to be predicted. We build the copy network on the top of the context encoder. The probability for copying each word from the dialogue context is computed as follows: where W c and b c are learnable parameters, h c i is the output for word c i from the context encoder, and σ is a non-linear activation function. ψ c is the score function for the copy-mode and Z is the normalization term shared by equation (4) and (7).
Both probabilities from the two modes contribute to the final probability distribution over the extended vocabulary (the vocabulary plus the words from the dialogue context) which is calculated as follows: which is used to predict the final output word.
Gated Copy An alternative to the copy network is the gated copy mechanism that use a gate to regulate the contributions of the generation and copy mode to the final prediction. The gate p gen is calculated as follows: where W h , W s , W y and b t are learnable parameters and σ is the sigmoid function. Training The standard cross-entropy loss is adopted as the loss function to train the GECOR model. by  in a multi-task learning framework, which is shown in Figure 2. The GECORequipped TSCP model contains the embedding layer, the utterance and context encoders, and three decoders: decoder 1 for generating belief spans (BSpan) defined in  which are text spans for tracking dialogue states (e.g., inf Italian, cheap /inf ; req phone /req ), decoder 2 for complete user utterances and decoder 3 for machine responses. The embedding layer and encoders are the same as described in section 3.
BSpan Decoder Unlike Lei et al. (2018), we do not concatenate current user utterance with previously generated machine response. At each turn of dialogue, the user utterance and the previous BSpan (the dialogue states updated to the previous turn) are used as the inputs to the user utterance encoder. The outputs of this encoder are then fed into the BSpan decoder for predicting the new BSpan for the current turn and a cross-entropy loss L 1 is calculated. The user utterance encoder hidden states, the last hidden state and the output of the BSpan decoder are input into the other two decoders.
Complete User Utterance Decoder The basic structure of this decoder is the same as the decoder described in the last section.We pass the last hidden state of the BSpan decoder to the initial state of this decoder. In addition to the inputs from the user utterance encoder and the dialogue context encoder, we also input the output of the BSpan decoder into this decoder. The generation probability P g t , copy probability P c1 t for copying tokens in BSpan, and copy probability P c2 t for copying words in the dialogue context are calculated with a shared normalization term and combined for the final probability computation: P t is then used to decode the words in the complete user utterance. For this decoder, the second cross-entropy loss L 2 is calculated. Machine Response Decoder Similar to the previous two decoders, the machine response decoder is also a single-layer unidirectional GRU, the initial state of which is set to the last hidden state of the complete user utterance decoder. In this decoder, we compute three context vectors for each decoder state s t . The first context vector h * t1 is calculated over the user utterance encoder hidden states while the other two context vectors h * t2 , h * t3 are calculated over the BSpan decoder hidden states and the complete user utterance decoder hidden states, respectively. The concatenation of s t , h * t1 , h * t2 , h * t3 and the Knowledge Base (KB) matching vector K t (a one-hot representation of the retrieval results in KB according to the constraints in the corresponding BSpan) is used to generate the output and update the decoder state. The generated output is then concatenated with the three context vectors to feed into a layer to produce the gener-  Table 2: An example of the ellipsis / co-reference annotation ation probability distribution over the vocabulary. Similar to the complete user utterance decoder, we also use the copy mechanism in the machine response decoder. The third cross-entropy loss L 3 is then calculated.
Training The final loss for the multi-task learning framework is estimated as follows: We learn parameters to minimize the final loss.

Data Annotation for Ellipsis and Co-Reference Rosultion in Dialogue
Since there are no publicly available labeled data for the resolution of ellipsis and co-reference in dialogue, we manually annotate such a new dataset based on the public dataset CamRest676 (Wen et al., 2016a,b) from the restaurant domain. Annotation Specification Annotation cases for user utterances can be summarized into the following three conventions: • As shown in Table 2, if a user utterance contains an ellipsis or anaphor, we manually resolve the ambiguity of ellipsis or anaphor and supplement the user utterance with a correct expression by checking the dialogue context. In doing so, we create a pragmatically complete version for the utterance. If the utterance only contains an ellipsis and the ellipsis can be replaced with an anaphor, we create a co-reference version for it. Similarly, if the utterance only contains an anaphor and the anaphor can be omitted, we create an ellipsis version for the utterance.
• If the user utterance itself is pragmatically complete, without any ellipsis or anaphora, we create an anaphor and ellipsis version for it if such a creation is appropriate.
• If the utterance itself is complete and it is not suitable to create an ellipsis or anaphor version, we just do nothing.
With the annotation convention described above, for each user utterance in the dataset, we can label it as l ∈ {ellipsis, co-reference, complete} or create two other versions for it if appropriate. Please notice that these labels are used only for dataset statistics or for designing experiments, not for training our models. Dataset statistics The CamRest676 dataset contains 676 dialogues, with 2,744 user utterances. After annotation, 1,174 ellipsis versions and 1,209 co-reference versions are created from the 2,744 user utterances. 1,331 incomplete utterances are created that they are an either ellipsis or co-reference version. 1,413 of the 2,744 user utterances are complete and not amenable to change.
No new versions are created from these 1,413 utterances.
Dataset Split for Experiments We split the new dataset into a training set (accounting for 80%) and validation set (accounting for 20%) which can be used for the stand-alone ellipsis/coreference resolution task and the multi-task learning of both the ellipsis/co-reference resolution and end-to-end task-oriented dialogue.

Experiments
In this section we conducted experiments on the new dataset to examine the generative ellipsis/coreference resolution model and its integration into the end-to-end task-oriented dialogue.

Evaluation Metrics
As far as we know, there is no end-to-end generative ellipsis and co-reference resolution model applied to multi-turn dialogues. Therefore there are no off-the-shelf metrics tailored to this evaluation. Since we deal with two tasks: the task of ellipsis/co-reference resolution (resolution task for short) and the task-oriented dialogue with integrated ellipsis/co-reference resolution (hereafter dialogue task), we use two sets of evaluation metrics. For the resolution task, we use the exact match rate (EM) that measures whether the generated utterances exatly match the gold utterances.
BLEU (Papineni et al., 2002) and F 1 score (a balance between word-level precision and recall) are also used for the resolution task to evaluate the quality of generated utterances at the n-gram and word level. We use the success F 1 which is defined as the F 1 score of requested slots correctly answered in dialogues to evaluate task comple-  Table 3: Results of the resolution task on the dataset. GECOR 1/2: the GECOR model with the copy/gated copy mechanism. EM 1 and EM 2 respectively indicate the situation that the input utterance is complete or incomplete while EM is the comprehensive evaluation of the two situations. Reso.F 1 : Resolution F 1 tion rate for the dialogue task, similar to .

Parameter Settings
For all our models, both the size of hidden states and word embeddings were set to 50. The vocabulary size |V| was set to 800 and the batch size was set to 32. We trained our models via the Adam optimizer (Kingma and Ba, 2015), with a learning rate of 0.003 and a decay parameter of 0.5. Early stopping and dropout were used to prevent overfitting, and the dropout rate was set to 0.5.

Baselines and Comparisons
For the resolution task, we compared our GECOR model with the baseline model proposed by Zheng et al. (2018) which is a seq2seq neural network model that identifies and recovers ellipsis for short texts.
For the dialogue task, we compared our multitask learning framework with the baseline model TSCP proposed by  which is a seq2seq model enhanced with reinforcement learning. We ran the source code 2 on our dataset to get the baseline results for comparison.
For the resolution task, we also performed a comparison study to examine the impacts of the gate mechanism incorporated into the copy network on the GECOR model and on the multi-task learning dialogue model.

The GECOR Model
Our generative resolution model was trained on three types of data: the ellipsis data where only ellipsis version utterances from the annotated dataset were used, the co-reference data where 2 https://github.com/WING-NUS/sequicity only co-reference version utterances from the annotated dataset were used, and the mixed data where we randomly selected a version for each user utterance from {ellipsis, co-reference, com-plete}. In the mixed data, 633 turns are with ellipsis user utterances, 698 turns are with co-reference user utterances, and the rest are with complete user utterances. The experimental results of the GECOR and baseline model (Zheng et al., 2018) on the three different datasets are shown in Table  3.
Overall results From the third column of the table, we find that the GECOR model with the copy mechanism (GECOR 1) improves the exact match rate (EM) by more than 17 points on the ellipsis version data, more than 15 points on the co-reference data, and more than 18 points on the mixed data. We further define a metric we term as Resolution F 1 that is an F 1 score calculated by comparing machine-generated words with ground truth words for only the ellipsis / co-reference part of user utterances. The GECOR model achieves consistent and significant improvements over the baseline in terms of BLEU, F 1 and Resolution F 1 in addition to the EM metric . The major difference between the GECOR and the baseline is that the former tries to copy words from the dialogue context. The improvements, especially the improvements on the ellipsis resolution (higher than those on the co-reference resolution) indicate that the copy mechanism is crucial for the recovery of ellipsis and co-reference.
Effect of the two copy mechanisms Comparing the GECOR 1 against the GECOR 2 (with the gated copy mechanism), we can find that the gating between copy and generation is helpful in terms of the word-level quality (F 1 and Resolution F 1 score) but not in terms of the fragment  Table 4: Results of the multi-task learning model. This table is split into two parts: performance of resolution for the integrated GECOR on the left side and performance of dialogue task on the right side.
or sequence-based metrics (i.e., BLEU and EM). Therefore, we only integrate the GECOR model with the copy mechanism into the dialogue system.
Incomplete vs. complete utterances In multiturn dialogues, user utterances may be incomplete or complete. A robust resolution model needs to be able to accurately identify whether the input utterance is complete or not. The model needs to keep it unchanged when it is complete and to predict the corresponding complete version when it is incomplete. For these cases, we tested our models and made statistical analysis on the three versions of data as shown in column 3, 4 and 5 of Table 3 (EM, EM 1, EM 2). We can find that the GECOR model beats the baseline model in all respects. However, the GECOR model needs further improvement when the input utterances are incomplete, compared with its good performance on complete utterances.
Analysis on GECOR results for complete utterances We then analyzed the experimental results of the GECOR 1 on the mixed data in detail. When the input user utterances are complete, the GECOR model can amazingly generate 92.03% utterances that exactly match the input utterances. Only 7.97% do not match perfectly. Most unmatched cases, as we found, are with: (1) missed words (e.g., User: Can I get a Korean restaurant in the town centre? GECOR: Can I get a Korean restaurant in the town?) (2) Repetition (e.g., User: OK, thank you. That is all for today then. GECOR: OK, thank you. That is all for today for today then.) Analysis on GECOR results for incomplete utterances For incomplete input user utterances, GECOR can generate 42.04% exactly matched cases. Among the 57.96% cases that do not exactly match ground truth utterances, only 6.3% are not complete, which still contains unresolved el-lipsis or co-reference, while 93.7% of these cases are complete with GECOR-generated words that do not match ground truth words. An in-depth analysis on these show that they can be clustered into 4 classes. (1) Paraphrases. We found that the majority of the unmatched complete utterances generated by GECOR are actually paraphrases to the ground truth complete utterances (e.g., User: Any will be fine. GECOR: Any food type will be fine. Reference: Any type of restaurant will be fine.). This is also confirmed by the high scores of the word-level evaluation metrics in Table 3.
(2) Partial resolution. When a pronoun refers to more than one items, GECOR sometimes generate a partial resolution for the pronoun (e.g., User: I do not care about them. GECOR: I do not care about the price range. Reference: I do not care about the price range or location.). (3) Minor errors. In a few cases, the resolution part is correct while there are some errors elsewhere. (e.g., User: How about Chinese food? Prediction: How about international food on the south side of town? Reference: How about Chinese food on the south side of town?) (4) Repetition. Some cases contain repeatedly generated words.
We think that although not exactly matched, paraphrased complete utterances generated by GECOR are acceptable. These utterances are helpful for the downstream dialogue task. For other errors, such as partial resolution or repetition, it may be necessary to enhance the attention or copy mechanism further in GECOR.

The Multi-Task Learning Model
We further conducted experiments to extrinsically evaluate the GECOR model in task-oriented dialogue with the success F 1 metric. This is also to evaluate our multi-task learning framework in integrating the GECOR model into the end-to-end dialogue model. In addition to training the base-line TSCP model on the ellipsis, co-reference and mixed dataset, we also trained it on the dataset with only complete user utterances. This is to examine the ability of the baseline model in using correct contextual information presented in user utterances. The experimental results are shown in Table 4.
Overall results In comparison to the baseline, we can see that our model improves the success F 1 score by nearly 4 points on the co-reference dataset, which is close to the score obtained by the baseline trained with the complete user utterances. On the mixed and ellipsis dataset, our model also achieves 2.7 points and 0.8 points of success F 1 score improvements, respectively.
Resolution performance of the integrated GECOR We also provide the performance of the integrated GECOR on the resolution task in Table  4. The performance is slightly lower than when the GECOR is trained independently as a standalone system. This suggests that the GECOR is able to perform well when integrated into a dialogue system. The overall results demonstrate that the proposed multi-task learning framework for the end-to-end dialogue is able to improve the task completion rate by incorporating an auxiliary ellipsis/co-reference resolution task.
Since the BSpan decoder is also used in the baseline system to capture contextual information and track dialogue states, we believe that our multi-task learning model with the integrated GECOR will play a more important role in endto-end dialgoue models that do not use state tracking modules, e.g., neural open-domain conversation models (Vinyals and Le, 2015;.

Conclusion and Future Work
In this paper, we have extensively investigated the ellipsis and co-reference resolution in the context of multi-turn task-oriented dialogues. We have presented the GECOR, a unified end-to-end generative model for both ellipsis and co-reference resolution in multi-turn dialogues. A multi-task learning framework is further proposed to integrate the GECOR into the end-to-end task-oriented dialogue. In order to train and test the proposed model and framework, we manually created a new dataset with annotated ellipsis and co-reference information based on the publicly available Cam-Rest676 dataset. Experiments on the resolution task show that the GECOR is able to significantly improve the performance in terms of the exact match rate, BLEU and word-level F 1 score. Experiments on the dialogue task demonstrate that the task completion rate of the task-oriented dialogue system is significantly improved with the aid of ellipsis and co-reference resolution.
Our work could be extended to end-to-end open-domain multi-turn dialogue. We will further improve our model by incorporating syntactic and location information. We would also like to adapt the proposed methods to document-level neural machine translation in the future.