CREAD: Combined Resolution of Ellipses and Anaphora in Dialogues

Anaphora and ellipses are two common phenomena in dialogues. Without resolving referring expressions and information omission, dialogue systems may fail to generate consistent and coherent responses. Traditionally, anaphora is resolved by coreference resolution and ellipses by query rewrite. In this work, we propose a novel joint learning framework of modeling coreference resolution and query rewriting for complex, multi-turn dialogue understanding. Given an ongoing dialogue between a user and a dialogue assistant, for the user query, our joint learning model first predicts coreference links between the query and the dialogue context, and then generates a self-contained rewritten user query. To evaluate our model, we annotate a dialogue based coreference resolution dataset, MuDoCo, with rewritten queries. Results show that the performance of query rewrite can be substantially boosted (+2.3% F1) with the aid of coreference modeling. Furthermore, our joint model outperforms the state-of-the-art coreference resolution model (+2% F1) on this dataset.


Introduction
In recent years, dialogue systems have attracted growing interest, and been applied to various scenarios, ranging from chatbots to task-oriented dialogues to question answering. Despite rapid progress in dialogue systems, several difficulties remain in the understanding of complex, multi-turn dialogues. Two major problems are anaphora resolution (Clark and Manning, 2016a,b) and ellipsis (Kumar and Joshi, 2016) in follow-up turns. Take the dialogue in Figure 1 as an example: ellipsis happens in user turn 2 where the user is asking for the capital of "Costa Rica" without explicitly mentioning the country again; coreference happens in user turn 3 where "the capital" refers to ⇤ ⇤ Work done while the first author was an intern at Apple. Figure 1: An example of a question-answering dialogue where coreference and ellipsis happen in user query, and the corresponding query rewrite annotation. References to the same entity are highlighted in the same color, and can be resolved by coreference resolution modeling. The two system responses in Turn 3 indicate two possible interpretations of the city San Jose by the system. "San Jose". Without resolving the anaphoric reference and the ellipsis, dialogue systems may fail to generate coherent responses.
Query rewrite (Quan et al., 2019) is an approach that converts a context-dependent user query into a self-contained utterance so that it can be understood and executed independent of previous dialogue context. This technique can solve many cases where coreference or ellipsis happens. For instance, "the capital" in user turn 3 is changed to "San Jose" in the rewrite. Furthermore, the ellipsis of the country name "Costa Rica" in user turn 2 can be revealed through rewriting. The rewritten utterance improves multi-turn dialogue understanding (Yang et al., 2019) by reducing dependency on the previous turns.
Although query rewrite implicitly resolves coreference resolution, there is information not contained in a rewrite. First, it does not provide a distinct coreference link between mentions across dialogue turns as in the classic coreference resolution task. This is particularly disadvantageous when there is entity ambiguity in the rewritten sentence. For example, in Figure 1, since "San Jose" in Rewrite turn 3 can be either San Jose in Costa Rica or San Jose in California, it is likely that the system ends up with an incorrect response by generating System (i) instead of System (ii) due to the wrong interpretation of San Jose. Second, mention detection, an essential step in coreference resolution (Peng et al., 2015), is not involved in query rewrite. By knowing which span in an utterance is a mention, downstream systems like named entity recognition and intent understanding can perform better (Bikel et al., 2009). Third, if coreference links to dialogue context are available, downstream systems can skip entity linking, which is time-consuming and may introduce noise.
To resolve the above issues, we propose a novel joint learning framework that incorporates the benefits of reference resolution into the query rewrite task. To the best of our knowledge, there does not exist, at the time of writing, an English conversational dataset that couples annotations of both query rewrite and coreference resolution (as links or clusters). This motivates us to collect annotations for query rewrite on a recent dialogue dataset -MuDoCo (Martin et al., 2020), which already has coreference links between user query and dialogue context. Compared to existing query rewrite datasets (Quan et al., 2019;Anantha et al., 2020), rewriting in MuDoCo is much more challenging since it involves reasoning over multiple turns and spans multiple domains.
We design a joint learning model adopting the GPT-2 (Radford et al., 2019) architecture that learns both query rewrite and coreference resolution. Given an ongoing dialogue, our model first predicts the coreference links, if any, between the latest user query and the dialogue context. Then it generates the rewritten query by drawing upon the coreference results. Our experiments show that query rewrite performance can be substantially boosted with the aid of coreference training. In addition, our model outperforms strong baselines for the two individual tasks. Since both the tasks fundamentally solve reference resolution, the joint training facilitates knowledge sharing.
Our contributions can be summarized as follows: • We present a novel joint learning framework of modeling coreference resolution and query rewrite for multi-turn dialogues.
• Our annotations of query rewrite augment the MuDoCo dataset with query rewrite labels.
To the best of our knowledge, our augmented MuDoCo is the first English dialogue dataset with both coreference resolution and query rewrite annotations.
• We propose a novel GPT-2 based model to tackle the two target tasks, and show that joint training with coreference resolution helps in improving the quality of the query rewrites.
The augmented dataset with our annotations along with the modeling source code are available at https://github.com/apple/ ml-cread.  (2019), two separate attention distributions are learned for the dialogue context and the user query respectively with a control gate. This modified copy mechanism shows improvements over the standard pointer-generator on both LSTM-based models and transformer-based models (Vaswani et al., 2017). Note that in the dataset used in their work, the dialogue context has only 2 utterances; MuDoCo, in contrast, has up to 8 utterances, making it much more challenging for query rewrite.
Coreference Resolution Research on documentbased coreference resolution has a long history (a detailed survey can be found in Ng (2010)). Various approaches have been proposed, ranging from learning mention-pair classifiers (Ng and Cardie, 2002;Bengtson and Roth, 2008), latent structuredbased models (Fernandes et al., 2012;Björkelund and Kuhn, 2014;Martschat and Strube, 2015) to the more recent neural pipeline based systems that rely on syntactic parsers (Raghunathan et al., 2010) and clustering algorithms (Clark and Man ning, 2016a,b). The first neural end-to-end coreference resolution model was proposed in Lee et al. (2017) and achieved better results without external resources. An improved version was proposed in Lee et al. (2018), which considers higher-order structures by iteratively refining span representations. Recently, powerful pre-trained models have been used to extract representations for these endto-end models using BERT (Joshi et al., 2019) or SpanBERT (Joshi et al., 2020). Wu et al. (2020) approach the problem in a question answering framework. For each detected mention candidate, the sentence it resides in serves as the query and is used to predict the referent in the passage. Different from these works, we focus on coreference resolution in dialogues with the following main distinctions: 1) the speaker information in dialogues is clear; 2) less descriptive content may cause the pronoun mention to be more ambiguous; and 3) coreference resolution is conducted only between the latest user query and the previous dialogue context -unlike in document-based coreference resolution where a model can look ahead for the resolution, future turns are not available to a dialogue agent. We encourage the reader to refer to Martin et al. (2020) for more details.
Joint Learning In contrast to prior works that focus solely on either query rewrite or coreference resolution, we present a novel joint learning approach to tackle both the tasks using one single model. We hope that this work serves as a first step towards this new, challenging and practical problem for dialogue understanding.

Dataset and Task
The MuDoCo dataset contains 7.5k task-oriented multi-turn dialogues across 6 domains. A dialogue has an average of 2.6 turns and a maximum of 5 (a turn includes a user query and a system response). Figure 2 shows an example. For each partial dialogue, the coreference links, if existing, are annotated between the latest user query and its dialogue context. For example, when we consider the partial dialogue preceding up to user turn 2, there is a coreference link between the anaphora "this" in user turn 2 and the antecedent "song" in user turn 1. When an anaphora has multiple antecedents in the context, e.g., "song" in user turn 1 and "Yellow Sub- marine" in system turn 1, only one of them is annotated as its referent in the coreference link. On top of the existing coreference labels, we annotate the rewrite for each utterance. The goal is to rewrite the query into a self-contained query independent of the dialogue context. 30 annotators are recruited for the data collection. Each of them is shown a partial dialogue, and is asked (1) to decide if the query needs to be rewritten due to coreference or ellipsis; and (2) to provide the rewritten query, when rewriting is required. We notice that there can be various ways of rewriting an utterance. For example, some annotators might include every detail of the rewritten entity, while others might choose a precise term; some might paraphrase the rewritten utterance, while others keep the same expression. To ensure data consistency and high annotation quality, we designed a comprehensive guideline for the annotators to follow and undertook a two-stage collection process: 1) we organized two training sessions with annotators. In each session, 50 representative examples were selected and assigned to each annotator. An author inspected these training results individually and provided feedback to the annotators. 2) 5% of the grading results were manually evaluated by an author for quality assurance. Detailed annotation guidelines can be found in the Appendix.
The joint learning task requires the machine to predict both coreference links and the rewritten query for the latest user query given an ongoing dialogue. The outputs of the two individual tasks complement each other and provide more comprehensive information for dialogue understanding. For instance, the "Yellow Submarine" in Figure 2 can be either a song name or an album name. Explicit coreference resolution helps to disambiguate Figure 3: The proposed model for joint learning of coreference resolution and query rewrite, designed using the GPT-2 architecture. Given a dialogue context and a user query, the model first detects the mentions in the query (Step 1); resolves the corresponding reference spans (Step 2); predicts whether the query needs a rewrite or not (Step 3); and, if the model decides to rewrite, generates the rewritten query (Step 4). In this example dialogue, there is a coreference link existing between the mention "this" and its referent "Yellow Submarine". between various possibilities by linking entities to previously resolved ones. More importantly, the supervision of coreference resolution can be beneficial to rewriting the anaphora to its antecedent.

Modeling
Our proposed model for jointly learning coreference resolution and query rewrite is designed based on the GPT-2 architecture, presented in Figure 3. The input to the model is the concatenation of the dialogue context and the latest user query, where special tokens are used to separate utterances and indicate speaker information. Passing through the standard decoder layers, the hidden state h l t 2 R d and attention score a l,j t 2 R T at each position of the input sequence are calculated, where l, j and t denote the index of the decoder layer, that of the attention head, and the input token position respectively; d and T denote the embedding size and the length of the input sequence respectively. Inspired by the end-to-end coreference resolution model (Lee et al., 2017), our model first predicts mentions in the user query and grounds them to their corresponding referent in the dialogue context using attention heads. The model then generates the rewritten query conditioned on the resolved coreference links. The prediction process has four main steps, described in detail below: Step 1: Mention Detection First, the model detects any possible referring expressions in the user query. Here we use the term mention to include all those expressions that require reference resolution (e.g., pronouns or partial entity names). We formulate mention detection as a sequence labeling problem: each token in a query is labelled as one of three classes {S, E, N }, referring to Start of mention, End of mention and None respectively. This sequence tagger in the mention detector, parameterized by a feed-forward network, takes the hidden states of the query from the last decoder layer as input, and predicts the sequence of class labels. Then the mention spans in the query can be determined by a pair of mention start S and end E tags. For instance, in Figure 3 the label of position "this" is class S and that of "from" is class E, while the rest of the positions in the query are labelled as class N . We use m S and m E to respectively denote the start and end position index of a predicted mention m.
Step 2: Reference Resolution For each detected mention m, the model resolves it to the antecedent (or referent) in the dialogue context by predicting the span boundaries: the position index of the referent start r S and end r E . Essentially, the distributions of the boundaries (r S and r E ) are learned by supervising multiple attention heads associated with the target mention m. In other words, the at-tention distribution a m S (the attention score of each position associated with the mention start m S ) is supervised to focus on the referent start r S . Similarly, attention scores a m E associated with the mention end m E are used to learn the boundary of referent end r E . Concretely: where q r S and q r E are the probability distributions that a given token represents the referent start r S and end r E respectively; L 0 and J 0 are the specified number of the involved decoder layers and attention heads. We then take the argmax of these boundary distributions to resolve the referent r.
Our design of reference resolution effectively leverages the powerful attention mechanism in GPT-2 without adding any extra components for reference resolution.
Step 3: Binary Rewriting Classification The model completes the coreference resolution in steps 1 and 2, after which it starts producing the rewritten query. Unlike existing query rewrite systems that directly generate the rewrite given the input, our model first predicts whether the incoming query requires to be rewritten using a binary classifier. As shown in Figure 3, the classifier, a two-layer feed-forward network followed by a softmax layer, takes as input the hidden state of the first decoding step and predicts a vector with two entries representing the rewrite and no-rewrite classes. Only when the binary prediction is true, i.e., the classifier predicts the class indicating that a rewrite is required, does the model enter Step 4 to generate the rewritten query; otherwise, the input query will be directly copied as the output. We show that a well-learned binary classifier with 93% accuracy functions as a filter that helps the model not only minimize the risk of incorrectly rewriting already self-contained queries, but also allows the rest of the generation process to solely focus on how to rewrite incomplete queries during training.
Step 4: Query Rewrite Generation In this final step, the model runs the generation step based on its binary decision of whether or not to rewrite. Unlike the standard language modeling setup in GPT-2, where the output sequence is generated directly from the last hidden states, we design the Coref2QR attention layer that allows information gained during coreference resolution to effectively assist in the query rewrite generation. First, all relevant hidden states of mentions and referents predicted in Steps 1 and 2 are assembled to form a memory pool M . Note that it is possible for an example to have more than one coreference link. At each time step t 0 during the rewrite generation, the Coref2QR attention layer, operating as the standard multi-head attention mechanism, takes h t 0 as query to attend over the coreference related states M by treating them as keys and values. The resulting attention head c t 0 is summed with h t 0 to obtain the feature f t 0 before the final output token classifier. This design improves information flow between the two tasks, enabling the model to directly utilize information regarding previously resolved coreferents during rewrite generation.
The Coref2QR attention can be applied to any arbitrary decoder layer to facilitate the deeper interaction between rewrite and coreference resolution in the model. Formally, at each decoder layer l, the memory pool M l stores coreference related states produced at layer l. At the generation step t 0 , the Coref2QR layer takes h l t 0 as query to attend over M l to obtain c l t 0 . The final feature f t 0 before the output token classifier is then obtained For simplicity, in Figure 3 we only illustrate the Coref2QR attention for the last decoder layer. Our results and analysis show that this Coref2QR attention design benefits the quality of query rewrite, especially in rewriting an anaphora into its antecedent.
Optimization During training, an input sequence with length T is formed by the concatenation of the dialogue context, the user query and the target query rewrite. Four objectives, corresponding to each step in the model, are used for training. For mention detection, the objective is the cross-entropy between the predicted sequence of mention class p M and its ground-truth sequence y M : where q S and q E denote the start and end index of the query respectively. > is the transpose operation.
For each coreference link n, the loss is calculated using the cross-entropy between the predicted distributions of the antecedent boundaries q n and the corresponding ground-truth y Rn . The final loss for reference resolution is the sum of losses from the existing coreference links: where N is the number of coreference links in an example. q n r S and q n r E represent the predicted distributions of reference start r S and reference end r E respectively. When an example does not contain a coreference link, L R would be 0.
For query rewrite, the binary classification loss is the two-class cross entropy between the prediction p B and the binary rewriting label y B : For generation, as in the standard language modeling task, we use cross-entropy between the predicted sequence p Q and its ground-truth sequence y Q : where t 0 is the time step in the word sequence of query rewrite. Note that L Q is 0 for examples that do not need rewrite. The final loss is the sum of all these losses: MuDoCo dataset there are coreference annotations where the mention has the exact same word span as its referent.
Setup The GPT-2 decoder layers and word classification layer in our model are initialized with the pre-trained weights from the GPT-2 small model. We fine-tune the model using Adam (Kingma and Ba, 2014) optimizer with learning rate 5e-05 and batch size 15. The criterion for early stopping is the averaged performance of coreference resolution and query rewrite on the development set. Results are obtained as the average of 5 runs.

Query Rewrite
Evaluation Metrics The standard BLEU-4 (Papineni et al., 2002) between the generated and the target sentences are reported. In addition, to highlight the quality of the rewritten parts in generated sentences, following the post-processing in Quan et al. (2019), we measure an F1 score calculated by comparing machine-generated words with ground truth words for only the ellipsis / co-reference part of user utterances. We also report the percentage of all referents in ground-truth coreference links that were successfully generated in the query rewrite, denoted as reference match (RM). The RM ratio explicitly reflects the quality of coreference resolution in the generated rewritten query.
Baselines The standard seq-to-seq model with attention (   Results Table 2 shows the query rewrite results. The low F1 score and high BLEU score is because of filtering out the non-rewritten repeated tokens in post-processing when calculating F1. This allows us to better evaluate the quality of rewritten parts and to better differentiate between good and bad generation in our task. We find that our joint model substantially outperforms all LSTM-based seq-to-seq models on all metrics. Although the pointer-generator in LSTMs can effectively copy words from the input to its generation, the powerful transformer architecture with pre-trained weights allows better learning of rewriting patterns. To fairly investigate the impact of coreference modeling on the generation of query rewrite, we train a variant of our model using only the query rewrite objectives (Eqns. (4) and (5)), denoted as QR-only model. We can see that without coreference resolution, the F1 score drops from 60.2 to 57.9 and the reference match drops from 82.0 to 78.7. This illustrates the improved ability of the joint model to rewrite anaphoric expressions, since the model can leverage its coreference resolution predictions to generate more accurate query rewrites. We present a detailed case study with model predictions in Sec. 5.5.

Coreference Resolution
Evaluation Metrics The MUC, B 3 , and CEAF 4 metrics that are widely-used in coreference resolution task are reported. Note that these metrics are calculated based on coreference clusters and we only have ground-truth annotations for coreference links between mentions and referents. To align the links and clusters, during evaluation we post-process both the ground-truth and the model predictions. All the word spans that are identical to the referent in the dialogue context are combined into a cluster so that a link between a mention and a referent can be transformed into a cluster for the standard coreference resolution evaluation.
Baselines To the best of our knowledge, there is no suitable coreference resolution model that is proposed in the same setup for dialogues 2 . We therefore experiment with the state-of-the-art models of document-based coreference resolution, including the end-to-end model (Lee et al., 2017(Lee et al., , 2018 using BERT (Joshi et al., 2019) or SpanBERT (Joshi et al., 2020) 3 . Note that these models can only serve as a reference since they are not specifically designed for dialogue-based tasks. Since they require coreference clusters for training, coreference clusters are built from annotated links as in the post-processing step done for evaluation. Table 3, SpanBERT obtains better results than BERT, which is consistent with the findings in Joshi et al. (2020). This is mainly because SpanBERT is better at capturing span information, which facilitates tasks such as coreference resolution where reasoning about relationships between spans is required. In comparison, our joint learning model achieves competitive and even slightly better results. This indicates that the design of our model leveraging attention heads inside GPT-2 is effective at predicting coreference links in dialogues. To test if the supervision of query rewrite affects the optimization of coreference resolution in joint learning, we train a model variant 2 The baseline in Martin et al. (2020) is not compared for two reasons: 1) their setups in training/evaluation is different than ours in many ways, e.g., they only consider finished dialogues; 2) their source code is not released.    using only the objectives for coreference resolution (Eqns. (2) and (3)), denoted as coref-only model. It is observed that the results of the corefonly model are very close to that of the joint model, showing that the addition of coreference resolution in joint learning is beneficial to query rewrite without sacrificing the performance of the former.

Ablation Study
Here, we investigate how the different components in our joint model contribute to the performance of query rewrite. We remove one component at a time and examine the performance of query rewrite. As shown in Table 4, without the designed coref2qr attention layer, the performance degrades with a drop of 2.9% F1 and 1.4% RM rate. By further removing the supervision of coreference modeling from our joint learning model, the model is solely optimized towards the objectives of query rewrite and produces worse results compared to the complete model. These results indicate that through joint learning, the model's ability of generating the rewritten query improves, including its ability to rewrite the anaphora with its antecedent, by lever-aging the information from coreference resolution modeling. In addition, the binary head plays an essential role in our model. The accuracy of this binary classifier is 93.9%. Without the binary head, the performance drop can be up to 5.9% F1 (60.2 -> 54.3). This shows that with the binary classification, the model is able to focus on rewriting the input query without worrying about whether to rewrite or not.

Analysis
In this section we analyze query rewrite performance on two different types of rewriting, coreference (coref.) and ellipses (elp.). The F1 score over three main domains and all test sets are reported in Table 5. The seq2seq+pg model is the baseline seq2seq model with pointer-generator; QR-only model is our model variant but trained without coreference modeling. The overall trend shows that 1) when the dialogue contains coreferences, the joint learning model is more capable of rewriting the query by leveraging its coreference predictions; 2) when coreferences are not present but the query still needs rewriting on account of information omission, the joint model can still perform competitively with the QR-only model.

Case Study
We demonstrate several examples of query rewrites generated by different models to provide more insights into the task and into the benefits of joint learning. The coreference links predicted by the joint learning model are appended after its generated rewrite. Two examples that require coreference resolution in query rewrite are shown in   complexity of a long dialogue. The joint learning model not only correctly predicts the coreference link pointing from the mention to its referent in the first turn, but also generates a rewrite perfectly consistent with its coreference prediction. A similar trend can be observed in the right example. The first two models cannot identify which "Ariana" to generate, while our model is able to rewrite with the correct one with the aid of the correct coreference resolution. While our model does well on most of the test cases, there are situations where the joint model fails to predict correctly. A representative failure example is provided in Appendix A.2. Table 7 shows an ellipsis example. The implicit location in the user query can be recovered through rewriting by both GPT-2 based models, while the LSTM-based model tends to keep the query. This indicates that 1) even with the pointer-generator's ability to copy source text, the seq2seq model is not capable enough of handling the difficult information omission rewrite; 2) the joint learning model still performs well on ellipses, while substantially benefiting in coreference cases.

Conclusion and Future Work
We propose a novel joint learning framework for coreference resolution and query rewrite in dialogues. Modeling coreference resolution not only complements the missing information in query rewrite, but is also beneficial to rewriting anaphoric expressions. Our joint learning model can predict coreference links between the user query and dialogue context, and generate the rewritten query. We show that with the aid of coreference resolution, the performance of query rewrite can be substantially boosted. Furthermore, our model produces competitive results in coreference resolution when compared to state-of-the-art BERT-based systems. We hope that the presented joint learning task with the release of our query rewrite annotations on the MuDoCo dataset provides a promising research direction in multi-turn dialogue understanding.
One restriction of our model is that by virtue of the model being designed to predict the boundaries of a reference, our model is only able to handle cases involving continuous spans of words. In addition, the influence of query rewrite on coreference resolution is limited due to the nature of the information flow in our current model design. Future work will focus on these perspectives.

A Appendices
A.1 Training details The average run time for training our joint learning model is 6 hours using GTX 1080 Ti. Based on the GPT-2 architecture, our model has 148M parameters. For the attention heads used for predicting the referent in Equation 1, hyper-parameter boundaries for L 0 and J 0 are: 1  L 0  12 and 1  J 0  12. The best performance is obtained when only using the last two decode layers (L 0 = 2) with 3 attention heads used in each layer (J 0 = 3).
Hyper-parameters are tuned based on the averaged performance of query rewrite and coreference resolution on the development set.

A.2 Sample Model-Generated Failure Cases
We find that our joint model makes mistakes when the coreference signal is ambiguous and there is complex dialogue context (e.g., having multiple person names in an utterance). In a representative example (Table 8), even though the joint model predicts one of the coreference links correctly (one -> call) and generates the corresponding rewritten span (call from Sana and Erica), it fails to infer that the pronoun her refers to Deirdre, and simply ignores the corresponding rewrite. This is likely because there are many female names that the pronoun her can refer to in this utterance, and these types of complex cases are too infrequent in the training corpus for the model to learn well.

A.3 Query Rewrite Annotation Guideline
The annotation guidelines for collecting query rewrites on the MuDoCo dataset are provided in the following pages. Note that we annotate the rewrite label for every utterance, including the system response, even though they are not used in our experiments.

MuDoCo Grading Guideline O er ie
In hi p ojec o ill be gi en a con e a ion be een a e and a i al a i an Ce ain pa of he e ance bo h e and a i an migh e i e p e io con e o be f ll nde ood The goal of the project is to make mi imal changes to the current turn so that it can be understood independentl ithout needing access to the prior conte t Thi ma mean one o mo e of e e al hing