Incomplete Utterance Rewriting as Semantic Segmentation

Recent years the task of incomplete utterance rewriting has raised a large attention. Previous works usually shape it as a machine translation task and employ sequence to sequence based architecture with copy mechanism. In this paper, we present a novel and extensive approach, which formulates it as a semantic segmentation task. Instead of generating from scratch, such a formulation introduces edit operations and shapes the problem as prediction of a word-level edit matrix. Benefiting from being able to capture both local and global information, our approach achieves state-of-the-art performance on several public datasets. Furthermore, our approach is four times faster than the standard approach in inference.


Introduction
A dramatic progress has been achieved in singleturn dialogue modeling such as open-domain response generation (Shang et al., 2015), question answering (Rajpurkar et al., 2016), etc. By contrast, multi-turn dialogue modeling is still in its infancy, as users tend to use incomplete utterances which usually omit or refer back to entities or concepts appeared in the dialogue context, namely ellipsis and coreference. According to previous studies, ellipsis and coreference exist in more than 70% of the utterances (Su et al., 2019), for which a dialogue system must be equipped with the ability of understanding them. To tackle the problem, early works include learning a hierarchical representation (Serban et al., 2017;Zhang et al., 2018) and concatenating the dialogue utterances selectively (Yan et al., 2016). Recently, researchers focus on a more explicit and explainable solution: the task of Incomplete Utterance Rewriting (IUR, also known as context rewriting) (Kumar and Joshi, 2016;Su et al., 2019; Work done during an internship at Microsoft Research.

Turn
Utterance (Translation) x 1 (A) 北京今天天气如何 How is the weather in Beijing today 北京今天是阴天 Beijing is cloudy today 为什么总是这样 Why is always this x * 3 北京为什么总是阴天 Why is Beijing always cloudy Table 1: An example dialogue between user A and B, including the context utterances (x 1 , x 2 ), the incomplete utterance (x 3 ) and the rewritten utterance (x * 3 ).
2019a; Pan et al., 2019;Elgohary et al., 2019;. IUR aims to rewrite an incomplete utterance into an utterance which is semantically equivalent but self-contained to be understood without context. As shown in Table 1, the incomplete utterance x 3 not only omits the subject "北京"(Beijing), but also refers to the semantic of "阴 天"(cloudy) via "这 样"(this). By explicitly recovering the hidden semantics behind x 3 into x * 3 , IUR makes the downstream dialogue modeling more precise.
To deal with IUR, a natural idea is to transfer models from coreference resolution (Clark and Manning, 2016). However, this idea is not easy to realize, as ellipsis also accounts for a large proportion. Despite being different, coreference and ellipsis both can be resolved without introducing out-of-dialogue words in most cases. That is to say, words of the rewritten utterance are nearly from either the context utterances or the incomplete utterance. Observing it, most previous works employ the pointer network (Vinyals et al., 2015) or the sequence to sequence model with copy mechanism (Gu et al., 2016;See et al., 2017). However, they generate the rewritten utterance from scratch, neglecting a key trait that the main structure of a rewritten utterance is always the same as the incomplete utterance. To highlight it, we imagine the rewritten utterance as the outcome after a series of edit operations (i.e. substitute and insert) on the incomplete utterance. Taking the example from Table 1, x * 3 can be obtained by substituting "这样"(this) in x 3 with "阴天"(cloudy) in x 2 and inserting "北京"(Beijing) before "为什 么"(Why), much easier than producing x * 3 via decoding word by word. These edit operations are carried out between word pairs of the context utterances and the incomplete utterance, analogous to semantic segmentation (a well-known task in computer vision): Given relevance features between word pairs as an image, the model is to predict the edit type for each word pair as a pixel-level mask (elaborated in Section 3). Inspired by the above, in this paper, we propose a novel and extensive approach which formulates IUR as semantic segmentation 1 . Our contributions are as follows: • As far as we know, we are the first to present such a highly extensive approach which formulates the incomplete utterance rewriting as a semantic segmentation task.
• Benefiting from being able to capture both local and global information, our approach achieves state-of-the-art performance on several datasets across different domains and languages.
• Furthermore, our model predicts the edit operations in parallel, and thus obtains a much faster inference speed than traditional methods.

Related Work
The most related work to ours is the line of incomplete utterance rewriting. Recently, it has raised a large attention in several domains. In question answering, previous works include non-sentential utterance resolution using the sequence to sequence based architecture (Kumar and Joshi, 2016), incomplete follow-up question resolution via a retrieval sequence to sequence model (Kumar and Joshi, 2017) and sequence to sequence model with a copy mechanism (Elgohary et al., 2019;Quan et al., 2019). In conversational semantic parsing, Liu et al. (2019b) proposed a novel approach which considers the structures of questions, while Different from all of them, we formulate the task as a semantic segmentation task. Our work is also closely related to coreference resolution. It is an active task that has been studied years, and deep learning based methods have achieved state-of-the-art performance via the paradigm of scoring span or mention pairs Manning, 2015, 2016;Lee et al., 2017Lee et al., , 2018. Researchers also explored to use unsupervised contextualized representations to enhance the coreference resolution. Joshi et al. (2019) applied SpanBERT (Joshi et al., 2020) to enhance the span representation in coreference resolution, and Wu et al. (2020) formulated coreference resolution as query-based span prediction and employed SpanBERT to solve it as a machine reading task. The above works only focus on coreference resolution, while our work deals with coreference and ellipsis under a unified approach.
From the perspective of the methodology, our work is correlated with directions of edit-based text generation and semantic similarity measurement.  proposed a prototype-thenedit paradigm for open-domain response generation, while Malmi et al. (2019) cast text generation as a text editing task and tackled it with a sequence tagging approach. Our work is different from theirs since we model the editing process between two sentences as a semantic segmentation task. As for semantic similarity measurement, similar to us, both He and Lin (2016) and Pang et al. (2016) used convolutional neural networks to capture similarities between sentences.

Incomplete Utterance Rewriting as Semantic Segmentation
In this section, we will have a glance at the fundamental idea behind our approach: incomplete utterance rewriting as semantic segmentation. In a multi-turn dialogue, given the context utterances (x 1 , · · · , x t−1 ) and the incomplete utterance x t , IUR is to rewrite x t to x * t using contextual information, where x * t has the same meaning with x t . The rewritten utterance x * t has self-contained semantics and can be understood solely. To produce x * t , our approach formulates the problem as a semantic segmentation task. Concretely, we concatenate all the context utterances to produce an M -length word sequence c = (c 1 , c 2 , · · · , c M ). To separate context utterances in different turns, we insert a special word [S] between each context utterance. Meanwhile, the incomplete utterance is denoted by x = (x 1 , x 2 , · · · , x N ). As mentioned, the rewritten utterance x * can be obtained by editing the incomplete utterance x with in-dialogue words (i.e. words in c). To model edit operations between x and c, we define a M × N matrix Y , where the entry Y m,n indicates the edit type between c m and x n . There are three kinds of edit types: Substitute means replacing the span in x with the corresponding context span in c; Insert aims to insert the context span before a certain token in x, and None represents no operation. For example, as shown schematically in Figure 1, we can edit x by replacing (x 2 , x 3 ) with (c 2 , c 3 , c 4 ) and insert (c 6 , c 7 , c 9 , c 10 ) before x 7 . It is notable that we append a special word [E] to x, to enable Insert take place after x. More concrete examples can be found in Section 5.3.
Then, we propose to emit such a matrix Y in a way analogous to the task of semantic segmentation. Specially, we build a M × N feature map via capturing the word-to-word relevance between c and x. Taking the feature map as an image, the output word-level edit matrix Y is parallel to the pixel-level mask in semantic segmentation, which bridges IUR and semantic segmentation. Such a formulation comes with several key advantages: (i) Easy: compared with traditional methods generating the rewritten utterance from scratch, such a formulation introduces edit operations to lower the difficulty of generation; (ii) Fast: these edits are predicted concurrently, so our model naturally enjoys the fast inference speed than conventional models which decode word by word; (iii) Transferable: taking the formulation as a bridge between IUR and semantic segmentation, one can transfer empirical models from the community of semantic segmentation with ease.
As shown in Figure 2, our approach firstly obtains the word-level edit matrix through three neural layers. Then based on the word-level edit matrix, it applies a generation algorithm to produce the rewritten utterance. Since the model yields a Ushaped architecture (illustrated later), we name our approach as Rewritten U-shaped Network (RUN).

Word-level Edit Matrix Construction
To construct a word-level edit matrix, our model passes through three neural layers: a context layer, an encoding layer and a subsequent segmentation layer. The context layer produces a context-aware representation for each word in both c and x, based on which the encoding layer forms a feature map matrix F to capture word-to-word relevance. Finally a segmentation layer is applied to emit the word-level edit matrix.
Context Layer As shown in the left of Figure 2, at first the concatenation of c and x passes by the word embedding φ to get the representation for each word in both utterances. The embedding is initialized using GloVe (Pennington et al., 2014), and then updated along with other parameters. On top of the joint word embedding sequence φ(c 1 ),· · ·, φ(c M ), φ(x 1 ),· · ·, φ(x N ) , Bidirectional Long Short-Term Memory Network (BiLSTM) (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997) is applied to capture contextual information inter and intra utterances. Although c and x are jointly encoded by BiLSTM (see the left of Figure 2), below we distinguish their hidden states for clear illustration. For a word c m (m = 1, . . . , M ) in c, its hidden state is denoted by u m obtained through BiLSTM, while the hidden state h n is for word x n (n = 1, · · · , N ) in the incomplete utterance.
Encoding Layer On top of the context-aware hidden states, we consider several similarity functions to encode the word-to-word relevance. Concretely, for each word x n in the incomplete utterance and c m in the context utterances, their relevance is captured by a D-dimensional feature vector F(x n , c m ). It is produced by concatenating element-wise similarity (Ele Sim.), cosine similarity (Cos Sim.) and learned bi-linear similarity (Bi-Linear Sim.) between them as: where W is a learned parameter. These similarity functions are expected to model the word-to-word relevance from different perspectives, important for the follow-up edit type classification. However, they concentrate on local rather than global information (see discussion in Section 5.3). To capture global information, a segmentation layer is proposed.
Segmentation Layer Taking the feature map matrix F ∈ R M ×N ×D as a D-channel image, the segmentation layer is to predict the word-level edit matrix Y ∈ R M ×N , analogous to a pixel-level mask. Inspired by UNet (Ronneberger et al., 2015), the layer is formed as a U-shaped structure: two down-sampling blocks and two up-sampling blocks with skip connection. A down-sampling block contains two separate "Conv" modules and a subsequent max pooling. Each down-sampling block doubles the number of channels. Intuitively, the down-sampling block expands the receptive fields of each cell, hence providing rich global information for the final decision. An up-sampling block contains two separate "Conv" modules, and a subsequent deconvloution neural network. Each up-sampling block halves the number of channels and concatenates the correspondingly cropped feature map in down-sampling as the output (skip connect in Figure 2). Finally a feedforward neural network is employed to map each feature vector to one of three edit types, obtaining the word-level edit matrix Y . By incorporating an encoding layer and a segmentation layer, our model is able to capture both local and global information.
BERT Enhanced Embedding Since pretrained language models have been proven to be effective on several tasks, we also experiment with employing BERT (Devlin et al., 2019) to augment our model via BERT enhanced embedding.

Rewritten Utterance Generation
Once a word-level edit matrix is emitted, a subsequent generation algorithm is applied for producing the rewritten utterance. As indicated in Figure 1, to apply edit operations without ambiguity, we assume each edit region in Y is a rectangle. However, the predicted Y is not guaranteed to meet this requirement, indicating the need for a standardization step. Therefore, the overall procedure of generation is divided into two stages: first the algorithm delimits standard edit regions via searching minimal covering rectangles for each connected region; then it manipulates the incomplete utterance based on these standard edit regions to produce the rewritten utterance. Since the second step has been illustrated in Section 3, in the following we concentrate on the first standardization step.
In the standardization step, we employ the twopass algorithm (also known as Hoshen-Kopelman algorithm) to find connected regions (Hoshen and Kopelman, 1976). In a nutshell, the algorithm makes two passes over the word-level edit matrix. The first pass is to assign temporary cluster labels and record equivalences between clusters in an order of left to right and top to down. Concretely, for each cell, if its neighbors (i.e. left or top cells with the same edit type) have been assigned temporary cluster labels, it is labeled as the smallest neighboring label. Meanwhile, its neighboring clusters are recorded as equivalent. Otherwise, a new temporary cluster label is created for the cell. The second pass is to merge temporary cluster labels which are recorded as equivalent. Finally, cells with the same label form a connected region. For each connected region, we use its minimal covering rectangle to serve as the output of our model.

Distant Supervision
As mentioned in Section 3, the expected supervision for our model is the word-level edit matrix, but existing datasets only contain rewritten utterances. Therefore, we use a procedure to automatically derive (noisy) word-level edit matrices (i.e. distant supervision), and use these examples to train our model. We use the following process to build our training set. First, we find a Longest Common Subsequence (LCS) between x and x * . Then, for each word in x * , if it is not in LCS, it is marked as ADD. Conversely, for each word in x but not in LCS, it is marked as DEL. Contiguous words with the same mark are merged into one span. By a span-level comparison, any ADD span in x * with a DEL counterpart (i.e. under the same context) relates it to Substitute. Otherwise, the span is inserted into x, corresponding to Insert.
Taking the example from Table 1, given x as "为什么总是这样"(Why is always this) and x * as "北京为什么总是阴天"(Why is Beijing always cloudy), their longest common subsequence is "为什么总是"(Why is always). Therefore, with "这样"(this) in x being marked as DEL and "阴 天"(cloudy) in x * being marked as ADD, they cor-respond to the edit type Substitute. In comparison, since "北京"(Beijing) cannot find a counterpart, it is related to the edit type Insert.

Experiments
In this section, we conduct thorough experiments to demonstrate the superiority of our approach. . We use the same data split for these datasets as their original paper, and some statistics are shown in Table 3.

Experimental
Baselines We consider a bunch of baselines, including LSTM-based models, Transformer-based models and state-of-the-art models on each dataset. (i) LSTM-based models consist of the vanilla sequence to sequence model with attention (L-Gen) (Bahdanau et al., 2015), the pointer network architecture (L-Ptr) (Vinyals et al., 2015) and We refer readers to their papers for more details. It is remarkable that above methods all generate rewritten utterances from scratch.
Evaluation We employ both automatic metrics and human evaluations to evaluate our approach. As in literature (Pan et al., 2019), we examine RUN using the widely used automatic metrics BLEU, ROUGE, EM and Rewriting F-score. (i) BLEU n (B n ) evaluates how similar the rewritten utterances are to the golden ones via the cumulative n-gram BLEU score (Papineni et al., 2002). (ii) ROUGE n (R n ) measures the n-gram overlapping between the rewritten utterances and the golden ones, while ROUGE L (R L ) measures the longest matching sequence between them (Lin, 2004). (iii) EM stands for the exact match ac-   Table 3: Statistics of different datasets. NA means the development set is also the test set. "Ques" is short for questions, "Avg" for average, "len" for length, "Con" for context utterance, "Cur" for current utterance, and "Rew" for rewritten utterance.
curacy, which is the strictest evaluation metric.
(iv) Rewriting Precision n , Recall n and F-score n (P n , R n , F n ) emphasize more on words from c which are argued to be harder to copy (Pan et al., 2019). Therefore, they are calculated on the collection of n-grams that contain at least one word from c. As validated by Pan et al. (2019), above automatic metrics are credible indicators to reflect the rewrite quality. However, none of automatic metrics reflects the utterance fluency or the improvement on downstream tasks. Therefore, human evaluations are included to evaluate the fluency of rewritten utterances and their boost on the downstream task.
Implementation Details Our implementation was based on PyTorch (Paszke et al., 2019), Al-lenNLP (Gardner et al., 2018) and HuggingFace's transformers library (Wolf et al., 2019). Since the distribution of edit types is severely unbalanced (e.g. None accounts for nearly 90%), we employed weighted cross-entropy loss and tuned the weight on development sets. We used Adam (Kingma and Ba, 2015) to optimize our model and set the learning rate as 1e-3, except for BERT as 1e-5. The embedding size and hidden size in BiLSTM are 100 and 200 respectively. Specifically, BERT  mentioned above refer to BERT base . All results of baselines without specific marks were reproduced by ours using OpenNMT with beam size as 4 (Klein et al., 2017).
Connection Words Similar to pointer network (Vinyals et al., 2015), RUN is restricted to predict words which have appeared in the dialogue.
Although most examples work well under the restriction, there still exist a few cases which rely on certain words to generate fluent utterances. For example, when rewriting possessive pronouns such as "their", we usually need an extra word "of" to enhance the fluency. Such common words, named after connection words, improve fluency of the rewritten utterances. In practice, we append a small list of connection words to the tail of c, enabling our model to pick connection words as well. For each dataset, their connection word list is automatically derived from the training data.     Table 7: Human rating evaluations about the response quality on sampled 300 dialogues from the development set of MULTI. The score ranges from 0 to 2. "NR" represents the proportion of rewritten utterances which are equal to current utterances. example, our approach exceeds the best baseline L-Ptr-Gen by a large margin, reaching a new stateof-the-art performance on almost all automatic metrics. To illustrate, our approach improves the previous best model by 6.4 points and 10.0 points on B 1 and F 1 respectively. Furthermore, our approach leaves a striking impression when augmented with BERT. It not only fully surpasses the best sequence generation baseline with BERT (i.e. T-Ptr-λ+BERT on REWRITE), but also obtains a considerable boost over a cascade model designed for stimulating potential of BERT (i.e. PAC on MULTI). Even for the most challenging metric EM on REWRITE, RUN with BERT improves 8.9 points, demonstrating the superiority of our model. Besides, our approach also achieves comparable or better results against all baselines on TASK and CANARD, as shown in Table 5.

Model Comparison
Besides automatic results, we perform two groups of human evaluation to answer (i) how fluent the rewritten utterances are and (ii) how much IUR can contribute to downstream tasks. For the evaluation of fluency, we randomly sampled 500  dialogues in the development set of REWRITE. Then we fed them to representative IUR models and presented generated rewritten utterances to 10 judges, who are asked to decide which of the rewritten utterances is of higher fluency in pairwise comparisons. Ties are acceptable. Table 6 shows the evaluation results. In comparison to the best baseline T-Ptr-λ, our model only loses in 20.4% cases, which is extremely competitive.
To access the influence of IUR on downstream tasks, we choose multi-turn response selection as a representative, which aims to retrieve suitable responses from a candidate pool considering the context. Concretely, an SMN model trained on the Douban Conversation Corpus is selected as the backbone in multi-turn response selection (Wu et al., 2017). At first we sampled 300 dialogues from the development set of MULTI as the input to IUR models. Then their predicted rewritten utterances and the context utterances were fed into the SMN model, to help it select suitable responses.  Table 9: The ablation results on the development set of MULTI. "w/o Edit" means directly using the current utterance as the rewritten utterance. "w/o U-shape seg." means that our segmentation layer is replaced by a feed-forward neural network with comparable parameters. The remaining variants ablate different similarity functions in the encoding layer. The response candidate pool was formed by all utterances in MULTI. Finally, 5 workers were asked to evaluate responses following a multi-scale rating from 0 to 2: 0 means the response is not related to the dialogue; 1 means the response is related but not interesting enough; and 2 means the response is satisfying. To illustrate more clearly, we also conduct human rating evaluation on responses under the settings of original dialogue (i.e. without rewriting, relying on the SMN model itself to understand the context) and gold dialogue (i.e. human rewriting). As shown in Table 7, our model achieves the highest response quality score among IUR models, improving the original setting by 19% relatively. Considering that the SMN model is capable of aggregating implicit context information, it is non-trivial for our model to further improve the response quality.

Closer Analysis
We conduct a series of experiments to analyze our model deeply. First we conduct an inference speed comparison between our model and representative  baselines under the same run-time environment. Then we verify the effectiveness of components in our model by a thorough ablation study. Meanwhile, we touch how the amount of connection words affect the performance. Finally, we present two real cases to illustrate our model concretely. Table 8 compares inference speed between our model and baselines. Since L-Ptr-λ and T-Ptr-λ are not implemented under Py-Torch, we do not show their inference time for fair consideration. Noticing the beam size would affect the inference time of baselines, we also show the results with beam size as 1. Using the simplest L-Gen as a standard, one can find that our model is nearly four times faster, with the highest improvement ∆B 4 . Meanwhile, our model is the only one which can improve both performance and inference speed, significantly surpassing all baselines.

Inference Speed
Ablation Study To verify the effectiveness of different components in our model, we present a thorough ablation study in Table 9. As expected, "w/o Edit" causes a huge drop on all evaluation metrics. Notably, the extreme drop on F n indicates that it is more suited for IUR than common metrics. "w/o U-shape Seg.", which ablates the segmentation layer, also brings a great performance drop. Without our segmentation layer capturing global information, an encoding layer only achieves comparable performance with L-Gen, suggesting there are considerable benefits with bridging IUR and semantic segmentation. We also ablates different feature similarity functions (i.e. "w/o Ele Sim.", "w/o Cos Sim." and "w/o Bi-Linear Sim.") for an in-depth analysis. As shown in Table 9, ablating each similarity function will hurt most metrics. Meanwhile, our model does not depend on any similarity function severely, showing its robustness. Furthermore, we explore how the amount of connection words affect the performance in Figure 3. As indicated, except TASK, the number of connection words affect slightly. Nevertheless, it shows a positive effect overall, providing a way to generate out-of-dialogue words. We present two real cases in Figure 4 from REWRITE to illustrate the rewritten process of our model concretely. For both (a) coreference and (b) ellipsis, our model deals with them flexibly.

Discussion
While our approach has made some progress, it still has several limitations. First, our model severely relies on the word order implied by the dialogue. It makes our model vulnerable to some complex cases (i.e. multiple Insert corresponds to one position). The second limitation is that we predict edit types of each cell independently, ignoring the relationship between neighboring edit types. It is hopefully resolved by the conditional random field algorithm (Arnab et al., 2018). The above limitations may raise concerns about the performance upper bound of our approach. In fact, it is not an issue. On three out of four datasets used in the experiments, more than 85% examples could be tackled perfectly by our approach (87.6% in TASK, 91.0% in REWRITE, 95.3% in MULTI). The number in CANARD is relatively low (42.5%) since human annotators introduce many new words in rewriting. Nevertheless, the BLEU upper bound in CANARD could be as high as 72.5% with our approach, which is acceptable.
The last point we focus on is why similarities can be good features for determining edits. We think it can be elaborated from two aspects. For coreference, the similarity function is suitable for identifying whether two spans refer to the same entity. For ellipsis, the similarity function is an effective indicator to find matching anchors, which indicate the possible insertion positions.

Conclusion & Future Work
In this paper, we present a novel and extensive approach which formulates the incomplete utterance rewriting as a semantic segmentation task. On top of the formulation, we carefully design a U-shaped rewritten network, which outperforms existing baselines significantly on several datasets. In the future, we will investigate on extending our approach to more areas.