Redundancy Localization for the Conversationalization of Unstructured Responses

Conversational agents offer users a natural-language interface to accomplish tasks, entertain themselves, or access information. Informational dialogue is particularly challenging in that the agent has to hold a conversation on an open topic, and to achieve a reasonable coverage it generally needs to digest and present unstructured information from textual sources. Making responses based on such sources sound natural and fit appropriately into the conversation context is a topic of ongoing research, one of the key issues of which is preventing the agent’s responses from sounding repetitive. Targeting this issue, we propose a new task, known as redundancy localization, which aims to pinpoint semantic overlap between text passages. To help address it systematically, we formalize the task, prepare a public dataset with fine-grained redundancy labels, and propose a model utilizing a weak training signal defined over the results of a passage-retrieval system on web texts. The proposed model demonstrates superior performance compared to a state-of-the-art entailment model and yields encouraging results when applied to a real-world dialogue.


Introduction
Recent years have seen a growing interest in research on conversational agents. Several strands of dialogue systems have emerged which differ in underlying goals and methods. Some systems focus on data-driven learning of models which can autonomously hold conversations with humans or one another, potentially even on open domains (Vinyals and Le, 2015;Sordoni et al., 2015; Work performed during an internship at Google.

User: What is Malaria?
Agent: A disease caused by a plasmodium parasite, transmitted by the bite of infected mosquitoes.

User: Is it a virus?
Agent: Malaria is a parasitic infection spread by Anopheles mosquitoes. The Plasmodium parasite that causes Malaria is neither a virus nor a bacterium -it is a single-celled parasite that multiplies in red blood cells of humans as well as in the mosquito intestine. et al., 2016). Other works deal with task-oriented dialogues, which offer natural-language interfaces to real-world services like restaurant booking Dhingra et al., 2016;. We focus in this paper on a third dialogue setting where the goal is to have a natural conversation with a user, during which the user's information needs are satisfied in an iterative manner. Such a setting is common in question-answering experiences implemented in personal digital assistants . We call this setting informational dialogues. They start with the user posing a fact-seeking question, e.g., to learn about current events or to explore unknown terms and concepts. Consider the example dialogue in Fig. 1, which is initiated by the user requesting a definition of a specific disease and which also features a subsequent question on the same topic. Many approaches have been proposed which can produce suitable replies to such questions. Examples include techniques which find pertinent passages or short text chunks in collections of documents Miller et al., 2016;Trischler et al., 2016) or find rele-vant entries in structured knowledge bases (Bordes et al., 2014(Bordes et al., , 2015Yin et al., 2016a,b). Generation techniques can then be employed to generate well-formed natural-language utterances from the candidate replies (Wen et al., 2015(Wen et al., , 2016aZhou et al., 2016;Dušek and Jurcicek, 2016). In the dialogue in Fig. 1, both agent replies are coherent wrt. the questions. However, they sound strange when occurring together in a single dialogue context because information is partially reiterated (see the underlined part in the second agent reply). It is this very problem that we focus on in this work, i.e., the localization of redundancy in conversation. Information on the location of non-novel portions of a passage could either be fed back to the retrieval model, so that only text passages with new information would be selected, or alternatively this localized redundancy might be used as input to a summarization model (Rush et al., 2015).
The specific contributions of this work are as follows: • We propose a new task, motivated by practical issues that dialogue applications face (Sec. 3).
• We release a new dataset with manual annotations for this task, which allows to evaluate and compare competing approaches (Sec. 4).
• Due to the insufficient amount of annotated data for training purposes, we report on a weak supervision signal over a large collection of passages with partially redundant content (Sec. 5).
• We augment a recently introduced entailment model (Parikh et al., 2016) with means for representing local similarities in passages in a unidirectional way (Sec. 6) and find that this extension outperforms the original model (Sec. 8).
• Furthermore, we briefly discuss an experiment on real-world dialogue data (Sec. 9), which gives insights on the application-relevance of the proposed task and model.

Related Work
A lot of work has been presented on reasoning with short texts for tasks on similarity and entailment. Knowledge-rich approaches define lexical and syntactic inference rules over phrase pairs and employ decision algorithms that rely on matches of these rules in input texts (Magnini et al., 2014). Other approaches generate structured representations of the input to enable sophisticated alignment of the texts with now available rich lexical, syntactic, and semantic information . The use of kernel methods for similarity tasks has also been reported (Filice et al., 2015). In contrast to these approaches, neither do we use external knowledge nor do we build explicit syntactic representations of input texts. Sentence fusion (Barzilay and McKeown, 2005;Filippova and Strube, 2008) is a technique that is related to the overall problem setting of this paper. This technique is used in the context of abstractive multi-document summarization, where a particular challenge is to identify shared content in a cluster of sentences and to subsequently produce a single sentence that covers all information fragments. In our work, we focus on a similar but different problem formulation, in which we fix one text fragment and want to find reiterations of its content in other texts. Furthermore, we focus on identifying and localizing redundancy and leave the generation of low-redundancy text mostly as future work.
Neural approaches are common for bi-sequence classification problems (Laha and Raykar, 2016). Yin and Schütze (2015), He et al. (2015), and He and Lin (2016) use convolutional networks to represent input texts on multiple granularity levels and model the interactions of these. We also aim to find fine-granular interactions in texts, but in addition to their models, we aim to make these interactions explicit rather than latent intermediate results. Another line of research has proposed recurrent networks for modeling phrases/sentences, including various forms of neural attention (Bowman et al., 2015;Rocktäschel et al., 2015;Zhao et al., 2016). These approaches come with high computational cost during training and inference, in contrast we rely on cheaper feed-forward connections.

Problem Definition
We focus in this work on the problem of redundancy localization in a passage with respect to another text, i.e., we aim to understand when a sub-passage is redundant with what is mentioned in the context. 1 Consider the following example with a context passage c and a follow-up passage p with sub-sequences s 0 -s 3 , which need to be ranked according to the extent to which their semantics are covered by c. In this case, one may expect the order to be (s 1 , s 2 , s 3 , s 0 ): c : The Allianz Arena is a football stadium in Munich, Bavaria, Germany, with a seating capacity of more than 70,000.
s0 : Bayern to increase stadium capacity.
s1 : Bayern Munich have revealed plans to increase the capacity of Allianz Arena to 75,000, s2 : which would make it the second largest stadium in Germany.
s3 : The Allianz Arena is currently the third largest stadium in Germany.
More formally, let p be a sequence of n tokens. Let S = {s k } m−1 k=0 be a set of m sub-sequences of p such that for integers s 0 , s 1 , . . . , s m with s 0 = 0 < s 1 < . . . < s m−1 < s m = n, each subsequence s k ∈ S is ranging from tokens s k to (s k+1 − 1), inclusive. Given a context sequence c, the task of redundancy localization is to produce a ranking function rank (s k ) ∈ {1, . . . , m} that induces an ordering of the subsequences s k ∈ S of p which corresponds to the degree of information in s k that is semantically covered by c. Here, a low rank corresponds to a high semantic overlap of a subsequence with c, where segments are allowed to have equal ranks.
We formulate this task as a ranking problem instead of a more expressive yet also more complex regression setting in order to pose less restrictions on the collection of data for training and evaluation. The design decision to rank sub-sequences rather than individual tokens is intended to keep manual annotation feasible and cost-effective.
Relation to Other Tasks The problem we pose here is related to bi-sequence problems like semantic textual similarity (STS) (Agirre et al., 2016a) and recognizing textual entailment (RTE) (Bowman et al., 2015). In contrast to these tasks, we are not interested in determining the overall relation between sequences, but aim to generate more finegrained sub-passage-level information. The task of interpretable semantic textual similarity (Agirre et al., 2016b) requires systems to provide humanunderstandable explanations for STS ratings of sentence pairs. Chunks from both sentences need to be paired and for each such pairing, similarity and relation type need to be assessed. While this type of annotation is richer than what we propose, it is also harder to produce, likely requiring speciallytrained raters, and would likely be impossible to predict accurately using a surrogate supervision signal like we rely on. Besides, it does not scale well beyond single sentences, since the number of ratings per sequence pair grows proportionally to the multiple of their lengths, while the model we present can handle longer, multi-sentence passages. The setting proposed in the next section is more restricted, but easier to learn and directly applicable in downstream applications.

A Testbed for Redundancy Localization
The evaluation dataset (EVAL) is constructed from pairs of potentially redundant passages from Wikipedia, which were segmented into subpassages and presented to human raters for manual redundancy assessment. The collection of passages was guided by a need for text pairs with various degrees of semantic overlap; we employed a passage-retrieval system for the purpose of text selection. Passage retrieval (Khalid and Verberne, 2008;Aktolga et al., 2011;Xu et al., 2011) is a common intermediate step in information-retrieval and question-answering settings, the goal of which is to return a passage containing the answer to a given query. Most systems generate a list of candidate passages, rank them by relevance and return the top one. We picked a random set of 1200 fact-seeking questions and retrieved corresponding passages from Wikipedia. The questions were then discarded, as they are not relevant to our task. We selected the top-scoring passage as the context c and paired it with a low-scoring one from further down the result list (p). p was then heuristically split into chunks s k , corresponding to verb-governed phrases. The example shown in the last section is an instance of such a pair (c, p).
We asked three raters per item to select for each segment s k of p one out of three labels: NOTREDUNDANT, PARTIALLYREDUNDANT, and FULLYREDUNDANT, depending on the degree of which the content of a sub-passage is covered by the context c. The annotators fully/partially agreed 2 on 64%/96% of examples, their annotation has an intra-class correlation of .55. We aggregated the rating by mapping the categorical labels to a numeric scale (0, 1, 2) and averaging the scores. We used 200 examples as a development c : Brewer's yeast is made ::: from :: a :::::::: one-celled ::::: fungus called Saccharomyces cerevisiae. p + : Brewer's yeast is named so because it comes ::: from ::: the ::::: same ::::: fungus that's used to ferment and make beer -Saccharomyces cerevisiae. p − : Because brewer's yeast is a rich source of chromium, scientists think it may help treat high blood sugar.
c : The world's tallest artificial structure is the :::: 829.   dataset for the experiments in this paper (DEV), and the remaining 1000 items as a test dataset (TEST). Tab. 2 reports the label distribution in both parts of the dataset. We make the dataset publicly available at https://github.com/ kraseb/redundancy-localization.

Training with a Proxy Signal
While the annotation required for our task is comparatively simple and can be performed by raters without special training, a workable fullysupervised model would require a very considerable amount of data and is likely to prove costly. 3 Suppose, however, we were supplied with a large number of short texts with varying degrees of similarity and relatedness to one another and we had a means of assessing at the coarse level of text pairs whether or not they were similar. Our hypothesis is that given appropriate model capacity and structure, a model trained to predict the passage-level similarity would learn to compare smaller units of text to make an appropriate high-level decision. We derive a proxy signal from passage-level retrieval scores which allows to bootstrap the redundancy-localization model described in Sec. 6.
The model is presented with passage triples, where two passages are very closely related and the third one is on the same general topic, but less similar to the other two and hence likely contains less redundancy. The model is then trained to rank the more closely related passage pairs above the less closely related ones.
We retrieve lists of relevant passages from the web using the same passage-retrieval system that we utilized to collect data for manual annotation. Through manual inspection of a small subset of candidate passage lists, we identified a range of passage scores, where candidate passages are topically close to the top-scoring one, but sufficiently different in factual content. To ensure that the topscoring passage and the lower-scoring one are on the same topic, we further require that they be extracted from the same webpage.
From each of the queries' passage lists we extract three passages, the top-scoring passage c, the second-highest ranking passage p + , and a lower-scoring passage p − from the score corridor described above. The stream of passage triples (c, p + , p − ) generated in this way allows to train a model with a margin-based ranking objective. This objective enforces that the similarity score of the two high-scoring passages c, p + is greater than the similarity of the low-scoring passage p − and the top-scoring one, plus a margin; see Sec. 6.3. This pushes a model to find what differentiates two given text sequences, so that it can assign a higher similarity to the near-paraphrases.
Tab. 1 shows three example passage triples constructed with this signal. Here, underlining is a means of visualizing the overlapping/disjoint content between triple elements. Note that we do not Step (a) Passage with redundancy: Step (b) Step (c)

Segmentation of passage
Step (  make this information available to a model during training. In the interest of brevity, we selected short, single-sentence passages for this example.

Model Design
This section first gives a brief overview of the proposed model, before going into details of its architecture and use during training and inference time.
Architecture Overview Existing models for bisequence tasks (Bahdanau et al., 2014;Rush et al., 2015;He and Lin, 2016) often learn to align texts as an intermediate step, i.e., reasoning is done with pairs of short text units, which allows to build a task-specific output for whole sequences on top of local decisions. A particular example for RTE is the three-layer model of Parikh et al. (2016). The first layer produces a bi-directional alignment between input sentences, which is utilized in the second component to perform local comparisons, which in turn are fed to the top layer to make the final entailment decision. We follow the same pattern in the design of our model. We implement a multi-component neuralnetwork that takes two passages as input. It first (a) learns a uni-directional alignment between the passages, which is utilized to produce a customized representation of the context passage, specific to each token of the potentially redundant passage.
Next, (b) token-level redundancy scores are produced via local comparison operations. During training, (c) an additional layer aggregates the local scores and produces a passage-level similarity score on top of which a ranking objective is applied. At inference time, (d) the local scores from (b) serve as the basis for the ranking of the subpassage elements as described in Sec. 3. Fig. 2 outlines steps (a) -(d).

Step (a): Alignment
Input to the model are two sequences of n tokens each, p = (p 0 , . . . , p n−1 ) and c = (c 0 , . . . , c n−1 ), with shorter sequences being padded to this length. The goal of this step is to generate for each p i ∈ p a fixed-length representation c aligned i of c, which captures the meaning aspects of c specifically relevant for p i .
The tokens p i , c j are represented via word embeddings of size d w , which are updated during model training and are stored in a matrix W w ∈ R dw×|V | , with V being the vocabulary. For ease of notation, we use p, p i , c, c i to refer to both the original tokens and their embedding representation.
We create a soft alignment of c to the tokens of p via the decomposed attention mechanism described by Parikh et al. (2016). At its core is the application of the attention function f1 to each token of the input sequences, which is implemented as a feed-forward neural network with h f1 layers of d f1 rectified linear units (Glorot et al., 2011, ReLu) each. Using this function, unnormalized attention weights are produced: then normalized per token in p via The customized (aligned) representation of c is then calculated as (3)

Step (b): Learning Local Redundancy
Each token p i from p is compared to the corresponding representation c aligned i of the context sequence via a single-layer feed-forward network f2 with a ReLu: with [ ] being the concatenation operator and lsim(p, c) ∈ R n . This local similarity score measures for each token the degree with which its meaning is covered by c.

Step (c): Learning to Aggregate Local Redundancy Scores
As described in Sec. 5, supervised training with local redundancy labels is costly, which is why we add another layer on top which learns to calculate a coarse passage-level similarity score csim(p, c) from the local redundancy information. Given a passage triple (c, p + , p − ) (Sec. 5), two such coarse scores are calculated and used to determine a loss which allows to train steps (a-c) of the network in Fig. 2 in a weakly supervised way. The passage-level score is computed by another feed-forward network f3 with h f3 layers of d f3 Re-Lus, followed by another hidden layer with a logistic activation function that projects to a scalar value in (0, 1): csim (p, c) := f3 (lsim (p, c)) .
Then, for a given passage triple (c, p + , p − ), the loss is defined as: This ranking criterion is similar to what has been used by Collobert et al. (2011) and Bordes et al. (2013). It is intended to push the model to assign a higher coarse similarity score to the more similar sequences from the triple, and in doing so, ideally forces the model to learn to detect local redundancies.

Step (d): Generation of Sub-sequence Redundancy Scores
During inference time, the goal of this model is to rank a set of given sub-sequences S of p with respect to their redundancy with c; note that during inference time the model is presented with pairs of passages in contrast to the triples it sees in the training phase. We calculate a redundancy score for a subsequence s k ∈ S as follows: where s k is the subsequence running from positions s k to s k+1 − 1 (see Sec. 3). A ranking of the subsequences is then given by: In other words, sub-passages are ranked by comparing the mean of their local redundancy scores. In the evaluation of Sec. 8, we refer to the model that uses this way of ranking sub-passages as UA (short for uni-directional alignment). We compare this against a number of other variants of processing internal activations of the model to extract information about local redundancy, see Sec. 8.

Baseline Ranking Method
The bi-directional alignment model (BA) of Parikh et al. (2016) can be trained in a similar fashion as our proposed model, i.e., with triples of passages and the loss from Eq. (7). Although it has not been developed with the localization of redundancy in mind, its native problem formulation (RTE) is structurally related to the problem at hand by requiring models to assess to what degree the semantic content of one passage is embedded in a second one. We believe BA constitutes a strong baseline because it has been shown to achieve state-of-the-art performance on RTE and because it has the means to decompose coarse inference decisions on two text sequences into local comparison operations,  a key requisite to successfully utilize the training signal from Sec. 5. However, in contrast to our model, the results of comparing the aligned sequences c aligned i with individual tokens from p are not directly interpretable as redundancy scores, also the architecture is designed for a bi-directional alignment of the input sequences. In order to produce lsim values for the tokens of p, we use the alignment matrices as a basis for a max-based aggregation, i.e., we take the row-wise maximum value and use this as the localized redundancy value for the corresponding token. Sub-sequence similarity is then determined either via Eq. (8) or alternatively via summation.

Experimental Setting, Model Training
We implemented both UA and BA in the Tensor-Flow framework (Abadi et al., 2015) and trained them with the signal from Sec. 5. As input to the passage-retrieval system we used a set of 1.5 million queries, resulting in the same amount of passage triples; 80% were used for training, 10% were used as a separate validation set for hyperparameter optimization, and the final 10% were held out and served as the basis for the smaller dataset with manually annotated labels (EVAL, Sec. 4) 4 .
The hyperparameters of UA (h f1 , d f1 , h f3 , d f3 ) and BA (like our model, plus a few additional ones) were optimized separately. We also experimented with Dropout (Srivastava et al., 2014) for the feedforward networks in step (a-c) (p f1 , p f2 , p f3 ), with different initial learning rates (η) for Adagrad (Duchi et al., 2011), with different batch sizes, and with different vocabulary sizes (|V |). The final settings for UA used in the reported experiments are shown in Tab. 3. Word embeddings were initialized with pre-trained embeddings (Mikolov et al., 2013), the other model parameters were randomly initialized; out-of-vocabulary words were hashed 4 We only annotated a subset of the passages in this part of the data.  into 100 buckets. The models were trained for 1 million steps.

Evaluation on EVAL
We first compare the performance of different variants of generating the redundancy scores for subpassage ranking, for both UA and BA, on DEV. We then pick the respective best-performing model variant and compare the systems on TEST. The model variants we test are the following: • UA: The uni-directional alignment model described in Sec. 6.
• UA Σ : Summation instead of averaging in Eq. (8), which gives higher weight to long subsequences with redundancy.
• UA : Calculation of lsim in analogous fashion as BA (see below).
• UA Σ : Combination of two variants above.
• BA /BA : Models with bi-directional alignment of input texts. lsim values for tokens of p are produced by using the first/second one of the two alignment matrices as a basis for the max-based aggregation of the normalized attention weights described in Sec. 6.5.
• BA Σ / BA Σ : Like above, but sub-sequence similarity is determined via summation rather than calculating the mean in Eq. (8).
We measure performance by calculating the Spearman correlation of the raw passage scores with the gold redundancy for all segments in the respective partition of the dataset. The top of Tab from step (a) of the model. The best overall results for UA are achieved when this is combined with the strategy that represents sub-sequence redundancy as the arithmetic mean of the contained tokens' local scores, meaning sub-sequence length needs to be taken into account.
For the baseline BA, exploiting the reverse alignment matrix and summing over the alignment scores without correction for sub-sequence length gives the best results. The bottom of the table reports the results of applying both models with the respective best strategy on the test partition of the dataset. The proposed uni-directional model clearly outperforms the bi-directional baseline. This indicates that the direct modeling of uni-directional redundancy during both training and inference time allows a model to better learn to compare a subsequence to another full passage, in comparison to the case where both passages are analyzed in a fine-granular way. Fig. 3 depicts a scatter plot of the segments in TEST, with the x-axis corresponding to the gold redundancy scores (Sec. 4) and the y-axis showing the redundancy assessment by UA. While actually redundant segments tend to be handled correctly by the model, a certain amount of non-redundant segments get assigned a relatively high absolute redundancy value, which is not problematic as long as the actually redundant segments of the same passage are rated even higher. The next section elaborates on an experiment that looks into the quality of this internal ranking of segments for given passages, and how this ranking could potentially be utilized in an application.

Redundancy Localization for Passage Compression
This section briefly discusses an experiment in a dialogue setting, in which redundancy information is used for the compression of passages. Consider again the example from Fig. 1, where a conversational agent engages a human user in an informational dialogue whose quality suffers from repetition of information on the agent side. In this experiment, we asked human raters to assess whether the removal of redundancy improves the dialogue flow. Note, however, that given the small scale of the experiment, results are only indicative and not conclusive.
We selected 50 passage pairs from the held-out portion of the training data where the second passage consisted of at least three sentences. We then fed the passages to UA and removed the sentence from the second passage which had the largest semantic overlap with the context (the first passage). We asked three human raters, (a) whether the two original passages are coherent at all (as the following questions assume this), (b) whether the compressed passage sounds more or less natural (due to the dropped redundant sentence), and (c) whether the modified passage is equally informative as the original passage.
For comparison, we implemented a baseline which always dropped the first sentence of a passage, as well as one that removed the sentence with the highest term overlap. For the following example, dropping the underlined sentence from the passage would result in a more natural and equally informative text: Among the 50 uncompressed passage pairs, only one third was rated as being coherent (question a; independent of the model). For these pairs, UA tended to produce more natural compressions (question b) compared to the baselines. This might be explained by the term-overlap baseline's restriction to only look at the level of individual words, which results in erroneously removing sentences that are essential for discourse coherence but do not repeat facts. Similarly, always dropping the first sentence can leave a passage with dangling backward references, e.g., in the case of anaphors. In terms of the informativeness dimension (question c), all approaches resulted in slightly less informative compressed passages, which is expected. However, UA's score on this metric is slightly worse than the one of the baselines.

Contributions and Outlook
In this paper, we described the problem of localizing redundancy in pairs of passages. We proposed a model based on a uni-directional alignment from one passage to the context passage, which can be efficiently trained using a novel weak supervision signal defined over the output of common passageretrieval systems. We applied this signal in a oneoff process to train our model and a reasonable baseline; from a held-out part of the retrieved passages we created a publicly available dataset which allows to compare and evaluate models on this task and enables other researchers to reproduce the evaluation setting of this work. The conducted evaluation showed that the proposed uni-directional alignment model is indeed capable of finding the redundant sub-segments in texts.
In future work, we would like to represent and model more facets of the naturalness and coherence of dialogues. For instance in dialogue settings, a certain amount of redundancy between the utterances of participants may actually tie the dialogue turns together, i.e., may be beneficial in terms of discourse coherence and naturalness. Incorporating this consideration into the structure of a model can potentially improve the results of passage compression techniques in settings similar to Sec. 9.