The World Is Not Binary: Learning to Rank with Grayscale Data for Dialogue Response Selection

Response selection plays a vital role in building retrieval-based conversation systems. Despite that response selection is naturally a learning-to-rank problem, most prior works take a point-wise view and train binary clas-siﬁers for this task: each response candidate is labeled either relevant (one) or irrelevant (zero). On the one hand, this formalization can be sub-optimal due to its ignorance of the diversity of response quality. On the other hand, annotating grayscale data for learning-to-rank can be prohibitively expensive and challenging. In this work, we show that grayscale data can be automatically constructed without human effort. Our method employs off-the-shelf response retrieval models and response generation models as automatic grayscale data generators. With the constructed grayscale data, we propose multi-level ranking objectives for training, which can (1) teach a matching model to capture more ﬁne-grained context-response relevance difference and (2) reduce the train-test discrepancy in terms of distractor strength. Our method is simple, effective, and universal. Experiments on three benchmark datasets and four state-of-the-art matching models show that the proposed approach brings signiﬁcant and consistent performance improvements.


Introduction
Building intelligent conversation systems (Shum et al., 2018;Kollar et al., 2018) is gaining more and more attention in recent years. A core module in such kind of conversation systems is response selection (Ritter et al., 2011;Hu et al., 2014;Wu et al., 2017;: Identifying the best response from a set of possible candidates given a dialogue context, i.e., conversation history. For Table 1: Dialogue context (conversation history) between Speakers A and B. R1 is a random sample used as a negative instance during training. R2 and R3 are real distractors during testing. the response selection problem, the trendy practice is to build neural matching models (Ji et al., 2014;Wang et al., 2015;Wu et al., 2017;Lu et al., 2019) for scoring the adequacy of individual response candidates in the dialogue context. Most prior works on this topic focus on fine-grained text encoding and better interactions between dialogue context and response candidates, typically via sophisticated and powerful matching networks (Wu et al., 2017;Lu et al., 2019;Gu et al., 2019). Despite their differences, in almost all these previous works, the matching models are trained with binary classification objective. Each response in the training data is either labeled positive (i.e., a correct response to the dialogue context) or negative (i.e., an incorrect response). Often, the negative responses are automatically constructed by random sampling.
One limitation of the above training strategy is that this formalization downplays the nuance of fine-grained response quality; the matching model is only informed to predict a binary label, either correct or incorrect. However, the quality of possible response candidates may be quite diverse, thus letting the matching model be aware of which response candidates are more incorrect or less incorrect than others may more effectively increase the model capacity. Another limitation is that in real-world scenarios the matching models are often confronted with more difficult tasks: to select the best response from a set of strong response candidates instead of random ones. An example is given in Table 1. During training, the matching models are trained to distinguish the ground truth G and the randomly sampled response R1, where R1 shows little relevance to the dialogue context. Matching models trained on such training data have little experience to identify the groundtruth response G from a set of strong distractor responses such as R2 and R3. Intuitively, a good matching model should be able to not only distinguish good responses from random ones (usually totally irrelevant), as conveyed by the binary classification objective, but also capture the more subtle differences for competitive candidates.
One natural solution to the above problems is to collect grayscale data for training; if we consider the quality of all possible response candidates falls in the interval [0, 1], the golden-truth and random responses usually cover the two endpoints only, and our goal is to obtain a list of grayscale responses locate in between 0 and 1. However, grayscale data are hard to obtain in reality owing to the expense of human annotation and the subjectivity of individual human annotators.
In this work, we propose to automatically construct grayscale data from standard dialogue datasets, where only golden dialogue context and response pairs are provided. To meet this goal, we resort to off-the-shelf retrieval algorithms and generation models. Our idea is inspired by the observation that, in most cases, the responses from retrieval models or generation models are better than randomly sampled ones but worse than the ground-truth response. We believe that this progressive relationship, such as "ground truth > retrieval > random", can be utilized for training a better matching model. Concretely, we propose a multi-level ranking objective to make full use of such relationships. Our multi-level ranking objective jointly combines multiple binary contrastive estimations. In addition, the grayscale data partly simulates the real-world response distractors and thus reduces the gap between training and testing, leading to a better distinguishing ability for strong response distractors.
Our method is simple, effective, and orthogonal to prior efforts for modeling designs. It can be conveniently implemented with most existing matching models. Experimental results on four state-of-the-art matching models and three benchmark datasets demonstrate that our new training approach leads to remarkable performance improvement consistently.

Background
Early research for response selection is devoted to single-turn conversations (Wang et al., 2013;Tan et al., 2015;. Recently, researchers have started to study on multi-turn conversations (Lowe et al., 2015;Wu et al., 2017;Zhang et al., 2018). In the current literature, the task of response selection is formulated as follows. Given a dialogue dataset D = {(c i , r i )}, where c i represents a dialogue context, and r i is the human-written ground-truth response. The goal is to build a matching model s(·, ·) from D so that s(c, r) accurately measures the adequacy of a response candidate r for a dialogue context c.
Rapid progress has been made for building such matching models in recent years. Concretely, various neural architectures (Zhou et al., 2016;Wu et al., 2017;Gu et al., 2019;Yuan et al., 2019) have been proposed for fine-grained text encoding and better dialogue context and response interactions modeling. To train such matching models, binary-labeled training sets are constructed (Lowe et al., 2015;Wu et al., 2017;Zhang et al., 2018): The humanwritten ground-truth response is designated as positive instances (labeled as 1), and a set of randomly sampled responses N i are treated as negative ones (labeled as 0). The learning objective of s(·, ·) is then to maximize the following binary classification loss function: Different from previous works, our study questions the effectiveness of the binary-labeled training data and the corresponding binary classification objective. We argued that the binary classification paradigm is sub-optimal as most of the randomly sampled negative responses are distant from the corresponding positive responses in terms of matching degree, which could lead to serious drawbacks when some strong distractors are presented during Figure 1: The illustration of our training approach. For each dialogue, we first extract a number of grayscale data from heterogeneous sources. Then, the multi-level ranking objective is applied to learn the progressive relationships between different responses.
testing Zhang et al., 2018). Our work starts with enriching the range of the negative sample set N i in terms of response quality and leads to a simple but new learning strategy that aimed at capturing more fine-grained response quality differences. First, different responses are acquired from various sources, such as retrieval models, generation models, and random sampling. Then, the collected responses are sorted by estimated quality to form progressive relationships. Lastly, a multi-level ranking objective is designed to learn such relationships. We first present our methods for automatically constructing grayscale data in Section 3.2, followed by the multi-level ranking objective introduced in Section 3.3.

Grayscale Data Acquisition
Our goal is to construct a set of responses with diverse quality. Specifically, we construct three types of responses for each dialogue context and rank them in three tiers. It should be noted that our data acquisition only relies on standard dialogue datasets, which only provide human-to-human dialogue context and response pairs.
Zero & One First of all, the corresponding responses for dialogues context in the standard dialogue dataset are considered as our ground-truth responses. These human-written responses are often informative and relevant. As a result, the groundtruth samples are ranked as tier-1. Similar to previous work, we also utilize randomly sampled responses for contrastive estimation. The random responses are sampled from the responses of other dialogue contexts in the training data. We rank random responses as tier-3 because they often show little relevance to the dialogue context. The groundtruth responses and random responses constitute the "zero & one" binary training data used in the prior work.
Grayscale We now delve into describing the grayscale data construction procedures. We consider two types of frequently-used toolkits for automatic response generation to produce grayscale data, namely, the retrieval-based models and the generation-based models.
The retrieval-based models (Ji et al., 2014;Hu et al., 2014) directly copy an existing response from the training corpus when receiving a response request. Since the returning responses are always human utterances in real-world conversations, they are informative and grammatical. However, the response quality of such systems varies as it depends on the lexical similarity of the given dialogue con-text and those in the training corpus. Typically, the retrieval results are better than random responses because they are more or less relevant to the dialogue context. However, most retrieval results are worse than the ground truth. The retrieval results are ranked tier-2.
Specifically, we split the multi-turn dialogue into a series of single-turn input-response pairs. Then we index the input-response pairs with the BM25 algorithm (Robertson and Zaragoza, 2009). We retrieve response candidates using the last utterance of the dialogue context.
The generation-based models (Shang et al., 2015;Li et al., 2016) generate a new utterance from scratch after training. While those models have better generalization capacity in rare dialogue contexts, the generation responses tend to be universal and noninformative (e.g., I dont know, I think so etc.) (Li et al., 2016). Similar to the retrieval responses, the generation responses are usually better than the random responses but worse than the ground-truth responses. However, compared to retrieval models that merely rely on lexical overlapping, generation results can capture deeper semantic interactions. The different characteristics of retrieval and generation models make their results complement each other in terms of response quality, which we consider beneficial for training.
Specifically, we train a Seq2Seq model with the attention mechanism (Bahdanau et al., 2015) for response generation. We adopt the same corpus used in the retrieval model to train the generation model. The generation response is produced by feeding the dialogue context to a trained model.

Discussion on Extendibility
Note that there can be many more sophisticated ways to construct the grayscale data. For example, one may employ the results from different retrieval models and/or generation models. Responses from different models can be further divided into sub-groups according to the relative strengths of the corresponding models. For instance, responses that are generated from more advanced and competent generation models (e.g., a model based on GPT2 (Radford et al., 2019)) can be considered better than those from less competent models (e.g., a vanilla seq2seq model). However, in this paper, we only showcase the results with basic retrieval and generation models for keeping our idea simple and neat. Nevertheless, this simple setting, as we will demonstrate, already leads to remarkable performance improvements.

Multi-Level Ranking Objectives
Our grayscale data acquisition provides ground for carrying out more principled and sufficient training paradigms. To make full use of the grayscale data, we propose multi-level ranking objectives. Unlike prior work that minimizes binary classification errors, our training objective better fits the learning-to-rank nature of response selection, that is, minimizes ranking errors of possible responses (Cao et al., 2007). Also, as the grayscale data exhibit various response quality, training with such data rather than random negatives better simulate testing environments.
We start formal descriptions with some notation: the training set can be re-organized as D = where c i denotes the dialogue context and R i = {r i , e i , g i ,r i } is the response set enhanced by grayscale data. Concretely, r i , e i , g i , andr i refer to ground-truth responses, retrieval responses, generation responses, and random responses, respectively. We consider three ordered list as follows.
• ground truth>retrieval>random This ordered list considers the progressive relationships between ground-truth responses, retrieval responses, and random responses. We use margin ranking losses for implementation, the formula are given below: where µ is a hyperparameter and represent the minimum acceptable score margin between two tiers, and s(·, ·) is the matching score given by a matching model.
• ground truth>generation>random This ordered list considers the progressive relationships between ground-truth responses, generation responses, and random responses. The loss function is given below.
• ground truth>random this loss function directly models the relationship between the ground-truth samples r i and random samplesr i .
Our final training objective is an unite of all above. It models the integrated relationship between tiers "ground truth>retrieval & generation>random" and "ground truth > random" simultaneously: 4 Experimental Setup

Datasets and Evaluation Metrics
We test on three benchmark datasets for multi-turn response selection.
Ubuntu Dialogue Corpus It consists of English multi-turn dialogues about technical support collected from the Ubuntu Forum (Lowe et al., 2015). The dataset contains 500K, 50K and 50K chat logs for training, validation, and test respectively. Each test dialogue is paired with 9 distractor responses. Following conventions, the response selection performance is evaluated by R n @k scores. R n @k is the recall rate at position k in n candidates.

Douban Conversation Corpus It consists of
Chinese multi-turn daily conversations crawled from Douban group (Wu et al., 2017). The dataset contains 500K, 25K and 1K chat logs for training, validation, and test respectively. Each test dialogue is paired with 10 candidate responses. Following prior work, besides R n @k scores, we also report Mean Average Precision (MAP), Mean Reciprocal Rank (MRR) and the precision at position 1 (P@1).
E-commerce It consists of Chinese conversations between customers and customer service staff from Taobao (Zhang et al., 2018). The dataset sizes and settings is the same as Douban corpus. R n @k scores are commonly employed for evaluation.

Baseline Models
We compare with the following baseline models.

Implementation Details
For grayscale data construction, we train a seq2seq generation model and build a BM25 retrieval system using the training set for each dataset. We consider the top 100 responses from BM25 retrieval and the top 5 responses from seq2seq generation (via beam search) as the grayscale responses. To facilitate further research, we have made our collected grayscale data publicly available. 1 During training, we use these grayscale responses in a way adaptive to the training matching model. At each training epoch, ten different grayscale responses are used: the top 5 retrieval responses ranked by the current matching model and all 5 seq2seq generation responses. We experiment our new training approach on four latest state-of-the-art models as follows: • SMN (Wu et al., 2017) interacts each utterance of a dialogue context with a response and then transforms interaction matrices into matching vectors with CNN. The matching vectors are finally mapped into a matching score with an RNN.
• DAM  obtains matching vectors of text segments at different granularities with the stacked self-attention. The matching vectors are then distilled with the cross-attention and finally fused into a matching score via a single-layer perceptron.
• IOI  pairs each utterance of a context with a response via stacking multiple interaction blocks and then aggregates matching information from all the pairs as a matching score in an iterative fashion.
• MSN (Yuan et al., 2019) utilizes a multi-hop selector to select the relevant utterances as context and then matches the filtered context with the given response candidate to obtain a matching score.

Model
Douban Ubuntu E-commerce M AP M RR P @1 R 10 @1 R 10 @2 R 10 @5 R 2 @1 R 10 @1 R 10 @2 R 10 @5 R 10 @1 R 10 @2 R 10 @5   Specifically, we first pre-train a model with objective L ran only then switch to L U ni . We find that such a treatment makes the training process more stable.

Experimental Results
The experimental results are listed in Table 2, where G-X indicates X with our grayscale enhanced training approach. We can see that our training approach significantly improves the performance of all four matching models in terms of various metrics. The improvements are consistent across different datasets and different models, indicating the university of our approach. Moreover, one interesting observation is that a less-accurate matching architecture with the proposed training approach can outperform a stronger matching architecture with the traditional training paradigm, e.g., G-IOI vs. MSN. This suggests that while the choice of learning objective is often overlooked, it could be decisive for building a competitive response selection model.

Effect of Different Grayscale Data
We then turn to conduct an ablation study for understanding the roles of different grayscale data in performance enhancement. We choose SMN as well as DAM as the baselines models. We train the models with three additional settings by removing either retrieval responses or generation responses and removing both of them.
The results are shown in Table 3, we can find that both retrieval data and generation data make irreplaceable contributions to the overall performance and the combination of both worlds makes the best results, which confirms our hypotheses that responses from heterogeneous sources complement each other. We can also find that the help from retrieval data has a greater influence than generation data when used alone. This can be attributed to that the seq2seq-based generation model tends to output general and dull responses. Such general responses are less informative than the retrieval data, thus can provide limited help for distinguishing the nuance of fine-grained response quality.

Effect of Multi-level Ranking Objectives
Next, we study the effect of the multi-level ranking objective (MRO). Recall that we adopt the MRO in order to make use of the progressive relationship in different tiers. However, a simpler alternative is to treat all grayscale data as negative samples and use the learning objective in Eq. 2. It can be regarded  as a simple data augmentation technique, enlarging the set of negative examples with retrieval and generation results. We implement such an idea to test whether the proposed MRO is necessary and quantify the benefit of the MRO. As shown in Table 4, the performance of models trained without MRO falls behind those trained with MRO. Besides, the improvements of grayscale data without MRO are quite limited compared to the original counterparts without grayscale data. This indicates that the proposed multi-level ranking objective is essential for performance improvement.

Effect of Margin Size
The hyperparameter margin size (µ) denotes the minimum distance between two tiers in matching scores, which may affect the performance of a matching model. We conduct a series of sensitivity analysis experiments to study how the margin affects the performance of our training. 2 All models are evaluated in terms of R 10 @1.
Referring to Figure 2, we can see that both SMN and DAM have a similar trend on Douban: the curves first increase and then drop as the margin increases. This is mainly because response candidates on Douban are of high relevance. When the margin is too large, matching models have no idea to handle strongly relevant distractors. However, when the margin is too small, matching models will become too sensitive and sometimes mistakenly give high scores for responses with less relevance to dialogue context. Results on Ubuntu show a completely different behavior: the performances grow in step with the margin. The reason may be that 2 We also tried to use different margins for different pairs but the improvements are limited. the response distractors of Ubuntu have relatively large margins in semantic and matching models need to make strong discrimination between the ground truth and other grayscale samples. As a result, models learned with the large margin can fit such data distribution.

Compatiblity with Co-teaching
We have noticed that  adopts the co-teaching framework to train a robust matching model. From their experiment, the co-teaching framework with dynamic margins is proven to eliminate the effect from random sampled noisy responses effectively. We believe that our approach and co-teaching framework can benefit each other. Therefore, we combine our training approach with the co-teaching framework taking margins strategy as an instance to train the matching models. From the results in Table 5, we can see that models trained with our approach outperform those trained with the co-teaching framework. More importantly, the SMN+CoT and DAM+CoT obtain further improvements after adding our multi-level ranking objectives. This demonstrates that our approach is compatible with the co-teaching framework and shows strong portability and practicability to act as a generalized approach.

Case Study
As shown in case 1 of Table 6, response 2 contains some irrelevant content about the comic "One Piece", but it is still selected by DAM as the best response. In case 2, SMN selects the totally irrelevant response 2 as the best response, which may because this response has some overlapped words with the dialogue. These are consistent with the problem  Table 5: Experimental results of matching models trained with our approach and the co-teaching framework.
X+CoT indicates models trained with the co-teaching framework. We copy the results of SMN+CoT and DAM+CoT from  on Douban, and we supplement the results of two models trained with the co-teaching framework on Ubuntu. Table 6: Two cases from the test set of Douban are listed above, and both of them have Response 1 as a groundtruth response. Though each dialogue has ten candidates, we show only two of them due to space limitations. The dialogues are in Chinese (the left) and we also provide their translated version in English (the right).
introduced in Section 2 that these models may mistake the fuzzy-candidate with few improper details for the best response due to the gap between training and testing. In contrast, after adopting our training approach, the G-SMN and G-DAM correctly identify the improper content in the negative responses and successfully select response 1 as the best response.

Related Work
Some researchers also studied how to improve the performance of existing matching models with a better learning method.  proposed to leverage a Seq2Seq model as a weak annotator to assign a score for each response candidate of the dialogue and learn matching models through the scores.  introduced the coteaching framework (Han et al., 2018) for eliminat-ing the effect of training noises. The learning approach maintains two matching models and makes them teach each other.  attempted to neglect the effect of false negatives and trivial true responses by adopting four negative sampling strategies to choose negative samples during training dynamically. Different from those previous works, our approach makes use of grayscale data from heterogeneous sources and learns progressive quality relationships. In addition, our work enhances retrieval models with generation models, which is on par with recent attempts (Cai et al., 2019a,b) to strengthen generation models via retrieval models.

Conclusions
We presented a novel approach for training response selection models for multi-turn conversa-tions. It automatically constructs different types of grayscale data and uses a multi-level ranking objective. The proposed approach can teach a matching model to capture fine-grained quality differences better and reduce the train-test discrepancy in distractor strength. Experimental results on three benchmark datasets and four state-of-the-art models demonstrated the effectiveness of the proposed training approach.