Learning Matching Models with Weak Supervision for Response Selection in Retrieval-based Chatbots

We propose a method that can leverage unlabeled data to learn a matching model for response selection in retrieval-based chatbots. The method employs a sequence-to-sequence architecture (Seq2Seq) model as a weak annotator to judge the matching degree of unlabeled pairs, and then performs learning with both the weak signals and the unlabeled data. Experimental results on two public data sets indicate that matching models get significant improvements when they are learned with the proposed method.


Introduction
Recently, more and more attention from both academia and industry is paying to building non-task-oriented chatbots that can naturally converse with humans on any open domain topics. Existing approaches can be categorized into generation-based methods (Shang et al., 2015;Vinyals and Le, 2015;Serban et al., 2016;Sordoni et al., 2015;Serban et al., 2017;Xing et al., 2018) which synthesize a response with natural language generation techniques, and retrieval-based methods (Hu et al., 2014;Lowe et al., 2015;Zhou et al., 2016; which select a response from a pre-built index. In this work, we study response selection for retrieval-based chatbots, not only because retrieval-based methods can return fluent and informative responses, but also because they have been successfully applied to many real products such as the social-bot XiaoIce from Microsoft (Shum et al., 2018) and the E-commerce assistant * Corresponding Author AliMe Assist from Alibaba Group .
A key step to response selection is measuring the matching degree between a response candidate and an input which is either a single message (Hu et al., 2014) or a conversational context consisting of multiple utterances . While existing research focuses on how to define a matching model with neural networks, little attention has been paid to how to learn such a model when few labeled data are available. In practice, because human labeling is expensive and exhausting, one cannot have large scale labeled data for model training. Thus, a common practice is to transform the matching problem to a classification problem with human responses as positive examples and randomly sampled ones as negative examples. This strategy, however, oversimplifies the learning problem, as most of the randomly sampled responses are either far from the semantics of the messages or the contexts, or they are false negatives which pollute the training data as noise. As a result, there often exists a significant gap between the performance of a model in training and the same model in practice (Wang et al., 2015;. 1 We propose a new method that can effectively leverage unlabeled data for learning matching models. To simulate the real scenario of a retrieval-based chatbot, we construct an unlabeled data set by retrieving response candidates from an index. Then, we employ a weak annotator to provide matching signals for the unlabeled inputresponse pairs, and leverage the signals to supervise the learning of matching models. The weak annotator is pre-trained from large scale humanhuman conversations without any annotations, and thus a Seq2Seq model becomes a natural choice. Our approach is compatible with any matching models, and falls in a teacher-student framework (Hinton et al., 2015) where the Seq2Seq model transfers the knowledge from human-human conversations to the learning process of the matching models. Broadly speaking, both of (Hinton et al., 2015) and our work let a neural network supervise the learning of another network. An advantage of our method is that it turns the hard zero-one labels in the existing learning paradigm to soft (weak) matching scores. Hence, the model can learn a large margin between a true response with a true negative example, and the semantic distance between a true response and a false negative example is short. Furthermore, due to the simulation of real scenario, harder examples can been seen in the training phase that makes the model more robust in the testing.
We conduct experiments on two public data sets, and experimental results on both data sets indicate that models learned with our method can significantly outperform their counterparts learned with the random sampling strategy.
Our contributions include: (1) proposal of a new method that can leverage unlabeled data to learn matching models for retrieval-based chatbots; and (2) empirical verification of the effectiveness of the method on public data sets.

The Existing Learning Approach
Given a data set D = {x i , (y i,1 , . . . , y i,n )} N i=1 with x i a message or a conversational context and y i,j a response candidate of x i , we aim to learn a matching model M(·, ·) from D. Thus, for any new pair (x, y), M(x, y) measures the matching degree between x and y.
To obtain a matching model, one has to deal with two problems: (1) how to define M(·, ·); and (2) how to perform learning.
Existing work focuses on Problem (1) where state-ofthe-art methods include dual LSTM (Lowe et al., 2015), Multi-View LSTM (Zhou et al., 2016), CNN , and Sequential Matching Network , but adopts a simple strategy for Problem (2): ∀x i , a human response is designated as y i,1 with a label 1, and some randomly sampled responses are treated as (y i,2 , . . . , y i,n ) with labels 0. M(·, ·) is then learned by maximizing the following objective: where r i,j ∈ {0, 1} is a label. While matching accuracy can be improved by carefully designing M(·, ·) , the bottleneck becomes the learning approach which suffers obvious problems: most of the randomly sampled y i,j are semantically far from x i which may cause an undesired decision boundary at the end of optimization; some y i,j are false negatives. As hard zero-one labels are adopted in Equation (1), these false negatives may mislead the learning algorithm. The problems remind us that besides good architectures of matching models, we also need a good approach to learn such models from data.

A New Learning Method
As human labeling is infeasible when training complicated neural networks, we propose a new method that can leverage unlabeled data to learn a matching model. Specifically, instead of random sampling, we construct D by retrieving (y i,2 , . . . , y i,n ) from an index (y i,1 is the human response of x i ). By this means, some y i,j are true positives, and some are negatives but semantically close to x i . After that, we employ a weak annotator G(·, ·) to indicate the matching degree of every (x i , y i,j ) in D as weak supervision signals. Let s ij = G(x i , y i,j ), then the learning approach can be formulated as: where s ′ ij is a normalized weak signal defined as max(0, Objective (2) encourages a large margin between the matching of an input and its human response and the matching of the input and a negative response judged by G(·, ·) (as will be seen later, s i,j s i,1 > 1). The learning approach simulates how we build a matching model in a retrievalbased chatbot: given {x i }, some response candidates are first retrieved from an index. Then human annotators are hired to judge the matching degree of each pair. Finally, both the data and the human labels are fed to an optimization program for model training. Here, we replace the expensive human labels with cheap judgment from G(·, ·).
We define G(·, ·) as a sequence-to-sequence architecture (Vinyals and Le, 2015) with an attention mechanism (Bahdanau et al., 2015), and pretrain it with large amounts of human-human conversation data. The Seq2Seq model can capture the semantic correspondence between an input and a response, and then transfer the knowledge to the learning of a matching model in the optimization of (2). s ij is then defined as the likelihood of generating y i,j from x i : where w y i,j ,k is the k-th word of y i,j and w y i,j ,l<k is the word sequence before w y i,j ,k .
Since negative examples are retrieved by a search engine, the oversimplification problem of the negative sampling approach can be partially mitigated. We leverage a weak annotator to assign a score for each example to distinguish false negative examples and true negative examples. Equation (2) turns the hard zero-one labels in Equation (1) to soft matching degrees, and thus our method encourages the model to be more confident to classify a response with a high s i,j score as a negative one. In this way, we can avoid false negative examples and true negative examples are treated equally during training, and update the model toward a correct direction.
It is noteworthy that although our approach also involves an interaction between a generator and a discriminator, it is different from the GANs (Goodfellow et al., 2014) in principle. GANs try to learn a better generator via an adversarial process, while our approach aims to improve the discriminator with supervision from the generator, which also differentiates it from the recent work on transferring knowledge from a discriminator to a generative visual dialog model (Lu et al., 2017). Our approach is also different from those semi-supervised approaches in the teacher-student framework (Dehghani et al., 2017a,b), as there are no labeled data in learning.

Experiment
We conduct experiments on two public data sets: STC data set (Wang et al., 2013) for single-turn response selection and Douban Conversation Corpus  for multi-turn response selection. Note that we do not test the proposed approach on Ubuntu Corpus (Lowe et al., 2015), because both training and test data in the corpus are constructed by random sampling.

Implementation Details
We implement our approach with TensorFlow. In both experiments, the same Seq2Seq model is exploited which is trained with 3.3 million inputresponse pairs extracted from the training set of the Douban data. Each input is a concatenation of consecutive utterances in a context, and the response is the next turn ({u <i }, u i ). We set the vocabulary size as 30, 000, the hidden vector size as 1024, and the embedding size as 620. Optimization is conducted with stochastic gradient descent (Bottou, 2010), and is terminated when perplexity on a validation set (170k pairs) does not decrease in 3 consecutive epochs. In optimization of Objective (2), we initialize M(·, ·) with a model trained under Objective (1) with the (random) negative sampling strategy, and fix word embeddings throughout training. This can stabilize the learning process. The learning rate is fixed as 0.1.

Single-turn Response Selection
Experiment settings: in the STC (stands for Short Text Conversation) data set, the task is to select a proper response for a post in Weibo 2 . The training set contains 4.8 million post-response (true response) pairs. The test set consists of 422 posts with each one associated with around 30 responses labeled by human annotators in "good" and "bad". In total, there are 12, 402 labeled pairs in the test data. Following (Wang et al., 2013(Wang et al., , 2015, we combine the score from a matching model with TF-IDF based cosine similarity using RankSVM whose parameters are chosen by 5-fold cross validation. Precision at position 1 (P@1) is employed as an evaluation metric. In addition to the models compared on the data in the existing literatures, we also implement dual LSTM (Lowe et al., 2015) as a baseline. As case studies, we learn a dual LSTM and an CNN (Hu et al., 2014) with the proposed approach, and denote them as LSTM+WS (Weak Supervision) and CNN+WS, respectively. When constructing D, we build an index with the training data using Lucene 3 and retrieve 9 candidates (i.e., {y i,2 , . . . , y i,n }) for each post with the inline algorithm of the index. We form a validation set by randomly sampling 10 thousand posts associ-ated with the responses from D (human response is positive and others are treated as negative). P@1 TFIDF (Wang et al., 2013) 0.574 +Translation (Wang et al., 2013) 0.587 +WordEmbedding 0.579 +DeepMatchtopic  0.587 +DeepMatchtree (Wang et al., 2015) 0.608 +LSTM (Lowe et al., 2015) 0.592 +LSTM+WS 0.616 +CNN (Hu et al., 2014) 0.585 +CNN+WS 0.604 Table 1: Results on STC Results: Table 1 reports the results. We can see that CNN and LSTM consistently get improved when learned with the proposed approach, and the improvements over the models learned with random sampling are statistically significant (ttest with p-value < 0.01). LSTM+WS even surpasses the best performing model, DeepMatch tree , reported on this data. These results indicate the usefulness of the proposed approach in practice. One can expect improvements to models like DeepMatch tree with the new learning method. We leave the verification as future work.

Multi-turn Response Selection
Experiment settings: Douban Conversation Corpus contains 0.5 million context-response (true response) pairs for training and 1000 contexts for test. In the test set, every context has 10 response candidates, and each of the response has a label "good" or "bad" judged by human annotators. Mean average precision (MAP) (Baeza-Yates et al., 1999), mean reciprocal rank (MRR) (Voorhees, 1999), and precision at position 1 (P@1) are employed as evaluation metrics. We copy the numbers reported in  for the baseline models, and learn LSTM, Multi-View, and SMN with the proposed approach. We build an index with the training data, and retrieve 9 candidates with the method in  for each context when constructing D. 10 thousand pairs are sampled from D as a validation set.
Results: Table 2 reports the results. Consistent with the results on the STC data, every model (+WS one) gets improved with the new learning approach, and the improvements are statistically significant (t-test with p-value < 0.01).   0.488 0.527 0.330 LSTM (Lowe et al., 2015) 0.485 0.527 0.320 LSTM+WS 0.519 0.559 0.359 Multi-View (Zhou et al., 2016)

Discussion
Ablation studies: we first replace the weak supervision s ′ i,j in Equation (2) with a constant ǫ selected from {0.1, 0.2, . . . , 0.9} on validation, and denote the models as model+const. Then, we keep everything the same as our approach but replace D with a set constructed by random sampling, denoted as model+WSrand. Table 3 reports the results. We can conclude that both the weak supervision and the strategy of training data construction are important to the success of the proposed learning approach. Training data construction plays a more crucial role, because it involves more true positives and negatives with different semantic distances to the positives into learning.
Does updating the Seq2Seq model help? It is well known that Seq2Seq models suffer from the "safe response" (Li et al., 2016a) problem, which may bias the weak supervision signals to highfrequency responses. Therefore, we attempt to iteratively optimize the Seq2Seq model and the matching model and check if the matching model can be further improved. Specifically, we update the Seq2Seq model every 20 mini-batches with the policy-based reinforcement learning approach proposed in (Li et al., 2016b). The reward is de-fined as the matching score of a context and a response given by the matching model. Unfortunately, we do not observe significant improvement on the matching model. The result is attributed to two factors: (1) it is difficult to significantly improve the Seq2Seq model with a policy gradient based method; and (2) eliminating "safe response" for Seq2Seq model cannot help a matching model to learn a better decision boundary.
How the number of response candidates affects learning: we vary the number of {y i,j } n j=1 in D in {2, 5, 10, 20} and study how the hyperparameter influences learning. We study with LSTM on the STC data and SMN on the Douban data. Table 4 reports the results. We can see that as the number of candidates increases, the performance of the the learned models becomes better. Even with 2 candidates (one from human and the other from retrieval), our approach can still improve the peformance of matching models.

Conclusion and Future Work
Previous studies focus on architecture design for retrieval-based chatbots, but neglect the problems brought by random negative sampling in the learning process. In this paper, we propose leveraging a Seq2Seq model as a weak annotator on unlabeled data to learn a matching model for response selection. By this means, we can mine hard instances for matching model and give them scores with a weak annotator. Experimental results on public data sets verify the effectiveness of the new learning approach. In the future, we will investigate how to remove bias from the weak supervisors, and further improve the matching model performance with a semi-supervised approach.