Sampling Matters! An Empirical Study of Negative Sampling Strategies for Learning of Matching Models in Retrieval-based Dialogue Systems

We study how to sample negative examples to automatically construct a training set for effective model learning in retrieval-based dialogue systems. Following an idea of dynamically adapting negative examples to matching models in learning, we consider four strategies including minimum sampling, maximum sampling, semi-hard sampling, and decay-hard sampling. Empirical studies on two benchmarks with three matching models indicate that compared with the widely used random sampling strategy, although the first two strategies lead to performance drop, the latter two ones can bring consistent improvement to the performance of all the models on both benchmarks.


Introduction
In this work, we study the problem of response selection as an approach to implementing a retrievalbased dialogue system (Ji et al., 2014;Wang et al., 2013). A key step in response selection is measuring the matching degree between a conversation context and a response candidate. Existing studies focus on constructing a matching model with sophisticated neural architectures (Lowe et al., 2015;Zhou et al., 2016;Wu et al., 2017;Zhang et al., 2018;Tao et al., 2019), but pay little attention to how to effectively learn such architectures from data. On the one hand, it is well known that learning of complicated neural architectures requires large-scale high quality training data; on the other hand, since human labeling is expensive and exhausting, most of the existing work just adopts a simple heuristic to automatically build a training set where human responses are treated as positive examples and negative response candidates are randomly sampled.
Such a training set might contain many false negatives and trivial true negatives that are very easy to distinguish from those true positives. As a result, models with advanced architectures can only reach sub-optimal performance after learning .
In this paper, instead of configuring new architectures, we investigate how to improve the performance of existing matching models with a better learning method. A learning method usually involves choice of loss functions and construction of training data, and we are particularly interested in automatic training data construction, as data are often more crucial to the performance of models. The key problem in training data construction lies in how to properly choose negative examples, and our idea is that negative examples should adapt to the matching models at different learning stages. Following this idea, we consider four negative sampling strategies, namely minimum sampling, maximum sampling, semi-hard sampling, and decay-hard sampling. In the first two strategies, a response candidate that corresponds to the minimal or the maximal matching score at the current step is picked from a pool as a negative example for the next step; and in the latter two strategies, we select negative examples by considering how hard they are to the current matching models. The semi-hard sampling prefers candidates with moderate difficulty to avoid both false negatives and trivial true negatives, and the decay-hard sampling gradually increases the difficulty of negative samples with the training process going on.
We compare different sampling strategies with three matching models in different levels of complexity on two benchmarks. Evaluation results indicate that minimum sampling and maximum sampling are inferior to randomly sampling, and both semi-hard sampling and decay-hard sampling can bring consistent improvement to the perfor-mance of all the three models on both data sets.
Our contributions include (1) a systematic comparison of different sampling strategies with two benchmarks; and (2) proposal of semi-hard and decay-hard negative sampling strategies that can generally improve the performance of existing matching models on benchmarks.

Learning a Matching Model for
Response Selection is a training set, where c i is a conversation context, ∀j ∈ {1, . . . , n + i }, r + i,j is a positive response candidate that properly replies to c i , and ∀k ∈ {1, . . . , n − i }, r − i,k is a negative response candidate that is used to indicate errors in responding to a model, then the learning problem of response selection is to estimate a matching model g(·, ·) from D, which can be formulated as where Θ are the parameters of g(·, ·), and L(·, ·) is a loss function. In practice, L(·, ·) is usually set as cross entropy, then the remaining problems become (1) how to define g(·, ·); and (2) how to construct D given that large-scale human labeling is infeasible. Existing work has paid enough effort to solving Problem (1), but only adopts a simple heuristic for Problem (2) where human responses are treated as j=1 (a common case is n + i = 1 since only one response is available for a specific context), and some randomly sampled responses are utilized The problem with this heuristic is that there is no guarantee on what responses will be sampled as : some of them could be false negatives, and some could be too trivial to recognize. The clear drawback of random sampling motivates us to pursue better negative sampling strategies in training data construction, as will be elaborated in the next section.

Model Adaptive Negative Sampling
Our idea is to dynamically adapt negative examples to matching models in learning. The idea is inspired by how human learn knowledge: they adjust their learning materials according to their learning progress. Based on this idea, we consider four strategies to sample negative examples: Minimum sampling: the strategy used to be exploited in answer selection (Rao et al., 2016), and here we apply it to response selection for open domain dialogue systems. Suppose thatĝ(·, ·) is a matching model obtained from the t-th minibatch, then in the (t + 1)-th mini-bach, we try to select the easiest negative example for a context c according toĝ(·, ·), which can be formulated as where R − is a pool of negative examples for c.
Maximum sampling: similar to minimum sampling, the strategy is also borrowed from answer selection (Rao et al., 2016), but attempts to select the hardest negative example by Semi-hard sampling: the first two strategies might be too aggressive, as the easiest negative example could bring no new information toĝ(·, ·), and the hardest one could be a false negative. To avoid both cases, we propose a semi-hard sampling strategy which selects a negative sample with moderate difficulty. Formally, the strategy is defined as where r + is the positive response candidate of c 1 , and α is a constant. In Equation (4), we exploit α as a margin to control the distance of matching degree between the selected r − and r + . The one with a matching degree closest toĝ(c, r + ) − α is picked as a negative example.
Decay-hard sampling: to imitate the behavior that human gradually increase the difficulty of their learning materials at different stages, we propose a decay-hard sampling strategy which decays the margin in Equation (4) with the training process going on. Specifically, we consider two methods, namely exponential decay and linear decay.
In the first method, the margin shrinks in a exponential speed. In the t-th mini-bach, the margin α t is defined by where 0 < ϕ < 1 and −1 < ω < 0 are parameters. In the second method, the margin linearly becomes small with the training steps, which is given by where 0 < θ < 1 and , −1 < λ < 0 are parameters. We carefully choose θ and λ to make sure that α t > 0, ∀t ∈ {1, . . . , T }, where T refers to the maximum number of iterations. Note that descriptions above assume that for each context, only one negative example is selected from the pool. This is for a fair comparison with random sampling on benchmarks, as most of the existing work (Wu et al., 2017; utilizes one negative example per context in training. It is easy to extend the strategies to sample multiple negative examples (e.g., by picking the top l examples with matching scores closest tô g(c, r + ) − α in semi-hard sampling).

Experiments
We compare different sampling strategies on two benchmarks.

Experimental Setup
The first data set we use is the Ubuntu Dialogue Corpus (Lowe et al., 2015) collected from chat logs of the Ubuntu Forum. We use the version provided by Xu et al. (2017). The data contains 1 million context-response pairs for training, and 0.5 million pairs for validation and test.
Following (Lowe et al., 2015), we employ recall at position k in n candidates (R n @k) as evaluation metrics.
Besides the Ubuntu data, we also choose the Douban Conversation Corpus (Wu et al., 2017) as an experimental data set. The data consists of multi-turn Chinese conversations collected from Douban group 2 . There are 1 million contextresponse pairs for training, 50 thousand pairs for validation, and 6, 670 pairs for test.

Matching Models
The following matching models are selected: Dual-LSTM (Lowe et al., 2015): the model individually encodes a context and a response candidate with LSTMs, and then calculates a matching score based on the final states of the two LSTMs.
SMN (Wu et al., 2017): the model lets each utterance in a context interact with a response, and forms the interaction matrices into a matching vector with CNN. The matching vectors are finally accumulated with an RNN as a matching score.
DAM : the model performs matching in a similar manner as SMN but represents a context and a response candidate with stacked self-attention and cross-attention.
In terms of both complexity and performance under random sampling, Dual-LSTM<SMN<DAM. Regarding to baselines, we consider two random sampling strategies. The first one is a static strategy where negative examples are fixed in the entire learning procedure. This is how existing work learns a matching model with the data described in Section 4.1, and we denote the model as Model-Base. The second one is a dynamic strategy where in each mini-batch, a negative example is randomly sampled from R − . This is a simplification of model adaptive sampling strategies, and we denote a model learned with this strategy as Model-Rand. We denote a model trained with minimum sampling, maximum sampling, semi-hard sampling, exponential decay-hard sampling, and linear decay-hard sampling as Model-Min, Model-Max, Model-Semi, Model-EDecay, and Model-LDecay respectively. All models are implemented with TensorFlow and tuned on the validation sets. We make sure that Model-Base achieves the performance on both data sets as that reported in .

Implementation Details
For static random sampling, we just use the published training sets of both data. For the remaining sampling strategies, we randomly sample 10 responses 3 for each context from the training sets as a pool of negative examples at each epoch. Every time, one response is sampled from the pool as a negative example for a context. Models trained with different sampling strategies are   Table 1 reports evaluation results on the two data sets. We can see that both semi-hard sampling and decay-hard sampling can generally improve the three matching models on both data sets. Minimum sampling and maximum sampling are consistently worse than random sampling, which verified the statement we make in Section 3 that the two strategies are too aggressive. Decay-hard sampling is a little worse than semi-hard sampling. The reason might be that false negatives are introduced to learning by decay-hard at late stages of training. Dynamic random sampling is better than static random sampling, because models can leverage more negative examples from the pool. It is worth noting that the proposed sampling strategies does not change the elapsed time for prediction, despite a little more training time. On the other hand, we do see improvement on the two data sets. Therefore, we believe it is worth paying a little more training time but obtaining the improvement.

Evaluation Results
Besides strategies, we are also interested in how the hyperparameter α affects the performance of semi-hard sampling. Figure 1 shows how the performance of SMN and DAM changes with respect to different margins (i.e., α). We observe a similar trend for both models: they first increase monotonically until the margin reaches 0.07, and then drop as the margin increases. Particularly, the performance of both models with the semi-hard sampling strategy is worse than that with the random sampling strategy when the margin reaches 0.5. The reason behind the phenomenon is that when the margin is small, semi-hard sampling is similar to maximum sampling and will introduce false negatives into learning, while when the margin is large, semihard sampling is like minimum sampling and is prone to provide trivial samples to learning.

Related works
Negative sampling strategies have been studied in many machine learning tasks. In the computer vision fields, Faghri et al. (2017) studies hard negatives and introduces a simple change to common loss function on image-caption retrieval tasks. Guo et al. (2018) proposes a fast negative sampler which chooses negative examples that are most likely to meet the requirements of violation according to the latent factors of image. In natural language processing fields, Kotnis and Nastase (2017) analyses the impact of negative sampling strategies on the performance of link prediction in knowledge graphs. Saeidi et al. (2017) studies the affect of a tailored sample strategy on the performance of document retrieval task. Rao et al. (2016) uses three negative strategies to select the most informative negative samples on the pairwise ranking model for answer selection. Xu et al. (2015) introduces a straightforward negative sampling strategy to improve the assignment of subjects and objects on a convolution neural network. To our best knowledge, this is the first work to empirical study of negative sampling strategies for learning of matching models in multi-turn retrieval-based dialogue systems, which may enlighten future works in the learning of retrievalbased dialogue systems.

Conclusions
We present minimum sampling, maximum sampling, semi-hard sampling, and decay-hard sampling as four model adaptive negative sampling strategies to learn a matching model for retrievalbased dialogue systems. Evaluation results with three models on two benchmarks indicate that although minimum sampling and maximum sampling are worse than random sampling, both semihard sampling and decay-hard sampling can generally improve the performance of the models on both data sets. In the future, we would like to extend our negative sampling strategies to other tasks.