Selection Bias Explorations and Debias Methods for Natural Language Sentence Matching Datasets

Natural Language Sentence Matching (NLSM) has gained substantial attention from both academics and the industry, and rich public datasets contribute a lot to this process. However, biased datasets can also hurt the generalization performance of trained models and give untrustworthy evaluation results. For many NLSM datasets, the providers select some pairs of sentences into the datasets, and this sampling procedure can easily bring unintended pattern, i.e., selection bias. One example is the QuoraQP dataset, where some content-independent naive features are unreasonably predictive. Such features are the reflection of the selection bias and termed as the “leakage features.” In this paper, we investigate the problem of selection bias on six NLSM datasets and find that four out of them are significantly biased. We further propose a training and evaluation framework to alleviate the bias. Experimental results on QuoraQP suggest that the proposed framework can improve the generalization ability of trained models, and give more trustworthy evaluation results for real-world adoptions.


Introduction
Natural Language Sentence Matching (NLSM) aims at comparing two sentences and identifying the relationships (Wang et al., 2017), and serves as the core of many NLP tasks such as question answering and information retrieval (Wang et al., 2016b).Natural Language Inference (NLI) (Bowman et al., 2015) and Semantic Textual Similarity (STS) (Wang et al., 2016b) are both typical NLSM problems.A large number of publicly available datasets have benefited the research to a great extent (Kim et al., 2018;Wang et al.,  2017; Tien et al., 2018), including QuoraQP 1 , SNLI (Bowman et al., 2015), SICK (Marelli et al., 2014), etc.These datasets provide resources for both training and evaluation of different algorithms (Torralba and Efros, 2011).However, most of the datasets are prepared by conducting procedures involving a sampling process, which can easily introduce a selection bias (Heckman, 1977;Zadrozny, 2004).It would get even worse when the bias can reveal the label information, resulting in the "leakage features", which are irrelevant to the content/semantic of the sentences but are predictive to the label.One example is the QuoraQP, a dataset on classifying whether two sentences are duplicated (labeled as 1) or not (labeled as 0), which has been widely used to evaluate STS models (Gong et al., 2017;Kim et al., 2018;Wang et al., 2017;Devlin et al., 2018).In QuoraQP, three leakage features have been identified, including S1 freq, the number of occurrences of the first sentence in the dataset; S2 freq, the number of occurrences of the second sentence; and S1S2 inter, the number of sentences that are paired with both the first and the 1 https://data.quora.com/First-Quora-Dataset-Release-Question-Pairssecond sentences in the dataset for comparison.
Figure 1 shows the distributions of normalized (negative) Word Mover's Distance (WMD) (Kusner et al., 2015) and normalized leakage features versus the labels in QuoraQP.The features are all normalized to their quantiles.As illustrated, the leakage features are more predictive than the WMD, as the differences between the distributions of positive and negative pairs are more significant.Moreover, combining S1 freq and S2 freq can make even more accurate predictions as illustrated in Figure 2, where we calculate the averages of the labels under different S1 freq and S2 freq.We find that when both features' values are large, the pairs tend to be duplicated (marked in red), while when one is large and the other is small, the pairs tend to be not duplicated (marked in blue).
These leakage features play a critical role in the QuoraQP competition 2 .As the evaluations are conducted with the same biased datasets, models that fit the bias pattern can take additional advantages over unbiased models, making the benchmark results untrustworthy.On the other hand, the bias pattern doesn't exist in the real-world, so if a model fits the bias pattern (intentionally or unintentionally), the generalization performance will be hurt, limiting the values of these datasets for further applications (Torralba and Efros, 2011).
In this paper, we study this problem and demonstrate the impact of the selection bias by a series of experiments.We focus on the selection bias embodied in the comparing relationships of sen-2 https://www.kaggle.com/c/quora-question-pairs/discussion/34355 and https://www.kaggle.com/c/quora-question-pairs/discussion/33168 tences, and the main contributions of this paper are the answers to the following questions: • Does selection bias exist in other NLSM datasets?We identify four out of six publicly available datasets that suffer from the selection bias.
• Would DNN-based methods learn from the bias pattern unintentionally?We find that Siamese-LSTM models trained on QuoraQP do capture the bias pattern.
• Can we help the model learn the useful semantic pattern from the content without fitting the bias pattern?We propose an easy-adopting method to mitigate the bias.
Experiments show that this method can improve the generalization performance of the trained models.
• Can we build an evaluation framework that gives us more reliable results for realworld adoption?We propose a more trustworthy evaluation method that demonstrates consistent results with unbiased cross-dataset evaluations.
The rest of the paper is organized as follows.Section 2 gives an empirical look at the selection bias on a variety of NLSM datasets and analyzes why the leakage features are effective.Section 3 examines whether DNN-based methods fit the bias pattern unintentionally.Section 4 introduces the training and evaluation framework to alleviate the biasedness.Taking QuoraQP as an example, we report the experimental results in Section 5. Section 6 summarizes related work, and Section 7 draws the conclusion.

Empirical Study of the Selection Bias
In this section, we investigate the problem of selection bias on six NLSM datasets and then analyze why the leakage features are effective.

Quantifying the Biasedness in Datasets
To quantify the severity of the leakage from the selection bias, we formulate a toy problem for NLSM.We predict the semantic relationship of two sentences based on the comparing relationships between sentences.We refer semantic relationship of two sentences as their labels, for example, duplicated for STS and entailment for NLI, and comparing relationship as whether they are paired for comparison in the dataset.Here we only consider the index of each sentence, and the actual content is not used.The formal problem definition is as follow: Problem 1 ( Leveraging the Leakage for NLSM).
Given a set of sentence ids S, and a set of comparing relationships of the sentences C = { s i , s j }, s i , s j ∈ S. The goal is to infer the semantic relationship between given pairs of sentence ids from S.
This toy problem is indeed an edge classification problem (Aggarwal et al., 2016), as we can construct a graph using the comparing relationships as illustrated in Figure 3.In addition, from the graph perspective, S1 freq and S2 freq are the degrees of nodes, and S1S2 inter is the number of 2-hop paths connecting two nodes.Learning on the graph for this toy problem follows a transductive setting (Ji et al., 2010), where the graph is built with the comparing relationships of all the examples.
Based on the new problem definition, we investigate six NLSM datasets, including SNLI, MultiNLI (Williams et al., 2018), Quo-raQP, MSRP (Dolan et al., 2004), SICK and ByteDance 3 .We apply two different methods to classify the edges on the graph, including Leakage which uses the three leakage features introduced in Section 1 and Advanced which uses some more advanced graph-based features (Perozzi et al., 2014;Zhou et al., 2009;Liben-Nowell and Kleinberg, 2007) together with the three leakage features 4 .We also report the results of three baselines, including Majority which predicts the most frequent label, Unlexicalized which uses 15 handcrafted features from the content of sentences (Bowman et al., 2015), and LSTM which is a DNN-based method using sequences of word embeddings.All classifiers are Random Forests if no specific configuration is mentioned.The classifiers are trained with the training set, and we report the results on the testing set.More detailed settings are introduced in Appendix A. The results are reported in Table 1.
Predicting semantic relationships without using sentence contents seems impossible.However, we find that the graph-based features (Leakage and Advanced) make the problem feasible on a wide range of datasets.Specifically, on the datasets like QuoraQP and ByteDance, the leakage features are even more effective than the unlexicalized features.One exception is that on MultiNLI, Majority outperforms Leakage and Advanced significantly.Another interesting finding is that on SNLI and ByteDance, advanced graph-based features improve a lot over the leakage features, while on QuoraQP, the difference is very small.Among all the tested datasets, only MSRP and SICK NLI 3 https://www.kaggle.com/c/fake-news-pairclassification-challenge 4 The features are selected carefully to describe the local structure between two nodes and to prevent the model from remembering the exact ID of sentences to make inferences.are almost neutral to the leakage features.Note that their sizes are relatively small with only less than 10k samples.Results in Table 1 raise concerns about the impact of selection bias on the models and evaluation results.

Why are the Leakage Features Effective?
As discussed in Section 1, the leakage features are the reflection of selection bias.Intuitively, if we construct a dataset for NLSM by randomly sampling some pairs of sentences, the resulting dataset would be extremely imbalanced, where the most of the pairs are neutral for NLI or not duplicated for STS.Thus, to make the dataset relatively balanced, a sampling strategy is often required.If the strategy is not well-designed, it will introduce a bias pattern into the dataset, which can be revealed by leakage features.Here we try to figure out why the leakage features are effective in aforementioned datasets.Since we do not have every detail about how they are constructed, we only analyze based on SNLI and Quo-raQP.
During the preparation of SNLI, as introduced in (Bowman et al., 2015), human workers are presented with "premise scene descriptions", and asked to supply "hypotheses" for each of the three labels (i.e., entailment, neutral and contradiction).However, it is found that some workers are "reusing the same sentence for many different prompts", which might cause SNLI to suffer from selection bias.To validate, we calculate the percentage of each label versus S2 freq, and the results are shown in Fig- ure 4. We see that the percentages of the three labels are similar when S2 freq is small, but as S2 freq increases, the label is more likely to be an entailment.
For QuoraQP dataset, the providers state that "Our original sampling method returned an imbalanced dataset with many more true examples of duplicate pairs than non-duplicates.Therefore, we supplemented the dataset with negative examples.One source of negative examples were pairs of "related questions" which, although pertaining to similar topics, are not truly semantically equivalent."Our hypothesis is that the way in which negative samples were supplemented is the reason why QuoraQP is so biased.For example, the newly added sentences of "related questions" may appear in the dataset for limited times, thus we get the phenomenon in Figure 2, i.e., if two sentences both appear for many times, the pair is likely to be duplicated, while if one of them appears for only a few times, the pair is likely to be not duplicated.
We conduct ablation experiments on the datasets where the leakage features are effective, i.e., SNLI, QuoraQP, SICK STS and ByteDance.The results are reported in Table 2.We can see that S2 freq is more effective in SNLI, and S1 freq plays a more critical role in SICK STS , while in QuoraQP and ByteDance, S1S2 inter is the most predictive.Based on the experiments and observations, we conclude that existing datasets incline to be biased due to various reasons.Further study is required to understand the problem and prevent bias from being introduced into future datasets for research.
3 Do NN Models Fit the Bias Pattern Unintentionally?
In this section, we investigate whether DNN models are unintentionally fitting the bias pattern in addition to the semantic pattern.We train a classical Siamese-LSTM model 5 with the training set /HDNDJH)HDWXUHV

3UHGLFWLRQV
Figure 5: Visualization of predicted scores versus the leakage feature.The boxes represent the upper quartiles to the lower quartiles of predicted scores, and the lowest datum is the 1.5 IQR of the lower quartile.
of QuoraQP, and make predictions on a synthetic dataset.Interestingly, we find that the results are significantly influenced by the bias pattern.
The synthetic dataset is built in the following way.We extract the distinct sentences from the training set of QuoraQP, then compare the sentences with themselves, finally we obtain 517,970 pairs in total.Since the two sentences in the pairs are identical, the labels are all duplicated.All three leakage features are the same, i.e., the numbers of occurrences of the sentence in the dataset.If the model can perfectly learn the semantic relationships between sentences, the predictions should be substantially the same for all the pairs.To illustrate the predicted scores of duplication, we visualize them versus the leakage features in Figure 5, and the boxplot follows the Tukey boxplot style (Frigge et al., 1989).Intriguingly, we find that even though the sentences in pairs are all identical, the model still tends to give lower scores of duplication to the pairs with leakage features equal to 1.This result is consistent with the bias pattern shown in Figure 2, i.e., the data points in the bottom left corner tend to be not duplicated, compared with the data points in the top right corner which represent larger values of S1 freq and S2 freq.
The results indicate that the model is unintentionally capturing the undesired bias pattern that only exists in the particular dataset.This will make an adverse effect on the generalization performance of the trained models (to be illustrated in Section 5.4).

Leakage-Neutral Learning and Evaluation Method
Given a biased dataset, can we eliminate the bias to train completely unbiased models?Unfortunately, this is very difficult due to that the bias is related with the labels, and we cannot have access to the labels of unselected samples (Zadrozny, 2004).In this paper, we propose to take a step back and define a leakage-neutral distribution, which is more close to the real-world than the biased one.We make a few reasonable assumptions about it and how the biased dataset is generated from it.We demonstrate that we can train and evaluate models unbiased to the leakage-neutral distribution, with only the biased dataset.
Generation of the biased dataset from leakageneutral distribution Assuming that there is a leakage-neutral distribution D with domain X × Y×L×S where X is the semantic feature space, Y is the (binary) semantic label space, L is the sampling strategy feature space and S is the (binary) sampling intention space.The sampling intentions represent whether dataset providers want to select a positive sample or a negative sample.For example, S = 1 means that the providers want to select a positive sample here.We assume that samples (x, y, l, s) are drawn independently from D, then if s = y (the label matches the sampling intention), the samples are selected into the dataset, otherwise, the samples are discarded.This operation results in the biased distribution D that are observed from the dataset.
In this section, we use uppercase letters, such as Y and S, to represent random variables, and lowercase letters, such as y and s, to represent specific values for samples.We use P D (•) to represent the probability on D and omit the subscripts for D.
Assumptions about the leakage-neutral distribution We make the following assumptions about D. The first one is the leakage-neutral assumption defined as follows, which means that the sampling strategy is independent with the labels, making the leakageneutral distribution more close to the real-world.
The second one is that, given L, S is independent with X and Y defined as follows, P (S|X, Y, L) = P (S|L), which means that the sampling strategy features can completely control the sampling intentions.
Leakage-neutral learning and evaluation method Based on the assumptions above, given a biased dataset, the proposed method works in the following way.Firstly, we estimate P D (Y = 1|l) from the dataset for all samples.In practice, this can be achieved by training classifiers and making crosspredictions.Since we don't have access to the true sampling strategy features, we use the leakage features from the graph instead, as they are the reflection of the biased sampling strategy.
Then we can get P (S = 1|l), the conditional probability of the sampling intention S on D given l, using the following equation with P (Y = 1) given. . (1) The derivation of Equation ( 1) is presented in Appendix B.1.Afterwards, we use w = 1 P (S=y|l) as the weights for the samples (note that the labels y are needed here).Training and evaluating with the weights can give us the results unbiased to the leakage-neutral distribution.
The step-by-step procedure for leakage-neutral learning and evaluation is presented in Algorithm 1.Note that our analyses and the proposed method are general enough for a variety of bias, as long as a sampling strategy feature is given.
Theoretical guarantee of unbiasedness Assuming that we know P (S = y|l), and they are greater than zero for any l, the following theorem shows that we can obtain the loss unbiased to the leakage neutral distribution after using the sample weights.Theorem 1 (Unbiased Expectation).For any classifier f = f (x, l), and for any loss function ∆ = ∆(f (x, l), y), if we use w = P (S=Y ) P (S=y|l) as weights, then The proof is presented in Appendix B.2. Since P (S = Y ) is only a number which does not affect the models, we can concentrate on the denominator, i.e., P (S = y|l) and use w = as the weights instead.The loss can be used for both training and evaluation unbiased to the leakage neutral distribution.

Experimental Results for the
Leakage-neutral Method on QuoraQP In this section, we present the experimental results for leakage-neutral learning on QuoraQP.We demonstrate that the proposed learning framework can mitigate the bias and improve the generalization performance of trained models.Besides, the corresponding evaluation method can serve as a more reliable in-domain benchmark compared with the biased one.

Dataset Information and Weight Generation
We use QuoraQP as our experimental dataset.We use the same dataset partition as (Wang et al., 2017).We use the three leakage features for generating the weights.We use Random Forest classifiers to estimate P D (Y = 1|l), and the 100-fold cross predictions as the estimated values.P (Y = 1) is chosen to keep the proportion of the weights of positive and negative samples unchanged, and the mean of the weights is normalized to 1.The minimum weight of all the samples is 0.51, and the maximum weight is 4953.17.

Experiment Settings
We implement a classical Siamese-LSTM model with Keras and Tensorflow (Abadi et al., 2016) backend.Sequences of the embeddings of word tokens are fed into the LSTM layer with a hidden size of 128.Then the representations of both sentences, as well as the dot-production of the representations, go through a two Layer MLP where Batch Normalization (Ioffe and Szegedy, 2015) is applied after every hidden layer.Dropout (Srivastava et al., 2014) with rate 0.5 is applied after the last hidden layer.We use the RMSProp (Tieleman and Hinton, 2012) optimizer to train all the parameters.The learning rate starts at 1e-3, and decays at a fixed rate of 0.2 when performance does not improve on the validation set.We also use a gradient clipping of 5.0.The batch size is set to 256.All the results reported in this section are the average numbers of ten runs using the same hyper-parameters with different random initializations.Our implementation achieves slightly better performance compared with the results of the original Siamese-LSTM from Wang et al. (2017)  6 .
We initialize our word embeddings with pretrained GloVe 840B 300D vectors (Pennington et al., 2014), and the embeddings are kept fixed during training.All the sentences are cut off to have a maximum of 35 word tokens.
Note that the scale of weights of the different samples varies greatly.To prevent the model from jiggling during the mini-batch training, we use a sampling strategy for model training, i.e., we sample examples with probabilities proportional to the weights to get the data for every mini-batch.

Evaluation Scheme
To evaluate the effectiveness of leakage-neutral learning, we use the following strategy in our experiments.Firstly, we train and validate a model using the data from QuoraQP without any weights.The model is referred to as Biased Model.Then we train and validate a model using the data from QuoraQP with the weights, and the model is referred to as Debiased Model.These two models are evaluated with the following methods.
• Testing set evaluation.We evaluate the models with the testing set of QuoraQP.Evaluation without the weights is named as Biased Eva, and evaluation with the weights is named as Debiased Eva.This can show how the leakage-neutral evaluation proposed in Section 4 affect the evaluation results.
• Synthetic dataset evaluation.We evaluate the performance of models with the synthetic dataset introduced in Section 3. A better model is supposed to give higher accuracy, 6 The codes and the weights will be published upon the paper acceptance.

Method
Biased and tended to be less impacted by the bias pattern.
We evaluate that how the models perform on other STS datasets, i.e., MSRP and SICK STS .We use the entire datasets for evaluations.As the preparation strategies of different datasets are different, cross-dataset evaluations will not give additional rewards for the selection bias of QuoraQP.Although different datasets may have different contexts, a better model trained with QuoraQP is still supposed to perform better.
Among all the evaluation methods, using the testing set for evaluation without weights (Biased Eva) is biased, and we will show that the Debiased Eva is more consistent with the unbiased synthetic dataset evaluation and cross-dataset evaluations.

Experimental Results
The evaluation results on the testing set of Quo-raQP are reported in Table 3. From the accuracy of the method Leakage, we can see that although the influence isn't completely eliminated, the evaluation result of Debiased Eva is less impacted by the bias pattern in the original distribution.This makes the results more reliable for evaluations.
As for the Biased Model and the Debiased Model, we find that the Biased Model performs significantly better under the Biased Eva.This is the effect of fitting the bias pattern in addition to the semantic pattern, thus taking some extra advantage that cannot be generalized to real-life cases.On the other hand, under the Debiased Eva, we can find that the Debiased Model performs the best.
Table 4 reports the results on the datasets that are not biased to the leakage pattern of QuoraQP.We find that the Debiased Model significantly outperforms the Biased Model on all three datasets.This indicates that the Debiased Model better captures the true semantic similarities of the input sentences.We further visualize the predictions on the synthetic dataset in Figure 6.As illustrated, the predictions are more neutral to the leakage feature.
From the experimental results, we can see that the proposed leakage-neutral training method is effective, as the Debiased Model performs significantly better with Synthetic dataset, MSRP and SICK, showing a better generalization strength.Moreover, the Debiased Eva gives results that are more consistent with the results on unbiased datasets, thus it can serve as a more reliable indomain way to evaluate models trained with Quo-raQP.As a conclusion, our constructed leakageneutral distribution is more close to the real-world one compared with the biased distribution that is directly observed from the given datasets.

Related Work
In this section, we summarize the related work and distinguish them from our contributions.
Inverse propensity score for debiasing Usually, the Inverse Propensity Score (IPS) is used to reduce the selection bias (Schonlau et al., 2009;d'Agostino, 1998), where the propensity score (Rosenbaum and Rubin, 1983) is the probability that a sample will be selected into the dataset.Zadrozny (2004) studies the learning and evaluating of classifiers under sample selection bias, while his focus was the "missing-atrandom" (MAR) (Little and Rubin, 2014) problem where the biasedness only depends on the feature vector x.
For NLSM datasets, the selection bias is "notmissing-at-random" (NMAR) (Little and Rubin, 2014), thus we cannot hope to estimate the true propensity scores directly as it requires the labels of unselected samples (Zadrozny, 2004).In this paper, we propose to fit a constructed leakageneutral distribution, which could be achieved with only the selected samples that we can access.
Biasedness of datasets Although dataset bias is often mentioned, the research community is not putting sufficient attention to it compared with models and algorithms.Torralba and Efros (2011) studied the dataset bias for image recognition datasets, and categorize the bias into Selection Bias, Capture Bias and Negative Set Bias.Selection bias is widely studied in the search ranking field as position bias (Wang et al., 2016a(Wang et al., , 2018;;Joachims et al., 2017).Usually the propensity scores are estimated through online Result Randomization (Joachims et al., 2017).
In the NLP field, Minka and Robertson (2008) studied the selection bias in the LETOR datasets, and found that Reverse BM25 performs unreasonably well due to the selection procedure.Dixon et al. (2018) studied the potential unfairness for toxic comments classification due to unintended bias, and proposed methods to mitigate it by balancing the training dataset with additional data.Gururangan et al. (2018) and Poliak et al. (2018) found that in some NLI datasets, there is biasedness of specific linguistic phenomena.
In this paper, we study the selection bias embodied in the comparing relationships in NLSM datasets.To the best of our knowledge, this is the first study on this kind of selection bias.

Conclusion
In this paper, we take a close look at the selection bias of NLSM datasets and focus on the selection bias embodied in the comparing relationships of sentences.To mitigate the bias, we propose an easy-adopting method for leakage-neutral learning and evaluations.
However, there is still much to do to form a clearer scope of this problem.For example, we still do not know the details of dataset prepara-tions of many other NLSM datasets, and we can not say to what extent the assumptions in Section 4 hold in QuoraQP and what is the relationship between the leakage-neutral distribution and the realworld distribution.We suggest for future NLSM datasets, the providers should pay more attention to this problem.Furthermore, they could reveal the more detailed strategy of sample selection, and might publish some official weights to eliminate the bias.

Figure 1 :
Figure 1: Visualization for the distributions of normalized features versus the label in QuoraQP.The right part (in red) represents the distributions of duplicated pairs, and the left part (in blue) represents the distributions of not duplicated pairs.Best viewed in color.

Figure 2 :
Figure 2: The averages of the labels under different S1 freq and S2 freq.Red squares indicate that the averages are close to 1, and blue squares indicate that the averages are close to 0. Best viewed in color.

Figure 4 :
Figure 4: The percentage of each label versus S2 freq in SNLI.

Algorithm 1 :
Leakage-neutral Training and Evaluation  Input:  The dataset {x, y}, the number of fold K for cross prediction, and the prior probability P (Y = 1).Procedure: 01 Extract the leakage features l from the dataset.02 Estimate P D (Y = 1|l) for all samples by training classifiers and using K-fold cross-predicting strategy.03 Calculate P (S = 1|l) for all samples according to Equation (1).04 Obtain the weights w = 1 P (S=y|l) for all samples and normalize the mean of the weights.05 Train and validate models with the training set and validation set respectively using w as the sample weights.06 Evaluate the models with the testing set using w as the sample weights.

Table 3 :
Evaluation Results with the testing set of Quo-raQP."%" is omitted.

Table 4 :
Evaluation Results with the synthetic dataset, MSRP and SICK STS dataset."%" is omitted.