On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

Many pairwise classification tasks, such as paraphrase detection and open-domain question answering, naturally have extreme label imbalance (e.g., 99.99% of examples are negatives). In contrast, many recent datasets heuristically choose examples to ensure label balance. We show that these heuristics lead to trained models that generalize poorly: State-of-the art models trained on QQP and WikiQA each have only 2.4% average precision when evaluated on realistically imbalanced test data. We instead collect training data with active learning, using a BERT-based embedding model to efficiently retrieve uncertain points from a very large pool of unlabeled utterance pairs. By creating balanced training data with more informative negative examples, active learning greatly improves average precision to 32.5% on QQP and 20.1% on WikiQA.


Introduction
For most pairwise classification tasks in NLP, the most realistic data distribution has extreme label imbalance (e.g., 99.99% of examples have the same label). In question deduplication (Iyer et al., 2017), the vast majority of pairs of questions from an online forum are not duplicates. In open-domain question answering (Yang et al., 2015;, almost any randomly sampled document will not answer a given question. Random pairs of sentences from a diverse distribution will have no relation between them in natural language inference (Bowman et al., 2015), as opposed to an entailment or contradiction relationship.
While past work has recognized the importance of label imbalance in NLP (Lewis et al., 2004;Chawla et al., 2004), many recently released datasets are heuristically collected to ensure label balance, generally for ease of training. For instance, * Authors contributed equally.  Figure 1: Modern benchmarks often use heuristically balanced data for training and evaluation. We find that models trained on this data perform poorly on the very imbalanced all-pairs distribution and develop adaptive methods to collect training data for this setting.
the Quora Question Pairs (QQP) dataset (Iyer et al., 2017) was generated by mining non-duplicate questions that were heuristically determined to be nearduplicates. The SNLI dataset had crowdworkers generate inputs to match a specified label distribution (Bowman et al., 2015). In this work, we show that models trained on heuristically balanced datasets deal poorly with natural label imbalance at test time: They have very low average precision on realistically imbalanced test data created by taking all pairs of test utterances.
Instead of heuristically producing static training datasets, we study adaptive data collection methods. In particular, we apply the pool-based active learning framework (Settles, 2009) to extremely imbalanced pairwise tasks. At training time, the system has query access (i.e., the ability to collect a limited subset of the labels) to a large unlabeled dataset formed by taking all utterance pairs from a set of training utterances. For example, in the question deduplication setting, we might have a budget to annotate pairs of questions as "duplicate" or "not duplicate," and wish to train a model on this data to find new duplicate pairs with high precision. Data collection for extremely imbalanced pairwise tasks is challenging because the pool of unlabeled examples is very large (as it grows quadratically in the number of utterances) and very few of the examples are positive. We collect balanced training data using uncertainty sampling, an adaptive method that queries labels for examples on which a model trained on previously queried data has high uncertainty (Lewis and Gale, 1994). To lower the computational cost of searching for uncertain points, we propose combining active learning with a BERT embedding model for which uncertain points can be located efficiently using nearest neighbor search.
In this work, we empirically show that our use of adaptive data collection yields significant gains over static heuristics. In order to compare methods without collecting data separately for each method and each run, we perform retrospective data collection with imputed labels to simulate data collection. Uncertainty sampling with our BERT embedding model achieves 32.5% and 20.1% average precision for QQP and WikiQA, respectively. In contrast, state-of-the-art models trained on the original heuristically collected data each have only average precision of 2.4%.

Setting
The pairwise tasks described above fall under a more general category of binary classification tasks, those with an input space X and output space {0, 1}. We assume the label y is a deterministic function of x, which we write y(x). A classification model p θ yields probability estimates p θ (y | x) where x ∈ X .
Our setting has two aspects: The way training data is collected via label queries (Section 2.1) and the way we evaluate the model p θ (y | x) by measuring average precision (Section 2.2). This work focuses on pairwise tasks (Section 2.3), which enables efficient active learning (Section 4).

Data collection
In our setting, a system is given an unlabeled dataset D train all ⊆ X . The system can query an input x ∈ D train all and receive the corresponding label y(x). The system is given a budget of n queries to build a labeled training dataset of size n.

Evaluation
Following standard practice for imbalanced tasks, we evaluate on precision, recall, and average precision (Lewis, 1995;Manning et al., 2008). A scoring function S : X → R (e.g., S(x) = p θ (y = 1 | x)) is used to rank examples x, where an ideal S assigns all positive examples {x : y(x) = 1} a higher score than all negative examples {x : y(x) = 0}. Given a test dataset D test all ⊆ X , define the number of true positives, false positives, and false negatives of a scoring function S at a threshold γ as: For any threshold γ, define the precision P (S, γ) and recall R(S, γ) of a scoring function S as: Let Γ = {S(x) : x ∈ D test all } be the set of scores of the dataset. By sweeping over all distinct values Γ in descending order, we trace out the precisionrecall curve. The area under the precision recall curve or average precision (AP) is defined as: where γ 0 = ∞ and γ 1 > γ 2 > . . . γ |Γ| and γ i ∈ Γ.
Note that high precision requires very high accuracy when the task is extremely imbalanced. For example, if only one in 10, 000 examples in D test all is positive and 50% precision at some recall is achieved, that implies 99.99% accuracy.

Pairwise tasks
In this work, we focus on "pairwise" tasks, meaning that the input space decomposes as X = X 1 × X 2 . For instance, X 1 could be questions and X 2 could be paragraphs for open-domain question answering, and X 1 = X 2 could be questions for question deduplication. We create the unlabeled all-pairs training set D train all by taking the cross product of a subset from X 1 and a subset from X 2 . We follow the same procedure to form the all-pairs test set, D test all . As is standard practice, we ensure that the train and test all-pairs sets are disjoint.
Many pairwise tasks require high average precision on all-pairs test data. A question deduplication system must compare a new question with all previously asked questions to determine if a duplicate exists. An open-domain question-answering system must search through all available documents for one that answers the question. In both cases, the all-pairs distribution is extremely imbalanced, as the vast majority of pairs are negatives, while standard datasets are artificially balanced.

Results training on heuristic datasets
In this section, we show that state-of-the-art models trained on two standard pairwise classification datasets-QQP and WikiQA-do not generalize well to our extremely imbalanced all-pairs test data, which we create from an original dataset by imputing a negative label for all pairs that are not marked as positive. Both QQP and WikiQA were collected using static heuristics that attempt to find points x ∈ X that are more likely to be positive. These heuristics are necessary because uniformly sampling from X is impractical due to the label imbalance: if the proportion of positives is 10 −4 , then random sampling would have to label 10,000 examples on average to find one positive example. Standard models can achieve high test accuracy on test data collected with these heuristics, but fare poorly when evaluated on all-pairs data derived from the same data source (Section 3.2). Manual inspection confirms that these models often make surprising false positive errors (Section 3.3).

Evaluation
We evaluate models on both heuristically balanced test data and our imbalanced all-pairs test data.
Heuristically balanced evaluation. Let D pos denote the set of all positive examples, and D statedneg denote the set of stated negative examples-negative examples in the original heuristically collected dataset. We define the stated test dataset D test heur as the pairs in the original balanced dataset that are also in our defined test dataset: all . This is similar to the original QQP test data, but with a different train/test split. We use task-specific evaluation metrics described in the next section.
All-pairs evaluation. All-pairs evaluation metrics depend on the label of every pair in D test all . We approximate these labels by imputing (possibly noisy) labels on all pairs using the available labeled data, as described in the next section. 1 In Section 3.3, we manually label examples to confirm our results from this automatic evaluation.
Computing the number of false positives FP(S, γ) requires enumerating all negative examples in D test all , which is too computationally expensive with our datasets. To get an unbiased estimate of FP(S, γ), we could randomly subsample D test all , but the resulting estimator has high variance. We instead compute an unbiased estimator that uses importance sampling. In particular, we combine counts of errors on a set of "nearby negative" examples D test near ⊆ D test all , pairs of similar utterances on which we expect more false positives to occur, and random negatives D test rand sampled uniformly from negatives in D test all \ D test near . Details are provided in Appendix A.2.

Datasets
Quora Question Pairs (QQP). The task for QQP (Iyer et al., 2017) is to determine whether two questions are paraphrases. The non-paraphrase pairs in the dataset were chosen heuristically, e.g., by finding questions on similar topics. We impute labels on all question pairs by assuming that two questions are paraphrases if and only if they are equivalent under the transitive closure of the equivalence relation defined by the labeled paraphrase pairs. 2 We randomly partition all unique questions into train, dev, and test splits, ensuring that no two questions that were paired (either in positive or negative examples) in the original dataset end up in different splits. Since every question is a paraphrase of itself, we define D train all as the set of distinct pairs of questions from the training questions, and define D dev all and D test all analogously. For heuristically balanced evaluation, we report accuracy and F1 score  WikiQA. The task for WikiQA (Yang et al., 2015) is to determine whether a question is answered by a given sentence. The dataset only includes examples that pair a question with sentences from a Wikipedia article believed to be relevant based on click logs. We impute labels by assuming that question-sentence pairs not labeled in the dataset are negative. We partition the questions into train, dev, and test, following the original questionbased split of the dataset, and then take the direct product with the set of all sentences in the original dataset to form the train, dev, and test sets. For all WikiQA models, we prepend the title of the source article to the sentence to give the model information about the sentence's origin, as in . For heuristically balanced evaluation, we report two evaluation metrics. Following standard practice, we report clean mean average precision (c-MAP), defined as MAP over "clean" test questions-questions that are involved in both a positive and negative example in D test heur (Garg et al., 2020). We also report F1 score across all examples in D test heur (a-F1), which unlike c-MAP considers the more realistic setting where questions may not be answerable given the available article. This introduces more label imbalance, as positives make up 6% of D test heur but 12% of clean examples. The original WikiQA paper advocated a-F1 (Yang et al., 2015), but most subsequent papers do not report it (Shen et al., 2017;Yoon et al., 2019;Garg et al., 2020).
Data statistics.

Models
We train four state-of-the-art models that use BERTbase (Devlin et al., 2019), XLNet-base (Yang et al., 2019), RoBERTa-base (Liu et al., 2019), and ALBERT-base-v2 (Lan et al., 2020), respectively. As is standard, all models receive as input the concatenation of x 1 and x 2 separated by a special token; we refer to these as concatenationbased (CONCAT) models. We train on binary crossentropy loss for 2 epochs on QQP and 3 epochs on WikiQA, chosen to maximize dev all-pairs AP for RoBERTa. We report the average over three random seeds for training.

Evaluation results
As shown in Table 2, state-of-the-art models trained only on stated training data do well on heuristically balanced test data but poorly on the extremely imbalanced all-pairs test data. On QQP, the best model gets 80.2% F1 on heuristically balanced test examples. 3 However, on all-pairs test data, the best model can only reach 3.5% precision at a modest 20% recall. On WikiQA, our best c-MAP of 84.6% is higher than the best previously reported c-MAP without using additional question-answering data, 83.6% (Garg et al., 2020). However, on all-pairs test data, the best model gets 6.5% precision at 20% recall. All-questions F1 on heuristically balanced data is also quite low, with the best model only achieving 53.6%. Since a-F1 evaluates on a QQP, CONCATBERT trained on D train heur x1: "How do I overcome seasonal affective disorder?" x2 :"How do I solve puberty problem?" x1: "What will 10000 A.D be like?" x2 :"Does not introduction of new Rs.2000 notes ease carrying black money in future?" x1: "Can a person with no Coding knowledge learn Machine learning?" x2: "How do I learn Natural Language Processing?" WikiQA, CONCATBERT trained on D train heur x1: "where does limestone form?" x2: "Glacier cave . A glacier cave is a cave formed within the ice of a glacier ." x1: "what is gravy made of?" x2: "Amaretto. It is made from a base of apricot pits or almonds, sometimes both." more imbalanced distribution than c-MAP, this further demonstrates that state-of-the-art models deal poorly with test-time label imbalance. Compared with a-F1, all-pairs evaluation additionally shows that models make many mistakes when evaluated on questions paired with less related sentences; these examples should be easier to identify as negatives, but are missing from D train heur .

Manual verification of imputed negatives
Our all-pairs evaluation results are based on automatically imputed negative labels, rather than the gold label evaluation metrics. To check the validity of our results, we manually labeled putative false positive errors-examples that our model labeled positively but for which the imputed label was negative-to more accurately estimate precision. We focused on the best QQP model and random seed combination on the development set, which got 8.2% precision at 20% recall. 4 For this recall threshold, we manually labeled 50 randomly chosen putative false positives from D dev near , and 50 more from D dev rand . In 72% and 92% of cases, respectively, the imputed label was correct and the model was wrong. Extrapolating from these results, we estimate the true precision of the model to be 9.5%, still close to our original estimate of 8.2%. See Appendix A.3 for more details. For the remainder of the paper, we simply use the imputed labels, keeping in mind this may underestimate precision. Figure 2 shows real false positive predictions at 20% recall for the best QQP and WikiQA models. For QQP, models often make surprising errors on pairs of unrelated questions (first two examples), as well as questions that are somewhat related but distinct (third example). For WikiQA, models often predict a positive label when something in the sentence has the same type as the answer to the question, even if the sentence and question are unrelated. While these pairs seem easy to classify, the heuristically collected training data lacks coverage of these pairs, leading to poor generalization.

Active learning for pairwise tasks
As shown above, training on heuristically collected balanced data leads to low average precision on all pairs. How can we collect training data that leads to high average precision? We turn to active learning, in which new data is chosen adaptively based on previously collected data. Adaptivity allows us to ignore the vast majority of obvious negatives (unlike random sampling) and iteratively correct the errors of our model (unlike static data collection) by collecting more data around the model's decision boundary.

Active learning
Formally, an active learning method takes in an unlabeled dataset D train all ⊆ X . Data is collected in a series of k > 1 rounds. For the i th round, we choose a batch B i ⊆ D train all of size n i and observe the outcome as the labels {(x, y(x)) : x ∈ B i }. The budget n is the total number of points labeled, i.e., n = k i=1 n i . This process is adaptive because we can choose batch B i based on the labels of the previous i − 1 batches. Static data collection corresponds to setting k = 1.
Uncertainty sampling. The main active learning algorithm we use is uncertainty sampling (Lewis and Gale, 1994), which is simple, effective, and commonly used in practice (Settles, 2009). Uncertainty sampling first uses a static data collection method to select the seed set B 1 . For the next k − 1 rounds, uncertainty sampling trains a model on all collected data and chooses B i to be the n i unlabeled points in D train all on which the model is most uncertain. For binary classification, the most uncertain points are the points where p θ (y = 1 | x) is closest to 1 2 . Note that a brute force approach to finding B i requires evaluating p θ on every example in D train all , which can be prohibitively expensive. In balanced settings, it suffices to choose the most uncertain point from a small random subset of D train all (Ertekin et al., 2007); however, this strategy works poorly in extremely imbalanced settings, as a small random subset of D train all is unlikely to contain any uncertain points. In Section 4.2, we address this computational challenge with a bespoke model architecture.
Adaptive retrieval. We also use a related algorithm we call adaptive retrieval, which is like uncertainty sampling but queries the n i unlabeled points in D train all with highest p θ (y = 1 | x) (i.e., pairs the model is most confident are positive). Adaptive retrieval can be seen as greedily maximizing the number of positive examples collected.

Modeling and implementation
We now fully specify our approach by describing our model, how we find pairs in the unlabeled pool D train all to query, and how we choose the seed set B 1 . In particular, a key technical challenge is that the set of training pairs D train all is too large to enumerate, as it grows quadratically. We therefore require an efficient way to locate the uncertain points in D train all . We solve this problem with a model architecture COSINEBERT that enables efficient nearest neighbor search (Gillick et al., 2019).

Model
Given input x = (x 1 , x 2 ), COSINEBERT embeds x 1 and x 2 independently and predicts p θ (y = 1 | x) based on vector-space similarity. More precisely, where σ is the sigmoid function, w > 0 and b are learnable parameters, and e θ : X 1 ∪ X 2 → R d is a learnable embedding function. In other words, we compute the cosine similarity of the embeddings of x 1 and x 2 , and predict y using a logistic regression model with cosine similarity as its only feature.
We define e θ as the final layer output of a BERT model ( 5 Although WikiQA involves an asymmetric relationship between questions and sentences, we use the same encoder for both. This is still expressive enough for WikiQA, since the set 4.2.2 Finding points to query Next, we show how to choose the batch B i of points to query, given a model p θ (y | x) trained on data from batches B 1 , . . . , B i−1 . Recall that uncertainty sampling chooses the points x for which for which p θ (y = 1 | x) is closest to 1 2 , and adaptive retrieval chooses the points x with largest p θ (y = 1 | x). Since the set of positives is very small compared to the size of D train all , the set of uncertain points can be found by finding points with the largest p θ (y = 1 | x), thus filtering out the confident negatives, and then selecting the most uncertain from those.
To find points with largest p θ (y = 1 | x), we leverage the structure of our model. Since w > 0, p θ (y = 1 | x) is increasing in the cosine similarity of e θ (x 1 ) and e θ (x 2 ). Therefore, it suffices to find pairs (x 1 , x 2 ) that are nearest neighbors in the embedding space defined by e θ . In particular, for each x 1 ∈ X 1 , we use the Faiss library (Johnson et al., 2017) to retrieve a set N (x 1 ) containing the m nearest neighbors in X 2 , and define D train close to be the set of all pairs (x 1 , x 2 ) such that x 2 ∈ N (x 1 ). We then iterate through D train close to find either the most uncertain points (for uncertainty sampling) or points with highest cosine similarity (for adaptive retrieval). Note that this method only requires embedding the number of distinct elements that appear in the training set, rather than the total number of pairs, the requirement for jointly embedding all pairs.

Choosing the seed set
Recall that both of our active learning techniques require a somewhat representative initial seed set B 1 to start the process. We use the pre-trained BERT model as the embedding e θ and select the n 1 pairs with largest p θ (y = 1 | x). Recall that w > 0, so this amounts to choosing the pairs with highest cosine similarity.

Experimental details
We simulate active learning with the imputed labels so that we can compare different algorithms without performing expensive gold label collection for each algorithm. We collect n 1 = 2048 examples in the seed set, and use k = 10 rounds of of questions and set of sentences are disjoint. For asymmetric tasks like NLI where X1 = X2, we would need to use separate encoders for the X1 and X2.   active learning for QQP and k = 4 for WikiQA, as WikiQA is much smaller. At round i, we query n i = n 1 · (3/2) i−1 new labels. The exponentially growing n i helps us avoid wasting queries in early rounds, when the model is worse, and also makes training faster in the early rounds. These choices imply a total labeling budget n of 232,100 for QQP and 16,640 for WikiQA. For both datasets, n is slightly less than |D train heur | (257,421 for QQP and 20,360 for WikiQA), thus ensuring a meaningful comparison with training on heuristic data. We retrieve m = 1000 nearest neighbors per x 1 ∈ X 1 for QQP and m = 100 for WikiQA. We run all active learning experiments with three different random seeds and report the mean. Training details are given in Appendix A.

Main results
We now compare the two active learning methods, adaptive retrieval and uncertainty sampling, with training on D train heur and two other baselines. Random sampling queries n pairs uniformly at random, which creates a very imbalanced dataset. Static retrieval queries the n most similar pairs using the pre-trained BERT embedding, similar to Section 4.2.3. Table 3 shows all-pairs evaluation for COSINEBERT trained on these datasets. The two active learning methods greatly outperform other methods: Uncertainty sampling gets 32.5% AP on QQP and 20.1% on WikiQA, while the best static data collection method, static retrieval, gets only 25.1% AP on QQP and 8.2% AP on WikiQA. Recall from Table 2

Manual verification of imputed negatives
As in Section 3.3, we manually labeled putative QQP false positives at the threshold where recall is 20% for COSINEBERT trained on either stated data or uncertainty sampling data. For each, we labeled 50 putative false positives from D dev near , and all putative false positives from D dev rand (12 for stated data, 0 for uncertainty sampling).
COSINEBERT trained on D train heur . 67% (8 of 12) of the putative false positives on D dev rand were actual errors by the model, but only 36% of putative false positives on D dev near were errors. Extrapolating from these results, we update our estimate of development set precision at 20% recall from 28.4% to 41.4%.
Overall, this model makes some more reasonable mistakes than the CONCATBERT model, though its precision is still not that high.
COSINEBERT model with uncertainty sampling. Only 32% of putative false positives from D dev near were real errors, significantly less than the 72% for CONCATBERT trained on D train heur (p = 7 × 10 −5 , Mann-Whitney U test). Extrapolating from these results, we update our estimate of development set precision at 20% recall from 55.1% to 79.3%, showing that uncertainty sampling yields a more precise model than our imputed labels indicate. In fact, this model provides a high-precision way to identify paraphrase pairs not annotated in the original dataset.

Comparison with stratified sampling
Next, we further confirm that having all the positive examples is not sufficient for high precision.
In Table 5, we compare with two variants of stratified sampling, in which positive and negative examples are independently subsampled at a desired ratio (Attenberg and Provost, 2010). First, we randomly sample positive and negative training examples to match the number of positives and negatives collected by uncertainty sampling, the best active learning method for both datasets ("Strat. match" in Table 5). Second, we trained on all positive examples and added negatives to match the number of positives on QQP or match the active learning total budget on WikiQA ("Strat. all pos."). 6 For QQP, this yielded a slightly larger dataset than the first setting. Note that stratified sampling requires oracle information: It assumes the ability to sample uniformly from all positives, even though this set is not known before data collection begins. Nonetheless, stratified sampling trails uncertainty sampling by more than 10 AP points on both datasets. Since stratified sampling has access to all positives, active learning must be choosing more informative negative examples.

Training other models on collected data
For QQP, data collected with active learning and COSINEBERT is useful for training other models on the same task. Table 6 shows that CON-CATBERT does better on data collected by active learning-using COSINEBERT-compared to the original dataset or static retrieval. CON-CATBERT performs best with stratified sampling; recall that this is not a comparable data collection strategy in our setting, as it requires oracle knowledge. COSINEBERT outperforms CONCATBERT in all training conditions; we hypothesize that the cosine similarity structure helps it generalize more  Table 6: Comparison on QQP of COSINEBERT with CONCATBERT. Data collected by active learning (using COSINEBERT) is more useful for training CON-CATBERT than stated data or static retrieval data. Stratified sampling here matches the label balance of the uncertainty sampling data. robustly to pairs of unrelated questions. However, COSINEBERT trained on stated data does not do as well on WikiQA, as shown in Table 3.

Data efficiency
Adaptivity is crucial for getting high AP with less labeled data. Static retrieval, the best static data collection method, gets 21.9% dev AP on QQP with the full budget of 232,100 examples. Uncertainty sampling achieves a higher dev AP of 22.6% after collecting only 16,640 examples, for a 14× data efficiency improvement. See Appendix B.1 for further analysis.

Effect of seed set
Our method is robust to choice of the initial seed set for uncertainty sampling. We consider using stated data as the seed set, instead of data chosen via static retrieval. As shown in Figure 3, seeding with stated data performs about as well as static retrieval in terms of AP. Since the stated data artificially overrepresents positive examples, the model trained on stated data is initially miscalibrated-the points it is uncertain about are actually almost all negative points. Therefore, uncertainty sampling initially collects very few additional positive examples. Over time, adaptively querying new data helps correct for this bias.
6 Discussion and related work In this paper, we have studied how to collect training data that enables generalization to extremely imbalanced test data in pairwise tasks. State-ofthe-art models trained on standard, heuristically collected datasets have very low average precision when evaluated on imbalanced test data, while active learning leads to much better average precision. Gillick et al. (2019) propose a similar model for entity linking and mine hard negative examples, an approach related to adaptive retrieval. However, they have abundant labeled data, whereas we study data collection with a limited labeling budget.
Work in information retrieval often attempts to maximize precision across all pairs of test objects. Machine learning models are commonly used to re-rank candidate pairs from an upstream retriever (Chen et al., 2017;Nogueira and Cho, 2019), while our method learns embeddings to improve the initial retrieval step. Distant supervision has been used to train end-to-end retrieval models for question answering , but does not extend to other tasks like paraphrase detection. Other work on duplicate question detection on community QA forums trains on labels generated by forum users (dos Santos et al., 2015). Hoogeveen et al. (2016) show that these datasets tend to have many false negatives and suggests additional labeling to correct this problem; active learning provides one way to choose informative pairs to label.
Extreme label imbalance is an important challenge in many non-pairwise NLP tasks, including document classification (Lewis et al., 2004) and relation extraction (Zhang et al., 2017). Most prior work focuses on sampling a fixed training dataset (Chawla et al., 2004;Sun et al., 2009;Dendamrongvit and Kubat, 2009), whereas our work explores data collection. Attenberg and Provost (2010) find stratified sampling outperforms active learning in non-pairwise imbalanced tasks, primarily due to the difficulty of finding a useful seed set. We find pre-trained embeddings effective for seed set collection in pairwise tasks. Zhang et al. (2019) found that the frequency of questions in QQP leaks information about the label. Evaluating on all pairs avoids such artifacts, as every test utterance appears in the same number of examples. Zhang et al. (2019) re-weight the original dataset to avoid these biases, but re-weighting cannot compensate for the absence of some types of negative examples, unlike active learning.
Many pairwise datasets are generated by asking crowdworkers to generate part or all of the input x (Bowman et al., 2015;Mostafazadeh et al., 2016). Having crowdworkers generate text increases the risk of introducing artifacts (Schwartz et al., 2017;Poliak et al., 2018), while our pool-based approach considers the entire distribution of utterance pairs.
We use active learning, specifically uncertainty sampling (Lewis and Gale, 1994), to create a balanced training set that leads to models that generalize to the full imbalanced distribution. Ertekin et al. (2007) argues that active learning is capable of providing balanced classes to the learning algorithm by selecting examples close to the decision boundary. Furthermore, active learning can generalize to the full distribution, both empirically (Settles, 2009;Yang and Loog, 2018) and theoretically (Balcan et al., 2007;Balcan and Long, 2013;Mussmann and Liang, 2018).
Finally, this paper addresses two central concerns in NLP today: How to construct fair but challenging tests of generalization (Geiger et al., 2019), and how to collect training data in a way that improves generalization. Evaluating on extremely imbalanced all-pairs data has several advantages over other tests of generalization. Our examples are realistic and natural, unlike adversarial perturbations (Ebrahimi et al., 2018;Alzantot et al., 2018), and diverse, unlike hand-crafted tests of specific phenomena (Glockner et al., 2018;Naik et al., 2018;McCoy et al., 2019). Since we allow querying the label of any training example, generalization to our test data is achievable, while out-of-domain generalization (Levy et al., 2017;Yogatama et al., 2019;Talmor and Berant, 2019) may be statistically impossible. Our work thus offers a natural, challenging, and practically relevant testbed to study both generalization and data collection.
Reproducibility. Code and data needed to reproduce all results can be found on the CodaLab platform at https://bit.ly/2GzJAgM. Y. Zhang, V. Zhong, D. Chen, G. Angeli, and C. D. Manning. 2017. Position-aware attention and supervised data improve slot filling. In Empirical Methods in Natural Language Processing (EMNLP).

A Experimental details
A.1 Training details At each round of active learning, we train for 2 epochs. We train without dropout, as dropout artificially lowers cosine similarities at training time.
We apply batch normalization (Ioffe and Szegedy, 2015) to the cosine similarity layer to rescale the cosine similarities, as they often are very close to 1. We initialize w and b so that high cosine similarities correspond to the positive label, and constrain w to be nonnegative during training. We use a maximum sequence length of 128 word piece tokens.
To compensate for BERT's low learning rate, we increased the learning rate on the w and b parameters by a factor of 10 4 . Below in Table 7, we show hyperparameters for training. Hyperparameters were tuned on the development set of QQP; we found these same hyperparameters also worked well for WikiQA, and so we did not tune them separately for WikiQA. In most cases, we used the default hyperparameters for BERT.

Hyperparameter Value
Learning rate 2 × 10 −5 Training epochs 2 Weight decay 0 Optimizer AdamW AdamW Epsilon 1 × 10 −6 Batch size 16 At the end of training, we freeze the embeddings e θ and train the output layer parameters w and b to convergence, to improve uncertainty estimates for uncertainty sampling. This process amounts to training a two-parameter logistic regression model. We optimize this using (batch) gradient descent with learning rate 1 and 10, 000 iterations. When training this model, we normalize the cosine similarity feature to have zero mean and unit variance across the training dataset. Training this was very fast compared to running the embedding model.
Each experiment was conducted with a single GPU, most commonly a TITAN V or TITAN Xp.
Running one complete uncertainty sampling experiment (i.e., 10 rounds of data collection and model training for QQP, 4 for WikiQA) on a machine with one TITAN V GPU takes about 9 hours for QQP and about 30 minutes for WikiQA. Recall that COSINEBERT only adds two additional parameters, w and b, on top of a BERT model; we use the uncased BERT-base pre-trained model which has 110M parameters.

A.2 Evaluation details
To evaluate a given scoring function S at threshold γ on a test set D test all , we must compute the number of true positives TP(S, γ), false positives FP(S, γ), and false negatives FN(S, γ). True positives and false negatives are computationally easy to compute, as they only require evaluating S(x) on all the positive inputs x in D test all . However, without any structural assumptions on S, it is computationally infeasible to exactly compute the number of false positives, as that would require evaluating S on every negative example in D test all , which is too large to enumerate. Therefore, we devise an approach to compute an unbiased, low-variance estimate of FP(S, γ). Recall that this term is defined as where D test neg denotes the set of all negative examples in D test all . One approach to estimating FP(S, γ) would be simply to randomly subsample D test neg to some smaller set R, count the number of false positives in R, and then multiply the count by |D test neg |/|R|. This would be an unbiased estimate of D test neg , but has high variance when the rate of false positive errors is low. For example, if |D test neg | = 10 10 , |R| = 10 6 , and the model makes a false positive error on 1 in 10 6 examples in D test neg , then FP(S, γ) = 10 4 . However, with probability 1 − 1 10 6 10 6 ≈ 1/e ≈ 0.368, R will contain no false positives, so we will estimate FP(S, γ) as 0. A similar calculation shows that the probability of having exactly one false positive in R is also roughly 1/e, which means that with probability roughly 1 − 2/e ≈ 0.264, we will have at least two false positives in R, and therefore overestimate FP(S, γ) by at least a factor of two.
To get a lower variance estimateFP(S, γ), we preferentially sample from likely false positives and use importance weighting to get an unbiased estimate of FP(S, γ). In particular, we construct D test near to be the pairs in D test all with nearby pretrained BERT embeddings, analogously to how we create the seed set in Section 4.2.3. Points with nearby BERT embeddings are likely to look similar are therefore more likely to be false positives. Note that FP(S, γ) = where we define w rand = |D test neg | − |D test near |. We can compute the first term exactly, since D test near is small enough to enumerate, and approximate the second term as where D test rand is a uniformly random subset of D test neg \ D test near . Therefore, our final estimate iŝ FP(S, γ) =

A.3 Incorporating manual labels
In Section 3.3, we manually label examples that were automatically labeled as false positives, and use this to improve our estimates of the true model precision. We manually label randomly chosen putative false positives from both D dev near and D dev rand , and use this to estimate the proportion of putative false positives in each set that are real false positives. Letp near denote the estimated fraction of putative false positives in D dev near that are real false positives, andp rand be the analogous quantity for We then compute precision usingFP manual (S, γ) in place ofFP(S, γ).

A.4 Comparison with GLUE QQP data
For a few reasons, our QQP in-domain accuracy numbers are lower than those on the GLUE leaderboard, which has accuracies in the low 90's. First, our training set is smaller (257K examples versus 364K). Second, our split is more challenging because the model does not see the same questions or even same paraphrase clusters at training time and test time. Finally, our test set is more balanced (58% negative) than the GLUE QQP dev set (63% negative; test set balance is unknown). As a sanity check, we confirmed that our RoBERTa implementation can achieve 91.5% dev accuracy when trained and tested on the GLUE train/dev split, in line with previously reported results (Liu et al., 2019).  Figure 4: Uncertainty sampling compared with matching amounts of static retrieval data on QQP. (a) Average precision is higher for uncertainty sampling. (b) Percent of all collected data that is positive. Adaptivity helps uncertainty sampling collect more positives.
In Figure 4a, we plot average precision on the QQP dev set for our model after each round of uncertainty sampling. For comparison, we show a model trained on the same amount of data collected via static retrieval, the best-performing static data collection method. Uncertainty sampling leads to higher AP with much less data. For example, uncertainty sampling only needs to collect 16,640 examples to surpass the average precision of static retrieval collecting all 232,100 examples, for a 14× data efficiency improvement. A big factor for the success of uncertainty sampling is its ability to collect many more positive examples than static retrieval, as shown in Figure 4b. Static retrieval collects fewer positives over time, as it exhausts the set of positives that are easy to identify. However, uncertainty sampling collects many more positives, especially after the first round of training, because it improves its embeddings over time.