Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks

There are two approaches for pairwise sentence scoring: Cross-encoders, which perform full-attention over the input pair, and Bi-encoders, which map each input independently to a dense vector space. While cross-encoders often achieve higher performance, they are too slow for many practical use cases. Bi-encoders, on the other hand, require substantial training data and fine-tuning over the target task to achieve competitive performance. We present a simple yet efficient data augmentation strategy called Augmented SBERT, where we use the cross-encoder to label a larger set of input pairs to augment the training data for the bi-encoder. We show that, in this process, selecting the sentence pairs is non-trivial and crucial for the success of the method. We evaluate our approach on multiple tasks (in-domain) as well as on a domain adaptation task. Augmented SBERT achieves an improvement of up to 6 points for in-domain and of up to 37 points for domain adaptation tasks compared to the original bi-encoder performance.


Introduction
Pairwise sentence scoring tasks have wide applications in NLP. They can be used in information retrieval, question answering, duplicate question detection, or clustering. An approach that sets new state-of-the-art performance for many tasks including pairwise sentence scoring is BERT (Devlin et al., 2018). Both sentences are passed to the network and attention is applied across all tokens of the inputs. This approach, where both sentences are simultaneously passed to the network, is called cross-encoder (Humeau et al., 2020).
A downside of cross-encoders is the extreme computational overhead for many tasks. For example, clustering of 10,000 sentences has a quadratic complexity with a cross-encoder and would require 1 Code available: www.sbert.net about 65 hours with BERT (Reimers and Gurevych, 2019). End-to-end information retrieval is also not possible with cross-encoders, as they do not yield independent representations for the inputs that could be indexed. In contrast, bi-encoders such as Sentence BERT (SBERT) (Reimers and Gurevych, 2019) encode each sentence independently and map them to a dense vector space. This allows efficient indexing and comparison. For example, the complexity of clustering 10,000 sentences is reduced from 65 hours to about 5 seconds (Reimers and Gurevych, 2019). Many realworld applications hence depend on the quality of bi-encoders. A drawback of the SBERT bi-encoder is usually a lower performance in comparison with the BERT cross-encoder. We depict this in Figure 1, where we compare a fine-tuned cross-encoder (BERT) and a fine-tuned bi-encoder (SBERT) over the popular English STS Benchmark dataset 2 (Cer et al., 2017) for different training sizes and spearman rank correlation (ρ) on the test split.
This performance gap is the largest when little 297 training data is available. The BERT cross-encoder can compare both inputs simultaneously, while the SBERT bi-encoder has to solve the much more challenging task of mapping inputs independently to a meaningful vector space which requires a sufficient amount of training examples for fine-tuning.
In this work, we present a data augmentation method, which we call Augmented SBERT (AugS-BERT), that uses a BERT cross-encoder to improve the performance for the SBERT bi-encoder. We use the cross-encoder to label new input pairs, which are added to the training set for the bi-encoder. The SBERT bi-encoder is then fine-tuned on this larger augmented training set, which yields a significant performance increase. As we show, selecting the input pairs for soft-labeling with the cross-encoder is non-trivial and crucial for improving performance. Our method is easy to apply to many pair classification and regression problems, as we show in the exhaustive evaluation of our approach.
First, we evaluate the proposed AugSBERT method on four diverse tasks: Argument similarity, semantic textual similarity, duplicate question detection, and news paraphrase identification. We observe consistent performance increases of 1 to 6 percentage points over the state of the art SBERT bi-encoder's performance. Next, we demonstrate the strength of AugSBERT in a domain adaptation scenario. Since the bi-encoder is not able to map the new domain to a sensible vector space, the performance drop on the target domain for SBERT bi-encoders is much higher than for BERT crossencoders. In this scenario, AugSBERT achieves a performance increase of up to 37 percentage points.

Related Work
Sentence embeddings are a well studied area in recent literature. Earlier techniques included unsupervised methods such as Skip-thought vectors (Kiros et al., 2015) and supervised methods such as InferSent (Conneau et al., 2017) or USE (Cer et al., 2018). For pairwise scoring tasks, more recent sentence embedding techniques are also able to encode a pair of sentences jointly. Among these, BERT (Devlin et al., 2018) can be used as a cross-encoder. Both inputs are separated by a special SEP token and multi-head attention is applied over all input tokens. While the BERT cross-encoder achieves high performances for many sentence pair-tasks, a drawback is that no independent sentence representations are generated. This drawback was ad-dressed by SBERT (Reimers and Gurevych, 2019), which applies BERT independently on the inputs followed by mean pooling on the output to create fixed-sized sentence embeddings. Humeau et al. (2020) showed that cross-encoders typically outperform bi-encoders on sentence scoring tasks. They proposed a third strategy (poly-encoders), that is in-between cross-and biencoders. Poly-encoders utilize two separate transformers, one for the candidate and one for the context. A given candidate is represented by one vector, while the context is jointly encoded with the candidates (similar to cross-encoders). Unlike cross-encoder's full self attention technique, polyencoders apply attention between two inputs only at the top layer. Poly-encoders have the drawback that they are only practical for certain applications: The score function is not symmetric, i.e., they cannot be applied for tasks with a symmetric similarity relation. Further, poly-encoder representations cannot be efficiently indexed, causing issues for retrieval tasks with large corpora sizes. Chen et al. (2020) propose the DiPair architecture which, similar to our work, also uses a crossencoder model to annotate unlabeled pairs for finetuning a bi-encoder model. DiPair focuses on inference speed and provides a detailed ablation for optimal bi-encoder architectures for performance versus speed trade-offs. The focus of our work are sampling techniques, which we find crucial for performance boosts in the bi-encoder model while keeping its architecture constant.
Our proposed data augmentation approach is based on semi-supervision (Blum and Mitchell, 1998) for in-domain tasks, which has been applied successfully for a wide range of tasks. Uva et al.
train a SVM model with few gold samples and apply semi-supervision with pre-training neural networks. Another common strategy is to generate paraphrases of existent sentences, for example, by replacing words with synonyms (Wei and Zou, 2019), by using round-trip translation (Yu et al., 2018;Xie et al., 2020), or with seq2seq-models (Kumar et al., 2019). Other approaches generate synthetic data by using generative adversarial networks (Tanaka and Aranha, 2019), by using a language model to replace certain words (Wu et al., 2019) or to generate complete sentences (Anaby-Tavor et al., 2019). These data augmentation approaches have in common that they were applied to single sentence classification tasks. In our work, we focus on sentence pair tasks, for which we need to generate suitable sentence pairs. As we show, randomly combining sentences is insufficient. Sampling appropriate pairs has a decisive impact on performance which corresponds to recent findings on similar datasets (Peinelt et al., 2019).

Methods
In this section we present Augmented SBERT for diverse sentence pair in-domain tasks. We also evaluate our method for domain adaptation tasks.

Augmented SBERT
Given a pre-trained, well-performing crossencoder, we sample sentence pairs according to a certain sampling strategy (discussed later) and label these using the cross-encoder. We call these weakly labeled examples the silver dataset and they will be merged with the gold training dataset. We then train the bi-encoder on this extended training dataset. We refer to this model as Augmented SBERT (AugSBERT). The process is illustrated in Figure 2.  Pair Sampling Strategies The novel sentence pairs, that are to be labeled with the cross-encoder, can either be new data or we can re-use individual sentences from the gold training set and re-combine pairs. In our in-domain experiments, we re-use the sentences from the gold training set. This is of course only possible if not all combinations have been annotated. However, this is seldom the case as there are n × (n − 1)/2 possible combinations for n sentences. Weakly labeling all possible combinations would create an extreme computational overhead, and, as our experiments show, would likely not lead to a performance improvement. Instead, using the right sampling strategy is crucial to achieve a performance improvement.
Random Sampling (RS): We randomly sample a sentence pair and weakly label it with the crossencoder. Randomly selecting two sentences usually leads to a dissimilar (negative) pair; positive pairs are extremely rare. This skews the label distribution of the silver dataset heavily towards negative pairs.
Kernel Density Estimation (KDE): We aim to get a similar label distribution for the silver dataset as for the gold training set. To do so, we weakly label a large set of randomly sampled pairs and then keep only certain pairs. For classification tasks, we keep all the positive pairs. Subsequently we randomly sample out negative pairs from the remaining dominant negative silver-pairs, in a ratio identical to the gold dataset training distribution (positives/negatives). For regression tasks, we use kernel density estimation (KDE) to estimate the continuous density functions F gold (s) and F silver (s) for scores s. We try to minimize KL Divergence (Kullback and Leibler, 1951) between distributions using a sampling function which retains a sample with score s with probability Q(s): Note, that the KDE sampling strategy is computationally inefficient as it requires labeling many, randomly drawn samples, which are later discarded.
BM25 Sampling (BM25): In information retrieval, the Okapi BM25 (Amati, 2009) algorithm is based on lexical overlap and is commonly used as a scoring function by many search engines. We utilize ElasticSearch 3 for the creation of indices which helps in fast retrieval of search query results. For our experiments, we index every unique sentence, query for each sentence and retrieve the top k similar sentences. These pairs are then weakly labeled using the cross-encoder. Indexing and re-trieving similar sentences is efficient and all weakly labeled pairs will be used in the silver dataset.
Semantic Search Sampling (SS): A drawback of BM25 is that only sentences with lexical overlap can be found. Synonymous sentences with no or little lexical overlap will not be returned, and hence, not be part of the silver dataset. We train a bi-encoder (SBERT) on the gold training set as described in section 5 and use it to sample further, similar sentence pairs. We use cosine-similarity and retrieve for every sentence the top k most similar sentences in our collection. For large collections, approximate nearest neighbour search like Faiss 4 could be used to quickly retrieve the k most similar sentences.
BM25 + Semantic Search Sampling (BM25-S.S.): We apply both BM25 and Semantic Search (S.S.) sampling techniques simultaneously. Aggregating the strategies helps capture the lexical and semantically similar sentences but skews the label distribution towards negative pairs. Dodge et al. (2020) show a high dependence on the random seed for transformer based models like BERT, as it converges to different minima that generalize differently to unseen data (LeCun et al., 1998;Erhan et al., 2010;Reimers and Gurevych, 2017). This is especially the case for small training datasets. In our experiments, we apply seed optimization: We train with 5 random seeds and select the model that performs best on the development set. In order to speed this up, we apply early stopping at 20% of the training steps and only continue training the best performing model until the end. We empirically found that we can predict the final score with high confidence at 20% of the training steps (Appendix D).

Domain Adaptation with AugSBERT
Until now we discussed Augmented SBERT for indomain setups, i.e., when the training and test data are from the same domain. However, we expect an even higher performance gap of SBERT on out-ofdomain data. This is because SBERT fails to map sentences with unseen terminology to a sensible vector space. Unfortunately, annotated data for new domains is rarely available.
Hence, we evaluate the proposed data augmentation strategy for domain adaptation: We first finetune a cross-encoder (BERT) over the source domain containing pairwise annotations. After finetuning, we use this fine-tuned cross-encoder to label the target domain. Once labeling is complete, we train the bi-encoder (SBERT) over the labeled target domain sentence pairs ( Figure 3).

Datasets
Sentence pair scoring can be differentiated in regression and classification tasks. Regression tasks assign a score to indicate the similarity between the inputs. For classification tasks, we have distinct labels, for example, paraphrase vs. non-paraphrase.

Single-Domain Datasets
In our single-domain (i.e. in-domain) experiments, we use two sentence pair regression tasks: semantic textual similarity and argument similarity. Furthermore, we use two binary sentence pair classification tasks: Duplicate question detection and news paraphrase identification. Examples for all datasets are given in Table 2.
SemEval Spanish STS: Semantic Textual Similarity (STS) 5 is the task of assessing the degree of similarity between two sentences over a scale ranging from [0, 5] with 0 indicating no semantic overlap and 5 indicating identical content (Agirre et al., 2016). We choose Spanish STS data to test our methods for a different language than English. For our training and development dataset, we use the datasets provided by SemEval STS 2014 (Agirre et al., 2014) and SemEval STS 2015 (Agirre et al., 2015). These consist of annotated sentence pairs from news articles and from Wikipedia. As test set, we use SemEval STS 2017 (Cer et al., 2017), which annotated image caption pairs from SNLI (Bowman et al., 2015). For all our experiments, we normalise the original similarity scores to [0, 1] by dividing the score by 5.

BWS Argument Similarity Dataset (BWS):
Existing similarity datasets have the disadvantage that the sentence pair selection/sampling process is not always comprehensible. To overcome this limitation, we create and publicly release a novel dataset 6 for argument similarity.
Previous work addressing argument similarity (Misra et al., 2016;Reimers et al., 2019) used discrete scales. However, expressing an inherently continuous property in this way is counter-intuitive and potentially unreliable due to different assumptions made when binning a range of values into a discrete class (Kingsley and Brown, 2010).
Collecting continuous annotations is complex due to selection bias and due to a lack of consistency for a single annotator (Kendall, 1948). To solve the consistency problem, we apply a comparative approach, which converts the annotation into a preference problem: the annotators stated their preference on pairs of sentential arguments. We utilized the Best-Worst Scaling (BWS) method (Kiritchenko and Mohammad, 2016) to reduce the number of required annotations. For each topic regardless of stance, all arguments were randomly paired and for ensuring a certain proportion of similar arguments within the pairings, a distant supervision filtering strategy was implemented by labeling pairs with scores between 0 and 1 using the system proposed by Misra et al. (2016). Next, all argument pairs were sampled with a desired similarity distribution, by creating argument pair bins across three categories: top 1%, top 2-50% and remaining pairs. As the final step, we randomly drew pairs from the top 1% with 50% probability, and with each 25% from the two other bins.
The resulting argument pairs were annotated using crowdsourcing via the Amazon Mechanical Turk Platform. For each annotation task, workers were shown four argument pairs and had to select the most and least similar pair amongst them. Each of these tasks was assigned to four different workers. To assess the quality of the resulting annotations, we used split-half reliability measure (Callender and Osburn, 1979). Workers' votes were split by half and used to independently rank all argument pairs with the BWS method for each half on each task. Finally, the Spearman's rank correlation between the resulting rankings is calculated as a proxy for consistency. The resulting average correlation across all topics in our dataset is 0.66 (random splits are repeated 25 times and final scores averaged), which, given the small number of votes per half (two), is in an acceptable range and reflects the difficulty of this task (Kiritchenko and Mohammad, 2016). Table 3 lists the mean splithalf reliability estimates for all topics (averaged over 25 random splits) in the dataset.  Table 3: Mean split-half reliability estimate is calculated using Spearman's rank correlation ρ per topic T and over the whole BWS Argument Similarity dataset.
We use the resulting BWS Argument Similarity Dataset with different splitting strategies in our paper. In cross-topic tasks, we fix topics (T 1 -T 5 ) for training, T 6 for development and (T 7 and T 8 ) for test sets. This is a difficult task, as models are evaluated on completely unseen topics.
Note that the cross-topic experiments on this dataset are quite different from cross-domain tasks (subsection 3.2): the model fine-tunes in-domain on fixed topics (T 1 -T 5 in our case) and is evaluated on unseen topics, whereas in the domain adaptation experiments we fine-tune on target domain data. For in-topic, we randomly sample fixed and disjoint pairs from each and every topic (T 1 -T 8 ) and create our train, development and test splits with approximately equal number of pairs from each topic.
Quora Question Pairs (Quora-QP): Duplicate question classification identifies whether two questions are duplicates. Quora released a dataset 7 containing 404,290 question pairs. We start with the same dataset partitions from Wang et al. (2017) 8 . We remove all overlaps and ensure that a question in one split of the dataset does not appear in any other split to mitigate the transductive classification problem (Ji et al., 2010). As we observe performance differences between cross-and bi-encoders mainly for small datasets, we randomly downsample the training set to 10,000 pairs while preserving the original balance of non-duplicate to duplicate question pairs.
Microsoft Research Paraphrase Corpus (MRPC): Dolan et al. (2004) presented a paraphrase identification dataset consisting of sentence pairs automatically extracted from online news sources. Each pair was manually annotated by  two human judges whether they describe the same news event. We use the originally provided train-test splits 9 . We ensured that all splits have disjoint sentences.

Multi-Domain Datasets
One of the most prominent sentence pair classification tasks with datasets from multiple domains is duplicate question detection. Since our focus is on pairwise sentence scoring, we model this task as a question vs. question (title/headline) binary classification task. AskUbuntu, Quora, Sprint, and SuperUser: We replicate the setup of Shah et al. (2018) for domain adaptation experiments. The AskUbuntu and SuperUser data comes from Stack Exchange, which is a family of technical community support forums. Sprint FAQ is a crawled dataset from the Sprint technical forum website. We exclude Apple and Android datasets due to unavailability of labeled question pairs. The Quora dataset (originally derived from the Quora website) is artificially balanced by removing negative question pairs. The statistics for the datasets can be found in Table 4. Since negative question pairs are not explicitly labeled, Shah et al. (2018) add 100 randomly sampled (presumably) negative question pairs per duplicate question for all datasets except Quora, which has explicit negatives.

Experimental Setup
We conduct our experiments using PyTorch Huggingface's transformers (Wolf et al., 2019) and the sentence-transformers framework 10 (Reimers and Gurevych, 2019). The latter showed that BERT outperforms other transformer-like networks when used as bi-encoder. For English datasets, we use bert-base-uncased and for the Spanish dataset we

Cross-encoders
We fine-tune the BERTuncased model by optimizing a variety of hyperparameters: hidden-layer sizes, learning-rates and batch-sizes. We add a linear layer with sigmoid activation on top of the [CLS] token to output scores 0 to 1. We achieve optimal results with the combination: learning rate of 1 × 10 −5 , hidden-layer sizes in {200, 400} and a batch-size of 16. Refer to Table 7 in Appendix C.
Bi-encoders We fine-tune SBERT with a batch-size of 16, a fixed learning rate of 2 × 10 −5 , and AdamW optimizer. Table 8 in Appendix C lists hyper-parameters we initially evaluated.
BM25 and Semantic Search We evaluate for various top k in {3, ..., 18}. We conclude the impact of k is not big and overall accomplish best results with k = 3 or k = 5 for our experiments. More details in Appendix E.
Evaluation If not otherwise stated, we repeat our in-domain experiments with 10 different random seeds and report mean scores along with standard deviation. For in-domain regression tasks (STS and BWS), we report the Spearman's rank correlation (ρ × 100) between predicted and gold similarity scores and for in-domain classification tasks (Quora-QP, MRPC), we determine the optimal threshold from the development set and use it for the test set. We report the F 1 score of the positive label. For all domain adaptation tasks, we weakly-label the target domain training dataset and measure AUC(0.05) as the metric since it is more robust against false negatives (Shah et al., 2018). AUC(0.05) is the area under the curve of the true positive rate as function of the false positive rate (fpr), from fpr = 0 to fpr = 0.05.
Baselines For the in-domain regression tasks, we use Jaccard similarity to measure the word overlap of the two input sentences. For the in-domain classification tasks, we use a majority label baseline. Further, we compare our results against Universal Sentence Encoder (USE) (Yang et al., 2019), which is a popular state-of-the-art sentence embedding model trained on a wide rang of training data. We utilise the multilingual model 11 . Fine-tuning code for USE is not available, hence, we utilise USE as a comparison to a large scale, pre-trained sentence embedding method. Further, we compare our data augmentation strategy AugSBERT against a straightforward data augmentation strategy provided by NLPAug, which implements 15 methods for text data augmentation. 12 We include synonym replacement replacing words in sentences with synonyms utilizing a BERT language model. We empirically found synonym-replacement to work best from the rest of the methods provided in NLPAug.  Opt.) consistently underperforms (4.5 -9.1 points) the cross-encoder across all in-domain tasks. Optimizing the seed helps SBERT more than BERT, however, the performance gap remains open (2.8 -8.2 points). Training with multiple random seeds and selecting the best performing model on the development set can significantly improve the performance. For the smallest dataset (STS), we observe large performance differences between different random seeds. The best and worst seed for SBERT have a performance difference of more than 21 points. For larger datasets, the dependence on the random seed decreases. We observe bad training runs can often be identified and stopped early using the early stopping algorithm (Dodge et al., 2020). Detailed results with seed optimization can be found in Appendix D.

In-Domain Experiments for AugSBERT
Our proposed AugSBERT approach improves the performance for all tasks by 1 up to 6 points, significantly outperforming the existing bi-encoder SBERT and reducing the performance difference to the cross-encoder BERT. It outperforms the synonym replacement data augmentation technique (NLPAug) for all tasks. Simple word replacement strategies as shown are not helpful for data augmentation in sentence-pair tasks, even leading to worse performances compared to models without augmentation for BWS and Quora-QP. Compared to the off the shelf USE model, we see a significant improvement with AugSBERT for all tasks except Spanish-STS. This is presumably due to the fact that USE was trained on the SNLI corpus (Bowman et al., 2015), which was used as basis for the Spanish STS test set, i.e., USE has seen the test sentence pairs during training.
For the novel BWS argument similarity dataset, we observe AugSBERT only gives a minor improvement for cross-topic split. We assume this is due to cross-topic setting being a challenging task, mapping sentences of an unseen topic to a vector space such that similar arguments are close. However, on known topics (in-topic), AugSBERT shows its full capabilities and even outperforms the cross-encoder. We think this is due a better generalization of SBERT bi-enconder compared to the BERT cross-encoder. Sentences from known topics (in-topic) are mapped well within a vector space by a bi-encoder.  Pairwise Sampling We observe that the sampling strategy is critical to achieve an improvement using AugSBERT. Random sampling (R.S.) decreases performance compared to training SBERT without any additional silver data in most cases. BM25 sampling and KDE produces the best AugS-BERT results, followed by Semantic Search (S.S.). Figure 4, which shows the score distribution for the gold and silver dataset for Spanish-STS, visualizes the reason for this. With random sampling, we observe an extremely high number of low similarity pairs. This is expected, as randomly sampling two sentences yields in nearly all cases a dissimilar pair. In contrast, BM25 generates a silver dataset with similar score distribution to the gold training set. It is still skewed towards low similarity pairs, but has the highest percentage of high similarity pairs. BM25+S.S., apart on Spanish-STS, overall performs worse in this combination than the individual methods. It even underperforms random sampling on the BWS and Quora-QP datasets. We believe this is due to the aggregation of a high number of dissimilar pairs from the sampling strategies combined. KDE shows the highest performance in three tasks, but only marginally outperforms BM25 in two of these. Given that BM25 is the most computationally efficient sampling strategy and also creates smaller silver datasets (numbers are given in Appendix F, Table 11), it is likely the best choice for practical applications.

Domain Adaptation with AugSBERT
We evaluate the suitability of AugSBERT for the task of domain adaptation. We use duplicate question detection data from different (specialized) online communities. Results are shown in Table 6. We can see in almost all combinations that AugSBERT outperforms SBERT trained on out-ofdomain data (cross-domain). On the Sprint dataset (target), the improvement can be as large as 37 points. In few cases, AugSBERT even outperforms SBERT trained on gold in-domain target data.
We observe that AugSBERT benefits a lot when the source domain is rather generic (e.g. Quora) and the target domain is rather specific (e.g. Sprint). We assume this is due to Quora forum covering many different topics including both technical and non-technical questions, transferred well by a crossencoder to label the specific target domain (thus benefiting AugSBERT). Vice-versa, when we go from a specific domain (Sprint) to a generic target domain (Quora), only a slight performance increase is noted.
For comparison, Table 6 also shows the state-ofthe-art results from Shah et al. (2018), who applied direct and adversarial domain adaptation with a Bi-LSTM bi-encoder. With the exception of the Sprint dataset (target), we outperform that system with substantial improvement for many combinations.

Conclusion
We presented a simple, yet effective data augmentation approach called AugSBERT to improve biencoders for pairwise sentence scoring tasks. The idea is based on using a more powerful crossencoder to soft-label new sentence pairs and to include these into the training set.
We saw a performance improvement of up to 6 points for in-domain experiments. However, selecting the right sentence pairs for soft-labeling is crucial and the naive approach of randomly selecting pairs fails to achieve a performance gain. We compared several sampling strategies and found that BM25 sampling provides the best trade-off between performance gain and computational complexity.
The presented AugSBERT approach can also be used for domain adaptation, by soft-labeling data on the target domain. In that case, we observe an improvement of up to 37 points compared to an SBERT model purely trained on the source domain.

A Appendices
In this appendix, we mention the following sections in detail: MTurk guidelines and density distribution analysis for the BWS argument similarity dataset (B), hyperparameter-tuning (C) and seedoptimization (D); provide analysis of the top-k parameter (E) and computational efficiency (F) for our in-domain sampling strategies; report development set performances for all our tasks (G).

B BWS Argument Similarity Dataset B.1 Amazon Mechanical Turk Guidelines
The annotations required for the BWS Argument Similarity Corpus were acquired via crowdsourcing on the Amazon Mechanical Turk platform. Workers participating in the study had to be located in the US, with more than 100 HITs approved and an overall acceptance rate of 90% or higher. We paid them at the US federal minimum wage of $7.25/hour. Workers also had to qualify for the study by passing a qualification test consisting of four test questions with argument pairs. Figure 7 exemplifies the instructions given to workers. Figure 5 compares the density distributions of BWS with Spanish-STS. For the Spanish-STS dataset, the pre-sampling process results in a high amount of pairs towards either ends of the similarity scaleleading to selection bias. The pre-sampling of the creation process of the BWS dataset, in turn, is less biased. There is a much lower number of pairs towards either end of the scale, which is in accordance with data from the wild, i.e. randomly paired arguments.

C Hyperparameter Tuning
We implement coarse to fine random search to find the optimal combination of hyperparameters for both cross-encoders (BERT) and bi-encoders (SBERT). We choose the optimal combination based on the development dataset performance keeping random seed value fixed. 13 Cross-Encoder (BERT): For all fine-tuning experiments, we optimize a variety of hyperparameters: hidden-layer sizes, learning-rates and batchsizes. We first evaluate over a wide range of parameters and later conduct a deeper fine search of these optimal parameters. Experimental setup can be found in Table 7  Bi-Encoder (SBERT): For all fine-tuning experiments, we utilize bert-base models, and implement coarse to fine random search with various learning-rates and batch-sizes. Since changing the learning rate scheduler did not contribute to significant improvement, we kept it constant for all our experiments. Experimental setup can be found in Table 8.

D Seed Optimization
For our in-domain tasks, we apply seed optimization i.e. we train our models with 5 random seeds  and select the model that performs best on the development set, and repeat this complete setup 10 times. Testing various seeds can be computationally expensive. In order to reduce the computational overhead, we evaluate whether bad runs can be identified and stopped early. At x% of the overall training steps we evaluate the model on the development set and compare the rank with the final ranking of the models on the development set.
The results are depicted in Figure 6. We observe a Spearman's rank correlation of over 0.8 at about 20% of the training steps. We conclude, that bad training runs can often be identified and stopped early.

E Impact of Top K in Sampling Strategy
In sampling strategies, such as BM25 and semantic search, we are required to pick the top k values returned by the retrieval engine. Typically for small k values, positive-pairs are dominant and with increase in k, negative-pairs start becoming dominant.
We chose a top k value within {3,5,7,9,12,18} and evaluated the final scores retrieved from our experiments, to measure an impact of k. Overall, we find the impact of k to be rather small and k = 3 or k = 5 producing optimal scores for most of the experiments. Top-k mean test scores for our in-domain datasets are reported in Table 9 for   BM25 and Table 10 for semantic search sampling strategies respectively.

F Computational Efficiency vs. Size of Silver Datasets
The augmented SBERT strategy requires to weakly label a large set of sentence pairs with the crossencoder. The larger the set of silver pairs, the bigger is the overhead for labeling with the cross-encoder and subsequent training the bi-encoder. Hence, for reasons of efficiency, smaller silver dataset sizes are preferable. Table 11 summarizes the performance of each sampling technique versus the size of sampled silver pairs. Different sampling strategies create vastly different amounts of sentence pairs. Randomly sampling (R.S.) a large number of sentence pairs is not efficient and often leads to worse performances. KDE with large silver datasets produce optimal scores, but is less computationally efficient. Semantic Search (S.S.) requires the bi-encoder to be additionally trained, which causes computational overhead. Finally, BM25 overall on an average performs best for all tasks given computational efficiency, by sampling out the smallest silver dataset sizes for all tasks in Table 11.

G Development Set Performances
The development set performances for all sentence pair in-domain and domain adaptation tasks can be referred in Table 12 and Table 13 Table 11: Summary of (#silver dataset samples, mean score) for each sampling technique across all in-domain datasets. For STS and BWS datasets, we report the Spearman's rank correlation ρ × 100 and the F 1 score for Quora-QP and MRPC datasets. None represents plain bi-encoder i.e. SBERT. Scores with best sampling strategy and smallest silver dataset size across each dataset are highlighted.  Arguments are similar if - • They say exactly the same thing in different words. Example for topic "Fracking", Argument A: "And the toxic chemicals associated with fracking operations can contaminate the soil, air and water, and leach into crops".
Argument B: "The chemicals used in fracking are toxic and threaten to poison and pollute our air, ground, water and food supplies -basic necessities for life".
• They cover the same aspect and only differ in minor details. Example for topic "Electric Cars", Argument A: "With literally hundreds of moving parts, a petro-fired automobile requires considerably more maintenance than an electric car".
Argument B: "Electric cars are much more reliable and require less maintenance than gas-powered cars".
• They talk about the same general aspect but differ in important details. Example for topic "Electric Cars", Argument A: "Electric cars are environmentally friendly as it reduces air pollution".
Argument B: "Many people think that electric cars are better than gasoline models, not only because of lower operating costs, but because of quicker acceleration and cleaner air".
Arguments are not similar if • They have the same topic but do not cover the same aspect. Example for topic "Electric cars", Argument A: "Electric cars are environmentally friendly as it reduces air pollution".
Argument B: "Generally electric motors for automobiles are much easier to maintain".
• They have different topics. Example for topic "Robotic Surgery", Argument A: "Opponents argue that more drilling offshore could damage sensitive ecosystems".
Argument B: "Robotic surgery offers patients less pain, fewer complications, and a faster return to normal daily activities".