QuadrupletBERT: An Efficient Model For Embedding-Based Large-Scale Retrieval

The embedding-based large-scale query-document retrieval problem is a hot topic in the information retrieval (IR) field. Considering that pre-trained language models like BERT have achieved great success in a wide variety of NLP tasks, we present a QuadrupletBERT model for effective and efficient retrieval in this paper. Unlike most existing BERT-style retrieval models, which only focus on the ranking phase in retrieval systems, our model makes considerable improvements to the retrieval phase and leverages the distances between simple negative and hard negative instances to obtaining better embeddings. Experimental results demonstrate that our QuadrupletBERT achieves state-of-the-art results in embedding-based large-scale retrieval tasks.


Introduction
Large-scale retrieval systems such as search engines have been a vital tool to help people access the massive amount of online information. Various techniques have been developed to improve retrieval quality in the last decades.
Due to the difficulty of computing search intent from the query text and accurately representing the semantic meaning of document requirements, most previous studies are based on classic term-weighting methods such as BM-25 (Robertson and Zaragoza, 2009) or TF-IDF (Spärck Jones, 1972, 2004 or simple context-free word embedding (Mikolov et al., 2013) that perform well for the cases that keyword matching can address. However, these models only accept sparse handcrafted features and cannot capture complex semantic features.
Considering that pre-trained language models like BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) have achieved great success in a wide * Corresponding author variety of NLP tasks, more and more researchers propose BERT-style models to solve large-scale retrieval problems. Some previous work has confirmed the effectiveness of BERT for enhancing retrieval systems. For example, Yilmaz et al. (2019) apply a BERTstyle model to document retrieval via integration with the open-source anserini information retrieval toolkit to demonstrate end-to-end search over large document collections.  build a BERT-based reader to identify answers from a large corpus of Wikipedia articles in an end-to-end fashion. Padaki et al. (2020) use query expansion to generate better queries for BERT-based Ranker in retrieval. Mass and Roitman (2020) describe a weakly-supervised method for training BERT-style models for ad hoc document retrieval.
In BERT, the prediction function f (query, doc) is a pre-trained deep bidirectional Transformer model (Vaswani et al., 2017). While the above BERT-style models are very successful, this approach cannot be directly applied to large-scale retrieval problems because predicting f for every possible document can be prohibitively expensive. Thus, the methods mentioned above first use a less powerful but more efficient retrieval algorithm (Retriever) such as an inverted index to reduce the solution space and then use the BERT-style model to re-rank the retrieved documents. As shown in figure 1, we refer to all such BERT-style retrieval models as Ranker. Unlike these Ranker which have recently seen significant advances, constructing a BERT-style Retriever is a new topic in the large-scale retrieval field, on which few studies have thus far focused. For example, Reimers and Gurevych (2019) present a modification of the pre-trained BERT network that uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine similarity. Chang et al. (2020) build a two-tower Transformer model with more pre-training data, which can significantly outperform the widely used BM-25 algorithm. Lu et al. (2020) distill knowledge from BERT into a two-tower architecture network for efficient retrieval.
As shown in figure 2 (a) and (b), the existing BERT-style Retriever mentioned above simply builds a two-or three-tower network structure to compute distances between positive and negative instances, which ignores the fact there are not only simple negative instances in the dataset: some instances are seemingly positive but actually negative, which we call hard negative instances. As we all know, the Retriever should have high recall; otherwise, many positive instances will not even be considered in the ranking phase. However, due to hard negative instances being literally related, treating them as equal to simple negative instances may harm the embedding of positive instances and lead the model to identify positive instances as negative ones mistakenly.
The key to solving the problem mentioned above is incorporating the distances between hard negative and simple negative instances into the training step. Our intuition is that hard negative instances are negative compared to positive instances but should be considered positive compared to simple negative instances. Therefore, we explore a new way to incorporate distances between hard negative and simple negative instances into the training process and build a four-tower BERT-style model named QuadrupletBERT. We have evaluated our model on two Retrieval Question-Answering (ReQA) benchmarks. Experimental results show that our model registers huge improvements over existing BERT-style Retriever models and achieves state-of-the-art results.
Our main contributions are as follows: 1. We propose a new four-tower BERT-style model named QuadrupletBERT, which is very easy to use and improves hugely over existing BERT-style Retriever models.
2. We find that leveraging distances between hard negative and simple negative instances in the training process helps improve the Retriever model.

Task Description
Large-scale retrieval problems can be defined as: given a query, return the most relevant documents from a large corpus, where the corpus' size can be hundreds of thousands or more. The embeddingbased retrieval model jointly embeds queries and documents in the same embedding space and uses an inner product or cosine distance to measure the similarity between queries and documents. Since embeddings of all candidate documents can be precomputed and indexed, the inference can be made efficiently with approximate nearest neighbor search algorithms in the embedding space (Shrivastava and Li, 2014; Guo et al., 2016). Let the query embedding model be φ(·), and the document embedding model be ψ(·) The distance function can be defined as: In this paper, we are interested in parameterizing the encoders φ and ψ as a four-tower BERT which incorporates the distances between hard negative and simple negative instances into the training step.

QuadrupletBERT
As shown in figure 2 (c), the core of our model is a four-tower sentence-level BERT relevance encoder. Each tower of our retrieval model follows the architecture and hyper-parameters of the 12 layers BERT model 1 . Note that for all BERT baselines, we all pre-train them on the specific downstream datasets by Masked LM and Next Sentence Prediction tasks (Devlin et al., 2019). The embedding dimension is 768. The sequence length for the encoder is set to be 64. For all towers, taking the average of the encoding layer's hidden state on the time axis as the final embedding.

Training
One unique advantage of the multi-tower retrieval model compared with classic IR algorithms is the ability to train it for specific tasks. In this paper, our training data x can be defined as quaternion query-document pairs: where q, p, n, and hn are representing query, positive document, negative document, hard negative document separately. We estimate the model parameters by minimizing the following loss function: where d p i is the distance between q i and p i , d n i is the distance between q i and n i , and d h i is the distance between hn i and n i . This loss function constructed by two parts, where both loss h i and loss n i aim to minimize d p i . Besides, loss h i aims to maximize d h i , 1 https://github.com/google-research/ bert and loss n i aims to maximize d n i . m is the margin enforced between positive, negative, and hard negative documents. This loss function's intuition is to cluster the query and positive documents and separate the positive and hard negative documents from the negative documents by a distance margin. The distance function f we select is cosine distance, which can be defined as follows:

Inference
First, we pre-compute all the document embeddings. Then, given an unseen query q, we only need to rank the document based on its cosine distance with the query embedding. To make our Quadru-pletBERT can be applied in resource-restricted and time-sensitive systems such as query understanding in search engines (Nakamura et al., 2019), we deployed an inverted index based ANN (approximate near neighbor) search algorithms to our model. We employed Faiss library (Johnson et al., 2017) to quantize the vectors and then implemented the efficient embedding search in our model.

Datasets and Baselines
We consider the Retrieval Question-Answering (ReQA) benchmark proposed by Ahmad et al. (2019). The two QA datasets we consider are SQuAD and Natural Questions. Note that each entry of QA datasets is a tuple (q, a, e), where q is the question, a is the answer span, and e is the evidence passage containing a. Following Ahmad et al. (2019), we split a passage into sentences e = s 1 s 2 ...s n and transform the original entry to a new tuple (q, s i ). Different from the ranking phase of large-scale retrieval. The retrieval phase is that given a question q, retrieve the correct sentence s from all candidates. For each evidence passage e we create a set of candidate sentences s i , and the retrieval candidate set is built by combining such sentences for all passages.
To construct our training quaternion pairs (q i , p i , n i , hn i ). For a specific question q i , we define the gold sentence containing a i as p i , and randomly select a sentence not containing a i as n i . We firstly train our model with loss h i = 0 until the loss is converged. Then we use the trained model  Table 1: Recall@k on two datasets, where three-T Emb represents the three-tower word embedding retrieval method (Huang et al., 2020) and Three-T BERT represents the three-tower Sentence-BERT (Reimers and Gurevych, 2019 to retrieve a candidate set C i for q i . We randomly select a sentence in C i as hn i . For each dataset, we consider different training/test split of the data (5%/95% and 80%/20%) in the fine-tuning stage, and the 10% of the training set is held out as the validation set for hyperparameter tuning. The split is created assuming a cold-start retrieval scenario where the queries in the test (query, document) pairs are not seen in training.
We compare our method against two famous embedding-based large-scale retrieval baselines: (1) Recent three-tower word embedding retrieval method proposed by Facebook Search (Huang et al., 2020). (2) The state-of-the-art threetower Sentence-BERT proposed by Reimers and Gurevych (2019).

Evaluation Metric
Since the goal of Retriever is to capture the positives in the top-k results, we select Recall@k as the evaluation metric. The following equation computes Recall@k: where R k is the top k results recalled by our model. D is the dataset. x i and y i are the i-th question and i-th answer separately.

Overall Results
The experimental results 2 are shown in the table 1. We can see that: 1. Results of both Sentence-BERT and our QuadrupletBERT overpass the results of three tower word embedding, which confirms the effectiveness of BERT-style retrieval model.
2. Our four-tower QuadrupletBERT models gain improvements over the three-tower BERT. It is worth noting that the only difference between them is that our model leverages distances between hard negative and simple negative instances in the training process by an extra tower, which verifies our assumption.

Hyper-Parameter Finetuning
As a key hyper-parameter of our QuadrupletBERT model, m denotes the margin enforced between positive and hard negative and negative instances. We further investigated the influence of m on our model. With the SQuAD and Natural Questions datasets, we train models with m is set to 0, 0.1, 0.2, 1, 1.5, and 2, respectively. The experimental results are shown in Table 2. We found that tuning margin value is important -the optimal margin value varies a lot across different training tasks, and different margin values result in 5 − 10% recall variance.

Related Work
We have covered research on embedding based large-scale retrieval in Section 1, related work that inspires our technical design is mainly introduced in the following: Reimers and Gurevych (2019) present a modification of the pre-trained BERT network that uses multi-tower network structures to derive semantically meaningful sentence embeddings that can be compared using cosine similarity. Huang et al. (2020) present a multi-tower word embedding retrieval method successfully applied in the Facebook online search. Besides, they mentioned that shuffling hard negative and simple negative instances in the training sets may help model learning, which inspired us to further investigate the effectiveness of hard negative instances.

Conclusion
We have presented our four-tower Quadruplet-BERT model and demonstrated its usage and effect on large-scale retrieval. Unlike many widely-used BERT-style Ranker models of large-scale retrieval tasks, our model focus on the retrieval phase. The multi-tower architecture making it extremely easy to be applied in retrieval systems. Moreover, incorporating distances between hard negative and simple negative instances into the training process shows significant superiority in improving Retriever model performance.
We hope our work can inspire more sophisticated techniques of leveraging BERT-style models in large-scale retrieval. Leveraging hard negative instances for other natural language processing tasks such as text generation and information extraction is also worth investigating.