SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval

We introduce SPARTA, a novel neural retrieval method that shows great promise in performance, generalization, and interpretability for open-domain question answering. Unlike many neural ranking methods that use dense vector nearest neighbor search, SPARTA learns a sparse representation that can be efficiently implemented as an Inverted Index. The resulting representation enables scalable neural retrieval that does not require expensive approximate vector search and leads to better performance than its dense counterpart. We validated our approaches on 4 open-domain question answering (OpenQA) tasks and 11 retrieval question answering (ReQA) tasks. SPARTA achieves new state-of-the-art results across a variety of open-domain question answering tasks in both English and Chinese datasets, including open SQuAD, CMRC and etc. Analysis also confirms that the proposed method creates human interpretable representation and allows flexible control over the trade-off between performance and efficiency.


Introduction
Open-domain Question Answering (OpenQA) is the task of answering a question based on a knowledge source. One promising approach to solve OpenQA is Machine Reading at Scale (MRS) (Chen et al., 2017). MRS leverages an information retrieval (IR) system to narrow down to a list of relevant passages and then uses a machine reading comprehension reader to extract the final answer span. This approach, however, is bounded by its pipeline nature since the first stage retriever is not trainable and may return no passage that contains the correct answer.
To address this problem, prior work has focused on replacing the first stage retriever with a trainable ranker (Chidambaram et al., 2018;. End-to-end systems * This work was done during an internship at SOCO have also been proposed to combine passage retrieval and machine reading by directly retrieving answer span . Despite of their differences, the above approaches are all built on top of the dual-encoder architecture, where query and answer are encoded into fixed-size dense vectors, and their relevance score is computed via dot products. Approximate nearest neighbor (ANN) search is then used to enable realtime retrieval for large dataset (Shrivastava and Li, 2014).
In this paper, we argue that the dual-encoder structure is far from ideal for open-domain QA retrieval. Recent research shows its limitations and suggests the importance of modeling complex queries to answer interactions for strong QA performance.  shows that their best performing system underperforms the state-of-theart due to query-agnostic answer encoding and its over-simplified matching function. Humeau et al. (2019) shows the trade-off between performance and speed when moving from expressive crossattention in BERT (Devlin et al., 2018) to simple inner product interaction for dialog response retrieval. Therefore, our key research goal is to develop new a method that can simultaneously achieve expressive query to answer interaction and fast inference for ranking.
We introduce SPARTA (Sparse Transformer Matching), a novel neural ranking model. Unlike existing work that relies on a sequence-level inner product, SPARTA uses token-level interaction between every query and answer token pair, leading to superior retrieval performance. Concretely, SPARTA learns sparse answer representations that model the potential interaction between every query term with the answer. The learned sparse answer representation can be efficiently saved in an Inverted Index, e.g., Lucene (McCandless et al., 2010), so that one can query a SPARTA index with almost the same speed as a standard search engine and enjoy the more reliable ranking performance without depending on GPU or ANN search.
Experiments are conducted on two settings: OpenQA (Chen et al., 2017) that requires phraselevel answers and retrieval QA (ReQA) that requires sentence-level answers (Ahmad et al., 2019). Our proposed SpartaQA system achieves new stateof-the-art results across 15 different domains and 2 languages with significant performance gain, including OpenSQuAD, OpenCMRC and etc.
Moreover, model analysis shows that SPARTA exhibits several desirable properties. First SPARTA shows strong domain generalization ability and achieves the best performance compared to both classic IR method and other learning methods in low-resources domains. Second, SPARTA is simple and efficient and achieves better performance than many more sophisticated methods. Lastly, it provides a human-readable representation that is easy to interpret. In short, the contributions of this work include: • A novel ranking model SPARTA that offers token-level query-to-answer interaction and enables efficient large-scale ranking.
• New state-of-the-art experiment results on 11 ReQA tasks and 4 OpenQA tasks in 2 languages.
• Detailed analyses that reveal insights about the proposed methods, including generalization and computation efficiency.

Related Work
The classical approach for OpenQA depends on knowledge bases (KB)s that are manually or automatically curated, e.g., Freebase KB (Bollacker et al., 2008), NELL (Fader et al., 2014) etc. Semantic parsing is used to understand the query and computes the final answer (Berant et al., 2013;Berant and Liang, 2014). However, KB-based systems are often limited due to incompleteness in the KB and inflexibility to changes in schema (Ferrucci et al., 2010). A more recent approach is to use text data directly as a knowledge base. Dr.QA uses a search engine to filter to relevant documents and then applies machine readers to extract the final answer (Chen et al., 2017). It needs two stages because all existing machine readers, for example, BERT-based models (Devlin et al., 2018), are prohibitively slow (BERT only processes a few thousands of words per second with GPU acceleration). Many attempts have been made to improve the first-stage retrieval performance (Chidambaram et al., 2018;Henderson et al., 2019;Karpukhin et al., 2020;Chang et al., 2020). The information retrieval community has shown that word embedding matching do not perform well for ad-hoc document search compared to classic methods (Guo et al., 2016;Xiong et al., 2017;Hui et al., 2017).
To increase the expressiveness of dual encoders, Xiong et al. (2017) develops kernel function to learn soft matching score at token-level instead of sequence-level. Humeau et al. (2019) proposes Poly-Encoders to enable more complex interactions between the query and the answer by letting one encoder output multiple vectors instead of one vector. Dhingra et al. (2020) incorporates entity vectors and multi-hop reasoning to teach systems to answer more complex questions. (Lee et al., 2020) augments the dense answer representation with learned n-gram sparse feature from contextualized word embeddings, achieving significant improvement compared to the dense-only baseline. Chang et al. (2020) explores various unsupervised pretraining objectives to improve dual-encoders' QA performance in the low-resources setting.
Unlike existing work based-on dual-encoders, we focus on learning sparse representation and emphasizing token-level interaction. This is perhaps the most related to the sparse index from Den-SPI (Lee et al., 2020) and DeepCT (Dai and Callan, 2020). Our approach is different because our proposed model is architecturally simpler and is generative so that it will understand words that not appear in the answer document, whereas the one developed at (Lee et al., 2020) only models n-grams appear in the document. MacAvaney et al. (2020) also explores retrieval with sparse representations. Our work is different from theirs in that we decide not to model the query order information, which enables the model to do full ranking. Section 3.4 shows that our system can be easily deployed via inverted index under modern search engines, such as Lucene (McCandless et al., 2010).

Problem Formulation
First, we formally define the problem of answer ranking for question answering. Let q be the input question, and A = {(a, c)} be a set of candidate Figure 1: SPARTA Neural Ranker computes token-level matching score via dot product. Each query terms' contribution is first obtained via max-pooling and then pass through ReLU and log. The final score is the summation of each query term contribution.
answers. Each candidate answer is a tuple (a, c) where a is the answer text and c is context information about a. The objective is to find model parameter θ that rank the correct answer as high as possible, .i.e: This formulation is general and can cover many tasks. For example, typical passage-level retrieval systems sets the a to be the passage and leaves c empty (Chen et al., 2017;Yang et al., 2019a). The sentence-level retrieval task proposed at sets a to be each sentence in a text knowledge base and c to be the surrounding text (Ahmad et al., 2019). Lastly, the phrase-level QA system sets a to be all valid phrases from a corpus and c to be the surrounding text . This work focuses on the same sentence-level retrieval task (Ahmad et al., 2019) since it provides a good balance between precision and memory footprint. Yet note that our methods can be easily applied to the other two settings.

SPARTA Neural Ranker
In order to achieve both high accuracy and efficiency (scale to millions of candidate answers with real-time response), the proposed SPARTA index is built on top of two high-level intuitions.
• Accuracy: retrieve answer with expressive embedding interaction between the query and answer, i.e., token-level contextual interaction.
• Efficiency: create query agnostic answer representation so that they can be pre-computed at indexing time. Since it is an offline operation, we can use the most powerful model for indexing and simplify the computation needed at inference.
As shown in Figure 1, a query is represented as a sequence of tokens q = [t 1 , ...t |q| ] and each answer is also a sequence of tokens (a, c) = [c 1 , ..a 1 , ..a |a| , c a+1 , ...c |c| ]. We use a noncontextualized embedding to encode the query tokens to e i , and a contextualized transformer model to encode the answer and obtain contextualized token-level embedding s j : ..e |q| ] Query Embedding (2) H(a, c) = [s 1 , ...s |c| ] Answer Embedding (3) Then the matching score f between a query and an answer is computed by: where b is a trainable bias. The final score between the query and answer is the summation of all individual scores between each query token and the answer. The logarithm operations normalize each individual score and weaken the overwhelmingly large term score. Additionally, there are two key design choices worth of elaboration. Token-level Interaction SPARTA scoring uses token-level interaction between the query and the answer. Motivated by bidirectional-attention flow (Seo et al., 2016), relevance between every query and answer token pair is computed via dot product and max pooling in Eq. 4. Whereas in a typical dual-encoder approach, only sequence-level interaction is computed via dot product. Results in our experiment section show that fine-grained interaction is crucial to obtain significant accuracy improvement. Additionally, s j is obtained from powerful bidirectional transformer encoders, e.g. BERT and only needs to be computed at the indexing time. On the other hand, the query embedding is non-contextual, a trade-off needed to enable realtime inference, which is explained in Section 3.4 Sparsity Control Another key feature to enable efficient inference and memory foot print is sparsity. This is achieved via the combination of log, ReLU and b in Eq. 5. The bias term is used as a threshold for y i . The ReLU layer forces that only query terms with y i > 0 have impact to the final score, achieving sparse activation. The log operation is proven to be useful via experiments for regularizing individual term scores and leads to better performance and more generalized representation.
Implementation In terms of implementation, we use a pretrained 12-layer, 768 hidden size bertbase-uncased as the answer encoder to encode the answer and their context (Devlin et al., 2018). To encode the difference between the answer sequence and its surrounding context, we utilized the segment embedding from BERT, i.e. the answer tokens have segment_id = 1 and the context tokens havesegment_id = 0. Moreover, the query tokens are embedded via the word embedding from the bert-base-uncased with dimension 768.

Learning to Rank
The training of SPARTA uses cross entropy learning-to-rank loss and maximizes Eq. 7. The objective tries to distinguish between the true relevant answer (a + , c + )and irrelevant/random answers K − for each training query q: The choice of negative samples K − are crucial for effective learning. Our study uses two types of negative samples: 50% of the negative samples are randomly chosen from the entire answer candidate set, and the rest 50% are chosen from sentences that are nearby to the ground truth answer a. The second case requires the model to learn the fine-grained difference between each sentence candidate instead of only rely on the context information. The parameters to learn include both the query encoder E and the answer encoder H. Parameters are optimized using back propagation (BP) through the neural network.

Indexing and Inference
One major novelty of SPARTA is how one can use it for real-time inference. That is for a testing query q = [t 0 , ...t |q| ], the ranking score between q and an answer is: Since the query term embedding is non-contextual, we can compute the rank feature φ(t, (a, c)) for every possible term t in the vocabulary V with every answer candidate. The result score is cached in the indexing time as shown in Eq. 8. At inference time, the final ranking score can be computed via O(1) look up plus a simple summation as shown in Eq. 9. More importantly, the above computation can be efficiently implemented via a Inverted Index (Manning et al., 2008), which is the underlying data structure for modern search engines, e.g. Lucene (McCandless et al., 2010) as shown in Figure 1(b). This property makes it easy to apply SPARTA to real-world applications.

Relation to Classic IR and Generative Models
It is not hard to see the relationship between SPARTA and classic BM25 based methods. In the classic IR method, only the tokens that appeared in the answer are saved to the Inverted Index. Each term's score is a combination of Term Frequency and Inverted Document Frequency via heuristics (Manning et al., 2008). On the other hand, SPARTA learns which term in the vocabulary should be inserted into the index, and predicts the ranking score directly rather than heuristic calculation. This enables the system to find relevant answers, even when none of the query words appeared in the answer text. For example, if the answer sentence is "Bill Gates founded Microsoft", a SPARTA index will not only contain the tokens in the answer, but also include relevant terms, e.g. who, founder, entrepreneur and etc.
SPARTA is also related to generative QA. The scoring between (a, c) and every word in the vocabulary V can be understood as the un-normalized probability of log p(q|a) = |q| i log p(t i |a) with term independence assumption. Past work such as Lewis and Fan (2018); Nogueira et al. (2019) trains a question generator to score the answer via likelihood. However, both approaches focus on auto-regressive models and the quality of question generation and do not provide an end-to-end solution that enables stand-alone answer retrieval.

OpenQA Experiments
We consider an Open-domain Question Answering (OpenQA) task to evaluate the performance of SPARTA ranker. Following previous work on OpenQA (Chen et al., 2017;Xie et al., 2020), we experiment with two English datasets: SQuAD (Rajpurkar et al., 2016), Natural Questions (NQ) ; and two Chinese datasets: CMRC (Cui et al., 2018), DRCD (Shao et al., 2018). For each dataset, we used the version of Wikipedia where the data was collected from. Preliminary results show that it is crucial to use the right version of Wikipedia to reproduce the results from baselines. We compare the results with previous best models.
System-wise we follow the 2-stage ranker-reader structure used in (Chen et al., 2017).
Ranker: We split all documents into sentences. Each sentence is treated as a candidate answer a. We keep the surrounding context words of each candidate answer as its context c. We encode at most 512 word piece tokens and truncate the context surrounding the answer sentence with equal window size. For model training, bert-base-uncased is used as the answer encoder for English, and chinesebert-wwm is used for Chinese. We reuse the word embedding from corresponding BERT model as the term embedding. Adam (Kingma and Ba, 2014) is used as the optimizer for fine-tuning with a learning rate 3e-5. The model is fine-tuned for at most 10K steps and the best model is picked based on validation performance.
Reader: We deploy a machine reading comprehension (MRC) reader to extract phrase-level answers from the top-K retrieved contexts. For English tasks, we fine-tune on span-bert (Joshi et al., 2020). For Chinese tasks, we fine-tune on chinesebert-wwm (Cui et al., 2020). Two additional proven techniques are used to improve performance. First, we use global normalization (Clark and Gardner, 2017) to normalize span scores among multiple passages and make them comparable among each other. Second, distant supervision is used. Concretely, we first use the ranker to find top-10 passages for all training data from Wikipedia corpus. Then every mention of the oracle answers in these contexts are treated as training examples. This can ensure the MRC reader to adapt to the ranker and make the training distribution closer to the test distribution (Xie et al., 2020).
Lastly, evaluation metrics include the standard MRC metric: EM and F1-score.
• Exact Match (EM): if the top-1 answer span matches with the ground truth exactly.
• F1 Score: we compute word overlapping between the returned span and the ground truth answer at token level.

Retrieval QA Experiments
We also consider Retrieval QA (ReQA), a sentencelevel question answering task (Ahmad et al., 2019). The candidate answer set contains every possible sentence from a text corpus and the system is expected to return a ranking of sentences given a query. The original ReQA only contains SQuAD and NQ. In this study, we extend ReQA to 11 different domains adapted from (Fisch et al., 2019) to evaluate both in-domain performance and out-ofdomain generalization. The details of the 11 ReQA domains are in Table 3 and Appendix.
The in-domain scenarios look at domains that have enough training data (see Table 3). The models are trained on the training data and the evaluation is done on the test data. On the other hand, the out-of-domain scenarios evaluate systems' performance on test data from domains not included in the training, making it a zero-shot learning problem. There are two out-of-domain settings: (1) training data only contain SQuAD (2) training data contain only SQuAD and NQ. Evaluation is carried on all the domains to test systems' ability to generalize to unseen data distribution.
USE-QA 1 : universal sentence encoder trained for QA task by Google (Yang et al., 2019b). USE-QA uses the dual-encoder architecture and it is trained on more than 900 million mined questionanswer pairs with 16 different languages.
Poly-Encoder (Poly-Enc): Poly Encoders improves the expressiveness of dual-encoders with two-level interaction (Humeau et al., 2019). We adapted the original dialog model for QA retrieval: two bert-base-uncased models are used as the question and answer encoders. The answer encoder has 4 vector outputs. Table 4 shows the MRR results on the five datasets with in-domain training. SPARTA can achieve the best performance across all domains with a large margin. In terms of average MRR across the five domains, SPARTA is 114.3% better than BM25, 50.6% better than USE-QA and 26.5% better than Poly-Encoders.

In-domain Performance
Two additional insights can be drawn from the results. First, BM-25 is a strong baseline and does not require training. It performs particularly well in domains that have a high-rate of word-overlapping between the answer and the questions. For example, SQuAD's questions are generated by crowd workers who look at the ground truth answer, while ques-  tion data from NQ/News are generated by question makers who do not see the correct answer. BM25 works particularly well in SQuAD while performing the poorest in other datasets. Similar observations are also found in prior research (Ahmad et al., 2019). Second, the results in Table 4 confirms our hypothesis on the importance of rich interaction between the answer and the questions. Both USE-QA and Poly Encoder use powerful transformers to encode the whole question and model word-order information in the queries. However, their performance is bounded by the simple dot-product interaction between the query and the answer. On the other hand, despite the fact that SPARTA does not model word-order information in the query, it is able to achieve a big performance gain compared to the baselines, confirming the effectiveness of the proposed token-level interaction method in Eq. 4. 6 Model Analysis

Interpreting Sparse Representations
One common limitation of deep neural network models is poor interpretability. Take dense distributed vector representation for example, one cannot directly make sense of each dimension and has to use dimension reduction and visualization methods, e.g. TSNE (Maaten and Hinton, 2008). On the contrary, the resulting SPARTA index is straightforward to interpret due to its sparse nature. Specifically, we can understand a SPARTA vector by reading the top K words with non-zero f (t, (a, c)), since these terms have the greatest impact to the final ranking score. Table 6 shows some example outputs. It is not hard to note that the generated terms for each answer sentence is highly relevant to both a and c, and contains not keywords that appeared in the answer, but also include terms that are potentially in the query but never appear in the answer itself. Two experts manually inspect the outputs for 500 (a, c) data points from Wikipedia, and we summarize the following four major categories of terms that are predicted by SPARTA.
Conversational search understanding: the third row is an example. "Who" appears to the top term, showing it learns Bill Gates is a person so that it's likely to match with "Who" questions.
Keyword identification: terms such as "gates, google, magnate, yellowstone" have high scores in the generated vector, showing that SPARTA learns which words are important in the answer.
Synonyms and Common Sense: "benefactor, investors" are examples of synonyms. Also even though "Utah" does not appear in the answer, it is predicted as an important term, showing that SPARTA leverages the world-knowledge from a pretrained language model and knows Yellowstone is related to Utah.  Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP).
answering, question, q, computer, information" retrieval,language, natural, human, nl, science, ... Table 6: Top-k terms predicted by SPARTA. The text in bold is the answer sentence and the text surrounded it is encoded as its context. Each answer sentence has around 1600 terms with non-zero scores.

Sparsity vs. Performance
Sparsity not only provides interpretability, but also offers flexibility to balance the trade-off of memory footprint vs. performance. When there are memory constraints on the vector size, the SPARTA vector can be easily reduced by only keeping the top-K important terms. Table 7 shows performance on SQuAD and NQ with varying K. The resulting sparse vector representation is very robust to smaller K. When only keeping the top 50 terms in each answer vector, SPARTA achieves 69.5 MRR, a better score than all baselines with only 1.6% memory footprint compared to Poly-Encoders (768 x 4 dimension). NQ dataset is more challenging and requires more terms. SPARTA achieves a close to the best performance with top-500 terms.

Conclusion
In short, we propose SPARTA, a novel ranking method, that learns sparse representation for better open-domain QA. Experiments show that the proposed framework achieves the state-of-the-art performance for 4 different open-domain QA tasks in 2 languages and 11 retrieval QA tasks. This confirm our hypothesis that token-level interaction is superior to sequence-level interaction for better evidence ranking. Analyses also show the advantages of sparse representation, including interpretability, generalization and efficiency. Our findings also suggest promising future research directions. The proposed method does not support multi-hop reasoning, an important attribute that enables QA systems to answer more complex questions that require collecting multiple evidence passages. Also, current method only uses a bag-ofword features for the query. We expect further gain by incorporating word-order information.