Contextualized Sparse Representations for Real-Time Open-Domain Question Answering

Open-domain question answering can be formulated as a phrase retrieval problem, in which we can expect huge scalability and speed benefit but often suffer from low accuracy due to the limitation of existing phrase representation models. In this paper, we aim to improve the quality of each phrase embedding by augmenting it with a contextualized sparse representation (Sparc). Unlike previous sparse vectors that are term-frequency-based (e.g., tf-idf) or directly learned (only few thousand dimensions), we leverage rectified self-attention to indirectly learn sparse vectors in n-gram vocabulary space. By augmenting the previous phrase retrieval model (Seo et al., 2019) with Sparc, we show 4%+ improvement in CuratedTREC and SQuAD-Open. Our CuratedTREC score is even better than the best known retrieve & read model with at least 45x faster inference speed.


Introduction
Open-domain question answering (QA) is the task of answering generic factoid questions by looking up a large knowledge source, typically unstructured text corpora such as Wikipedia, and finding the answer text segment (Chen et al., 2017). One widely adopted strategy to handle such large corpus is to use an efficient document (or paragraph) retrieval technique to obtain a few relevant documents, and then use an accurate (yet expensive) QA model to read the retrieved documents and find the answer (Chen et al., 2017;Wang et al., 2018;Das et al., 2019;Yang et al., 2019).
More recently, an alternative approach formulates the task as an end-to-end phrase retrieval problem by encoding and indexing every possible text span in a dense vector offline (Seo et al., 2018). The approach promises a massive speed advantage with 1 Code available at https://github.com/jhyuklee/sparc.

Passage
Between 1991 and 2000, the total area of forest lost in the Amazon rose from 415,000 to 587,000 square kilometres Question How many square kilometres of the Amazon forest was lost by 1991?
In this paper, we introduce a method to learn a Contextualized Sparse Representation (SPARC) for each phrase and show its effectiveness in opendomain QA under phrase retrieval setup. Related previous work (for a different task) often directly maps dense vectors to a sparse vector space (Faruqui et al., 2015;Subramanian et al., 2018), which can be at most only a few thousand dimensions due to computational cost and small gradients. We instead leverage rectified self-attention weights on the neighboring n-grams to scale up its cardinality to n-gram vocabulary space (billions), allowing us to encode rich lexical information in each sparse vector. We kernelize 2 the inner product space during training to avoid explicit mapping and obtain memory-and computational efficiency.
SPARC improves the previous phrase retrieval model, DenSPI (Seo et al., 2019) (by augmenting its phrase embedding), by more than 4% in both CuratedTREC and SQuAD-Open. In fact, our Cu-ratedTREC result achieves the new state of the art even when compared to previous retrieve & read approaches, with at least 45x faster speed.

Background
We focus on open-domain QA on unstructured text where the answer is a text span in a textual corpus (e.g., Wikipedia). Formally, given a set of K documents x 1 , . . . , x K and a question q, the task is to design a model that obtains the answerâ bŷ a = argmax x k i:j F (x k i:j , q), where F is the score model to learn and x k i:j is a phrase consisting of words from the i-th to the j-th word in the k-th document. Pipeline-based methods (Chen et al., 2017;Lin et al., 2018;Wang et al., 2019) typically leverage a document retriever to reduce the number of documents to read, but they suffer from error propagation when wrong documents are retrieved and can be still slow due to the heavy reader model.
Phrase-Indexed Open-domain QA As an alternative, Seo et al. (2018Seo et al. ( , 2019 introduce an endto-end, real-time open-domain QA approach to directly encode all phrases in documents agnostic of the question, and then perform similarity search on the encoded phrases. This is feasible by decomposing the scoring function F into two functions, where H x is the query-agnostic phrase encoding, and H q is the question encoding, and · denotes a fast inner product operation. Seo et al. (2019) propose to encode each phrase (and question) with the concatenation of a dense vector obtained via a deep contextualized word representation model (Devlin et al., 2019) and a sparse vector obtained via computing the tf-idf of the document (paragraph) that the phrase belongs to. We argue that the inherent characteristics of tf-idf, which is not learned and identical across the same document, has limited representational power.
Our goal in this paper is to propose a better and learned sparse representation model that can further improve the QA accuracy in the phrase retrieval setup.

Sparse Encoding of Phrases
Our sparse model, unlike pre-computed sparse embeddings such as tf-idf, dynamically computes the weight of each n-gram that depends on the context.

Why do we need sparse representations?
To answer the question in Figure 1, the model should know that the target answer (415,000) corresponds to the year 1991 while the (confusing) phrase 587,000 corresponds to the year 2000. The dense phrase encoding is likely to have difficulty in precisely differentiating between 1991 and 2000 since it needs to also encode several different kinds of information. Window-based tf-idf would not help because the year 2000 is closer (in word distance) to 415,000. This example illustrates the strong need to create an n-gram-based sparse encoding that is highly syntax-and context-aware.

Contextualized Sparse Representations
The sparse representation of each phrase is obtained as the concatenation of its start word's and end word's sparse embedding, i.e. s i:j = [s start i , s end j ]. This way, similarly to how the dense phrase embedding is obtained in Seo et al. (2019), we can efficiently compute them without explicitly enumerating all possible phrases.
We obtain each (start/end) sparse embedding in the same way (with unshared parameters), so we just describe how we obtain the start sparse embedding here and omit the superscript 'start'. Given the contextualized encoding of each document H where Q, K ∈ R N ×d are query, key matrices obtained by applying a (different) linear transformation on H (i.e., using W Q , W K : R N ×d → R N ×d ), and F ∈ R N ×F is an one-hot n-gram feature representation of the input document x. That is, for instance, if we want to encode unigram (1-gram) features, F i will be a one-hot representation of the word x i , and F will be equivalent to the vocabulary size. Intuitively, s i contains a weighted bag-of-ngram representation where each n-gram is weighted by its relative importance on each start or end word of a phrase. Note that F will be very large, so it should always exist as an efficient sparse matrix format (e.g., csc), and one should not explicitly create its dense form. Since we want to handle several different sizes of n-grams, we create the sparse encoding S for each n-gram and concatenate the resulting sparse encodings. In practice, we experimentally find that unigram and bigram are sufficient for most use cases. We compute sparse encodings on the question side (s ∈ R F ) in a similar way to the document side, with the only difference that we use the [CLS] token instead of start and end words to represent the entire question. We share the same BERT and linear transformation weights used for the phrase encoding.

Training
As training phrase encoders on the whole Wikipedia is computationally prohibitive, we use training examples from an extractive question answering dataset (SQuAD) to train our encoders. We also use an improved negative sampling method which makes both dense and sparse representations more robust to noisy texts.
Kernel Function Given a pair of question q and a golden document x (a paragraph in the case of SQuAD), we first compute the dense logit of each phrase x i:j by l i,j = h i:j · h . Each phrase's sparse embedding is trained, so it needs to be considered in the loss function. We define the sparse logit for phrase x i:j as l CLS] . For brevity, we describe how we compute the first term s start i · s start [CLS] corresponding to the start word (and dropping the superscript 'start'); the second term can be computed in the same way.
where Q , K ∈ R M ×d , F ∈ R M ×F denote the question side query, key, and n-gram feature matrices, respectively. The output size of F is prohibitively large, but we efficiently compute the loss by precomputing FF ∈ R N ×M . Note that FF can be considered as applying a kernel function, i.e. K(F, F ) = FF where its (i, j)-th entry is 1 if and only if the n-gram at the i-th position of the context is equivalent to the j-th n-gram of the question, which can be efficiently computed as well. One can also think of this as kernel trick (in the literature of SVM (Cortes and Vapnik, 1995)) that allows us to compute the loss without explicit mapping.
The final loss to minimize is computed from the negative log likelihood over the sum of the dense and sparse logits: where i * , j * denote the true start and end positions of the answer phrase. As we don't want to sacrifice the quality of dense representations which is also very critical in dense-first search explained in Section 4.1, we add dense-only loss that omits the sparse logits (i.e. original loss in Seo et al. (2019)) to the final loss, in which case we find that we obtain higher-quality dense phrase representations.
Negative Sampling To learn robust phrase representations, we concatenate negative paragraphs to the original SQuAD paragraphs. To each paragraph x, we concatenate the paragraph x neg which was paired with the question whose dense representation h neg is most similar to the original dense question representation h , following Seo et al. (2019). We find that adding tf-idf matching scores on the word-level logits of the negative paragraphs further improves the quality of sparse representations.

Experimental Setup
Datasets SQuAD-Open is the open-domain version of SQuAD (Rajpurkar et al., 2016). We use 87,599 examples with the golden evidence paragraph to train our encoders and use 10,570 examples from dev set to test our model, as suggested by Chen et al. (2017). CURATEDTREC consists of question-answer pairs from TREC QA (Voorhees et al., 1999) curated by Baudiš andŠedivỳ (2015). We use 694 test set QA pairs for testing our model. We only train on SQuAD and test on both SQuAD-Open and CuratedTREC, relying on the generalization ability of our model (zero-shot) for Curat-edTREC.

Implementation Details
We use and finetune BERT-Large for our encoders. We use BERT vocabulary which has 30,522 unique tokens based on byte pair encodings. As a result, we have F ≈ 1B when using both uni-/bigram features. We do not finetune the word embedding during training. We pre-compute and store all encoded phrase representations of all documents in Wikipedia (more than 5 million documents). It takes 600 GPU hours to index all phrases in Wikipedia. We use the same storage reduction and search techniques by Seo et al. (2019). For search, we perform dense search first and then rerank with sparse scores (DFS) or perform sparse search first and rerank with dense scores (SFS), or a combination of both (Hybrid).

Results
Open-Domain QA Experiments Table 1 shows experimental results on two open-domain question answering datasets, comparing our method with previous pipeline and end-to-end approaches. On both datasets, our model with contextualized   sparse representations (DENSPI + SPARC) largely improves the performance of the phrase-indexing baseline model (DENSPI) by more than 4%. Also, our method runs significantly faster than other models that need to run heavy QA models during the inference. On CuratedTREC, which is constructed from real user queries, our model achieves stateof-the-art performance at the time of submission. Even though our model is only trained on SQuAD (i.e., zero-shot), it outperforms all other models which are either distant-or semi-supervised with at least 45x faster inference. On SQuAD-Open, our model outperforms BERT-based pipeline approaches such as BERTserini (Yang et al., 2019) while being more than two orders of magnitude faster. Multi-passage BERT, which utilizes a dedicated document retriever, outperforms all end-to-end models with a large margin in SQuAD-Open. While our main contribution is on the improvement in end-to-end, we also note that retrieving correct documents in SQuAD-Open is known to be often easily exploitable , so we should use more open-domainappropriate test datasets (such as CuratedTREC) for a more fair comparison. Table 2 shows the effect of contextualized sparse representations by comparing different variants of our method on SQuAD-Open. We use a subset of Wikipedia dump (1/100 and 1/10). Interestingly, adding trigram features in SPARC is worse than using uni-/bigram representations only, calling for a stronger regularization for  high-order n-gram features. See Appendix B on how SPARC performs in different search strategies. Table 3 shows the performance of DenSPI + SPARC in the SQuAD v1.1 development set, where a single paragraph that contains an answer is provided in each sample. While BERT-Large that jointly encodes a passage and a question still has a higher performance than ours, we have closed the gap to 6.1 F1 score in a query-agnostic setting. Table 4 shows the outputs of three OpenQA models: DrQA (Chen et al., 2017), DENSPI (Seo et al., 2019), and DENSPI + SPARC (ours). Our model is able to retrieve various correct answers from different documents, and it often correctly answers questions with specific dates or numbers compared to DENSPI showing the effectiveness of learned sparse representations.

Conclusion
In this paper, we demonstrate the effectiveness of contextualized sparse representations, SPARC, for encoding phrase with rich lexical information in open-domain question answering. We efficiently train our sparse representations by kernelizing the sparse inner product space. Experimental results show that our fast open-domain QA model that augments DENSPI with SPARC outperforms previous open-domain QA models, including recent BERT-based pipeline models, with two orders of magnitude faster inference time.

A Inference Speed Benchmark of
Open-Domain QA Models We also note that our reported number for the inference speed of DenSPI (Seo et al., 2019) is slightly faster than that reported in the original paper. This is mostly because we are using a PCIebased SSD (NVMe) instead of SATA-based. We also expect that the speed-up can be greater with Intel Optane which has faster random access time.

B Model Performances in Different
Search Strategies  In Table 5, we show how we consistently improve over DENSPI when SPARC is added in different search strategies. Note that on Curat-edTREC where the questions more resemble real user queries, DFS outperforms SFS showing the effectiveness of dense search when not knowing which documents to read.