Dense Passage Retrieval for Open-Domain Question Answering

Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks.


Introduction
Open-domain textual question answering is a task that requires a system to answer factoid questions using a large collection of documents as the information source, without the need of pre-specifying topics or domains.This task not only has a clear value in information-seeking applications such as handling informational queries in natural language (Voorhees et al., 1999), but also poses a fundamental technical challenge in "machine reading at scale" (Chen et al., 2017).
Early QA systems are often complicated and consist of multiple components, covering sub-tasks like question analysis, document retrieval, passage retrieval, answer extraction and verification (Ferrucci (2012); Moldovan et al. (2003), inter alia).In contrast, the rise of deep learning and specifically the advances of reading comprehension (Rajpurkar et al., 2016) suggest a much simplified two-stage framework: (1) a context retriever first selects a small subset of passages where some of them contain the answer to the question, and then (2) a machine reader can thoroughly examine the retrieved contexts and identify the correct answer.While reducing open-domain QA to machine reading is a very reasonable strategy, a huge performance degradation is often observed in practice (Chen et al., 2017;Yang et al., 2019) 2 , which suggests that the retriever needs to be enhanced and the reader also requires to be adjusted for multi-context input.
Mainstream approaches for building the retriever for open-domain QA use traditional information retrieval methods, including TF-IDF unigram/bigram matching (Chen et al., 2017) or BM25 (Robertson and Zaragoza, 2009) term weighting supported by tools like Lucene and Elastic Search (Wang et al., 2019;Yang et al., 2019;Nie et al., 2019).TF-IDF or BM25 can be viewed as representing the question and text as high-dimensional, sparse vectors (with weighting).These sparse vectors can be searched efficiently with an inverted index, and are effective in factoid question answering for which the retriever usually needs to narrow down the search space dramatically based on keywords.For instance, to find relevant passages for a question "How many provinces did the Ottoman empire contain in the 17th century?" or another question "What part of the atom did Chadwick discover?", it is clear that looking up the text which contain the keyword Ottoman or Chadwick is crucial.Conversely, the dense, latent semantic encoding of contexts and questions is complementary to the sparse vector representation by design.For example, synonyms or paraphrases that consist of completely different tokens may still be mapped to vectors close to each other.Consider the question "Who is the bad guy in lord of the rings?", which can be answered from the context "Sala Baker is an actor and stuntman from New Zealand.He is best known for portraying the villain Sauron in the Lord of the Rings trilogy ...".A term-based system would have difficulty retrieving such a context, while a dense retrieval system would be able to better match "bad guy" with "villain" and fetch the correct context.Dense encodings are also learnable.For instance, with pairs of questions and contexts that contain the answers, the distance between the question and context in the latent semantic space can effectively be optimized by learning the embedding function.This provides additional flexibility to have a task-specific representation.With special in-memory data structure and indexing schemes, search in the dense vector space can also be done efficiently, thanks to tools like FAISS (Johnson et al., 2017).
In this paper, we focus on improving opendomain question answering by replacing traditional information retrieval (IR) methods with dense representations for retrieval.Specifically, we learn an effective dense vector representation of questions and passages using a simple dual-encoder framework (Bromley et al., 1994), such that similarity in the corresponding latent semantic space yields a better retrieval function, compared to traditional sparse vector representations.Recent work made significant efforts in this direction by either treating the retrieval module as a latent variable (Lee et al., 2019;Guu et al., 2020) or applying similarity search directly at the phrase level (Seo et al., 2019).However, the former approach needs to adopt extra pre-training tasks to make the system perform well, while the latter approach demonstrates the importance of sparse vectors as removing it entirely made the system much worse.Instead, we demonstrate the feasibility of training a dense retrieval model by proper fine-tuning on a small number of question-passage pairs from existing datasets.
Our contributions are twofold.First, we demonstrate that with the proper training setup, simply fine-tuning the question and passage encoder on existing question-passage pairs is sufficient to greatly outperform a strong Lucene-BM25 system in terms of top-k retrieval accuracy.The dense representation is complementary to the sparse vector representation, and combining them can further improve the performance.Second, we verify that, in the context of open-domain question answering, a higher retrieval accuracy indeed translates to a higher endto-end QA accuracy.As a result, we achieve new state-of-the-art results on multiple QA datasets in the open-retrieval setting.

Background
The problem of open-domain textual question answering studied in this paper can be described as follows.Given a factoid question, such as "Who first voiced Meg on Family Guy?" or "Where was the 8th Dalai Lama born?", a system is required to answer it using a large corpus of diversified topics.More specifically, we assume the extractive QA setting, in which the answer is restricted to a span appearing in one or more documents in the corpus.Let the corpus which consists of D documents.We first split each of the document into text passages of equal lengths as the basic retrieval units3 and hence we get M passages in total p 1 , p 2 , . . ., p M and each passage p i can be viewed as a sequence of tokens w Given a question q, the task is to find a span w e from one of the documents d j that can answer the question.Notice that to cover a wide variety of domains, the corpus size can easily range from millions of documents (e.g., Wikipedia) to billions (e.g., the Web).
This form of question answering is also known as machine reading at scale, coined by Chen et al. (2017), due to the popular format of extractive reading comprehension.However, by scaling up from one document to a large collection of documents, open-domain question answering is faced with an additional technical challenge: relevant context retrieval.Although impressive progress has been made on reading comprehension, thanks mainly to the advance of deep neural networks, naively applying an existing reader model to all documents in the corpus is not practical, due to the expensive inference-time computation.As a result, any open-domain QA system needs to include an efficient retriever component that can select a small set of relevant texts, before applying the reader to extract the answer. 4Formally speaking, a retriever R : (q, C) → C F is a function that takes as input a question q and a corpus C and returns a much smaller filter set of documents C F ⊂ C, where For a fixed k, a retriever can be evaluated in isolation on top-k retrieval accuracy, which is the fraction of questions for which C F contains a span which answers the question.

Dense Passage Retriever (DPR)
We focus our research in this work on improving the retrieval component in open-domain question answering.Given a collection of M text passages, the goal of our dense retriever (DPR) is to index all the passages in a low-dimensional and continuous space, such that it can retrieve efficiently the top k passages relevant to the input question for the reader at run-time.Note that M can be very large (the corpus we used contains 21 million passages in our experiments as we will see in Section 4.1) and k is usually small, such as 20 ∼ 100.Below we give the design details of DPR.

Overview
Our dense passage retriever (DPR) uses a dense encoder E P (•) which maps any text passage to a ddimensional real-valued vectors and build an index for all the M passages we will use for retrieval.At run-time, DPR applies a different encoder E Q (•) that maps the input question to a d-dimensional vector, and retrieves k passages of which vectors are the closest to the question vector.We define the similarity between the question and the passage using the dot product of the their vectors: Although more expressive model forms for measuring the similarity between a question and a passage do exist, such as networks consists of multiple layers of cross attentions, Eq. ( 1) is required to be decomposable so that the representations of the collection of passages can be pre-computed.
Most decomposable similarity functions are some transformations of Euclidean distance (L2).For instance, cosine is equivalent to inner product for unit vectors and the Mahalanobis distance is equivalent to L2 distance in a transformed space.Inner product search has been widely used and studied, as well as its connection to cosine similarity and L2 distance (Mussmann and Ermon, 2016;Ram and Gray, 2012).In our preliminary experiments, we also tested cosine similarity but found no significant improvement.We thus choose the simpler inner product function and improve the dense passage retriever by learning better encoders.
Encoders Although in principle the question and passage encoders can be implemented by any neural network, in this work we use two independent BERT (Devlin et al., 2019) networks (base, uncased) and take the representation at the [CLS] token as the output, so d = 768.
Inference During inference time, we apply the passage encoder E P to all the passages and index them using FAISS (Johnson et al., 2017) offline.
FAISS is an extremely efficient, open-source library for similarity search and clustering of dense vectors, which can easily be applied to billions of vectors.Given a question q at run-time, we derive its embedding v q = E Q (q) and retrieve the top k passages with embeddings closest to v q .

Training
Training the encoders so that the dot-product similarity (Eq.( 1)) becomes a good ranking function for retrieval is essentially a metric learning problem (Kulis, 2013).The goal is to create a vector space such that relevant pairs of questions and passages will have smaller distance (i.e., higher similarity) than the irrelevant pairs, by learning a better embedding.
be the training data that consists of m instances.Each instance contains one question q i and one relevant (positive) passage p + i , along with n irrelevant (negative) passages p − i,j .The loss function is the negative log likelihood of the correct passage: e sim(q i ,p + i ) + n j=1 e sim(q i ,p − i,j ) Positive and negative passages For retrieval problems, it is often the case that positive examples are available explicitly, while negative examples need to be selected from an extremely large pool.
For instance, relevant passages to a question may be given in a QA dataset, or can be found using the answer.All other passages in the collection, while not specified explicitly, can be viewed as irrelevant by default.In practice, how to select negative examples is often overlooked but could be decisive for learning a high-quality encoder.We generally consider three different types of negatives: (1) Random: any random passage from the corpus; (2) BM25: top passages returned by BM25 which don't contain the answer but match question tokens heavily; (3) Gold: positive passages paired with other questions which appear in the training set.We will discuss the impact of different types of negative passages and training schemes in Section 5.4.Our best model uses gold passages from the same mini-batch and one BM25 negative passage.In particular, re-using gold passages from the same batch as negatives can make the computation efficient while achieving great performance.We will discuss this approach below.
In-batch negatives Assume that we have B questions in a mini-batch and each one is associated with a relevant passage.Let Q and P be the (B ×d) matrix of question and passage embeddings in a batch of size B. S = QP T is a (B × B) matrix of similarity scores, where each row of which corresponds to a question, paired with B passages.
In this way, we reuse computation and effectively train on B 2 (q i , p j ) question/passage pairs in each batch.Any (q i , p j ) pair is a positive example when i = j, and negative otherwise.This creates B training instances in each batch, where there are B − 1 negative passages for each question.The trick of in-batch negatives has been used in the full batch setting (Yih et al., 2011) and more recently for mini-batch (Gillick et al., 2019;Henderson et al., 2017).This has been shown an effective strategy for learning a Siamese neural network (aka.dual encoder) model that boosts the number of training examples.As will be seen in Section 5, it is important in creating a high-quality passage retriever that helps advance the state of the art of open-domain question answering.

Experimental Setup
In this section, we describe the data we used for experiments and the basic setup.

Wikipedia Data Pre-processing
Following (Lee et al., 2019), we use the English Wikipedia dump from Dec. 20, 2018 as the source documents for answering questions.We first apply the pre-processing code released in DrQA (Chen et al., 2017) to extract the clean, text-portion of articles from the Wikipedia dump.This step removes semi-structured data, such as tables, infoboxes, lists, as well as the disambiguation pages.We then split each article into multiple, disjoint text blocks of 100 words as passages, serving as our basic retrieval units, following (Wang et al., 2019), which results in 21,015,324 passages in the end. 5ach passage is also prepended with the title of the Wikipedia article where the passage is from, along with an [SEP] token.

Question Answering Datasets
We use the same five QA datasets and training/dev/testing splitting method as in previous work (Lee et al., 2019).Below we briefly describe each dataset and refer readers to their paper for the details of data preparation.Natural Questions (NQ) (Kwiatkowski et al., 2019) was designed for end-to-end question answering.The questions were mined from real Google search queries and the answers were spans in Wikipedia articles identified by annotators.TriviaQA (Trivia) (Joshi et al., 2017) contains a set of trivia questions with answers that were originally scraped from the Web.WebQuestions (WQ) (Berant et al., 2013) consists of questions selected using Google Suggest API, where the answers are entities in Freebase.CuratedTREC (TREC) (Baudiš and Šedivỳ, 2015) sources questions from TREC QA tracks as well as various web sources and is intended for open-domain QA from unstructured corpora.SQuAD v1.1 (Rajpurkar et al., 2016) is a popular benchmark dataset for reading comprehension.Annotators were presented with a Wikipedia paragraph, and asked to write questions that can be answered from the given text.Although SQuAD has been used previously for open-domain QA research, it is not ideal because many questions lack context in absence of the provided paragraph.We still include it in our experiments for providing a fair comparison to previous work and we will discuss more in Section 5.1.
Selecting positive passages Because only pairs of questions and answers are provided in TREC, WebQuestions and TriviaQA6 , we use the highestranked passage from BM25 that contains the answer as the positive passage.If none of the top 100 retrieved passages has the answer, the question will be discarded.For SQuAD and Natural Questions, since the original passages have been split and processed differently than our pool of candidate passages, we match and replace each gold passage with the corresponding passage in the candidate pool.We discard the questions when the matching is failed due to different Wikipedia versions or preprocessing.Table 1 shows the number of questions in training/dev/test sets for all the datasets and the actual questions used for training the retriever.

Experiments: Passage Retrieval
In this section, we evaluate the retrieval performance of our Dense Passage Retriever (DPR), along with analysis on how its output differs from traditional retrieval methods, the effects of different training schemes and the run-time efficiency.
The DPR model used in our main experiments is trained using the in-batch negative setting (Section 3.2) with a batch size of 128 and one additional BM25 negative passage per question.We trained the question and passage encoders for up to 30 epochs for large datasets (NQ, Trivia, SQuAD) and 100 epochs for small datasets (TREC, WQ), with a learning rate of 10 −5 using Adam, linear scheduling with warm-up and dropout rate 0.1.
While it is good to have the flexibility to adapt the retriever to each dataset, it would also be desirable to obtain a single retriever that works well across the board.To this end, we train a multidataset encoder by combining training data from all datasets excluding SQuAD. 7In addition to DPR, we also present the results of BM25, the traditional retrieval method8 and BM25+DPR, using a linear combination of their scores as the new ranking function.Specifically, we re-rank top-2000 BM25 retrieved passages using BM25(q,p) + λ • sim(q, p) as the ranking function.We used λ = 1.1 based on the development set retrieval accuracy.Results can be improved further in some cases by combining DPR with BM25 in both single-and multi-dataset settings.We conjecture that the lower performance on SQuAD is due to two reasons.First, the annotators wrote questions after seeing the passage.As a result, there is a high lexical overlap between passages and questions, which gives BM25 a clear advantage.Second, the data was collected from only 500+ Wikipedia articles and thus the distribution of training examples is extremely biased, as argued previously by Lee et al. (2019).

Sample Efficiency
We also explore how many training examples are needed to achieve good passage retrieval performance.Figure 1   already outperforms BM25, indicating that using a dense retriever to replace a term-matching retrieval system is feasible.Adding more training examples (from 1k to 59k) further improves the retrieval accuracy consistently.

Qualitative Analysis
While DPR performs better than BM25 in general, the retrieved passages of these two retrievers actually differ qualitatively.Methods like BM25 are sensitive to highly selective keywords and phrases, but cannot capture lexical variations or semantic relationships well.In contrast, DPR excels at semantic representation, but might lack sufficient capacity to represent salient phrases which appear rarely.Table 3 illustrates this phenomenon with two examples.In the first example, the top scoring passage from BM25 is irrelevant, even though keywords such as England and Ireland appear multiple times.In comparison, DPR is able to return the correct answer, presumably by matching "body of water" with semantic neighbors such as sea and channel, even though no lexical overlap exists.The second example is one where BM25 does better.
The salient phrase "Thoros of Myr" is critical, and DPR is unable to capture it.

Comparisons of Training Schemes
We experiment with different schemes of training DPR.Results on the development set of Natural Questions are summarized in Table 4.We first test the standard 1-of-N training setting (top block), where each question in the batch is paired with a positive passage and its own set of n negative passages (Eq.( 2)).We found that using random, BM25 or gold passages (positive passages from other questions) perform similarly in this setting.
Next, we introduce in-batch negative training (Section 3.2).We find that using a similar setting (7 gold negative passages), in-batch negative training improved the results substantially.The key difference between the two is whether the gold negative passages come from the same batch or from the whole training set.Moreover, because in-batch negative training reuses other positive passages from the same batch, it is more memory-efficient and allows us to test bigger batch sizes.As a result, accuracy consistently increases as the batch size grows (middle block).Finally, we explore in-batch negative training with additional BM25 negative passages.We find that adding a single BM25 negative passage improved the result significantly while adding two did not help further.

Run-time Efficiency
The main reason that we require a retrieval component for open-domain QA is to reduce the number of candidate passages that the reader needs to consider, which is crucial for answering user's questions in real-time.We profiled the passage retrieval speed on a server with Intel Xeon CPU E5-2698 v4 @ 2.20GHz and 512GB memory.With the help of FAISS in-memory index for real-valued vectors9 , DPR can be made incredibly efficient, processing 995.0 questions per second, returning top 100 passages per question.In contrast, BM25/Lucene (implemented in Java, using file index) processes 23.7 questions per second per CPU thread.
On the other hand, the time required for building an index for dense vectors is much longer.Computing dense embeddings on 21-million passages is resource intensive, but can be easily parallelized, taking roughly 8.8 hours on 8 GPUs.However, building the FAISS index on 21-million vectors on a single server takes 8.5 hours.In comparison, building an inverted index using Lucene is much cheaper and takes only about 30 minutes in total.
6 Experiments: Question Answering In this section, we experiment with how different passage retrievers affect the final QA accuracy.

End-to-end QA system
We implement an end-to-end question answering system in which we can plug our different retriever systems directly.Besides the retriever, our QA system consists of a neural reader that outputs the answer to the question.Given the top k retrieved passages (up to 100 in our experiments), the reader assigns a passage selection score to each passage.In addition, it extracts an answer span from each passage and assigns a span score.The best span from the passage with the highest passage selection score is chosen as the final answer.This approach has been shown to work well previously (Wang et al., 2019(Wang et al., , 2018;;Lin et al., 2018).
Specifically, let P i ∈ R L×h (1 ≤ i ≤ k) be a BERT (base, uncased in our experiments) representation for the i-th passage, where L is the maximum length of the passage and h the hidden dimension.The probabilities of a token being the starting/ending positions of an answer span and a passage being selected are defined as: P start,i (s) = softmax P i w start s , P end,i (t) = softmax P i w end t , P selected (i) = softmax P w selected i , where P = [P [CLS] 1 , . . ., P

[CLS] k
] ∈ R h×k and w start , w end , w selected ∈ R h are learnable vectors.We compute a span score of the s-th to t-th words from the i-th passage as P start,i (s) × P end,i (t), and a passage selection score of the i-th passage as P selected (i).
For training, we sample one positive and m − 1 negative passages for each question at each iteration, where m is a hyperparamter (we use m = 24).The model training objective is maximum marginal likelihood for a passages could contain multiple correct answer spans.We use the batch size of 16 and 4 for large (NQ, Trivia, SQuAD) and small (TREC, WQ) datasets, respectively, and tune k on the development set.For experiments on small datasets under the Multi setting, in which using other datasets is allowed, we finetune the reader trained on Natural Questions to the target dataset.All experiments were done on eight 32GB GPUs.

Results
Table 5 summarizes our final end-to-end QA results, measured by exact match with the reference answer after minor normalization as in (Chen et al., 2017;Lee et al., 2019).From the table, we can see that higher retriever accuracy typically leads to better final QA results: in all cases except SQuAD, answers extracted from the passages retrieved by DPR are more likely to be correct, compared to those from BM25.For large datasets like NQ and TriviaQA, models trained using multiple datasets (Multi) perform comparably to those trained using the individual training set (Single).Conversely, on smaller datasets like WQ and TREC, the multidataset setting has a clear advantage.Overall, our DPR-based models outperform the previous stateof-the-art results on four out of the five datasets, with 1% to 12% absolute differences in exact match accuracy.It is interesting to contrast our results to those of ORQA (Lee et al., 2019) and also the concurrently developed approach, REALM (Guu et al., 2020).While both methods include additional pre-training tasks and employ an expensive end-to-end training regime, DPR manages to outperform them on both NQ and TriviaQA, simply by focusing on learning a strong passage retrieval model using pairs of questions and answers.The additional pre-training tasks are likely more useful only when the target training sets are small.Although the results of DPR on WQ and TREC in the single-dataset setting are less competitive, adding more question-answer pairs helps boost the performance, achieving the new state of the art, as observed in the multi-dataset setting.
To compare our pipeline training approach with joint learning, we run an ablation on Natural Questions where the retriever and reader are jointly trained.Following Lee et al. (2019), the training procedure removes a passage scorer in the reader, and does not back-propagate to the passage encoder (E P in Eq. ( 1)).This approach obtains a score of 36.8EM, which suggests that our strategy of training a strong retriever and reader in isolation can leverage effectively available supervision, while outperforming a comparable joint training approach with a simpler design.

Related Work
Open-domain question answering, which requires a system to answer any factoid question using a large collection of documents, without pre-specifying a particular domain, has been popularized by the TREC QA tracks starting in the late '90s (Voorhees et al., 1999).Most QA systems in this era implemented a sophisticated pipeline approach or leveraged the Web redundancy to extract frequently occurring phrases as answers (Brill et al., 2002;Kwok et al., 2001).For Open-QA, retrieval has always been an important component.Not only does it effectively reduce the search space for answer extraction, but it also identifies the support context for the user to verify the answer.In practice, the sparse vector space model like TF-IDF or BM25 is a very strong baseline for retrieval and has been viewed as the standard method applied broadly to various QA tasks (e.g., Chen et al., 2017;Nie et al., 2019;Min et al., 2019a;Wolfson et al., 2020).Augmenting text-based retrieval with externalstructured information, such as knowledge graph and Wikipedia hyper-links, has also been explored recently (Min et al., 2019b;Asai et al., 2020).
While the sparse vector space model has been the cornerstone of information retrieval, using continuous, dense vector representations of text has been proposed dating back to Latent Semantic Analysis (LSA) (Deerwester et al., 1990).Such approaches complement the sparse vector methods as they can potentially give high similarity scores to text pairs that are semantically relevant, even without exact token matching.The dense representation alone, however, is typically inferior to the sparse representation.In contrast to unsupervised dimensionality reduction methods (e.g., LSA), leveraging labeled data to train dense encoders for embedding both documents and queries has become popular recently (Yih et al., 2011;Huang et al., 2013;Gillick et al., 2019;Humeau et al., 2019).To ensure that the embedding can be pre-computed and indexed offline, the basic architecture follows the Siamese neural network design (Bromley et al., 1994;Chopra et al., 2005).The interaction between document and query will only occur on the final top-level, through standard distance functions such as inner-product or cosine similarity.When the training methods can use the labeled pairs to effectively generate a large number of examples, such as the in-batch random negative pairs (Yih et al., 2011;Henderson et al., 2017), or when there exist a big collection of positive and negative pairs (Huang et al., 2013), dense representation can outperform sparse representation.Enabled by the modern neural network toolkits, researchers have started to explore an end-to-end training scheme that includes pre-training, retrieval and also answer extraction jointly (Lee et al., 2019;Guu et al., 2020).While the initial results are positive, the overall framework becomes inevitably more complex.In contrast, our model provides a simple and yet effective solution that shows strong empirical performance.

Conclusion
In this work, we demonstrated that dense embedding-based retrieval can outperform and po-tentially replace the traditional sparse retrieval component in open-domain question answering.We showed that while a simple approach can be made to work surprisingly well, there are some critical ingredients to training a dense retriever successfully.Dense and sparse retrieval are found to be largely complementary and can be combined for best accuracy.As a result of improved retrieval performance, we obtained new state-of-the-art results on multiple open-domain question answering benchmarks.

Figure 1 :
Figure 1: Retriever top-k accuracy with different number of training examples used in our dense passage retriever vs BM25.The results are measured on the development set of Natural Questions.Our DPR trained using 1,000 examples already outperforms BM25.

Table 1 :
Number of questions in each QA datasets.The second column of Train shows the numbers of questions after pruning.See text for detail.

Table 2
compares different passage retrieval systems on five QA datasets, using the top-k accuracy (k is 20 or 100).With the exception of SQuAD, DPR performs consistently better than BM25 on all datasets.The gap is especially large when k is small (e.g., 78.4% vs. 59.1% for top-20 accuracy on Natural Questions).When training with multiple datasets, TREC, the smallest dataset of the five, benefits greatly from more training examples.In contrast, Natural Questions and WebQuestions improve modestly and TriviaQA degrades slightly.

Table 2 :
illustrates the top-k retrieval accuracy with respect to different number of training examples, measured on the development set of Natural Questions.As is shown, a dense passage retriever trained using only 1,000 examples Top-20 & Top-100 retrieval accuracy on test sets, measured as the percentage of top 20/100 retrieved passages that contain the answer.Single and Multi denote that our Dense Passage Retriever (DPR) was trained using individial or combined training datasets (all the datasets excluding SQuAD).See text for more details.England is not recognised as a region by the UCI, and there is no English cycling team outside the Commonwealth Games.For those occasions, British Cycling selects and supports the England team.Cycling is represented on the Isle of Man by the Isle of Man Cycling Association.Cycling in Northern Ireland is organised under Cycling Ulster, part of the all-Ireland governing body Cycling Ireland.Until 2006, a rival governing body existed, . . . . . .Annual traffic between Great Britain and Ireland amounts to over 12 million passengers and of traded goods.The Irish Sea is connected to the North Atlantic at both its northern and southern ends.To the north, the connection is through the North Channel between Scotland and Northern Ireland and the Malin Sea.The southern end is linked to the Atlantic through the St George's Channel between Ireland and Pembrokeshire, and the Celtic Sea. . . .He may be "no one," but there's still enough of a person left in him to respect, and admire who this girl is and what she's become.Arya finally tells us something that we've kind of known all along, that she's not no one, she's Arya Stark of Winterfell." "No One" saw the reintroduction of Richard Dormer and Paul Kaye, who portrayed Beric Dondarrion and Thoros of Myr, respectively, in the third season, . . .Pål Sverre Valheim Hagen (born 6 November 1980) is a Norwegian stage and screen actor.He appeared in the Norwegian film "Max Manus" and played Thor Heyerdahl in the Oscar-nominated 2012 film "Kon-Tiki".Pl Hagen was born in Stavanger, Norway, the son of Roar Hagen, a Norwegian cartoonist who has long been associated with Norwayś largest daily, "VG".He lived in Jtten, a neighborhood in the city of Stavanger in south-western Norway. . . .

Table 3 :
Examples of passages returned from BM25 and DPR.Correct answers are written in blue and the content words in the question are written in bold.

Table 5 :
End-to-end QA (Exact Match) Accuracy.REALM Wiki and REALM News are the same model but pretrained on Wikipedia and CC-News, respectively.Single and Multi denote that our Dense Passage Retriever (DPR) was trained using individual or combined training datasets (all except SQuAD).For WQ and TREC in the Multi setting, we finetune the reader trained on NQ.