RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering

In open-domain question answering, dense passage retrieval has become a new paradigm to retrieve relevant passages for finding answers. Typically, the dual-encoder architecture is adopted to learn dense representations of questions and passages for semantic matching. However, it is difficult to effectively train a dual-encoder due to the challenges including the discrepancy between training and inference, the existence of unlabeled positives and limited training data. To address these challenges, we propose an optimized training approach, called RocketQA, to improving dense passage retrieval. We make three major technical contributions in RocketQA, namely cross-batch negatives, denoised hard negatives and data augmentation. The experiment results show that RocketQA significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions. We also conduct extensive experiments to examine the effectiveness of the three strategies in RocketQA. Besides, we demonstrate that the performance of end-to-end QA can be improved based on our RocketQA retriever.


Introduction
Open-domain question answering (QA) aims to find the answers to natural language questions from a large collection of documents. Early QA systems (Brill et al., 2002;Dang et al., 2007;Ferrucci et al., 2010) constructed complicated pipelines consisting of multiple components, including question understanding, document retrieval, passage ranking and answer extraction. Recently, inspired by the advancements of machine reading comprehension (MRC), Chen et al. (2017) proposed a simplified two-stage approach, where a traditional IR * Corresponding authors.
p (1)   retriever (e.g., TF-IDF or BM25) first selects a few relevant passages as contexts, and then a neural reader reads the contexts and extracts the answers. As the recall component, the first-stage retriever significantly affects the final QA performance. Though efficient with an inverted index, traditional IR retrievers with term-based sparse representations have limited capabilities in matching questions and passages, e.g., term mismatch.
To deal with the issue of term mismatch, the dual-encoder architecture (as shown in Figure 1a) has been widely explored Guu et al., 2020;Luan et al., 2020; to learn dense representations of questions and passages in an end-to-end manner, which provides better representations for semantic matching. These studies first separately encode questions and passages to obtain their dense representations, and then compute the similarity between the dense representations using similarity functions such as cosine or dot product. Typically, the dual-encoder is trained by using in-batch random negatives: for each question-positive passage pair in a training batch, the positive passages for the other questions in the batch would be used as negatives. However, it is still difficult to effectively train a dual-encoder for dense passage retrieval due to the following three major challenges.
First, there exists the discrepancy between training and inference for the dual-encoder retriever. During inference, the retriever needs to identify positive (or relevant) passages for each question from a large collection containing millions of candidates. However, during training, the model is learned to estimate the probabilities of positive passages in a small candidate set for each question, due to the limited memory of a single GPU (or other device). To reduce such a discrepancy, previous work tried to design specific mechanisms for selecting a few hard negatives from the top-k retrieved candidates (Gillick et al., 2019;Luan et al., 2020;. However, it suffers from the false negative issue due to the following challenge. Second, there might be a large number of unlabeled positives. Usually, it is infeasible to completely annotate all the candidate passages for one question. By only examining the the top-K passages retrieved by a specific retrieval approach (e.g. BM25), the annotators are likely to miss relevant passages to a question. Taking the MSMARCO dataset (Nguyen et al., 2016) as an example, each question has only 1.1 annotated positive passages on average, while there are 8.8M passages in the whole collection. As will be shown in our experiments, we manually examine the top-retrieved passages that were not labeled as positives in the original MSMARCO dataset, and we find that 70% of them are actually positives. Hence, it is likely to bring false negatives when sampling hard negatives from the top-k retrieved passages.
Third, it is expensive to acquire large-scale training data for open-domain QA. MSMARCO and Natural Questions (Kwiatkowski et al., 2019) are two largest datasets for open-domain QA. They are created from commercial search engines, and have 516K and 300K annotated questions, respectively. However, it is still insufficient to cover all the topics of questions issued by users to search engines.
In this paper, we focus on addressing these challenges so as to effectively train a dual-encoder retriever for open-domain QA. We propose an optimized training approach, called RocketQA, to improving dense passage retrieval. Considering the above challenges, we make three major technical contributions in RocketQA. First, RocketQA introduces cross-batch negatives. Comparing to inbatch negatives, it increases the number of available negatives for each question during training, and alleviates the discrepancy between training and inference. Second, RocketQA introduces denoised hard negatives. It aims to remove false negatives from the top-ranked results retrieved by a retriever, and derive more reliable hard negatives. Third, RocketQA leverages large-scale unsupervised data "labeled" by a cross-encoder (as shown in Figure  1b) for data augmentation. Though inefficient, the cross-encoder architecture has been found to be more capable than the dual-encoder architecture in both theory and practice (Luan et al., 2020). Therefore, we utilize a cross-encoder to generate highquality pseudo labels for unlabeled data which are used to train the dual-encoder retriever. The contributions of this paper are as follows: • The proposed RocketQA introduces three novel training strategies to improve dense passage retrieval for open-domain QA, namely cross-batch negatives, denoised hard negatives, and data augmentation.
• The overall experiments show that our proposed RocketQA significantly outperforms previous state-of-the-art models on both MS-MARCO and Natural Questions datasets.
• We conduct extensive experiments to examine the effectiveness of the above three strategies in RocketQA. Experimental results show that the three strategies are effective to improve the performance of dense passage retrieval.
• We also demonstrate that the performance of end-to-end QA can be improved based on our RocketQA retriever.

Related Work
Passage retrieval for open-domain QA For opendomain QA, a passage retriever is an important component to identify relevant passages for answer extraction. Traditional approaches (Chen et al., 2017) implemented term-based passage retrievers (e.g. TF-IDF and BM25), which have limited representation capabilities. Recently, researchers have utilized deep learning to improve traditional passage retrievers, including document expansions (Nogueira et al., 2019c), question expansions (Mao et al., 2020) and term weight estimation (Dai and Callan, 2019). Different from the above term-based approaches, dense passage retrieval has been proposed to represent both questions and documents as dense vectors (i.e., embeddings), typically in a dual-encoder architecture (as shown in Figure 1a). Existing approaches can be divided into two categories: (1) self-supervised pre-training for retrieval Guu et al., 2020; and (2) fine-tuning pre-trained language models on labeled data. Our work follows the second class of approaches, which show better performance with less cost. Although the dual-encoder architecture enables the appealing paradigm of dense retrieval, it is difficult to effectively train a retriever with such an architecture. As discussed in Section 1, it suffers from a number of challenges, including the training and inference discrepancy, a large number of unlabeled positives and limited training data. Several recent studies Luan et al., 2020;Henderson et al., 2017) tried to address the first challenge by designing complicated sampling mechanism to generate hard negatives. However, it still suffers from the issue of false negatives. The later two challenges have seldom been considered for open-domain QA.
Passage re-ranking for open-domain QA Based on the retrieved passages from a first-stage retriever, BERT-based rerankers have recently been applied to retrieval-based question answering and search-related tasks Nogueira and Cho, 2019;Nogueira et al., 2019b;Yan et al., 2019), and yield substantial improvements over the traditional methods. Although effective to some extent, these rankers employ the cross-encoder architecture (as shown in Figure 1b) that is impractical to be applied to all passages in a corpus with respect to a question. The re-rankers (Khattab and Zaharia, 2020; with light weight interaction based on the representations of dense retrievers have been studied. However, these techniques still rely on a separate retriever which provides candidates and representations. As a comparison, we focus on developing dual-encoder based retrievers.

Approach
In this section, we propose an optimized training approach to dense passage retrieval for opendomain QA, namely RocketQA. We first introduce the background of the dual-encoder architecture, and then describe the three novel training strategies in RocketQA. Lastly, we present the whole training procedure of RocketQA.

Task Description
The task of open-domain QA is described as follows. Given a natural language question, a system is required to answer it based on a large collection of documents. Let C denote the corpus, consisting of N documents. We split the N documents into M passages, denoted by p 1 , p 2 , ..., p M , where each passage p i can be viewed as an l-length sequence of tokens p i . Given a question q, the task is to find a passage p i among the M candidates, and extract a span p i from p i that can answer the question. In this paper, we mainly focus on developing a dense retriever to retrieve the passages that contain the answer.

The Dual-Encoder Architecture
We develop our passage retriever based on the typical dual-encoder architecture, as illustrated in Figure 1a. First, a dense passage retriever uses an encoder E p (·) to obtain the d-dimensional real-valued vectors (a.k.a., embedding) of passages. Then, an index of passage embeddings is built for retrieval. At query time, another encoder E q (·) is applied to embed the input question to a d-dimensional realvalued vector, and k passages whose embeddings are the closest to the question's will be retrieved. The similarity between the question q and a candidate passage p can be computed as the dot product of their vectors: In practice, the separation of question encoding and passage encoding is desirable, so that the dense representations of all passages can be precomputed for efficient retrieval. Here, we adopt two independent neural networks initialized from pre-trained LMs for the two encoders E q (·) and E p (·) separately, and take the representations at the first token (e.g., [CLS] symbol in BERT) as the output for encoding.
Training The training objective is to learn dense representations of questions and passages so that question-positive passage pairs have higher similarity than the question-negative passage pairs in training data. Formally, given a question q i together with its positive passage p + i and m negative passages {p − i,j } m j=1 , we minimize the loss function: where we aim to optimize the negative log likelihood of the positive passage against a set of m negative passages. Ideally, we should take all the negative passages in the whole collection into consideration in Equation 2. However, it is computationally infeasible to consider a large number of negative samples for a question, and hence m is practically set to a small number that is far less than M . As what will be discussed later, both the number and the quality of negatives affect the final performance of passage retrieval. Inference In our implementation, we use FAISS (Johnson et al., 2019) to index the dense representations of all passages. Specifically, we use IndexFlatIP for indexing and the exact maximum inner product search for querying.

Optimized Training Approach
In Section 1, we have discussed three major challenges in training the dual-encoder based retriever, including the training and inference discrepancy, the existence of unlabeled positives, and limited training data. Next, we propose three improved training strategies to address the three challenges.
Cross-batch Negatives When training the dualencoder, the trick of in-batch negatives has been widely used in previous work (Henderson et al., 2017;Gillick et al., 2019;Luan et al., 2020). Assume that there are B questions in a mini-batch on a single GPU, and each question has one positive passage. With the in-batch negative trick, each question can be further paired with B − 1 negatives (i.e., positive passages of the rest questions) without sampling additional negatives. In-batch negative training is a memory-efficient way to reuse the examples already loaded in a mini-batch rather than sampling new negatives, which increases the number of negatives for each question. As illustrated at the top of Figure 2, we present an example for in-batch negatives when training on A GPUs in a data parallel way. To further optimize the training with more negatives, we propose to use cross-batch negatives when training on multiple GPUs, as illustrated at the bottom of Figure 2. Specifically, we first compute the passage embeddings within each single GPU, and then share these passage embeddings among all the GPUs. Besides the in-batch negatives, we collect all passages (i.e., their dense representations) from other GPUs as the additional negatives for each question. Hence, with A GPUs  (or mini-batches) 2 , we can indeed obtain A×B −1 negatives for a given question, which is approximately A times as many as the original number of in-batch negatives. In this way, we can use more negatives in the training objective of Equation 2, so that the results are expected to be improved. Denoised Hard Negatives Although the above strategy can increase the number of negatives, most of negatives are easy ones, which can be easily discriminated. While, hard negatives are shown to be important to train a dual-encoder (Gillick et al., 2019;Luan et al., 2020;. To obtain hard negatives, a straightforward method is to select the top-ranked passages (excluding the labeled positive passages) as negative samples. However, it is likely to bring false negatives (i.e., unlabeled positives), since the annotators can only annotate a few top-retrieved passages (as discussed in Section 1). Another note is that previous work mainly focuses on factoid questions, to which the answers are short and concise. Hence, it is not challenging to filter false negatives by using the short answers . However, it cannot apply to non-factoid questions. In this paper, we aim to learn dense passage retrieval for both factoid questions and non-factoid questions, which needs a more effective way for denoising hard negatives.
Here, our idea is to utilize a well-trained crossencoder to remove top-retrieved passages that are likely to be false negatives. Because the crossencoder architecture is more powerful for capturing semantic similarity via deep interaction and shows much better performance than the dual-encoder ar- chitecture (Luan et al., 2020). The cross-encoder is more effective and robust, while it is inefficient over a large number of candidates in inference. Hence, we first train a cross-encoder (following the architecture shown in Figure 1b). Then, when sampling hard negatives from the top-ranked passages retrieved by a dense retriever, we select only the passages that are predicted as negatives by the cross-encoder with high confidence scores. The selected top-retrieved passages can be considered as denosied samples that are more reliable to be used as hard negatives. Data Augmentation The third strategy aims to alleviate the issue of limited training data. Since the cross-encoder is more powerful in measuring the similarity between questions and passages, we utilize it to annotate unlabeled questions for data augmentation. Specifically, we incorporate a new collection of unlabeled questions, while reuse the passage collection. Then, we use the learned crossencoder to predict the passage labels for the new questions. To ensure the quality of the automatically labeled data, we only select the predicted positive and negative passages with high confidence scores estimated by the cross-encoder. Finally, the automatically labeled data is used as augmented training data to learn the dual encoder. Another view of the data augmentation is knowledge distillation (Hinton et al., 2015), where the cross-encoder is the teacher and the dual-encoder is the student.

The Training Procedure
As shown in Figure 3, we organize the above three training strategies into an effective training pipeline for the dual-encoder. It makes an analogy to a multi-stage rocket, where the performance of the dual-encoder is consecutively improved at three steps (STEP 1, 3 and 4). That is why we call our approach RocketQA. Next, we will describe the details of the whole training procedure of RocketQA.
• REQUIRE: Let C denote a collection of passages.
Q L is a set of questions that have corresponding labeled passages in C, and Q U is a set of questions that have no corresponding labeled passages. D L is a dataset consisting of C and Q L , and D U is a dataset consisting of C and Q U . • STEP 1: Train a dual-encoder M D from C for each question q ∈ Q L . This design is to let the cross-encoder adjust to the distribution of the results retrieved by the dualencoder, since the cross-encoder will be used in the following two steps for optimizing the dualencoder. This design is important, and there is similar observation in Facebook Search (Huang et al., 2020). • STEP 3: Train a dual-encoder M (1) D by further introducing denoised hard negative sampling on D L . Regarding to each question q ∈ Q L , the hard negatives are sampled from the top passages retrieved by M (0) D from C, and only the passages that are predicted as negatives by the cross-encoder M C with high confidence scores will be selected. • STEP 4: Construct pseudo training data D U by using M C to label the top-k passages retrieved by M (1) D from C for each question q ∈ Q U , and then train a dual-encoder M (2) D on both the manually labeled training data D L and the automatically augmented training data D U . Note that the cross-batch negative strategy is applied through all the steps for training the dual-   , around 62, 000 factoid questions are selected, and all the Wikipedia articles are processed as the collection of passages. There are more than 21 million passages in the corpus. In our experiments, we reuse the version of NQ created by . Note that the dataset used in DPR contains empty negatives, and we discarded the empty ones.

Evaluation Metrics
Following previous work, we use MRR and Recall at top k ranks to evaluate the performance of passage retrieval, and exact match (EM) to measure the performance of answer extraction.
MRR The Reciprocal Rank (RR) calculates the reciprocal of the rank at which the first relevant passage was retrieved. When averaged across questions, it is called Mean Reciprocal Rank (MRR).
Recall at top k ranks The top-k recall of a retriever is defined as the proportion of questions to which the top k retrieved passages contain answers.
Exact match This metric measures the percentage of questions whose predicted answers that match any one of the reference answers exactly, after string normalization.

Implementation Details
We conduct all experiments with the deep learning framework PaddlePaddle  on up to eight NVIDIA Tesla V100 GPUs (with 32G RAM).
Pre-trained LMs The dual-encoder is initialized with the parameters of ERNIE 2.0 base , and the cross-encoder is initialized with ERNIE 2.0 large. ERNIE 2.0 has the same networks as BERT, and it introduces continual pretraining framework on multiple pre-trained tasks. We notice previous work use different pre-trained LMs, and we examine the effects of pre-trained LMs in Section A.1 in Appendix. Our approach is effective when using different pre-trained LMs.
Cross-batch negatives 3 The cross-batch negative sampling is implemented with differentiable all-gather operation provided in FleetX (Dong, 2020), that is a highly scalable distributed training engine of PaddlePaddle. The all-gather operator makes representation of passages across all GPUs visible on each GPU and thus the cross-batch negative sampling approach can be applied globally.
Denoised hard negatives and data augmentation We use the cross-encoder for both denoising hard negatives and data augmentation. Specifically, we select the top retrieved passages with scores less than 0.1 as negatives and those with scores higher than 0.9 as positives. We manually evaluated the selected data, and the accuracy was higher than 90%.
The number of positives and negatives When training the cross-encoders, the ratios of the number of positives to the number of negatives are 1:4 and 1:1 on MSMARCO and NQ, respectively. The
negatives used for training cross-encoders are randomly sampled from top-1000 and top-100 passages retrieved by the dual-encoder M Batch sizes The dual-encoders are trained with the batch sizes of 512 × 8 and 512 × 2 on MS-MARCO and NQ, respectively. The batch size used on MSMARCO is larger, since the size of MSMARCO is larger than NQ. The cross-encoders are trained with the batch sizes of 64 × 4 and 64 on MSMARCO and NQ, respectively. We use the automatic mixed precision and gradient checkpoint 4 functionality in FleetX, so as we can train the models using large batch sizes with limited resources.
Training epochs The dual-encoders are trained on MSMARCO for 40, 10 and 10 epochs in three steps of RocketQA, respectively. The dualencoders are trained on NQ for 30 epochs in all steps of RocketQA. The cross-encoders are trained for 2 epochs on both MSMARCO and NQ.
Optimizers We use ADAM optimizer.
Warmup and learning rate The learning rate of the dual-encoder is set to 3e-5 and the rate of linear scheduling warm-up is set to 0.1, while the learning rate of the cross-encoder is set to 1e-5.
Maximal length We set the maximal length of questions and passages as 32 and 128, respectively.
Unlabeled questions We collect 1.7 million unlabeled questions from Yahoo! Answers 5 , ORCAS (Craswell et al., 2020) and MRQA (Fisch et al., 2019). We use the questions from Yahoo! Answers, 4 The gradient checkpoint (Chen et al., 2016) enables the trading off computation against memory resulting in sublinear memory cost, so bigger/deeper nets can be trained with limited resources. 5 http://answers.yahoo.com/ ORCAS and NQ as new questions in the experiments of MSMARCO. We only use the questions from MRQA as the new questions in the experiments of NQ. Since both NQ and MRQA mainly contain factoid-questions, while other datasets contain both factoid and non-factoid questions.

Experimental Results
In our experiments, we first examine the effectiveness of our retriever on MSMARCO and NQ datasets. Then, we conduct extensive experiments to examine the effects of the three proposed training strategies. We also show the performance of endto-end QA based on our retriever on NQ dataset.

Dense Passage Retrieval
We first compare RocketQA with the previous state-of-the-art approaches on passage retrieval. We consider both sparse and dense passage retriever baselines. The sparse retrievers include the traditional retriever BM25 (Yang et al., 2017), and four traditional retrievers enhanced by neural networks, including doc2query (Nogueira et al., 2019c), DeepCT (Dai and Callan, 2019), docTTTT-Tquery (Nogueira et al., 2019a) and GAR (Mao et al., 2020). Both doc2query and docTTTTTquery employ neural question generation to expand documents. In contrast, GAR employs neural generation models to expand questions. Different from them, DeepCT utilizes BERT to learn the term weight. The dense passage retrievers include DPR , ME-BERT (Luan et al., 2020) and ANCE . Both DRP and ME-BERT use in-batch random sampling and hard negative sampling from the results retrieved by BM25, while ANCE enhances the hard negative sampling by using the dense retriever.  NQ datasets. Another observation is that the dense retrievers are overall better than the sparse retrievers. Such a finding has also been reported in previous studies Luan et al., 2020;, which indicates the effectiveness of the dense retrieval approach.

The Effectiveness of The Three Training Strategies in RocketQA
In this part, we conduct the extensive experiments on MSMARCO dataset to examine the effectiveness of the three strategies in RocketQA. Results on NQ dataset has shown the similar findings (see in Section A.2 in Appendix). First, we compare cross-batch negatives with inbatch negatives by using the same experimental setting (i.e. the number of epochs is 40 and the batch size is 512 on each single GPU). From the first two rows in Table 3, we can see that the performance of the dense retriever can be improved with more negatives by cross-batch negatives. It is expected that when increasing the number of random negatives, it will reduce the discrepancy between training and inference. Furthermore, we investigate the effect of the number of random negatives. Specifically, we examine the performance of dual-encoders trained by using different numbers of random negatives with a fixed number of steps. From Figure 4, we can see that the model performance increases, when the number of random negatives becomes larger. After a certain point, the model performance starts to drop, since a large batch size may bring difficulty for optimization on training data with limited size. We say that there should be a balance between the batch size and the number of negatives. When increasing the batch size, we will have more negatives for each question. However, when the size of training data is limited, a large batch size will bring difficulty for optimization. Second, we examine the effect of denoised hard negatives from the top-k passages retrieved by the dense retriever. As shown in the third row in Table 3, the performance of the retriever significantly decreases by introducing hard negatives without denoising. We speculate that it is caused by the fact that there are a large number of unlabeled positives. Specifically, we manually examine the topretrieved passages of 100 questions, that were not labeled as true positives. We find that about 70% of them are actually positives or highly relevant. Hence, it is likely to bring noise if we simply sample hard negatives from the top-retrieved passages by the dense retriever, which is a widely adopted strategy to sample hard negatives in previous studies (Gillick et al., 2019;. As a comparison, we propose denoised hard negatives by a powerful cross-encoder. From the fourth row in Table 3, we can see that denoised negatives improve the performance of the dense retriever. To obtain more insights about denoised hard negatives, Table 4 gives the sampled hard negatives for two questions before and after denoising. Figure 5 further illustrates the ratio of filtered passages at different ranks. We can see that there are more passages filtered (i.e. denoised) at   lower ranks, since it is likely to have more false negatives at lower ranks. Finally, when integrated with the data augmentation strategy (see the fifth row in Table 3), the performance has been further improved. A major merit of data augmentation is that it does not explicitly rely on manually-labeled data. Instead, it utilizes the cross-encoder (having more powerful capability than the dual-encoder) to generate pseudo training data for improving the dual-encoder. We further examine the effect of the size of the augmented data. As shown in Figure 6, we can see when the size of the augmented data is increasing, the performance increases.

Passage Reading with RocketQA
Previous experiments have shown the effectiveness of RocketQA on passage retrieval. Next, we verify whether the retrieval results of RocketQA can improve the performance of passage reading for extracting correct answers. We implement an end-to-end QA system in which we have an extractive reader stacked on our RocketQA retriever. For a fair comparison, we first re-use the released model 6 of the extractive reader in DPR , and take 100 retrieved passages during inference (the same setting used in DPR). Besides, 6 https://github.com/facebookresearch/ DPR Model EM BM25+BERT  26.5 HardEM (Min et al., 2019a) 28.1 GraphRetriever (Min et al., 2019b) 34.5 PathRetriever (Asai et al., 2020) 32.6 ORQA  33.3 REALM (Guu et al., 2020) 40.4 DPR  41.5 GAR (Mao et al., 2020) 41.6 RocketQA + DPR reader 42.0 RocketQA + re-trained DPR reader 42.8 Table 5: The experimental results of passage reading on NQ dataset. In this paper, we focus on extractive reader, while the recent generative readers Izacard and Grave, 2020) can also be applied here and may lead to better results.
we use the same setting to train a new extractive reader based on the retrieval results of RocketQA (except that we choose top 50 passages for training instead of 100). The motivation is that the reader should be adapted to the retrieval distribution of RocketQA. Table 5 summarizes the the end-to-end QA performance of our approach and a number of competitive methods. From Table 5, we can see that our retriever leads to better QA performance. Compared with prior solutions, our novelty mainly lies in the passage retrieval component, i.e., the RocketQA approach. The results have shown that our approach can provide better passage retrieval results, which finally improve the final QA performance.

Conclusions
In this paper, we have presented an optimized training approach to improving dense passage retrieval. We have made three major technical contributions in RocketQA, namely cross-batch negatives, denoised hard negatives and data augmentation. Extensive experiments have shown the effectiveness of the proposed approach by incorporating the three optimization strategies. We also demonstrate that the performance of end-to-end QA can be improved based on our RocketQA retriever.

Ethical Considerations
The technique of dense passage retrieval is effective for question answering, where the majority of questions are informational queries. Different from the traditional search, there is usually term mismatch between questions and answers. The term mismatch brings barriers for the machine to accurately find the information for people. Hence, we need dense passage retrieval for semantic matching in the scenario of question answering. Dense passage retrieval has the potential to empower people to find the accurate information more quickly and achieve more in their daily life and work. Our technique contributes toward the goal of asking machines to find the answers to natural language questions from a large collection of documents. However, the goal is still far from being achieved, and more efforts from the community is needed for us to get there.