Multi-Passage Machine Reading Comprehension with Cross-Passage Answer Verification

Machine reading comprehension (MRC) on real web data usually requires the machine to answer a question by analyzing multiple passages retrieved by search engine. Compared with MRC on a single passage, multi-passage MRC is more challenging, since we are likely to get multiple confusing answer candidates from different passages. To address this problem, we propose an end-to-end neural model that enables those answer candidates from different passages to verify each other based on their content representations. Specifically, we jointly train three modules that can predict the final answer based on three factors: the answer boundary, the answer content and the cross-passage answer verification. The experimental results show that our method outperforms the baseline by a large margin and achieves the state-of-the-art performance on the English MS-MARCO dataset and the Chinese DuReader dataset, both of which are designed for MRC in real-world settings.


Introduction
Machine reading comprehension (MRC), empowering computers with the ability to acquire knowledge and answer questions from textual data, is believed to be a crucial step in building a general intelligent agent (Chen et al., 2016). Recent years have seen rapid growth in the MRC community. With the release of various datasets, the MRC task has evolved from the early cloze-style test (Hermann et al., 2015;Hill et al., 2015) to answer extraction from a single passage (Rajpurkar et al., * This work was done while the first author was doing internship at Baidu Inc. 2016) and to the latest more complex question answering on web data (Nguyen et al., 2016;Dunn et al., 2017;He et al., 2017).
Great efforts have also been made to develop models for these MRC tasks , especially for the answer extraction on single passage (Wang and Jiang, 2016;Seo et al., 2016;Pan et al., 2017). A significant milestone is that several MRC models have exceeded the performance of human annotators on the SQuAD dataset 1 (Rajpurkar et al., 2016). However, this success on single Wikipedia passage is still not adequate, considering the ultimate goal of reading the whole web. Therefore, several latest datasets (Nguyen et al., 2016;He et al., 2017;Dunn et al., 2017) attempt to design the MRC tasks in more realistic settings by involving search engines. For each question, they use the search engine to retrieve multiple passages and the MRC models are required to read these passages in order to give the final answer.
One of the intrinsic challenges for such multipassage MRC is that since all the passages are question-related but usually independently written, it's probable that multiple confusing answer candidates (correct or incorrect) exist. Table 1 shows an example from MS-MARCO. We can see that all the answer candidates have semantic matching with the question while they are literally different and some of them are even incorrect. As is shown by Jia and Liang (2017), these confusing answer candidates could be quite difficult for MRC models to distinguish. Therefore, special consideration is required for such multi-passage MRC problem.
In this paper, we propose to leverage the answer candidates from different passages to verify the final correct answer and rule out the noisy incorrect answers. Our hypothesis is that the cor-Question: What is the difference between a mixed and pure culture? Passages: [1] A culture is a society's total way of living and a society is a group that live in a defined territory and participate in common culture. While the answer given is in essence true, societies originally form for the express purpose to enhance . . . [2] . . . There has been resurgence in the economic system known as capitalism during the past two decades. 4. The mixed economy is a balance between socialism and capitalism. As a result, some institutions are owned and maintained by . . .
[3] A pure culture is one in which only one kind of microbial species is found whereas in mixed culture two or more microbial species formed colonies. Culture on the other hand, is the lifestyle that the people in the country . . .
[4] Best Answer: A pure culture comprises a single species or strains. A mixed culture is taken from a source and may contain multiple strains or species. A contaminated culture contains organisms that derived from some place . . . [5] . . . It will be at that time when we can truly obtain a pure culture. A pure culture is a culture consisting of only one strain. You can obtain a pure culture by picking out a small portion of the mixed culture . . .
[6] A pure culture is one in which only one kind of microbial species is found whereas in mixed culture two or more microbial species formed colonies. A pure culture is a culture consisting of only one strain. . . . · · · · · · Reference Answer: A pure culture is one in which only one kind of microbial species is found whereas in mixed culture two or more microbial species formed colonies. rect answers could occur more frequently in those passages and usually share some commonalities, while incorrect answers are usually different from one another. The example in Table 1 demonstrates this phenomenon. We can see that the answer candidates extracted from the last four passages are all valid answers to the question and they are semantically similar to each other, while the answer candidates from the other two passages are incorrect and there is no supportive information from other passages. As human beings usually compare the answer candidates from different sources to deduce the final answer, we hope that MRC model can also benefit from the cross-passage answer verification process.
The overall framework of our model is demonstrated in Figure 1 , which consists of three modules. First, we follow the boundary-based MRC models (Seo et al., 2016;Wang and Jiang, 2016) to find an answer candidate for each passage by identifying the start and end position of the answer ( Figure 2). Second, we model the meanings of the answer candidates extracted from those passages and use the content scores to measure the quality of the candidates from a second perspective. Third, we conduct the answer verification by enabling each answer candidate to attend to the other candidates based on their representations. We hope that the answer candidates can collect supportive information from each other according to their semantic similarities and further decide whether each candidate is correct or not. Therefore, the final answer is determined by three factors: the boundary, the content and the crosspassage answer verification. The three steps are modeled using different modules, which can be jointly trained in our end-to-end framework.
We conduct extensive experiments on the MS-MARCO (Nguyen et al., 2016) and DuReader (He et al., 2017) datasets. The results show that our answer verification MRC model outperforms the baseline models by a large margin and achieves the state-of-the-art performance on both datasets. Figure 1 gives an overview of our multi-passage MRC model which is mainly composed of three modules including answer boundary prediction, answer content modeling and answer verification. First of all, we need to model the question and passages. Following Seo et al. (2016), we compute the question-aware representation for each passage (Section 2.1). Based on this representation, we employ a Pointer Network (Vinyals et al., 2015) to predict the start and end position of the answer in the module of answer boundary prediction (Section 2.2). At the same time, with the answer content model (Section 2.3), we estimate whether each word should be included in the answer and thus obtain the answer representations. Next, in the answer verification module (Section 2.4), each answer candidate can attend to the other answer candidates to collect supportive information and we compute one score for each candidate   to indicate whether it is correct or not according to the verification. The final answer is determined by not only the boundary but also the answer content and its verification score (Section 2.5).

Question and Passage Modeling
Given a question Q and a set of passages {P i } retrieved by search engines, our task is to find the best concise answer to the question. First, we formally present the details of modeling the question and passages.
Encoding We first map each word into the vector space by concatenating its word embedding and sum of its character embeddings. Then we employ bi-directional LSTMs (BiLSTM) to encode the question Q and passages {P i } as follows: Q-P Matching One essential step in MRC is to match the question with passages so that important information can be highlighted. We use the Attention Flow Layer (Seo et al., 2016) to conduct the Q-P matching in two directions. The similarity matrix S ∈ R |Q|×|P i | between the question and passage i is changed to a simpler version, where the similarity between the t th word in the question and the k th word in passage i is computed as: Then the context-to-question attention and question-to-context attention is applied strictly following Seo et al. (2016) to obtain the questionaware passage representation {ũ P i t }. We do not give the details here due to space limitation. Next, another BiLSTM is applied in order to fuse the contextual information and get the new representation for each word in the passage, which is regarded as the match output: Based on the passage representations, we introduce the three main modules of our model.

Answer Boundary Prediction
To extract the answer span from passages, mainstream studies try to locate the boundary of the answer, which is called boundary model. Following (Wang and Jiang, 2016), we employ Pointer Network (Vinyals et al., 2015) to compute the probability of each word to be the start or end position of the span: By utilizing the attention weights, the probability of the k th word in the passage to be the start and end position of the answer is obtained as α 1 k and α 2 k . It should be noted that the pointer network is applied to the concatenation of all passages, which is denoted as P so that the probabilities are comparable across passages. This boundary model can be trained by minimizing the negative log probabilities of the true start and end indices: where N is the number of samples in the dataset and y 1 i , y 2 i are the gold start and end positions.

Answer Content Modeling
Previous work employs the boundary model to find the text span with the maximum boundary score as the final answer. However, in our context, besides locating the answer candidates, we also need to model their meanings in order to conduct the verification. An intuitive method is to compute the representation of the answer candidates separately after extracting them, but it could be hard to train such model end-to-end. Here, we propose a novel method that can obtain the representation of the answer candidates based on probabilities. Specifically, we change the output layer of the classic MRC model. Besides predicting the boundary probabilities for the words in the passages, we also predict whether each word should be included in the content of the answer. The content probability of the k th word is computed as: Training this content model is also quite intuitive. We transform the boundary labels into a continuous segment, which means the words within the answer span will be labeled as 1 and other words will be labeled as 0. In this way, we define the loss function as the averaged cross entropy: The content probabilities provide another view to measure the quality of the answer in addition to the boundary. Moreover, with these probabilities, we can represent the answer from passage i as a weighted sum of all the word embeddings in this passage:

Cross-Passage Answer Verification
The boundary model and the content model focus on extracting and modeling the answer within a single passage respectively, with little consideration of the cross-passage information. However, as is discussed in Section 1, there could be multiple answer candidates from different passages and some of them may mislead the MRC model to make an incorrect prediction. It's necessary to aggregate the information from different passages and choose the best one from those candidates. Therefore, we propose a method to enable the answer candidates to exchange information and verify each other through the cross-passage answer verification process. Given the representation of the answer candidates from all passages {r A i }, each answer candidate then attends to other candidates to collect supportive information via attention mechanism: Herer A i is the collected verification information from other passages based on the attention weights. Then we pass it together with the original representation r A i to a fully connected layer: We further normalize these scores over all passages to get the verification score for answer candidate A i : In order to train this verification model, we take the answer from the gold passage as the gold answer. And the loss function can be formulated as the negative log probability of the correct answer: where y v i is the index of the correct answer in all the answer candidates of the i th instance .

Joint Training and Prediction
As is described above, we define three objectives for the reading comprehension model over multiple passages: 1. finding the boundary of the answer; 2. predicting whether each word should be included in the content; 3. selecting the best answer via cross-passage answer verification. According to our design, these three tasks can share the same embedding, encoding and matching layers. Therefore, we propose to train them together as multi-task learning (Ruder, 2017). The joint objective function is formulated as follows: where β 1 and β 2 are two hyper-parameters that control the weights of those tasks.
When predicting the final answer, we take the boundary score, content score and verification score into consideration. We first extract the answer candidate A i that has the maximum boundary score from each passage i. This boundary score is computed as the product of the start and end probability of the answer span. Then for each answer candidate A i , we average the content probabilities of all its words as the content score of A i . And we can also predict the verification score for A i using the verification model. Therefore, the final answer can be selected from all the answer candidates according to the product of these three scores.

Experiments
To verify the effectiveness of our model on multipassage machine reading comprehension, we conduct experiments on the MS-MARCO (Nguyen et al., 2016) and DuReader (He et al., 2017) datasets. Our method achieves the state-of-the-art performance on both datasets.

Datasets
We choose the MS-MARCO and DuReader datasets to test our method, since both of them are  One prerequisite for answer verification is that there should be multiple correct answers so that they can verify each other. Both the MS-MARCO and DuReader datasets require the human annotators to generate multiple answers if possible. Table 2 shows the proportion of questions that have multiple answers. However, the same answer that occurs many times is treated as one single answer here. Therefore, we also report the proportion of questions that have multiple answer spans to match with the human-generated answers. A span is taken as valid if it can achieve F1 score larger than 0.7 compared with any reference answer. From these statistics, we can see that the phenomenon of multiple answers is quite common for both MS-MARCO and DuReader. These answers will provide strong signals for answer verification if we can leverage them properly.

Implementation Details
For MS-MARCO, we preprocess the corpus with the reversible tokenizer from Stanford CoreNLP  and we choose the span that achieves the highest ROUGE-L score with the reference answers as the gold span for training. We employ the 300-D pre-trained Glove embeddings (Pennington et al., 2014) and keep it fixed during training. The character embeddings are randomly initialized with its dimension as 30. For DuReader, we follow the preprocessing described in He et al. (2017).
We tune the hyper-parameters according to the Model ROUGE-L BLEU-1 FastQA Ext (Weissenborn et al., 2017) 33.67 33.93 Prediction (Wang and Jiang, 2016) 37.33 40.72 ReasoNet (Shen et al., 2017) 38.81 39.86 R-Net (Wang et al., 2017c) 42.89 42.22 S-Net (Tan et al., 2017) 45  Two simple yet effective technologies are employed to improve the final performance on these two datasets respectively. For MS-MARCO, approximately 8% questions have the answers as Yes or No, which usually cannot be solved by extractive approach (Tan et al., 2017). We address this problem by training a simple Yes/No classifier for those questions with certain patterns (e.g., starting with "is"). Concretely, we simply change the output layer of the basic boundary model so that it can predict whether the answer is "Yes" or "No". For DuReader, the retrieved document usually contains a large number of paragraphs that cannot be fed into MRC models directly (He et al., 2017). The original paper employs a simple a simple heuristic strategy to select a representative paragraph for each document, while we train a paragraph ranking model for this. We will demonstrate the effects of these two technologies later. Table 3 shows the results of our system and other state-of-the-art models on the MS-MARCO test set. We adopt the official evaluation metrics, including ROUGE-L (Lin, 2004) and BLEU-1 (Papineni et al., 2002). As we can see, for both metrics, our single model outperforms all the other competing models with an evident margin, which is extremely hard considering the near-human per-   formance. If we ensemble the models trained with different random seeds and hyper-parameters, the results can be further improved and outperform the ensemble model in Tan et al. (2017), especially in terms of the BLEU-1.

Results on DuReader
The results of our model and several baseline systems on the test set of DuReader are shown in Table 4. The BiDAF and Match-LSTM models are provided as two baseline systems (He et al., 2017). Based on BiDAF, as is described in Section 3.2, we tried a new paragraph selection strategy by employing a paragraph ranking (PR) model. We can see that this paragraph ranking can boost the BiDAF baseline significantly. Finally, we implement our system based on this new strategy, and our system (single model) achieves further improvement by a large margin.
Question: What is the difference between a mixed and pure culture Scores Answer Candidates: Boundary Content Verification [1] A culture is a society's total way of living and a society is a group . . .

Ablation Study
To get better insight into our system, we conduct in-depth ablation study on the development set of MS-MARCO, which is shown in Table 5. Following Tan et al. (2017), we mainly focus on the ROUGE-L score that is averaged case by case. We first evaluate the answer verification by ablating the cross-passage verification model so that the verification loss and verification score will not be used during training and testing. Then we remove the content model in order to test the necessity of modeling the content of the answer. Since we don't have the content scores, we use the boundary probabilities instead to compute the answer representation for verification. Next, to show the benefits of joint training, we train the boundary model separately from the other two models. Finally, we remove the yes/no classification in order to show the real improvement of our end-toend model compared with the baseline method that predicts the answer with only the boundary model.
From Table 5, we can see that the answer verification makes a great contribution to the overall improvement, which confirms our hypothesis that cross-passage answer verification is useful for the multi-passage MRC. For the ablation of the content model, we analyze that it will not only affect the content score itself, but also violate the verification model since the content probabilities are necessary for the answer representation, which will be further analyzed in Section 4.3. Another discovery is that jointly training the three models can provide great benefits, which shows that the three tasks are actually closely related and can boost each other with shared representations at bottom layers. At last, comparing our method with the baseline, we achieve an improvement of nearly 3 points without the yes/no classification. This significant improvement proves the effectiveness of our approach.

Case Study
To demonstrate how each module of our model takes effect when predicting the final answer, we conduct a case study in Table 6 with the same example that we discussed in Section 1. For each answer candidate, we list three scores predicted by the boundary model, content model and verification model respectively.
On the one hand, we can see that these three scores generally have some relevance. For example, the second candidate is given lowest scores by all the three models. We analyze that this is because the models share the same encoding and matching layers at bottom level and this relevance guarantees that the content and verification models will not violate the boundary model too much. On the other hand, we also see that the verification score can really make a difference here when the boundary model makes an incorrect decision among the confusing answer candidates (

Necessity of the Content Model
In our model, we compute the answer representation based on the content probabilities predicted by a separate content model instead of directly using the boundary probabilities. We argue that this content model is necessary for our answer verification process. Figure 2 plots the predicted content probabilities as well as the boundary probabilities  The  noun  charge  unit  has  1  sense  :  1  .  a  measure  of  the  quantity  of  electricity  -LRB-determined  by  the  amount  of  an  electric  current  and  the  time  for  which  it  flows  -RRB-.  familiarity  info  :  charge  unit  used  as  a  noun  is  very  rare  . start probability end probability content probability Figure 2: The boundary probabilities and content probabilities for the words in a passage for a passage. We can see that the boundary and content probabilities capture different aspects of the answer. Since answer candidates usually have similar boundary words, if we compute the answer representation based on the boundary probabilities, it's difficult to model the real difference among different answer candidates. On the contrary, with the content probabilities, we pay more attention to the content part of the answer, which can provide more distinguishable information for verifying the correct answer. Furthermore, the content probabilities can also adjust the weights of the words within the answer span so that unimportant words (e.g. "and" and ".") get lower weights in the final answer representation. We believe that this refined representation is also good for the answer verification process.

Related Work
Machine reading comprehension made rapid progress in recent years, especially for singlepassage MRC task, such as SQuAD (Rajpurkar et al., 2016). Mainstream studies (Seo et al., 2016;Wang and Jiang, 2016;Xiong et al., 2016) treat reading comprehension as extracting answer span from the given passage, which is usually achieved by predicting the start and end position of the answer. We implement our boundary model similarly by employing the boundary-based pointer network (Wang and Jiang, 2016). Another inspiring work is from Wang et al. (2017c), where the authors propose to match the passage against itself so that the representation can aggregate evidence from the whole passage. Our verification model adopts a similar idea. However, we collect information across passages and our attention is based on the answer representation, which is much more efficient than attention over all passages. For the model training, Xiong et al. (2017) argues that the boundary loss encourages exact answers at the cost of penalizing overlapping answers. Therefore they propose a mixed objective that incorporates rewards derived from word overlap. Our joint training approach has a similar function. By taking the content and verification loss into consideration, our model will give less loss for overlapping answers than those unmatched answers, and our loss function is totally differentiable.
Recently, we also see emerging interests in multi-passage MRC from both the academic (Dunn et al., 2017;Joshi et al., 2017) and industrial community (Nguyen et al., 2016;He et al., 2017). Early studies (Shen et al., 2017;Wang et al., 2017c) usually concat those passages and employ the same models designed for singlepassage MRC. However, more and more latest studies start to design specific methods that can read multiple passages more effectively. In the aspect of passage selection, Wang et al. (2017a) introduced a pipelined approach that rank the passages first and then read the selected passages for answering questions. Tan et al. (2017) treats the passage ranking as an auxiliary task that can be trained jointly with the reading comprehension model. Actually, the target of our answer verification is very similar to that of the passage selection, while we pay more attention to the answer content and the answer verification process.
Speaking of the answer verification, Wang et al. (2017b) has a similar motivation to ours. They attempt to aggregate the evidence from different passages and choose the final answer from n-best candidates. However, they implement their idea as a separate reranking step after reading comprehension, while our answer verification is a component of the whole model that can be trained end-to-end.

Conclusion
In this paper, we propose an end-to-end framework to tackle the multi-passage MRC task . We creatively design three different modules in our model, which can find the answer boundary, model the answer content and conduct cross-passage answer verification respectively. All these three modules can be trained with different forms of the answer labels and training them jointly can provide further improvement. The experimental results demonstrate that our model outperforms the baseline models by a large margin and achieves the state-of-the-art performance on two challenging datasets, both of which are designed for MRC on real web data.