Joint Training of Candidate Extraction and Answer Selection for Reading Comprehension

While sophisticated neural-based techniques have been developed in reading comprehension, most approaches model the answer in an independent manner, ignoring its relations with other answer candidates. This problem can be even worse in open-domain scenarios, where candidates from multiple passages should be combined to answer a single question. In this paper, we formulate reading comprehension as an extract-then-select two-stage procedure. We first extract answer candidates from passages, then select the final answer by combining information from all the candidates. Furthermore, we regard candidate extraction as a latent variable and train the two-stage process jointly with reinforcement learning. As a result, our approach has improved the state-of-the-art performance significantly on two challenging open-domain reading comprehension datasets. Further analysis demonstrates the effectiveness of our model components, especially the information fusion of all the candidates and the joint training of the extract-then-select procedure.


Introduction
Teaching machines to read and comprehend human languages is a long-standing objective in natural language processing. In order to evaluate this ability, reading comprehension (RC) is designed to answer questions through reading relevant passages. In recent years, RC has attracted intense interest. Various advanced neural models have been proposed along with newly released datasets (Hermann et al., 2015;Rajpurkar et al., 2016;Dunn et al., 2017;Dhingra et al., 2017b;He et al., 2017). Q Cocktails: Rum, lime, and cola drink make a . A Cuba Libre P1 Daiquiri, the custom of mixing lime with rum for a cooling drink on a hot Cuban day, has been around a long time. P2 Cocktail recipe for a Daiquiri, a classic rum and lime drink that every bartender should know. P3 Hemingway Special Daiquiri: Daiquiris are a family of cocktails whose main ingredients are rum and lime juice. P4 A homemade Cuba Libre Preparation To make a Cuba Libre properly, fill a highball glass with ice and half fill with cola. P5 The difference between the Cuba Libre and Rum is a lime wedge at the end. The key information is marked in italic, which should be combined from different text pieces to select the correct answer "Cuba Libre".
Most existing approaches mainly focus on modeling the interactions between questions and passages (Dhingra et al., 2017a;Seo et al., 2017;, paying less attention to information concerning answer candidates. However, when human solve this problem, we often first read each piece of text, collect some answer candidates, then focus on these candidates and combine their information to select the final answer. This collect-then-select process can be more significant in open-domain scenarios, which require the combination of candidates from multiple passages to answer one single question. This phenomenon is illustrated by the example in Table 1. With this motivation, we formulate an extractthen-select two-stage architecture to simulate the above procedure. The architecture contains two components: (1) an extraction model, which generates answer candidates, (2) a selection model, which combines all these candidates and finds out the final answer. However, answer candidates to be focused on are often unobservable, as most RC datasets only provide golden answers. Therefore, we treat candidate extraction as a latent variable and train these two stages jointly with reinforcement learning (RL).
In conclusion, our work makes the following contributions: 1. We formulate open-domain reading comprehension as a two-stage procedure, which first extracts answer candidates and then selects the final answer. With joint training, we optimize these two correlated stages as a whole.
2. We propose a novel answer selection model, which combines the information from all the extracted candidates using an attention-based correlation matrix. As shown in experiments, the information fusion is greatly helpful for answer selection.
3. With the two-stage framework and the joint training strategy, our method significantly surpasses the state-of-the-art performance on two challenging public RC datasets Quasar-T (Dhingra et al., 2017b) and SearchQA (Dunn et al., 2017).

Related Work
In recent years, reading comprehension has made remarkable progress in methodology and dataset construction. Most existing approaches mainly focus on modeling sophisticated interactions between questions and passages, then use the pointer networks (Vinyals et al., 2015) to directly model the answers (Dhingra et al., 2017a;Wang and Jiang, 2017;Seo et al., 2017;. These methods prove to be effective in existing close-domain datasets (Hermann et al., 2015;Hill et al., 2015;Rajpurkar et al., 2016).
More recently, open-domain RC has attracted increasing attention (Nguyen et al., 2016;Dunn et al., 2017;Dhingra et al., 2017b;He et al., 2017) and raised new challenges for question answering techniques. In these scenarios, a question is paired with multiple passages, which are often collected by exploiting unstructured documents or web data. Aforementioned approaches often rely on recurrent neural networks and sophisticated attentions, which are prohibitively time-consuming if passages are concatenated altogether. Therefore, some work tried to alleviate this problem in a coarse-to-fine schema. Wang et al. (2018a) combined a ranker for selecting the relevant passage and a reader for producing the answer from it. However, this approach only depended on one passage when producing the answer, hence put great demands on the precisions of both components. Worse still, this framework cannot handle the situation where multiple passages are needed to answer correctly. In consideration of evidence aggregation, Wang et al. (2018b) proposed a re-ranking method to resolve the above issue. However, their re-ranking stage was totally isolated from the candidate extraction procedure. Being different from the re-ranking perspective, we propose a novel selection model to combine the information from all the extracted candidates. Moreover, with reinforcement learning, our candidate extraction and answer selection models can be learned in a joint manner. Trischler et al. (2016) also proposed a two-step extractor-reasoner model, which first extracted K most probable single-token answer candidates and then compared the hypotheses with all the sentences in the passage. However, in their work, each candidate was considered isolatedly, and their objective only took into account the ground truths compared with our RL treatment.
The training strategy employed in our paper is reinforcement learning, which is inspired by recent work exploiting it into question answering problem. The above mentioned coarse-to-fine framework (Choi et al., 2017;Wang et al., 2018a) treated sentence selection as a latent variable and jointly trained the sentence selection module with the answer generation module via RL. Shen et al. (2017) modeled the multi-hop reasoning procedure with a termination state to decide when it is adequate to produce an answer. RL is suitable to capture this stochastic behavior. Hu et al. (2018) merely modeled the extraction process, using F1 as rewards in addition to maximum likelihood estimation. RL was utilized in their training process, as the F1 measure is not differentiable.

Two-stage RC Framework
In this work, we mainly consider the open-domain extractive reading comprehension. In this scenario, a given question Q is paired with multiple passages P = {P 1 , P 2 , ..., P N }, based on which we aim to find out the answer A. Moreover, the golden answers are almost subspans shown in some passages in P. Our main framework consists of two parts, which are: (1) extracting answer candidates C = {C 1 , C 2 , ..., C M } from passages P and (2) selecting the final answer A from candidates C. This process is illustrated in Figure 1. We design different models for each part and optimize them as a whole with joint reinforcement learning.

Candidate Extraction
We build candidate set C by independently extracting K candidates from each passage P i according to the following distribution: (1) where C ij denotes the jth candidate extracted from the ith passage. K is set as a constant number in our formulation. Taking K as 2 for an example, we denote each probability shown on the right side of Equation 1 through sampling without replacement: (2) where we neglect Q, P i to abbreviate the conditional distributions in Equation 1.
Consequently, the basic block of our candidate extraction stage turns out to be the distribution of each candidate P (C ij |Q, P i ). In the rest of this subsection, we will elaborate on the model architecture concerning candidate extraction, which is displayed in Figure 2.

Start End
Question & Passage Representation

Question & Passage Interaction
Candidate Scoring

Question Passage
and its relevant passage P = {x t P } l P t=1 ∈ P with word vectors to form Q ∈ R dw×l Q and P ∈ R dw×l P respectively, where d w is the dimension of word embeddings, l Q and l P are the length of Q and P .
We then feed Q and P to a bidirectional LSTM to form their contextual representations H Q ∈ R d h ×l Q and H P ∈ R d h ×l P : Question & Passage Interaction Modeling the interactions between questions and passages is a critical step in reading comprehension. Here, we adopt the attention mechanism similar to (Lee et al., 2016) to generate question-dependent pas- (4) After concatenating two kinds of passage representations H P and H P , we use another bidirectional LSTM to get the final representation of every position in passage P as G P ∈ R dg×l P : Candidate Scoring Then we use two linear transformations w b ∈ R 1×dg and w e ∈ R 1×dg to calculate the begin and the end scores for each position: At last, we model the probability of every subspan in passage P as a candidate C = {x t P } Ce t=C b according to its begin and end position: In this definition, the probabilities of all the valid answer candidates are already normalized.

Answer Selection
As the second part of our framework, the answer selection model finds out the most probable answer by calculating p(C|Q, P, C) for each candidate C ∈ C. The model architecture is illustrated in Figure 3. Notably, selection model receives candidate set C as additional information. This more focused information allows the model to combine evidences from all the candidates, which would be useful for selecting the best answer.
For ease of understanding, we briefly describe the selection stage as follows. After being extracted from a single passage, a candidate borrows information from other candidates across different passages. With this global information, the passage is reread to confirm the correctness of the candidate further. The following are details about the selection model.

Question Representation
Questions are fundamental for finding out the correct answer. As did for the extraction model, we embed the question Q with word vectors to form Q ∈ R dw×l Q . Then we use a bidirectional LSTM to establish its contextual representation: A max-pooling operation across all the positions is followed to get the condensed vector representation: Passage Representation Assume the candidate C is extracted from the passage P ∈ P. To be informed of C, we first build the representation of P . For every word in P , three kinds of features are utilized: • Word embedding: each word expresses its basic feature with the word vector.
• Common word: the feature has value 1 when the word occurs in the question, otherwise 0.
• Question independent representation: the condensed representation r q .
With these features, information not only in Q but also in P is considered. By concatenating them, we get r t P corresponding to every position t in passage P . Then with another bidirectional LSTM, we fuse these features to form the contextual representation of P as S P ∈ R ds×l P : Candidate Representation Candidates provide more focused information for answer selection. Therefore, for each candidate, we first build its independent representation according to its position in the passage, then construct candidates fused representation through combination of other correlated candidates.
Given the candidate C = {x t P } Ce t=C b in the passage P , we extract its corresponding span from S P = {s t P } l P t=1 to form S C = {s t P } Ce t=C b as its contextual encoding. Moreover, we calculate its condensed vector representation through its begin and end positions: where W b ∈ R dc×ds , W e ∈ R dc×ds . To model the interactions among all the answer candidates, we calculate the correlations of the candidate C, which is assumed to be indexed by j in C, with others {C m } M m=1,m =j via attention mechanism: where W c ∈ R dc×dc , W o ∈ R dc×dc and w v ∈ R 1×dc are linear transformations to capture the intensity of each interaction.
In this way, we form a correlation matrix V ∈ R M ×M , where M is the total number of candidates. With the correlation matrix, for the candidate C, we normalize its interactions via a sof tmax operation, which emphasizes the influence of stronger interactions: To take into account different influences of all the other candidates, it is sensible to generate a candidates fused representation according to the above normalized interactions: In this formulation, all the other candidates contribute their influences to the fused representation by their interactions with C, thus information from different passages is gathered altogether. In our experiments, this kind of information fusion is the key point for performance improvements.
Passage Advanced Representation As more focused information of the candidate C is available, we are provided with a better way to confirm its correctness by rereading its corresponding passage P . Specifically, we equip each position t in P with following advanced features: • Passage contextual representation: the former passage representation s t P .
• Candidate-dependent passage representation: replace H Q with S C and H P with S P in Equation 4 to model the interactions between candidates and passages to form s t P .
• Candidate related distance feature: the relative distance to the candidate C can be a reference of the importance of each position.
• Candidate independent representation: use r C to consider the concerned candidate C.
• Candidates fused representation: use r C to consider all the other candidates interacting with the concerned candidate C.
With these features, we capture the information from the question, the passages and all the candidates. By concatenating them, we get u t P in every position in the passage P . Combining these features with a bidirectional LSTM, we get: Answer Scoring At last, the max pooling of each dimension of F P is performed, resulting in a condensed vector representation, which contains all the concerned information in a candidate: The final score of this candidate as the answer is calculated via a linear transformation, which is then normalized across all the candidates:

Joint Training with RL
In our formulation, the answer candidate set influences the result of answer selection to a large extent. However, with only golden answers provided in the training data, it is not apparent which candidates should be considered further. To alleviate the above problem, we treat candidate extraction as a latent variable, jointly train the extraction model and the selection model with reinforcement learning. Formally, in the extraction and selection stages, two kinds of actions are modeled. The action space for the extraction model is to select from different candidate sets, which is formulated by Equation 1. The action space for the selection model is to select from all extracted candidates, which is formulated by Equation 17. Our goal is to select the final answer that leads to a high reward. Inspired by Wang et al. (2018a), we define the reward of a candidate to reflect its accordance with the golden answer:  where f 1(., .) ∈ [0, 1] is the function to measure word-level F1 score between two sequences. Incorporating this reward can alleviate the overstrict requirements set by traditional maximum likelihood estimation as well as keep consistent with our evaluation methods in experiments.
The learning objective becomes to maximize the expected reward modeled by our framework, where θ stands for all the parameters involved: Following REINFORCE algorithm, we approximate the gradient of the above objective with a sampled candidate set, C ∼ P (C|Q, P), resulting in the following form: Each question is paired with 100 sentence-level passages retrieved from ClueWeb09 (Callan et al., 2009) based on Lucene.
SearchQA (Dunn et al., 2017) starts from existing question-answer pairs, which are crawled from J!Archive, and is augmented with text snippets retrieved by Google, resulting in more than 140,000 question-answer pairs with each pair having 49.6 snippets on average. The detailed statistics of these two datasets is shown in Table 2.

Model Settings
We initialize word embeddings with the 300dimensional Glove vectors 1 . All the bidirectional LSTMs hold 1 layer and 100 hidden units. All the linear transformations take the size of 100 as output dimension. The common word feature and the candidate related distance feature are embedded with vectors of dimension 4 and 50 respectively. By default, we set K as 2 in Equation 1, which means each passage generates two candidates based on the extraction model.
For ease of training, we first initialize our models by maximum likelihood estimation and finetune them with RL. The similar training strategy is commonly employed when RL process is involved (Ranzato et al., 2015;Li et al., 2016a;Hu et al., 2018). To pre-train the extraction model, we only use passages containing ground truths as training data. The log likelihood of Equation 7 is taken as the training objective for each question and passage pair. After pre-training the extraction model, we use it to generate two top-scoring candidates from each passage, forming the training data to pre-train our selection model, and maximize the log likelihood of the Equation 17 as our second objective. In pre-training, we use the batch size of 30 for the extraction model, 20 for the selection model and RMSProp (Tieleman and Hinton, 2012) with an initial learning rate of 2e-3. In fine-tuning with RL, we use the batch size of 5 and RMSProp with an initial learning rate of 1e-4. Also, we use a dropout rate of 0.1 in each training procedure.

Experimental Results
In addition to results of previous work, we add two baselines to demonstrate the effectiveness of our framework. The first baseline only applies the extraction model to score the answers, which is aimed at explaining the importance of the selection model. The second one only uses the pre-trained extraction model and selection model to illustrate the benefits from our joint training schema.
As seen from the results on Quasar-T, our quite simple extraction model alone almost reaches the state-of-the-art result compared with other methods without re-rankers. The combination of the extraction and selection models exceeds our extraction baseline by a great margin, and also results in performance surpassing the best single reranker in (Wang et al., 2018b). This result illustrates the necessity of introducing the selection model, which incorporates information from all the candidates. In the end, by joint training with RL, our method produces better performance even compared with the ensemble of three different rerankers.
On SearchQA, we find that our extraction model alone performs not that well compared with the state-of-the-art model without re-rankers. However, the improvement brought by our selection model isolatedly or jointly trained still demonstrates the importance of our two-stage framework. Not surprisingly, comparing the results, our isolated training strategy still lags behind the single re-ranker proposed in (Wang et al., 2018b), partly because of the deficiency with our extraction model. However, uniting our extraction and selection models with RL makes up the disparity, and the performance surpasses the ensemble of three different re-rankers, let alone the result of any single re-ranker.

Further Analysis
Effect of Features in Selection Model As the incorporation of the selection model improves the overall performance significantly, we conduct ab-  lation analysis on the Quasar-T to prove the effectiveness of its major components. As shown in Table 4, all these components modeling the selection procedure play important roles in our final architecture. Specifically, introducing the independent representation of the question and its common words with the passage seems an efficient way to consider the information of questions, which is consistent with previous work (Li et al., 2016b;.
As for features related to candidates, the incorporation of the candidate independent information contributes to the final result more or less. These features include candidate-dependent passage representation, candidate independent representation and candidate related distance feature.
Most importantly, the candidates fused representation, which combines the information from Q Cocktails : Rum , lime , and cola drink make a . A Cuba Libre P1 In Nicaragua , when it is mixed using Flor de Ca a -LRB-the national brand of rum -RRB-and cola , it is called a Nica Libre . P2 The drink ... Daiquiri The custom of mixing lime with rum for a cooling drink on a hot Cuban day has been around a long time .

P3
If you only learn to make two cocktails , the Manhattan should be one of them .

P4
Daiquiri Cocktail recipe for a Daiquiri , a classic rum and lime drink that every bartender should know .

P5
Hemingway Special Daiquiri : Daiquiris are a family of cocktails whose main ingredients are rum and lime juice .

P6
In the Netherlands the drink is commonly called Baco , from the two ingredients of Bacardi rum and cola .

P7
A homemade Cuba Libre Preparation To make a Cuba Libre properly , fill a highball glass with ice and half fill with cola .

P8
Bacardi Cocktail Cocktail recipe for a Bacardi Cocktail , a classic cocktail of Bacardi rum , lemon or lime juice and grenadine Roy Rogers -LRB-non-alcoholic -RRB-Cocktail recipe for a Roy Rogers , P9 Margarita Cocktail recipe for a Margarita , a popular refreshing tequila and lime drink for summer . P10 The difference between the Cuba Libre and Rum is a lime wedge at the end . all the candidates, demonstrates its indispensable role in candidate modeling, with a performance drop of nearly 8% when discarded. This phenomenon also verifies the necessity of our extractthen-select procedure, showing the importance of combining information scattered in different text pieces when picking out the final answer.

Example for Candidates Fused Representation
We conduct a case study to demonstrate the importance of candidates fused information further. In Table 5, each candidate only partly matches the description of the question in its independent context. To correctly answer the question, information in P 7 and P 10 should be combined. In experiments, our selection model provides the correct answer, while the wrong candidate "Daiquiri", a different kind of cocktail, is selected if candidates fused representation is discarded. The attention map established when modeling the fusion of candidates (corresponding to Equation 13) in this example is illustrated in Figure 4, in which we can see the interactions among all the candidates from different passages. In this figure, it is obvious that the interaction of "Cuba Libre" in P 7 and P 10 is the key point to answer the question correctly.

Effect of Candidate Number
The candidate extraction stage takes an important role to decide  Table 5.  what information should be focused on further. Therefore, we also test the influence of different K when extracting candidates from each passage. The results are shown in Table 6. Taking K = 1 degrades the performance, which conforms to the expectation, as the correct candidates become less in this stricter situation. However, taking K = 3 can not improve the performance further. Although a larger K means a higher possibility to include good answers, it raises more challenges for the selection model to pick out the correct one from candidates with more varieties.

Conclusion
In this paper, we formulate the problem of RC as a two-stage process, which first generates candidates with an extraction model, then selects the final answer by combining the information from all the candidates. Furthermore, we treat candidate extraction as a latent variable and jointly train these two stages with RL. Experiments on public open-domain RC datasets Quasar-T and SearchQA show the necessity of introducing the selection model and the effectiveness of fusing candidates information when modeling. Moreover, our joint training strategy leads to significant improvements in performance.