Ranking and Sampling in Open-Domain Question Answering

Open-domain question answering (OpenQA) aims to answer questions based on a number of unlabeled paragraphs. Existing approaches always follow the distantly supervised setup where some of the paragraphs are wrong-labeled (noisy), and mainly utilize the paragraph-question relevance to denoise. However, the paragraph-paragraph relevance, which may aggregate the evidence among relevant paragraphs, can also be utilized to discover more useful paragraphs. Moreover, current approaches mainly focus on the positive paragraphs which are known to contain the answer during training. This will affect the generalization ability of the model and make it be disturbed by the similar but irrelevant (distracting) paragraphs during testing. In this paper, we first introduce a ranking model leveraging the paragraph-question and the paragraph-paragraph relevance to compute a confidence score for each paragraph. Furthermore, based on the scores, we design a modified weighted sampling strategy for training to mitigate the influence of the noisy and distracting paragraphs. Experiments on three public datasets (Quasar-T, SearchQA and TriviaQA) show that our model advances the state of the art.


Introduction
Different from the traditional reading comprehension (RC) task, which aims to answer a question based on an off-the-peg paragraph, the reading paragraphs on OpenQA datasets are always collected via an information retrieval (IR) system. For example, given a question as shown in Figure 1, an OpenQA system usually coarsely retrieves paragraphs that are similar to the question. Therefore, the OpenQA model has to answer a question based on numerous paragraphs. As can The key information, answers in correct-labeled and wrong-labeled contexts are marked in blue, green and red respectively. be seen in Figure 1, some of the negative paragraphs are similar to the question whereas the answer string is missed, e.g. Paragraph4. We consider these paragraphs as distracting ones. Besides, in the distantly supervised setup (Mintz et al., 2009), it is postulated that the paragraphs that contain the answer string are ground truths, while even some of the positive paragraphs are wrong-labeled as they do not concern the question, e.g. Paragraph3. We consider these paragraphs as noisy ones. As a result, only Paragraph1 and Para-graph2 provide relevant answer to the question and the distracting and noisy paragraphs will prevent the model from identifying the correct answer.
Recent OpenQA approaches always follow the retrieve-then-read pipeline and can be roughly divided into two categories: single-paragraph approaches (Joshi et al., 2017;Wang et al., 2018a) and multi-paragraph approaches (Chen et al., 2017;Clark and Gardner, 2018;Pang et al., 2019). Single-paragraph approaches mainly focus on the most relevant paragraph during reading. Multiparagraph approaches apply a RC model to multiple paragraphs to extract the final answer. However, these approaches still face two major issues.
First, as described above, recent methods only consider the paragraph-question relevance in selecting paragraphs but neglect the paragraph-paragraph relevance, which can be exploited to associate relevant paragraphs that are useful to the downstream RC task. Second, some previous approaches only take into account the positive paragraphs, which are known to contain the answer string, at the training stage. This will make the model become too confident in heuristics or patterns that are only effective in positive paragraphs. As a result, the model will suffer from the problem of impaired generalization ability and be easily bewildered by the distracting paragraphs during testing. Such phenomenon has also been observed by Clark and Gardner (2018). Therefore, a more carefully-designed training strategy is needed.
To address the two issues, we first propose an attention based ranking model which utilizes both the question-paragraph and the paragraphparagraph relevance to compute a more accurate confidence score for each paragraph. Through the multi-level attention mechanism, the model can utilize the word-level and sentence-level information between the paragraph and the question. Through the sentence-level self-attention mechanism, the model can aggregate the effective evidence among relevant paragraphs and thus increase their confidence scores.
Second, based on the confidence scores, we design a modified weighted sampling strategy to select training paragraphs, which simultaneously ameliorate the influence of the distracting and noisy paragraphs. Through "sampling", the model can be prevented from being too confident in heuristics that are only effective in positive paragraphs. Through "weighted", the model can be less affected by the noisy paragraphs. After that, we concatenate the selected paragraphs and feed them to a RC model to predict the final answer.
We evaluate our work on three distantly supervised OpenQA datasets including Quasar-T (Dhingra et al., 2017b), SearchQA (Dunn et al., 2017) and TriviaQA (Joshi et al., 2017). Empirical results show that our model is effective in improving the performance of OpenQA and advances the state of the art on all three datasets. We additionally perform an ablation study on our model to give insights into the contribution of each submodule. We will release our code on GitHub for further research explorations. 1 1 https://github.com/xuyanfu/RASAOpenQA 2 Methodology

Task Definition
OpenQA aims to answer a question based on a number of paragraphs retrieved by IR systems. Formally, given a question containing m words Q = {q 1 , q 2 , ..., q m } and a set of retrieved i } is the i-th paragraph, |p i | is the number of words in the i-th paragraph, our model is supposed to extract the answer from the paragraph collection D. Figure 2 gives an overview of our OpenQA model which is composed of two modules including a ranker and a reader. The ranker is responsible to compute a confidence score for each paragraph. Based on the confidence scores, we design five strategies to select k paragraphs, which are then concatenated and passed to the reader. Note that k is a hyperparameter. Finally, a sub-phrase of the concatenated paragraph is predicted as the answer according to reader's output, which is the probability distribution of the start and end positions.

Paragraph Ranker
We adopt a ranker to produce a confidence score for each paragraph. The ranker is comprised of an encoding layer, a word-level matching layer, a sentence-level matching layer and a sentence-level decoding layer.

Encoding Layer
Given a paragraph P i = {p 1 i , p 2 i , ..., p |p i | i } and a question Q = {q 1 , q 2 , ..., q m }, we map each word into a vector by combining the following features: Word Embeddings: We use pre-trained word vectors, GloVe (Pennington et al., 2014), to obtain the fixed embedding of each word.
Char Embeddings: We map the characters in a word into 20-dimensional vectors which are then passed to a convolutional layer and a max pooling layer to obtain a fixed-size vector of the word.
Common word: The feature is set to 1 if the word appears in both the question and the paragraph, and otherwise it is set to 0. Then the feature is mapped into a vector of fixed size.
By concatenating the above features, we obtain the encoded sequence of the paragraph P emb ...

Encoding Word-Level Matching Sen-Level Matching
Word-Level Self-Attention Sen-Level Self-Attention  Figure 2: The framework of out model, which consists of a paragraph ranker (left) and a paragraph reader (right). The ranker first computes a confidence score for each paragraph and k paragraphs are selected through the ranking and sampling strategies. Then, the reader concatenates the selected paragraphs together and computes the start score and the end score for each word in the concatenated paragraph.
of words in the i-th paragraph. After that, a bidirectional LSTM is used to calculate a contextual encoding for each word in the paragraph and the question respectively:

Word-level Matching Layer
The word-level matching layer takes P enc i and Q enc as inputs and produces a new question-aware representation C i for each paragraph. Following Clark and Gardner (2018), we first perform bidirectional attention between the question Q enc and the paragraph P enc i , which gives rise to H i for P enc i , where H i ∈ R d hid ×|p i | . Next, we feed H i to the self-attention layer and obtain E i ∈ R d hid ×|p i | . We strictly implement the steps as suggested by Clark and Gardner (2018). After that, we use a bidirectional LSTM to obtain a question-aware paragraph embedding for the paragraph P i :

Sentence-level Matching Layer
Given D c = {C 1 , C 2 , ..., C n } and Q enc , the sentence-level matching layer produces a sentence-level representation Z i for each paragraph by effectively exploiting the paragraphquestion and the paragraph-paragraph relevance. Firstly, we use self-attention pooling to obtain a fixed-size vector for the encoded question representation. Each position in the encoded question representation is assigned with a score computed using a two-layer multi-layer perception (MLP). This score is normalized and used to compute a weighted sum over the columns of the question representation: Here, α t is the normalized score for the t-th word in the question and W 1 ∈ R d hid ×d hid , W 2 ∈ R d hid are trainable parameters for the MLP. Next, we utilize the co-attention pooling (CAP) to get the fixed-length summary vector for each paragraph. The score for each position in C i is calculated by measuring the similarity between C i and Q pool , based on which we can compute a weighted sum over the columns of the question-aware paragraph embedding C i : whereα t is the normalized score for the t-th position in C i . Let C pool represent the sequence of summary vectors for all paragraphs. Afterwards, in order to make full use of the paragraphparagraph relevance, we use the sentence-level self-attention (SSA) and a bi-directional LSTM to produce the sentence-level paragraph representation Z. The SSA is calculated as: Here, A is the similarity matrix and A ij indicates the similarity between C pool :i and C pool :j . U :i is the i-th column of U which is the representation of the i-th paragraph where information of all the retrieved paragraphs is encompassed. U and C pool are concatenated together to yield G, which is defined by where • is element-wise multiplication, − is element-wise subtraction, and [; ] is vector concatenation across rows. After that, a bi-directional LSTM is used to obtain the final sentence-level paragraph representation Z:

Sentence-level Decoding Layer
The sentence-level decoding layer, which consists of a linear layer and a sotfmax function, will produce a normalized score for each paragraph: where W 3 ∈ R d hid is a trainable weight vector. Pr sen i is the i-th element of Pr sen which represents the probability that the i-th paragraph contains the answer.

Ranking and Sampling Strategies
Based on the retrieved paragraphs and their confidence scores, we design five strategies to select paragraphs for the reader.
Ground Truth (GT) strategy simply chooses all positive paragraphs which contain the answer string as reading datasets and filters out the negative ones. Since we have to know which paragraph contains the answer string, this strategy can only be used while constructing the training set.
Random Sampling (RS) strategy samples k different paragraphs from all the retrieved paragraphs based on the uniform probability distribution. Different from the GT strategy, this strategy will include paragraphs, which do not contain an answer, in the training set. As a result, the RC model is prevented from becoming too confident in heuristics or patterns that are only effective when it is guaranteed that an answer string exists.
Ranking (RK) strategy selects k top-ranked paragraphs based on the confidence scores produced by the ranker. In this strategy, the ranker is expected to select more relevant paragraphs, providing a better start point for the reader to predict the final answer.
Weighted Sampling (WS) strategy samples k different paragraphs from all retrieved paragraphs based on the probability distribution generated by the ranker. This strategy is an improved version of RK. It hopes to consider the effects of the distracting paragraphs while filtering out the noisy ones.
RS→WS strategy is a combination of RS and WS. In the training process, in order to make full use of the dataset and mitigate the impact of the distracting paragraphs, we generate the initial training set for the reader via RS. After that, WS is applied to select paragraphs with higher accuracy, which reduces the influence of the distracting and noisy paragraphs.
During training, we can utilize all the five strategies to generate reading data, while in testing, considering that the paragraphs should be as clean as possible, we only use RK for paragraph selection. After that, we feed the selected paragraphs to the reader to generate the final answer.

Paragraph Reader
The reader is a traditional RC model, where the inputs are the k selected paragraphs D top = {P 1 , P 2 , ..., P k } and we regard k as a hyperparameter. We concatenate these paragraphs together to obtainP = [P 1 ; P 2 ; ...; P k ] which aggregates the information from the selected paragraphs in word level. GivenP and Q, the output of the reader, i.e, Pr s , Pr e ∈ R |P | , are the probability distributions of the start index and the end index over the concatenated paragraph, where |P | is the number of words in the concatenated paragraph. Our reader is the same as in Clark and Gardner (2018) and we do not give the details here due to the space limitation. After that, we split Pr s and Pr e into k vectors according to the length of each paragraph and obtain {Pr s 1 , Pr s 2 , ..., Pr s k } and {Pr e 1 , Pr e 2 , ..., Pr e k }, where Pr s i / Pr e i represent the start/end probabilities of the i-th paragraph.

Training and Testing
Our model is trained in two stages. We follow the distantly supervised setup that all the paragraphs that contain the answer span are ground truths.
For the ranker, each paragraph in {P i } n i=1 is associated with a label y i ∈ {0, 1}, and y i equals to 1 if the paragraph contains the answer string. We follow the S-Net (Tan et al., 2018) to design the loss function of the ranker. Given the paragraph probabilities Pr sen ∈ R n , the ranker is trained by minimizing the loss: (14) For the reader, we first select k paragraphs based on the RS→WS strategy and then obtain the final scores Pr s i and Pr e i for each paragraph, where Pr s i (t)/Pr e i (t) represent the start/end scores of the t-th word in the i-th paragraph. As described above, the answer string can appear multiple times in a paragraph. Therefore, The reader is trained using a summed objective function that optimizes the negative log probability of selecting any correct answer span: is the set of the start and end positions of the answer strings that appear in the paragraph P i .
During testing, we first utilize the ranker to predict a confidence score for each retrieved paragraph and utilize the RK strategy to select k paragraphs. Next, the reader calculates the start and the end score of each word for all the selected paragraphs. Based on the reader-produced scores, there are two answer choosing methods: MAX method is usually adopted in the traditional RC task. It simply chooses an answer span which has the maximum span score from all selected paragraphs. This span score is the product of the start score of the first word and the end score of the last word of the answer span.
SUM method first extracts an answer candidate A i that has the maximum span score from each paragraph. Next, if the answer candidates from different paragraphs refer to the same answer, we will sum up the scores of these answer candidates and choose the answer candidate with the maximum score as the final prediction. The similar method has also been adopted by recent approaches (Wang et al., 2018b;Lin et al., 2018;Pang et al., 2019).
Quasar-T. It consists of 43K open-domain trivia question-answer pairs, and about 100 paragraphs are provided for each question-answer pair by using the Solr search engine.
SearchQA. It has a total of more than 140k question-answer pairs, and roughly 50 webpage snippets are provided by the Google search engine as background paragraphs for each question.
TriviaQA. It contains approximately 95K opendomain question-answer pairs. The unfiltered version of TriviaQA is used in our experiment.

Implementation Details
In the experiments, we adopt the same data preprocessing scheme as Clark and Gardner (2018). For the ranker, we use the 300-dimensional word embeddings pre-trained by GloVe (Pennington et al., 2014), which are fixed during training, and regard the 20-dimensional character embeddings as learnable parameters. The common word feature is mapped into a 4-dimensional vector and it is updated during training. We set the hidden size of LSTM to 150 and the number of LSTM layers to 1. We use Adam with learning rate 5e-4 to optimize the model. The batch size is set to 8 and dropout is applied to the outputs of all LSTM layers at a rate of 0.2. As for RS→WS, we first train the reader until convergence by RS. After that, we utilize WS to select paragraphs for training in order to further facilitate the performance of the reader. Our reader adopts the same hyperparameters as S-Norm (Clark and Gardner, 2018). While training the reader, the parameters of the ranker are fixed. The average length of paragraphs in TriviaQA is greater than that of Quasar-T and SearchQA, so we select 30, 30 and 15 paragraphs for Quasar-T, SearchQA and TriviaQA respectively.

Overall Results
In this section, we focus on the performance of the whole model. Table 1 presents the F1 and Exact Match (EM) scores of our model and the baseline models. These evaluation metrics are widelyadopted for OpenQA. Our (RS, RK, MAX) model utilizes RS and RK to select paragraphs for training and testing respectively and uses MAX to generate answer. Our (RS, RK, MAX) model achieves better results on most of the datasets compared to the baselines. The main reason is that RS can select distracting paragraph to train the reader, thereby preventing the reader from being too confident in patterns that are only effective in positive paragraphs. Besides, compared with the paragraphquestion relevance, the paragraph-paragraph relevance can further enhance the evidence among the relevant paragraphs. By jointly considering those relevance, RK can select more relevant paragraphs during testing to provide a better start point for the reader.
Our (RS→WS, RK, MAX) model consistently outperforms Our (RS, RK, MAX) model across the three datasets. The improvement is attributed to the adoption of the weighted sampling strategy. In the first stage of training, RS prevents excessive focus on the positive paragraphs. On this basis, WS further alleviates the influence of noisy paragraphs in the second stage without hurting the model's generalization ability.
We can observe that the SUM method (Wang et al., 2018b) helps our model better extract the correct answer. As we discussed, different from the traditional RC task, the answer string will appear multiple times in different paragraphs in OpenQA. By summing up the answer span scores which refer to the same answer candidate, we can extract a more accurate answer.
On SearchQA dataset, Our (RS→WS, RK, MAX) model achieves a close performance compared to HAS-QA model. 35% of the paragraphs in the development set of SearchQA contain the answer string, and for Quasar-T and TriviaQA the percentage of such paragraphs in the development set are 13% and 29% respectively. This indicates that SearchQA contains more positive paragraphs than other datasets. Since our model is better at filtering out the irrelevant data, the improvement in SearchQA is not as significant as in other datasets. However, Our (RS→WS, RK, SUM) model still achieves the SOTA result by adopting the SUM method and HAS-QA, which is a strong baseline, also adopts the similar method.
In the following sections, we give further analyses about the ranker, the selection strategies, the hyperparameter, the case study and the ablation study of our model.

Performance of the Ranker
We separately evaluate our ranker by comparing it with a typical IR model 2 and SOTA models which also have a ranking submodule to select paragraphs, i.e. DS-QA (Lin et al., 2018), R3 (Wang et al., 2018a), FSE (Zhang et al., 2018), MSR (Das et al., 2019). This evaluation simply focuses on whether the ground-truth appears in the Top 1, 3, 5 paragraphs. We follow the previous works to report the results on Quasar-T and SearchQA. As shown in Table 2, we can observe that our ranker surpasses the IR model with a large margin. The result implies that, the question-paragraph relevance is effective in selecting the relevant paragraphs. Compared with R3, DS-QA, FSE and MSR, which also adopt a neural network to select paragraphs, our model still works better. It suggests that exploiting the paragraph-question and paragraph-paragraph relevance with multi-level attention and self-attention mechanisms can further improve the performance of the ranker.

Performance of Selection Strategies
In this section, we compare the performance of different selection strategies for training.  comparison, a common scheme for testing is applied, where we use RK to select testing paragraphs and MAX to extract the answer. As shown in Table 3, the performance of RS is much better than GT and RK. The reason is that both GT and RK tend to train the reader on the positive paragraphs which inevitably impairs the generalization ability of the reader. It demonstrates that mainly choosing the positive paragraphs for training will hurt the performance of the model. Compared with RS, WS leads to a slight improvement thanks to the denoising effect. However, WS still reveals a bias toward the positive paragraphs during training, which may explain for the reason why the improvement is trivial. We can see that RS→WS gains the best score on three datasets. The reason is that the RS→WS strategy can not only ameliorates the influences of the distracting paragraphs but also filter out the noisy paragraphs to further strengthen our model.

Impact of the Number of Paragraphs
In order to further investigate the influences of different rankers and the hyperparameter k during testing, we compare the final performance of RK (Our) with that of RK (IR). We utilize RS→WS to select paragraphs for training and and extract answers through the MAX method.   RK (Our) and RK (IR) denote utilizing our ranker and a typical IR model, respectively, to select the k top-ranked paragraphs for testing. As shown in Figure 3, it is obvious that the performance of the two models first increases and then remains stable or reduces as k rises on three datasets. As the number of paragraphs increases, the chances are better that the answer is included in these paragraphs. However, the difficulty and running time for the reader also increase. As can be seen in Figure 3, the peak value of our ranker exceeds that of the IR model on three datasets, suggesting that our ranker can discover more useful and relevant paragraphs during testing, providing a better start point for the reader to predict the final answer. Table 4 shows an example from Quasar-T. The second row contains 5 top-ranked paragraphs which are produced by our ranker and the paragraphs in the third row are selected by a naive version of our ranker (RKN) which replaces the co-attention pooling (CAP) with max pooling for C i and removes the sentence-level self-attention (SSA). For the question "Who was born on Krypton?", our ranker can select more related paragraphs and all of them support "Superman" as the answer. These paragraphs are inherently relevant to each other and provide enhanced evidence for the correct answer. However, most of  the paragraphs in the third row are noisy and distracting, even though some of them are related to the question to some extent. Compared with RKN, which only uses the paragraph-question relevance, our completed ranker can further utilize the paragraph-paragraph relevance to aggregate the evidence among relevant paragraphs, thereby increasing their confidence scores. As a result, the reader is provided with more useful paragraphs to enhance the performance of our model.

Ablation Study
Finally, we conduct an ablation study to analyze the effect of all proposed methods and report the results on Quasar-T. As shown in Table 5, we evaluate the performance of the ranker according to the precision at 1, 3 and 5. In comparison with the evaluation which assesses whether the ground-truth appears in the top-ranked paragraphs, this metric can better evaluate the performance of the ranker, as we expect to find as many relevant paragraphs in the k top-ranked paragraphs as possible. We start with our completed model which utilizes RS→WS to select 30 paragraphs for training and RK to select 30 top-ranked paragraphs for testing. As shown in Table 5, the '-SUM' denotes using MAX to choose the answer and the performance drop indicates that SUM can further enhance our model. Replacing RS→WS with RK for training accounts for a 2.1% performance drop, which demonstrates the effectiveness of our proposed selection strategy. The '-CAP' means replacing the co-attention pooling with max pooling for C i and the '-SSA' means removing the sentence-level self-attention , and from the results, we can see that both paragraph-question and paragraph-paragraph relevance are integral to our model. The '-ranker' denotes replacing our ranker with the IR model in training and testing stages. The replacement accounts for a performance drop of 5.2%, which illustrates the indispensable role of our ranker in the final architecture.

Related Work
Most recent OpenQA approaches extract answer from distantly supervised paragraphs which are retrieved from the web (Chen et al., 2017;Dhingra et al., 2017a;Buck et al., 2018). The interest of research roughly falls into two categories, namely the single-paragraph approaches and the multi-paragraph approaches, both of which follow the retrieve-then-read pipeline. Single-paragraph approaches (Joshi et al., 2017;Wang et al., 2018a) select a most relevant paragraph from all the retrieved paragraphs and then feed it to a RC model to extract the answer. The dependence on a single paragraph renders the RC model vulnerable to selection errors and a large amount of information in the remaining paragraphs is consequently being neglected.
Multi-paragraph approaches can be further divided into answer-rerank methods and confidencebased methods. For answer-rerank methods, Wang et al. (2018b) adopted a single-paragraph approach to extract an answer candidate from each retrieved paragraph and then used an extra neural network to rerank the candidates. Wang et al. (2018c) jointly trained the candidates extraction and reranking steps with reinforcement learning to further improve the performance. But these methods still rely heavily on the candidates extraction results. Chen et al. (2017) proposed the first confidence method which retrieved multiple paragraphs with an IR system and applied a RC model to each para-graph for extracting the answer with the highest confidence. Das et al. (2019) introduced a new framework to make use of the result of the reader for improved paragraph retrieval. However, these models suffer from the problem of impaired generalization ability as a result of too much focus on the positive paragraphs in training. Clark and Gardner (2018) presented a new method, S-Norm. They used the sampling strategy and the sharednorm objective function to teach the model to ignore non-answer containing paragraphs. However, the retrieved paragraphs are always noisy and this will hurt the performance of the model. Zhang et al. (2018) and Lee et al. (2018) adopted a neural network to rank the retrieved paragraphs. Lin et al. (2018) and Pang et al. (2019) also ranked the paragraphs and took into account both the ranking results and the scores produced by the reader to predict the final answer. However, they mainly utilize the paragraph-question relevance to rank the paragraphs and train the reader mainly on the positive paragraphs.

Conclusion
In this paper, we tackle the issues caused by the noisy and distracting paragraphs on OpenQA via ranking and sampling. By jointly considering the question-paragraph and the paragraphparagraph relevance, our ranking model can calculate a more accurate confidence score for each paragraph. Through the modified weighted sampling strategy, our model can make full use of the dataset and mitigate the influence of the distracting and noisy paragraphs. Experimental results on three challenging public OpenQA datasets (Quasar-T, SearchQA and TriviaQA) show that our model advances the state of the art.