Denoising Distantly Supervised Open-Domain Question Answering

Distantly supervised open-domain question answering (DS-QA) aims to find answers in collections of unlabeled text. Existing DS-QA models usually retrieve related paragraphs from a large-scale corpus and apply reading comprehension technique to extract answers from the most relevant paragraph. They ignore the rich information contained in other paragraphs. Moreover, distant supervision data inevitably accompanies with the wrong labeling problem, and these noisy data will substantially degrade the performance of DS-QA. To address these issues, we propose a novel DS-QA model which employs a paragraph selector to filter out those noisy paragraphs and a paragraph reader to extract the correct answer from those denoised paragraphs. Experimental results on real-world datasets show that our model can capture useful information from noisy data and achieve significant improvements on DS-QA as compared to all baselines.


Introduction
Reading comprehension, which aims to answer questions about a document, has recently become a major focus of NLP research. Many reading comprehension systems (Chen et al., 2016;Dhingra et al., 2017a;Cui et al., 2017;Shen et al., 2017; have been proposed and achieved promising results since their multilayer architectures and attention mechanisms allow them to reason for the question. To some ex- * Corresponding author: Zhiyuan Liu tent, reading comprehension has shown the ability of recent neural models for reading, processing, and comprehending natural language text. Despite their success, existing reading comprehension systems rely on pre-identified relevant texts, which do not always exist in real-world question answering (QA) scenarios. Hence, reading comprehension technique cannot be directly applied to the task of open domain QA. In recent years, researchers attempt to answer opendomain questions with a large-scale unlabeled corpus.  propose a distantly supervised open-domain question answering (DS-QA) system which uses information retrieval technique to obtain relevant text from Wikipedia, and then applies reading comprehension technique to extract the answer.
Although DS-QA proposes an effective strategy to collect relevant texts automatically, it always suffers from the noise issue. For example, for the question "Which country's capital is Dublin?", we may encounter that: (1) The retrieved paragraph "Dublin is the largest city of Ireland ..." does not actually answer the question; (2) The second "Dublin" in the retrieved paragraph 'Dublin is the capital of Ireland. Besides, Dublin is one of the famous tourist cities in Ireland and ..." is not the correct token of the answer. These noisy paragraphs and tokens are regarded as valid instances in DS-QA. To address this issue,  separate the answer generation in DS-QA into two modules including selecting a target paragraph in document and extracting the correct answer from the target paragraph by reading comprehension. Further, Wang et al. (2018a) use reinforcement learning to train target paragraph selection and answer extraction jointly.
These methods only extract the answer according to the most related paragraph, which will lose a large amount of rich information contained in p 1 : As the capital of Ireland, Dublin is … p 3 : Dublin is the capital of Ireland. Besides, Dublin is one of famous tourist cities in Ireland and ... p 1 : As the capital of Ireland, Dublin is … p 3 : Dublin is the capital of Ireland. Besides, Ottawa is one of famous tourist cities in Ireland and ... p 1 : As the capital of Ireland, Dublin is … p 2 : Ireland is an island in the North Atlantic… p 3 : Dublin is the capital of Ireland. Besides, Ottawa is one of famous tourist cities in Ireland and ...

Question:
What's the capital of Ireland?
Answer: Dublin Paragraph Selector Paragraph Reader Figure 1: An overview of our model. For the question 'What's the capital of Dublin?", our paragraph selector selects two paragraphs p 1 and p 3 which actually correspond to the question from all retrieved paragraphs. And then our paragraph reader extracts the correct answer "Dublin" (in red color) from all selected paragraphs. Finally, our system aggregates the extracted results and obtains the final answer.
those neglected paragraphs. In fact, the correct answer is often mentioned in multiple paragraphs, and different aspects of the question may be answered in several paragraphs. Therefore, Wang et al. (2018b) propose to further explicitly aggregate evidence from across different paragraphs to re-rank extracted answers. However, the reranking approach still relies on the answers obtained by existing DS-QA systems, and fails to solve the noise problem of DS-QA substantially.
To address these issues, we propose a coarseto-fine denoising model for DS-QA. As illustrated in Fig. 1, our system first retrieves paragraphs according to the question from a large-scale corpus via information retrieval. After that, to utilize all informative paragraphs, we adopt a fast paragraph selector to skim all retrieved paragraphs and filter out those noisy ones. And then we apply a precise paragraph reader to perform careful reading in each selected paragraph for extracting the answer. Finally, we aggregate the derived results of all cho-sen paragraphs to obtain the final answer. The fast skimming of our paragraph selector and intensive reading of our paragraph reader in our method enables DS-QA to denoise noisy paragraphs as well as maintaining efficiency.
The experimental results on real-world datasets including Quasar-T, SearchQA and TriviaQA show that our system achieves significant and consistent improvement as compared to all baseline methods by aggregating extracted answers of all informative paragraphs. In particular, we show that our model can achieve comparable performance by selecting a few informative paragraphs, which greatly speeds up the whole DS-QA system. We will publish all source codes and datasets of this work on Github for further research explorations.

Related Work
Question answering is one of the most important tasks in NLP. Many efforts have been invested in QA, especially in open-domain QA. Open-domain QA has been first proposed by (Green Jr et al., 1961). The task aims to answer open-domain questions using external resources such as collections of documents (Voorhees et al., 1999), webpages (Kwok et al., 2001;Chen and Van Durme, 2017), structured knowledge graphs (Berant et al., 2013a;Bordes et al., 2015) or automatically extracted relational triples (Fader et al., 2014).
Recently, with the development of machine reading comprehension technique (Chen et al., 2016;Dhingra et al., 2017a;Cui et al., 2017;Shen et al., 2017;, researchers attempt to answer open-domain questions via performing reading comprehension on plain texts.  propose a DS-QA system, which retrieves relevant texts of the question from a large-scale corpus and then extracts answers from these texts using reading comprehension models. However, the retrieved texts in DS-QA are always noisy which may hurt the performance of DS-QA. Hence,  and Wang et al. (2018a) attempt to solve the noise problem in DS-QA via separating the question answering into paragraph selection and answer extraction and they both only select the most relevant paragraph among all retrieved paragraphs to extract answers. They lose a large amount of rich information contained in those neglected paragraphs. Hence, Wang et al. (2018b) propose strength-base and coverage-based re-ranking approaches, which can aggregate the results extracted from each paragraph by existing DS-QA system to better determine the answer. However, the method relies on the pre-extracted answers of existing DS-QA models and still suffers from the noise issue in distant supervision data because it considers all retrieved paragraphs indiscriminately. Different from these methods, our model employs a paragraph selector to filter out those noisy paragraphs and keep those informative paragraphs, which can make full use of the noisy DS-QA data.
Our work is also inspired by the idea of coarseto-fine models in NLP. Cheng and Lapata (2016) and  propose a coarse-to-fine model, which first selects essential sentences and then performs text summarization or reading comprehension on the chosen sentences respectively. Lin et al. (2016) utilize selective attention to aggregate the information of all sentences to extract relational facts. Yang et al. (2016) propose a hierarchical attention network which has two levels of attentions applied at the word and sentence level for document classification. Our model also employs a coarse-to-fine model to handle the noise issue in DS-QA, which first selects informative retrieved paragraphs and then extracts answers from those selected paragraphs.

Methodology
In this section, we will introduce our model in details. Our model aims to extract the answer to a given question in the large-scale unlabeled corpus. We first retrieve paragraphs corresponding to the question from the open-domain corpus using information retrieval technique, and then extract the answer from these retrieved paragraphs.
Formally, given a question q = (q 1 , q 2 , · · · , q |q| ), we retrieve m paragraphs which are defined as Our model measures the probability of extracting answer a given question q and corresponding paragraph set P . As illustrated in Fig. 1, our model contains two parts: 1. Paragraph Selector. Given the question q and the retrieved paragraph P , the paragraph selector measures the probability distribution Pr(p i |q, P ) over all retrieved paragraphs, which is used to select the paragraph that really contains the answer of question q.
2. Paragraph Reader. Given the question q and a paragraph p i , the paragraph reader calculates the probability Pr(a|q, p i ) of extracting answer a through a multi-layer long short-term memory network.
Overall, the probability Pr(a|q, P ) of extracting answer a given question q can be calculated as:

Paragraph Selector
Since the wrong labeling problem inevitably occurs in DS-QA data, we need to filter out those noisy paragraphs when exploiting the information of all retrieved paragraphs. It is straightforward that we need to estimate the confidence of each paragraph. Hence, we employ a paragraph selector to measure the probability of each paragraph containing the answer among all paragraphs.
Paragraph Encoding. We first represent each word p j i in the paragraph p i as a word vector p j i , and then feed each word vector into a neural network to obtain the hidden representationp j i . Here, we adopt two types of neural networks including: 2. Recurrent Neural Network (RNN) wherep j i is expected to encode semantic information of word p j i and its surrounding words. For RNN, we select a single-layer bidirectional long short-term memory network (LSTM) as our RNN unit, and concatenate the hidden states of all layers to obtainp j i . Question Encoding. Similar to paragraph encoding, we also represent each word q i in the question as its word vector q i , and then fed them into a MLP:q or a RNN: {q 1 ,q 2 , · · · ,q |q| } = RNN({q 1 , q 2 , · · · , q |q| }).
(5) whereq j is the hidden representation of the word q j and is expected to encode the context information of it. After that, we apply a self attention operation on the hidden representations to obtain the final representation q of the question q: where α j encodes the importance of each question word and is calculated as: where w is a learned weight vector. Next, we calculate the probability of each paragraph via a max-pooling layer and a softmax layer: where W is a weight matrix to be learned.

Paragraph Reader
The paragraph reader aims to extract answers from a paragraph p i . Similar to paragraph reader, we first encode each paragraph p i as And we also obtain the question embeddingq via a self-attention multi-layers bidirectional LSTM.
The paragraph reader aims to extract the span of tokens which is most likely the correct answer. And we divide it into predicting the start and end position of the answer span. Hence, the probability of extracting answer a of the question q from the given the paragraph p i can be calculated as: where a s and a e indicate the start and end positions of answer a in the paragraph, P s (a s ) and P e (a e ) are the probabilities of a s and a e being start and end words respectively, which is calculated by: where W s and W e are two weight matrices to be learned. In DS-QA, since we didn't label the position of the answer manually, we may have several tokens matched to the correct answer in a paragraph. Let {(a 1 s , a 1 e ), (a 2 s , a 2 e ), · · · , (a |a| s , a |a| e )} be the set of the start and end positions of the tokens matched to answer a in the paragraph p i . The equation (9) is further defined using two ways: (1) Max. That is, we assume that only one token in the paragraph indicates the correct answer. In this way, the probability of extracting the answer a can defined by maximizing the probability of all candidate tokens: (2) Sum. In this way, we regard all tokens matched to the correct answer equally. And we define the answer extraction probability as: Our paragraph reader model is inspired by a previous machine reading comprehension model, Attentive Reader described in (Chen et al., 2016). In fact, other reading comprehension models can also be easily adopted as our paragraph reader. Due to the space limit, in this paper, we only explore the effectiveness of Attentive Reader.

Learning and Prediction
For the learning objective, we define a loss function L using maximum likelihood estimation: where θ indicates the parameters of our model, a indicates the correct answer, T is the whole training set and R(P ) is a regularization term over the paragraph selector to avoid its overfitting. Here, R(P ) is defined as the KL divergence between Pr(p i |q, P ) and a probability distribution X where X i = 1 c P (c P is the number of paragraphs containing correct answer in P ) if the paragraph contains correct answer, otherwise 0. Specifically, R(P ) is defined as: To solve the optimization problem, we adopt Adamax to minimize the objective function as described in (Kingma and Ba, 2015). During testing, we extract the answerâ with the highest probability as below: Here, the paragraph selector can be viewed as a fast skimming over all paragraphs, which determines the probability distribution of containing the answer for each paragraph. Hence, we can simply aggregate the predicting results from those paragraphs with higher probabilities for acceleration. WebQuestions 5 (Berant et al., 2013b) is designed for answering questions from the Freebase knowledge base, which is built by crawling questions through the Google Suggest API and the paragraphs are retrieved from the English Wikipedia using .
The statistics of these datasets are shown in Ta   , we adopt two metrics including ExactMatch (EM) and F1 scores to evaluate our model. EM measures the percentage of predictions that match one of the ground truth answers exactly and F1 score is a metric that loosely measures the average overlap between the prediction and ground truth answer.

Baselines
For comparison, we select several public models as baselines including: (1)  And we also compare our model with its naive version, which regards each paragraph equally and sets a uniform distribution to the paragraph selection. We name our model as "Our+FULL" and its naive version "Our+AVG".

Experimental Settings
In this paper, we tune our model on the development set and use a grid search to determine the optimal parameters. We select the hidden size of LSTM n ∈ {32, 64, 128, · · · , 512}, the number of LSTM layers for document and question encoder among {1, 2, 3, 4}, regularization weight α among {0.1, 0.5, 1.0, 2.0} and the batch size among {4, 8, 16, 32, 64, 128}. The optimal parameters are highlighted with bold faces. For other parameters, since they have little effect on the results, we simply follow the settings used in .
For training, our Our+FULL model is first initialized by pre-training using Our+AVG model, and we set the iteration number over all the training data as 10. For pre-trained word embeddings, we use the 300-dimensional GloVe 6 (Pennington et al., 2014) word embeddings learned from 840B Web crawl data.

Effect of Different Paragraph Selectors
As our model incorporates different types of neural networks including MLP and RNN as our paragraph selector, we investigate the effect of different paragraph selector on the Quasar-T and SearchQA development set.
As shown in Table 3, our RNN paragraph selector leads to statistically significant improvements on both Quasar-T and SearchQA. Note that Our+FULL which uses MLP paragraph selector even performs worse on Quasar-T dataset as compared to Our+AVG. It indicates that MLP paragraph selector is insufficient to distinguish whether a paragraph answers the question. As RNN paragraph selector consistently improves all evaluation metrics, we use it as the default paragraph selector in the following experiments.

Effect of Different Paragraph Readers
Here, we compare the performance of different types of paragraph readers and the results are shown in Table 4.
From the table, we can see that all models with Sum or Max paragraph readers have comparable performance in most cases, but Our+AVG with Max reader has about 3% increment as compared to the one with Sum reader on the SearchQA dataset. It indicates that the Sum reader is more susceptible to noisy data since it regards all tokens matching to the answer as ground truth. In the following experiments, we select the Max reader as our paragraph reader since it is more stable.

Overall Results
In this part, we will show the performance of different models on five DS-QA datasets and offer some further analysis. The performance of our models are shown in Table 2. From the results, we can observe that: (1) Both our models including Our+AVG and Our+FULL achieve better results on most of the datasets as compared to other baselines. The reason is that our models can make full use of the information of all retrieved paragraphs to answer the question, while other baseline models only consider the most relevant paragraph. It verifies our claim that incorporating the rich information of all retrieved paragraphs could help us better extract the answer to the question.
(2) On all datasets, Our+FULL model outperforms Our+AVG model significantly and consistently. It indicates that our paragraph selector could effectively filter out those meaningless retrieved paragraphs and alleviate the wrong labeling problem in DS-QA.
(3) On TriviaQA dataset, our+AVG model has worse performance as compared to R 3 model. After observing the TriviaQA dataset, we find that in this dataset only one or two retrieved paragraphs actually contain the correct answer. Therefore, simply using all retrieved paragraphs equally to extract answer may bring in much noise. On the contrary, Our+FULL model still has a slight improvement by considering the confidence of each retrieved paragraph.
(4) On CuratedTREC and WebQuestions datasets, our model only has a slight improvement as compared to R 3 model. The reason is that the size of these two datasets is tiny and the performance of these DS-QA systems is heavily influenced by the gap with the dataset used to pre-trained.

Paragraph Selector Performance Analysis
To demonstrate the effectiveness of our paragraph selector in filtering out those noisy retrieved paragraphs, we compare our paragraph selector with traditional information retrieval 7 (IR) in this part.
We also compare our model with a new baseline named Our+INDEP which trains the paragraph reader and the paragraph selector independently.
To train the paragraph selector, we regard all the paragraph containing the correct answer as ground truth and learns it with Eq. 14. First, we show the performance in selecting informative paragraphs. Since distantly supervised data doesn't have the labeled ground-truth to tell    which paragraphs actually answer the question, we adopt a held-out evaluation instead. It evaluates our model by comparing the selected paragraph with pseudo labels: we regard a paragraph as ground-truth if it contains a token matched to the correct answer. We use Hit@N which indicates the proportion of proper paragraphs being ranked in top-N as evaluation metrics. The result is shown in Table 5. From the table, we can observe that: (1) Both Our+INDEP and Our+FULL outperform traditional IR model significantly in selecting informative paragraphs. It indicates that our proposed paragraph selector is capable of catching the semantic correlation between question and paragraphs.
(2) Our+FULL has similar performance as compare with Our+SINGLE from Hits@1 to Hits@5 to select valid paragraphs. The reason is that the way of our evaluation of paragraph selection is consistent with the training objective of the ranker in Our+SINGLE.
In fact, this way of evaluation may be not enough to distinguish the performance of differ-ent paragraph selector. Therefore, we further report the overall answer extraction performance of Our+FULL and Our+INDEP. From the table, we can see that Our+FULL performs better in answer extraction as compared to Our+SINGLE although they have similar performance in paragraph selection. It demonstrates that our paragraph selector can better determine which tokens matched to the answer are actually answering the question by joint training with paragraph reader.

Performance with different numbers of paragraphs
Our paragraph selector can be viewed as a fast skimming step before carefully reading the paragraphs. To show how much our paragraph selector can accelerate the DS-QA system, we compare the performance of our model with top paragraphs selected by our paragraph selector (Our+FULL) or traditional IR model. The results are shown in Fig. 2. There is no doubt that with the number of paragraphs increasing, the performance of our+IR and our+FULL model will increase significantly. From the figure, we can find that on both Quasar-T and SearchQA datasets, our+FULL can use only half of the retrieved paragraphs for answer extraction without performance deterioration, while our+IR suffers from the significant performance degradation when decreasing the number of paragraphs. It demonstrates that our model can extract answer with a few informative paragraphs selected by paragraph selector, which will speed up our whole DS-QA system.

Potential improvement
To show the potential improvement in aggregating extracted answers with answer re-ranking models of our DS-QA system, we provide statistical anal-      Table 7: Potential improvement on DS-QA performance by answer re-ranking. The performance is based on the Quasar-T and SearchQA development dataset.
From Table 7, we can see that: (1) There is a clear gap between top-3/5 and top-1 DS-QA performance (10-20%). It indicates that our DS-QA model is far from the upper performance and still has a high probability to be improved by answer re-ranking.
(2) The Our+FULL model outperforms R 3 model in top-1, top-3 and top-5 on both Quasar-T and SearchQA datasets by 5% to 7%. It indicates that aggregating the information from all informative paragraphs can effectively enhance our model in DS-QA, which is more potential using answer re-ranking. Table 6 shows two examples of our models, which illustrates that our model can make full use of informative paragraphs. From the table we find that:

Case Study
(1) For the question "Who directed the 1946 'It's A Wonderful Life'?", our model extracts the answer "Frank Capra" from both top-2 paragraphs ranked by our paragraph selector.
(2) For the question "What famous artist could write with both his left and right hand at the same time?", our model identifies that "Leonardo Da Vinci" is an artist from the first paragraph and could write with both his left and right hand at the same time from the second paragraph.

Conclusion and Future Work
In this paper, we propose a denoising distantly supervised open-domain question answering system which contains a paragraph selector to skim over paragraphs and a paragraph reader to perform an intensive reading on the selected paragraphs. Our model can make full use of all informative paragraphs and alleviate the wrong labeling problem in DS-QA. In the experiments, we show that our models significantly and consistently outperforms state-of-the-art DS-QA models. In particular, we demonstrate that the performance of our model is hardly compromised when only using a few topselected paragraphs.
In the future, we will explore the following directions: (1) An additional answer re-ranking step can further improve our model. We will explore how to effectively re-rank our extracted answers to further enhance the performance.
(2) Background knowledge such as factual knowledge, common sense knowledge can effectively help us in paragraph selection and answer extraction. We will incorporate external knowledge bases into our DS-QA model to improve its performance.