Ranking Paragraphs for Improving Answer Recall in Open-Domain Question Answering

Recently, open-domain question answering (QA) has been combined with machine comprehension models to find answers in a large knowledge source. As open-domain QA requires retrieving relevant documents from text corpora to answer questions, its performance largely depends on the performance of document retrievers. However, since traditional information retrieval systems are not effective in obtaining documents with a high probability of containing answers, they lower the performance of QA systems. Simply extracting more documents increases the number of irrelevant documents, which also degrades the performance of QA systems. In this paper, we introduce Paragraph Ranker which ranks paragraphs of retrieved documents for a higher answer recall with less noise. We show that ranking paragraphs and aggregating answers using Paragraph Ranker improves performance of open-domain QA pipeline on the four open-domain QA datasets by 7.8% on average.


Introduction
With the introduction of large scale machine comprehension datasets, machine comprehension models that are highly accurate and efficient in answering questions given raw texts have been proposed recently (Seo et al., 2016;Xiong et al., 2016;Wang et al., 2017c). While conventional machine comprehension models were given a paragraph that always contains an answer to a question, some researchers have extended the models to an open-domain setting where relevant documents have to be searched from an extremely large knowledge source such as Wikipedia (Chen et al., 2017;Wang et al., 2017a). However, most of the open-domain QA pipelines depend on traditional information retrieval systems * Corresponding author which use TF-IDF rankings (Chen et al., 2017;Wang et al., 2017b). Despite the efficiency of the traditional retrieval systems, the documents retrieved and ranked at the top by such systems often do not contain answers to questions. However, simply increasing the number of top ranked documents to find answers also increases the number of irrelevant documents. The tradeoff between reading more documents and minimizing noise is frequently observed in previous works that defined the N number of top documents as a hyperparameter to find (Wang et al., 2017a).
In this paper, we tackle the problem of ranking the paragraphs of retrieved documents for improving the answer recall of the paragraphs while filtering irrelevant paragraphs. By using our simple but efficient Paragraph Ranker, our QA pipeline considers more documents for a high answer recall, and ranks paragraphs to read only the most relevant ones. The work closest to ours is that of Wang et al. (2017a). However, whereas their main focus is on re-ranking retrieved sentences to maximize the rewards of correctly answering the questions, our focus is to increase the answer recall of paragraphs with less noise. Thus, our work is complementary to the work of Wang et al. (2017a).
Our work is largely inspired by the field of information retrieval called Learning to Rank (Liu et al., 2009;Severyn and Moschitti, 2015). Most learning to rank models consist of two parts: encoding networks and ranking functions. We use bidirectional long short term memory (Bi-LSTM) as our encoding network, and apply various ranking functions proposed by previous works (Severyn and Moschitti, 2015;Tu et al., 2017). Also, as the time and space complexities of ranking paragraphs are much larger than those of ranking sentences (Severyn and Moschitti, 2015), we resort to negative sampling (Mikolov et al., 2013) for an efficient training of our Paragraph Ranker. Our pipeline with Paragraph Ranker improves the exact match scores on the four open-domain QA datasets by 7.8% on average. Even though we did not further customize Document Reader of DrQA (Chen et al., 2017), the large improvement in the exact match scores shows that future researches would benefit from ranking and reading the more relevant paragraphs. By a qualitative analysis of ranked paragraphs, we provide additional evidence supporting our findings.

Open-Domain QA Pipeline
Most open-domain QA systems are constructed as pipelines that include a retrieval system and a reader model. We additionally built Paragraph Ranker that assists our QA pipeline for a better paragraph selection. For the retrieval system and the reader model, we used Document Retriever and Document Reader of Chen et al. (2017). 1 The overview of our pipeline is illustrated in Figure 1.

Paragraph Ranker
Given N number of documents retrieved from Document Retriever, we assume that each document contains K number of paragraphs on average. Instead of feeding all N K number of paragraphs to Document Reader, we select only M number of paragraphs using Paragraph Ranker. Utilizing Paragraph Ranker, we safely increase N for a higher answer recall, and reduce the number of paragraphs to read by selecting only top ranked paragraphs.
Given the retrieved paragraphs P i where i ranges from 1 to N K, and a question Q, we en-code each paragraph and the question using two separate RNNs such as Bi-LSTM. Representations of each paragraph and the question are calculated as follows: where BiLSTM(·) returns the concatenation of the last hidden state of forward LSTM and the first hidden state of backward LSTM. E(·) converts tokens in a paragraph or a question into pretrained word embeddings. We use GloVe (Pennington et al., 2014) for the pretrained word embeddings.
Once each paragraph and the question are represented as p i h and q h , we calculate the probability of each paragraph to contain an answer of the question as follows: where we have used similarity function s(·, ·) to measure the probability of containing answer to the question Q in the paragraph P i . While Wang and Jiang (2015) adopted high capacity models such as Match-LSTM for measuring the similarity between paragraphs and questions, we use much simpler scoring functions to calculate the similarity more efficiently. We tested three different scoring functions: 1) the dot product of p i h and q h , 2) the bilinear form p i h T W q h , and 3) a multilayer perceptron (MLP) (Severyn and Moschitti, 2015). While utilizing MLP takes much more time than the other two functions, recall of MLP was similar to that of the dot product. Also, as recall of the bilinear form was worse than that of the dot product, we use the dot product as our scoring function.
Due to the large size of N K, it is difficult to train Paragraph Ranker on all the retrieved paragraphs. 2 To efficiently train our model, we use a negative sampling of irrelevant paragraphs (Mikolov et al., 2013). Hence, the loss function of our model is as follows: where k indicates indexes of negative samples that do not contain the answer, and Θ denotes trainable parameters of Paragraph Ranker. The distribution of negative samples are defined as p n . We use the distribution of all the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) training paragraphs as p n .
Based on the rank of each paragraph from Paragraph Ranker and the rank of source document from Document Retriever, we collect top M paragraphs to read. We combine the ranks by the multiplication of probabilities p(P i |Q) andp(D i |Q) to find most relevant paragraphs wherep(D i |Q) denotes TF-IDF score of a source document D i .

Answer Aggregation
We feed M paragraphs to Document Reader to extract M answers. While Paragraph Ranker increases the probability of including answers in the top M ranked paragraphs, aggregation step should determine the most probable answer among the M extracted answers. Chen et al. (2017) and Clark et al. (2017) used the unnormalized answer probability from the reader. However, as the unnormalized answer probability is very sensitive to noisy answers, Wang et al. (2017b) proposed a more sophisticated aggregation methods such as coveragebased and strength-based re-rankings.
In our QA pipeline, we incorporate the coverage-based method by Wang et al. (2017b) with paragraph scores from Paragraph Ranker. Although strength-based answer re-ranking showed good performances on some datasets, it is too complex to efficiently re-rank M answers. Given the M candidate answers [A 1 , ..., A M ] from each paragraph, we aggregate answers as follows: (1) 2 N K ≈ 350 when N = 5 in SQuAD QA pairs. wherep(A i |P i , Q) denotes the unnormalized answer probability from a reader given the paragraph P i and the question Q. Importance of each score is determined by the hyperparamters α, β, and γ. Also, we add up all the probabilities of the duplicate candidate answers for the coverage-based aggregation.

Datasets
We evaluate our pipeline with Paragraph Ranker on the four open-domain QA datasets. Wang et al. (2017a)

Implementation Details
Paragraph Ranker uses 3-layer Bi-LSTM networks with 128 hidden units. On SQuAD OPEN and CuratedTrec, we set α, β, and γ of Paragraph Ranker to 1. Due to the different characteristics of questions in WebQuestion and WikiMovies, we find α, β, and γ based on the validation QA pairs of the two datasets. We use N = 20 for the number of documents to retrieve and M = 200 for the number of paragraphs to read for all the four datasets. We use Adamax (Kingma and Ba, 2014) as the optimization algorithm. Dropout is applied to LSTMs and embeddings with p = 0.4.

Analysis
In Table 2, we show 3 random paragraphs of the top document returned by Document Retriever, and the top 3 paragraphs ranked by Paragraph Ranker from the top 40 documents. As Document Retriever largely depends on matching of query tokens with document tokens, the top ranked document is usually the document with most tokens matching the query. However, Question 1 includes the polysemy of the word "play" which makes it more difficult for Document Retriever to perform effectively. Our Paragraph Ranker well understands that the question is about a sports player not a musician. The top 1-3 paragraphs for the second question came from the 30th, 7th, and 6th documents, respectively, ranked by Document Retriever. This shows that increasing number of documents to rank helps Paragraph Ranker find more relevant paragraphs.

Conclusion
In this paper, we present an open-domain question answering pipeline and proposed Paragraph Ranker. By using Paragraph Ranker, the QA pipeline benefits from increased answer recall from paragraphs to read, and filters irrelevant documents or paragraphs. With our simple Paragraph Ranker, we achieve state-of-the-art performances on the four open-domain QA datasets with large margins. As future works, we plan to further improve Paragraph Ranker based on the researches on learning to rank.