Frustratingly Hard Evidence Retrieval for QA Over Books

A lot of progress has been made to improve question answering (QA) in recent years, but the special problem of QA over narrative book stories has not been explored in-depth. We formulate BookQA as an open-domain QA task given its similar dependency on evidence retrieval. We further investigate how state-of-the-art open-domain QA approaches can help BookQA. Besides achieving state-of-the-art on the NarrativeQA benchmark, our study also reveals the difficulty of evidence retrieval in books with a wealth of experiments and analysis - which necessitates future effort on novel solutions for evidence retrieval in BookQA.


Introduction
The task of question answering has benefited largely from the advancements in deep learning, especially from the pre-trained language models(LM) (Radford et al., 2019;Devlin et al., 2019). While question answering over single passage (reading comprehension datasets) and over the large-scale open-domain corpora (open-domain QA) have largely benefited from these, the performance of QA over book stories (BookQA) lags behind. For example, the most representative benchmark in this direction, the NarrativeQA (Kočiskỳ et al., 2018) which was released three years ago -the current state-of-the-art methods only show marginal improvement over the first baselines.
There are several challenges in NarrativeQA which slow down the research progress. First, the narrative stories lead to a new writing style which differs from previous works over formal texts like Wikipedia. Second, the long inputs of books are beyond the processing ability of neural models so evidence identification from a whole book is critical. Third, NarrativeQA is a generative task, and many of the answers cannot be exactly matched in the original books. Hence, the generative QA models are required. Finally and most importantly, the dataset does not provide annotations of the supporting evidence. While this makes it a realistic setting like open-domain QA, together with the generative nature of the answers, also makes it difficult to infer the supporting evidence similar to most of the extractive open-domain QA tasks.
The requirements around evidence identification and the missing supporting evidence annotation make BookQA task similar to open-domain QA. In this paper, we first study whether the ideas used in state-of-the-art open-domain QA systems can be extended to improve BookQA including: (1) the neural ranker-reader pipeline (Wang et al., 2018), where a neural ranker is used to select related passages (evidence) given a question from a large candidate sets; (2) the usage of pre-trained LMs as reader and ranker, such as GPT (Radford et al., 2019), BERT (Devlin et al., 2019) and their followup work; (3) the distantly supervised and unsupervised training techniques (Wang et al., 2018;Min et al., 2019;Guu et al., 2020;Karpukhin et al., 2020) that help rankers learn more from noisy gold data.
By training a ranker-reader framework on BookQA, we successfully achieve a new state-ofthe-art on NarrativeQA using both generative and extractive readers. Based on these results and our analysis, we observe the followings: • Using the pre-trained LMs as the reader model, such as BERT and GPT, improves the NarrativeQA performance. With the same BM25 IR baseline, they give 5-6% improvement on Rouge-L over their non-pre-trained counterparts.
• Our specifically designed distant supervision signals improve the neural ranker significantly, but the improvement is small compared to the upper bound. Further analysis of the ranker module confirms the difficulty in training, as the improvement from the pre-trained LM BERT is marginal in it.

Task Definition
Following (Kočiskỳ et al., 2018), we define the task of BookQA as finding the answer A to a question Q from a book B, 1 where each book contains a number of consecutive paragraphs C (usually hundreds or more). A is a free-form answer that can be concluded from the book but may not appear in it in an exact form.
In this paper we propose an open-domain QA formulation and solution to the task of BookQA. Specifically, the task consists of (1) an evidence retrieval step that selects evidence from B for Q, which in our case is a collection of paragraphs C Q = {C i } ⊂ B; and (2) a question-answering step that predicts an answer given Q and C Q .
In the state-of-the-art open-domain QA systems, the aforementioned two steps are modeled by two learnable models (usually based on pre-trained LMs), namely the ranker and the reader. The ranker predicts the relevance of each paragraph C ∈ B to the question, where the top ranked paragraphs form the C Q ; and the reader predicts the answer following P (A|Q, C Q ).
In the following subsections, we describe our solution to make the training of pre-trained LMbased ranker and reader work for the BookQA task.

Reader (QA Model)
Extractive Reader We use a pre-trained BERT model (Devlin et al., 2019;Wolf et al., 2019) to predict the answer span given the query and the context. One challenge of training an extraction model in BookQA is that there is no annotation of true spans because of its generative nature. Our solution is to find the most likely span as answer supervision. Specifically, we compute the Rouge-L score (Lin, 2004) between the true answer and each candidate span of the same length, and finally take the span with the maximum Rouge-L score as our weak label. We initially tried the exactanswer spans but failed to find many due to its low coverage in BookQA.
Generative Reader Considering the GPT memory limitation, we use the GPT-2-medium model as our pre-trained generative model and fine-tune it on BookQA using default training parameters 2 .

Book Paragraph Ranker
We fine-tune another BERT binary classifier for paragraph retrieval, following the usage of BERT on text similarity tasks. In BookQA, training such a classifier is challenging because of the lack of evidence-level supervision. We deal with this problem by using an ensemble method to achieve distant supervision. We build two weak BM25 retrievers with one using only Q and the other using both Q and true A. Denoting the correspondent roughgrained retrievals as C Q and C Q+A , we then tutor a model to select their intersection C Q ∩ C Q+A by sampling the positive samples from C Q ∩ C Q+A and the negative ones from (C Q ∩C Q+A ) c . In order to encourage the ranker to select passages that have better coverage of the answers, we further apply a Rouge-L filter upon the previous sampling results, and only select the positive samples whose answer-related Rouge-L score is higher than the upper threshold and the negative samples lower than the lower threshold 3 .

Settings
Dataset We conduct experiments on Narra-tiveQA dataset (Kočiskỳ et al., 2018), which has a collection of 783 books and 789 movie scripts and their summaries, with each having on average 30 question-answer pairs. Each book or movie script contains an average of 62k words. NarrativeQA provides two different settings, the summary setting and the full-story setting. Our BookQA task corresponds to the full-story setting that finds answers from books or movie scripts. Note that the NarrativeQA is a generative QA task. The answers are not guaranteed to appear in the books.   (Wang et al., 2017) uses reinforcement learning to train the ranker; and (Tay et al., 2019) uses curriculum to train the reader to overcome the divergence of evidence retrieval qualities between training and testing.
We preprocess the raw data with SpaCy 4 tokenization. Then following (Kočiskỳ et al., 2018), we cut the books into non-overlapping paragraphs with a length of 200 each for the full-story setting.
Baseline We conduct experiments with both generative and extractive readers, and compare with the competitive baseline models from (Kočiskỳ et al., 2018;Tay et al., 2019;Frermann, 2019) in the fullstory setting. Meanwhile, we take a BM25 retrieval as the baseline ranker and evaluate our distantly supervised BERT rankers. We also compare to the strong results from (Frermann, 2019), which constructed evidence-level supervision with the usage of book summaries. However, the summary is not considered available by design (Kočiskỳ et al., 2018) in the general full-story scenario where questions should be answered solely from books. 5 Although not the focus of the paper, our reader performance in the summary setting is also reported (Section 3.2), to show the properties of the readers.
Metrics Because of the generative nature of the task, following previous works (Kočiskỳ et al., 2018;Tay et al., 2019;Frermann, 2019), we evaluate the QA performance with Bleu-1, Bleu-4 (Papineni et al., 2002), Meteor (Banerjee and Lavie, 2005), Rouge-L (Lin, 2004). 6 We also report the Exact Match(EM) and F1 scores 7 that are commonly used in open-domain QA evaluation. We convert both hypothesis and reference to lowercase and remove the punctuation before evaluation.

Model Selection
We select the best models on the development set according to its average score 4 https://spacy.io/ 5 In NarrativeQA, the summary has a good coverage of the answers due to the data collection procedures; also, summaries can be viewed as humans' comprehension of the books. 6 We used an open-source evaluation library (Sharma et al., 2017): https://github.com/Maluuba/nlg-eval. 7 The squad/evaluate-v1.1.py script is used.
of Rouge-L and EM. For ranker model selection, we use the average score of upper bound EM and Rouge-L of top-5 ranked paragraphs.

Reader Model Validation (the QA-over-Summary Setting)
First, we compare our readers under the summary setting, to verify the correctness of our readers. Our BERT reader achieves performance close to the public state-of-the-art in this setting. Our GPT-2 reader outperforms the existing systems without usage of pointer generators (PG), but is behind the state-of-the-art with PG. Despite the large gap between systems with and without PG in this setting, (Tay et al., 2019) demonstrates that it didn't contribute much in the full-story setting in the ablation study. Nonetheless, we will investigate the usage of PG in pre-trained LMs in the future work.

Main Results (the QA-over-Book Setting)
We then experimented our whole QA pipelines in the full-story setting. Table 3 and Table 4 compare our results with public state-of-the-art generative and extractive QA systems.
Our pipeline system with the baseline BM25 ranker outperforms the existing state-of-the-art, confirming the advantage of pre-trained LMs as observed in most QA tasks. Our distantly supervised ranker adds another 1-2% of improvement to all the metrics, bringing both our generative and extractive models with the best performance. It also helps outperform (Frermann, 2019) on multiple metrics without the usage of the strong extra supervision from the summaries.

Ablation of Ranker Performance
To take a deeper look at the challenges in ranker training, we conduct an ablation study on the ranker independently. The quality of a ranker is measured   (Min et al., 2019) and its BERT-only ablation. The latter corresponds to the same setting as ours. For generative model, we compare with the best public models with and without pointer generators.   by the answer coverage of its top-5 selections on the basis of the top-32 candidates from the baseline. The answer coverage is estimated by the maximum Rouge-L score of the subsequences of the selected paragraphs of the same length as the answers; and whether the answer can be covered by any of the selected paragraphs (EM).
Our BERT ranker together with supervision filtering strategy has a significant improvement over the BM25 baseline. Our BERT ranker improves by 0.7%, compared with MatchLSTM (Wang and Jiang, 2016) or an improved BiDAF architec-ture (Clark and Gardner, 2018). On the other hand, comparing the benefits that BERT brings to opendomain QA tasks, the relatively small improvement demonstrates the difficulty of evidence retrieval in BookQA. This shows the potential room for future novel improvements, which is also exhibited by the large gap between our best rankers and either the upper bound or the oracle.

Discussion of Future Improvement
We can see a considerable gap between our best models (ranker and readers) and their correspond-  Table 5: Generative result examples. The model tends to generate shorter answers in general. The longer answer it generates, the less likely the answer tends to be correct. The grammatical correctness and fluency of the long generative answers are approaching to human level, regardless of the problematic logic between the generated answer and question. The majority of the generative results do not make sense logically which leads to the low scores in different metrics.  ing oracles in Table 3, 4, and 6. One difficulty that limits the effectiveness of ranker training is the noisy annotation resulted from the nature of the free-form answers. Our filtering technique helps significantly but is still not sufficient. One way we believe that can improve the distant supervision signals is by iteratively updating the ranker and reader like in Hard-EM (Min et al., 2019;Guu et al., 2020). Another possible direction is to extend the idea of inferring evidence on training data with game-theoretic approaches (Perez et al., 2019;Feng et al., 2020), then use the inferred evidence paragraph as labels to train the ranker.

Conclusion
We explored the BookQA task and systemically tested on NarrativeQA dataset different types of models and techniques from open-domain QA. Our proposed approaches bring significant improvements to the state-of-the-art across different metrics. Our insight and analysis lay the path for excit-ing future work in this domain.