Coarse-to-Fine Question Answering for Long Documents

We present a framework for question answering that can efficiently scale to longer documents while maintaining or even improving performance of state-of-the-art models. While most successful approaches for reading comprehension rely on recurrent neural networks (RNNs), running them over long documents is prohibitively slow because it is difficult to parallelize over sequences. Inspired by how people first skim the document, identify relevant parts, and carefully read these parts to produce an answer, we combine a coarse, fast model for selecting relevant sentences and a more expensive RNN for producing the answer from those sentences. We treat sentence selection as a latent variable trained jointly from the answer only using reinforcement learning. Experiments demonstrate state-of-the-art performance on a challenging subset of the WikiReading dataset and on a new dataset, while speeding up the model by 3.5x-6.7x.


Introduction
Reading a document and answering questions about its content are among the hallmarks of natural language understanding.Recently, interest in question answering (QA) from unstructured documents has increased along with the availability of large scale datasets for reading comprehension (Hermann et al., 2015;Hill et al., 2015;Rajpurkar et al., 2016;Onishi et al., 2016;Nguyen et al., 2016;Trischler et al., 2016a).
Current state-of-the-art approaches for QA over documents are based on recurrent neural networks (RNNs) that encode the document and the ques- tion to determine the answer (Hermann et al., 2015;Chen et al., 2016;Kumar et al., 2016;Kadlec et al., 2016;Xiong et al., 2016).While such models have access to all the relevant information, they are slow because the model needs to be run sequentially over possibly thousands of tokens, and the computation is not parallelizable.
In fact, such models usually truncate the documents and consider only a limited number of tokens (Miller et al., 2016;Hewlett et al., 2016).
Inspired by studies on how people answer questions by first skimming the document, identifying relevant parts, and carefully reading these parts to produce an answer (Masson, 1983), we propose a coarse-to-fine model for question answering.
Our model takes a hierarchical approach (see Figure 1), where first a fast model is used to select a few sentences from the document that are relevant for answering the question (Yu et al., 2014;Yang et al., 2016a).Then, a slow RNN is employed to produce the final answer from the selected sentences.The RNN is run over a fixed number of tokens, regardless of the length of the document.Empirically, our model encodes the text up to 6.7 times faster than the base model, which reads the first few paragraphs, while having d: s1: The 2011 Joplin tornado was a catastrophic EF5rated multiple-vortex tornado that struck Joplin, Missouri . . .s4: It was the third tornado to strike Joplin since May 1971.s5: Overall, the tornado killed 158 people . .., injured some 1,150 others, and caused damages . . .x: how many people died in joplin mo tornado y: 158 people Figure 2: A training example containing a document d, a question x and an answer y in the WIKISUGGEST dataset.In this example, the sentence s5 is necessary to answer the question.
access to four times more tokens.
A defining characteristic of our setup is that an answer does not necessarily appear verbatim in the input (the genre of a movie can be determined even if not mentioned explicitly).Furthermore, the answer often appears multiple times in the document in spurious contexts (the year '2012' can appear many times while only once in relation to the question).Thus, we treat sentence selection as a latent variable that is trained jointly with the answer generation model from the answer only using reinforcement learning.Treating sentence selection as a latent variable has been explored in classification (Yessenalina et al., 2010;Lei et al., 2016), however, to our knowledge, has not been applied for question answering.
We find that jointly training sentence selection and answer generation is especially helpful when locating the sentence containing the answer is hard.We evaluate our model on the WIKIREAD-ING dataset (Hewlett et al., 2016), focusing on examples where the document is long and sentence selection is challenging, and on a new dataset called WIKISUGGEST that contains more natural questions gathered from a search engine.
To conclude, we present a modular framework and learning procedure for QA over long text.It captures a limited form of document structure such as sentence boundaries and deals with long documents or potentially multiple documents.Experiments show improved performance compared to the state of the art on the subset of WIKIREADING, comparable performance on other datasets, and a 3.5x-6.7xspeed up in document encoding, while allowing access to much longer documents.

Problem Setting
Given a training set of question-document-answer triples {x (i) , d (i) , y (i) } N i=1 , our goal is to learn a model that produces an answer y for a question-

Data
We evaluate on WIKIREADING, WIKIREADING LONG, and a new dataset, WIKISUGGEST.WIKIREADING (Hewlett et al., 2016) is a QA dataset automatically generated from Wikipedia and Wikidata: given a Wikipedia page about an entity and a Wikidata property, such as PROFES-SION, or GENDER, the goal is to infer the target value based on the document.Unlike other recently released large-scale datasets (Rajpurkar et al., 2016;Trischler et al., 2016a), WIKIREAD-ING does not annotate answer spans, making sentence selection more challenging.
Due to the structure and short length of most Wikipedia documents (median number of sentences: 9), the answer can usually be inferred from the first few sentences.Thus, the data is not ideal for testing a sentence selection model compared to a model that uses the first few sentences.Table 1 quantifies this intuition: We consider sentences containing the answer y * as a proxy for sentences that should be selected, and report how often y * appears in the document.Additionally, we report how frequently this proxy oracle sentence is the first sentence.We observe that in WIKIREAD-ING, the answer appears verbatim in 47.1% of the examples, and in 75% of them the match is in the first sentence.Thus, the importance of modeling sentence selection is limited.
To remedy that, we filter WIKIREADING and ensure a more even distribution of answers throughout the document.We prune short docu-ments with less than 10 sentences, and only consider Wikidata properties for which Hewlett et al. ( 2016)'s best model obtains an accuracy of less than 60%.This prunes out properties such as GENDER, GIVEN NAME, and INSTANCE OF. 1  The resulting WIKIREADING LONG dataset contains 1.97M examples, where the answer appears in 50.4% of the examples, and appears in the first sentence only 31% of the time.On average, the documents in WIKIREADING LONG contain 1.2k tokens, more tokens than those of SQuAD (average 122 tokens) or CNN (average 763 tokens) datasets (see Table 2).Table 1 shows that the exact answer string is often missing from the document in WIKIREADING.This is since Wikidata statements include properties such as NATIONALITY, which are not explicitly mentioned, but can still be inferred.A drawback of this dataset is that the queries, Wikidata properties, are not natural language questions and are limited to 858 properties.
To model more realistic language queries, we collect the WIKISUGGEST dataset as follows.We use the Google Suggest API to harvest natural language questions and submit them to Google Search.Whenever Google Search returns a box with a short answer from Wikipedia (Figure 3), we create an example from the question, answer, and the Wikipedia document.If the answer string is missing from the document this often implies a spurious question-answer pair, such as ('what time is half time in rugby', '80 minutes, 40 minutes').Thus, we pruned question-answer pairs without the exact answer string.We examined fifty examples after filtering and found that 54% were well-formed question-answer pairs where we can ground answers in the document, 20% contained answers without textual evidence in the document (the answer string exists in an irreleveant context), and 26% contain incorrect QA pairs such as the last two examples in Figure 3.

Model
Our model has two parts (Figure 1): a fast sentence selection model (Section 4.1) that defines a distribution p(s | x, d) over sentences given the input question (x) and the document (d), and a more costly answer generation model (Section 4.3) that generates an answer y given the question and a document summary, d (Section 4.2), that focuses on the relevant parts of the document.
1 These three relations alone account for 33% of the data.

Sentence Selection Model
Following recent work on sentence selection (Yu et al., 2014;Yang et al., 2016b), we build a feed-forward network to define a distribution over the sentences s 1 , s 2 , . . ., s |d| .We consider three simple sentence representations: a bag-of-words (BoW) model, a chunking model, and a (parallelizable) convolutional model.These models are efficient at dealing with long documents, but do not fully capture the sequential nature of text.
BoW Model Given a sentence s, we denote by BoW(s) the bag-of-words representation that averages the embeddings of the tokens in s.To define a distribution over the document sentences, we employ a standard attention model (e.g., (Hermann et al., 2015)), where the BoW representation of the query is concatenated to the BoW representation of each sentence s l , and then passed through a single layer feed-forward network: where [; ] indicates row-wise concatenation, and the matrix W , the vector v, and the word embeddings are learned parameters.
Chunked BoW Model To get more fine-grained granularity, we split sentences into fixed-size smaller chunks (seven tokens per chunk) and score each chunk separately (Miller et al., 2016).This is beneficial if questions are answered with subsentential units, by allowing to learn attention over different chunks.We split a sentence s l into a fixed number of chunks (c l,1 , c l,2 . . ., c l,J ), generate a BoW representation for each chunk, and score it exactly as in the BoW model.We obtain a distribution over chunks, and compute sentence probabilities by marginalizing over chunks from the same sentence.Let p(c = c l,j | x, d) be the distribution over chunks from all sentences, then: with the same parameters as in the BoW model.
Convolutional Neural Network Model While our sentence selection model is designed to be fast, we explore a convolutional neural network (CNN) that can compose the meaning of nearby words.A CNN is still efficient, since all filters can be computed in parallel.Following previous work (Kim, 2014;Kalchbrenner et al., 2014), we concatenate the embeddings of tokens in the query x and the sentence s l , and run a convolutional layer with F filters and width w over the concatenated embeddings.This results in F features for every span of length w, and we employ max-over-time-pooling (Collobert et al., 2011) to get a final representation h l ∈ R F .We then compute p(s = s l | x, d) by passing h l through a single layer feed-forward network as in the BoW model.

Document Summary
After computing attention over sentences, we create a summary that focuses on the document parts related to the question using deterministic soft attention or stochastic hard attention.Hard attention is more flexible, as it can focus on multiple sentences, while soft attention is easier to optimize and retains information from multiple sentences.
Hard Attention We sample a sentence ŝ ∼ p(s | x, d) and fix the document summary d = ŝ to be that sentence during training.At test time, we choose the most probable sentence.To extend the document summary to contain more information, we can sample without replacement K sentences from the document and define the summary to be the concatenation of the sampled sentences d = [ŝ 1 ; ŝ2 ; . . .; ŝK ].

Soft Attention
In the soft attention model (Bahdanau et al., 2015) we compute a weighted average of the tokens in the sentences according to p(s | x, d).More explicitly, let dm be the mth token of the document summary.Then, by fixing the length of every sentence to M tokens, 2 the blended 2 Long sentences are truncated and short ones are padded.
tokens are computed as follows: where s l,m is the mth word in the lth sentence (m ∈ {1, . . ., M }).
As the answer generation models (Section 4.3) take a sequence of vectors as input, we average the tokens at the word level.This gives the hard attention an advantage since it samples a "real" sentence without mixing words from different sentences.Conversely, soft attention is trained more easily, and has the capacity to learn a low-entropy distribution that is similar to hard attention.

Answer Generation Model
State-of-the-art question answering models (Chen et al., 2016;Seo et al., 2016) use RNN models to encode the document and question and selects the answer.We focus on a hierarchical model with fast sentence selection, and do not subscribe to a particular answer generation architecture.
Here we implemented the state-of-the-art wordlevel sequence-to-sequence model with placeholders, described by Hewlett et al. (2016).This models can produce answers that does not appear in the sentence verbatim.This model takes the query tokens, and the document (or document summary) tokens as input and encodes them with a Gated Recurrent Unit (GRU; Cho et al. (2014)).Then, the answer is decoded with another GRU model, defining a distribution over answers p(y | x, d).
In this work, we modified the original RNN: the word embeddings for the RNN decoder input, output and original word embeddings are shared.

Learning
We consider three approaches for learning the model parameters (denoted by θ): (1) We present a pipeline model, where we use distant supervision to train a sentence selection model independently from an answer generation model.( 2) The hard attention model is optimized with REIN-FORCE (Williams, 1992) algorithm.(3) The soft attention model is fully differentiable and is optimized end-to-end with backpropagation.
Distant Supervision While we do not have an explicit supervision for sentence selection, we can define a simple heuristic for labeling sentences.We define the gold sentence to be the first sentence that has a full match of the answer string, or the first sentence in the document if no full match exists.By labeling gold sentences, we can train sentence selection and answer generation independently with standard supervised learning, maximizing the log-likelihood of the gold sentence and answer, given the document and query.Let y * and s * be the target answer and sentence , where s * also serves as the document summary.The objective is to maximize: Since at test time we do not have access to the target sentence s * needed for answer generation, we replace it by the model prediction arg max s l ∈d p θ (s = s l | d, x).
Reinforcement Learning Because the target sentence is missing, we use reinforcement learning where our action is sentence selection, and our goal is to select sentences that lead to a high reward.We define the reward for selecting a sentence as the log probability of the correct answer given that sentence, that is, R θ (s l ) = log p θ (y = y * | s l , x).Then the learning objective is to maximize the expected reward: Following REINFORCE (Williams, 1992), we approximate the gradient of the objective with a sample, ŝ ∼ p θ (s | x, d): Sampling K sentences is similar and omitted for brevity.
Training with REINFORCE is known to be unstable due to the high variance induced by sampling.To reduce variance, we use curriculum learning, start training with distant supervision and gently transition to reinforcement learning, similar to DAGGER (Ross et al., 2011).Given an example, we define the probability of using the distant supervision objective at each step as r e , where r is the decay rate and e is the index of the current training epoch.Soft Attention We train the soft attention model by maximizing the log likelihood of the correct answer y * given the input question and document log p θ (y * | d, x).Recall that the answer generation model takes as input the query x and document summary d, and since d is an average of sentences weighted by sentence selection, the objective is differentiable and is trained end-to-end.

Experiments
Experimental Setup We used 70% of the data for training, 10% for development, and 20% for testing in all datasets.We used the first 35 sentences in each document as input to the hierarchical models, where each sentence has a maximum length of 35 tokens.Similar to Miller et al. (2016), we add the first five words in the document (typically the title) at the end of each sentence sequence for WIKISUGGEST.We add the sentence index as a one hot vector to the sentence representation.We coarsely tuned and fixed most hyperparameters for all models, and separately tuned the learning rate and gradient clipping coefficients for each model on the development set.The details are reported in the supplementary material.
Evaluation Metrics Our main evaluation metric is answer accuracy, the proportion of questions answered correctly.For sentence selection, since we do not know which sentence contains the answer, we report approximate sentence selection accuracy by matching sentences that contain the answer string (y * ).For the soft attention model, we treat the sentence with the highest probability as the predicted sentence.

Models and Baselines
The models PIPELINE, REINFORCE, and SOFTATTEND correspond to the learning objectives in Section 5. We compare these models against the following baselines: FIRST always selects the first sentence of the document.The answer appears in the first sentence in 33% and 15% of documents in WIKISUGGEST and WIKIREADING LONG.BASE is the re-implementation of the best model by Hewlett et al. (2016), consuming the first 300 tokens.We experimented with providing additional tokens to match the length of document available to hierarchical models, but this performed poorly.ORACLE selects the first sentence with the answer string if it exists, or otherwise first sentence in the document.

Answer Accuracy Results
Table 3 summarizes answer accuracy on all datasets.We use BOW encoder for sentence selection as it is the fastest.The proposed hierarchical models match or exceed the performance of BASE, while reducing the number of RNN steps significantly, from 300 to 35 (or 70 for K=2), and allowing access to later parts of the document.Figure 4 reports the speed gain of our system.While throughput at training time can be reported numbers due to modifications in implementation and better optimization.
improved by increasing the batch size, at test time real-life QA systems use batch size 1, where RE-INFORCE obtains a 3.5x-6.7xspeedup (for K=2 or K=1).In all settings, REINFORCE was at least three times faster than the BASE model.All models outperform the FIRST baseline, and utilizing the proxy oracle sentence (ORACLE) improves performance on WIKISUGGEST and WIKIREADNG LONG.In WIKIREADING, where the proxy oracle sentence is often missing and documents are short, BASE outperforms ORACLE.
Jointly learning answer generation and sentence selection, REINFORCE outperforms PIPELINE, which relies on a noisy supervision signal for sentence selection.The improvement is larger in WIKIREADING LONG, where the approximate supervision for sentence selection is missing for 51% of examples compared to 22% of examples in WIKISUGGEST. 5n WIKIREADING LONG, REINFORCE outperforms all other models (excluding ORACLE, which has access to gold labels at test time).In other datasets, BASE performs slightly better than the proposed models, at the cost of speed.In these datasets, the answers are concentrated in the first few sentences.BASE is advantageous in categorical questions (such as GENDER), gathering bits of evidence from the whole document, at the cost of speed.Encouragingly, our system almost reaches the performance of ORACLE in WIKIREADING, showing strong results in a limited token setting.
Sampling an additional sentence into the document summary increased performance in all datasets, illustrating the flexibility of hard attention compared to soft attention.
Additional sampling allows recovery from mistakes in WIKIREADING LONG, where sentence selection is challenging. 6Comparing hard attention to soft attention, we observe that REINFORCE performed better than SOFTATTEND.The attention distribution learned by the soft attention model was often less peaked, generating noisier summaries.7where the the answer is in the document.In WIK-ISUGGEST performance is at 67.5%, mostly due to noise in the data.PIPELINE performs slightly better as it is directly trained towards our noisy evaluation.However, not all sentences that contain the answer are useful to answer the question (first example in Table 5).REINFORCE learned to choose sentences that are likely to generate a correct answer rather than proxy gold sentences, improving the final answer accuracy.On WIKIREADING LONG, complex models (CNN and CHUNKBOW) outperform the simple BOW, while on WIKISUG-GEST BOW performed best.

Sentence Selection Results
Qualitative Analysis We categorized the primary reasons for the errors in Table 6 and present an example for each error type in Interestingly, the answer string can still appear in the document as in the first example in Table 5: 'Saint Petersburg' appears in the document (4th sentence).Answer generation at times failed to generate the answer even when the correct sentence was selected.This was pronounced especially in long answers.For the automatically collected WIKISUGGEST dataset, noisy question-answer pairs were problematic, as discussed in Section 3.However, the models frequently guessed the spurious answer.We attribute higher proxy performance in sentence selection for WIKISUGGEST to noise.In manual analysis, sentence selection was harder in WIKIREADING LONG, explaining why sampling two sentences improved performance.
In the first correct prediction (Table 5), the model generates the answer, even when it is not in the document.The second example shows when our model spots the relevant sentence without obvious clues.In the last example the model spots a sentence far from the head of the document.

Related Work
There has been substantial interest in datasets for reading comprehension.MCTest (Richardson et al., 2013) is a smaller-scale datasets focusing on common sense reasoning; bAbi (Weston et al., 2015) is a synthetic dataset that captures various aspects of reasoning; and SQuAD (Rajpurkar et al., 2016;Wang et al., 2016;Xiong et al., 2016) and NewsQA (Trischler et al., 2016a) are QA datasets where the answer is a span in the document.Compared to Wikireading, some datasets covers shorter passages (average 122 words for SQuAD).Cloze-style question answering datasets (Hermann et al., 2015;Onishi et al., 2016;Hill et al., 2015) assess machine comprehension but do not form questions. The recently released MS MARCO dataset (Nguyen et al., 2016) consists of query logs, web documents and crowd-sourced answers.
Hierarchical models which treats sentence selection as a latent variable have been applied text categorization (Yang et al., 2016b), extractive summarization (Cheng and Lapata, 2016), machine translation (Ba et al., 2014) and sentiment analysis (Yessenalina et al., 2010;Lei et al., 2016).To the best of our knowledge, we are the first to use the hierarchical nature of a document for QA.
Finally, our work is related to the reinforcement learning literature.Hard and soft attention were examined in the context of caption generation (Xu et al., 2015).Curriculum learning was investigated in Sachan and Xing (2016), but they focused on the ordering of training examples while we combine supervision signals.Reinforcement learning recently gained popularity in tasks such as coreference resolution (Clark and Manning, 2016), information extraction (Narasimhan et al., 2016), semantic parsing (Andreas et al., 2016) and textual games (Narasimhan et al., 2015;He et al., 2016).

Conclusion
We presented a coarse-to-fine framework for QA over long documents that quickly focuses on the relevant portions of a document.In future work we would like to deepen the use of structural clues and answer questions over multiple documents, using paragraph structure, titles, sections and more.We argue that this is necessary for developing systems that can efficiently answer the information needs of users over large quantities of text.

Figure 1 :
Figure 1: Hierarchical question answering: the model first selects relevant sentences that produce a document summary ( d) for the given query (x), and then generates an answer (y) based on the summary ( d) and the query x.

Figure 3 :
Figure 3: Example queries and answers of WIKISUGGEST.

Figure 4 :
Figure 4: Runtime for document encoding on an Intel Xeon CPU E5-1650 @3.20GHz on WIKIREADING at test time.The boxplot represents the throughput of BASE and each line plot shows the proposed models' speed gain over BASE.Exact numbers are reported in the supplementary material.

Table 1 :
Statistics on string matches of the answer y * in the document.The third column only considers examples with answer match.Often the answer string is missing or appears many times while it is relevant to query only once.

Table 2 :
Data statistics.documentpair(x, d).A document d is a list of sentences s 1 , s 2 , ..., s |d| , and we assume that the answer can be produced from a small latent subset of the sentences.Figure2illustrates a training example in which sentence s 5 is in this subset.

Table 3 :
Answer prediction accuracy on the test set.K is the number of sentences in the document summary.

Table 4 :
Approximate sentence selection accuracy on the development set for all models.We use ORACLE to find a proxy gold sentence and report the proportion of times each model selects the proxy sentence.

Table 6 :
Manual error analysis on 50 errors from the development set for REINFORCE (K=1).

Table 5 :
Friedmann died on September 16 , 1925 , at the age of 37 , from typhoid fever that he contracted while returning from a vacation in Crimean Peninsula .Blaine was born and raised in, Brooklyn , New York the son of Patrice Maureen White . . .Bourgogne or vin de Bourgogne ) is wine made in the . . . 2 90.8The most famous wines produced here . . .are dry red wines made from Pinot noir grapes . . .Anchen Margaretha Dreyer (born 27 March 1952) is a South African politician, a Member of Parliament for the opposition Democratic Alliance , and currently . . .LaSer UK is a provider of credit and loyalty programmes , operating in the UK and Republic . . . 4 82.3The company 's operations are in Solihull and Belfast where it employs 800 people .Lavigne married Nickelback frontman , Chad Kroeger , in 2013 .Avril Ramona Lavigne was . . .Example outputs from REINFORCE (K=1) with BOW sentence selection model.First column: sentence index (l).Second column: attention distribution p θ (s l |d, x).Last column: text s l .