Simple and Effective Curriculum Pointer-Generator Networks for Reading Comprehension over Long Narratives

This paper tackles the problem of reading comprehension over long narratives where documents easily span over thousands of tokens. We propose a curriculum learning (CL) based Pointer-Generator framework for reading/sampling over large documents, enabling diverse training of the neural model based on the notion of alternating contextual difficulty. This can be interpreted as a form of domain randomization and/or generative pretraining during training. To this end, the usage of the Pointer-Generator softens the requirement of having the answer within the context, enabling us to construct diverse training samples for learning. Additionally, we propose a new Introspective Alignment Layer (IAL), which reasons over decomposed alignments using block-based self-attention. We evaluate our proposed method on the NarrativeQA reading comprehension benchmark, achieving state-of-the-art performance, improving existing baselines by 51% relative improvement on BLEU-4 and 17% relative improvement on Rouge-L. Extensive ablations confirm the effectiveness of our proposed IAL and CL components.


Introduction
Teaching machines to read and comprehend is a fundamentally interesting and challenging problem in AI research (Hermann et al., 2015;Trischler et al., 2016;Rajpurkar et al., 2016).While there have been considerable and broad improvements in reading and understanding textual snippets, the ability for machines to read/understand complete stories and novels is still in infancy (Kočiskỳ et al., 2018).The challenge becomes insurmountable in lieu of not only the large context but also the intrinsic challenges of narrative text which arguably requires a larger extent of reasoning.As such, this motivates the in-ception of relevant, interesting benchmarks such as the NarrativeQA Reading Comprehension challenge1 (Kočiskỳ et al., 2018).
The challenges of having a long context have been traditionally mitigated by a two-step approach -retrieval first and then reading second (Chen et al., 2017;Wang et al., 2018;Lin et al., 2018).This difficulty mirrors the same challenges of open domain question answering, albeit introducing additional difficulties due to the nature of narrative text (stories and retrieved excerpts need to be coherent).While some recent works have proposed going around by training retrieval and reading components end-to-end, this paper follows the traditional paradigm with a slight twist.We train our models to be robust regardless of whatever is retrieved.This is in similar spirit to domain randomization (Tobin et al., 2017).
In order to do so, we propose a diverse curriculum learning scheme (Bengio et al., 2009) based on two concepts of difficulty.The first, depends on whether the answer exists in the context (answerability), aims to bridge the gap between training time and inference time retrieval.On the other hand, and the second, depends on the size of retrieved documents (coherence and understandability).While conceptually simple, we found that these heuristics help improve performance of the QA model.To the best of our knowledge, we are the first to incorporate these notions of difficulty in QA reading models.
All in all, our model tries to learn to generate the answer even if the correct answer does not appear as evidence which acts as a form of generative pretraining during training.As such, this is akin to learning to guess, largely motivated by how humans are able to extrapolate/guess even when given access to a small fragment of a film/story.In this case, we train our model to generate answers, making do with whatever context it was given.To this end, a curriculum learning scheme controls the extent of difficulty of the context given to the model.
At this juncture, it would be easy to realize that standard pointer-based reading comprehension models would not adapt well to this scheme, as they fundamentally require the golden label to exist within the context (Wang and Jiang, 2016b;Seo et al., 2016).As such, our overall framework adopts a pointer-generator framework (See et al., 2017) that learns to point and generate, conditioned on not only the context but also the question.This relaxes this condition, enabling us to train our models with diverse views of the same story which is inspired by domain randomization (Tobin et al., 2017).For our particular task at hand, the key idea is that, even if the answer is not found in the context, we learn to generate the answer despite the noisy context.
Finally, our method also incorporates a novel Introspective Alignment Layer (IAL).The key idea of the IAL mechanism is to introspect over decomposed alignments using block-style local self-attention.This not only imbues our model with additional reasoning capabilities but enables a finer-grained (and local-globally aware) comparison between soft-aligned representations.All in all, our IAL mechanism can be interpreted as learning a matching over matches.
Our Contributions All in all, the prime contributions of this work is summarized as follows: • We propose a curriculum learning based Pointer-Generator model for reading comprehension over narratives (long stories).For the first time, we propose two different notions of difficulty for constructing diverse views of long stories for training.We show that this approach achieves better results than existing models adapted for open-domain question answering.
• Our proposed model incorporates an Introspective Alignment Layer (IAL) which uses block-based self-attentive reasoning over decomposed alignments.Ablative experiments show improvements of our IAL layer over the standard usage of vanilla self-attention.
• Our proposed framework (IAL-CPG) achieves state-of-the-art performance on the NarrativeQA reading comprehension challenge.On metrics such as BLEU-4 and Rouge-L, we achieve a 17% relative improvement over prior state-of-the-art and a 10 times improvement in terms of BLEU-4 score over BiDAF, a strong span prediction based model.
• We share two additional contributions.Firstly, we share negative results on using Reinforcement Learning to improve the quality of generated answers (Paulus et al., 2017;Bahdanau et al., 2016).Secondly, we show that the evaluation scheme in NarrativeQA is flawed and models can occasionally generate satisfactory (correct) answers but score zero points during evaluation.

Our Proposed Framework
This section outlines the components of our proposed architecture.Since our problem is mainly dealing with extremely long sequences, we employ an initial retrieval2 phrase by either using the answer or question as a cue (query for retrieving relevant chunks/excerpts).The retrieval stage is controlled by our curriculum learning process in which the details are deferred to subsequent sections.The overall illustration of this framework is depicted in Figure 1.

Introspective Alignment Reader
This section introduces our proposed Introspective Alignment Reader (IAL-Reader).
Input and Context Encoding Our model accepts two inputs, (context C and question Q).
Each input is a sequence of words.We pass each sequence into a shared Bidirectional LSTM layer.
where H c 2 R `c⇥d and H q 2 R `q⇥d are the hidden representations for C and Q respectively.
Introspective Alignment Next, we pass H c , H q into an alignment layer.Firstly, we compute a soft attention affinity matrix between H c and H q as follows: where h c i is the i-th word in the context and h q j is the j-th word in the question.F (•) is a standard nonlinear transformation function (i.e., F (x) = (W x + b), where indicates non-linearity function), and is shared between context and question.E 2 R `c⇥`q is the soft matching matrix.To learn alignments between context and question, we compute: Reasoning over Alignments Next, to reason over alignments, we compute a self-attentive reasoning over decomposed alignments: where square brackets Local Block-based Self-Attention Since `c is large in our case (easily 2000), computing the above Equation ( 2) may become computationally prohibitive.As such, we compute the scoring function for all cases where |i j|  b, in which, b is a predefined hyperparameter and also the block size.Intuitively, the initial alignment layer (i.e., Equation 1) already considers a global view.As such, this self-attention layer can be considered as a local-view perspective, confining the affinity matrix computation to a local window of b.Finally, to compute the introspective alignment representation, we compute: where B `c⇥4d is the introspective aligned representation of A. Finally, we use another d dimensional BiLSTM layer to aggregate the aligned representations: where Y 2 R `c⇥2d is the final contextual representation of context C.

Pointer-Generator Decoder
Motivated by recent, seminal work in neural summarization, our model adopts a pointer-generator architecture (See et al., 2017).Given Y (the question infused contextual representation), we learn to either generate a word from vocabulary, or point to a word from the context.The decision to generate or point is controlled by an additive blend of several components such as the previous decoder state and/or question representation.
The pointer-generator decoder in our framework uses an LSTM decoder 3 with a cell state c t 2 R n and hidden state vector h t 2 R n .At each decoding time step t, we compute an attention over Y as follows: where F a (•) and F h (•) are nonlinear transformations projecting to n dimensions.i is the position index of the input sequence.F q (•) is an additional attentive pooling operator over the question representation H q (after the context encoding layer).
The semantics of the question may be lost after the alignment based encoding.As such, this enables us to revisit the question representation to control the decoder.y t 2 R n is the context representation at decoding time step t and a 2 R `c is an attention distribution over the context words which is analogous to the final probability distributions that exist in typical span prediction models.Next, we compute the next hidden state via: where w t 1 is the (t 1) th token in the ground truth answer (teacher forcing).To learn to generate, we compute: where v t 2 R |Vg| , V g is the global vocabulary size.The goal of the pointer-generator decoder is to choose between the abstractive distribution v t over the vocabulary (see Equation 6) and the extractive distribution a t (see Equation 5) over the context text tokens.To this end, we learn a scalar switch p t 2 R: where F pc (•), F ph (•), F py (•) are linear transformation layers (without bias) which project c t , h t and y t into scalar values.To control the blend between the attention context and the generated words, we use a linear interpolation between a t and v t .The predicted word w t at time step t is therefore: Note that we scale (append and prepend) a t and v t with zeros to make them the same length (i.e., `c + |V g |).The LSTM decoder runs for a predefined fix answer length.During inference, we simply use greedy decoding to generate the output answer.

Curriculum Reading
A key advantage of the pointer-generator is that it allows us to generate answers even if the answers do not exist in the context.This also enables us to explore multiple (diverse) views of contexts to train our model.However, to this end, we must be able to identify effectively the most useful retrieved context evidences for the training.For that purpose, we propose to use a diverse curriculum learning scheme which is based on two intuitive notions of difficulty: Answerability -It is regarded as common practice to retrieve excerpts based by using the correct answer as a cue (during training).This establishes an additional gap between training and inference since during inference, correct answers are not available.This measure aims to bridge the gap between question and answer (as a query prompt for passage retrieval).In this case, we consider the set of documents retrieved based on questions as the hard setting, H. Conversely, the set of retrieved documents using answers is regarded as the easy setting, E.
Understandability -This aspect controls how understandable the overall retrieved documents are as a whole.The key idea of this setting is to control the paragraph/chunk size.Intuitively, a small paragraph/chunk size would enable more relevant components to be retrieved from the document.However, its understandability might be affected if paragraph/chunk size is too small.Conversely, a larger chunk size would be easier to be understood.To control the level of understandability, we pre-define several options of chunk sizes (e.g., {50, 100, 200, 500}) which will be swapped and determined during training.
To combine the two measures described above, we comprise an easy-hard set pair for each chunk size, i.e., {E k , H k }, where: F (.) is an arbitrary ranking function which may or may not be parameterized, and n is the size of each retrieved chunk.
Two-layer Curriculum Reading Algorithm.
As our model utilizes two above measures of difficulty, there lies a question on which whether we

Experiments
We conduct our experiments on the NarrativeQA reading comprehension challenge.

Experimental Setup
This section introduces our experimental setups.(Kočiskỳ et al., 2018) .The numbers besides the model name denote the total context size.Rel.Gain reports the relative improvement of our model and the best baseline reported in (Kočiskỳ et al., 2018) on a specific context size setting.

Model Hyperparameters
parisons.The compared baselines are listed below: • Attention Sum Reader (ASR) (Kadlec et al., 2016) is a simple baseline for reading comprehension.Aside from our the results on (Kočiskỳ et al., 2018), we report our own implementation of the ASR model.Our implementation follows (Kočiskỳ et al., 2018) closely.
• Reinforced Reader Ranker (R 3 ) (Wang et al., 2018)  • RNET + PG / CPG (Wang et al., 2017b) is a strong, competitive model for paragraph level reading comprehension.We replace the span 7 prediction layer in RNET with a pointer generator (PG) model with the exact setup as our model.We also investigate equipping RNET + PG with our curriculum 7 The performance of the RNET + span predictor is similar to the BiDAF model reported in (Kočiskỳ et al., 2018).learning mechanism (curriculum pointer generator).

Experimental Results
Table 1 reports the results of our approach on the NarrativeQA benchmark.Our approach achieves state-of-the-art results as compared to prior work (Kočiskỳ et al., 2018).When compared to the best ASR model in (Kočiskỳ et al., 2018), the relative improvement across all metrics are generally high, ranging from +17% to 51%.The absolute improvements range from approximately +1% to +3%.
Pertaining to the models benchmarked by us, we found that our re-implementation of ASR (Ours) leaves a lot to be desired.Consequently, our proposed IAL-CPG model almost doubles the score on all metrics compared to ASR (Ours).The R 3 model, which was proposed primarily for open-domain question answering does better than ASR (Ours) but still fall shorts.Our RNET-PG model performs slightly better than R 3 but fails to get a score on BLEU-4.Finally, RNET-CPG matches the state-of-the-art performance of (Kočiskỳ et al., 2018).However, we note that there might be distinct implementation differences 8 with the primary retrieval mechanism and environment/preprocessing setup.A good fair comparison to observe the effect of our curricum reading is the improvement between RNET-PG and RNET-CPG.

Ablation Study
In this section, we provide an extensive ablation study on all the major components and features of our proposed model.Table 2 reports results of our ablation study.

Attention ablation
In ablations (1-3), we investigate the effectiveness of the self-attention layer.In (1), we remove the entire IAL layer, piping the context-query layer directly to the subsequent layer.In (2), we replace block-based self-attention with the regular self-attention.Note that the batch size is kept extremely small (e.g., 2), to cope with the memory requirements.In (3), we remove the multiplicative and subtractive features in the IAL layer.Results show that replacing the block-based self-attention with regular self-attention hurts performance the most.However, this may be due to the requirement of reducing the batch size significantly.Removing the IAL layer only sees a considerable drop while removing the enhancement also reduces performance considerably.

Curriculum ablation
In ablations (4-8), we investigate various settings pertaining to curriculum learning.In (4), we remove the pointer generator (PG) completely.Consequently, there is also no curriculum reading in this setting.Performance drops significantly in this setting and demonstrates that the pointer generator is completely essential to good performance.In (5-6), we remove one component from our curriculum reading mechanism.Results show that the answerabiity heuristic is more important than the understandability heuristic.In (7-8), we focus on non curriculum approaches training on the easy or hard set only.It is surprising that training on the hard set alone gives considerablely decent performance which is comparable to the easy set.However, varying them in a curriculum setting has significant benefits.

RL ablation
In ablation (9), we investigated techniques that pass the BLEU-score back as a reward for the model and train the model jointly using Reinforcement learning.We follow the setting much worse than (Kočiskỳ et al., 2018).We spend a good amount of time trying to reproduce the results of ASR on the original paper.
of (Paulus et al., 2017), using the mixed training objective and setting to 0.05.We investigated using BLEU-1,BLEU-4 and Rouge-L (and combinations of these) as a reward for our model along with varying rates.Results in Table 2 reports the best result we obtained.We found that while RL does not significantly harm the performance of the model, there seem to be no significant benefit in using RL for generating answers, as opposed to other sequence transduction problems (Bahdanau et al., 2016;Paulus et al., 2017).
Understandability ablation From ablations (10-16), we study the effect of understandability and alternating paragraph sizes.We find that generally starting from a smaller paragraph and moving upwards performs better and moving the reverse direction may have adverse effects on performance.This is made evident by ablations (10)(11).We also note that a curriculum approach beats a static approach often.

Qualitative Error Analysis
Table 3 provides some examples of the output of our best model.First, we discuss some unfortunate problems with the evaluation in generation based QA.In examples (1), the model predicts a semantically correct answer but gets no credit due to a different form.In (2), no credit is given for wordlevel evaluation.In (3), the annotators provide a more general answer and therefore, a highly specific answer (e.g., moscow) do not get any credit.
Second, we observe that our model is occasionally able to get the correct (exact match) answer.This is shown in example ( 4) and (7).However, there are frequent inability to generate phrases that make sense, even though it seems like the model is trudging along the right direction (e.g., "to wants to be a love of john" versus "because he wants her to have the baby" and "in the york school" versus "east harlem in new york").In (9), we also note a partially correct anwer, even though it fails to realize that the question is about a male and generates "she is a naval".

Related Work
The existing work on open domain QA (Chen et al., 2017) has distinct similarities with our problem, largely owing to the overwhelming large corpus that a machine reader has to reason over.In recent years, a multitude of techniques have been developed.(Wang et al., 2018)   ment learning to select passages using the reader as the reward.(Min et al., 2018) proposed ranking the minimal context required to answer the question.(Clark and Gardner, 2017) proposed shared norm method for predicting spans in the multiparagraph reading comprehension setting.(Lin et al., 2018) proposed ranking and de-noising techniques.(Wang et al., 2017a) proposed evidence aggregation based answer re-ranking.Most techniques focused on constructing a conducive and less noisy context for the neural reader.Our work provides the first evidence of diverse sampling for training neural reading comprehension models.
Our work draws inspiration from curriculum learning (CL) (Bengio et al., 2009).One key difficulty in CL is to determine which samples are easy or hard.Self-paced learning (Jiang et al., 2015) is a recently popular form of curriculum learning that treats this issue as an optimization problem.To this end, (Sachan and Xing, 2016) applies selfpaced learning for neural question answering.Automatic curriculum learning (Graves et al., 2017), similarly, extracts signals from the learning process to infer progress.
State-of-the-art neural question answering models are mainly based on cross-sentence attention (Seo et al., 2016;Wang and Jiang, 2016b;Xiong et al., 2016;Tay et al., 2018c).Self-attention (Vaswani et al., 2017;Wang et al., 2017b) has also been popular for reading comprehension (Wang et al., 2018;Clark and Gardner, 2017).However, its memory complexity makes it a chal-lenge for reading long context.Notably, the truncated/summary setting of the NarrativeQA benchmark have been attempted recently (Tay et al., 2018c,b;Hu et al., 2018;Tay et al., 2018a).However, this summary setting bypasses the difficulties of long context reading comprehension, reverting to the more familiar RC setup.
While most of the prior work in this area has mainly focused on span prediction models (Wang and Jiang, 2016b) and/or multiple choice QA models (Wang and Jiang, 2016a), there have been recent interest in generation based QA (Tan et al., 2017).S-NET (Tan et al., 2017) proposed a twostage retrieve then generate framework.
Flexible neural mechanisms that learn to point and/or generate have been also popular across many NLP tasks.Our model incorporates Pointer-Generator networks (See et al., 2017) which learns to copy or generate new words within the context of neural summarization.Prior to Pointer Generators, CopyNet (Gu et al., 2016) incorporates a copy mechanism for sequence to sequence learning.Pointer generators have also been recently adopted for learning a universal multi-task architecture for NLP (McCann et al., 2018).

Conclusion
We proposed curriculum learning based Pointergenerator networks for reading long narratives.Our proposed IAL-CPG model achieves stateof-the-art performance on the challenging Narra-tiveQA benchmark.We show that sub-sampling diverse views of a story and training them with a curriculum scheme is potentially more effective than techniques designed for open-domain question answering.We conduct extensive ablation studies and qualitative analysis, shedding light on the task at hand.

Table 1 :
(Pennington et al., 2014)he encoder layer is set to 128 and the decoder size is set to 256.The block size b for the Introspective Alignment Layer is set to 200.We initialize our word embeddings with pretrained GloVe vectors(Pennington et al., 2014)which are not updated 5 during training.Results on NarrativeQA reading comprehension dataset (Full story setting).Results are reported from

Table 2 :
Ablation results on NarrativeQA development set.(1-3) are architectural ablations.(4-8)arecurriculum reading based ablations.(9)investigates RL-based generation.(10-16) explores the understandability/paragraph size heuristic.Note that (10) was the optimal scheme reported in the original setting.Moreover, more permutations were tested but only representative example are reported due to lack of space.

Table 3 :
Qualitative analysis on NarrativeQA development set.