Cut to the Chase: A Context Zoom-in Network for Reading Comprehension

In recent years many deep neural networks have been proposed to solve Reading Comprehension (RC) tasks. Most of these models suffer from reasoning over long documents and do not trivially generalize to cases where the answer is not present as a span in a given document. We present a novel neural-based architecture that is capable of extracting relevant regions based on a given question-document pair and generating a well-formed answer. To show the effectiveness of our architecture, we conducted several experiments on the recently proposed and challenging RC dataset ‘NarrativeQA’. The proposed architecture outperforms state-of-the-art results by 12.62% (ROUGE-L) relative improvement.


Introduction
Building Artificial Intelligence (AI) algorithms to teach machines to read and to comprehend text is a long-standing challenge in Natural Language Processing (NLP). A common strategy for assessing these AI algorithms is by treating them as RC tasks. This can be formulated as finding an answer to a question given the document(s) as evidence. Recently, many deep-learning based models (Seo et al., 2017;Xiong et al., 2017;Wang et al., 2017;Shen et al., 2017;Clark and Gardner, 2017) have been proposed to solve RC tasks based on the SQuAD (Rajpurkar et al., 2016) and Trivi-aQA (Joshi et al., 2017) datasets, reaching human level performance. A common approach in these models is to score and/or extract candidate spans conditioned on a given question-document pair.
Most of these models have limited applicability to real problems for the following reasons. They do not generalize well to scenarios where the answer is not present as a span, or where several discontinuous parts of the document are required to * To whom correspondence should be addressed. form the answer. In addition, unlike humans, they can not easily skip through irrelevant parts to comprehend long documents (Masson, 1983).
To address the issues above we develop a novel context zoom-in network (ConZNet) for RC tasks, which can skip through irrelevant parts of a document and generate an answer using only the relevant regions of text. The ConZNet architecture consists of two phases. In the first phase we identify the relevant regions of text by employing a reinforcement learning algorithm. These relevant regions are not only useful to generate the answer, but can also be presented to the user as supporting information along with the answer. The second phase is based on an encoder-decoder architecture, which comprehends the identified regions of text and generates the answer by using a residual self-attention network as encoder and a RNNbased sequence generator along with a pointer network (Vinyals et al., 2015) as the decoder. It has the ability to generate better well-formed answers not verbatim present in the document than span prediction models.
Recently, there have been several attempts to adopt condensing documents in RC tasks. Wang et al. (2018) retrieve a relevant paragraph based on the question and predict the answer span.  select sentence(s) to make a summary of the entire document with a feed-forward network and generate an answer based on the summary. Unlike existing approaches, our method has the ability to select relevant regions of text not just based on the question but also on how well regions are related to each other. Moreover, our decoder combines span prediction and sequence generation. This allows the decoder to copy words from the relevant regions of text as well as to generate words from a fixed vocabulary. We evaluate our model using one of the challenging RC datasets, called 'NarrativeQA', which  was released recently by Kočiskỳ et al. (2017). Experimental results show the usefulness of our framework for RC tasks and we outperform stateof-the-art results on this dataset.

Proposed Architecture
An overview of our architecture is shown in Figure  1, which consists of two phases. First, the identification of relevant regions of text is computed by the Co-attention and Context Zoom layers as explained in Sections 2.1 and 2.2. Second, the comprehension of identified regions of text and output generation is computed by Answer Generation block as explained in Section 2.3.

Co-attention layer
The words in the document, question and answer are represented using pre-trained word embeddings (Pennington et al., 2014). These wordbased embeddings are concatenated with their corresponding char embeddings. The char embeddings are learned by feeding all the characters of a word into a Convolutional Neural Network (CNN) (Kim, 2014). We further encode the document and question embeddings using a shared bi-directional GRU (Cho et al., 2014) to get context-aware representations.
We compute the co-attention between document and question to get question-aware representations for the document by using tri-linear attention as proposed by Seo et al. (2017). Let d i be the vector representation for the document word i, q j be the vector for the question word j, and l d and l q be the lengths of the document and question respectively. The tri-linear attention is calculated as where w d , w q , and w dq are learnable parameters and denotes the element-wise multiplication. We compute the attended document wordd i by first computing λ i = sof tmax(a i: ) and followed byd i = lq j=1 λ ij q j . Similarly, we compute a question to document attention vectorq by first computing b = sof tmax(max(a i: )) and followed byq to yield a query-aware contextual representation for each word in the document.

Context Zoom Layer
This layer finds relevant regions of text. We use reinforcement learning to do that, with the goal of improving answer generation accuracy -see Section 2.4.
The Split Context operation splits the attended document vectors into sentences or fixed size chunks (useful when sentence tokenization is not available for a particular language). This results in n text regions with each having length l k , where l d = n k=1 l k . We then get the representations, denoted as z k , for each text region by running a BiGRU and concatenating the last states of the forward and backward GRUs.
The text region representations, z k , encode how well they are related to the question, and their surrounding context. Generating an answer may depend on multiple regions, and it is important for each text region to collect cues from other regions which are outside of their surroundings. We can compute this by using a Self-Attention layer. It is a special case of co-attention where both operands (d i and q j ) are the text fragment itself, computed by setting a ij = −∞ when i = j in Eq. 1.
These further self-attended text region representations,z k , are passed through a linear layer with tanh activation and softmax layer as follows: where ψ is the probability distribution of text regions, which is the evidence used to generate the answer. The policy of the reinforcement learner is defined as π(r|u; θ z ) = ψ r , where ψ r is the probability of a text region r (agent's action) being selected, u is the environment state as defined in Eq. 2, and θ z are the learnable parameters. During the training time we sample text regions using ψ, in inference time we follow greedy evaluation by selecting most probable region(s).

Answer Generation
This component is implemented based on the encoder-decoder architecture of . The selected text regions from the Context Zoom layer are given as input to the encoder, where its output is given to the decoder in order to generate the answer. The encoder block uses residual connected selfattention layer followed by a BiGRU. The selected relevant text regions (∈ ψ r ) are first passed through a separate BiGRU, then we apply a selfattention mechanism similar to the Context Zoom layer followed by a linear layer with ReLU activations. The encoder's output consists of representations of the relevant text regions, denoted by e i .
The decoder block is based on an attention mechanism (Bahdanau et al., 2015) and a copy mechanism by using a pointer network similar to (See et al., 2017). This allows the decoder to predict words from the relevant regions as well as from the fixed vocabulary. At time step t, the decoder predicts the next word in the answer using the attention distribution, context vector and current word embedding. The attention distribution and context vector are obtained as follows: where h t is hidden state of the decoder, v, W e , W h , b o are learnable parameters. The γ t represents a probability distribution over words of relevant regions e i . The context vector is given by c t = i γ t i e i . The probability distribution to predict word w t from the fixed vocabulary (P f v ) is computed by passing state h t and context vector c t to a linear layer followed by a softmax function denoted as To allow decoder to copy words from the encoder sequence, we compute a soft gate (P copy ), which helps the decoder to generate a word by sampling from the fixed vocabulary or by copying from a selected text regions (ψ r ). The soft gate is calculated as where x t is current word embedding, h t is hidden state of the decoder, c t is the context vector, and w p , v h , w x , and b c are learnable parameters. We maintain a list of out-of-vocabulary (OOV) words for each document. The fixed vocabulary along with this OOV list acts as an extended vocabulary for each document. The final probability distribution (unnormalized) over this extended vocabulary (P ev ) is given by

Training
We jointly estimate the parameters of our model coming from the Co-attention, Context Zoom, and Answer Generation layers, which are denoted as θ a , θ z , and θ g respectively. Estimating θ a and θ g is straight-forward by using the cross-entropy objective J 1 ({θ a , θ g }) and the backpropagation algorithm. However, selecting text regions in the Context Zoom layer makes it difficult to estimate θ z given their discrete nature. We therefore formulate the estimation of θ z as a reinforcement learning problem via a policy gradient method. Specifically, we design a reward function over θ z . We use mean F-score of ROUGE-1, ROUGE-2, and ROUGE-L (Lin and Hovy, 2003) as our reward function R. The objective function to maximize is the expected reward under the probability distribution of current text regions ψ r , i.e., J 2 (θ z ) = E p(r|θz) [R]. We approximate the gradient ∇ θz J 2 (θ z ) by following the REINFORCE (Williams, 1992) algorithm. To reduce the high variance in estimating ∇ θz J 2 (θ z ) one widely used mechanism is to subtract a baseline value from the reward. It is shown that any number will reduce the variance (Williams, 1992;Zaremba and Sutskever, 2015), here we used the mean of the mini-batch reward b as our baseline. The final objective is to minimize the following equation: where, B is the size of mini-batch, and R i is the reward of example i ∈ B. J(θ) is now fully differentiable and we use backpropagation to estimate θ.

Dataset
The NarrativeQA dataset (Kočiskỳ et al., 2017) consists of fictional stories gathered from books and movie scripts, where corresponding summaries and question-answer pairs are generated with the help of human experts and Wikipedia articles. The summaries in NarrativeQA are 4-5 times longer than documents in the SQuAD dataset. Moreover, answers are well-formed by human experts and are not verbatim in the story, thus making this dataset ideal for testing our model. The statistics of NarrativeQA are available in Table 1 1 .

Baselines
We compare our model against reported models in Kočiskỳ et al. (2017) (Seq2Seq, ASR, BiDAF) and the Multi-range Reasoning Unit (MRU) in Tay et al. (2018). We implemented two baseline models (Baseline 1, Baseline 2) with Context Zoom layer similar to Wang et al. (2018). In both baselines we replace the span prediction layer with an answer generation layer. In Baseline 1 we use an 1 please refer Kočiskỳ et al. (2017) for more details attention based seq2seq layer without using copy mechanism in the answer generation unit similar to . In Baseline 2 the answer generation unit is similar to our ConZNet architecture.

Implementation Details
We split each document into sentences using the sentence tokenizer of the NLTK toolkit (Bird and Loper, 2004). Similarly, we further tokenize each sentence, corresponding question and answer using the word tokenizer of NLTK. The model is implemented using Python and Tensorflow (Abadi et al., 2015). All the weights of the model are initialized by Glorot Initialization (Glorot et al., 2011) and biases are initialized with zeros. We use a 300 dimensional word vectors from GloVe (Pennington et al., 2014) (with 840 billion pre-trained vectors) to initialize the word embeddings, which we kept constant during training. All the words that do not appear in Glove are initialized by sampling from a uniform random distribution between [-0.05, 0.05]. We apply dropout (Srivastava et al., 2014) between the layers with keep probability of 0.8 (i.e dropout=0.2). The number of hidden units are set to 100. We trained our model with the AdaDelta (Zeiler, 2012) optimizer for 50 epochs, an initial learning rate of 0.1, and a minibatch size of 32. The hyperparameter 'sample size' (number of relevant sentences) is chosen based on the model performance on the devset. Table 2 shows the performance of various models on NarrativeQA. It can be noted that our model with sample size 5 (choosing 5 relevant sentences) outperforms the best ROUGE-L score available so far by 12.62% compared to Tay et al. (2018). The low performance of Baseline 1 shows that the hybrid approach (ConZNet) for generating words from a fixed vocabulary as well as copying words from the document is better suited than span prediction models (Seq2Seq, ASR, BiDAF, MRU).

Results
To validate the importance of finding relevant sentences in contrast to using an entire document for answer generation, we experimented with sample sizes beyond 5. The performance of our model gradually dropped from sample size 7 onwards. This result shows evidence that only a few relevant sentences are sufficient to answer a question.
We also experimented with various sample sizes to see the effect of intra sentence relations for an-  swer generation. The performance of the model improved dramatically with sample sizes 3 and 5 compared to the sample size of 1. These results show that the importance of selecting multiple relevant sentences for generating an answer. In addition, the low performance of Baseline 2 indicates that just selecting multiple sentences is not enough, they should also be related to each other. This result points out that the self-attention mechanism in the Context zoom layer is an important component to identify related relevant sentences.

Conclusion
We have proposed a new neural-based architecture which condenses an original document to facilitate fast comprehension in order to generate better well-formed answers than span based prediction models. Our model achieved the best performance on the challenging NarrativeQA dataset. Future work can focus for example on designing an inexpensive preprocess layer, and other strategies for improved performance on answer generation.