Ruminating Reader: Reasoning with Gated Multi-Hop Attention

To answer the question in machine comprehension (MC) task, the models need to establish the interaction between the question and the context. To tackle the problem that the single-pass model cannot reflect on and correct its answer, we present Ruminating Reader. Ruminating Reader adds a second pass of attention and a novel information fusion component to the Bi-Directional Attention Flow model (BiDAF). We propose novel layer structures that construct an query-aware context vector representation and fuse encoding representation with intermediate representation on top of BiDAF model. We show that a multi-hop attention mechanism can be applied to a bi-directional attention structure. In experiments on SQuAD, we find that the Reader outperforms the BiDAF baseline by a substantial margin, and matches or surpasses the performance of all other published systems.


Introduction
The majority of recorded human knowledge is circulated in unstructured natural language. It is tremendously valuable to allow machines to read and comprehend the text knowledge. Machine comprehension (MC)-especially in the form of question answering (QA)-is therefore attracting a significant amount of attention from the machine learning community. Recently introduced large-scale datasets like CNN/Daily Mail (Hermann et al., 2015), the Stanford Question Answering Dataset (SQuAD; Rajpurkar et al., 2016)   allow data-driven methods, including deep learning, to become viable.
Recent approaches toward solving machine comprehension tasks using neural networks can be viewed as falling into two broad categories: single-pass reasoners and multiple-pass reasoners. Single-pass models read a question and a source text once and often adopt the differentiable attention mechanism that emphasizes important parts of the context related to the question.
BIDAF (Seo et al., 2017) represents one of the state-of-the-art single-pass models in Machine Comprehension. BIDAF uses a bi-directional attention matrix which calculates the correlations between each word pair in context and query to build query-aware context representation. However, BIDAF and some similar models miss some questions because they don't have the capacity to reflect on problematic candidate answers and revise their decisions.
When humans are reading a text with the goal of answering a question, they tend to read it multiple times to get a better understanding of the context and question, and to give a better response.
With this intuition, recent multi-pass models revisit the question and the context passage (or ruminate) to infer the relations between the context, the question and the answer.
We propose an extension of BIDAF, called Ruminating Reader, which uses a second pass of reading and reasoning to allow it to learn to avoid mistakes and to ensure that it is able to effectively use the full context when selecting an answer. In addition to adding a second pass, we also introduce two novel layer types, the ruminate layers, which use gating mechanisms to fuse the obtained from the first and second passes. We observe a surprising phenomenon that when an LSTM layer in the context ruminate layer takes same input in each timestep, it can produce useful representation for the gates. In addition, we introduce an answer-question similarity loss to penalize overlap between question and predicted answer, a common feature in the errors of our base model. This allows us to achieve an F1 score of 79.5 and Exact Match (EM) score of 70.6 on hidden test set, 1 an improvement of 2.2 F1 score and 2.9 EM on BIDAF. Figure 1 shows a high-level comparison between BIDAF and Ruminating Reader. This paper is organized as follows: In Section 2 we define the problem to be solved and introduce the SQuAD task. In Section 3 we introduce Ruminating Reader, focusing on the informationextracting and information-digesting components and how they integrate. Section 4 discusses related work. Section 5 presents the experimental setting, results and analysis. Section 6 concludes.

Question Answering
The task of the Ruminate Reader is to answer a question by reading and understanding a paragraph of text and selecting a span of words within the context. Formally, the Training and development data consist of tuples (Q, P, A), where Q = (q 1 , ..., q i , ...q |Q| ) is the question, a sequence of words with length |Q|, C = (c 1 , ...c j , ..., c |C| ) is the context, a sequence of words with length |C|, and A = (a b , a e ) is the answer span marking the beginning and end indices of the the answer in the context (1 <= a b <= a e <= |C|).
SQuAD The SQuAD corpus is built using 536 articles randomly selected from English Wikipedia. Images, figures, tables are stripped and any paragraphs shorter than 500 characters are discarded. Unlike other datasets that such as CNN/Daily Mail whose questions are synthesized, Rajpurkar et al. (2016) uses a crowdsourcing platform to generate realistic question and answer pairs. SQuAD contains 107,785 question-answer pairs. The typical context length spans from 50 tokens to 250 tokens. The typical length of a question is around 10 tokens. The answer be any span of words from the context, resulting in O(|C| 2 ) possible outputs.

Ruminating Reader
In this section, we review the BIDAF model (Seo et al., 2017) and introduce our extension, the Ruminating Reader.
Our additions to the base model are motivated by the intuition that adding an additional pass of reading will allow the model to better integrate information from the question and answer and to better weigh possible answers, and that by interpolating the results of the second pass with those of the first pass through gating, we can prevent the additional complexity that we add to the model from substantially increasing the difficulty of training. The structure of our model is shown in Figure 2 and explained in the following sections.
Character Embedding Layer Just as in the base BIDAF model, the character embedding layer maps each word to a high dimensional vector using character features. It does so using a convolutional neural network with max pooling over learned character vectors (Lee et al., 2017;Kim et al., 2016). Thus we have a context character representation M ∈ R f ×C and a query representation N ∈ R f ×Q , where C is the sequence length of the context, Q is the sequence length of the query and f is the number of 1D convolutional neural network filters.
Word Embedding Layer Again as in the base model, the word embedding layer uses pretrained word vectors (the 6B GloVe vectors of Pennington et al., 2014) to map the word into a high dimensional vector space. We do not update the word embeddings during training. The character embedding and the word embedding are concatenated and passed into a two-layer highway network (Srivastava et al., 2015) to obtain a d dimensional vector representation of each single word. Hence, we have a context representation H ∈ R d×C and a query representation U ∈ R d×Q .
Sequence Encoding Layers As in BIDAF, we use two LSTM RNNs (Hochreiter and Schmidhuber, 1997) with d-dimensional outputs to encode the context and query representations in both directions. Therefore, we obtain a context encoding matrix C ∈ R 2d×C , and a query encoding matrix Q ∈ R 2d×Q .
Attention Flow Layer As in BIDAF, the attention flow layer constructs a query-aware context representation G from inputs C and Q. This layer takes two steps. In the first step, an interaction matrix I ∈ R C×Q is computed, which indicates the affinities between each context word encoding and each query word encoding. I cq indicates the correlation between the c-th word in context and q-th word in query. The interaction matrix is computed by where w I ∈ R 6d is a trainable parameter, C c is c-th column of context encoding and Q q is q-th column of query encoding, • is elementwise multiplication, and [; ] is vector concatenation.
Context-to-query Attention As in BIDAF, the context-to-query attention component generates, for each context word, an attention-weighted sum of query word encodings. LetQ ∈ R 2d×C represent the context-to-query attention matrix. For column c inQ is defined byQ c = (a cq Q q ), where a is the attention weight. a is computed by Query-to-context Attention Query-to-context attention indicates the most relevant context words to query. The most relevant word vector representation is an attention-weighted sum defined bỹ c = b c C c where b, is an attention weight which is calculated by b = sof tmax(max col (I)) ∈ R C . c is replicated C times across the column, therefore givingC ∈ R 2d×C .
We then obtain the final query-aware context representation by where G c ∈ R 8d×C .

Summarization Layer
We propose summarization layer which produces a vector representation that summarizes the information in the queryaware context representation. The input to summarization layer is G. We use one bi-directional LSTM network to model the learned information.
We select the final states from both directions and concatenate them together as s = [s f ; s b ] . where s ∈ R 2d represents the representation summarized from the reading of context and query, s f is the final state of LSTM in forward direction, and s b is the final state of LSTM in backward direction.
Query Ruminate Layer The query ruminate layer fuses the summarization vector representation with the query encoding Q, helping reformulate the query representation in order to maximize the chance of retrieving the correct answer. The input to this layer is s tiled Q times (S Q ∈ R 2d×Q ). A gating function then fuses this with the existing query encoding: Context Ruminate Layer Context ruminate layer digests the summarization and integrates it with the context encoding C to facilitate answer extraction. In this layer, we tile s C times and we have S C ∈ R 2d×C . To incorporate the positional information into this relatively long tiled sequence, we feed it into an additional bidirectional LSTM with output size d in each direction. This approach, while somewhat inefficient, proves to be an valuable addition to the model and allows it to better track position information, loosely following the positional encoding strategy of Sukhbaatar et al. (2015). Hence we obtainS C ∈ R 2d×C , which is fused with context encoding C via a gate: Context: The Broncos took an early lead in Super Bowl 50 and never trailed. Newton was limited by Denver's defense, which sacked him seven times and forced him into three turnovers, including a fumble which they recovered for a touchdown. Denver linebacker Von Miller was named Super Bowl MVP, recording five solo tackles, 2 sacks, and two forced fumbles.
Question: Which Newton turnover resulted in seven points for Denver?
Ground Truth: {a fumble, a fumble, Fumble} Prediction: three turnovers Table 1: An error of the type that motivated the answer-question similarity loss.

Second Hop Attention Flow Layer
We takẽ Q ∈ R 2d×Q andC ∈ R 2d×C as the input to another attention flow layer with the same structure as described above, yielding G (2) ∈ R 8d×C .
Modeling Layer We use two layers of bidirectional LSTM with output size d in each direction to aggregate the information in G (2) , yielding a pre-output matrix M s ∈ R 2d×C .
Output Layer As in BIDAF, our output layer independently models the probability of each word being selected as the start or end of an answer span. We calculate the probability distribution of the start index of the answer span by where w (p 1 ) ∈ R 10d is a trainable parameter. We pass the matrix M s to another bi-directional LSTM with output size d in single direction yielding M e . We obtain the probability distribution of the end index of the answer span by Training Loss We define the training loss as the sum of three components: negative log likelihood loss, L2 regularization loss, and a novel answerquestion similarity loss.

Answer-Question Similarity Loss
We observe that a version of our model trained only on the two standard loss terms often selects answers that overlap substantially in content with their corresponding questions, and that this nearly always results in an error. A sample error of this kind is shown in Table 1. This motivates an additional loss term at training time: We penalize the similarity between the question and the selected answer. Formally, the answer question similarity loss is defined as where s refers to the start index of answer span, e refers to the end index of the answer span, q BoW is the bag of words representation of query encoding, cos(a, b) is the cosine similarity between a and b, C s and C e are the s-th and e-th vector representation of context encoding.
Prediction During prediction, we use a local search strategy that for token indices a and a , we maximize p s a × p e a , where 0 ≤ a − a ≤ 15 . Dynamic programming is applied during search, resulting in O(C) time complexity.

Related Work
Recently, both QA and Cloze-style machine comprehension tasks like CNN/Daily Mail have seen fast progress. Much of this recent work has been based on end-to-end trained neural network models, and within that, most have used recurrent neural networks with soft attention (Bahdanau et al., 2015), which emphasizes one part of the data over the others. These models can be coarsely divided into two categories: single-pass and multi-pass reasoners.
Most papers on single-pass reasoning systems propose novel ways to use the attention mechanism: Wang and Jiang (2016) propose match-LSTM to model the interaction between context and query, as well as introducing the use of a pointer network (Vinyals et al., 2015) to extract the answer span from the context. Xiong et al. (2017) propose the Dynamic Coattention Network, which uses co-dependent representations of the question and the context, and iteratively updates the start and end indices to recover from local maxima and to find the optimal answer span.  propose the Multi-Perspective Context Matching model that matches the encoded context with query by combining various matching strategies, aggregates matching vector with bidirectional LSTM, and predict start and end positions. In order to merge the entity score during its multiple appearence, Kadlec et al. (2016) propose attention-sum reader who computes dot product between context and query encoding, does a softmax operation over context and sums the probability over the same entity to favor the frequent entities over rare ones. Chen et al. (2016) propose to use a bilinear term to calculate the attentional alignment between context and query.
Among multi-hop reasoning systems: Hill et al. (2015) apply attention on window-based memory, by extending multi-hop end-to-end memory network (Sukhbaatar et al., 2015). Dhingra et al. (2016) extend attention-sum reader to multi-turn reasoning with an added gating mechanism. The Iterative Alternative (IA) reader (Sordoni et al., 2016) produces query glimpse and document glimpse in each iterations and uses both glimpses to update recurrent state in each iteration. Shen et al. (2017) propose a multi-hop attention model that used reinforcement learning to dynamically determine when to stop digesting intermediate information and produce an answer.

Implementation details
Our model configuration closely follows that of Seo et al. (2017) did: In the character encoding layer, we use 100 filters of width 5. In the remainder of the model, we set the hidden layer dimension (d) to 100. We use pretrained 100D GloVe vectors (6B-token version) as word embeddings. Out-of-vocobulary tokens are represented by an UNK symbol in the word embedding layer, but treated normally by the character embedding layer. The BiLSTMs in context and query encoding layers share same weights. We use the AdaDelta optimizer (Zeiler, 2012) for optimization.
We selected hyperparameter values through random search (Bergstra and Bengio, 2012). Batch size is 30. Learning rate starts at 0.5, and decreases to 0.2 once the model stops improving. The L2-regularization weight is 1e-4, AQSL weight is 1 and dropout with a drop rate of 0.2 is   A typical model run converges in about 40k steps. This takes two days using Tensorflow (Abadi et al., 2015) and a single NVIDIA K80 GPU . Rajpurkar et al. (2016) provide an official evaluation script that allows us to measure F1 score and EM score by comparing the prediction and ground truth answers. Three answers are provided for each question. The prediction is compared to each of the answer and best score is selected. F1 score is defined by recall and precision of words and EM score, as Exact Match score, is defined as the score of 100% accuracy in prediction. We do not use any kind of ensembling, and compare our results primarily with other single-model (non-ensemble) results. The test set performance is evaluated at CodaLab by administrator.

Results
At the time of submission, our model is tied in accuracy on the hidden test set with the bestperforming published single model (Zhang et al., 2017). We achieve an F1 score of 79.5 and EM score of 70.6. The current leaderboard is displayed in Table 2. The leaderboard is listed in descending order of F1 score, but if an entry's F1 score is better than the adjacent entry's, while its EM score is worse, then these two entries are considered tied.

Analysis
Layer Ablation Analysis To analyze how each component contribute to the model, we run a layer ablation experiment. We present results for twelve versions of the model on the development set, each missing some or all of the major components of the full Ruminating Reader. The precise definition of each of the twelve ablated models can be found in Appendix A.1. The results of the ablation experiment are shown in Table 3. The ablation experiments show how each component contribute to the model. Experiments 3 and 4 show that the two ruminate layers are both important and helpful in contributing performance. It is worth noting that the BiLSTM in the context ruminate layer contributes substantially to model performance. We find this somewhat surprising, since it takes the same input in each timestep, but it nonetheless successfully digests the summarization information representation and produces a useful input for the gating component. Experiments 7 and 8 show that the Figure 3: The visualization of first hop (top) and second hop (bottom) attention interaction matrix. We use coolwarm colormap, where red is close to 1 and blue is close to 0. In the question "What is the name of the trophy given to anyone who plays on the winning team in a super Bowl?", the key words name, trophy, given, who are strongly attended to in the first hop. modeled summarization vector representation can provide information to gates reasonably well. The drop in performance in both experiments 9 and 10 shows that the key information for new query and context representation are the are first stage query and context encodings. Experiments 11 and 12 shows that the summarization vector representation does help the later stage of reasoning. The order of words is the same as in Figure 3. We use coolwarm colormap, where red means the gate uses more information from intermediate representation, and blue from encoding representation.
Visualization Figure 3 provides a visualization of the first hop and second hop attention interaction matrix I. We also provide a sample of visualization for the L2 sum of gate value in context and query ruminate layers in Figure 4.
From Figure 3 we see that though the structures of two hops of attention flow layer are the same, they function quite differently in typical cases. The first hop attention appears to be primarily concerned with identifying the key informative word (or words, as here) in the query. Though in Figure 3 four key words are signified, one or two words are attended to in the first hop in the common case. The second hop is then responsible for finding candidate answers that are relevant to those key words and generating a query-aware context representation. We observe the first hop attention shows a consistent attention pattern across context words, suggesting that there may be room to make the first hop component more efficient in future work.
From Figure 4, we see the gate value on both query ruminate layer and context ruminate layer shows that the gates are working to fuse information to original query encoding and context encoding. We observe that in most of the case the gates in ruminate layers uses more information from encoding than from summarization representation. The observation matches our expectation that the gates modify and improve on the encoding representation.
We also provide a comparison of F1 score between BIDAF and Ruminating Reader on question with different ground truth answer length and different types of questions in Figure 5. Exact match score is highly correlated with F1 score so we omit it for clarity. We observe that the Ruminating Reader outperforms BIDAF on most of the questions with respect of different answer length. On the question with long answer length, of 5, 8 and 9, Ruminating Reader outperforms BIDAF by a great margin. Questions with longer reference answers appear to be more difficult to answer. In addition, the Ruminating Reader does better on each type of question. Both models work best for when questions-these question are answerable by temporal expressions, which are relatively easy to recognize. The Why questions are hardest to answer-they tend to have long answers with no purely lexical cues marking their beginnings or ends. Ruminating Reader outperforms BIDAF model on why questions by a substantial margin. Performance Breakdown Following Zhang et al. (2017), we break down Ruminating Reader's 79.5% F1 score on the development set into three sub-scores, representing failures, partial successes, and successes. On 13.5% of development set examples, Ruminate Reader fails, yielding 0% F1. On 70.6% of examples, Ruminate Reader achieves a perfect F1 score. On the remaining 15.9%, Ruminate Reader got only partial matches (i.e., answers that partially overlapped with reference answers), with an average F1 score of 56.0%. Comparing to the jNet (Zhang et al., 2017) whose success answers occupy 69.1% of all answers, failure score answers 14.9% and partial success 16.01% with an average F1 score of 58.0%, our model works better on increasing successes and reducing failures.

Conclusion
We propose the Ruminating Reader, an extension to the BIDAF model with two-hop attention. The model surpasses the original BIDAF model's performance on Stanford Question Answering Dataset (SQuAD) by a large margin, and ties with the best published system. These results and our qualitative analysis both suggest that the model successfully fuses the information from two passes of reading using gating and uses the result to identify appropriate answers to Wikipedia questions. An ablation experiment shows that each of components of this complex model contribute substantially. In future work, we aim to find ways to simplify this model without impacting performance, to explore the possibility of yet deeper models, and to expand our study to machine comprehension tasks more broadly.