Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering

This paper describes a novel hierarchical attention network for reading comprehension style question answering, which aims to answer questions for a given narrative paragraph. In the proposed method, attention and fusion are conducted horizontally and vertically across layers at different levels of granularity between question and paragraph. Specifically, it first encode the question and paragraph with fine-grained language embeddings, to better capture the respective representations at semantic level. Then it proposes a multi-granularity fusion approach to fully fuse information from both global and attended representations. Finally, it introduces a hierarchical attention network to focuses on the answer span progressively with multi-level soft-alignment. Extensive experiments on the large-scale SQuAD, TriviaQA dataset validate the effectiveness of the proposed method. At the time of writing the paper, our model achieves state-of-the-art on the both SQuAD and TriviaQA Wiki leaderboard as well as two adversarial SQuAD datasets.


Introduction
As a brand new field in question answering community, reading comprehension is one of the key problems in artificial intelligence, which aims to read and comprehend a given text, and then answer questions based on it. This task is challenging which requires a comprehensive understanding of natural languages and the ability to do further inference and reasoning. Restricted by the limited volume of the annotated dataset, early studies mainly rely on a pipeline of NLP models to complete this task, such as semantic parsing and linguistic annotation (Das et al., 2014). Not until the release of large-scale clozestyle dataset, such as Children's Book Test (Hill et al., 2015) and CNN/Daily Mail (Hermann et al., 2015), some preliminary end-to-end deep learning methods have begun to bloom and achieve superior results in reading comprehension task (Hermann et al., 2015;Cui et al., 2016).
However, these cloze-style datasets still have their limitations, where the goal is to predict the single missing word (often a named entity) in a passage. It requires less reasoning than previously thought and no need to comprehend the whole passage . Therefore, Stanford publish a new large-scale dataset SQuAD (Rajpurkar et al., 2016), in which all the question and answers are manually created through crowdsourcing. Different from cloze-style reading comprehension dataset, SQuAD constrains answers to all possible text spans within the reference passage, which requires more logical reasoning and content understanding.
Benefiting from the availability of SQuAD benchmark dataset, rapid progress has been made these years. The work (Wang and Jiang, 2016) and (Seo et al., 2016) are among the first to investigate into this dataset, where Wang and Jiang propose an end-to-end architecture based on match-LSTM and pointer networks (Wang and Jiang, 2016), and Seo et al. introduce the bi-directional attention flow network which captures the questiondocument context at different levels of granularity (Seo et al., 2016). Chen et al. devise a simple and effective document reader, by introducing a bilinear match function and a few manual features (Chen et al., 2017a). Wang et al. propose a gated attention-based recurrent network where self-match attention mechanism is first incorporated (Wang et al., 2017). In (Liu et al., 2017b) and , the multi-turn memory networks are designed to simulate multi-step reasoning in machine reading comprehension.
The idea of our approach derives from the normal human reading pattern. First, people scan through the whole passage to catch a glimpse of the main body of the passage. Then with the question in mind, people make connection between passage and question, and understand the main intent of the question related with the passage theme. A rough answer span is then located from the passage and the attention can be focused on to the located context. Finally, to prevent from forgetting the question, people come back to the question and select a best answer according to the previously located answer span.
Inspired by this, we propose a hierarchical attention network which can gradually focus the attention on the right part of the answer boundary, while capturing the relation between the question and passage at different levels of granularity, as illustrated in Figure 1. Our model mainly consists of three joint layers: 1) encoder layer where pretrained language models and recurrent neural networks are used to build representation for questions and passages separately; 2) attention layer in which hierarchical attention networks are designed to capture the relation between question and passage at different levels of granularity; 3) match layer where refined question and passage are matched under a pointer-network (Vinyals et al., 2015) answer boundary predictor.
In encoder layer, to better represent the questions and passages in multiple aspects, we combine two different embeddings to give the fundamental word representations. In addition to the typical glove word embeddings, we also utilize the ELMo embeddings (Peters et al., 2018) derived from a pre-trained language model, which shows superior performance in a wide range of NLP problems. Different from the original fusion way for intermediate layer representations, we design a representation-aware fusion method to compute the output ELMo embeddings and the context information is also incorporated by further passing through a bi-directional LSTM network.
The key in machine reading comprehension solution lies in how to incorporate the question con-text into the paragraph, in which attention mechanism is most widely used. Recently, many different attention functions and types have been designed (Xiong et al., 2016;Seo et al., 2016;Wang et al., 2017), which aims at properly aligning the question and passage. In our attention layer, we propose a hierarchical attention network by leveraging both the co-attention and self-attention mechanism, to gradually focus our attention on the best answer span. Different from the previous attention-based methods, we constantly complement the aligned representations with global information from the previous layer, and an additional fusion layer is used to further refine the representations. In this way, our model can make some minor adjustment so that the attention will always be on the right place.
Based on the refined question and passage representation, a bilinear match layer is finally used to identify the best answer span with respect to the question. Following the work of (Wang and Jiang, 2016), we predict the start and end boundary within a pointer-network output layer.
The proposed method achieves state-of-the-art results against strong baselines. Our single model achieves 79.2% EM and 86.6% F1 score on the hidden test set, while the ensemble model further boosts the performance to 82.4% EM and 88.6% F1 score. At the time of writing the paper (Jan. 12th 2018), our model SLQA+ (Semantic Learning for Question Answering) achieves the first position on the SQuAD leaderboard 1 for both single and ensemble models. Besides, we are also among the first to surpass human EM performance on this golden benchmark dataset.

Machine Reading Comprehension
Traditional reading comprehension style question answering systems rely on a pipeline of NLP models, which make heavy use of linguistic annotation, structured world knowledge, semantic parsing and similar NLP pipeline outputs (Hermann et al., 2015). Recently, the rapid progress of machine reading comprehension has largely benefited from the availability of large-scale benchmark datasets and it is possible to train large end-to-end neural network models. Among them, CNN/Daily Mail (Hermann et al., 2015) and Children's Book Test (Hill et al., 2015) are the first large-scale datasets for reading comprehension task. However, these datasets are in cloze-style, in which the goal is to predict the missing word (often a named entity) in a passage. Moreover, Chen at al. have also shown that these clozestyle datasets requires less reasoning than previously thought . Different from the previous datasets, the SQuAD provides a more challenging benchmark dataset, where the goal is to extract an arbitrary answer span from the original passage.

Attention-based Neural Networks
The key in MRC task lies in how to incorporate the question context into the paragraph, in which attention mechanism is most widely used. In spite of a variety of model structures and attention types (Cui et al., 2016;Xiong et al., 2016;Seo et al., 2016;Wang et al., 2017;Clark and Gardner, 2017), a typical attention-based neural network model for MRC first encodes the symbolic representation of the question and passage in an embedding space, then identify answers with particular attention functions in that space. In terms of the question and passage attention or matching strategy, we roughly categorize these attention-based models into two large groups: one-way attention and two-way attention.
In one-way attention model, question is first summarized into a single vector and then directly matched with the passage. Most of the end-toend neural network methods on the cloze-style datasets are based on this model (Hermann et al., 2015;Kadlec et al., 2016;Dhingra et al., 2016). Hermann et al. are the first to apply the attention-based neural network methods to MRC task and introduce an attentive reader and an impatient reader (Hermann et al., 2015), by leveraging a two layer LSTM network. Chen et al.  further design a bilinear attention function based on the attentive reader, which shows superior performance on CNN/Daily Mail dataset. However, part of information may be lost when summarizing the question and a finegrained attention on both the question and passage words should be more reasonable.
Therefore, the two-way attention model unfolds both the question and passage into respective word embeddings, and compute the attention in a two-dimensional matrix. Most of the top-ranking methods on SQuAD leaderboard are based on this attention mechanism (Wang et al., 2017;Xiong et al., 2017;Liu et al., 2017b,a). (Cui et al., 2016) and (Xiong et al., 2016) introduce the co-attention mechanism to better couple the representations of the question and document. Seo et al. propose a bi-directional attention flow network to capture the relevance at different levels of granularity (Seo et al., 2016). (Wang et al., 2017) further introduce the self-attention mechanism to refine the representation by matching the passage against itself, to better capture the global passage information. Huang et al. introduce a fully-aware attention mechanism with a novel history-of-word concept .
We propose a hierarchical attention network by leveraging both co-attention and self-attention mechanisms in different layers, which can capture the relevance between the question and passage at different levels of granularity. Different from the above methods, we further devise a fusion function to combine both the aligned representation and the original representation from the previous layer within each attention. In this way, the model can always focus on the right part of the passage, while keeping the global passage topic in mind.

Task Description
Typical machine comprehension systems take an evidence text and a question as input, and predict a span within the evidence that answers the question. Based on this definition, given a passage and a question, the machine needs to first read and understand the passage, and then finds the answer to the question. The passage is described as a sequence of word tokens P = w P t n t=1 and the question is described as where n is the number of words in the passage, and m is the number of words in the question. In general, n m. The answer can have different types depending on the task. In the SQuAD dataset (Rajpurkar et al., 2016), the answer A is guaranteed to be a continuous span in the passage P. The object function for machine reading comprehension is to learn a function f(q, p) = arg max a∈A(p) P(a|q, p). The training data is a set of the question, passage and answer tuples < Q, P, A >.

Encode-Interaction-Pointer Framework
We will now describe our framework from the bottom up. As show in Figure 1, the proposed framework consists of four typical layers to learn different concepts of semantic representations: • Encoder Layer as a language model, utilizes contextual cues from surrounding words to refine the embedding of the words. It converts the passage and question from tokens to semantic representation; • Attention Layer attempts to capture relations between question and passage. Besides the aligned context, the contextual embeddings are also merged by a fusion function. Moreover, the multi-level of this operation forms a "working memory"; • Match Layer employs a bi-linear match function to compute the relevance between the question and passage representation on a span level; • Output Layer uses a pointer network to search the answer span of question.
The main contribution of this work is the attention layer, in order to capture the relationship between question and passage, a hierarchical strategy is used to progressively make the answer boundary clear with the refined attention mechanism. A fine-grained fusion function is also introduced to better align the contextual representations from different levels. The detailed descrip-tion of the model is provided as follows.

Hierarchical Attention Fusion Network
Our design is based on a simple but natural intuition: performing fine-grained mechanism requires first to roughly see the potential answer domain and then progressively locate the most discriminative parts of the domain.
The overall framework of our Hierarchical Attention Fusion Network is shown in Figure 1. It consists of several parts: a basic co-attention layer with shallow semantic fusion, a self-attention layer with deep semantic fusion and a memorywise bilinear alignment function. The proposed network has two distinctive characteristics: (i) A fine-grained fusion approach to blend attention vectors for a better understanding of the relationship between question and passage; (ii) A multi-granularity attention mechanism applied at the word and sentence-level, enabling it to properly attend to the most important content when constructing the question and passage representation. Experiments conducted on SQuAD and adversarial example datasets (Jia and Liang, 2017) demonstrate that the proposed framework outperform previous methods by a large margin. Details of different components will be described in the following sections.

Language Model & Encoder Layer
Encoder layer of the model transform the discrete word tokens of question and passage to a sequence of continuous vector representations. We use a pre-trained word embedding model and a char embedding model to lay the foundation for our model. .
To further utilize contextual cues from surrounding words to refine the embedding of the words, we then put a shared Bi-LSTM network on top of the embeddings provided by the previous layers to model the temporal interactions between words. Before feeding into the Bi-LSTM contextual network, we concat the word embeddings and char embeddings for a full understanding of each word. The final output of our encoder layer is shown as below, where we further concat the output of the contextual Bi-LSTM network with the pre-trained char embeddings for its good performance (Peters et al., 2018). This can be regarded as a residual connection between word representations in different levels.

Hierarchical Attention & Fusion Layer
The attention layer is responsible for linking and fusing information from the question and passage representation, which is the most critical in most MRC tasks. It aims to align the question and passage so that we can better locate on the most relevant passage span with respect to the question. We propose a hierarchical attention structure by combining the co-attention and self-attention mechanism in a multi-hop style. Besides, we think that the original representation and the aligned representation via attention can reflect the content semantics in different granularities. Therefore, we also apply a particular fusion function after each attention function, so that different levels of semantics can be better incorporated towards a better understanding.

Co-attention & Fusion
Given the question and passage representation u Q t and u P t , a soft-alignment matrix S has been built to calculate the shallow semantic similarity between question and passage as follows: where W lin is a trainable weight matrix. This decomposition avoids the quadratic complexity that is trivially parallelizable (Parikh et al., 2016). Now we use the unnormalized attention weights S ij to compute the attentions between question and passage, which is further used to obtain the attended vectors in passage to question and question to passage direction, respectively.
P2Q Attention signifies which question words are most relevant to each passage word, given as below: where α j represents the attention weights on the question words. The aligned passage representation from question Q = u Q t m t=1 can thus be derived as, Q2P Attention signifies which passage words have the closest similarity to one of the question words and are hence critical for answering the question. We utilize the same way to calculate this attention as in the passage to question attention (P2Q), except for that in the opposite direction: whereP indicates the weighted sum of the most important words in the passage with respect to the question.
With the aligned passage and question represen-tationsQ andP derived, a particular fusion unit has been designed to combine the original contextual representations and the corresponding attention vectors for question and passage separately: where Fuse(·, ·) is a typical fusion kernel. The simplest way of fusion is a concatenation or addition of the two representations, followed by some linear or non-linear transformation. Recently, a heuristic matching trick with difference and element-wise product is found effective in combining different representations (Mou et al., 2016;Chen et al., 2017b): where • denotes the element-wise product, and W f , b f are trainable parameters. The output dimension is projected back to the same size as the original representation P or Q via the projected matrix W f .
Since we find that the original contextual representations are important in reflecting the semantics at a more global level, we also introduce different levels of gating mechanism to incorporate the projected representations m(·, ·) with the original contextual representations. As a result, the final fused representations of passage and question can be formulated as: P = g(P,Q) · m(P,Q) + (1 − g(P,Q)) · P (11) Q = g(Q,P) · m(Q,P) + (1 − g(Q,P)) · Q (12) where g(·, ·) is a gating function. To capture the relation between the representations in different granularities, we also design a scalar-based, a vector-based and a matrix-based sigmoid gating function, which are compared in Section 4.5.

Self-attention & Fusion
Borrowing the idea from wide and deep network (Cheng et al., 2016), manual features have also been added to combine with the outputs of previous layer for a more comprehensive representation. In our model, these features are concatenated with the refined question-aware passage representation as below: where feat man denotes the word-level manual passage features.
In this layer, we separately consider the semantic representations of question and passage, and further refine the obtained information from the co-attention layer. Since fusing information among context words allows contextual information to flow close to the correct answer, the self-attention layer is used to further align the question and passage representation against itself, so as to keep the global sequence information in memory. Benefiting from the advantage of self-alignment attention in addressing the longdistance dependence (Wang et al., 2017), we adopt a self-alignment fusion process in this level. To allow for more freedom of the aligning process, we introduce a bilinear self-alignment attention function on the passage representation: Another fusion function Fuse(·, ·) is again adopted to combine the question-aware passage representation D and self-aware representationD, as below: Finally, a bidirectional LSTM is used to get the final contextual passage representation: As for question side, since it is generally shorter in length and could be adequately represented with less information, we follow the question encoding method used in (Chen et al., 2017a) and adopt a linear transformation to encode the question representation to a single vector.
First, another contextual bidirectional LSTM network is applied on top of the fused question representation: Q = BiLSTM(Q ). Then we aggregate the resulting hidden units into one single question vector, with a linear self-alignment: where w q is a weight vector to learn, we self-align the refined question representation to a single vector according to the question self-attention weight, which can be further used to compute the matching with the passage words.

Model & Output Layer
Instead of predicting the start and end positions based only on D , a top-level bilinear match function is used to capture the semantic relation between question q and paragraph D in a matching style, which actually works as a multi-hop matching mechanism. Different from the co-attention layer that generates coarse candidate answers and the selfattention layer that focus the relevant context of passage to a certain intent of question, the top model layer uses a bilinear matching function to capture the interaction between outputs from previous layers and finally locate on the right answer span.
The start and end distribution of the passage words are calculated in a bilinear matching way as below, where W s and W e are trainable matrices of the bilinear match function. The output layer is application-specific, in MRC task, we use pointer networks to predict the start and end position of the answer, since it requires the model to find the sub-phrase of the passage to answer the question.
In training process, with cross entropy as metric, the loss for start and end position is the sum of the negative log probabilities of the true start and end indices by the predicted distributions, averaged over all examples: where θ is the set of all trainable weights in the model, and p s is the probability of start index, p e is the probability of end index, respectively. y s i and y e i are the true start and end indices. During prediction, we choose the answer span with the maximum value of p s · p e under a constraint that s ≤ e ≤ s + 15, which is selected via a dynamic programming algorithm in linear time.

Experiments
In this section, we first present the datasets used for evaluation. Then we compare our end-to-end Hierarchical Attention Fusion Networks with existing machine reading models. Finally, we conduct experiments to validate the effectiveness of our proposed components. We evaluate our model on the task of question answering using recently released SQuAD and TriviaQA Wikipedia (Joshi et al., 2017), which have gained a huge attention over the past year. An adversarial evaluation for the Stanford Question Answering SQuAD is also used to demonstrate the robust of our model under adversarial attacks (Jia and Liang, 2017).

Training Details
We use the AdaMax optimizer, with a mini-batch size of 32 and initial learning rate of 0.002. A dropout rate of 0.4 is used for all LSTM layers. To directly optimize our target against the evaluation metrics, we further fine-tune the model with some well-defined strategy. During fine-tuning, Focal Loss (Lin et al., 2017) and Reinforce Loss which take F1 score as reward are incorporated with Cross Entropy Loss. The training process takes roughly 20 hours on a single Nvidia Tesla M40 GPU. We also train an ensemble model consisting of 15 training runs with the identical framework and hyper-parameters. At test time, we choose the answer with the highest sum of confidence scores amongst the 15 runs for each question.

Main Results
The results of our model and competing approaches on the hidden test set are summarized in Table 1. The proposed SLQA+ ensemble model achieves an EM score of 82.4 and F1 score of 88.6, outperforming all previous approaches, which validates the effectiveness of our hierarchical attention and fusion network structure.
We also conduct experiments on the adversarial The F1 scores of different models on AddSent and AddOneSent datasets (S: Single Model, E: Ensemble).

Model
AddSent AddOneSent Logistic (Rajpurkar et al., 2016) 23.2 30.4 Match-S (Wang and Jiang, 2016) 27.3 39.0 Match-E (Wang and Jiang, 2016) 29.4 41.8 BiDAF-S (Seo et al., 2016) 34.3 45.7 BiDAF-E (Seo et al., 2016) 34.2 46.9 ReasoNet-S  39.4 50.3 ReasoNet-E  39.4 49.8 Mnemonic-S (Hu et al., 2017) 46.6 56.0 Mnemonic-E (Hu et al., 2017) 46.2 55.3 QANet-S (Yu et al., 2018) 45 SQuAD dataset (Jia and Liang, 2017) to study the robustness of the proposed model. In the dataset, one or more sentences are appended to the original SQuAD context, aiming to mislead the trained models. We use exactly the same model as in our SQuAD dataset, the performance comparison result is shown in Table 2. It can be seen that the proposed model can still get superior results than all the other competing approaches.

Ablations
In order to evaluate the individual contribution of each model component, we run an ablation study. Table 3 shows the performance of our model and its ablations on SQuAD dev set. The bi-linear alignment plus fusion between passage and question is most critical to the performance on both metrics which results in a drop of nearly 15%. The reason may be that in top-level attention layer, the similar semantics between question and passage are strong evidence to locate the correct answer span. The ELMo accounts for about 5% of the performance degradation, which clearly shows the effectiveness of language model. We conjecture that language model layer efficiently encodes different types of syntactic and semantic information about words-in-context, and improves the task performance. To evaluate the performance of hierarchical architecture, we reduce the multi-hop fusion with the standard LSTM network. The result shows that multi-hop fusion outperforms the standard LSTM by nearly 5% on both metrics.

Fusion Functions
In this section, we experimentally demonstrate how different choices of the fusion kernel impact the performance of our model. The compared fusion kernels are described as follows: Simple Concat: a simple concatenation of two  Scalar-based Fusion: the gating function is a trainable scalar parameter (a coarse fusion level): where g p is a trainable scalar parameter.
Vector-based Fusion: the gating function contains a weight vector to learn, which acts as a onedimensional sigmoid gating, g(P,Q) = σ(w g ·[P;Q; P•Q; P−Q]+b g ) (24) where w g is trainable weight vector, b g is trainable bias, and σ is sigmoid function.
Matrix-based Fusion: the gating function contains a weight matrix to learn, which acts as a twodimensional sigmoid gating, g(P,Q) = σ(W g ·[P;Q; P•Q; P−Q]+b g ) (25) where W g is a trainable weight matrix.
The comparison results of different fusion kernels can be found in Table 4. We can see that different fusion methods contribute differently to the final performances, and the vector-based fusion method performs best, with a moderate parameter size.

Attention Hierarchy and Function
In the proposed model, attention layer is the most important part of the framework. At the bottom of Table 5 we show the performances on SQuAD  for four common attention functions. Empirically, we find bilinear attention which add ReLU after linearly transforming does significantly better than the others. At the top of Table 5 we show the effect of varying the number of attention layers on the final performance. We see a steep and steady rise in accuracy as the number of layers is increased from N = 1 to 3.

Experiments on TriviaQA
To further examine the robustness of the proposed model, we also test the model performance on TriviaQA dataset. The test performance of different methods on the leaderboard (on Jan. 12th 2018) is shown in Table 6. From the results, we can see that the proposed model can also obtain state-of-the-art performance in the more complex TriviaQA dataset.

Conclusions
We introduce a novel hierarchical attention network, a state-of-the-art reading comprehension model which conducts attention and fusion horizontally and vertically across layers at different levels of granularity between question and paragraph. We show that our proposed method is very powerful and robust, which outperforms the previous state-of-the-art methods in various largescale golden MRC datasets: SQuAD, TriviaQA, AddSent and AddOneSent.