Gated Self-Matching Networks for Reading Comprehension and Question Answering

In this paper, we present the gated self-matching networks for reading comprehension style question answering, which aims to answer questions from a given passage. We first match the question and passage with gated attention-based recurrent networks to obtain the question-aware passage representation. Then we propose a self-matching attention mechanism to refine the representation by matching the passage against itself, which effectively encodes information from the whole passage. We finally employ the pointer networks to locate the positions of answers from the passages. We conduct extensive experiments on the SQuAD dataset. The single model achieves 71.3% on the evaluation metrics of exact match on the hidden test set, while the ensemble model further boosts the results to 75.9%. At the time of submission of the paper, our model holds the first place on the SQuAD leaderboard for both single and ensemble model.


Introduction
In this paper, we focus on reading comprehension style question answering which aims to answer questions given a passage or document. We specifically focus on the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016), a largescale dataset for reading comprehension and question answering which is manually created through crowdsourcing. SQuAD constrains answers to the space of all possible spans within the reference passage, which is different from cloze-style reading comprehension datasets (Hermann et al., * Contribution during internship at Microsoft Research. § Equal contribution. 2015; Hill et al., 2016) in which answers are single words or entities. Moreover, SQuAD requires different forms of logical reasoning to infer the answer (Rajpurkar et al., 2016). Rapid progress has been made since the release of the SQuAD dataset. Wang and Jiang (2016b) build question-aware passage representation with match-LSTM (Wang and Jiang, 2016a), and predict answer boundaries in the passage with pointer networks (Vinyals et al., 2015). Seo et al. (2016) introduce bi-directional attention flow networks to model question-passage pairs at multiple levels of granularity. Xiong et al. (2016) propose dynamic co-attention networks which attend the question and passage simultaneously and iteratively refine answer predictions. Lee et al. (2016) and Yu et al. (2016) predict answers by ranking continuous text spans within passages.
Inspired by Wang and Jiang (2016b), we introduce a gated self-matching network, illustrated in Figure 1, an end-to-end neural network model for reading comprehension and question answering. Our model consists of four parts: 1) the recurrent network encoder to build representation for questions and passages separately, 2) the gated matching layer to match the question and passage, 3) the self-matching layer to aggregate information from the whole passage, and 4) the pointernetwork based answer boundary prediction layer.
The key contributions of this work are three-fold.
First, we propose a gated attention-based recurrent network, which adds an additional gate to the attention-based recurrent networks Rocktäschel et al., 2015;Wang and Jiang, 2016a), to account for the fact that words in the passage are of different importance to answer a particular question for reading comprehension and question answering. In Wang and Jiang (2016a), words in a passage with their corresponding attention-weighted question context are en-coded together to produce question-aware passage representation. By introducing a gating mechanism, our gated attention-based recurrent network assigns different levels of importance to passage parts depending on their relevance to the question, masking out irrelevant passage parts and emphasizing the important ones. Second, we introduce a self-matching mechanism, which can effectively aggregate evidence from the whole passage to infer the answer. Through a gated matching layer, the resulting question-aware passage representation effectively encodes question information for each passage word. However, recurrent networks can only memorize limited passage context in practice despite its theoretical capability. One answer candidate is often unaware of the clues in other parts of the passage. To address this problem, we propose a self-matching layer to dynamically refine passage representation with information from the whole passage. Based on question-aware passage representation, we employ gated attention-based recurrent networks on passage against passage itself, aggregating evidence relevant to the current passage word from every word in the passage. A gated attention-based recurrent network layer and self-matching layer dynamically enrich each passage representation with information aggregated from both question and passage, enabling subsequent network to better predict answers.
Lastly, the proposed method yields state-of-theart results against strong baselines. Our single model achieves 71.3% exact match accuracy on the hidden SQuAD test set, while the ensemble model further boosts the result to 75.9%. At the time 1 of submission of this paper, our model holds the first place on the SQuAD leader board.

Task Description
For reading comprehension style question answering, a passage P and question Q are given, our task is to predict an answer A to question Q based on information found in P. The SQuAD dataset further constrains answer A to be a continuous subspan of passage P. Answer A often includes nonentities and can be much longer phrases. This setup challenges us to understand and reason about both the question and passage in order to infer the answer. Table 1 shows a simple example from the SQuAD dataset. 1 On Feb. 6, 2017 Passage: Tesla later approached Morgan to ask for more funds to build a more powerful transmitter. When asked where all the money had gone, Tesla responded by saying that he was affected by the Panic of 1901, which he (Morgan) had caused. Morgan was shocked by the reminder of his part in the stock market crash and by Tesla's breach of contract by asking for more funds. Tesla wrote another plea to Morgan, but it was also fruitless. Morgan still owed Tesla money on the original agreement, and Tesla had been facing foreclosure even before construction of the tower began.
Question: On what did Tesla blame for the loss of the initial money? Answer: Panic of 1901 3 Gated Self-Matching Networks Figure 1 gives an overview of the gated selfmatching networks. First, the question and passage are processed by a bi-directional recurrent network (Mikolov et al., 2010) separately. We then match the question and passage with gated attention-based recurrent networks, obtaining question-aware representation for the passage. On top of that, we apply self-matching attention to aggregate evidence from the whole passage and refine the passage representation, which is then fed into the output layer to predict the boundary of the answer span.

Question and Passage Encoder
Consider a question Q = {w Q t } m t=1 and a passage P = {w P t } n t=1 . We first convert the words to their respective word-level embeddings ({e Q t } m t=1 and {e P t } n t=1 ) and character-level embeddings ({c Q t } m t=1 and {c P t } n t=1 ). The character-level embeddings are generated by taking the final hidden states of a bi-directional recurrent neural network (RNN) applied to embeddings of characters in the token. Such character-level embeddings have been shown to be helpful to deal with out-ofvocab (OOV) tokens. We then use a bi-directional RNN to produce new representation u Q 1 , . . . , u Q m and u P 1 , . . . , u P n of all words in the question and passage respectively:

Gated Attention-based Recurrent Networks
We propose a gated attention-based recurrent network to incorporate question information into passage representation. It is a variant of attentionbased recurrent networks, with an additional gate to determine the importance of information in the passage regarding a question. Given question and passage representation {u Q t } m t=1 and {u P t } n t=1 , Rocktäschel et al. (2015) propose generating sentence-pair representation {v P t } n t=1 via soft-alignment of words in the question and passage as follows: where is an attentionpooling vector of the whole question (u Q ): Each passage representation v P t dynamically incorporates aggregated matching information from the whole question.
Wang and Jiang (2016a) introduce match-LSTM, which takes u P t as an additional input into the recurrent network: To determine the importance of passage parts and attend to the ones relevant to the question, we add another gate to the input ([u P t , c t ]) of RNN: Different from the gates in LSTM or GRU, the additional gate is based on the current passage word and its attention-pooling vector of the question, which focuses on the relation between the question and current passage word. The gate effectively model the phenomenon that only parts of the passage are relevant to the question in reading comprehension and question answering.
We call this gated attention-based recurrent networks. It can be applied to variants of RNN, such as GRU and LSTM. We also conduct experiments to show the effectiveness of the additional gate on both GRU and LSTM.

Self-Matching Attention
Through gated attention-based recurrent networks, question-aware passage representation {v P t } n t=1 is generated to pinpoint important parts in the passage. One problem with such representation is that it has very limited knowledge of context. One answer candidate is often oblivious to important cues in the passage outside its surrounding window. Moreover, there exists some sort of lexical or syntactic divergence between the question and passage in the majority of SQuAD dataset (Rajpurkar et al., 2016). Passage context is necessary to infer the answer. To address this problem, we propose directly matching the question-aware passage representation against itself. It dynamically collects evidence from the whole passage for words in passage and encodes the evidence relevant to the current passage word and its matching question information into the passage representation h P t : where c t = att(v P , v P t ) is an attention-pooling vector of the whole passage (v P ): An additional gate as in gated attention-based recurrent networks is applied to [v P t , c t ] to adaptively control the input of RNN.
Self-matching extracts evidence from the whole passage according to the current passage word and question information.

Output Layer
We follow Wang and Jiang (2016b) and use pointer networks (Vinyals et al., 2015) to predict the start and end position of the answer. In addition, we use an attention-pooling over the question representation to generate the initial hidden vector for the pointer network. Given the passage representation {h P t } n t=1 , the attention mechanism is utilized as a pointer to select the start position (p 1 ) and end position (p 2 ) from the passage, which can be formulated as follows: Here h a t−1 represents the last hidden state of the answer recurrent network (pointer network). The input of the answer recurrent network is the attention-pooling vector based on current predicted probability a t : When predicting the start position, h a t−1 represents the initial hidden state of the answer recurrent network. We utilize the question vector r Q as the initial state of the answer recurrent network. r Q = att(u Q , V Q r ) is an attention-pooling vector of the question based on the parameter V Q r : To train the network, we minimize the sum of the negative log probabilities of the ground truth start and end position by the predicted distributions.

Implementation Details
We specially focus on the SQuAD dataset to train and evaluate our model, which has garnered a huge attention over the past few months. SQuAD is composed of 100,000+ questions posed by crowd workers on 536 Wikipedia articles. The dataset is randomly partitioned into a training set (80%), a development set (10%), and a test set (10%). The answer to every question is a segment of the corresponding passage.
We use the tokenizer from Stanford CoreNLP  to preprocess each passage and question. The Gated Recurrent Unit  variant of LSTM is used throughout our model. For word embedding, we use pretrained case-sensitive GloVe embeddings 2 (Pennington et al., 2014) for both questions and passages, and it is fixed during training; We use zero vectors to represent all out-of-vocab words. We utilize 1 layer of bi-directional GRU to compute character-level embeddings and 3 layers of bi-directional GRU to encode questions and passages, the gated attention-based recurrent network for question and passage matching is also encoded bidirectionally in our experiment. The hidden vector length is set to 75 for all layers. The hidden size used to compute attention scores is also 75. We also apply dropout (Srivastava et al., 2014) between layers with a dropout rate of 0.2. The model is optimized with AdaDelta (Zeiler, 2012) with an initial learning rate of 1. The ρ and used in AdaDelta are 0.95 and 1e −6 respectively.

Main Results
Two metrics are utilized to evaluate model performance: Exact Match (EM) and F1 score. EM measures the percentage of the prediction that matches one of the ground truth answers exactly. F1 measures the overlap between the prediction and ground truth answers which takes the maximum F1 over all of the ground truth answers. The scores on dev set are evaluated by the official script 3 . Since the test set is hidden, we are required to submit the model to Stanford NLP group to obtain the test scores.

Ablation Study
We do ablation tests on the dev set to analyze the contribution of components of gated self-matching networks. As illustrated in Table 3, the gated Figure 2: Part of the attention matrices for self-matching. Each row is the attention weights of the whole passage for the current passage word. The darker the color is the higher the weight is. Some key evidence relevant to the question-passage tuple is more encoded into answer candidates.
attention-based recurrent network (GARNN) and self-matching attention mechanism positively contribute to the final results of gated self-matching networks. Removing self-matching results in 3.5 point EM drop, which reveals that information in the passage plays an important role. Characterlevel embeddings contribute towards the model's performance since it can better handle out-ofvocab or rare words. To show the effectiveness of GARNN for variant RNNs, we conduct experiments on the base model (Wang and Jiang, 2016b) of different variant RNNs. The base model match the question and passage via a variant of attentionbased recurrent network (Wang and Jiang, 2016a), and employ pointer networks to predict the answer. Character-level embeddings are not utilized. As shown in Table 4, the gate introduced in question and passage matching layer is helpful for both GRU and LSTM on the SQuAD dataset.

Encoding Evidence from Passage
To show the ability of the model for encoding evidence from the passage, we draw the align-ment of the passage against itself in self-matching. The attention weights are shown in Figure 2, in which the darker the color is the higher the weight is. We can see that key evidence aggregated from the whole passage is more encoded into the answer candidates. For example, the answer "Egg of Columbus" pays more attention to the key information "Tesla", "device" and the lexical variation word "known" that are relevant to the question-passage tuple. The answer "world classic of epoch-making oratory" mainly focuses on the evidence "Michael Mullet", "speech" and lexical variation word "considers". For other words, the attention weights are more evenly distributed between evidence and some irrelevant parts. Selfmatching do adaptively aggregate evidence for words in passage.

Result Analysis
To further analyse the model's performance, we analyse the F1 score for different question types (Figure 3(a)), different answer lengths ( Figure  3(b)), different passage lengths (Figure 3(c)) and different question lengths (Figure 3(d)) of our model and its ablation models. As we can see, both four models show the same trend. The questions are split into different groups based on a set of question words we have defined, including "what", "how", "who", "when", "which", "where", and "why". As we can see, our model is better at "when" and "who" questions, but poorly on "why" questions. This is mainly because the answers to why questions can be very diverse, and they are not restricted to any certain type of phrases. From the Graph 3(b), the performance of our model obviously drops with the increase of answer length. Longer answers are harder to predict. From Graph 3(c) and 3(d), we discover that the performance remains stable with the increase in length, the obvious fluctuation in longer passages and questions is mainly because the proportion is too small. Our model is largely agnostic to long passages and focuses on important part of the passage.

Related Work
Reading Comprehension and Question Answering Dataset Benchmark datasets play an important role in recent progress in reading comprehension and question answering research. Exist-ing datasets can be classified into two categories according to whether they are manually labeled. Those that are labeled by humans are always in high quality (Richardson et al., 2013;Berant et al., 2014;Yang et al., 2015), but are too small for training modern data-intensive models. Those that are automatically generated from natural occurring data can be very large (Hill et al., 2016;, which allow the training of more expressive models. However, they are in cloze style, in which the goal is to predict the missing word (often a named entity) in a passage. Moreover,  have shown that the CNN / Daily News dataset  requires less reasoning than previously thought, and conclude that performance is almost saturated.
Different from above datasets, the SQuAD provides a large and high-quality dataset. The answers in SQuAD often include non-entities and can be much longer phrase, which is more challenging than cloze-style datasets. Moreover, Rajpurkar et al. (2016) show that the dataset retains a diverse set of answers and requires different forms of logical reasoning, including multi-sentence reasoning. MS MARCO (Nguyen et al., 2016) is also a large-scale dataset. The questions in the dataset 195 are real anonymized queries issued through Bing or Cortana and the passages are related web pages. For each question in the dataset, several related passages are provided. However, the answers are human generated, which is different from SQuAD where answers must be a span of the passage.
End-to-end Neural Networks for Reading Comprehension Along with cloze-style datasets, several powerful deep learning models Hill et al., 2016;Kadlec et al., 2016;Sordoni et al., 2016;Cui et al., 2016;Trischler et al., 2016;Shen et al., 2016) have been introduced to solve this problem. Hermann et al. (2015) first introduce attention mechanism into reading comprehension. Hill et al. (2016) propose a windowbased memory network for CBT dataset. Kadlec et al. (2016) introduce pointer networks with one attention step to predict the blanking out entities. Sordoni et al. (2016) propose an iterative alternating attention mechanism to better model the links between question and passage. Trischler et al. (2016) solve cloze-style question answering task by combining an attentive model with a reranking model.  propose iteratively selecting important parts of the passage by a multiplying gating function with the question representation. Cui et al. (2016) propose a two-way attention mechanism to encode the passage and question mutually. Shen et al. (2016) propose iteratively inferring the answer with a dynamic number of reasoning steps and is trained with reinforcement learning.
Neural network-based models demonstrate the effectiveness on the SQuAD dataset. Wang and Jiang (2016b) combine match-LSTM and pointer networks to produce the boundary of the answer. Xiong et al. (2016) and Seo et al. (2016) employ variant coattention mechanism to match the question and passage mutually. Xiong et al. (2016) propose a dynamic pointer network to iteratively infer the answer. Yu et al. (2016) and Lee et al. (2016) solve SQuAD by ranking continuous text spans within passage. Yang et al. (2016) present a fine-grained gating mechanism to dynamically combine word-level and character-level representation and model the interaction between questions and passages.  propose matching the context of passage with the question from multiple perspectives. Different from the above models, we introduce self-matching attention in our model. It dynamically refines the passage representation by looking over the whole passage and aggregating evidence relevant to the current passage word and question, allowing our model make full use of passage information. Weightedly attending to word context has been proposed in several works. Ling et al. (2015) propose considering window-based contextual words differently depending on the word and its relative position. Cheng et al. (2016) propose a novel LSTM network to encode words in a sentence which considers the relation between the current token being processed and its past tokens in the memory.  apply this method to encode words in a sentence according to word form and its distance. Since passage information relevant to question is more helpful to infer the answer in reading comprehension, we apply self-matching based on question-aware representation and gated attention-based recurrent networks. It helps our model mainly focus on question-relevant evidence in the passage and dynamically look over the whole passage to aggregate evidence.
Another key component of our model is the attention-based recurrent network, which has demonstrated success in a wide range of tasks.  first propose attentionbased recurrent networks to infer word-level alignment when generating the target word. Hermann et al. (2015) introduce word-level attention into reading comprehension to model the interaction between questions and passages. Rocktäschel et al. (2015) and Wang and Jiang (2016a) propose determining entailment via word-by-word matching. The gated attention-based recurrent network is a variant of attention-based recurrent network with an additional gate to model the fact that passage parts are of different importance to the particular question for reading comprehension and question answering.

Conclusion
In this paper, we present gated self-matching networks for reading comprehension and question answering. We introduce the gated attentionbased recurrent networks and self-matching attention mechanism to obtain representation for the question and passage, and then use the pointernetworks to locate answer boundaries. Our model achieves state-of-the-art results on the SQuAD dataset, outperforming several strong competing systems. As for future work, we are applying the gated self-matching networks to other reading comprehension and question answering datasets, such as the MS MARCO dataset (Nguyen et al., 2016).