Stochastic Answer Networks for Machine Reading Comprehension

We propose a simple yet robust stochastic answer network (SAN) that simulates multi-step reasoning in machine reading comprehension. Compared to previous work such as ReasoNet which used reinforcement learning to determine the number of steps, the unique feature is the use of a kind of stochastic prediction dropout on the answer module (final layer) of the neural network during the training. We show that this simple trick improves robustness and achieves results competitive to the state-of-the-art on the Stanford Question Answering Dataset (SQuAD), the Adversarial SQuAD, and the Microsoft MAchine Reading COmprehension Dataset (MS MARCO).


Introduction
Machine reading comprehension (MRC) is a challenging task: the goal is to have machines read a text passage and then answer any question about the passage.This task is an useful benchmark to demonstrate natural language understanding, and also has important applications in e.g.conversational agents and customer service support.It has been hypothesized that difficult MRC problems require some form of multi-step synthesis and reasoning.For instance, the following example from the MRC dataset SQuAD (Rajpurkar et al., 2016) illustrates the need for synthesis of information across sentences and multiple steps of reasoning: ... They hold the UK's biggest national collection of material about live performance.
To infer the answer (the underlined portion of the passage P ), the model needs to first perform coreference resolution so that it knows "They" refers "V&A Theator", then extract the subspan in the direct object corresponding to the answer.
This kind of iterative process can be viewed as s t-1 s t s t+1 x Figure 1: Illustration of "stochastic prediction dropout" in the answer module during training.At each reasoning step t, the model combines memory (bottom row) with hidden states s t−1 to generate a prediction (multinomial distribution).Here, there are three steps and three predictions, but one prediction is dropped and the final result is an average of the remaining distributions.
a form of multi-step reasoning.Several recent MRC models have embraced this kind of multistep strategy, where predictions are generated after making multiple passes through the same text and integrating intermediate information in the process.The first pioneering models employed a predetermined fixed number of reasoning steps (Hill et al., 2016;Dhingra et al., 2016;Sordoni et al., 2016;Kumar et al., 2015).Later, Shen et al. (2016) showed that dynamically determining the number of steps based on the complexity of the question and answer leads to improvements in MRC; the model uses reinforcement learning to predict the number of reasoning steps for each question-answer pair.Further, Shen et al. (2017) empirically showed that dynamic multistep reasoning outperforms fixed multi-step reasoning, which in turn outperforms single-step reasoning on two distinct MRC datasets (SQuAD and MS MARCO).
In this work, we derive an alternative multi-step reasoning neural network for MRC.During training, we fix the number of reasoning steps, but perform stochastic dropout on the answer module (final layer predictions).During decoding, we generate answers based on the average of predictions in all steps, rather than the final step.We call this a stochastic answer network (SAN) because the stochastic dropout is applied to the answer module; albeit simple, this technique significantly improves the robustness and overall accuracy of the model.Intuitively this works because while the model successively refines its prediction over multiple steps, each step is still trained to generate the same answer; we are performing a kind of stochastic ensemble over the model's successive prediction refinements.Stochastic prediction dropout is illustrated in Figure 1.We find that SAN is an effective model for multi-step reasoning in MRC.

Proposed model: SAN
The machine reading comprehension (MRC) task as defined here involves a question Q = {q 0 , q 1 , ..., q m−1 } and a passage P = {p 0 , p 1 , ..., p n−1 } and aims to find an answer span A = {a start , a end } in P .We assume that the answer exists in the passage P as a contiguous text string.Here, m and n denote the number of tokens in Q and P , respectively.The learning algorithm for reading comprehension is to learn a function f (Q, P ) → A. The training data is a set of the query, passage and answer tuples < Q, P, A >.
We will now describe our model from the ground up.The main contribution of this work is the answer module, but in order to understand what goes into this module, we will start by describing how Q and P are processed by the lower layers.Note the lower layers also have some novel variations that are not used in previous work.
As shown in Figure 2, our model contains four different layers to capture different concept of representations: 1) the lexicon encoding layer encodes different types of lexicon variants and its linguistic features; 2) the contextual encoding layer attempts to capture relations between words and phrases in context; 3) the memory generation layer gathers all the different information from the passage and question and forms a "working memory" for the final answer module; 4) the final answer module, a type of multi-step network, searches the answer span.The detailed description of our model is provided as follows.
Lexicon Encoding Layer.The purpose of the first layer is to extract information from Q and P at the word level and normalize for lexical variants.A typical technique to obtain lexicon embedding is concatenation of its word embedding with other linguistic embedding such as those derived from Part-Of-Speech (POS) tags.For word embeddings, we use the pre-trained 300-dimensional GloVe vectors (Pennington et al., 2014) for the both Q and P .Following Chen et al. (2017), we use three additional types of linguistic features for each token p i in the passage P : • 9-dimensional POS tagging embedding for total 56 different types of the POS tags.
• 8-dimensional named-entity recognizer (NER) embedding for total 18 different types of the NER tags.We utilized small embedding sizes for POS and NER to reduce model size.They mainly serve the role of coarse-grained word clusters.
• A 3-dimensional binary exact match feature defined as f exact match (p i ) = I(p i ∈ Q).This checks whether a passage token p i matches the original, lowercase or lemma form of any question token.
• Aligned question embeddings: f align (p i ) = j γ i,j g(GloV e(q j )), where g(•) is a 280-dimensional single layer neural network ReLU (W 0 x) and γ i,j = exp(g(GloV e(p j ))•g(GloV e(q i ))) measures the similarity in word embedding space between a token p i in the passage and a token q j in the question.Compared to the exact matching features, these embeddings encode soft alignments between similar but not-identical words.
In summary, each token p i in the passage is represented as a 600-dimensional vector and each token q j is represented as a 300-dimensional vector.
Due to different dimensions for the passages and questions, in the next layer two different bidirectional LSTM (BiLSTM) (Hochreiter and Schmidhuber, 1997) may be required to encode the contextual information.This, however, introduces a large number of additional parameters.To prevent this, we employ an idea inspired by The first layer is a lexicon encoding layer that maps words to their embeddings independently for the question (left) and the passage (right): this is a concatenation of standard word embeddings, POS embeddings, etc. followed by a Position-Wise Feed Forward Network (FFN).
The next layer is a context encoding layer, where a bidirectional LSTM (BiLSTM) is used on the top of the lexicon embedding layer to obtain the context representation for both question and passage.In order to reduce the parameters, a maxout layer is applied on the output of BiLSTM.The third layer is the working memory: First we compute an alignment matrix between the question and passage using an attention mechanism, and use this to derive a question-aware passage representation.Then we concatenate this with the context representation of passage and the word embedding, and employ a self attention layer to re-arrange the information gathered.Finally, we use another LSTM to generate a working memory for the passage.At last, the fourth layer is the answer module, which is a GRU recurrent network that outputs predictions at each state s t .(Vaswani et al., 2017): use two separate two-layer position-wise Feed-Forward Networks (FFN) to map both the passage and question lexical encodings into the same number of dimensions.It is defined as: Note that this FFN has fewer parameters compared to a BiLSTM.Therefore, we obtain the final lexicon embeddings for the tokens in Q as a matrix E q ∈ R d×m and words in P as E q ∈ R d×n .The fact that both Q and P have embeddings in with the same number of dimension leads to ways to share parameters in following contextual layers.In our experiments, d is set to 128.
Contextual Encoding Layer.Both passage and question use the same context encoding layer based on BiLSTM.To avoiding over-fitting, we concatenate the output of a pre-trained 600-dimensional contextualized embedding1 (McCann et al., 2017), which is trained on English-German machine translation dataset, with the aforementioned lexicon embeddings as the final input of contextual encoding layer.Two stacked BiLSTM layers are used to encode the context information for each word.Since it is a bidirectional layer, it doubles the hidden size.To reduce the parameter size, we use a maxout layer (Goodfellow et al., 2013) on the BiLSTM to shrink its output into its original hidden size.Then, a layer normalization (Ba et al., 2016) is applied.By a concatenation of the outputs of two BiLSTM layers, we obtain H q ∈ R 2d×m as representation of Q and H p ∈ R 2d×n as representation of P , where d is the hidden size of the BiLSTM.
Memory Generation Layer.In this layer, we want to construct our working memory, a summary of information from both Q and P .This is accomplished via an attention mechanism.First, a dot-product attention is adopted like in (Vaswani et al., 2017) to measure the similarity between the tokens in Q and P .Instead of using a scalar to normalize the scores as in (Vaswani et al., 2017), we use a flexible learnable layer with a ReLU nonlinearity to transform the contextual information of both questions and documents: C is an attention matrix, and dropout is applied for smoothing.Note that Ĥq and Ĥp is transformed from H q and H p by one layer neural network ReLU (W 3 x), respectively.Next, we gather all the information on passages by a simple concatenation of its contextual information H p and its question-aware representation H q • C: Typically, a passage may contain hundred of tokens, making it hard to learn the long dependencies within it.Inspired by (Lin et al., 2017), we apply a self-attended layer to rearrange the information U p as: (3) In other words, we first obtain an n × n attention matrix with U p onto itself, apply dropout, then multiply this matrix with U p to obtain an updated Û p .Instead of using a penalization term as in (Lin et al., 2017), we dropout the diagonal of the similarity matrix forcing each token in the passage to align to other tokens rather than itself.At last, the working memory is generated by using another BiLSTM based on all the information gathered: where the semicolon mark ; indicates the vector/matrix concatenation operator.Answer module.There is a Chinese proverb that says: "wisdom of masses exceeds that of any individual."Unlike other multi-step reasoning models, which only uses a single output either at the last step or some dynamically determined final step, our answer module employs all the outputs of multiple step reasoning.Intuitively, by applying dropout, it avoids a "step bias problem" (where models places too much emphasis one particular step's predictions) and forces the model to produce good predictions at every individual step.Further, during decoding, we reuse wisdom of masses instead of individual to achieve a better result.We call this method "stochastic prediction dropout" because dropout is being applied to the final predictive distributions.
Formally, our answer module will compute over T memory steps and output the answer span.This module is a memory network and has some similarities to other multi-step reasoning networks: namely, it maintains a state vector, one state per step.At the beginning, the initial state s 0 is the summary of the Q: s 0 = j α j H q j , where α j = exp(w 4 •H q j ) j exp(w 4 •H q j ) .At time step t in the range of {1, 2, ..., T − 1}, the state is defined by s t = GRU (s t−1 , x t ).Here, x t is computed from the previous state s t−1 and memory M : x t = j β j M j and β j = sof tmax(s t−1 W 5 M ).Finally, a bilinear function is used to find the begin and end point of answer spans at each reasoning step t ∈ {0, 1, . . ., T − 1}. (5) ) where the mark ; indicates the concatenation operator, and j indicates memory index and it only depends on the current begin output P begin t .
From a pair of begin and end points, the answer string can be extracted from the passage.However, rather than output the results (start/end points) from the final step (which is fixed at T − 1 as in Memory Networks or dynamically determined as in ReasoNet), we utilize all of the T outputs by averaging the scores: .
P end = avg([P end 0 , P end 1 , ..., P end T −1 ]) (8) Each P begin t or P end t is a multinomial distribution over {1, . . ., n}, so the average distribution is straightforward to compute.
During training, we apply stochastic dropout to before the above averaging operation.For example, as illustrated in Figure 1, we randomly delete several steps' predictions in Equations 7 and 8 so that P begin might be avg([P begin 1 , P begin 3 ]) and P end might be avg([P end 0 , P end 3 , P end 4 ]).During decoding, we average all outputs as shown in Equations 7 and 8.The use of averaged predictions and dropout during training improves robustness.
Our stochastic prediction dropout is similar in motivation to the dropout introduced by (Srivastava et al., 2014).The difference is that theirs is dropout at the intermediate node-level, whereas ours is dropout at the final layer-level.Dropout at the node-level prevents correlation between features.Dropout at the final layer level, where randomness is introduced to the averaging of predictions, prevents our model from relying exclusively on a particular step to generate correct output.We used a dropout rate of 0.4 in experiments.

Experiment Setup
Dataset: We evaluate on the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016).This contains about 23K passages and 100K questions.The passages come from approximately 500 Wikipedia articles and the questions and answers are obtained by crowdsourcing.The crowdsourced workers are asked to read a passage (a paragraph), come up with questions, then mark the answer span.All results are on the official development set, unless otherwise noted.
Two evaluation metrics are used: Exact Match (EM), which measures the percentage of span predictions that matched any one of the ground truth answer exactly, and Macro-averaged F1 score, which measures the average overlap between the prediction and the ground truth answer.Human performance on the test set is 82.3%EM and 91.2% F1.
Implementation details: The spacy tool 2 is used to tokenize the both passages and questions, and generate lemma, part-of-speech and named entity tags.We use 2-layer BiLSTM with d = 128 hidden units for both passage and question encoding.The mini-batch size is set to 32 and Adamax (Kingma and Ba, 2014) is used as our optimizer.The learning rate is set to 0.2 at first and decreased by half after every 10 epochs.We set the dropout rate for the word embeddings, all the hidden units of LSTM, and the answer module output layer to 0.4.To prevent degenerate output, we ensure that at least one step in the answer module is active during training.

Results
The main experimental question we would like to answer is whether the stochastic dropout and averaging in the answer module is an effective technique for multi-step reasoning.To do so, we fixed all lower layers and compared different architectures for the answer module: 1. Standard 1-step: generate prediction from s 0 , the first initial state.
2. 5-step memory network: this is a memory network fixed at 5 steps.We try two variants: the standard variant outputs result from the final step s T −1 .The averaged variant outputs results by averaging across all 5 steps, and is like our proposed SAN without the stochastic dropout.
3. ReasoNet 3 : this answer module dynamically decides the number of steps and outputs results conditioned on the final step.
4. SAN: proposed answer module that uses stochastic dropout and prediction averaging.
The main results in terms of EM and F1 are shown in Table 1.We observe that SAN achieves 2 https://spacy.io 3 For a fair comparison about the design of the answer module, this is not an exact re-implementation of (Shen et al., 2017).Here, the answer module is the same as (Shen et al., 2017) but the lower layers are set to be the same as SAN, 5-step memory network, and standard 1-step as described in Figure 2. We also show the K-best oracle results in Figure 3.The K-best answer spans are computed by ordering the spans according the their probabilities P begin × P end .We limit K to be 2, 3, or 4 and then pick the span with the best EM or F1 as oracle.SAN also outperforms the other models in terms of K-best oracle scores by a consistent margin.Impressively, these models achieve or surpass human performance at K = 2 for EM and K = 3 for F1.
Finally, we report the official SQuAD leaderboard results in Table 2. 4 .We see that SAN is very competitive: for the single model case, SAN ranks second in terms both EM and F1 on the test set; for the ensemble model case, SAN ranks third Note the best-performing model "BiDAF+SelfAttention+ELMo" used a largescale ELMo (Embeddings from Language Models) (Anonymous, 2017), which is a contextualized language model with two layers of BiLSTM, each with 4096 hidden nodes and 512 dimension projections and trained on a large-scale textual dataset.This resource gave significant improvements, e.g.+4.3% in terms for dev F1; SAN does not currently employ ELMo, and we expect significant improvements when adopting it in future work. 4For brevity, this is a partial list consisting of only the top systems.The full list is athttps://rajpurkar.github.io/SQuADexplorer/ 5 As pointed out by Furu Wei (personal communication, Nov. 22, 2017), the r-net result, which is dated Nov. 21, 2017 in the SQuAD leaderboard and is also the one we cited in Table 2, is obtained using a new version of r-net that uses

Analysis
To understand our proposed model in more detail, we perform several empirical analyses.

How robust are the results?
We are interested in whether the proposed model is sensitive to different random initial conditions.Table 3 shows the development set scores of SAN trained from initialization with different random seeds.We observe that SAN results are consisdifferent network structure and attention strategies from what is documented in (Wang et al., 2017) 1. 6We are also interested in how sensitive are the results to the number of reasoning steps, which is a fixed hyper-parameter.Since we are using dropout, a natural question is whether we can extend the number of steps to an extremely large number.Table 4 shows the development set scores for T = 1 to T = 10.We observe that there is a gradual improvement as we increase T = 1 to T = 5, but after 5 steps the improvements have saturated.In fact, the EM/F1 scores drop slightly, but considering that the random initialization re-sults in Table 3 show a standard deviation of 0.142 and a spread of 0.426 (for EM), we believe that the T = 10 result does not statistically differ from the T = 5 result.In summary, we think it is useful to perform some approximate hyper-parameter tuning for the number of steps, but it is not necessary to find the exact optimal value.

Is it possible to use different numbers of steps in test vs. train?
For practical deployment scenarios, prediction speed at test time is an important criterion.Therefore, one question is whether SAN can train with, e.g.T = 5 steps but test with T = 1 steps.ertheless achieve strong results.For example, prediction at T = 1 achieves 75.582, which outperforms a standard 1-step model (75.139EM) that has approximate equivalent prediction time.

How does the training time compare?
Does the stochastic dropout lead to longer training times?The average training time per epoch is comparable: our implementation running on a GTX Titan X is 22 minutes for 5-step memory net, 30 minutes for ReasoNet, and 24 minutes for SAN.The learning curve is shown in Figure 4. We observe that all systems improve at approximately the same rate up to 10 or 15 epochs.However, SAN continues to improve afterwards as other models start to saturate.This observation is consistent with previous works using dropout (Srivastava et al., 2014).We believe that while train-

How does SAN perform by question type?
We divided the development set by questions type based on their respective Wh-word, such as "who" and "where".We are interested to see whether SAN performs well on a particular type of ques-

Related Work
The recent big progress on MRC is largely due to the availability of the large-scale datasets (Rajpurkar et al., 2016;Nguyen et al., 2016;Richardson et al., 2013;Hill et al., 2016), since it is possible to train large end-to-end neural network models.In spite of the variety of model structures and attenion types (Bahdanau et al., 2015;Chen et al., 2016;Xiong et al., 2016;Seo et al., 2016;Shen et al., 2017;Wang et al., 2017), a typical neural network MRC model first maps the symbolic representation of the documents and questions into a neural space, then search answers on top of it.We categorize these models into two large groups based on the difference of the answer module: single-step and multi-step reasoning.The key difference between the two is what strategies are applied to search the final answers in the neural space.
A single-step model matches the question and document only once and produce the final answers.It is simple yet efficient and can be trained using the classical back-propagation algorithm, thus it is adopted by most of systems (Chen et al., 2016;Seo et al., 2016;Wang et al., 2017;Liu et al., 2017;Chen et al., 2017;Weissenborn et al., 2017;Hu et al., 2017).However, since humans often solve question answering tasks by re-reading and re-digesting the document multiple times before reaching the final answers (this may be based on the complexity of the questions/documents), it is natural to devise an iterative way to find answers as multi-step reasoning.
Pioneered by (Hill et al., 2016;Dhingra et al., 2016;Sordoni et al., 2016;Kumar et al., 2015), who used a predetermined fixed number of reasoning steps, Shen et al (2016;2017) showed that multi-step reasoning outperforms single-step ones and dynamic multi-step reasoning further outperformed the fixed multi-step ones on two distinct MRC datasets (SQuAD and MS MARCO).But these models have to be trained using reinforcement learning methods, e.g., policy gradient, which are tricky to implement due to the instability issue.Our model is different in that we fix the number of reasoning steps, but perform stochastic dropout to prevent step bias.Further, our model can also be trained by using the back-propagation algorithm, which is simple and yet efficient as single step reasoning.

Conclusion
We introduce Stochastic Answer Networks (SAN), a simple yet robust model for machine reading comprehension.The use of stochastic dropout in training and averaging in test at the answer module leads to robust improvements on SQuAD, outperforming both fixed step memory networks and dynamic step ReasoNet.We further empirically analyze the properties of SAN in detail.The model outperforms all the published methods by the time this is published, and achieves results competitive with the state-of-theart on the SQuAD leaderboard.Due to the strong connection between the proposed model with memory networks and ReasoNet, we would like to delve into the theoretical link between these models and its training algorithms.Further, we also would like to explore SAN on other tasks, such as text classification and natural language inference for its generalization in the future as well.
does the V&A Theator & Performance galleries hold?P : The V&A Theator & Performance galleries opened in March 2009.
(a) EM comparison on different systems.(b) F1 score comparison on different systems.

Figure 4 :
Figure 4: Learning curve measured on Dev set

Figure 5 :
Figure 5: Score breakdown by question/query type.

Table 1 :
-step with Memory Network (prediction from final step) 75.033 83.327Fixed 5-step with Memory Network (prediction averaged from all steps) 75.256 83.215 Dynamic steps (max 5) with ReasoNet 75.355 83.360 Stochastic Answer Network (SAN ), Fixed 5-step 76.235 84.056 Main results-Comparison of different answer module architectures.Note that SAN performs best in both Exact Match and F1 metrics.

Table 2 :
, and is not published yet.Official SQuAD leaderboard performance on December 5, 2017.Asterisk * denotes unpublished works.Results are sorted by Test F1.Note that due to the long list, we only show the top systems.
Table5shows the results of a SAN trained on T = 5 steps, but tested with different number of steps.As expected, the results are best when T matches during training and test; however, it is important to note that small number of steps T = 1 and T = 2 nev-

Table 3 :
Robustness on different random seeds for initialization: best and worst scores are boldfaced.

Table 4 :
Effect of number of steps: best and worst results are boldfaced.

Table 5 :
Prediction on different steps T .Note that the SAN model is trained using 5 steps.