Globally Normalized Reader

Rapid progress has been made towards question answering (QA) systems that can extract answers from text. Existing neural approaches make use of expensive bi-directional attention mechanisms or score all possible answer spans, limiting scalability. We propose instead to cast extractive QA as an iterative search problem: select the answer’s sentence, start word, and end word. This representation reduces the space of each search step and allows computation to be conditionally allocated to promising search paths. We show that globally normalizing the decision process and back-propagating through beam search makes this representation viable and learning efficient. We empirically demonstrate the benefits of this approach using our model, Globally Normalized Reader (GNR), which achieves the second highest single model performance on the Stanford Question Answering Dataset (68.4 EM, 76.21 F1 dev) and is 24.7x faster than bi-attention-flow. We also introduce a data-augmentation method to produce semantically valid examples by aligning named entities to a knowledge base and swapping them with new entities of the same type. This method improves the performance of all models considered in this work and is of independent interest for a variety of NLP tasks.


Introduction
Question answering (QA) and information extraction systems have proven to be invaluable in wide variety of applications such as medical information collection on drugs and genes (Quirk and Who was first to recognize that the Analytical Engine had applications beyond pure calculation?
Ada Lovelace was known for her work on Charles Babbage's Analytical Engine.
She was the first to recognize that the machine had applications beyond calculation.  Figure 1: GNR answering a question. It first picks a sentence, then start word, then end word. Probabilities are global and normalized over the beam. Model initially picks the wrong sentence, but global normalization lets it recover. Final prediction's probability (0.64) exceeds sentence pick (0.49), whereas with local normalization each probability is upper bounded by the previous step.
Poon, 2016), large scale health impact studies (Althoff et al., 2016), or educational material development (Koedinger et al., 2015). Recent progress in neural-network based extractive question answering models are quickly closing the gap with human performance on several benchmark QA tasks such as SQuAD (Rajpurkar et al., 2016), MS MARCO (Nguyen et al., 2016), or NewsQA (Trischler et al., 2016a). However, current approaches to extractive question answering face several limitations: 1. Computation is allocated equally to the entire document, regardless of answer location, with no ability to ignore or focus computation on specific parts. This limits applicability to longer documents.
2. They rely extensively on expensive bidirectional attention mechanisms (Seo et al., 2016) or must rank all possible answer spans (Lee et al., 2016).
3. While data-augmentation for question answering have been proposed , current approaches still do not provide training data that can improve the performance of existing systems.
In this paper we demonstrate a methodology for addressing these three limitations, and make the following claims: 1. Extractive Question Answering can be cast as a nested search process, where sentences provide a powerful document decomposition and an easy to learn search step. This factorization enables conditional computation to be allocated to sentences and spans likely to contain the right answer.
2. When cast as a search process, models without bi-directional attention mechanisms and without ranking all possible answer spans can achieve near state of the art results on extractive question answering.
3. Preserving narrative structure and explicitly incorporating type and question information into synthetic data generation is key to generating examples that actually improve the performance of question answering systems.
Our claims are supported by experiments on the SQuAD dataset where we show that the Globally Normalized Reader (GNR), a model that performs an iterative search process through a document (shown visually in Figure 1), and has computation conditionally allocated based on the search process, achieves near state of the art Exact Match (EM) and F1 scores without resorting to more expensive attention or ranking of all possible spans. Furthermore, we demonstrate that Type Swaps, a type-aware data augmentation strategy that aligns named entities with a knowledge base and swaps them out for new entities that share the same type, improves the performance of all models on extractive question answering.
We structure the paper as follows: in Section 2 we introduce the task and our model. Section 3 describes our data-augmentation strategy. Section 4 introduces our experiments and results. In Section 5 we discuss our findings. In Section 6 we relate our work to existing approaches. Conclusions and directions for future work are given in Section 7.

Model
Given a document d and a question q, we pose extractive question answering as a search problem. First, we select the sentence, the first word of the span, and finally the last word of the span. A example of the output of the model is shown in Figure 1, and the network architecture is depicted in Figure 2.
More formally, let d 1 , . . . , d n denote each sentence in the document, and for each sentence d i , let d i,1 , . . . , d i,m i denote the word vectors corresponding to the words in the sentence. Similarly, let q 1 , . . . , q denote the word vectors corresponding to words in the question. An answer is a tuple a = (i * , j * , k * ) indicating the correct sentence i * , start word in the sentence j * and end word in the sentence k * . Let A(d) denote the set of valid answer tuples for document d. We now describe each stage of the model in turn.
Following Lee et al. (2016), these hidden states are used to compute a passage-independent question embedding, q indep . Formally, where w q is a trainable embedding vector, and MLP is a two-layer neural network with a Relu non-linearity. The question is represented by concatenating the final hidden states of the for-ward and backward LSTMs and the passageindependent embedding, q = [h bwd 1 ; h fwd ; q indep ].

Question-Aware Document Encoding
Conditioned on the question vector, we compute a representation of each document word that is sensitive to both the surrounding context and the question. Specifically, each word in the document is represented as the concatenation of its word vector d i,j , the question vector q, boolean features indicating if a word appears in the question or is repeated, and a question-aligned embedding from Lee et al. (2016). The question-aligned embedding q align i,j is given by The document is encoded by a separate stack of Bi-LSTMs, producing a sequence of hidden states (h fwd 1,1 , h bwd 1,1 ), . . . , (h fwd n,mn , h bwd n,mn ).
The search procedure then operates on these hidden states.

Answer Selection
Sentence selection. The first phase of our search process picks the sentence that contains the answer span. Each sentence d i is represented by the hidden state of the first and last word in the sentence for the backward and forward LSTM respectively, [h bwd i,1 ; h fwd i,m i ], and is scored by passing this representation through a fully connected layer that outputs the unnormalized sentence score for sentence d i , denoted φ sent (d i ).
Span start selection. After selecting a sentence d i , we pick the start of the answer span within the sentence. Each potential start word d i,j is represented as its corresponding document encoding [h fwd i,j ; h bwd i,j ], and is scored by passing this encoding through a fully connected layer that outputs the unnormalized start word score for word j in sen- Span end selection. Conditioned on sentence d i and starting word d i,j , we select the end word from the remaining words in the sentence d i,j , . . . , d i,m i . To do this, we run a Bi-LSTM over the remaining document hidden states through a fully connected layer that outputs the unnormalized end word score for word k in sentence i, with start word j, denoted φ ew (d i,j:k ).

Global Normalization
The scores for each stage of our model can be normalized at the local or global level. Previous work demonstrated that locally-normalized models have a weak ability to correct mistakes made in previous decisions, while globally normalized models are strictly more expressive than locally normalized models (Andor et al., 2016;Zhou et al., 2015;Collins and Roark, 2004).
In a locally normalized model each decision is made conditional on the previous decision. The probability of some answer a = (i, j, k) is decomposed as Each sub-decision is locally normalized by applying a softmax to the relevant selection scores: .
(10) To allow our model to recover from incorrect sentence or start word selections, we instead globally normalize the scores from each stage of our procedure. In a globally normalized model, we define (11) Then, we model where Z is the partition function exp(score(a , d, q)).  In contrast to locally-normalized models, the model is normalized over all possible search paths instead of normalizing each step of search procedure. At inference time, the problem is to find which can be approximately computed using beam search.

Objective and Training
We minimize the negative log-likelihood on the training set using stochastic gradient descent. For a single example (a, d, q), the negative loglikelihood −score(a, d, q) + log Z requires an expensive summation to compute log Z. Instead, to ensure learning is efficient, we use beam search during training and early updates (Andor et al., 2016;Zhou et al., 2015;Collins and Roark, 2004). Concretely, we approximate Z by summing only over candidates on the final beam At training time, if the gold sequence falls off the beam at step t during decoding, a stochastic gradient step is performed on the partial objective computed through step t and normalized over the beam at step t.

Implementation
Our best performing model uses a stack of 3 Bi-LSTMs for the question and document encodings, and a single Bi-LSTM for the end of span prediction. The hidden dimension of all recurrent layers is 200.
We use the 300 dimensional 8.4B token Common Crawl GloVe vectors (Pennington et al., 2014). Words missing from the Common Crawl vocabulary are set to zero. In our experiments, all architectures considered have sufficient capacity to overfit the training set. We regularize the models by fixing the word embeddings throughout training, dropping out the inputs of the Bi-LSTMs with probability 0.3 and the inputs to the fullyconnected layers with probability 0.4 (Srivastava et al., 2014), and adding gaussian noise to the recurrent weights with σ = 10 −6 . Our models are trained using Adam with a learning rate of 0.0005, β 1 = 0.9, β 2 = 0.999, = 10 −8 and a batch size of 32 (Kingma and Ba, 2014).
All our experiments are implemented in Tensorflow (Abadi et al., 2016), and we tokenize using Ciseau (Raiman, 2017). Despite performing beam-search during training, our model trains to convergence in under 4 hours through the use of efficient LSTM primitives in CuDNN (Chetlur et al., 2014) and batching our computation over examples and search beams. We release our code and augmented dataset. 1 Our implementation of the GNR is 24.7 times faster at inference time than the official Bi-Directional Attention Flow implementation 2 . Specifically, on a machine running Ubuntu 14 with 40 Intel Xeon 2.6Ghz processors, 386GB of RAM, and a 12GB TitanX-Maxwell GPU, the GNR with beam size 32 and batch size 32 requires 51.58 ± 0.266 seconds (mean ± std) 3 to process the SQUAD validation set. By contrast, the Bi-Directional Attention Flow model with batch size 32 requires 1260.23 ± 17.26 seconds. We attribute this speedup to avoiding expensive bi-directional attention mechanisms and making computation conditional on the search beams.  In extractive question answering, the set of possible answer spans can be pruned by only keeping answers whose nature (person, object, place, date, etc.) agrees with the question type (Who, What, Where, When, etc.). While this heuristic helps human readers filter out irrelevant parts of a document when searching for information, no explicit supervision of this kind is present in the dataset. Despite this absence, the distribution question representations learned by our models appear to utilize this heuristic. The final hidden state of the 2 https://github.com/allenai/ bi-att-flow 3 All numbers are averaged over 5 runs.
question-encoding LSTMs naturally cluster based on question type (Table 1). In other words, the task induces a question encoding that superficially respects type information. This property is a double-edged sword: it allows the model to easily weed out answers that are inapplicable, but also leads it astray by selecting a text span that shares the answer's type but has the wrong underlying entity. A similar observation was made in the error analysis of (Weissenborn et al., 2017). We propose Type Swaps, an augmentation strategy that leverages this emergent behavior in order to improve the model's ability to prune wrong answers, and make it more robust to surface form variation. This strategy has three steps: 1. Locate named entities in document and question.

Generate new document-question-answer ex-
amples by swapping each named entity in an original triplet with a surface variant that shares the same type from the collection.  (18735) date (3127) number (2868) organization (825) country (381) monarchy (21) commercial building (8) international conference (1) Variations per Type  with documents where we can safely assume that the majority of the entities will be contained in a large knowledge base (KB) such as Wikidata Vrandečić and Krötzsch (2014) we find that simple string matching techniques are sufficiently accurate. Specifically, we use a part of speech tagger (Honnibal, 2017) to extract nominal groups in the training data and string-match them with entities in Wikidata. Using this technique, we are able to extract 47,598 entities in SQuAD that fall under 6,380 Wikidata instance of 4 types. Additionally we assign "number types" (e.g. year, day of the week, distance, etc.) to nominal groups that contain dates, numbers, or quantities 5 . These extraction steps produce 84,632 unique surface variants (on average 16.93 per type) with the majority of the variation found in humans, numbers or organizations as visible in Figure 4. With this method, we can generate 2.92 · 10 369 unique documents (average of 3.36 · 10 364 new documents for each original document). To ensure there is sufficient variation in the generated documents, we sample from this set and only keep variations where the question or answer is mutated. At each training epoch, we train on T Type Swap ex-amples and the full original training data. An example output of the method is shown in Figure 3.

Results
We evaluate our model on the 100,000 example SQuAD dataset (Rajpurkar et al., 2016) and perform several ablations to evaluate the relative importance of the proposed methods.

Learning to Search
In our first experiment, we aim to quantify the importance of global normalization on the learning and search process. We use T = 10 4 Type Swap samples and vary beam width B between 1 and 32 for a locally and globally normalized models and summarize the Exact-Match and F1 score of the model's predicted answer and ground truth computed using the evaluation scripts from (Rajpurkar et al., 2016) (Table 3). We additionally report another metric, the Sentence score, which is a measure for how often the predicted answer came from the correct sentence. This metric provides a measure for where mistakes are made during prediction.

Type Swaps
In our second experiment, we evaluate the impact of the amount of augmented data on the performance of our model. In this experiment, we use  (Rajpurkar et al., 2016) 80.3 90.5 Single model Sliding Window (Rajpurkar et al., 2016) 13.3 20.2 Match-LSTM (Wang and Jiang, 2016) 64.1 73.9 DCN (Xiong et al., 2016) 65.4 75.6 Rasor (Lee et al., 2016) 66.4 74.9 Bi-Attention Flow (Seo et al., 2016) 67.7 77.3 R-Net (Wang et al., 2017) 72  the best beam sizes for each model (B = 10 for local and B = 32 for global) and vary the augmentation from T = 0 (no augmentation) to T = 5 · 10 4 . The results of this experiment are summarized in (Table 4). We observe that both models improve in performance with T > 0 and performance degrades past T = 10 4 . Moreover, data augmentation and global normalization are complementary. Combined, we obtain 1.6 EM and 2.0 F1 improvement over the locally normalized baseline.
We also verify that the effects of Type Swaps are not limited to our specific model by observing the impact of augmented data on the DCN+ (Xiong et al., 2016) 6 . We find that it strongly reduces generalization error, and helps improve F1, with potential further improvements coming by re-6 The DCN+ is the DCN with additional hyperparameter tuning by the same authors as submitted on the SQuAD leaderboard https://rajpurkar.github. io/SQuAD-explorer/.  ducing other forms of regularization (Table 5).

Discussion
In this section we will discuss the results presented in Section 4, and explain how they relate to our main claims.

Extractive Question Answering as a Search Problem
Sentences provide a natural and powerful document decomposition for search that can be easily learnt as a search step: for all the models and configurations considered, the Sentence score was above 88% correct (Table 3) 7 . Thus, sentence selection is the easy part of the problem, and the model can allocate more computation (such as the end-word selection Bi-LSTM) to spans likely to contain the answer. This approach avoids wasteful work on unpromising spans and is important for further scaling these methods to long documents.

Global Normalization
The Globally Normalized Reader outperforms previous approaches and achieves the second highest EM behind (Wang et al., 2017), without using bi-directional attention and only scoring spans in its final beam. Increasing the beam width improves the results for both locally and globally normalized models (Table 3), suggesting search errors account for a significant portion of the performance difference between models. Models such as Lee et al. (2016) and Wang and Jiang (2016) overcome this difficulty by ranking all possible spans and thus never skipping a possible answer. Even with large beam sizes, the locally normalized model underperforms these approaches. However, by increasing model flexibility and performing search during training, the globally normalized model is able to recover from search errors and achieve much of the benefits of scoring all possible spans.

Type-Aware Data Augmentation
Type Swaps, our data augmentation strategy, offers a way to incorporate the nature of the question and the types of named entities in the answers into the learning process of our model and reduce sensitivity to surface variation. Existing neuralnetwork approaches to extractive QA have so far ignored this information. Augmenting the dataset with additional type-sensitive synthetic examples improves performance by providing better coverage of different answer types. Growing the number of augmented samples used improves the performance of all models under study ( for globally normalized models. Past a certain amount of augmentation, we observe performance degradation. This suggests that despite efforts to closely mimic the original training set, there is a train-test mismatch or excess duplication in the generated examples. Our experiments are conducted on two vastly different architectures and thus these benefits are expected to carry over to different models (Weissenborn et al., 2017;Seo et al., 2016;Wang et al., 2017), and perhaps more broadly in other natural language tasks that contain named entities and have limited supervised data.

Related Work
Our work is closely related to existing approaches in learning to search, extractive question answering, and data augmentation for NLP tasks.
Learning to Search. Several approaches to learning to search have been proposed for various NLP tasks and conditional computation. Most recently, Andor et al. (2016) and Zhou et al. (2015) demonstrated the effectiveness of globally normalized networks and training with beam search for part of speech tagging and transition-based dependency parsing, while Wiseman and Rush (2016) showed that these techniques could also be applied to sequence-to-sequence models in several application areas including machine translation. These works focus on parsing and sequence prediction tasks and have a fixed computation regardless of the search path, while we show that the same techniques can also be straightforwardly applied to question answering and extended to allow for conditional computation based on the search path.
Learning to search has also been used in context of modular neural networks with conditional computation in the work of Andreas et al. (2016) for image captioning. In their work reinforcement learning was used to learn how to turn on and off computation, while we find that conditional computation can be easily learnt with maximum likelihood and the help of early updates (Andor et al., 2016;Zhou et al., 2015;Collins and Roark, 2004) to guide the training process.
Our framework for conditional computation whereby the search space is pruned by a sequence of increasingly complex models is broadly reminiscent of the structured prediction cascades of (Weiss and Taskar, 2010). Trischler et al. (2016b) also explored this approach in the context of question answering.
Extractive Question Answering. Since the introduction of the SQuAD dataset, numerous systems have achieved strong results. Seo et al. (2016); Wang et al. (2017) and Xiong et al. (2016) make use of a bi-directional attention mechanisms, whereas the GNR is more lightweight and achieves similar results without this type of attention mechanism. The document representation used by the GNR is very similar to Lee et al. (2016). However, both Lee et al. (2016) and Wang and Jiang (2016) must score all O(N 2 ) possible answer spans, making training and inference expensive. The GNR avoids this complexity by learning to search during training and outperforms both systems while scoring only O(B) spans. Weissenborn et al. (2017) is a locally normalized model that first predicts start and then end words of each span. Our experiments lead us to believe that further factorizing the problem and using global normalization along with our data augmentation would yield corresponding improvements.
Data augmentation. Several works use data augmentation to control the generalization error of deep learning models. Zhang and LeCun (2015) use a thesaurus to generate new training examples based on synonyms. Vijayaraghavan et al. (2016) employs a similar method, but uses Word2vec and cosine similarity to find similar words. Jia and Liang (2016) use a high-precision synchronous context-free grammar to generate new semantic parsing examples. Our data augmentation technique, Type Swaps, is unique in that it leverages an external knowledge-base to provide new examples that have more variation and finer-grained changes than methods that use only a thesaurus or Word2Vec, while also keeping the narrative and grammatical structure intact.
More recently  proposed a sequence-to-sequence model to generate diverse and realistic training question-answer pairs on SQuAD. Similar to their approach, our technique makes use of existing examples to produce new examples that are fluent, however we also are able to explicitly incorporate entity type information into the generation process and use the generated data to improve the performance of question answering models.

Conclusions and Future Work
In this work, we provide a methodology that overcomes several limitations of existing approaches to extractive question answering. In particular, our proposed model, the Globally Normalized Reader, reduces the computational complexity of previous models by casting the question answering as search and allocating more computation to promising answer spans. Empirically, we find that this approach, combined with global normalization and beam search during training, leads to near state of the art results. Furthermore, we find that a type-aware data augmentation strategy improves the performance of all models under study on the SQuAD dataset. The method is general, only requiring that the training data contains named entities from a large KB. We expect it to be applicable to other NLP tasks that would benefit from more training data.
As future work we plan to apply the GNR to other question answering datasets such as MS MARCO (Nguyen et al., 2016) or NewsQA (Trischler et al., 2016a), as well as investigate the applicability and benefits of Type Swaps to other tasks like named entity recognition, entity linking, machine translation, and summarization. Finally, we believe there a broad range of structured prediction problems (code generation, generative models for images, audio, or videos) where the size of original search space makes current techniques intractable, but if cast as learning-to-search problems with conditional computation, might be within reach.