Contextualized Word Representations for Reading Comprehension

Reading a document and extracting an answer to a question about its content has attracted substantial attention recently. While most work has focused on the interaction between the question and the document, in this work we evaluate the importance of context when the question and document are processed independently. We take a standard neural architecture for this task, and show that by providing rich contextualized word representations from a large pre-trained language model as well as allowing the model to choose between context-dependent and context-independent word representations, we can obtain dramatic improvements and reach performance comparable to state-of-the-art on the competitive SQuAD dataset.


Introduction
Reading comprehension (RC) is a high-level task in natural language understanding that requires reading a document and answering questions about its content. RC has attracted substantial attention over the last few years with the advent of large annotated datasets (Hermann et al., 2015;Rajpurkar et al., 2016;Trischler et al., 2016;Nguyen et al., 2016;Joshi et al., 2017), computing resources, and neural network models and optimization procedures Sukhbaatar et al., 2015;.
Reading comprehension models must invariably represent word tokens contextually, as a function of their encompassing sequence (document or question). The vast majority of RC systems encode contextualized representations of words in both the document and question as hidden states of bidirectional RNNs (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997;Cho et al., 2014), and focus model design and capacity around question-document interaction, carry-ing out calculations where information from both is available (Seo et al., 2016;Xiong et al., 2017b;. Analysis of current RC models has shown that models tend to react to simple word-matching between the question and document (Jia and Liang, 2017), as well as benefit from explicitly providing matching information in model inputs Weissenborn et al., 2017). In this work, we hypothesize that the stillrelatively-small size of RC datasets drives this behavior, which leads to models that make limited use of context when representing word tokens.
To illustrate this idea, we take a model that carries out only basic question-document interaction and prepend to it a module that produces token embeddings by explicitly gating between contextual and non-contextual representations (for both the document and question). This simple addition already places the model's performance on par with recent work, and allows us to demonstrate the importance of context. Motivated by these findings, we turn to a semisupervised setting in which we leverage a language model, pre-trained on large amounts of data, as a sequence encoder which forcibly facilitates context utilization. We find that model performance substantially improves, reaching accuracy comparable to state-of-the-art on the competitive SQuAD dataset, showing that contextual word representations captured by the language model are beneficial for reading comprehension. 1

Contextualized Word Representations
Problem definition We consider the task of extractive reading comprehension: given a paragraph of text p = (p 1 , . . . , p n ) and a question q = (q 1 , . . . , q m ), an answer span (p l , . . . , p r ) is to be extracted, i.e., a pair of indices 1 ≤ l ≤ r ≤ n into p are to be predicted.
When encoding a word token in its encompassing sequence (question or passage), we are interested in allowing extra computation over the sequence and evaluating the extent to which context is utilized in the resultant representation.
To that end, we employ a re-embedding component in which a contextual and a non-contextual representation are explicitly combined per token. Specifically, for a sequence of word-embeddings w 1 , . . . , w k with w t ∈ R dw , the re-embedding of the t-th token w t is the result of a Highway layer (Srivastava et al., 2015) and is defined as: where x t is a function strictly of the word-type of the t-th token, u t is a function of the enclosing sequence, W g , W z , U g , U z are parameter matrices, and the element-wise product operator. We set x t = [w t ; c t ], a concatenation of w t with c t ∈ R dc where the latter is a character-based representation of the token's word-type produced via a CNN over character embeddings (Kim, 2014). We note that word-embeddings w t are pre-trained (Pennington et al., 2014) and are kept fixed during training, as is commonly done in order to reduce model capacity and mitigate overfitting. We next describe different formulations for the contextual term u t .
RNN-based token re-embedding (TR) Here we set {u 1 , . . . , u k } = BiLSTM(x 1 , . . . , x k ) as the hidden states of the top layer in a stacked BiLSTM of multiple layers, each uni-directional LSTM in each layer having d h cells and u k ∈ R 2d h .

LM-augmented token re-embedding (TR+LM)
The simple module specified above allows better exploitation of the context that a token appears in, if such exploitation is needed and is not learned by the rest of the network, which operates over w 1 , . . . , w k . Our findings in Section 4 indicate that context is crucial but that in our setting it may be utilized to a limited extent.
We hypothesize that the main determining factor in this behavior is the relatively small size of the data and its distribution, which does not require using long-range context in most examples. Therefore, we leverage a strong language model that was pre-trained on large corpora as a fixed encoder which supplies additional contextualized token representations. We denote these represen- The LM we use is from Józefowicz et al. (2016), 2 trained on the One Billion Words Benchmark dataset (Chelba et al., 2013). It consists of an initial layer which produces character-based word representations, followed by two stacked LSTM layers and a softmax prediction layer. The hidden state outputs of each LSTM layer are projected down to a lower dimension via a bottleneck layer (Sak et al., 2014). We set {o 1 , . . . , o k } to either the projections of the first layer, referred to as TR + LM(L1), or those of the second one, referred to as TR + LM(L2).
With both re-embedding schemes, we use the resulting representations w 1 , . . . , w k as a drop-in replacement for the word-embedding inputs fed to a standard model, described next.

Base model
We build upon Lee et al. (2016), who proposed the RaSoR model. For word-embedding inputs q 1 , . . . , q m and p 1 , . . . , p n of dimension d w , RaSoR consists of the following components: Passage-independent question representation The question is encoded via a BiLSTM {v 1 , . . . , v m } = BiLSTM(q 1 , . . . , q m ) and the resulting hidden states are summarized via attention (Bahdanau et al., 2015;: for a parameter vector w q ∈ R d f and FF(·) a single layer feed-forward network.
Passage-aligned question representations For each passage position i, the question is encoded via attention operated over its word-embeddings q align i = m j=1 β ij q j ∈ R dw . The coefficients β i are produced by normalizing the logits {s i1 , . . . , s im }, where s ij = FF(q j ) T · FF(p i ).

Augmented passage token representations
Each passage word-embedding p i is concatenated with its corresponding q align i and with the independent q indep to produce p * i = [p i ; q align i ; q indep ], 2 Named BIG LSTM+CNN INPUTS in that work and available at http://github.com/tensorflow/ models/tree/master/research/lm_1b. and a BiLSTM is operated over the resulting vectors: {h 1 , . . . , h n } = BiLSTM(p * 1 , . . . , p * n ).
Span representations A candidate answer span a = (l, r) with l ≤ r is represented as the concatenation of the corresponding augmented passage representations: h * a = [h l ; h r ]. In order to avoid quadratic runtime, only spans up to length 30 are considered.
Prediction layer Finally, each span representation h * a is transformed to a logit s a = w T c · FF(h * a ) for a parameter vector w c ∈ R d f , and these logits are normalized to produce a distribution over spans. Learning is performed by maximizing the log-likelihood of the correct answer span.

Evaluation and Analysis
We evaluate our contextualization scheme on the SQuAD dataset (Rajpurkar et al., 2016) which consists of 100,000+ paragraph-question-answer examples, crowdsourced from Wikipedia articles.
Importance of context We are interested in evaluating the effect of our RNN-based reembedding scheme on the performance of the downstream base model. However, the addition of the re-embedding module incurs additional depth and capacity for the resultant model. We therefore compare this model, termed RaSoR + TR, to a setting in which re-embedding is non-contextual, referred to as RaSoR + TR(MLP). Here we set u t = MLP(x t ), a multi-layered perceptron on x t , allowing for the additional computation to be carried out on word-level representations without any context and matching the model size and hyperparameter search budget of RaSoR + TR. In Table  1 we compare these two variants over the development set and observe superior performance by the contextual one, illustrating the benefit of contextualization and specifically per-sequence contextualization which is done separately for the question and for the passage.
Context complements rare words Our formulation lends itself to an inspection of the different dynamic weightings computed by the model for interpolating between contextual and noncontextual terms. In Figure 1 we plot the average gate value g t for each word-type, where the average is taken across entries of the gate vector and across all occurrences of the word in both passages  Table 1: Results on SQuAD's development set. The EM metric measures an exact-match between a predicted answer and a correct one and the F1 metric measures the overlap between their bag of words. and questions. This inspection reveals the following: On average, the less frequent a word-type is, the smaller are its gate activations, i.e., the reembedded representation of a rare word places less weight on its fixed word-embedding and more on its contextual representation, compared to a common word. This highlights a problem with maintaining fixed word representations: albeit pretrained on extremely large corpora, the embeddings of rare words need to be complemented with information emanating from their context. Our specific parameterization allows observing this directly, but it may very well be an implicit burden placed on any contextualizing encoder such as a vanilla BiLSTM.

Incorporating language model representations
Supplementing the calculation of token reembeddings with the hidden states of a strong language model proves to be highly effective. In Table 1 we list development set results for using either the LM hidden states of the first stacked LSTM layer or those of the second one. We additionally evaluate the incorporation of that model's 73.2 81.9 · · · RaSoR (base model) [10] 70.8 78.7  Lee et al. (2016) word-type representations (referred to as RaSoR + TR + LM(emb)), which are based on characterlevel embeddings and are naturally unaffected by context around a word-token.
Overall, we observe a significant improvement with all three configurations, effectively showing the benefit of training a QA model in a semisupervised fashion (Dai and Le, 2015) with a large language model. Besides a crosscutting boost in results, we note that the performance due to utilizing the LM hidden states of the first LSTM layer significantly surpasses the other two variants. This may be due to context being most strongly represented in those hidden states as the representations of LM(emb) are non-contextual by definition and those of LM(L2) were optimized (during LM training) to be similar to parameter vectors that correspond to word-types and not to word-tokens.
In Table 2 we list the top-scoring single-model published results on SQuAD's test set, where we observe RaSoR + TR + LM(L1) ranks second in EM, despite having only minimal questionpassage interaction which is a core component of other works. An additional evaluation we carry out is following Jia and Liang (2017), which demonstrated the proneness of current QA models to be fooled by distracting sentences added to the paragraph. In Table 3

Related Work
Our use of a Highway layer with RNNs is related to Highway LSTM  and Residual LSTM (Kim et al., 2017). The goal in those works is to effectively train many stacked LSTM layers and so highway and residual connections are introduced into the definition of the LSTM function. Our formulation is external to that definition, with the specific goal of gating between LSTM hidden states and fixed wordembeddings.
Multiple works have shown the efficacy of semi-supervision for NLP tasks (Søgaard, 2013).
Pre-training a LM in order to initialize the weights of an encoder has been reported to improve generalization and training stability for sequence classification (Dai and Le, 2015) as well as translation and summarization (Ramachandran et al., 2017). Similar to our work, Peters et al. (2017) utilize the same pre-trained LM from Józefowicz et al. (2016) for sequence tagging tasks, keeping encoder weights fixed during training. Their formulation includes a backward LM and uses the hidden states from the top-most stacked LSTM layer of the LMs, whereas we also consider reading the hidden states of the bottom one, which substantially improves performance. In parallel to our work, Peters et al. (2018) have successfully leveraged pre-trained LMs for several tasks, including RC, by utilizing representations from all layers of the pre-trained LM.
In a transfer-learning setting, McCann et al. (2017) pre-train an attentional encoder-decoder model for machine translation and show improvements across a range of tasks when incorporating the hidden states of the encoder as additional fixed inputs for downstream task training.

Conclusion
In this work we examine the importance of context for the task of reading comprehension. We present a neural module that gates contextual and non-contextual representations and observe gains due to context utilization. Consequently, we inject contextual information into our model by integrating a pre-trained language model through our suggested module and find that it substantially improves results, reaching state-of-the-art performance on the SQuAD dataset.