Entity Tracking Improves Cloze-style Reading Comprehension

Recent work has improved on modeling for reading comprehension tasks with simple approaches such as the Attention Sum-Reader; however, automatic systems still significantly trail human performance. Analysis suggests that many of the remaining hard instances are related to the inability to track entity-references throughout documents. This work focuses on these hard entity tracking cases with two extensions: (1) additional entity features, and (2) training with a multi-task tracking objective. We show that these simple modifications improve performance both independently and in combination, and we outperform the previous state of the art on the LAMBADA dataset by 8 pts, particularly on difficult entity examples. We also effectively match the performance of more complicated models on the named entity portion of the CBT dataset.


Introduction
There has been tremendous interest over the past several years in Cloze-style (Taylor, 1953) reading comprehension tasks, datasets, and models (Hermann et al., 2015;Hill et al., 2016;Kadlec et al., 2016;Dhingra et al., 2016;Cui et al., 2016). Many of these systems apply neural models to learn to predict answers based on contextual matching, and have inspired other work in long-form generation and question answering. The extent and limits of these successes have also been a topic of interest Chu et al., 2017). Recent analysis by Chu et al. (2017) suggests that a significant portion of the errors made by standard models, especially on the LAMBADA dataset (Paperno et al., 2016), derive from the inability to correctly track entities or speakers, or a failure to handle various forms of reference.
This work targets these shortcomings by designing a model and training scheme targeted towards entity tracking. Specifically we introduce Figure 1: A LAMBADA example where the final word "julie" (with reference chain in brackets) is the answer, y, to be predicted from the preceding context x. A system must know the two speakers and the current dialogue turn, simple context matching is not sufficient. Here, our model's predictions before and after adding multi-task objective are shown.
two simple changes to a stripped down model: (1) simple, entity-focused features, and (2) two multi-task objectives that target entity tracking. Our ablation analysis shows that both independently improve entity tracking, which is the primary source of overall model's improvement. Together they lead to state-of-the-art performance on LAMBADA dataset and near state-of-the-art on CBT dataset (Hill et al., 2016), even with a relatively simple model.

Background and Related Work
Cloze-style reading comprehension uses a passage of word tokens x = x 1:n (the context), with one token x j masked; the task is to fill in the masked word y, which was originally at position j. These datasets aim to present a benchmark challenge requiring some understanding of the context to select the correct word. This task is a prerequisite for problems like long-form generation and document-based question answering.
A number of datasets in this style exist with dif-ferent focus. Here we considered the LAMBADA dataset and the named entity portion of the Children's Book Test dataset (CBT-NE). LAMBADA uses novels where examples consist of 4-5 sentences and the last word to be predicted is masked, x n . The dataset is constructed carefully to focus on examples where humans needed the context to predict the masked word. CBT-NE examples, on the other hand, include 21 sentences where the masked word is a named entity extracted from the last sentence, with j ≤ n, and is constructed in a more automated way. We show an example from LAMBADA in Figure 1. In CBT, as well as the similar CNN/Daily Mail dataset (Hermann et al., 2015), the answer y is always contained in x whereas in LAMBADA it may not be. Chu et al. (2017) showed, however, that training only on examples where y is in x leads to improved overall performance, and we adopt this approach as well.

Related Work
The first popular neural network reading comprehension models were the Attentive Reader and its variant Impatient Reader (Hermann et al., 2015). Both were the first to use bidirectional LSTMs to encode the context paragraph and the query separately. The Stanford Reader  is a simpler version with fewer layers for inference. These models use an encoder to map each context token x i to a vector u i . Following the terminology of , explicit reference models calculate a similarity measure s i = s(u i , q) between each context vector u i and a query vector q derived for the masked word. These similarity scores are projected to an attention distribution α = softmax({s i }) over the context positions in 1, . . . , n, which are taken to be candidate answers. The Attention Sum Reader (Kadlec et al., 2016) is a further simplified version. It computes u i and q with separate bidirectional GRU (Chung et al., 2014) networks, and s i with a dot-product. It is trained to minimize: where θ is the set of all parameters associated with the model, and y is the correct answer. At test time, a pointer sum attention mechanism is used to predict the word type with the highest aggregate attention as the answer. The Gated Attention Reader (Dhingra et al., 2016) leverages the same mechanism for prediction and introduces an attention gate to modulate the joint context-query information over multiple hops.
The Recurrent Entity Networks (Henaff et al., 2016) uses a custom gated recurrent module, Dynamic Memory, to learn and update entity representations as new examples are received. Their gate function is combined of (1) a similarity measure between the input and the hidden states, and (2) a set of trainable "key" vectors which could learn any attribute of an entity such as its location or other entities it is interacting with in the current context. The Query Reduction Networks (Seo et al., 2016) is also a gated recurrent network which tracks state in a paragraph and uses a hidden query vector to keep pointing to the answer at each step. The query is successively transformed with each new sentence to a reduced state that's easier to answer given the new information.
Model In this work, we were particularly interested in the shortcomings of simple models and exploring whether or how much entity tracking could help, since Chu et al. (2017) has pointed out this weakness. As a result, we adapt a simplified Attention Sum (AttSum) reader throughout all experiments. Our version uses only a single bidirectional GRU for both u i and q. This GRU is of size 2d, using the first d states for the context and second d for the query. Formally, let For datasets using the last word, the query is con- When the masked word can be anywhere, the query is constructed Our main contribution is the extension of this simple model to incorporate entity tracking. Other authors have explored extending neural reading comprehension models with linguistic features, particularly Dhingra et al. (2017) who use a modified GRU with knowledge such as coreference relations and hypernymy. In Dhingra et al. (2018), the most recent coreferent antecedent for each token is incorporated into the update equations of the GRU unit to bias the reader towards coreferent recency. In this work, we instead use a much sim-

Learning to Track Entities
Analysis on reading comprehension has indicated that neural models are strong at matching local context information but weaker at following entities through the discourse Chu et al., 2017). We consider two straightforward ways for extending the Attention Sum baseline to better track entities.
Method 1: Features We introduce a short-list of features in Table 1 to augment the representation of each word in x. These features are meant to help the system to identify and use the relationships between words in the passage. 1 Features 2-5 apply only to words tagged PERSON by the NER tagger. Features 6-7 apply only to words between opening and closing quotation marks. Feature 6 indicates the index of the quote in the document, and Feature 7 gives the assumed speaker of the quote using some simple rules; we provide the rules in the Supplementary Material. Though most of these features are novel, they are motivated by recent analysis (Wang et al., 2015;.
All features are incorporated into a word's representation by embedding each discrete feature into a vector of the same size as the original word embedding, adding the vectors as well as a bias, and applying a tanh nonlinearity.
Method 2: Multitasking We additionally encourage the neural model to keep track of entities by multitasking with simple auxiliary entitytracking tasks. Examples such as Figure 1 suggest that keeping track of which entities are currently in 1 POS tags are produced with the NLTK library (Bird et al., 2009), and NER tags with the Stanford NER tagger (Finkel et al., 2005). We additionally found it useful to tag animate words as PERSONs on the CBT-NE data, using the animate word list of Bergsma and Lin (2006). scope is useful for answering reading comprehension questions. There, amy and julie are conversing, and being able to track that amy is the speaker of the final quote helps to rule her out as a candidate answer. We consider two tasks: For Task 1 (L 1 ) we train the same model to predict repeated named entities. For all named entities x j such that there is a x i = x j with i < j, we attempt to mask and predict the word type x j . This is done by introducing another Cloze prediction, but now setting the target y = x j , reducing the context to preceding words x 1:j−1 with u i = → h i , and the query q = → h j−1 . (Note that unlike above, both of these only use the forward states of the GRU). We use a bilinear similarity score s i = q T Q u i , for this prediction where Q is a learned transformation in R 2d×2d . This task is inspired by the antecedent ranking task in coreference (Wiseman et al., 2015(Wiseman et al., , 2016. For Task 2 (L 2 ) we train to predict the order index in which a named entity has been introduced. For example, in Figure 1, julie would be 1, amy would be 2, marsh would be 3, etc. The hope here is that learning to predict when entities reappear will help the model track their reoccurences. For the blue labeled julie, the model would aim to predict 1, even though it appears later in the context. This task is inspired by the One-Hot Pointer Reader of  on the Who-did-What dataset (Onishi et al., 2016). Formally, letting idx(x j ) be the predicted index for x j , we minimize: where W ∈ R |E|×2d and E is the set of entity word types in the document. Note that this is a simpler computation, requiring only O(|E| × n) predictions per x, whereas L 1 requires O(n 2 ).
The full model minimizes a multi-task loss: L 0 (θ) + γ 1 L 1 (θ) + γ 2 L 2 (θ). Using L 1 and L 2 simultaneously did not lead to improved performance however, and so either γ 1 , γ 2 is always 0. We believe that this is because, while the learning objectives for L 1 and L 2 are mathematically different, they are both designed to similarly track the entities mentioned so far in the document and thus do not provide complementary information to each other.
We found it useful to have two hyperparameters per auxiliary task governing the number of distinct named entity word types and tokens used in defining the losses L 1 and L 2 . In particular, per document these hyperparameters control in a top-tobottom order the number of distinct named entity word types we attempt to predict, as well as the number of tokens of each type considered.

Experiments
Methods This section highlights several aspects of our methodology; full hyperparameters are given in the Supplementary Material. For the training sets, we exclude examples where the answer is not in the context. The validation and test sets are not modified however and the model with the highest accuracy on the validation set is chosen for testing. For both tasks, the context words are mapped to learned embeddings; importantly, we initialize the first 100 dimensions with the 100dimensional GLOVE embeddings (Pennington et al., 2014). Named entity words are anonymized, as is done in the CNN/Daily Mail corpus (Hermann et al., 2015) and in some of the experiments of . The model is regularized with dropout (Srivastava et al., 2014) and optimized with ADAM (Kingma and Ba, 2014). For all experiments we performed a random search over hyperparameter values (Bergstra and Bengio, 2012), and report the results of the models that performed best on the validation set. Our implementation is available at https://github.com/ harvardnlp/readcomp. Table 2 shows the full results of our best models on the LAMBADA and CBT-NE datasets, and compares them to recent, best-performing results in the literature.

Results and Discussion
For both tasks the inclusion of either entity features or multi-task objectives leads to large statistically significant increases in validation and test score, according to the McNemar test (α = 0.05) with continuity correction (Dietterich, 1998). Without features, AttSum + L 2 achieves the best test results, whereas with features AttSum-Feat + L 1 performs best on CBT-NE. The results on LAMBADA indicate that entity tracking is a very important overlooked aspect of the task. Interestingly, with features included, AttSum-Feat + L 2 appears to hurt test performance on LAMBADA and leaves CBT-NE performance essentially unchanged, amounting to a negative result for L 2 . On the other hand, the effect of AttSum-Feat + L 1 is pro-  (Dhingra et al., 2017) 51.10 51.60 MAGE (64) (Dhingra et al., 2017) 52.10 51.10 GA + C-GRU (Dhingra et al., 2018 (Dhingra et al., 2016) 78.50 74.90 EpiReader (Trischler et al., 2016) 75.30 69.70 DIM Reader (Liu et al., 2017) 77.10 72.20 AoA (Cui et al., 2016) 77.80 72.0 AoA + Reranker (Cui et al., 2016) 79  nounced on CBT-NE, and while our simple models do not increase the state-of-the-art test performance on CBT-NE, they outperform "attentionover-attention" in addition to reranking (Cui et al., 2016), and is outperformed only by architectures supporting "multiple-hop" inference over the document (Dhingra et al., 2016). Our best model on CBT-NE test set, AttSum-Feat + L 1 , is very close to the current state-of-the-art result. On the validation sets for both LAMBADA and CBT-NE, the improvements from adding features to AttSum + L i are statistically significant (for full results refer to our supplementary material). On LAMBADA, the L 1 multi-tasked model is a 3.5-point increase on the state of the art.
Our method also employs fewer parameters than other richer models such as the GA Reader in (Dhingra et al., 2016). More specifically, in terms of number of parameters, our models are very similar to a 1-hop GA Reader. In contrast, all published experiments of the latter use 3 hops where each hop requires 2 separate Bi-GRUs, one  to model the document and one for the query. This constitutes the largest difference in model size between the two approaches. Table 3 considers the performance of the different models based on a segmentation of the data. Here we consider examples where: (1) Entityif the answer is a named entity; (2) Speaker -if the answer is a named entity and the speaker of quote; (3) Quote -if the answer is found within a quoted speech. Note that Speaker and Quote categories, while mutually exclusive, are subsets of the overall Entity category. We see that both the additional features and multi-task objectives independently result in a clear improvement in all categories, but that the gains are particularly pronounced for named entities and specifically for Speaker and Quote examples. Here we see sizable increases in performance, particularly in the Speaker category. We see larger increases in the more dialog heavy LAMBADA task.
As a qualitative example of the improvement afforded by multi-task training, in Figure 1 we show the different predictions made by our model with and without L 1 (colored as blue and red, respectively). Note that amy and julie are both entities that have been repeated twice in the passage. In addition to the final answer, our model with the L 1 loss was also able to predict these entities (at the colored locations) given preceding words. Further qualitative analysis reveals that these augmentations improved the model's ability to eliminate non-entity choices from predictions. Some examples are shown in Figure 2.

Conclusion
This work demonstrates that learning to track entities with features and multi-task learning significantly increases the performance of a baseline reading comprehension system, particularly on the difficult LAMBADA dataset. This result indicates that higher-level word relationships may not be modeled by simple neural systems, but can be incorporated with minor additional extensions. This work hints that it is difficult for vanilla models to learn long-distance entity relations, and that these may need to be encoded directly through features or possibly with better pre-trained representations.