MemoReader: Large-Scale Reading Comprehension through Neural Memory Controller

Machine reading comprehension helps machines learn to utilize most of the human knowledge written in the form of text. Existing approaches made a significant progress comparable to human-level performance, but they are still limited in understanding, up to a few paragraphs, failing to properly comprehend lengthy document. In this paper, we propose a novel deep neural network architecture to handle a long-range dependency in RC tasks. In detail, our method has two novel aspects: (1) an advanced memory-augmented architecture and (2) an expanded gated recurrent unit with dense connections that mitigate potential information distortion occurring in the memory. Our proposed architecture is widely applicable to other models. We have performed extensive experiments with well-known benchmark datasets such as TriviaQA, QUASAR-T, and SQuAD. The experimental results demonstrate that the proposed method outperforms existing methods, especially for lengthy documents.


Introduction
Most of the human knowledge has been stored in the form of text. Reading comprehension (RC) to understand this knowledge is a major challenge that can vastly increase the range of knowledge available to the machines. Many neural networkbased methods have been proposed, pushing performance close to a human level. Nonetheless, there still exists room to improve the performance especially in comprehending lengthy documents that involve complicated reasoning processes. We identify the main bottleneck as the lack of the long-term memory and its improper controlling mechanism.
Previously, several memory-augmenting methods have been proposed to solve the long-term de- * To whom correspondence should be addressed. pendency problem. For example, in relatively simple tasks such as bAbI tasks (Weston et al., 2015), Graves et al. (2014Graves et al. ( , 2016; Henaff et al. (2017) proposed methods that handle the external memory to address long-term dependency. Inspired by these approaches, we develop a customized memory controller along with an external memory augmentation (Graves et al., 2016) for complicated RC tasks. However, we found that the memory controller is susceptible to information distortion as neural networks become deeper, this distortion can hinder the performance.
To overcome this issue, we propose two novel strategies that improve the memory-handling capability while mitigating the information distortion. We extend the memory controller with a residual connection to alleviate the information distortion occurring in it. We also expand the gated recurrent unit (GRU) (Cho et al., 2014) with a dense connection that conveys enriched features to the next layer containing the original as well as the transformed information. We conducted extensive experiments through several benchmark datasets such as TriviaQA, QUASAR-T, and SQuAD. The results show that the proposed model outperforms all the published results. We also integrated the proposed memory controller and the expanded GRU cell block with other existing methods to ensure that our proposed components are widely applicable. The results show that our components consistently bring performance improvement across various state-of-the-art architectures.
The main contributions of this work include the following: (1) We propose an extended memory controller module for RC tasks. (2) We propose a densely connected encoder block with self attention to provide rich representation of given data, reducing information loss due to deep layers of the network. (3) We present the state-of-the-art results in lengthy-document RC tasks such as TriviaQA and QUASAR-T as well as relatively short document RC tasks such as SQuAD.

Proposed Method
This section presents two of our proposed components in detail, as depicted in Figure 1.

Memory Controller
Our first proposed component is an advanced external memory controller module for solving RC tasks. We modified the recently proposed memory controller (Graves et al., 2016) by using our new encoder block and layer-wise residual connections. These modifications enable the memory controller to reason over a lengthy document, leading to the overall performance improvement.
This layer takes input as a sequence of vector representations corresponding to individual tokens, d t ∈ R l , where l is the given vector dimension. For example, such input can be the output of the co-attention layer in Section 3. The operation of this layer is defined as That is, at time step t, the controller generates an interface vector i t for read and write operations and an output vector o t based on the input vector d t and the external memory content from the previous time step, M t−1 ∈ R p×q , where p is the memory size and q is the vector dimension of each memory.
Through this controller, we encode an input D = {d t } n t=1 to {x t } n t=1 by using the encoder block, i.e., where k is the output dimension of the encoder block. In general, this block is implemented as a recurrent unit, e.g., GRU (Cho et al., 2014). In our model, we replace it with our dense encoder block with self attention (DEBS), as will be discussed in Section 2.2.
To generate a memory-augmented vector z t , we concatenate x t with the vectors read from the previous time step memory, M t−1 , i.e., where s represents the number of read heads in the memory interface. We then feed the vector z t to the bi-directional GRU (BiGRU) layer and obtain the output vector h m t as Afterwards, we generate output vector v t as the weighted sum of the BiGRU output and read vectors from the memory in the current step, i.e., Finally, we add a residual connection between the input d t and the output v t to mitigate any possible information distortion that can occur while accessing the memory, resulting in a the output vector that can handle long-term dependency, i.e., For further details on how the interface vector works, we refer the readers to Graves et al. (2016) as well as our supplemental material.

Dense Encoder Block with Self Attention
The second novel component we propose is a dense encoder block with self attention (DEBS), which further improves a GRU cell. Recently, Huang et al. (2017a) proposed that adding a connection between each layer to the other layers in convolution networks can help to properly convey the information across multiple layers. Inspired by this, we add such dense connections that concatenate the input to a particular layer to its output. We also add a self-attention module to this block, to properly address long-term dependency in a length document. In this manner, our encoder block maintains the necessary information not only along the vertical direction (across layers) through dense connections but also along the horizontal direction (across time steps) through self attention.
DEBS takes the input vector sequence with its length as n and transforms each vector to an ldimensional vector p t through the fully connected layer with ReLU as a nonlinear unit and generates a contextually encoded vector r t as Then we concatenate each output vector r t to the projected input p t to obtain g t = [r t ; p t ] ∈ R 3l and pass it to the self-attention layer. The selfattention layer then calculates the similarity map S g ∈ R n×n using the tri-linear function as  where i, j = 1, . . . , n. Finally, the self-attended representation Q = {q t } n t=1 is obtained by performing column-wise softmax on S g to get the attention matrix A g , which is further multiplied with G = {g t } n t=1 , i.e., The final output is obtained as the concatenation of outputs from the recurrent layer (BiGRU) and the self-attention layer, i.e., [r t ; q t ] ∈ R 5l .

Reading Comprehension Model with Proposed Components
We apply the proposed components to our model for RC tasks. As depicted in Figure 1, the model consists of three major layers: the co-attention layer, the memory controller, and the prediction layer. Given the embeddings of a question and a document, the co-attention layer generates queryaware contextual representations. The memory controller further refines these contextual representations using an external memory. Based on such representations, the prediction layer determines the start and the end token indices that form the answer span. In addition, we replace all the encoder block with DEBS in the three major layers.
Embedding. We incorporate both word-and character-level embedding methods to obtain the vector representation of each word in the input data. For word-level embedding e w , we utilize pre-trained, 300-dimensional embedding vectors from GloVe 6B (Pennington et al., 2014). The character-level word embedding e c is obtained as a 100-dimensional vector by first applying a convolution layer with 100 filters to a sequence of 20dimensional character embeddings learned during training and by further applying global maxpooling over the entire character-level sequence. Then we obtain the embedding vector of a given word token, e, by concatenating these word-and character-level embeddings, i.e., e = [e w ; e c ] ∈ R 400 . Finally, we obtain the two sets of embedding vectors of question and document token sequences , where m and n represent the sequence length of a question and a document, respectively.
Co-attention layer. Given E q and E d , we feed each of them into the encoder block and obtain their contextual representations as These representations are used to calculate the pairwise similarity matrix S ∈ R m×n between tokens in the question and those in the document by a tri-linear function (Seo et al., 2017), i.e., where i = 1, . . . , m, j = (1, . . . , n), and represents the element-wise multiplication and w q , w d , and w c are trainable vectors. We apply column-wise softmax to S to obtain the documentto-question attention matrix A. Afterwards, a question-attended document representationC q is calculate as In addition to this, we obtain vectorã ∈ R n , corresponding to the attention of a question to document tokens, by applying softmax to the columnwise max values of S. Then document-attended question vector is obtained bỹ The final co-attended representations {d t } n t=1 is obtained by fully connected layer with ReLU as a nonlinear unit, ϕ, as Memory controller. This layer takes the output of the co-attention layer {d t } n t=1 as input and refine their representations using our proposed memory controller (Section 2.1). Afterwards, the resulting output vector {o t } n t=1 are given as input to the prediction layer.
Prediction layer. We feed the output of the memory controller {o t } n t=1 to the prediction layer to predict the start and the end token indices of the answer span. First, it goes through the encoder block followed by the fully connected layer with softmax over the entire sequence to compute the probability distribution of a start index. The probability distribution of the end index is calculated by concatenating the output of the encoder block for the start index with the output of the memory controller and then by feeding them as input to another encoder block. These probability distributions are used as part of the negative log-likelihood objective function.

Experimental Setup
Datasets and preprocessing. We perform extensive experiments with well-known benchmarks  such as TriviaQA, QUASAR-T, and SQuAD, as summarized in Table 1. In most of these datasets, a question q and a document d are represented as a sequence of words, and the answer span has to be selected from the document words based on the question. SQuAD consists of crowd-sourced questions and paragraphs from Wikipedia articles containing the answer to these questions. QUASAR-T is mostly based on factoid questions with their corresponding, large-sized corpus. TriviaQA is composed of question-answer pairs obtained from 14 trivia and quiz-league websites, along with the documents collected later that are likely to contain the answer from either web search or Wikipedia.
In TriviaQA dataset, we truncate each document to 1,200 words. Even with such truncation, the average word count per document (AWC) of TriviaQA is approximately four times larger than that of SQuAD. In terms of the AWC, documents in Triv-iaQA, QUASAR-T, and SQuAD can be viewed as large-, medium-, and small-length documents, respectively. In TriviaQA dataset, because a document is collected separately for an already collected questionanswer pair, the document does not sometimes have the information to properly infer the answer to the question. In response, Clark and Gardner (2017) attempted to solve this problem by exposing both relevant and irrelevant paragraphs together separated based on TF-IDF scores. We follow this approach in TriviaQA. In QUASAR-T, we follow the same preprocessing steps done by Dhingra et al. (2017).
Implementation details.  All the results are gathered from their corresponding publications except for our models and 'BiDAF + DNC,' which we implemented on our own. 'Full' represents a complete dataset not guaranteed to contain relevant information to answer the question while 'Verified' corresponds to its subset annotated by humans so that the relevant information for the answer is guaranteed to exist. The last column indicates whether a model uses any additional feature augmentation (AF). to build the model and Sonnet 2 to implement the memory interface. NLTK (Bird and Loper, 2004) is used for tokenizing words. In the memory controller, we use four read heads and one write head, and the memory size is set to 100 × 36, with all initialized as 0. The hidden vector dimension l is set to 200. We use AdaDelta (Zeiler, 2012) as an optimizer with a learning rate of 0.5. The batch size is set to 20 for TriviaQA (Joshi et al., 2017) and 30 for SQuAD (Rajpurkar et al., 2016) and QUASAR-T (Dhingra et al., 2017). We use an exponential moving average of weights with a decaying factor of 0.001. Our model does require more memory than existing methods, but a single GPU (e.g., M40 with 12GB memory) was enough to train model within a reasonable amount of time.

Quantitative Results
For our quantitative comparisons, we use BiDAF with self attention (Clark and Gardner, 2017) as a baseline, which maintains the best results published on both TriviaQA and SQuAD datasets. In TriviaQA and QUASAR-T dataset, we compare our model with BiDAF (Seo et al., 2017) as well as its variant called 'BiDAF + DNC,' which is augmented with an existing external memory architecture (Graves et al., 2016) just before the answer prediction layer in the BiDAF.
Overall, in lengthy-document cases such as TriviaQA and QUASAR-T, our model outperforms all the published results, as seen in Tables  2 and 3, while in the short-document case such as SQuAD, we mostly achieve the best results, as seen in Table 4. In the following, we present detailed analyses on each dataset. Table 2, our model, even without DEBS, outperforms the existing state-of-the-art method such as 'BiDAF + SA + SN' by a large margin in all the cases. Our model with DEBS, which replaces BiGRU encoder blocks, performs even better than that without it in all the cases except for the combination of the 'full' and 'Wikipedia' case, which involves documents containing no relevant information for the answer. Among those methods shown in Table 2, Reading Twice for NLU (Weissenborn, 2017) uses background knowledge from Concept-Net while both M-Reader (Hu et al., 2017) and MEMEN (Pan et al., 2017) (Seo et al., 2017) 37.00 42.50 39.50 44.50   formation as additional features. We note that our method achieves these outstanding results without any additional features. Table 3, our simple baseline 'BiDAF + DNC,' which involves an existing memory architecture, improves performance over BiDAF, indicating the efficacy of incorporating an external memory. Moreover, our model with the proposed memory controller achieves significantly better results compared to other models. Furthermore, another proposed component, DEBS, gives an additional performance boost to our model. Table 4, most of the models, if not all, use additional features such as ELMo (Peters et al.), and the self-attention mechanism to further improve the performance. We also adopt these mechanisms one by one to show that our model can also benefit from these. First, we adopt ELMo to our model (without DEBS), which uses word embedding as the weighted sum of the hidden layers of a language model with regularization as an additional feature to our word embeddings. This improves the F1 score of our model up to 85.13 and EM to 77.44, showing the highest performances among all the methods without using self attention. Due to the relatively short document length in SQuAD compared to TriviaQA and QUASAR-T, our model without DEBS performs worse than the baseline 'BiDAF + Self Attention + ELMo.' However, after applying DEBS, our model outperforms the baseline, achieving 86.73 F1 and 79.69 EM.

SQuAD. As shown in
Minimum anchor distance. Rajpurkar et al. (2016) proposed the difficulty measure called syntactic divergence, which is computed as the edit distance between syntactic parse trees of the question and the sentence containing the answer. However, this measure has limitations that the syntactic parser does not work properly on incomplete sentences, which are common in web text. It also becomes difficult to compute this measure if the  (Salant and Berant, 2017) 77.58 84.16 QANet (Yu et al., 2018a) 76.24 84.60 SAN (Liu et al., 2017b) 76.83 84.40 FusionNet (Huang et al., 2017b) 75.97 83.90 RaSoR + TR (Salant and Berant, 2017) 75.79 83.26 Conducter-net  74.41 82.74 Reinforced Mnemonic Reader (Hu et al., 2017) 73.20 81.80 BiDAF + Self Attention (Clark and Gardner, 2017) 72.14 81.05 MEMEN (Pan et al., 2017) 70.98 80.36 Our model (without DEBS) 70.99 79.94 r-net (Wang et al., 2017) 71.30 79.70 Document Reader (Chen et al., 2017) 70.73 79.35 FastQAExt  70.85 78.86 Human Performance 82.30 91.22 answer requires multi-sentence inference. Instead, we develop our own metric called a minimum anchor distance, which is simple and robust to noisy text. To compute this metric, we first identify for all the co-occurring words (anchor words) between a document and a question while ignoring stop words. Then, we compute the number of words found between the answer and all the possible anchor words and select the minimum number from these.
In Figure 2, we show F1 scores of our model with DEBS and the baseline with respect to the minimum anchor distance. The scores are obtained from the development set of Trivi-aQA(Web) and SQuAD. The heat map at the bottom of the figure indicates the number of samples in each interval of the minimum anchor distance. One can see that our model performs increasingly better than the baseline as the minimum anchor distance gets larger. The examples shown in Table 5 indicate that documents with long dependencies tend to have a large minimum anchor distance. These examples show that our model predicts the remotely placed answer from the anchor word relatively well when anaphora resolution and negation are involved.
Ablation study with an encoder block. We assume that the concatenation of the layer outputs in DEBS helps the memory controller store contextual representations clearly. To see how DEBS affects the memory controller depending on different positions in the entire network, we conducted an ablation study by replacing the encoder block with DEBS on SQuAD. As can be seen in Table 6, using DEBS in all the places improves the performance most, and furthermore, the memory controller with DEBS gives the largest performance margin. This implies that DEBS can generally work as a better alternative to a BiGRU module, and DEBS is critical in maintaining the high performance of our proposed memory controller.
Adding our proposed modules to other models. To show the wide effectiveness of our proposed approaches, we choose two well-known baseline models in SQuAD: R-net (Wang et al., 2017) and 'BiDAF + Self Attention' (Clark and Gardner, 2017). These models have similar architectures where the model first pairs a given question and document pair using an attention and afterward applies a self-attention mechanism. We use the publicly available implementation of these models 3,4 . In Table 7, replacing all the recurrent units with DEBS and adding our memory controller between the question-document pairing  TriviaQA 1953 London, England ) was an English contralto singer * who achieved an international (Web) reputation with a repertoire extending from folksong and popular ballads to the classical works. Her death from cancer , at the height of her fame, was a shock to the musical world and particularly to the general public, which was kept in ignorance of · · · (omit) Question : What did Mote think the Yuan class system really represented?
Context : The historian Frederick W.Mote wrote that the usage of the term "social SQuAD classes" for this system was misleading and that the position of people within the four -class system * was not an indication of their actual social power and wealth, but just entailed "degrees of privilege" to which they were entitled institutionally and legally, so a person's standing within the classes was not a guarantee of their standing, · · · (omit) * A word with an asterisk indicates an anchor word closest to the ground truth answer.  Table 6: Ablation study of replacing an encoder block with DEBS in the co-attention layer (C), the memory controller (M), and the prediction layer (P) in SQuAD. means that DEBS is used. Otherwise, BiGRU is used. layer and the self-attention layer increases the F1 score by around 0.5 compared to the baseline.

Related Work
Numerous neural network-based methods have been proposed, pushing the performance nearly up to a human level. Although slight differences exist, (Wang et al., 2017;Seo et al., 2017;Xiong et al., 2017) mostly leverage the questiondocument co-attention based on their pairwise similarity of word-level vector representations. These models currently work as the backbone architecture for many other models. Furthermore, Wang et al. (2017) suggest utilizing a self attention mechanism between tokens within a document to refine contextual representations. Salant  Enriching the input representation from pretrained external models has been shown to be useful in improving RC task performances. Yu et al. (2018a) have also improved the performance by leveraging self attention for context encoding based on convolutional neural networks. Hu et al. (2017) refine the contextual representation with multiple hops, and Pan et al. (2017) use the en-coded query for refining the answer prediction as a memory, which are different from our work in terms of handling long-range dependency.

Conclusion
This paper proposed two novel, crucial components for deep neural network-based RC tasks, (1) an advanced memory controller architecture and (2) a densely connected encoder block with self attention. We showed the effectiveness of these approaches in handling long-range dependencies using three benchmark RC datasets such as Triv-iaQA, QUASAR-T, and SQuAD. Our proposed modules are widely applicable to other models to improve their performance. Future work includes developing a scalable read/write accessing mechanism to handle a large-scale external memory to reason over multiple documents.