Learning and Knowledge Transfer with Memory Networks for Machine Comprehension

Enabling machines to read and comprehend unstructured text remains an unfulfilled goal for NLP research. Recent research efforts on the “machine comprehension” task have managed to achieve close to ideal performance on simulated data. However, achieving similar levels of performance on small real world datasets has proved difficult; major challenges stem from the large vocabulary size, complex grammar, and, the frequent ambiguities in linguistic structure. On the other hand, the requirement of human generated annotations for training, in order to ensure a sufficiently diverse set of questions is prohibitively expensive. Motivated by these practical issues, we propose a novel curriculum inspired training procedure for Memory Networks to improve the performance for machine comprehension with relatively small volumes of training data. Additionally, we explore various training regimes for Memory Networks to allow knowledge transfer from a closely related domain having larger volumes of labelled data. We also suggest the use of a loss function to incorporate the asymmetric nature of knowledge transfer. Our experiments demonstrate improvements on Dailymail, CNN, and MCTest datasets.


Introduction
A long-standing goal of NLP is to imbue machines with the ability to comprehend text and answer natural language questions. The goal is still distant and yet generates tremendous amount of interest due to the large number of potential NLP applications that are currently stymied because of their inability to deal with unstructured text. Also, the next generation of search engines are aiming to provide precise and semantically relevant answers in response to questions-as-queries; similar to the functionality of digital assistants like Cortana and Siri. This will require text understanding at a non-superficial level, in addition to reasoning, and, making complex inferences about the text.
As pointed out by , the Question Answering (QA) task on unstructured text is a sound benchmark on which to evaluate machine comprehension. The authors also introduced bAbI: a simulation dataset for QA with multiple toy tasks. These toy tasks require a machine to perform simple induction, deduction, multiple chaining of facts, and, complex reasoning; which make them a sound benchmark to measure progress towards AI-complete QA . The recently proposed Memory Network architecture and its variants have achieved close to ideal performance, i.e., more than 95% accuracy on 16 out of a total of 20 QA tasks (Sukhbaatar et al., 2015;. While this performance is impressive, and is indicative of the memory network having sufficient capacity for the machine comprehension task, the performance does not translate to real world text (Hill et al., 2016). Challenges in realworld datasets stem from the much larger vocabulary, the complex grammar, and the often ambiguous linguistic structure; all of which further impede high levels of generalization performance, especially with small datasets. For instance, the empirical results reported by Hill et al. (2016) show that an end-to-end memory network with a single hop surpasses the performance achieved using multiple hops (i.e, higher capacity), when the model is trained with a simple heuristic. Similarly, Tapaswi et al. (2015) show that a memory network heavily overfits on the MovieQA dataset and yields near random performance. These results suggest that achieving good performance may not always be merely a matter of training high capacity models with large volumes of data. In addition to exploring new models there is a pressing need for innovative training methods, especially when dealing with real world sparsely labelled datasets.
With the advent of deep learning, the state of art performance for various semantic NLP tasks has seen a significant boost (Collobert and Weston, 2008). However, most of these techniques are data-hungry, and require a large number of sufficiently diverse labeled training samples, e.g., for QA, training samples should not only encompass an entire range of possible questions but also have them in sufficient quantity (Bordes et al., 2015). Generating annotations for training deep models requires a tremendous amount of manual effort and is often too expensive. Hence, it is necessary to develop effective techniques to exploit data from a related domain in order to reduce dependence on annotations. Recently, Memory Networks have been successfully applied to QA and dialogue-systems to work with a variety of disparate data sources such as movies, images, structured, and, unstructured text Weston, 2016;Tapaswi et al., 2015;Bordes et al., 2015). Inspired from the recent success of Memory Networks, we study methods to train memory networks with small datasets by allowing for knowledge transfer from related domains where labelled data is more abundantly available.
The focus of this paper is to improve generalization performance of memory networks via an improved learning procedure for small real-world datasets and knowledge transfer from a related domain. In the process, this paper makes the following major contributions: (i) A curriculum inspired training procedure for memory network is introduced, which yields superior performance with smaller datasets.
(ii) The exploration of knowledge transfer methods such as pre-training, joint-training and the proposed curriculum joint-training with a related domain having abundant labeled data.
(iii) A modified loss function for joint-training to incorporate the asymmetric nature of knowledge transfer, and also investigate the application of a pre-trained memory network on very small datasets such as MCTest dataset.
The remainder of the paper is organized as follows: Firstly, we provide a summary of related work in Section 2. Next in Section 3, we describe the machine comprehension task and the datasets utilized in our experiments. An introduction to memory networks for machine comprehension is presented in Section 4. Section 5 outlines the proposed methods for learning and knowledge transfer. Experimental details are provided in Section 6. We summarize our conclusions in Section 7.

Related Work
Memory Networks have been successfully applied to a broad range of NLP and machine learning tasks. These tasks include but are not limited to: performing reasoning over a simulated environment for QA , factoid and non-factoid based QA using both knowledge bases and unstructured text (Kumar et al., 2015;Hill et al., 2016;Chandar et al., 2016;Bordes et al., 2015), goal driven dialog Dodge et al., 2016;Weston, 2016), automatic story comprehension from both video and text (Tapaswi et al., 2015), and, transferring knowledge from one knowledge-base while learning to answer questions on a different knowledge base (Bordes et al., 2015). Recently, various other attention based neural models (similar to Memory Networks) have been proposed to tackle the machine comprehension task by QA from unstructured text (Kadlec et al., 2016;Sordoni et al., 2016;Chen et al., 2016). To the best of our knowledge, knowledge transfer from an unstructured text dataset to another unstructured text dataset for machine comprehension is not explored yet. Training deep networks is known to be a notoriously hard problem and often the success of these techniques hinges upon achieving higher generalization performance with high capacity models (Blundell et al., 2015;Larochelle et al., 2009;Glorot and Bengio, 2010). To address this issue, Curriculum learning was firstly introduced by Bengio et al. (2009), which showed that training with gradually increasing difficulty leads to a better local minima, specially when working with nonconvex loss functions. Although devising a universal curriculum strategy is hard, as even humans do not converge to one particular order in which concepts should be introduced (Rohde and Plaut, 1999) some notion of concept difficulty is normally utilized. With similar motivations, this pa-per makes an attempt to exploit curriculum learning for machine comprehension with a memory network. Recently, curriculum learning has also been utilized to avoid negative transfer and make use of task relatedness for multi-task learning (Lee et al., 2016). Concurrently, Sachan and Xing (2016) have also studied curriculum learning for QA and unlike this paper, they do not consider learning and knowledge transfer on small realworld machine comprehension dataset in the setting of memory networks.
Pre-training & word2vec: Pre-training can often mitigate the issue that comes with random initialization used for network weights, by guiding the optimization process towards the basins of better local minima (Mishkin and Matas, 2016;Krahenbuhl et al., 2016;Erhan et al., 2010). An inspiration from the ripples created by the success of pre-training and as well as word2vec, this paper explores pre-training to utilize data from a related domain and also pre-trained vectors from word2vec tool (Mikolov et al., 2013). However, finding an optimal dimension for these pre-trained vectors and other involved hyper-parameters requires computationally extensive experiments.
Joint-training / Co-training / Multi-task learning / Domain adaptation: Previously, the utilization of common structures and similarities across different tasks / domains has been instrumental for various closely related learning tasks refereed as joint-training, co-training, multi-task learning and domain adaptation (Collobert and Weston, 2008;Liu et al., 2015;Chen et al., 2011;Maurer et al., 2016). To mitigate this ambiguity, in this paper, we limit ourselves to using "joint-training" and refrain from co-training, as unlike this work, cotraining was initially introduced to exploit unlabelled data in the presence of small labelled data and two different and complementary views about the instances (Blum and Mitchell, 1998).
While this work looks conceptually similar, the proposed method tries to exploit information from a related domain and aims to achieve an asymmetric transfer only towards the specified domain, without any interest in the source domain, and hence should not be confused with the longstanding pioneering work on multi-task learning (Caruana, 1997). Another field of work that is related to this paper is on domain adaptation which appears to have two major related branches. The first branch is the recent work that has primar-ily focused on unsupervised domain adaptation (Nguyen and Grishman, 2015;Zhang et al., 2015), and the other is the traditional work on domain adaptation which has focussed on problems like entity recognition and not on machine comprehension and modern neural architectures (Ben-David et al., 2010;Daume III, 2007).

Machine Comprehension : Datasets and Tasks Description
Machine comprehension is the ability to read and comprehend text, i.e., understand its meaning, and can be evaluated by tasks involving the answering of questions posed on a context document. Formally, a set of tuples (q, C, S, s) is provided, where q is the question, C is the context document, S is a list of possible answers, and, s indicates the correct answer. Each of q, C, and S are sequence or words from a vocabulary V . Our aim is to train a memory network model to perform QA with small training datasets. We propose two primary ways to achieve this: 1) Improve the learning procedure to obtain better models, and 2) Demonstrate knowledge transfer from a related domain.

Data Description
Several corpora have been introduced for the machine comprehension task such as MCTest-160, MCTest-500, CNN, Dailymail, and, Children Boot Test (CBT) (Richardson et al., 2013;Hermann et al., 2015;Hill et al., 2016). The MCTest-160 and MCTest-500 have multiple-choice questions with associated narrative stories. Answers in these datasets can be one of these forms: a word, a phrase, or, a full sentence. The remaining datasets are generated using Cloze-style questions; which are created by deleting a word from a sentence and asking the model to predict the deleted word. A place-holder token is substituted in place of the deleted word which is also the correct answer (Hermann et al., 2015). We have created three subsets of CNN namely, CNN-11K, CNN-22K and CNN-55K from the entire CNN dataset, and Dailymail-55K from the Dailymail dataset. Statistics on the number of samples comprising these datasets is presented in Table 1.

Improve Learning Procedure
It has been shown in the context of language modelling that presenting the training samples in an easy to hard ordering allows for shielding    the model from very hard samples during training, yielding faster convergence and better models . We investigate a curriculum learning inspired training procedure for memory networks to improve performance on the three subsets of the CNN dataset described below.

Demonstrate Knowledge Transfer
We plan to demonstrate knowledge transfer from Dailymail-55K to three subsets of CNN of varying sizes utilizing the proposed join-training method.
For learning, we make use of smaller subsets of the CNN dataset. The smaller size of these subsets enables us to assess the performance boost due to knowledge transfer: As our aim is to demonstrate transfer when less labelled data is available, choosing the complete dataset would render gains from knowledge transfer as insignificant. We also demonstrate knowledge transfer for the case of MCTest dataset using embeddings obtained after training the memory network with CNN datasets.

End-to-end Memory Network for Machine Comprehension
End-to-end Memory Network is a recently introduced neural network model that can be trained in an end-to-end fashion; directly on the tuples (q, C, S, s) using standard back-propagation (Sukhbaatar et al., 2015). The complete training procedure can be described in the three steps: i) encoding the training tuples into the contextual memory, ii) attending context in memory to retrieve relevant information with respect to a question, and, iii) predicting the answer using the retrieved information. To accomplish the first step, an embedding matrix A ∈ R p×d is used to map both question and context into a pdimensional embedding space; by applying the following transformations: − → q = AΦ(q) and In the last step, prediction distributionâ i is computed as in Equation 2, where U ∈ R p×d is an embedding matrix similar to A and can potentially be tied with A, and s i is one of the answers in S. Using the prediction step, a probability distributionâ i over all s i can be obtained and the final answer is selected as the one with the highest probabilityâ i corresponding to the option s i .
an × log(ân(P, D)) To train a memory network, the cross-entropy loss function L between the true label distribution a i ∈ {0, 1} s (which is a one hot vector to indicate the correct label s in the training tuples) and the predicted distributionâ i is used, as in Equation 3. Where P , D and N D represent the set of model parameters to learn, training dataset, and the number of tuples in the training set respectively. Such an objective can be easily optimized using stochastic gradient descent (SGD). A memory network can easily be extended to perform several hops over the memory before predicting the answer. For details, we refer to Hill et al. (2016). However, we constrain this study to use a single-hop network in order to reduce number of parameters to learn and also the chances of overfitting; as we are dealing with small scale datasets.
Self-Supervision is a heuristic introduced to provide memory supervision and the rationale behind is that if the memory supporting the correct answer is retrieved than the model is more likely to predict the correct answer (Hill et al., 2016). More precisely, this is achieved by keeping a hard attention over memory while training, i.e., m o = argmax α i . At each step of SGD, the model computes m o and updates only using those examples which do not select the memory m o having the correct answer in the corresponding c i .

Proposed Methods
We attempt to improve the training procedure for Memory Networks in order to increase the performance for machine comprehension by QA with small scale datasets. Firstly, we introduce an improved training procedure for memory networks using curriculum learning which is termed as Curriculum Inspired Training (CIT) and offer details about this in Section 5.1. Thereafter, Section 5.2 explains joint-training method for knowledge transfer from an abundantly labelled dataset to another dataset with limited label information .

CIT: Curriculum Inspired Training
Curriculum learning makes use of the fact that model performance can be significantly improved if the training samples are not presented randomly but in such a way so as to make the learning task gradually more difficult by presenting examples in an easy to hard ordering . Such a training procedure allows the learner to waste less time with noisy or hard to predict data when the model is not ready to incorporate such samples. However, what remains unanswered and is left as a matter of further exploration is how to devise an effective strategy for a given task?
In this work, we formulate a curriculum strategy to train a memory network for machine comprehension. Formally, we rank training tuples (q, S, C, s) from easy to hard based on the normalized word frequency for passage, question, and context initially; using the score function (SF) mentioned in Equation 4 (i.e. easier passages have more frequent words). The training data is then divided into a fixed number of chapters, with each successive chapter resulting in addition of more difficult tuples. The model is then trained sequentially on each chapter with the final chapter containing the complete training data. The presence of both the number of chapters and the fixed number of epochs per chapter makes such a strategy flexible and allows to be tailored to different data after optimizing the like other hyper-parameters.
(an × log(ân(P, D))+ (1 − an) × log(1 −ân(P, D)) × 1(en, c(n) × epc) The loss function used for curriculum inspired training varies with epoch number; as mentioned in Equation 5. Note, in Equation 5, en and c(n) represents the current epoch number and chapter number for n th tuple assigned using rank allocated based on SF mentioned in Equation 4 respectively. epc, P , D, and 1 is the number of epochs per chapter, model parameters, training set, and an indicator function which is one if first argument is >= the second argument or else zero; respectively.

Joint-Training for Knowledge Transfer
While joint-training methods offer knowledge transfer by exploiting similarities and regularities across different tasks or datasets, the asymmetric nature of transfer and skewed proportion of datasets is usually not handled in a sound way. Here, we devise a training loss functionL to relieve both of these involved issues while doing joint-training with a target dataset (TD) with fewer training samples and a source dataset (SD) having label information for higher number of examples; as mentioned in Equation 6.
WhereL represents the devised loss function for joint-training for transfer, L the cross-entropy loss function also mentioned earlier in Equation 3, γ is a weighting factor which varies between zero and one, F (N T D , N SD ) is an another weighting factor which is a function of number of samples in the target domain N T D and in the source domain N SD . The rationale behind γ factor is to control the relative update in the network due to samples from source and target datasets; which permits biasing of the model performance towards one dataset. F (N T D , N SD ) factor can be independently utilized to mitigate the effect of skewed proportion in the number of samples present in both target and source domains. Note, maintaining both γ and F (N T D , N SD ) as separate parameters allows for restricting γ within (0,1) without any extra computation as described below.

Improved Loss Functions
This paper explores the following variants of the introduced loss functionL for knowledge transfer via joint-training: The F (N T D , N SD ) factor does not increase computation as it is not optimized for any of the cases. Jo-Train (Liu et al., 2015), SrcOnly and a method similar to W+Jo-Train (Daume III, 2007) have also been explored previously for other NLP tasks and models.

Experiments
We evaluate the performance on datasets introduced earlier in Section 3. We first present baseline methods, pre-processing and training details. In Section 6.3, we present results on CNN-11/22/55K, MCTest-160 and MCTest-50 to validate our claims mentioned in Section 1. All of the methods presented here are implemented in Theano (Bastien et al., 2012) and Lasagne (Dieleman et al., 2015) and are run on a single GPU (Tesla K40c) server with 500GB of memory.

Baseline Methods
We implemented Sliding Window (SW) and Sliding Window + Distance (SW+D) (Richardson et al., 2013) as baselines to compare against our experiments. Further, we augment SW (or SW+D) to incorporate distances between word vectors of the question and the context over the sliding window; in a manner similar to the way SW+D is augmented from SW by Richardson et al. (2013). These approaches are named based upon the source of pre-trained word vectors, e.g., SW+D+CNN-11K+W2V utilizes vectors estimated from both CNN-11K and word2vec pretrained vectors 1 . In case of more than one source, individual distances are summed and utilized for final scoring. Results on MCTest for SW, SW+D, and their augmented approaches are reported using online available scores for all answers 2 .
Meaningful Comparisons: To ascertain that the improvement is due to the proposed training methods, and not merely because of addition of more data, we built multiple baselines, namely, initialization using word vectors from word2vec, pre-training, Jo-train, and SrcOnly. For pretraining and word2vec, words ∈ target dataset and / ∈ source dataset are initialized, by a uniform random sampling with the limits set to the extremes spanned by the word vectors in the source domain. It is worth to note that the pre-training and Jotrain utilizes as much label information and data as other proposed variants of joint-training. Also, SrcOnly method is an indicative of how much direct knowledge transfer from source domain to target domain can be achieved without any learning.

Pre-processing & Training Details
While processing data, we replace words occurring less than 5 times by <unk> token except for MCTest datasets. Additionally, all entities are included in vocabulary. All models are trained by carrying out the optimization using SGD with learning rate in {10 −4 , 10 −3 }, momentum value set to 0.9, weight decay in {10 −5 , 10 −4 }, and, max norm in {1, 10, 40}. We kept length of window equal to 5 for CNN / Dailymail datasets (Hill et al., 2016) (Hill et al., 2016). In case of curriculum learning, the number of chapters are optimized out of {3, 5, 8, 10} and number of epochs per chapter is set equal to 2M M +1 × ed ncl ed cl × EN which is estimated by equating to the number of network update found for the optimal case of non-curriculum learning. Here M and ed cl represents the number of chapter and embedding size for curriculum learning, and ed ncl & EN represents the optimal value found for embedding size and number of epochs without curriculum learning. We use early stopping with a validation set while training the network.

Results & Discussion
In this section, we present results to validate contributions mentioned in Section 1. Table 2 presents the results of our approaches along with results from baseline methods SW, SW+D, SW+W2V, and a standard memory network (MemNN). Results for CIT on  show an absolute improvement of 2.96%, 1.31%, and, 1.00% respectively, when compared with the memory network (MemNN) (contribution (i)). Figure 1 shows that the CIT leads to better convergence when compared without CIT on CNN-11K.
Results empirically support the major premise of this study, i.e., CIT and knowledge transfer from a related dataset with memory network can significantly improve the performance; improvements of 10.94%, 6.28%, and, 3.19% are observed with CNN-11/22/55K respectively when compared with the standard memory network. The improvement in knowledge transfer decreases as the amount of data in the target domain starts increasing from 11K to 55K, as the volume of data in the target domain starts becoming comparable to source domain, and is enough to achieve similar level of performance without knowledge transfer.
Previously, Chen et al. (2016) annotated a sample of 100 questions on CNN stories based on the type of capabilities required to answer the question. We report results for all 6 specific categories in Table 3. Even with CNN-11K and Dailymail-55K which is roughly 20% of the complete CNN dataset, the proposed methods achieve similar per-   formance on 4 out of 6 categories, when compared to latest models (2 nd & 3 rd last rows of Table 3).
On very small datasets such as MCTest-160 and MCTest-500, it is not feasible to train memory network (Smith et al., 2015), therefore, we explore the use of word vectors from the embedding matrix of a model pre-trained on CNN datasets. Here, the embedding matrix refers to the encoding matrix A used in the first step of memory network as mentioned in Section 4. SW+D+CNN-11/22/55K are the results when the similarity measures comes from SW+D as mentioned in Section 6.1 and also using the word vectors from encoding matrix A obtained after training on CNN-11/22/55K. From table 4, it is evident that performance improves as the amount of data increases in CNN domain (contribution(iii)). Further, on combining with word2vec distance (SW+D+CNN-11/22/55K+W2V), an improvement is observed.

Conclusion
Looking at the widespread applications of Memory Networks and the prohibitive data requirements for training them, this paper seeks to improve the performance of memory networks on small datasets in two different ways. Firstly, this paper introduces an effective CIT procedure for machine comprehension. Secondly, this paper explores various methods to exploit labelled data from closely related domains; in order to perform knowledge transfer and improve performance. Additionally, this paper suggests the use of a modified loss function to further incorporate the asymmetric nature of knowledge transfer. Beyond machine comprehension, we believe that the proposed methods are likely to achieve higher generalization for other tasks utilizing memory network style architectures, by virtue of the proposed CIT method and joint-training for knowledge transfer.