Meta-Learning Improves Lifelong Relation Extraction

Most existing relation extraction models assume a fixed set of relations and are unable to adapt to exploit newly available supervision data to extract new relations. In order to alleviate such problems, there is the need to develop approaches that make relation extraction models capable of continuous adaptation and learning. We investigate and present results for such an approach, based on a combination of ideas from lifelong learning and optimization-based meta-learning. We evaluate the proposed approach on two recent lifelong relation extraction benchmarks, and demonstrate that it markedly outperforms current state-of-the-art approaches.


Introduction
The majority of existing supervised relation extraction models can only extract a fixed set of relations which has been specified at training time. They are unable to detect an evolving set of novel relations observed after training without substantial retraining, which can be computationally expensive and may lead to catastrophic forgetting of previously learned relations. Zero-shot relation extraction approaches (Rocktäschel et al., 2015;Demeester et al., 2016;Levy et al., 2017;Obamuyide and Vlachos, 2018) can extract unseen relations, but at lower performance levels, and are unable to continually exploit newly available supervision to improve performance without considerable retraining. These limitations also extend to approaches to extracting relations in other limited supervision settings, for instance in the oneshot setting (Obamuyide and Vlachos, 2017). It is therefore desirable for relation extraction models to have the capability to learn continuously without catastrophic forgetting of previously learned relations. This would enable them exploit newly available supervision to both identify novel relations and improve performance without substantial retraining.
Recently, Wang et al. (2019) introduced an embedding alignment approach to enable continual learning for relation extraction models. They consider a setting with streaming tasks, where each task consists of a number of distinct relations, and proposed to align the representation of relation instances in the embedding space to enable continual learning of new relations without forgetting knowledge from past relations. While they obtained promising results, a key weakness of the approach is that the use of an alignment model introduces additional parameters to already overparameterized relation extraction models, which may in turn lead to an increase in the quantity of supervision required for training. In addition, the approach can only align embeddings between observed relations, and does not have any explicit objective that encourages the model to transfer and exploit knowledge gathered from previously observed relations to facilitate the efficient learning of yet to be observed relations.
In this work, we extend the work of Wang et al. (2019) by exploiting ideas from both lifelong learning and meta-learning. We propose to consider lifelong relation extraction as a metalearning challenge, to which the machinery of current optimization-based meta-learning algorithms can be applied. Unlike the use of a separate alignment model as proposed in Wang et al. (2019), the proposed approach does not introduce additional parameters. In addition, the proposed approach is more data efficient since it explicitly optimizes for the transfer of knowledge from past relations, while avoiding the catastrophic forgetting of previously learned relations. Empirically, we evaluate on lifelong versions of the datasets by Bordes et al. (2015) and Han et al. (2018) and demonstrate con-siderable performance improvements over prior state-of-the-art approaches.

Background
Lifelong Learning In the lifelong learning setting, also referred to as continual learning (Ring, 1994;Thrun, 1996;Zhao and Schmidhuber, 1996), a model f θ is presented with a sequence of tasks {T t } t=1,2,3..,T , one task per round, and the goal is to learn model parameters {θ t } t=1,2,3,..,T with the best performance on the observed tasks. Each task T can be a conventional supervised task with its own distinct train (T train ), development (T dev ) and test (T test ) splits. At each round t, the model is allowed to exploit knowledge gained from the previous t − 1 tasks to enhance performance on the current task. In addition, the model is also allowed to have a small-sized buffer memory B, which can be used to store a limited amount of data from previously observed tasks. A prominent line of work in lifelong learning research is developing approaches that enable models learn new tasks without forgetting knowledge from previous tasks, i.e. avoiding catastrophic forgetting of old tasks (McCloskey and Cohen, 1989;Ratcliff, 1990;McClelland et al., 1995;French, 1999). Approaches proposed to address this problem include memory-based approaches (Lopez-Paz and Ranzato, 2017;Rebuffi et al., 2017;Chaudhry et al., 2019); parameter consolidation approaches (Kirkpatrick et al., 2017;Zenke et al., 2017); and dynamic model architecture approaches (Xiao et al., 2014;Rusu et al., 2016;Fernando et al., 2017).
Meta-Learning Meta-learning, or learning to learn (Schmidhuber, 1987;Naik and Mammone, 1992;Thrun and Pratt, 1998), aims to develop algorithms that learn a generic knowledge of how to solve tasks from a given distribution of tasks, by generalizing from solving related tasks from that distribution. Given tasks T sampled from a distribution of tasks p(T ), and a learner model f (x; θ) parameterized by θ, gradient-based meta-learning methods, such as MAML (Finn et al., 2017), learn a prior initialization of the parameters of the model which, at meta-test time, can be quickly adapted to achieve good performance on a new task using a few steps of gradient descent. During adaptation to the new task, the model parameters θ are updated to task-specific parameters θ with good performance on the task. Formally, the meta-learning algorithms optimize for the meta-objective: (1) where L T is the loss and D T is training data from task T , and U is a fixed gradient descent learning rule, such as vanilla SGD. While these algorithms were proposed and evaluated in the context of fewshot learning, here we demonstrate their effectiveness when utilized in the lifelong learning setting for relation extraction, following similar intuition as recent work by Finn et al. (2019).

Meta-Learning for Lifelong Relation Extraction
It can be inferred from the previous section that a lot of lifelong learning research has focused on approaches to avoid catastrophic forgetting (i.e. negative backward transfer of knowledge) while recent meta-learning studies have focused on effective approaches for positive forward transfer of knowledge (for few-shot tasks). Given the complementary strengths of the approaches from the two learning settings, we propose to embed metalearning into the lifelong learning process for relation extraction. While we can utilize the MAML algorithm to directly optimize the meta-objective in Equation 1 for our purpose, doing so requires the computation of second-order derivatives, which can be computationally expensive. Nichol et al. (2018) proposed REPTILE, a first-order alternative to MAML, which uses only first-order derivatives. Similar to MAML, REPTILE works by repeatedly sampling tasks, training on those tasks and moving the initialization towards the adapted weights on those tasks. Here we adopt the REPTILE algorithm for meta-learning. Our algorithm for lifelong relation extraction is illustrated in Algorithm 1.
We start by randomly initializing the parameters of the relation extraction model (the learner) (line 1). Then, as new tasks arrive, we augment their training set with randomly sampled task exemplars from the buffer memory B (lines 2-9). We then sample a batch of relations from the augmented training set (line 10). Then for each sampled relation R i , we sample a batch of supervision instances D train R i from its training set (line 11-12).
We then obtain the adapted model parameters θ i t on the relation by first computing the gradient of the training loss on the sampled relation instances (line 13) and backpropagating the gradients with a gradient-based optimization algorithm (such as SGD or Adagrad (Duchi et al., 2011)) (line 14). At the end of the learning iteration, the adapted parameters on all sampled relations in the batch are averaged, and an update is made on the task parameters θ t (line 16). This is done until convergence on the current task, after which exemplars of the current task are added to the buffer memory (line 18). Task exemplars are obtained by first clustering all training instances of the current task into 50 clusters using K-Means, then selecting an instance from each cluster with a representation closest to the cluster prototype. Finally, the model parameters are updated to the current task's adapted parameters (line 19).

Relation Classification Model
In principle the learner model f θ could be any gradient-optimized relation extraction model. In order to use the same number of parameters and ensure fair comparison to Wang et al. (2019), we adopt as the relation extraction model f θ the Hier- arachical Residual BiLSTM (HR-BiLSTM) model of Yu et al. (2017), which is the same model used by Wang et al. (2019) for their experiments. The HR-BILSTM is a relation classifier which accepts as input a sentence and a candidate relation, then utilizes two Bidirectional Long Short-Term Memory (Hochreiter and Schmidhuber, 1997;Graves and Schmidhuber, 2005) (BiLSTM) units with shared parameters to process the Glove (Pennington et al., 2014) embeddings of words in the sentence and relation names, then selects the relation with the maximum cosine similarity to the sentence as its response.
Hyperparameters Apart from the hyperparameters specific to meta-learning (such as the step size ), all other hyperparameters we use for the learner model are the same as used by Wang et al. (2019). We also use the same buffer memory size (50) for each task. Note that the meta-learning algorithm uses SGD as the update rule (U), and does not add any additional trainable parameters to the learner model.

Setup
We conduct experiments in two settings. In the full supervision setting, we provide all models with all supervision available in the training set of each task. In the second, we limit the amount of supervision for each task to measure how the models are able to cope with limited supervision. Each experiment is run five (5) times and we report the average result.

Datasets
We conduct experiments on Lifelong FewRel and Lifelong SimpleQuestions datasets, both introduced in Wang et al. (2019). Lifelong FewRel is derived from the FewRel (Han et al., 2018) dataset, by partitioning its 80 relations into 10 distinct clusters made up of 8 relations each, with each cluster serving as a task where a sentence must be labeled with the correct relation. The 8 relations in each cluster were obtained by clustering the averaged Glove word embeddings of the relation names in the FewRel dataset. Each instance of the dataset contains a sentence, the relation it expresses and a set of randomly sampled negative relations. Lifelong SimpleQuestions was similarly obtained from the SimpleQuestions (Bordes et al., 2015) dataset, and is made up of 20 clusters of relations, with each cluster serving as a task.

Evaluation Metrics
We report two measures, ACC whole and ACC avg , both introduced in Wang et al. (2019). ACC whole measures accuracy on the test set of all tasks and gives a balanced measure of model performance on both observed (seen) and unobserved (unseen) tasks, and is the primary metric we report for all experiments. We also report ACC avg , which measures the average accuracy on the test set of only observed (seen) tasks.

Results and Discussion
Full Supervision Results Table 1 gives both the ACC whole and ACC avg results of our approach compared to other approaches including Episodic Memory Replay (EMR) and its various embedding-aligned variants EA-EMR as proposed in Wang et al. (2019). Across all metrics, our approach outperforms the previous approaches, demonstrating its effectiveness in this setting. This result is likely because our approach is able to efficiently learn new relations by exploiting knowledge from previously observed relations.

Limited Supervision Results
The aim of our limited supervision experiments is to compare the use of an alignment module as proposed by Wang et al. (2019) to using our approach when only limited supervision is available for all tasks. We compare three approaches, Full EA-EMR (which uses their alignment module), its variant without the alignment module (EA-EMR NoAlign) and our approach (MLLRE) . Figures 1(a) and 1(b) show results obtained using 100 supervision instances for each task on Lifelong FewRel and Lifelong Sim-pleQuestions. Figures 2(a) and 2(b) show the corresponding plots using 200 supervision instances for each task. From the figures, we observe that the use of a separate alignment model results in only minor gains when supervision for the tasks is limited, whereas the use of our approach leads to wide gains on both datasets. In summary, because our approach explicitly encourages the model to learn to share and transfer knowledge between relations (by means of the meta-learning objective), the model is able to learn to exploit common structures across relations in different tasks to efficiently learn new relations over time. This leads to the performance improvements obtained by our approach.

Conclusion
We investigated the effectiveness of utilizing a gradient-based meta-learning algorithm within a lifelong learning setting to enable relation extraction models that are able to learn continually. We show the effectiveness of this approach, both when provided full supervision for new tasks and when provided limited supervision for new tasks, and demonstrated that the proposed approach outperformed current state-of-the-art approaches.