Continual Relation Learning via Episodic Memory Activation and Reconsolidation

Continual relation learning aims to continually train a model on new data to learn incessantly emerging novel relations while avoiding catastrophically forgetting old relations. Some pioneering work has proved that storing a handful of historical relation examples in episodic memory and replaying them in subsequent training is an effective solution for such a challenging problem. However, these memory-based methods usually suffer from overfitting the few memorized examples of old relations, which may gradually cause inevitable confusion among existing relations. Inspired by the mechanism in human long-term memory formation, we introduce episodic memory activation and reconsolidation (EMAR) to continual relation learning. Every time neural models are activated to learn both new and memorized data, EMAR utilizes relation prototypes for memory reconsolidation exercise to keep a stable understanding of old relations. The experimental results show that EMAR could get rid of catastrophically forgetting old relations and outperform the state-of-the-art continual learning models.


Introduction
Relation extraction aims at detecting relations between entities from text, e.g., extracting the relation "the president of " from the given sentence "Newton served as the president of the Royal Society", which could serve as external resource for various downstream applications (Dong et al., 2015;Xiong et al., 2017;Schlichtkrull et al., * indicates equal contribution † Corresponding author 2018). The conventional RE methods (Riedel et al., 2013;Zeng et al., 2014;Lin et al., 2016) mostly focus on recognizing relations for a fixed pre-defined relation set, and cannot handle rapidly emerging novel relations in the real world. Some researchers therefore explore to detect and learn incessantly emerging relations in an open scenario. As shown in Figure 1, their efforts can be formulated into a two-step pipeline: (1) Open Relation Learning extracts phrases and arguments to construct patterns of specific relations, and then discovers unseen relation types by clustering patterns, and finally expands sufficient examples of new relation types from large-scale textual corpora; (2) Continual Relation Learning continually uses those expanded examples of new relations to train an effective classifier. The classifier is trained on a sequence of tasks for handling both existing and novel relations, where each task has its own relation set. Although continual relation learning is vital for learning emerging relations, there are rare explorations for this field.
A straightforward solution is to store all historical data and re-train models every time new relations and examples come in. Nevertheless, it is computationally expensive since relations are in sustainable growth. Moreover, the huge example number of each relation makes frequently mixing new and old examples become infeasible in the real world. Therefore, storing all data is not practical in continual relation learning. In view of this, the recent preliminary work (Wang et al., 2019) indicates that the main challenge of continual relation learning is the catastrophic forgetting problem, i.e., it is hard to learn new relations and meanwhile avoid forgetting old relations, considering memorizing all the data is almost impossible. Recent work (Shin et al., 2017;Kemker and Kanan, 2018;Chaudhry et al., 2019) has shown that the memory-based approaches, maintaining episodic memory to save a few training examples in old tasks and re-training memorized examples during training new tasks, are one of the most effective solutions to the catastrophic forgetting problem, especially for continual learning in NLP scenarios (Wang et al., 2019;d'Autume et al., 2019). However, existing memory-based models still suffer from an overfitting problem: when adapting them for continual relation learning, they may frequently change feature distribution of old relations, gradually overfit a few examples in memory, and finally become confused among old relations after long-term training.
In fact, these memory-based methods are similar to long-term memory model of mammalian memory in neuroscience (McClelland et al., 1995;Bontempi et al., 1999). Although researchers in neuroscience are not clear about secrets inside the human brain, they reach a consensus that the formation of long-term memory relies on continually replaying and consolidating information (Tononi and Cirelli, 2006;Boyce et al., 2016;Yang et al., 2014), corresponding to the episodic memory and memory replay in continual learning models. Yet later work (Nader et al., 2000;Lee et al., 2004;Alberini, 2005) in neuroscience indicates that reactivation of consolidated memory triggers a reconsolidation stage to continually maintain memory, and memory is easy to be changed or erased in this stage. To apply some reconsolidation exercises can help memory go through this stage and keep long-term memory stable. Intuitively, the ex-isting memory-based models seem like continual memory activation without reconsolidation exercises, and thus become sensitive and volatile.
Inspired by the reconsolidation mechanism in human long-term memory formation, we introduce episodic memory activation and reconsolidation (EMAR) to continual relation learning in this paper. More specifically, when training models on new relations and their examples, we first adopt memory replay to activate neural models on examples of both new relations and memory, and then utilize a special reconsolidation module to let models avoid excessively changing and erasing feature distribution of old relations. As the core of relation learning is to grasp relation prototypes rather than rote memorization of relation examples, our reconsolidation module requires models to be able to distinguish old relation prototypes after each time memory is replayed and activated. As compared with pioneering explorations to improve episodic memory replay (Chaudhry et al., 2019;Wang et al., 2019), with toughly keeping feature distribution of old relations invariant, EMAR is more flexible in feature spaces and powerful in remembering relation prototypes.
We conduct sufficient experiments on several RE datasets, and the results show that EMAR effectively alleviates the catastrophic forgetting problem and significantly outperforms the stateof-the-art continual learning models. Further experiments and analyses indicate the reasons for the effectiveness of EMAR, proving that it can utilize a few examples in old tasks to reconsolidate old relation prototypes and keep better distinction among old relations after long-term training.

Related Work
The conventional RE work, including both supervised RE models (Zelenko et al., 2003;Zhou et al., 2005;Gormley et al., 2015;Socher et al., 2012;Liu et al., 2013;Zeng et al., 2014;Nguyen and Grishman, 2015;dos Santos et al., 2015;Miwa and Bansal, 2016) and distantly supervised models (Bunescu and Mooney, 2007;Mintz et al., 2009;Riedel et al., 2010;Hoffmann et al., 2011;Zeng et al., 2015;Lin et al., 2016;Han et al., 2018a;Baldini Soares et al., 2019), focuses on extracting predefined relations from text. Yet in the real world, new relations are rapidly emerging, and it is impossible to train models with a fixed dataset once to cover all relations. Hence, some researchers pay their attention to relation learning in various open scenarios, in order to detect and learn relations without pre-defined relation sets. As we introduced before, learning incessantly emerging relations consists of two important steps: open relation learning and continual relation learning.
Existing continual learning methods focus on three research directions: (1) consolidation-based methods (Kirkpatrick et al., 2017;Zenke et al., 2017;Li and Hoiem, 2017;Liu et al., 2018;Ritter et al., 2018) which consolidate the model parameters important to previous tasks and reduce their learning weights; (2) dynamic architecture methods (Chen et al., 2016;Rusu et al., 2016;Fernando et al., 2017) which dynamically expand model architectures to learn new tasks and ef-1 Some work names it lifelong or incremental learning. fectively prevent forgetting old tasks. Yet model size growing dramatically with increasing tasks makes these methods unsuitable for NLP applications; (3) memory-based methods (Lopez-Paz and Ranzato, 2017;Rebuffi et al., 2017;Shin et al., 2017;Kemker and Kanan, 2018;Aljundi et al., 2018;Chaudhry et al., 2019) remember a few examples in old tasks and continually learn them with emerging new tasks to alleviate catastrophic forgetting. Among these methods, the memorybased methods have been proven to be the most promising for NLP tasks, including both relation learning (Wang et al., 2019) and other NLP tasks (d'Autume et al., 2019;. Inspired by reconsolidation in human memory formation, we introduce episodic memory activation and reconsolidation (EMAR) to alleviate the overfitting problem of the existing memory-based methods and better learn relations continually.

Task Definition and Overall Framework
Continual relation learning trains models on a sequence of tasks, where the k-th task has its own training set T k , validation set V k , and query set Q k . Each set of the k-th task, e.g.
consists of a series of examples and their corresponding relation labels, where N is the example number of T k . Each example x T k i and its label y T k i indicate that x T k i can express the relation y T k i ∈ R k , where R k is the relation set of the k-th task.
More specifically, models will be trained on T k at the k-th step to learn the new relations in R k . As relations are emerging and accumulating, continual relation learning requires models to perform well on both the k-th task and previous k −1 tasks. Hence, after training on T k , models will be evaluated onQ k = k i=1 Q i , and required to classify each query example into the all known relation set R k = k i=1 R i . Therefore, the evaluation will be more and more difficult with the growth of tasks.
For handling the catastrophic forgetting in continual relation learning, an episodic memory module M = {M 1 , M 2 , . . .} is set to store a few examples of historical tasks, each memory module ) ∈ T k and B is the constrained memory size for each task.
As shown in Figure 2, when models are trained on the k-th task, our framework includes several steps to learn new relations and meanwhile avoid forgetting old relations: (1) First (Section 3.3), we fine-tune the example encoder on the training set T k of the k-th task to let the model be aware of new relation patterns.
(2) Second (Section 3.4), for each relation in the k-th relation set R k , we select its informative examples and store the examples into the episodic memory M k .
(3) Finally (Section 3.5), we iteratively adopt memory replay and activation as well as memory reconsolidation to learn new relation prototypes while strengthening distinguishing old relation prototypes. Besides, we will introduce how to train models as well as predict relations for query examples in Section 3.6. As the example encoder is used in all other steps, we first introduce it in Section 3.2 before other steps.

Example Encoder
Given an example x, we adopt an example encoder to encode its semantic features for detecting and learning relations. To be specific, we first tokenize the given example into several tokens, and then input the tokenized tokens into neural networks to compute its corresponding embedding. As extracting relations from sentences is related to those entities mentioned in sentences, we thus add special tokens into the tokenized tokens to indicate the beginning and ending positions of those entities. For simplicity, we denote such an example encoding operation as the following equation, where x ∈ R d is the semantic embedding of x, and d is the embedding dimension. Note that the encoder is not our focus in this paper, we select bidirectional long short-term memory (BiL-STM) (Bengio et al., 1994) as representative encoders to encode examples. In fact, other neural text encoders like convolutional neural networks (Zeng et al., 2014) and pre-trained language models (Devlin et al., 2019) can also be adopted as example encoders.

Learning for New Tasks
When the k-th task is arising, the example encoder has not touched any examples of new relations before, and cannot extract the semantic features of them. Hence, we first fine-tune the example encoder on T k = {(x T k 1 , y T k 1 ), . . . , (x T k N , y T k N )} to grasp new relation patterns in R k . The loss function of learning the k-th task is as follows, where r j is the embedding of the j-th relation r j ∈R k in the all known relation setR k , g(·, ·) is the function to compute similarities between embeddings (e.g. cosine similarity), and θ is the parameters that can be optimized, including the example encoder parameters and relation embeddings. If y T k i equals r j , δ y T k i =r j = 1, otherwise δ y T k i =r j = 0. For each new relation, we first randomly initialize its embedding and then optimize Eq. (2).

Selecting Examples for Memory
After several epochs of learning for new tasks with Eq.
(2), we store a few examples from T k into the memory M k . More specifically, we select informative and diverse examples from T k to cover new relation patterns as much as possible, which can make the memory effectively approximate the feature distribution of relations.
After encoding all examples of the k-th task T k into {x T k 1 , . . . , x T k N }, we apply K-Means to cluster these example embeddings, where the number of clusters is the memory size B. Then, for each cluster, we select the example closest to the cluster centroid and record which relation these selected examples belong to. We denote this selected example set C k . By counting the example number in C k for each relation, we can describe the relation importance in this task: more selected examples of a relation indicates more importance. As the limited memory size, for those more important relations, we select at least B |R k | examples, yet for those less important ones, we select at most B |R k | examples. If a relation does not have enough examples to fill its allocated memory, this memory will be re-allocated for other relations.
For each relation, we also use K-Means to cluster its own examples, and the number of current clusters is its allocated example number in the memory. For each cluster, we select the example closest to the cluster centroid, and store this example into the memory M k .

Replay, Activation and Reconsolidation
After fine-tuning the example encoder for T k and selecting informative examples for M k , we iteratively adopt computing prototypes, memory replay and activation, and memory reconsolidation to strengthen identifying new relation patterns and keep distinguishing old relation patterns.

Computing Prototypes
By combining all examples in the episodic memory, we achieve the whole memory setM k = k i=1 M i . As we aim to grasp relation prototypes rather than rote memorization of relation examples, for each known relation r i ∈R k , we sample a prototype set P i = {x P i 1 , . . . , x P i |P i | }, where each example x P i i comes fromM k and its label equals r i , and compute its prototype embedding, where p i is the relation prototype embedding of r i ∈R k .

Memory Replay and Activation
In memory replay and activation, the whole memory setM k and the k-th training set T k will be combined into an activation set } to continually activate models to learn new relations and remember old relations, where M is the total example number of bothM k and T k . The loss function is . (4)

Memory Reconsolidation
As we mentioned before, just conducting memory replay and activation will lead to the overfitting problem, and in the end, models only remember a handful of memorized examples after long-term training. Meanwhile, the core of learning relations is to grasp relation prototypes rather than rote memorization of relation examples. Hence, every time conducting memory replay and activation to grasp both new and old relations, we adopt a memory reconsolidation module to strengthen this process, which seems like conducting reconsolidation exercises to keep long-term memory stable in the human brain. For each known relation r i ∈R k , we sample its instance set I i = {x I i 1 , . . . , x I i |I i | } as is similar to sampling P i , where each example x I i i ∈ I i also comes fromM k and its label equals r i . The loss function of the memory reconsolidation is where p l is the relation prototype embedding of r l ∈R k computed by Eq. (3).
Algorithm 1 Train EMAR for the k-th task Require: The training set T k of the k-th task Require: The emerging relation set R k of the k-th task Require: The memory moduleM k−1 before learning T k Require: The known relation setR k−1 before learning T k 1: Initialize the relation embeddings for R k 2:R k ←R k−1 ∪ R k 3: for i ← 1 to epoch1 do 4: Update θ with ∇L on T k 5: end for 6: Select informative examples from T k to store into M k 7:M k ←M k−1 ∪ M k 8: A k ←M k ∪ T k 9: for i ← 1 to epoch2 do 10: for relation rj ∈R k do 11: Sample Pj fromM k and compute its relation prototype embedding pj 12: end for 13: for j ← 1 to iter1 do 14: Update θ with ∇L A on A k 15: end for 16: for j ← 1 to iter2 do 17: Sample Ii fromM k for each known relation ri 18: Update θ with ∇L R on {I1, . . . , I |R k | } 19: end for 20: end for

Training and Prediction
For training the k-th task, we first use L(θ) to optimize parameters for several epochs. Then, we select examples for the memory, and iteratively optimize parameters with L A (θ) and L R (θ) until convergence. More details about the training process are shown in Algorithm 1.
After finishing the k-th task, for each known relation r i ∈R k , we collect all its memorized examples E i = {x E i 1 , . . . , x E i S } in the whole memorỹ M k , where S is the example number of r i in the memory, and compute final relation prototype for prediction,p where r i is the relation embedding of r i used in Eq.
(2) and Eq. (4). For each query example x iñ Q k , we define its score function for the relation r i : wherep i is the final prototype of the relation r i computed by Eq. (6). Finally, the prediction y for the query x is calculated by:

Datasets
We carry out our experiments on three benchmark datasets: (1) FewRel (Han et al., 2018b). FewRel is a RE dataset that contains 80 relations and 56, 000 examples in total. We follow the settings from Wang et al. (2019) to make FewRel a continual learning benchmark: FewRel is split into 10 clusters of relations, leading to 10 tasks and each relation just belongs to only one task. Each example in these tasks is related to a relation and a candidate set of 10 randomly selected relations for evaluation.
(2) SimpleQuestions (Bordes et al., 2015). SimpleQuestions (SimpleQ) is a knowledge base question answering dataset that contains 108, 442 questions, and Yu et al. (2017) construct a relation detection dataset based on it, where questions are linked to relations. Like FewRel, we follow the settings from Wang et al. (2019): SimpleQ is split into 20 clusters of relations to construct 20 tasks. As each question in SimpleQ has been related to a candidate set for evaluation, we do not randomly sample candidate sets again for SimpleQ.
(3) TACRED (Zhang et al., 2017). TACRED is a RE dataset that contains 42 relations and 21, 784 examples. Similar to FewRel, we also split TA-CRED into 10 clusters of relations to construct 10 tasks, and randomly sample candidate relation sets consisting of 10 relations for each examples. Considering there is a special relation "n/a" (not available) in TACRED, we filter out these examples with the relation "n/a" and use the left examples for continual TACRED.   Table 2: Accuracy (%) of models with different memory sizes. All the results come from our implemented models.

Experimental Settings
We use two evaluation settings including whole performance, which calculates the accuracy on the whole test set of all tasks, and average performance, which averages the accuracy on all seen tasks. After having seen all tasks, we use the final whole performance and average performance to evaluate the overall performance of continual relation learning. As average performance highlights the performance of handling catastrophic problem, and thus it is the main metric to evaluate models.
As the task sequence has influence on final model performance, we implement the baseline models by ourselves based on the toolkit 2 released by Wang et al. (2019). For fair comparison, we unify the random seeds in our experiments completely consistent with the seeds in Wang et al. (2019), so that the task sequence can be completely consistent with Wang et al. (2019). For other settings, such as hidden embedding dimension and pre-trained input embeddings, we also follow the settings in Wang et al. (2019).

Baselines
We evaluate our model and several baselines on the benchmarks, and select two theoretical models to measure the lower and upper bounds: (1) Lower Bound, which continually fine-tunes models for each new task without memorizing any historical examples; (2) Upper Bound, which remembers all examples in history and continually re-train models with all data. In fact, this model serves as the ideal upper bound for the performance of continual relation learning; (3) EWC (Kirkpatrick et al., 2017), which adopts elastic weight consolidation to add special L 2 regularization on parameter changes. Then, EWC uses Fisher information to measure the parameter importance to old tasks, and slow down the update of those parameters important to old tasks; For each image, we use the support vector machine to acquire its best linear boundary and draw it as the blue line. Step-1 Step-4 Step-7 Step-10   Figure 4. (Chaudhry et al., 2019), the extension of GEM, which takes the gradient on sampled memorized examples from memory as the only constraint on the optimization directions of the current task; (7) EA-EMR (Wang et al., 2019), which introduces memory replay and embedding aligned mechanism to enhance previous tasks and mitigate the embedding distortion when trained on new tasks. EA-EMR is also an extension of EMR, and the state-of-the-art on continual relation learning. (2) There is still a huge gap between our model and the upper bound. It indicates there remains lots of things to be explored in continual relation learning.

Overall Results
To further investigate how accuracy changes while learning new tasks, we show the average performance of models at each step in Figure 3. As shown in the figure, we can observe that: (1) With increasing numbers of tasks, the performance of all the models decreases in some degree. This indicates that catastrophically forgetting old relations is inevitable, and it is indeed one of the major difficulty for continual relation learning. (2) The memory-based methods significantly outperform the consolidation-based method, which demonstrates the memory-based methods could alleviate the problem of catastrophic forgetting to some extent. (3) Our proposed EMAR achieves a much better results compared to state-of-the-art model EA-EMR. It shows the effectiveness of our memory reconsolidation, and further indicates understanding relation prototypes is more important and reasonable than rote memorization of examples.

Effect of Memory Size
Memory size indicates the number of remembered examples for each task. In this section, we investigate the effect of memory size for the performance of baselines and our proposed model. We compare three memory sizes: 10, 25 and 50. As ex-isting work does not report the results with different memory size, we re-implement baseline models by ourselves in this experiment. The results are shown in Table 2. We can find that: (1) With the increasing memory size, the performance of all models improves respectively, which shows that the memory size is one of the key factor determining the performance of continual relation learning models.
(2) On both FewRel and TACRED, our EMAR keeps performing the best under different memory sizes, and even achieves comparable results with other models of larger memory sizes. It indicates adopting relation prototypes in EMAR is a more effective way to utilize memory compared with existing memory-based methods.

Effect of Prototypes and Reconsolidation
To show the effectiveness of prototypes and reconsolidation, we give a case study demonstrating the changing of feature spaces learnt by EA-EMR and EMAR (ours). We sample two relations from the training set and 40 examples per relation from the test set. Then we train EA-EMR and EMAR with the sampled training data respectively and visualize the changes of the sampled 40 instances in the feature spaces at different steps.
From Figure 4, we can see that EMAR learns better features of instances after multi-step training: the embedding space of EMAR is more sparse and features from two relations are more distinguishable. On the other hand, the features learnt by EA-EMR become more dense with increasing steps, thus harder to classify.
This phenomenon is mainly due to the different approaches of constraining features used by EA-EMR and EMAR. The L 2 regularization used in EA-EMR for keeping the instance distribution of old relations leads to higher density in the feature space and smaller distances between different relations after several training steps. On the contrary, EMAR avoids models from forgetting previous relations by relation prototypes. Compared with EA-EMR, using prototypes for reconsolidation is a more flexible constraint, allowing EMAR to utilize larger feature spaces for representing examples and prototypes.
To quantitatively analyze the case, we use the support vector machine to acquire linear boundaries for each image in Figure 4 and list the classification results in Table 3. The quantitative results in the table show that embeddings learnt by EMAR achieve better classification performance, which further supports our above observations.

Conclusion and Future Work
To alleviate catastrophically forgetting old relations in continual relation learning, we introduce episodic memory activation and reconsolidation (EMAR), inspired by the mechanism in human long-term memory formation. Compared with existing memory-based methods, EMAR requires models to understand the prototypes of old relations rather than to overfit a few specific memorized examples, which can keep better distinction among relations after long-term training. We conduct experiments on three benchmarks in relation extraction and carry out extensive experimental results as well as empirical analyses, showing the effectiveness of EMAR on utilizing memorized examples. For future work, how to combine open relation learning and continual relation learning together to complete the pipeline for emerging relations still remains a problem, and we will continue to work on it.