Continual Learning for Text Classification with Information Disentanglement Based Regularization

Continual learning has become increasingly important as it enables NLP models to constantly learn and gain knowledge over time. Previous continual learning methods are mainly designed to preserve knowledge from previous tasks, without much emphasis on how to well generalize models to new tasks. In this work, we propose an information disentanglement based regularization method for continual learning on text classification. Our proposed method first disentangles text hidden spaces into representations that are generic to all tasks and representations specific to each individual task, and further regularizes these representations differently to better constrain the knowledge required to generalize. We also introduce two simple auxiliary tasks: next sentence prediction and task-id prediction, for learning better generic and specific representation spaces. Experiments conducted on large-scale benchmarks demonstrate the effectiveness of our method in continual text classification tasks with various sequences and lengths over state-of-the-art baselines. We have publicly released our code at https://github.com/GT-SALT/IDBR.


Introduction
Computational systems in real world scenarios face changing environment frequently, and thus are often required to learn continually from dynamic streams of data building on what was learnt before (Biesialska et al., 2020). For example, a tweeter classifier needs to deal with trending topics which are constantly emerging. While being an intrinsic nature of human to continually acquire and transfer knowledge throughout lifespans, most machine learning models often suffer from catastrophic forgetting: when learning on new tasks, models dramatically and rapidly forget knowledge from previous tasks (McCloskey and Cohen, 1989). As a result, Continual Learning (CL) (Ring, 1998;Thrun, * Equal contribution. 1998) has received more attention recently as it can enable models to perform positive transfer (Perkins et al., 1992) as well as remember previously seen tasks.
A growing body of research has been conducted to equip neural networks with the ability of continual learning abilities (Kirkpatrick et al., 2017;Lopez-Paz and Ranzato, 2017;Aljundi et al., 2018). Existing continual learning methods on NLP tasks can be broadly categorized into two classes: purely replay based methods Sun et al., 2019) where examples from previous tasks are stored and re-trained during the learning of the new task to retain old information, and regularization based methods Han et al., 2020) where constraints are added on model parameters to prevent them from changing too much while learning new tasks. The former usually stores an extensive amount of data from old tasks  or trains language models based on task identifiers to generate sufficient examples (Sun et al., 2019), which significantly increases memory costs and training time. While the latter utilizes previous examples efficiently via the constraints added on text hidden space or model parameters, it generally views them as equally important and regularize them to the same extent Han et al., 2020), making it hard for models to differentiate informative representation that needs to be retained from ones that need a large degree of updates. However, we argue that when learning new tasks, task generic information and task specific information should be treated differently, as these generic representation might function consistently while task specific representations might need to be changed significantly.
To this end, we propose an information disentanglement based regularization method for continual learning on text classification. Specifically, we first disentangle the text hidden representation space (e.g., the output representation of BERT (Devlin et al., 2019)) into a task generic space and a task specific space using two auxiliary tasks: next sentence prediction for learning task generic information and task identifier prediction for learning task specific representations. When training on new tasks, we constrain the task generic representation to be relatively stable and representations of task specific aspects to be more flexible. To further alleviate catastrophic forgetting without much increases of memory and training time, we propose to augment our regularization-based methods by storing and replaying only a small amount of representative examples (e.g., 1% samples selected by memory selection rules like K-Means (MacQueen et al., 1967)). To sum up, our contributions are threefold: • We propose an information disentanglement based regularization method for continual text classification, to better learn and constrain task generic and task specific knowledge.
• We augment the regularization approach with a memory selection rule that requires only a small amount of replaying examples.
• Extensive experiments conducted on five benchmark datasets demonstrate the effectiveness of our proposed methods compared to state-of-the-art baselines.

Related work
Continual Learning Existing continual learning research can be broadly divided into four categories: (i) replay-based method, which remind models of information from seen tasks via experience replay ), distillation (Rebuffi et al., 2017, representation alignment  or optimization constraints (Lopez-Paz and Ranzato, 2017;Chaudhry et al., 2019) using examples sampled from previous tasks (Rebuffi et al., 2017; or synthesized with generative models (Shin et al., 2017;Sun et al., 2019); (ii) regularization-based method, which constrains model's output (Li and Hoiem, 2018), hidden space (Rannen et al., 2017), or parameters (Lopez-Paz and Ranzato, 2017;Zenke et al., 2017;Aljundi et al., 2018) from changing too much to retain learned knowledge; (iii) architecture-based method, where different tasks are associated with different components of the overall model to directly minimize the interference between new tasks and old tasks (Rusu et al., 2016;Mallya and Lazebnik, 2018); (iv) meta-learning-based method, which directly optimizes the knowledge transfer among tasks (Riemer et al., 2019;Obamuyide and Vlachos, 2019), or learns robust data representations (Javed and White, 2019;Holla et al., 2020;Wang et al., 2020) to alleviate forgetting. Among these different approaches, replay-based methods and regularization-based methods have been widely applied to NLP tasks to enable large pre-trained models (Devlin et al., 2019;Radford et al., 2019) to continually acquire novel world knowledge from streams of textual data without forgetting the already learned knowledge. For instance, replaying examples have shown promising performance for text classification (de Masson d' Autume et al., 2019;Sun et al., 2019;Holla et al., 2020), relation extraction  and question answering Sun et al., 2019;Wang et al., 2020). However, they often suffer from large memory costs or considerable training time, due to the requirements of storing an extensive amount of texts (de Masson d'Autume et al., 2019) or training language models to generate a sufficient number of examples (Sun et al., 2019). Recently, regularizationbased methods Han et al., 2020) have also been applied to directly constrain knowledge deposited in model parameters without abundant rehearsal examples. Despite better efficiency compared to replay-based methods, current regularization-based approaches often fail to generalize well to new tasks as they treat and constrain all the information equally and thus limit the needed updates for parameters that are specific to different tasks. To overcome these limitations, we propose to first distinguish hidden spaces that need to be retained from those that need to be updated substantially through information disentanglement, and then regularize different spaces separately, to better remember previous knowledge as well as transfer to new tasks. In addition, we enhance our regularization method by replaying only a limited amount of examples selected by K-means as the memory selection rule.

Textual Information Disentanglement
Our work is related to information disentanglement for text data, which has been extensively explored in generation tasks like style transfer (Fu et al., 2017;Zhao et al., 2018;Romanov et al., 2019;, where text hidden representations are often disentangled into sentiment (Fu et al., 2017;John et al., 2019), content (Romanov et al., 2019;Bao et al., 2019) and syntax (Bao et al., 2019) information through supervised learning from pre-defined labels (John et al., 2019) or unsupervised learning with adversarial training (Fu et al., 2017;. Building on these prior works, we differentiate task generic space from task specific space via supervision from two simple yet effective auxiliary tasks: next sentence prediction and task identifier prediction. Related Learning Paradigms There exists some other learning paradigms also dealing with multiple tasks, such as multi-task learning (Yu et al., 2020) and transfer learning (Houlsby et al., 2019;Pfeiffer et al., 2021). However, neither can fit in the scenario of learning multiple tasks sequentially. The former could be adapted to dynamic environments by storing all seen training data and retraining the model after the arrival of new tasks, which highly decreases efficiency and is impractical in deployment. The latter only focuses on the target tasks and ignores catastrophic forgetting on the source tasks. A more thorough discussion can be found in Biesialska et al. (2020).

Problem Formulation
In this work, we focus on continual learning for a sequence of text classification tasks {T 1 , ...T n }, where we learn a model f θ (.), θ is a set of parameters shared by all tasks and each task T i contains a different set of sentence-label training pairs, (x i 1:m , y i 1:m ). After learning all tasks in the sequence, we seek to minimize the generalization error on all tasks (Biesialska et al., 2020) : We use two commonly-used techniques for this problem setting in our proposed model: • Regularization: in order to preserve knowledge stored in the model, regularization is a constraint added to model output (Li and Hoiem, 2018), hidden space (Zenke et al., 2017) and parameters (Lopez-Paz and Ranzato, 2017;Zenke et al., 2017;Aljundi et al., 2018) to prevent them from changing too much while learning new tasks.
• Replay: when learning new tasks, Experience Replay (Rebuffi et al., 2017) is commonly used to recover knowledge from previous tasks, where a memory buffer is first adopted to store seen examples from previous tasks and then the stored data is replayed with the training set for the current task. Formally, after training on task t − 1 (t ≥ 2), γ|S t−1 | examples are randomly sampled from the t-th training set S t−1 into the memory buffer M, where 0 ≤ γ ≤ 1 is the store ratio. Data from M is then merged with the t-th training set S t when learning from task t.

Method
In continual learning, the model needs to adapt to new tasks quickly while maintaining the ability to recover information from previous tasks, hence not all information stored in the hidden representation space should be treated equally. In previous work like style transfer (John et al., 2019) and controlled text generation (Hu et al., 2017), certain information (such as content and syntax) is extracted and shared among different categories and other information (such as style and polarity) is manipulated for each specific category. Similarly, in our continual learning scenario, there is shared knowledge among different tasks as well while the model needs to learn and maintain specific knowledge for each individual task in the learning process. This key observation motivates us to propose an information-disentanglement based regularization for continual text classification to retain shared knowledge while adapting specific knowledge to streams of tasks (Section 4.1). We also incorporate a small set of representative replay samples to alleviate catastrophic forgetting (Section 4.3). Our model architecture is shown in Figure 1.

Information Disentanglement (ID)
This section describes how to disentangle sentence representations into task generic space and task specific space, and how separate regularizations are imposed on them for continual text classification. Formally, for a given sentence x, we first use a multi-layer encoder B(.), e.g., BERT (Devlin et al., 2019), to get the hidden representations r which contain both task generic and task specific information. Then we introduce two disentanglement networks G(.) and S(.) to extract the generic representation g and specific representation s from r. For new tasks, we learn the classifiers by utilizing information from both spaces, and we allow different spaces to change to different extents to best retain knowledge from previous tasks.
Task Generic Space Task generic space is the hidden space containing information generic to different tasks in a task sequence. During switching from one task to another, the generic information should roughly remain the same, e.g., syntactic knowledge should not change too much across the learning process of a sequence of tasks. To extract task generic information g from hidden representations r, we leverage the next sentence prediction task (Devlin et al., 2019) 1 to learn the generic information extractor G(.). More specifically, we insert a [SEP] token into each training example during tokenization to form a sequence pair labeled IsNext, and switch the first sequence and the second sequence to form a sentence pair labeled NotNext.
In order to distinguish IsNext pairs and NotNext pairs, extractor G(.) needs to learn the context dependencies between two segments, which is beneficial to understand every example and generic to any individual task. Denotex as the NotNext example corresponding to x (IsNext), and l ∈ {0, 1} as the label for next sentence prediction. We build a sentence relation predictor f nsp on the generic feature extractor G(.): where L is the cross entropy loss, M is the memory buffer and S t is the t-th training set.
Task Specific Space Models also need task specific information to perform well over each task. For example, on sentiment classification words like "good" or "bad" could be very informative, but they might not generalize well for tasks like topic classification. Thus we employ a simple task-identifier prediction task on the task specific representation s, which means for any given example we want to distinguish which task this example belongs to. This simple auxiliary setup will encourage s to embed different information from different tasks. The loss for task-identifier predictor f task is: where z is the corresponding task id for x.
Text Classification To adapt to the t-th task, we combine the task generic representation g = G(B(x) and task specific representation s = S(B(x)) to perform text classification, where we minimize the cross entropy loss: Here y is the corresponding class label for x, f cls (.) is the class predictor. • denotes the concatenation of the two representations.

ID Based Regularization
To further prevent severe distortion when training on new tasks, we employ regularization on both generic representations g and specific representations s. Different from previous approaches (Li and Hoiem, 2018; which treat all the spaces equally, we allow regularization to different extents on g and s as knowledge in different spaces should be preserved separately to encourage both more positive transfer and less forgetting. Specifically, before training all the modules on task t, we first compute the generic representations and specific representations of all sentences x from the training set S t of current task t and memory buffer M t . Using the trained B t−1 (.), G t−1 (.) and S t−1 (.) from previous task t − 1, for each example x we calculate the generic representation as G t−1 (B t−1 (x)), and the specific representation as S t−1 (B t−1 (x)) to hoard the knowledge from previous models. The computed generic and specific representations are saved. During the learning from training pairs from task t, we impose two regularization losses separately:

Memory Selection Rule
Since we only store a small number of examples as a way to balance the replay as well as the extra memory cost and training time, we need to carefully select them in order to utilize the memory buffer M efficiently. Considering that if two stored examples are very similar, then only storing one of them could possibly achieve similar results in the future. Thus, those stored examples should be as diverse and representative as possible. To this end, after training on t-th task, we employ K-means (MacQueen et al., 1967) to cluster all the examples from current training set S t : For each x ∈ S t , we utilize its embedding B(x) as its input feature to conduct K-means. We set the numbers of clusters to γ|S t | and only select the example closest to each cluster's centroid, following Wang et al.

Overall Objective
We can write the final objective for continual learning on text classification as the following: We set the coefficient of the first three loss terms to 1 for simplicity and only introduce two coefficients to tune: λ g and λ s . In practice, L task and L cls are also conducted on each generated NotNext examplex, L g reg and L s reg are only optimized starting from the second task. The full information disentanglement based regularization (IDBR) algorithm is shown in Algorithm 1.
Algorithm 1 IDBR Input Training sets {S 1 , ..., S n }, Replay Frequency β, Store ratio γ, Coefficients λ g , λ s Output Optimal models B, G, S, f nsp , f task , f cls M = {} Initialize memory buffer Initialize B using pretrained BERT Initialize G, S, f nsp , f task , f cls for t = 1, . . . , n do if t ≥ 2 then Store G(B(x)), S(B(x)), ∀x ∈ S t ∪ M for batches ∈ S t do Optimize L in Equation 1 if step mod β = 0 then Replay Sample t − 1 batches from M Optimize L in Equation 1 end if end for else No regularization on 1st task for batches ∈ S t do Optimize L = L cls + L nsp + L task end for end if C = K-Means(S t , n clusters =γ|S t |) C : centroid C = { Examples closest to centers ∈ C } M ← M ∪ C Add to memory end for return B, G, S, f nsp , f task , f cls 5 Experiment

Datasets
Following MBPA++ (de Masson d'Autume et al., 2019), we use five text classification datasets (Zhang et al., 2015;Chen et al., 2020) to evaluate our methods, including AG News (news classification), Yelp (sentiment analysis), DBPedia (Wikipedia article classification), Amazon (sentiment analysis), and Yahoo! Answer (Q&A classification). A summary of the datasets is shown in  Kirkpatrick et al. (2017). Our experiments are mainly conducted on the task sequences shown in Table 2. To minimize the effect of task order and task sequence length on the results, we examine both length-3 task sequences and length-5 task sequences in various orders. The first 3 task sequences are a cyclic shift of ag yelp yahoo, which are three classification tasks in different domains (news classification, sentiment analysis, Q&A classification). The last four length-5 task sequences follows de Masson d'Autume et al. (2019).

Baselines
We compare our proposed model with the following baselines in our experiments: • Finetune : finetune BERT model sequentially without the episodic memory module and any other loss.
Order Task Sequence  1  ag yelp yahoo  2  yelp yahoo ag  3  yahoo ag yelp  4 ag yelp amazon yahoo dbpedia 5 yelp yahoo amazon dbpedia ag 6 dbpedia yahoo ag amazon yelp 7 yelp ag dbpedia amazon yahoo • LAMOL (Sun et al., 2019): train a language model that simultaneously learns to solve the tasks and generate training samples, the latter is for generating pseudo samples used in experience replay. Here the text classification is performed in Q&A formats.
• Multi-task Learning (MTL): The model is trained on all tasks simultaneously, which can be considered as an upper-bound for continual learning methods since it has access to data from all tasks at the same time.

Implementation Details
We use pretrained BERT-based-uncased from Hug-gingFace Transformers (Wolf et al., 2020) as our base feature extractor. The task generic encoder and task specific encoder are both one linear layer followed by activation function T anh, their output size are both 128 dimensions. The predictors built on encoders are all one linear layer followed by activation function sof tmax. All experiments are conducted on NVIDIA RTX 2080 Ti with 11GB memory with the batch size of 8 and the maximum sequence length of 256 (use the first 256 tokens if one's length is beyond that). We use AdamW (Loshchilov and Hutter, 2019) as optimizer. For all modules except the task id predictor, we set the learning rate lr = 3e −5 ; for task id predictor, we set its learning rate lr task = 5e −4 . The weight decay for all parameters are 0.01.
For experience replay, we set the store ratio γ = 0.01, i.e. we store 1% of seen examples into the episodic memory module. Besides, we set the replay frequency β = 10, which means we do experience replay once every ten steps.
For information disentanglement, we mainly tune the coefficients of the regularization loss. For batches from memory buffer M, we set λ g to 2.5, select best λ s from {1.5, 2.0, 2.5}. For batches from current training set S, we set λ g to 0.25, select best λ s from {0.15, 0.20, 0.25}.

Results and Discussion
We evaluate models after training on all tasks and report their average accuracies on all test sets as our metric. Table 3 summarizes our results in Setting (Sampled). While continual finetuning suffered from severe forgetting, experience replay with 1% stored examples achieves promising results, which demonstrates the importance of experience replay for continual learning in NLP. Beyond that, simple regularization turns out to be a robust method on the basis of experience replay, which shows consistent improvements on all 6 orders. Our proposed Information Disentanglement Based Regularization (IDBR) further improves regularization consistently under all circumstances. Table 4 compares IDBR with previous SOTA: MBPA++ and LAMOL in Setting (Full). Note that although we use the same training/testing data, there is some inherent differences between our settings and previous SOTA methods. Despite the fact that MBPA++ applies local adaptation when testing, IDBR still outperforms it by an obvious margin. We achieve comparative results with LAMOL, despite that LAMOL requires task identifiers during inference which makes its prediction task easier.

Impact of the Lengths of Task Sequences
Comparing results of length-3 sequences and length-5 sequences in Table 3, we found that the gap between IDBR and multi-task learning became bigger when the length of task sequence changed from 3 to 5. To better understand how IDBR grad- ually forgot, we followed Chaudhry et al. (2018) to measure forgetting F k after trained on task k as follows: where a l,j is the is the model's accuracy on task j after trained on task l. On order 4, 5 and 6, we calculate the forgetting every time after IDBR was trained on a new task and summarize them in Table 5. For continual learning, we hypothesize that the model is prone to suffer from more severe forgetting as the task sequence becomes longer. We found that although there was some big drop after training on the 3rd task, IDBR maintained stable performance as the length of task sequence increased, especially after training on 4-th and 5-th task, the forgetting increment was relatively small, which demonstrated the robustness of IDBR.

Visualizing Disentangled Spaces
To study whether our task generic encoder G tends to learn more generic information and task specific encoder S captures more task specific information, we used t-SNE (van der Maaten and Hinton, 2008) to visualize the two hidden spaces of IDBR, using the final model trained on order 2, and the results are shown in Figure 2, where Figure 2a visualizes task generic space and Figure 2b visualizes task specific space. We observe that compared with task specific space, generic features from different tasks were more mixed, which demonstrates that the next sentence prediction helped task generic space to be more task-agnostic than task specific space, which was induced to learn separated representations for different tasks. Considering we only employed two simple auxiliary tasks, the effect of information disentanglement was noticeable.

Ablation Studies
Effect of Disentanglement In order to demonstrate that each module of our information disentanglement helps the learning process, we performed ablation study on the two auxiliary tasks using order 5 as a case study. The results are summarized in Table 6. We found that both task-id prediction and next sentence prediction contribute to the final   ers. While we may expect to give more tolerance to specific space for changing, we found that no regularization on it would lead to severe forgetting of previously learnt good task specific embeddings, hence it is necessary to add a regularizer over this space as well. Beyond that, we also observed that under most circumstances, adding regularization on the task generic space g results in a more significant gain than adding regularization on the task specific space s, consistent with our intuition that task generic space changes less across tasks and thus preserving it better helps more in alleviating catastrophic forgetting.

Impact of K-Means
To demonstrate our hypothesis that when the memory budget is limited, selecting the most representative subset of examples is vital to the success of continual learning, we performed an ablation study on order 1,2,3 using IDBR with and without K-Means. The result is shown in Table 8. From the table, we found that using K-Means helps boost the overall performance. Specifically, the improvement brought by K-Means was larger on those challenging orders, i.e. orders on which IDBR had worse performance. This is because for these challenging orders, the forgetting is more severe and the model needs more examples from previous tasks to help it retain previous knowledge. Thus with the same memory budget constraint, diversity across saved examples will help the model better recover knowledge learned from previous tasks.

Conclusion
In this work, we introduce an information disentanglement based regularization (IDBR) method for continual text classification, where we disentangle the hidden space into task generic space and task specific space and further regularize them differently. We also leverage K-Means as the memory selection rule to help the model benefit from the augmented episodic memory module. Experiments conducted on five benchmark datasets demonstrate that IDBR achieves better performances compared to previous state-of-the-art baselines on sequences of text classification tasks with various orders and lengths. We believe the proposed approach can be extended to continual learning for other NLP tasks such as sequence generation and sequence labeling as well, and plan to explore them in the future.