Lifelong Language Knowledge Distillation

It is challenging to perform lifelong language learning (LLL) on a stream of different tasks without any performance degradation comparing to the multi-task counterparts. To address this issue, we present Lifelong Language Knowledge Distillation (L2KD), a simple but efficient method that can be easily applied to existing LLL architectures in order to mitigate the degradation. Specifically, when the LLL model is trained on a new task, we assign a teacher model to first learn the new task, and pass the knowledge to the LLL model via knowledge distillation. Therefore, the LLL model can better adapt to the new task while keeping the previously learned knowledge. Experiments show that the proposed L2KD consistently improves previous state-of-the-art models, and the degradation comparing to multi-task models in LLL tasks is well mitigated for both sequence generation and text classification tasks.


Introduction
Training a single model to learn a stream of different tasks sequentially usually faces the catastrophic forgetting problem (McCloskey and Cohen, 1989): after learning a new task, the model forgets how to handle the samples from previous tasks. Lifelong learning manages to accumulate the knowledge and retain the performance of previously learned tasks. It is important especially for real-world natural language processing (NLP) applications, because these applications need to interact with many users from different domains everyday, and the language usage also evolves from time to time. Hence, various NLP tasks have been studied for lifelong learning in the previous work, including sentiment analysis (Chen et al., 2015;Xia et al., 1 The source code and data are available at https:// github.com/voidism/L2KD.  2017), conversational agents (Lee, 2017), word and sentence representation learning (Xu et al., 2018;Liu et al., 2019), text classification, and question answering . In recent, LAMOL (Sun et al., 2020) improved the performance of LLL by a general framework: 1) it followed the idea about considering many NLP tasks as question answering (QA) (McCann et al., 2018) and adapted all tasks into the language modeling (LM) form. In the unified framework, it can perform LLL on many NLP tasks by generating answers based on the contexts and the questions using a single language model, and 2) it outperformed the previous methods by a considerable margin and is only 2%-3% worse than the multi-tasking upper bound, which jointly learns all tasks in a mixed dataset.
This paper further improves LLL by introducing Lifelong Language Knowledge Distillation (L2KD), which can be flexibly applied upon the LAMOL architecture or other LLL methods for sequence generation learning.
The motivation of our work mainly comes from how to efficiently compress the knowledge under a lifelong language learning framework. If the model can learn a new task in an efficient way, the previously learned knowledge may not be affected and thus the problem of catastrophic forgetting can be mitigated.
Inspired by knowledge distillation (Bucila et al., 2006;Hinton et al., 2015;Kim and Rush, 2016), in which a student (smaller) model is trained to imitate the behavior of a teacher (larger) model in order to reach the performance closer to the teacher model, the LLL model in L2KD can be seen as a weak learner that needs to compress knowledge from different tasks into a compact single model. Thus LLL can benefit from the similar procedure of knowledge distillation, although the model size is equal to its teacher model. The similar idea about distilling knowledge from equal-size models has also been studied in born-again neural network (Furlanello et al., 2018), multitask learning (Clark et al., 2019) and lifelong computer vision learning (Hou et al., 2018), but never been explored in lifelong language learning research.
In L2KD, we train a new teacher model when facing a new task, and the LLL model imitates the behavior of its teacher at each training stage, as illustrated in Figure 1. This method only needs a little extra time to train a disposable teacher model for each new task, and the teacher model can be discarded when learning the next task; therefore, there is no extra memory or model capacity required for the target LLL model, making the proposed model more memory-efficient for real-world usage.

Proposed Approach
Before describing how L2KD works, in Section 2.1 we briefly introduce the architecture of LAMOL (Sun et al., 2020), which L2KD is built upon. Then we introduce different knowledge distillation strategies in Section 2.2, and how to apply them to L2KD in Section 2.3.

LAMOL: Language Modeling for Lifelong Language Learning
In the setting of LAMOL, all samples in language datasets have three parts: context, question and answer. We can simply concatenate these three parts into a single sentence and train the model to generate the answer based on the context and question prior to it, as illustrated in Figure 2a.  Besides generating answers for the given questions, the model simultaneously learns to model the whole training sample, as illustrated in Figure 2b. By doing that, when training on the next task, the model can generate training samples for the previous tasks and train on both data from the new task and the generated pseudo-data for the prior tasks. Thus the model would forget less when adapting to the new tasks.
LAMOL can outperform previous regularizationbased (Schwarz et al., 2018;Aljundi et al., 2018) or memory-based (Lopez-Paz et al., 2017; LLL methods by a large margin. While most of previous methods usually get results slightly better than the finetuning baseline (doing nothing to prevent forgetting), LAMOL already get significant results that are very close to the multitasking upper bound and only 2%-3% worse (Sun et al., 2020) than it. Thus, in this paper, we focus on how to apply L2KD based on LAMOL.

Knowledge Distillation
Language Modeling The training objective for normal language modeling is to minimize the negative log-likelihood (NLL) in predicting the next word (hard target): where x t denotes the t-th word in the sentence, x <t denotes all words prior to x t , and θ is the parameters of the language model.
In knowledge distillation, instead, we minimize the prediction errors between student and teacher models. The target unit for considering the errors can be done in the word level or the sequence level.
Word-Level (Word-KD) We minimize the cross-entropy between the output distributions from student and teacher models when predicting the next word: where the input x <t is from the ground truth sequence. V denotes the vocabulary set and V k is the k-th word in V. θ S and θ T are parameters of the student model and teacher model respectively.
Sequence-Level (Seq-KD) Similar to Kim and Rush (2016), we minimize the negative loglikelihood directly on the greedy decode or beam search output sequencex from the teacher model as the hard target, just like normal language modeling: Seq-KD is usually applied for improving weak nonautoregressive translation (NAT) models (Zhou et al., 2020) by reducing the multi-modality problem in machine translation datasets (Gu et al., 2018).
Soft Sequence-Level (Seq-KD soft ) We further investigate whether the soft target plus the teacher decoded sequence can help the model more, so we conduct Seq-KD soft , in which we perform Word-KD on the greedy decode or beam search outputs from the teacher model. The only difference between Seq-KD soft and Word-KD is that the input x <t of Word-KD is now replaced withx <t , the output sequence from the teacher model: Note that no matter what kind of loss we use in knowledge distillation, the teacher model is always fixed. Hence, the optimization process of finding parameters θ * S of the LLL model can be written as follows: from Dprev for j = 1 to n do update θS to minimize LNLL(X prev j ; θS) end for end for

L2KD: Lifelong Language Knowledge Distillation
Knowledge distillation can be applied to minimizing both LM and QA loss in LAMOL. Assuming that there is a stream of tasks with datasets {D 1 , D 2 , ...}, our LLL model has learned from D 1 to D m−1 and now was adapted to D m . First we train a teacher model on D m by minimizing the negative log-likelihood loss both for LM and QA in LAMOL and obtain the model parameters θ m T . Now our LLL model (with parameters θ S ) can be trained on D m by knowledge distillation from the teacher model. Given a training sample X m i = {x 1 , x 2 , ..., x T } ∈ D m (including the context, question and answer), we minimize where a 1 denotes the start position of the answer.
Here we take Word-KD for illustration, but we can also replace the text in the answer part with the teacher-generated answers, so as to conduct Seq-KD soft or Seq-KD.
Besides training on samples from D m , the LLL model also generates pseudo-data D prev for previous tasks. For samples in D prev , however, we cannot perform knowledge distillation here, because in our setting the teacher models of previous tasks will be discarded after adapting to the next task. Therefore, given the generated data X prev i ∈ D prev , we only minimize NLL loss here: Finally we jointly optimize two loss and obtain the parameters θ * S for the LLL model: The training procedure is detailed in Algorithm 1.

Experimental Setup
To evaluate the proposed method, we conduct a set of experiments detailed below.

Model and Training Details
We build our proposed approach based on the implementation of LAMOL 2 to make the results comparable. We use the same pre-trained small GPT-2 (Radford et al., 2019) for all single-task teacher, multitask and LLL models, and train the GPT-2 nine epochs for each dataset. We use the best setting in LAMOL: using task-specific tokens as begin-of-sentence tokens, and the pseudo-data sample rate is 0.2. During inference, we use greedy decoding to generate sequence. More details can be found in Appendix A.

Datasets
To evaluate the capability of L2KD on diverse sequence generation tasks, we pick the following three tasks from DecaNLP (  . We use the full dataset for the first three domains and the reduced set for the laptop domain for keeping them balance. Although our method is mainly designed for sequence generation tasks, we also use five different text classification datasets to evaluate whether the proposed method also benefits text classification tasks. We use the random sampled subsets released by Sun et al. (2020), each of which has 115,000 training and 7,600 testing instances.
• AGNews: News articles, including 4 classes for their topics.  • Yelp: Customer reviews on Yelp, including 5 classes for their rating scores. • Amazon: Customer reviews on Amazon, including 5 classes for their rating scores. • DBPedia: Articles on Wikipedia, including 14 classes for their categories. • Yahoo: QA pairs on the Yahoo! platform, including 10 classes for their categories. Due to the limitation of computational resources and the data imbalance, we reduce the big datasets (WikiSQL, CNN/DailyMail, E2E NLG, RNNLG (laptop)) to a smaller size by random sampling. The reduced data size and other data statistics in the experiments are detailed in Table 1.

Results and Discussion
We discuss the results for three settings: 1) different sequence generation tasks, 2) same tasks in different domains, and 3) different text classification tasks in order to validate the effectiveness of the proposed approach.

Different Sequence Generation Tasks
In the experiments, we perform lifelong learning on the WikiSQL (SQL), CNN/DailyMail (CNN) and MultiWOZ (WOZ) datasets with six different permutation orders, and test the performance at the end of the training streams. The detailed results are shown in Table 2, where the average scores indicate the average of three tasks for overall comparison. Note that the evaluation metrics of these three tasks are all ranging from 0 to 100. The overall results of six orders compared with single-task methods and multitask upper bounds are shown in Table 3.
In Table 2, the first baseline is (a) Finetune, in which we directly train three tasks one after another without preventing catastrophic forgetting. It is obvious that the Finetune model would forget one or two tasks learned before the final one. (b) LAMOL is the current state-of-the-art approach that significantly reduce the catastrophic forgetting for comparison. In the rows (c)-(e), it is shown that applying L2KD upon LAMOL significantly outperforms LAMOL for almost all cases, no matter which knowledge distillation strategy is used: (c) Word-KD, (d) Seq-KD soft , (e) Seq-KD. We also observe that among three different knowledge distillation strategies, (e) Seq-KD consistently improves the most on the CNN/DailyMail dataset, which is probably caused by the noisy nature of this summarization dataset. Therefore, sequence-level knowledge distillation produces a easy-to-learn answer comparing to the original complex answer, so that the LLL model can learn better on it.
On the other hand, for other two tasks (Mul-tiWOZ, WikiSQL), (c) Word-KD and (d) Seq-KD soft improve more for most cases. Because the target sequences of these two tasks are relatively simple, where MultiWOZ focuses on producing semantic state sequences from dialogues, and WikiSQL produces the structured query sequences from the given natural language sentences, the target sequences usually contain the patterns less complex than natural language. So, in these
cases, the soft targets may bring more advantages than teacher decoding sequences for the LLL model to learn from. In Table 3, the overall performance (averaged over six permutation orders) is compared with single-task methods and multi-task upper bounds. There are two training methods here: optimizing QA loss only (in rows (1)(3)(5)) or optimizing both QA and LM loss (in rows (2)(4)(6)), as illustrated in Figure 2. For multi-task models, we find that the same training steps (9 epochs on the mixed set) may not lead the models to converge (in row (3)(4)), so we additionally train multi-task models for three times longer (27 epochs on the mixed set) in rows (5)(6).
The second part of Table 3 shows the average performance in lifelong learning of six permutation orders. It is clear that L2KD significantly improves the average score from 54.9 in (b) LAMOL to 57.1 in (d) Seq-KD soft . The performance of Seq-KD soft is only 0.7% worse than the multi-task upper bound, 57.8 in (6)   lifelong learning and multi-task learning. Note that we can also apply similar distillation strategy on multitask learning to obtain a stronger upper bound, which might be a more fair comparison. Thus, we add Seq-KD to (6) Multi long QA+LM by making the model learn from five single-task teachers and the results are shown in row (7). We observe that the improvement on multitask learning is only 0.2%, while L2KD can improve LAMOL by 2.2%. This result indicates that the benefits brought by knowledge distillation may be saturated for multitask learning, but is not saturated for L2KD. The gap between lifelong learning and multi-task learning is still reduced even if we apply similar strategy on both of the models.
The third part of Table 3 shows the standard deviations of six permutation orders. As mentioned in Sun et al. (2020), if an algorithm has smaller standard deviation over different training orders, it means that the algorithm is more robust and not susceptible to learning orders. It can be found that the average standard deviation of LAMOL is reduced from 3.3 to 1.9 with Seq-KD soft . Therefore, both soft target training and teacher decode sequence can stabilize the training process of LLL and make it more order-agnostic.
To further analyze the performance change when  training on different tasks, we plot the testing results during whole lifelong learning stages with the order of WOZ (1-9 epoch) → SQL (10-18 epoch) → CNN (19-27 epoch) in Figure 3. In Figure 3a, the performance of WOZ for all methods is illustrated. The finetune baseline (purple line) significantly degrades when moving to the next task (SQL) in the second training stage, while other methods can keep the performance. We observe that applying soft-target Word-KD (blue) or Seq-KD soft (red) can increase the scores faster than hard-target Seq-KD (yellow) and LAMOL baseline (green) at the initial epochs, indicating the effectiveness of the proposed L2KD. In terms of other two tasks, all distillation methods (Word-KD, Seq-KD soft , Seq-KD) are capable of maintaining the performance of WOZ slightly better than LAMOL, and finally converge to better points in the third training stage. A similar trend can be observed in Figure 3b, where soft-target Word-KD and Seq-KD soft rise faster in the second training stage and finally drop less than original LAMOL in the third training stage, demonstrating the great property of our proposed methods as LLL models. In Figure 3c, in the third stage, the yellow line (Seq-KD) converges to a better point than all other methods, because it reduces the complexity of the noisy summarization dataset. However, although Seq-KD soft also reduces the complexity, it does not achieve the same performance as Seq-KD. The probable reason is that the teacher decoding sentences may be easy enough for the LLL model to learn from, and the soft target here makes the model not completely converge on these easy sentences.

Same Task in Different Domains
We perform L2KD on the same NLG task with five different domains: restaurant from E2ENLG, restaurant/hotel/tv/laptop from RNNLG. Note that although both E2ENLG and RNNLG has the restaurant domain, their input formats and label types are totally different. The results are shown in Table 4, where we only show two orders in the experiments: from the hardest task to the easiest one (left-to-right) and its reverse order (right-to-left) 3 . The results show that L2KD outperforms original LAMOL for most cases and improves the averaged ROUGE score by nearly 2 points. We find that different training orders bring slightly different results. In the right-to-left order, the baseline LAMOL can easily outperform multi-task models due to its easiest-to-hardest order, which helps the model to better transfer the knowledge gradually in these NLG tasks, similar to curriculum learning (Bengio et al., 2009). There-fore, it does not mean that lifelong learning can beat multi-task model in all the experiment.
We plot the learning curves of these five tasks in left-to-right order in Figure 4 for further analysis. Except for E2ENLG, the Seq-KD in yellow lines usually gain more performance at the end of the training stream. Also, we observe that when forward transfer exists, Seq-KD usually benefits more. For example, in Figure. 4c, when training on RNNLG (restaurant) in 10-18th epochs, the ROUGE score on RNNLG (hotel) has risen even before the model first sees RNNLG (hotel) data at the 19th epoch, indicating that the forward transfer exists in this order. The rising trend is more obvious in Seq-KD (yellow), and the same trend can also be obversed in Figure 4d and 4e.

Text Classification Tasks
Although our method is mainly designed for sequence generation tasks, we investigate whether this idea also benefits text classification (TC) tasks. Thus we perform L2KD on five TC tasks, where the answers are always very short sequences representing the class labels of the given documents, such as World, Sports, Business, or Sci/Tech in the AGNews dataset. Hence, generating such short answers is not complex for the proposed model, and the performance mainly reflects the text understanding performance instead of the generating capability.
We also conduct the experiments from the hardest task to the easiest task, and its reverse order shown in Table 5. To our surprise, L2KD also improves LAMOL to get better results on TC tasks. The results of these two orders are only 0.1% worse than the multi-task upper bounds. The Word-KD improves the most on these TC tasks in most cases, and the improvements are more obvious especially for the earlier learned tasks. The details of the learning curves in TC tasks are also shown in Appendix B for reference.
Because the answers in TC tasks are not as complex as other sequence generation tasks, we investigate where the improvement mainly comes from during the distillation process. Therefore, we split each testing set into two groups: (A) questions correctly answered by the teacher model; (B) questions incorrectly answered by the teacher model. We suspect that the LLL model trained by L2KD may totally copy the behavior from the teacher models and get improvement mainly from the group (A),   while it fails to answer the questions in the group (B). To figure it out, we compute the accuracy of each LLL model (in left-to-right experiment) for the groups (A) and (B) respectively, and the difference between original LAMOL and three distillation strategies on five tasks. The averaged results are shown in Table 6, and the more detailed results for each task can be found in Appendix C. Surprisingly, applying L2KD does not largely degrade the accuracy in the group (B) comparing to the original LAMOL, and even improves for Word-KD, showing that the LLL model does not fully copy the behavior from its teacher models. On the other hand, the total improvement indeed mainly comes from the group (A), and Word-KD also can improve the most. The double improvement both on group (A) and (B) for Word-KD indicates that on these TC tasks, the LLL model trained by Word-KD can better reach the balance between the teacher knowledge and the transfer ability. Therefore, it can get the advantages from the teacher knowledge while avoid some false knowledge taught from its teacher by integrating the knowledge from other tasks.
Knowledge distillation has been introduced to the field of lifelong learning; for example, Learning without Forgetting (LwF) (Li and Hoiem, 2017), Generative Replay with Distillation (DGR+distill), Replay-through-Feedback (RtF) (van de Ven and Tolias, 2018), and Lifelong GAN (Zhai et al., 2019), a lot of prior studies have also used knowledge distillation in lifelong learning, but all in computer vision tasks. Different from the prior work, this paper is the first attempt that adopts knowledge distillation for NLP tasks in the lifelong learning framework. Moreover, the prior work used the old model as a teacher to help the current model retain the knowledge about previous tasks. In contrast, our method trains a new teacher model on the incoming new task. Thus, these two directions of applying knowledge distillation are complementary to each other, showing the potential of applying the proposed method to the fields other than NLP.

Conclusion
This paper presents Lifelong Language Knowledge Distillation (L2KD), a simple method that effectively help lifelong language learning models to maintain good performance comparable to its multitask upper bounds. The experiments show the consistent improvement achieved by L2KD for three different settings, indicating the effectiveness of the proposed method to train robust LLL models. In addition, the proposed approach only requires a little extra time for training the teacher without extra memory or capacity needed, showing the potential of being applied to the practical scenarios.