Continual Learning for Natural Language Generation in Task-oriented Dialog Systems

Natural language generation (NLG) is an essential component of task-oriented dialog systems. Despite the recent success of neural approaches for NLG, they are typically developed in an offline manner for particular domains. To better fit real-life applications where new data come in a stream, we study NLG in a “continual learning” setting to expand its knowledge to new domains or functionalities incrementally. The major challenge towards this goal is catastrophic forgetting, meaning that a continually trained model tends to forget the knowledge it has learned before. To this end, we propose a method called ARPER (Adaptively Regularized Prioritized Exemplar Replay) by replaying prioritized historical exemplars, together with an adaptive regularization technique based on Elastic Weight Consolidation. Extensive experiments to continually learn new domains and intents are conducted on MultiWoZ-2.0 to benchmark ARPER with a wide range of techniques. Empirical results demonstrate that ARPER significantly outperforms other methods by effectively mitigating the detrimental catastrophic forgetting issue.


Introduction
As an essential part of task-oriented dialog systems (Wen et al., 2015b;Bordes et al., 2016), the task of Natural Language Generation (NLG) is to produce a natural language utterance containing the desired information given a semantic representation (socalled dialog act). Existing NLG models (Wen et al., 2015c;Tran and Nguyen, 2017;Tseng et al., 2018) are typically trained offline using annotated data from a single or a fixed set of domains. However, a desirable dialog system in real-life applications often needs to expand its knowledge to new domains and functionalities. Therefore, it is crucial to develop an NLG approach with the capability * Correspondence Author of continual learning after a dialog system is deployed. Specifically, an NLG model should be able to continually learn new utterance patterns without forgetting the old ones it has already learned.
The major challenge of continual learning lies in catastrophic forgetting (McCloskey and Cohen, 1989;French, 1999). Namely, a neural model trained on new data tends to forget the knowledge it has acquired on previous data. We diagnose in Section 4.4 that neural NLG models suffer such detrimental catastrophic forgetting issues when continually trained on new domains. A naive solution is to retrain the NLG model using all historical data every time. However, it is not scalable due to severe computation and storage overhead.
To this end, we propose storing a small set of representative utterances from previous data, namely exemplars, and replay them to the NLG model each time it needs to be trained on new data. Methods using exemplars have shown great success in different continual learning (Rebuffi et al., 2017;Castro et al., 2018;Chaudhry et al., 2019) and reinforcement learning (Schaul et al., 2016;Andrychowicz et al., 2017) tasks. In this paper, we propose a prioritized exemplar selection scheme to choose representative and diverse exemplar utterances for NLG. We empirically demonstrate that the prioritized exemplar replay helps to alleviate catastrophic forgetting by a large degree.
In practice, the number of exemplars should be reasonably small to maintain a manageable memory footprint. Therefore, the constraint of not forgetting old utterance patterns is not strong enough. To enforce a stronger constraint, we propose a regularization method based on the well-known technique, Elastic Weight Consolidation (EWC (Kirkpatrick et al., 2017)). The idea is to use a quadratic term to elastically regularize the parameters that are important for previous data. Besides the wide application in computer vision, EWC has been recently applied to the domain adaptation task for Neural Machine Translation (Thompson et al., 2019;Saunders et al., 2019). In this paper, we combine EWC with exemplar replay by approximating the Fisher Information Matrix w.r.t. the carefully chosen exemplars so that not all historical data need to be stored. Furthermore, we propose to adaptively adjust the regularization weight to consider the difference between new and old data to flexibly deal with different new data distributions.
To summarize our contribution, (1) to the best of our knowledge, this is the first attempt to study the practical continual learning configuration for NLG in task-oriented dialog systems; (2) we propose a method called Adaptively Regularized Prioritized Exemplar Replay (ARPER) for this task, and benchmark it with a wide range of state-of-the-art continual learning techniques; (3) extensive experiments are conducted on the MultiWoZ-2.0 (Budzianowski et al., 2018) dataset to continually learn new tasks, including domains and intents using two base NLG models. Empirical results demonstrate the superior performance of ARPER and its ability to mitigate catastrophic forgetting. Our code is available at https://github.com/MiFei/ Continual-Learning-for-NLG 2 Related Work Continual Learning. The major challenge for continual learning is catastrophic forgetting (Mc-Closkey and Cohen, 1989;French, 1999), where optimization over new data leads to performance degradation on data learned before. Methods designed to mitigate catastrophic forgetting fall into three categories: regularization, exemplar replay, and dynamic architectures. Methods using dynamic architectures (Rusu et al., 2016;Maltoni and Lomonaco, 2019) increase model parameters throughout the continual learning process, which leads to an unfair comparison with other methods. In this work, we focus on the first two categories.
Regularization methods add specific regularization terms to consolidate knowledge learned before. Li and Hoiem (2017) introduced the knowledge distillation (Hinton et al., 2015) to penalize model logit change, and it has been widely employed in Rebuffi et al.  Zhao et al. (2019). Another direction is to regularize parameters crucial to old knowledge according to various importance measures (Kirkpatrick et al., 2017;Zenke et al., 2017;Aljundi et al., 2018). Exemplar replay methods store past samples, a.k.a exemplars, and replay them periodically. Instead of selecting exemplars at random, Rebuffi et al. (2017) incorporated the Herding technique (Welling, 2009) Mi et al. (2020a,b). Ramalho and Garnelo (2019) proposed to store samples that the model is least confident. Chaudhry et al. (2019) demonstrated the effectiveness of exemplars for various continual learning tasks in computer vision.
Catastrophic Forgetting in NLP. The catastrophic forgetting issue in NLP tasks has raised increasing attention recently (Mou et al., 2016;Chronopoulou et al., 2019). Yogatama et al. (2019); Arora et al. (2019) identified the detrimental catastrophic forgetting issue while fine-tuning ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019). To deal with this issue, He et al. (2019) proposed to replay pre-train data during fine-tuning heavily, and Chen et al. (2020) proposed an improved Adam optimizer to recall knowledge captured during pretraining. The catastrophic forgetting issue is also noticed in domain adaptation setups for neural machine translation (Saunders et al., 2019;Thompson et al., 2019;Varis and Bojar, 2019) and the reading comprehension task (Xu et al., 2019).
Lee (2017) firstly studied the continual learning setting for dialog state tracking in task-oriented dialog systems. However, their setting is still a onetime adaptation process, and the adopted dataset is small. Shen et al. (2019) recently applied progressive network (Rusu et al., 2016) for the semantic slot filling task from a continual learning perspective similar to ours. However, their method is based on a dynamic architecture that is beyond the scope of this paper. Liu et al. (2019) proposed a Boolean operation of "conceptor" matrices for continually learning sentence representations using linear encoders. Li et al. (2020) combined continual learning and language systematic compositionality for sequence-to-sequence learning tasks.
Recent works have considered the domain adaptation setting. Tseng et al. (2018); Tran and Nguyen (2018b) proposed to learn domain-invariant representations using VAE (Kingma and Welling, 2013). They later designed two domain adaptation critics (Tran and Nguyen, 2018a). Recently, Mi et al. (2019); Qian and Yu (2019); Peng et al. (2020) studied learning new domains with limited training data. However, existing methods only consider a one-time adaptation process. The continual learning setting and the corresponding catastrophic forgetting issue remain to be explored.

Model
In this section, we first introduce the background of neural NLG models in Section 3.1, and the continual learning formulation in Section 3.2. In Section 3.3, we introduce the proposed method ARPER.

Background on Neural NLG Models
The NLG component of task-oriented dialog systems is to produce natural language utterances conditioned on a semantic representation called dialog act (DA). Specifically, the dialog act d is defined as the combination of intent I and a set of slot-value pairs where p is the number of slot-value pairs. Intent I controls the utterance functionality, while slot-value pairs contain information to express. For example, "There is a restaurant called [La Margherita] that serves [Italian] food." is an utterance corresponding to a DA "[Inform, (name=La Margherita, food=Italian)]" Neural models have recently shown promising results for NLG tasks. Conditioned on a DA, a neural NLG model generates an utterance containing the desired information word by word. For a DA d with the corresponding ground truth utterance Y = (y 1 , y 1 , ..., y K ), the probability of generating Y is factorized as below: where f θ is the NLG model parameterized by θ, and p y k is the output probability (i.e. softmax of logits) of the ground truth token y k at position k. The typical objective function for an utterance Y with DA d is the average cross-entropy loss w.r.t. all tokens in the utterance (Wen et al., 2015c,b;Tran and Nguyen, 2017;Peng et al., 2020):

Continual Learning of NLG
In practice, an NLG model needs to continually learn new domains or functionalities. Without loss of generality, we assume that new data arrive task by task (Rebuffi et al., 2017;Kirkpatrick et al., 2017). In a new task t, new data D t are used to train the NLG model f θ t−1 obtained till the last task.
The updated model f θt needs to perform well on all tasks so far. An example setting of continually learning new domains is illustrated in Figure 1. A task can be defined with different modalities to reflect diverse real-life applications. In subsequent experiments, we consider continually learning new domains and intents in Eq. (1). We emphasize that the setting of continual learning is different from that of domain adaptation. The latter is a one-time adaptation process, and the focus is to optimize performance on a target domain transferred from source domains but without considering potential performance drop on source domains (Mi et al., 2019;Qian and Yu, 2019;Peng et al., 2020). In contrast, continual learning requires a NLG model to continually learn new tasks in multiple transfers, and the goal is to make the model perform well on all tasks learned so far.

Adaptively Regularized Prioritized Exemplar Replay (ARPER)
We introduce the proposed method (ARPER) with prioritized exemplar replay and an adaptive regu-larization technique to further alleviate the catastrophic forgetting issue.

Prioritized Exemplar Replay
To prevent the NLG model catastrophically forgetting utterance patterns in earlier tasks, a small subset of a task's utterances are selected as exemplars, and exemplars in previous tasks are replayed to the later tasks. During training the NLG model f θt for task t, the set of exemplars in previous tasks, denoted as E 1:t−1 = {E 1 , . . . , E t−1 }, is replayed by joining with the data D t of the current task. Therefore, the training objective with exemplar replay can be written as: The set of exemplars of task t, referred to as E t , is selected after f θt has been trained and will be replayed to later tasks.
The quality of exemplars is crucial to preserve the performance on previous tasks. We propose a prioritized exemplar selection method to select representative and diverse utterances as follows.
Representative utterances. The first criterion is that exemplars E t of a task t should be representative of D t . We propose to select E t as a priority list from D t that minimize a priority score: (5) where S(d) is the set of slots in Y, and β is a hyper-parameter. This formula correlates the representativeness of an utterance to its L CE . Intuitively, the NLG model f θt trained on D t should be confident with representative utterances of D t , i.e., low L CE . However, L CE is agnostic to the number of slots. We found that an utterance with many common slots in a task could also have very low L CE , yet using such utterances as exemplars may lead to overfitting and thus forgetting of previous general knowledge. The second term |S(d)| β controls the importance of the number of slots in an utterance to be prioritized as exemplars. We empirically found in Appendix A.1 that the best β is greater than 0.
Diverse utterances. The second criterion is that exemplars should contain diverse slots of the task, rather than being similar or repetitive. A drawback of the above priority score is that similar or duplicated utterances containing the same set of frequent slots could be prioritized over utterances w.r.t. a Algorithm 1 select exemplars: Prioritized exemplar selection procedure of ARPER for task t 1: procedure select exemplars(D t , f θt , m) 2: if S(d) ∈ S seen then continue 8: return E t diverse set of slots. To encourage diversity of selected exemplars, we propose an iterative approach to add data from D t to the priority list E t based on the above priority score. At each iteration, if the set of slots of the current utterance is already covered by utterances in E t , we skip it and move on to the data with the next best priority score.
Algorithm 1 shows the procedure to select m exemplars as a priority list E t from D t . The outer loop allows multiple passes through D t to select various utterances for the same set of slots S(d).

Reducing Exemplars in Previous Tasks
Algorithm 1 requires the number of exemplars to be given. A straightforward choice is to store the same and fixed number of exemplars for each task as in Castro et al. (2018) (2) it does not discriminate tasks with different difficulty levels.
To this end, we propose to store a fixed number of exemplars throughout the entire continual learning process to maintain a bounded memory footprint as in Rebuffi et al. (2017). As more tasks are continually learned, exemplars in previous tasks are gradually reduced by only keeping the ones in the front of the priority list 1 , and the exemplar size of a task is set to be proportional to the training data size of the task to differentiate the task's difficulty. To be specific, suppose M exemplars are kept in total. The number of exemplars for a task is: where we choose 250/500 for M in experiments.

Constraint with Adaptive Elastic Weight Consolidation
Although exemplars of previous tasks are stored and replayed, the size of exemplars should be reasonably small (M |D 1:t |) to reduce memory overhead. As a consequence, the constraint we have made to prevent the NLG model catastrophically forgetting previous utterance patterns is not strong enough. To enforce a stronger constraint, we propose a regularization method based on the well-known Elastic Weight Consolidation (EWC, Kirkpatrick et al., 2017) technique.
Elastic Weight Consolidation (EWC). EWC utilizes a quadratic term to elastically regularize parameters important for previous tasks. The loss function of using the EWC regularization together with exemplar replay for task t can be written as: where N is the number of model parameters; θ t−1,i is the i-th converged parameter of the model trained till the previous task; is the i-th diagonal element of the Fisher Information Matrix approximated w.r.t. the set of previous exemplars E 1:t−1 . F i measures the importance of θ t−1,i to previous tasks represented by E 1:t−1 . Typical usages of EWC compute F i w.r.t. a uniformly sampled subset from historical data. Yet, we propose to compute F i w.r.t. the carefully chosen E 1:t−1 so that not all historical data need to be stored. The scalar λ controls the contribution of the quadratic regularization term. The idea is to elastically penalize changes on parameters important (with large F i ) to previous tasks, and more plasticity is assigned to parameters with small F i .
Adaptive regularization. In practice, new tasks have different difficulties and similarities compared to previous tasks. Therefore, the degree of need to preserve the previous knowledge varies. To this end, we propose an adaptive weight (λ) for the EWC regularization term as follows: Algorithm 2 learn task: Procedure of ARPER to continually learn task t where V 1:t−1 is the old word vocabulary size in previous tasks, and V t is the new word vocabulary size in the current task t; λ base is a hyper-parameter. In general, λ increases when the ratio of the size of old word vocabularies to that of new ones increases. In other words, the regularization term becomes more important when the new task contains fewer new vocabularies to learn.
Algorithm 2 summarizes the continual learning procedure of ARPER for task t. θ t is initialized with θ t−1 , and it is trained with prioritized exemplar replay and adaptive EWC in Eq. (7). After training θ t , exemplars E t of task t are computed by Algorithm 1, and exemplars in previous tasks are reduced by keeping the most prioritized ones to preserve the total exemplar size.

Dataset
We use the MultiWoZ-2.0 dataset 2 (Budzianowski et al., 2018) containing six domains (Attraction, Hotel, Restaurant, Booking, Taxi and Train) and seven DA intents ("Inform, Request, Select, Recommend, Book, Offer-Booked, No-Offer"). The original train/validation/test splits are used. For methods using exemplars, both training and validation set are continually expanded with exemplars extracted from previous tasks.
To support experiments on continual learning new domains, we pre-processed the original dataset by segmenting multi-domain utterances into singledomain ones. For instance, an utterance "The ADC Theatre is located on Park Street. Before I find your train, could you tell me where you would like to go?" is split into two utterances with domain "Attraction" and "Train" separately. If multiple sentences of the same domain in the original utterance exist, they are still kept in one utterance after pre-processing. In each continual learning task, all training data of one domain are used to train the NLG model, as illustrated in Figure 1. Similar preprocessing is done at the granularity of DA intents for experiments in Section 4.6. The statistics of the pre-processed MultiWoZ-2.0 dataset is illustrated in Figure 2. The resulting datasets and the pre-processing scripts are open-sourced.

Evaluation Metrics
Following previous studies, we use the slot error rate (SER) and the BLEU-4 score (Papineni et al., 2002) as evaluation metrics. SER is the ratio of the number of missing and redundant slots in a generated utterance to the total number of ground truth slots in the DA.
To better evaluate the continual learning ability, we use two additional commonly used metrics (Kemker et al., 2018) for both SER and BLEU-4: where T is the total number of continual learning tasks; Ω all,i is the test performance on all the tasks after the i th task has been learned; Ω f irst,i is that on the first task after the i th task has been learned.
Since Ω can be either SER or BLEU-4, both Ω all and Ω f irst have two versions. Ω all evaluates the overall performance, while Ω f irst evaluates the ability to alleviate catastrophic forgetting.

Baselines
Two methods without exemplars are as below: • Finetune: At each task, the NLG model is initialized with the model obtained till the last task, and then fine-tuned with the data from the current task.
• Full: At each task, the NLG model is trained with the data from the current and all historical tasks. This is the "upper bound" performance for continual learning w.r.t. Ω all .
Several exemplar replay (ER) methods trained with Eq. (4) using different exemplar selection schemes are compared: • ER herding (Welling, 2009;Rebuffi et al., 2017): This scheme chooses exemplars that best approximate the mean DA vector over all training examples of this task.
• ER random : This scheme selects exemplars at random. Despite its simplicity, the distribution of the selected exemplars is the same as the distribution of the current task in expectation.
• ER prio : The proposed prioritized scheme (c.f. Algorithm 1) to select representative and diverse exemplars.
Based on ER prio , four regularization methods (including ours) to further alleviate catastrophic forgetting are compared: • L2: A static L2 regularization by setting F i = 1 in Eq. (7). It regularizes all parameters equally. • ARPER (c.f. Algorithm 2): The proposed method using adaptive EWC with ER prio .
We utilized the well-recognized semanticallyconditioned LSTM (SCLSTM Wen et al., 2015c) as  the base NLG model f θ 3 with one hidden layer of size 128. Dropout is not used by default, which is evaluated as a separate regularization technique (c.f. ER prio +Dropout). For all the above methods, the learning rate of Adam is set to 5e-3, batch size is set to 128, and the maximum number of epochs used to train each task is set to 100. Early stop to avoid over-fitting is adopted when the validation loss does not decrease for 10 consecutive epochs. To fairly compare different methods, they are trained with the identical configuration on the first task to have a consistent starting point. Hyper-parameters of different methods are included in Appendix A.1.

Diagnose Forgetting in NLG
Before proceeding to our main results, we first diagnose whether the catastrophic forgetting issue exists when training an NLG model continually. As an example, a model pre-trained on the "Attraction" domain is continually trained on the "Train" domain. We present test performance on "Attraction" at different epochs in Figure 3 with 250 exemplars.
We can observe: (1) catastrophic forgetting indeed exists as indicated by the sharp performance drop of Finetune; (2) replaying carefully chosen exemplars helps to alleviate catastrophic forgetting by a large degree, and ER prio does a better job than ER random ; and (3) ARPER greatly mitigates catastrophic forgetting by achieving similar or even better performance compared to Full.

Continual Learning New Domains
In this experiment, the data from six domains are presented sequentially. We test 6 runs with different domain order permutations. Each domain is selected as the first task for one time, and the remaining five domains are randomly ordered 4 . Results averaged over 6 runs using 250 and 500 total exemplars are presented in Table 1. Several interesting observations can be noted: • All methods except Finetune perform worse on all seen tasks (Ω all ) than on the first task (Ω f irst ). This is due to the diverse knowledge among different tasks, which increases the difficulty of handling all the tasks. Finetune performs poorly in both metrics because of the detrimental catastrophic forgetting issue.
• Replaying exemplars helps to alleviate the catastrophic forgetting issue. Three ER methods substantially outperform Finetune. Moreover, the proposed prioritized exemplar selection scheme is effective, indicated by the superior perfor- mance of ER prio over ER herding and ER random .
• ARPER significantly outperforms three ER methods and other regularization-based baselines. Compared to the three closest competitors, ARPER is significantly better with p-value < 0.05 w.r.t SER.
• The improvement margin of ARPER is significant w.r.t SER that is critical for measuring an output's fidelity to a given dialog act. Different methods demonstrate similar performance w.r.t BLEU-4, where several of them approach Full, thus are very close to the upper bound performance.
• ARPER achieves comparable performance w.r.t to the upper bound (Full) on all seen tasks (Ω all ) even with a very limited number of exemplars. Moreover, it outperforms Full on the first task (Ω f irst ), indicating that ARPER better mitigates forgetting the first task than Full, and the latter is still interfered by data in later domains.

Dynamic Results in Continual Learning
In Figure 4, several representative methods are compared as more domains are continually learned. With more tasks continually learned, ARPER performs consistently better than other methods on all seen tasks (solid lines), and it is comparable to Full. On the first task (dashed lines), ARPER outperforms all the methods, including Full, at every continual learning step. These results illustrate the advantage of ARPER through the entire continual learning process.

Continual Learning New DA Intents
It is also essential for a task-oriented dialog system to continually learn new functionalities, namely, supporting new DA intents. To test this ability,  the data of seven DA intents are presented sequentially in the order of decreasing data size, i.e., "Inform, Request, Book, Recommend, Offer-Booked, No-Offer, Select". Results using 250 exemplars are presented in Table 2. We can observe that ARPER still largely outperforms other methods, and similar observations for ARPER can be made as before. Therefore, we conclude that ARPER is able to learn new functionalities continually. Compared to previous experiments, the performance of ER prio +KD degrades, while the performance of ER prio +L2 improves due to the very large data size in the first task ("Inform"), which means that they are sensitive to task orders.

Ablation Study
In Table 3, we compare several simplified versions of ARPER to understand the effects of different components. Comparisons are based on continually learning 6 domains staring with "Attraction". We can observe that: (1). L ER is beneficial because dropping it ("w/o ER") degrades the performance of ARPER.
(2). Using prioritized exemplars is advantageous because using random exemplars ("w/o PE") for ARPER impairs its performance. (3). Adaptive regularization is also effective, indicated by the superior performance of ARPER compared to using fixed regularization weights ("w/o AR").   Table 5: SER in % of using SCVAE and GPT-2 as f θ . Best Performance excluding "Full" are in bold. Table 4 shows two examples generated by ARPER and the closest competitor (ER prio +Dropout) on the first domain ("Attraction") after the NLG model is continually trained on all 6 domains starting with "Attraction". In both examples, ER prio +Dropout fails to generate slot "Fee" or "Type", instead, it mistakenly generates slots belonging to later domains ("Hotel" or "Restaurant") with several obvious redundant repetitions colored in purple. It means that the NLG model is interfered by utterance patterns in later domains, and it forgets some old patterns it has learned before. In contrast, ARPER succeeds in both cases without forgetting previously learned patterns.

Results using Other NLG Models
In this experiment, we changed the base NLG model From SCLSTM to SCVAE (Tseng et al., 2018) and GPT-2 (Radford et al., 2019). For GPT-2, we used the pre-trained model with 12 layers and 117M parameters. As in Peng et al. (2020), exact slot values are not replaced by special placeholders during training as in SCLSTM and SCVAE. The dialog act is concatenated with the corresponding utterance before feeding into GPT-2. More details are included in Appendix A.1. Results of using 250 exemplars to continually learn 6 domains starting with "Attraction" are presented in Table 5. Thanks to the large-scale pretrained language model, GPT-2 suffers less from the catastrophic forgetting issue because of the better performance of Finetune. In general, the relative performance patterns of different methods are similar to that we observed in Section 4.5 and 4.6. Therefore, we can claim that the superior performance of ARPER can generalize to different base NLG models.

Conclusion
In this paper, we study the practical continual learning setting of language generation in task-oriented dialog systems. To alleviate catastrophic forgetting, we present ARPER which replays representative and diverse exemplars selected in a prioritized manner, and employs an adaptive regularization term based on EWC (Elastic Weight Consolidation). Extensive experiments on MultiWoZ-2.0 in different continual learning scenarios reveal the superior performance of ARPER . The realistic continual learning setting and the proposed technique may inspire further studies towards building more scalable task-oriented dialog systems.

A.1 Model Details and Hyper-parameters
We first elaborate implementation details of the knowledge distillation (KD) baseline compared in our paper. We used the below loss term: where L is the vocabulary that appears in previous tasks but not in task t. At each position k of Y, [p k,1 , . . . ,p k,|L| ] is the predicted distribution 5 over L given by f θ t−1 , and [p k,1 , . . . , p k,|L| ] is the distribution given by f θt . L KD penalizes prediction changes on the vocabularies specific to earlier tasks. For all {Y, d} ∈ D t ∪ E 1:t−1 , L KD is linearly interpolated with L ER by L ER + η · L KD , with the η tuned as a hyper-parameter .
Hyper-parameters of SCVAE reported in Section 4.9 are set by default according to https: //github.com/andy194673/nlg-scvae, except that the learning rate is set to 2e-3. For GPT-2, we used the implementation pipeline from https://github.com/pengbaolin/ SC-GPT. We pre-processed the dialog act d into the format of : d = [ I ( s 1 = v 1 , . . . , s p = v p ) ], and the corresponding utterance Y is appended to be Y with a special start token [BOS] and an end token [EOS]. d and Y are concatenated before feeding into GPT-2. The learning rate of Adam optimizer is set to 5e-5 without weight decay. As GPT-2 converges faster, we train maximum 10 epochs for each task with early stop applied to 3 consecutive epochs.
Hyper-parameters of different methods are tuned to maximize SER all using grid search, and the optimal settings of different methods in various experiments are summarized in Table 6.

A.2 Domain Order Permutations
In Table 7, we provide the exact domain order permutations of the 6 runs used in the experiments in Table 1 and Figure 4.

A.3 Computation Resource
All experiments are conducted using a single GPU (GeForce GTX TITAN X). In Table 8, we compared the average training time of one epoch using 5 The temperature in (Hinton et al., 2015) is set to 1 due to its minimum impact based on our experiments.

B.1 Comparison to Pseudo Exemplar Replay
Instead of storing raw samples as exemplars, Shin et al. (2017); Riemer et al. (2019) generate "pseudo'' samples akin to past data. The NLG model itself can generate pseudo exemplars. In this experiment, we replace the 500 raw exemplars of ER random , ER prio , and ARPER by pseudo samples generated by the continually trained NLG model using the dialog acts of the same raw exemplars : An visualization of the change of SCLSTM's hidden layer weights obtained from two consecutive tasks of ARPER (Top) and ER prio +Dropout (Bottom). Two sample task transitions ("from Attraction" to "Train", and then from "Train" to "Hotel") are shown. High temperature areas of ARPER is highlighted by red bounding boxes for better visualization.
Ω all Ω f irst SER% BLEU-4 SER% BLEU-4 ER random 9.82 0.495 8.64 0.405 Pseudo-ER random 9.26 0.551 6.88 0.519 ER prio 7.84 0.573 6.20 0.523 Pseudo-ER prio 8.87 0.557 6.37 0.521 ARPER 4.43 0.597 3.40 0.574 Pseudo-ARPER 5.07 0.590 3.51 0.570 as input. Result comparing using pseudo or raw exemplars to continually learn 6 domains starting with "Attraction" are illustrated in Table 9. We can see that using pseudo exemplars performs better for ER random , but worse for ER prio and ARPER. It means that pseudo exemplars are better when exemplars are chosen randomly, while carefully chosen exemplars (c.f. algorithm 1) is better than pseudo exemplars. Explorations on utilizing pseudo exemplars for NLG is orthogonal to our work, and it is left as future work.

B.2 Flow of Parameters Update
To further understand the superior performance of ARPER, we investigated the update of parameters throughout the continual learning process. Specifically, we compared SCLSTM's hidden layer weights obtained from consecutive tasks, and the pairwise L 1 difference of two sample transitions is shown in Figure 5. We can observe that ER prio +Dropout tends to update almost all parameters, while ARPER only updates a small fraction of them. Furthermore, ARPER has different sets of important parameters for distinct tasks, indicated by different high-temperature areas in distinct weight updating heat maps. In comparison, parameters of ER prio +Dropout seem to be updated uniformly in different task transitions. The above observations verify that ARPER indeed elastically allocates different network parameters to distinct NLG tasks to mitigate catastrophic forgetting.