A Progressive Model to Enable Continual Learning for Semantic Slot Filling

Semantic slot filling is one of the major tasks in spoken language understanding (SLU). After a slot filling model is trained on precollected data, it is crucial to continually improve the model after deployment to learn users’ new expressions. As the data amount grows, it becomes infeasible to either store such huge data and repeatedly retrain the model on all data or fine tune the model only on new data without forgetting old expressions. In this paper, we introduce a novel progressive slot filling model, ProgModel. ProgModel consists of a novel context gate that transfers previously learned knowledge to a small size expanded component; and meanwhile enables this new component to be fast trained to learn from new data. As such, ProgModel learns the new knowledge by only using new data at each time and meanwhile preserves the previously learned expressions. Our experiments show that ProgModel needs much less training time and smaller model size to outperform various model fine tuning competitors by up to 4.24% and 3.03% on two benchmark datasets.


Introduction
Spoken language understanding (SLU) systems play a vital role in ubiquitous artificially intelligent voice-enabled personal assistants. As one of the major tasks in SLU, semantic slot filling is treated as a sequential labeling problem to map a natural language sequence x to a slot label sequence y of the same length in IOB format (Yao et al., 2014). Typically, a slot filling model is trained offline on large scale corpora with pre-collected utterances. However, such corpora usually cannot cover all possible varieties of utterances exhaustively (e.g., personalized expressions, new vocabulary, * This work was done when Xiangyu Zeng was in Samsung Research. utterances for new intent, etc.) from diverse users. Thus, it is critically desirable to develop a slot filling approach with the capability of continual learning after a personal assistant is deployed.
Unfortunately, existing approaches target on offline model training using a large scale training data. They are designed to either train a slot filling model independently (Yao et al., 2014;Peng et al., 2015;Kurata et al., 2016;Hakkani-Tür et al., 2016;Liu and Lane, 2016;Deng et al., 2019;Ray et al., 2019) or jointly with the other intent detection task in SLU (Guo et al., 2014;Liu and Lane, 2016;Zhang and Wang, 2016;Wang et al., 2018;Goo et al., 2018). Recently, (Shen et al., 2018a developed cold start algorithms to generate training data with the hope of covering more varieties before deployment. On the other hand, (Ray et al., 2018;Shen et al., 2018b) attempt to personalize the slot filling model. However, they are still restricted to the offline training and cannot be applied to learn new user's expressions after deployment.
To support continual learning, a naive solution is to retrain the current model at each time. However, it suffers from several drawbacks: First, in order to maintain the SLU performance on both original and new expressions, it usually requires almost retraining the model using the whole dataset. However, as the size of training set grows, it becomes infeasible to repeatedly conduct time consuming retraining on such a large dataset. More importantly, the old training data typically is not stored permanently due to huge storage need and privacy protection. If only fine tuned on new utterances, the new model intends to lose the previously learned knowledge, a.k.a., catastrophic forgetting (French, 1999  At each batch t, the last layers of base model and previous components (dotted lines) are only used for inference. Only the output of M t is used to guide the training.
computer vision (Li and Hoiem, 2016;Lee et al., 2017), yet it still remains open in spoken language understanding systems.
In this paper, we consider a practical setting that a batch 1 of new training data U t becomes available at each batch t.
Our goal is to enable the continual learning capability of a slot filling model such that it can keep learning new utterances efficiently as well as remember old knowledge without the needs of accessing old training data. To achieve this, we design a novel Progressive Slot Filling Model (ProgModel) that can be gradually expanded at each batch by using a novel context gate for knowledge transfer. Unlike the baseline that repeatedly retrains the same model, ProgModel keeps the previously trained components untouched such that the catastrophic forgetting can be largely avoided. Using the transferred knowledge, each newly expanded component in ProgModel is trained in a progressive manner to achieve better performance with faster training compared with baseline model retraining approaches .
2 Proposed Approach 2.1 Progressive Model (ProgModel) As the name indicates, the main idea of our proposed ProgModel is to progressively expand the model by transferring existing knowledge from the current model.
Thus, ProgModel can continually enhance its capability of understanding user's new expressions without catastrophic forgetting. This is motivated by the recent success of progressive neural networks in various applications (Rusu et al., 2016).
As shown in Figure 1, ProgModel consists 1 To avoid the confusion with the widely used timestamp in NLP (mean each word), we use each batch in our paper. of the following components: (1) Expanded Components: The current model is expanded via context-gated knowledge transfer to allow only training on a new batch training set U t at each batch t.
(2) Inference Decision Engine: When we receive multiple outputs from base model and expanded components, the decision engine is to derive the slot filling label output without additional training.

Expanded Component M t
At each batch t, a new component M t (Figure 1 (right)) is expanded on the base model M 0 and previously expanded components M 1 . . . M t−1 , denoted as M <t . The utterance BiLSTM t is learned from scratch at each batch t such that it can learn the new sentence structures via word sequence correlations. Next, we focus on designing two knowledge transfer mechanisms (green parts in Figure 1) to maximally leverage the previously learned knowledge in M <t .
Word Embeddings Transfer: Word embeddings in the newly expanded component M t are initialized using those in M 0 based on the assumption that M 0 covers most vocabulary. For a new word w, we initialize using GloVe embedding. The embeddings will be fine tuned during training M t . projection matrix V t shared for each word: Then, the context vector c t i for the i th word in expanded component M t is given as: (Liu and Lane, 2016) In ProgModel, each model M t has an independent g t which is initialized by g 0 in M 0 . Thus, α α α t i will be fine tuned from α α α 0 i during training M t via updated g t and t i .

Inference Decision Engine
We design the inference decision engine (IDE) as a non-trainable separate component to avoid the potential catastrophic forgetting. Thanks to the capability of knowledge transfer in ProgModel, M t can already remember quite much previously learned knowledge to give good label prediction in many cases. Thus, we consider two types of decision engines: (1) t-IDE: ProgModel using only the output of M t as decision engine; (2) c-IDE: for i th word, it combines all outputs from each component M t , t k=0 P k (i)I k (i). I k (i) is an indicator function which is 1 when i th word is in the vocabulary of M k and 0 otherwise. The label with maximum probability is selected.

Progressive Training
The training procedure is progressively conducted at each batch t. The first step is to train the base model M 0 using the loss function L 0 : where θ θ θ 0 are the parameters in M 0 ; |S| is the number of semantic slots in IOB format; and n is the sequence length. At each batch t, we train the expanded component M t while fixing the parameters θ θ θ <t in previous components. The loss is backpropagated from the output of M t using the loss function L t : where θ θ θ t and φ φ φ t t−1 are the parameters in model M t and in the context gate between M t−1 and M t . In both loss functions, P t j (i) is output probability of slot j for the i th word from M t .

Datasets & Settings
Dataset: We evaluate ProgModel on the following two benchmark datasets: ATIS (Airline Travel Information Systems) dataset (Hemphill et al., 1990): a widely used dataset in SLU research. The training set contains 4,978 utterances from the ATIS-2 and ATIS-3 corpora, and the testing set contains 893 utterances from the ATIS-3 data sets. There are 127 distinct slot labels. We do not use the intent labels in ATIS.
Snips dataset (Snips, 2017): another NLU dataset custom-intent-engines collected by Snips for model evaluation. It contains 7 domains. In each domain, the training set contains 1,800 to 2,000 utterances and the testing set contains around 100 utterances. Since each domain in Snips contains completely different slots and very few vocabulary are shared between the domains, we evaluate on each domain independently.
We use Amazon Mechanical Turk (MTurk) to split both training set into non-overlapping groups in each dataset (each domain in Snips as a separate dataset). Based on the size of each dataset, we consider 5 groups in ATIS and 3 groups in each domain of Snips dataset. Each turker is given 100 utterances from training set in one dataset; as well as the number of groups G for this dataset. He is asked to put utterances into no more than G groups based on their similarities. At last, we review the grouped utterances from turkers to further combine the similar utterances and derive the final grouping of the whole dataset. For each dataset, we consider the largest group of training set as a base dataset. At each batch t, one of leftover groups in training set is randomly selected to be U t . Table 1 shows the detailed data statistics. Competitors: First, only for reference purpose, we consider a performance upper bound baseline, i.e., train AttRNN on the all available dataset at each batch t. We compare with the following competitor approaches: (1) FT-AttRNN : fine tunes the current model only using new training data U t at each batch t; (2) FT-Lr-AttRNN : fine tunes the current model using an adjusted lower learning rate (we use 0.3 times of base model learning rate which has the best performance) on the new training set U t ; (3) FT-Cp-AttRNN : copies the previous model and fine tunes the new copied model on new training data U t at each batch t. During inference, FT-Cp-AttRNN uses both t-IDE and c-IDE decision engines and reports the one with better performance (F-1 score). We evaluate our ProgModel model with different inference engines: (1) t-ProgModel: ProgModel using only output of M t as decision engine; (2) c-ProgModel: ProgModel using combined inference decision engine. All base models M 0 are trained on state-of-the-art AttRNN model (Liu and Lane, 2016). For fair evaluation, we test both ProgModel and competitors on the all standard testing sets. Training: We implemented ProgModel model using TensorFlow 1.4.0 and conducted the experiments on NVIDIA Tesla M40. At each batch t, we train all models until their convergence. We observe that ProgModel takes around 10 epochs due to less parameters and transferred knowledge in M t while AttRNN retraining usually needs 100 epochs and various fine tuning competitors need around 30-50 epochs. Table 2 and Table 3 show the F1 score of slot filling performance comparison results on ATIS dataset and each domain of Snips dataset. The results show that ProgModel consistently outperforms AttRNN in all domains, where the improvement gain is up to 4.24% in ATIS and 3.03% in Snips. As expected, ProgModel continuously improves performance with more and more new batches of training data, even though it is only trained on new data at each batch. Among all competitors, FT-Cp-AttRNN achieves the closest performance to ProgModel by using much larger model size (shown in Section 3.4). In comparison, both FT-AttRNN and FT-Lr-AttRNN frequently suffer from catastrophic forgetting. The values in pink show that the performance of FT-AttRNN and FT-Cp-AttRNN drops up to 3.82% and 5.38% respectively. As a result, their F1 scores are significantly reduced in the end. At last, we observe that ProgModel is quite close to upper bound performance (Note that this is only for reference rather than comparison since upper bound performance assumes the availability of all training data while ProgModel does not).

Ablation Study
We further look into each competitor to better understand the advantage of our method. Since FT-AttRNN is only trained on new data, it is oftentimes overwhelmed by new knowledge and results in forgetting the old knowledge. On the other hand, FT-Lr-AttRNN has difficulty to learn new knowledge since it cannot jump out of local optimum due to a small learning rate. As a result, the performance of FT-Lr-AttRNN is even lower than FT-AttRNN most of the time. To make it even worse, the learning rate is very hard to tune at each batch. As we can see, it is non-trivial to achieve both goals, learn new knowledge and remember old knowledge.
FT-Cp-AttRNN performs slightly better than FT-AttRNN and FT-Lr-AttRNN . FT-Cp-AttRNN can be treated as a naive solution to achieve both goals by almost duplicating the model again and again. However, in addition to larger model size and longer training time, it still suffers from efficiently transfer previous knowledge and leads to catastrophic forgetting from time to time.
In comparison, ProgModel outperforms all above competitors since it provides a systematic mechanism to achieve both goals. The training of our designed context gate helps to determine which knowledge to transfer at each batch.
At last, we observe that c-ProgModel performs better than t-ProgModel in ATIS. This has two reasons: First, the utterances in different groups of ATIS are quite structurally similar such that c-IDE further enhances the correct slot label distribution

Model Size Results
Table 4 reports the model size comparison between FT-AttRNN (FT-AttRNN and FT-Lr-AttRNN ), FT-Cp-AttRNN and ProgModel (t-ProgModel and c-ProgModel). With more and more training data at each batch, the increase of ProgModel size is significantly slower than that of FT-Cp-AttRNN since the capability of knowledge transfer in ProgModel avoids the full model copy. Thus, our approach also better trades off the model size and performance. The small fluctuation of ProgModel model size expansion is due to the different size of vocabulary in each batch of training utterances. Each expanded model M t will only keep the embedding of vocabulary in new training data U t . One may concern that ProgModel will become large over time. In practice, it will be expanded only when the new data grows too large to handle by current model. Moreover, a new base model can be periodically reinitialized to reset the model size.

Conclusion
In this paper, we proposed a novel ProgModel model with the capability of efficient continual learning for semantic slot filling in SLU. ProgModel is designed to expand progressively at each batch of new training data with a new context gate for knowledge transfer. The model can be trained progressively without needing to store old training data. We showed that ProgModel need much shorter training time to significantly outperform baseline approaches and close to the upper bound performance.