Out-of-Task Training for Dialog State Tracking Models

Dialog state tracking (DST) suffers from severe data sparsity. While many natural language processing (NLP) tasks benefit from transfer learning and multi-task learning, in dialog these methods are limited by the amount of available data and by the specificity of dialog applications. In this work, we successfully utilize non-dialog data from unrelated NLP tasks to train dialog state trackers. This opens the door to the abundance of unrelated NLP corpora to mitigate the data sparsity issue inherent to DST.


Introduction
The role of the dialog state tracker in a task-oriented dialog system is to summarise the history of the conversation so far and extract the user goal. Dialog state tracking (DST) suffers extraordinarily from data sparsity. Collecting data for DST is expensive and time consuming. Typically, conversations are either staged or collected in a Wizard-of-Oz style setup and annotated by hand, severely inhibiting data collection. The enormous number of possible dialog states further exacerbates this. Even if we combined all data that commercial assistants generate, there will still be realistic but unobserved dialog states.
Unsupervised learning, transfer learning and multi-task learning (MTL) (Caruana, 1997) in general help mitigate data sparsity. Unsupervised learning relies on predicting inherent characteristics of the data. Transfer learning exploits knowledge learned on related problems to generalize to new tasks. MTL optimizes towards solving multiple tasks at once for synergy effects. Typically, utilized datasets have related domains and tasks share objectives. In other words, these strategies are typically used to address the problem of adaptation.
Recent approaches to adaptation in NLP tasks rely on contextual models. The methods above have been applied to improve generalization across related tasks and datasets. For instance, Phang et al. (2018), Wang et al. (2019) and Pruksachatkun et al. (2020) facilitate transfer learning by intermediate task fine-tuning (ITFT) on tasks that are related to the target task. Peng et al. (2020) and Liu et al. (2019a) jointly optimize transformer based models (Vaswani et al., 2017) towards multiple related tasks and/or domains. The latter apply MTL to pre-training, rather than fine-tuning. Gururangan et al. (2020) report improvements by continuing unsupervised pre-training for domain/task adaptation. Raffel et al. (2019) and Keskar et al. (2019) propose model architectures that handle diverse tasks with a unified mechanism.
Natural language understanding (NLU) and dialog state tracking (DST) benefit from joint modeling via multitask learning, as this utilizes dialog data more efficiently . Recent approaches view DST as a generative problem (Wu et al., 2019;Ren et al., 2019) or as a reading comprehension problem (Gao et al., 2019;Chao and Lane, 2019) utilizing contextual models. In the latter, span prediction or sequence tagging extract relevant information from the input directly. These mechanisms utilize training data more efficiently than early approaches that relied on exhaustive classification given a list of known concepts Liu and Lane, 2017;Zhong et al., 2018). Pre-training on multiple dialog datasets has been proposed to support subsequent fine-tuning towards specific dialog modeling tasks such as DST (Wu et al., 2020). Synergies between DST subtasks can be exploited via  Figure 1: Schematics of our proposed out-of-task training schemes. Blue dotted components are used for training on the auxiliary tasks only. Green solid components are used for DST training only.
MTL (Rastogi et al., 2019). Better generalization across slots is attempted via knowledge transfer and zero-shot learning (Rastogi et al., 2020). All of the above approaches are suitable to better utilize available dialog data, but the general issue of data sparsity persists. There likely will never be enough task-specific data to train dialog models to their full potential. Instead of resorting to the limited quantities of such data, we propose to utilize non-dialog data from unrelated tasks for the training of DST models. For this we explore two strategies: (1) in a sequential transfer learning approach, we first train a model to solve an unrelated task, followed by training towards solving DST; (2) we use MTL to jointly optimize towards DST and an unrelated task. We call our overall approach out-of-task training for DST.
With our methods, we achieve new state-of-the-art performance on all four target datasets. We show that even small amounts of auxiliary task data are beneficial to support model training, especially with MTL, which particularly improves performance on difficult tasks. Our positive experimental results open the door to the abundance of unrelated NLP corpora defined over a wide range of non-dialog tasks to mitigate the issue of data sparsity in DST.
2 Out-of-Task Training for DST

Dialog State Tracking
The task of DST is to extract meaning and intent from the user input, and to keep and update this information over the continuation of a dialog (Young et al., 2010). A restaurant recommender needs to know user preferences such as price, location, etc. These concepts are defined by an ontology in terms of domains (e.g., restaurant), slots (e.g., price range), and values (e.g. expensive). We utilize TripPy, our publicly available DST model with state-of-the-art performance on a range of datasets. 1 The details of this model are described in Heck et al. (2020). We briefly describe the aspects relevant to this work.
TripPy encodes the current dialog context using a transformer model. First, the model determines at each turn whether any of the known domain-slot pairs is present. This is done via slot gates, which either predict that a slot can be filled via a copy mechanism or that it takes a special value (none, dontcare, or true/false). There are three copy mechanisms in TripPy; span prediction and two types of memory lookup. Slot gates and span prediction are realized as classification heads on top of the contextual encoder. The model we use in this work is a modification of the original, as we use RoBERTa (Liu et al., 2019b) as encoder instead of BERT (Devlin et al., 2018). We motivate this by the fact that BERT's distinction of segments has little applicability in dialog. When approaching DST as a reading comprehension task, system and user utterance may take on both the roles of query and response. The overall performance of this DST model depends on the individual performance of contextual encoder, slot gates and span prediction, i.e., any of these parts could potentially benefit from out-of-task training.

Auxiliary Tasks
We consider two types of auxiliary tasks unrelated to DST. The first category encompasses sentence and sentence-pair level classification tasks that aim at discovering linguistic phenomena. We resort to the datasets used by the GLUE benchmark (Wang et al., 2018), which cover various NLP problems; (1) MNLI, QNLI, RTE and WNLI for natural language inference (e.g., entailment detection); (2) CoLA for linguistic acceptability classification; (3) SST-2 for sentiment classification; (4) MRPC and QQP for paraphrase detection 2 . We use SQuAD2.0 (Rajpurkar et al., 2018), a question-answering dataset, as representative of token-level classification tasks such as span prediction. SQuaD consists of questions, where the answer to every question is an extractable sequence of text, i.e., a span found in an accompanying paragraph. We refer to the original papers for further details regarding the datasets.
We employ the following training constraints: (1) The auxiliary task can either be a classification problem or a span prediction problem, and (2) only one auxiliary task at a time can be used. The latter allows us to clearly identify the effect of particular auxiliary tasks.

Intermediate Task Fine-tuning (ITFT)
Our ITFT scheme trains the same model successively on two unrelated tasks, i.e,. the auxiliary task, followed by the DST task. Figure 1a is a depiction of the model architecture. The encoder is either followed by task specific classification heads for DST or a task specific classification head for the auxiliary task depending on the training phase. Both phases of fine-tuning follow the procedure as described by Devlin et al. (2018). The intention of ITFT is to steer the encoder's parameters into a favorable direction so that subsequent fine-tuning finds a better local optimum.

Multi-task Learning (MTL)
With MTL, we train the same model simultaneously on two unrelated tasks. Figure 1b is a depiction of the model architecture.
The strategy is formally outlined in Algorithm 1. For each step s that is to be trained on DST, we also train one additional step on the auxiliary task. In other words, the training alternates between auxiliary task and target task on the level of steps. We share one optimizer for both tasks and perform two successive updates (lines 9 and 12 in Algorithm 1), one for each batch b. The number of these double steps is determined by s max , the maximum number of steps for the target task, so as to not overpower the main task. A hyperparameter e MTL determines the last epoch for which MTL is applied. If e MTL < e max (maximum number of epochs), then we fine-tune only on DST for the remainder.

Experiments
Datasets We conduct our evaluation on four dialog datasets. MultiWOZ 2.1 (Eric et al., 2019) is the most challenging, containing over 10k dialogs defined over 5 domains with 30 domain-slot pairs. WOZ 2.0 (Wen et al., 2016) is a single-domain benchmark. sim-M and sim-R (Shah et al., 2018) are singledomain datasets that are challenging due to their high out-of-vocabulary (OOV) rate in some slots.
Scoring We use joint goal accuracy (JGA) on the evaluation sets as the primary measure to compare individual models. JGA is the ratio of dialog turns for which all slots have been filled with the correct value according to the ground truth. We report average JGA over 5 tests each with different seeds.
Training Details As encoder we use RoBERTa-base. Training for the first phase of ITFT follows Devlin et al. (2018). 3 For target task training and MTL, the maximum input sequence length is 180 tokens after Byte-pair encoding (Sennrich et al., 2015). We use Adam optimizer (Kingma and Ba, 2014) with joint cross entropy and back-propagate through the entire network. The initial learning rate (LR) is 1e −4 . We conduct training with a warm-up proportion of 10% and linear LR decay. Weight decay is set to 0.01.  Table 1: Performance comparison of out-of-task training methods and utilized tasks. Bold indicates best performance per dataset and training method. * * and * indicate statistically significant improvements over the baseline with p < 0.05 and p < 0.1, respectively. 1 the maximum use of auxiliary task samples for MTL is e max times the number of samples for the target task (see Section 2.4).
During training we use a dropout (Srivastava et al., 2014) rate of 30% on the RoBERTa output, and 10% on the classification heads. We use early stopping based on the JGA of the development set. e max = 10, and e MTL = 7. We do not use slot value dropout (Xu and Sarikaya, 2014) except for sim-M.

Results
Table 1 lists our out-of-task training results on all four DST datasets, compared to a baseline that does not use any auxiliary task. It can be seen that additional training on an unrelated task produces considerably better models in almost every tested combination. With one exception, the average JGA of all models trained with IFTF or MTL is always higher than the respective baseline performance.
ITFT vs. MTL MTL shows a better performance than ITFT 3 out of 4 times. On every single DST dataset, it is preferrable to use multi-task learning rather than sequential fine-tuning. On average, MTL improves the performance of DST by almost 1% absolute across all datasets, while ITFT improves the average performance by 0.4%. In only one case (QQP), ITFT harms performance consistently. In stark contrast, MTL successfully utilizes QQP to consistently improve performance for all DST tasks. Figure 2a shows that both methods benefit early target task training. However, only MTL, which revisits out-of-task data during training, maintains a positive effect throughout all epochs.
Potential impact of target task difficulty It is reasonable to assume that target tasks do not benefit equally from out-of-task training, depending on the baseline model's initial capacities. WOZ 2.0 can be considered to be the easiest task to solve. sim-R and sim-M both feature slots with high OOV rates, and sim-M contains extremely limited amounts of data. MultiWOZ 2.1 is most challenging. Table 1 shows that more difficult tasks tend to benefit more from MTL than easier tasks. This is also true for ITFT except for MultiWOZ.
Potential impact of data amount The amount of target task data and potential improvement via outof-task training seems uncorrelated. The amount of available and utilized auxiliary task data likewise does not seem to be decisive. Even the smallest of the datasets (WNLI, RTE, MRPC, CoLA), can be utilized successfully to significantly improve DST. However, we did observe a correlation between the auxiliary task data size and the performance of the training methods. Figure 2b shows that MTL tends to benefit from larger out-of-task datasets, while ITFT performs better on small datasets. MTL only sees a subset of all samples of the auxiliary task if the target task is small, yet clearly outperforms ITFT, which always sees all out-of-task training samples and might therefore suffer from adverse effects of over-training on an unrelated task. Table 1 is not indicative of trends regarding the usefulness of particular auxiliary task types. We did observe that span prediction (SQuaD), sentiment classification (SST-2) and linguistic acceptability classification (CoLA) led to more consistent improvements than -15% -10% -5% 0% 5% 10% 15%   NLI-type tasks and paraphrase detection (MRPC, QQP). The latter two types lead to significantly lower improvements with ITFT, while MTL can benefit from all task types comparably well.

Potential impact of auxiliary task type
DST training effects SQuaD is the only auxiliary task that utilizes the token-level representations of RoBERTa for prediction, while all other tasks (which are GLUE tasks) solely rely on the sequence-level representation. Table 2 shows that fine-tuning on either task category leads to similar DST performance improvements in terms of average slot gate accuracy. While slot gates expect sequence representations as input, span prediction relies on token representations. As can be seen, out-of-task training with SQuaD leads to larger improvements on span prediction than GLUE tasks. The "movie" and "restaurant" slots in sim-M and sim-R show very high OOV rates (100% and 40%). SQuaD proved most helpful to improve accuracies of these particularly difficult slots. Overall, both task categories proved beneficial for improving DST performance.

Conclusion
We investigated auxiliary out-of-task training for DST and found that model training benefits most from joint optimization, compared to sequential training. Even though auxiliary tasks and target task are domain and task mismatched, our training schemes consistently improve target task performance, regardless of task types or data amounts. We reach state-of-the-art results with considerable improvements on all target datasets. We showed that out-of-task training is suitable to overcome data sparsity issues. In future work we pursue the direction of scaling up to do joint out-of-task training on multiple unrelated auxiliary tasks. We would also like to investigate iterative approaches to out-of-task training, where new data is added to the training during the course of a model's lifetime, rather than training from scratch. Table 3 summarizes experimental results when using BERT instead of RoBERTa as encoder in TripPy. Even though BERT benefits less from the out-of-task training, the same tendencies as for RoBERTa are observed. One notable difference is the poor performance of SST-2 for BERT. TripPy with BERT uses segment ID 0 for the current user utterance and segment ID 1 for the system utterance plus dialog history, while TripPy for RoBERTa does not distinguish between multiple segments in the input (Devlin et al., 2018;Liu et al., 2019b). Out-of-task training on SST-2 might negatively affect TripPy with BERT, because this data consists of single segments instead of pairs. However, the nature of the task (see Table 4) seems to be relevant, as CoLa -another a single segment classification problem -does not result in such poor performance using BERT. It is noteworthy that the improvements using RoBERTa over BERT for SST-2 are also above average in the official GLUE benchmark leaderboard 4 (while being on average for CoLA), which might indicate a generally higher aptitude of RoBERTa for learning from SST-2.  Table 3: Performance comparison of out-of-task training methods and utilized tasks when using BERT as encoder. Bold indicates best performance per dataset and training method. * * and * indicate statistically significant improvements over the baseline with p < 0.05 and p < 0.1, respectively. 1 the maximum use of auxiliary task samples for MTL is e max times the number of samples for the target task (see Section 2.4). Predicting if statement 2 is true or false, given statement 1 SQuaD Span prediction -pair Predicting start and end of answer span given a text and a question Table 4: Overview of auxiliary tasks that were utilized for out-of-task training for DST. Cl. denotes the number of target classes for a task. Input is either a single sequence or a sequence pair.