TOD-BERT: Pre-trained Natural Language Understanding for Task-Oriented Dialogue

The use of pre-trained language models has emerged as a promising direction for improving dialogue systems. However, the underlying difference of linguistic patterns between conversational data and general text makes the existing pre-trained language models not as effective as they have been shown to be. Recently, there are some pre-training approaches based on open-domain dialogues, leveraging large-scale social media data such as Twitter or Reddit. Pre-training for task-oriented dialogues, on the other hand, is rarely discussed because of the long-standing and crucial data scarcity problem. In this work, we combine nine English-based, human-human, multi-turn and publicly available task-oriented dialogue datasets to conduct language model pre-training. The experimental results show that our pre-trained task-oriented dialogue BERT (ToD-BERT) surpasses BERT and other strong baselines in four downstream task-oriented dialogue applications, including intention detection, dialogue state tracking, dialogue act prediction, and response selection. Moreover, in the simulated limited data experiments, we show that ToD-BERT has stronger few-shot capacity that can mitigate the data scarcity problem in task-oriented dialogues.


Introduction
Pre-trained models with self-attention encoder architectures (Devlin et al., 2018; have been commonly used in many NLP applications. Such models are self-supervised based on a massive scale of general text corpora, such as English Wikipedia or books (Zhu et al., 2015). By further fine-tuning these representations, breakthroughs have been continuously reported for various downstream tasks, especially natural language understanding.
However, previous work (Rashkin et al., 2018;Wolf et al., 2019) shows that there are some deficiencies in the performance to apply fine-tuning on conversational corpora directly. One possible reason could be the intrinsic difference of linguistic patterns between human conversations and writing text, resulting in a large gap of data distributions (Bao et al., 2019). Therefore, pre-training dialogue language models using chit-chat corpora from social media, such as Twitter or Reddit, has been recently investigated, especially for dialogue response generation (Zhang et al., 2019) and retrieval (Henderson et al., 2019b). Although these opendomain dialogues are diverse and easy-to-get, they are usually short, noisy, and without specific chatting goals.
On the other hand, a task-oriented dialogue has explicit goals (e.g. restaurant reservation or ticket booking) and many conversational interactions. But each dataset is usually small and scattered because obtaining and labeling such data is time-consuming. Moreover, a task-oriented dialogue has explicit user and system behaviors where a user has his/her goal, and a system has its belief and database information, which makes the language understanding component and dialogue policy learning more important than those chit-chat scenarios.
This paper aims to prove this hypothesis: selfsupervised language model pre-training using taskoriented corpora can learn better representations than existing pre-trained models for task-oriented downstream tasks. We emphasize that what we care about the most is not whether our pre-trained model can achieve state-of-the-art results on each downstream task since most of the current best models are built on top of pre-trained models, and ours can easily replace them. We avoid adding too many additional components on top of the pre-training architecture when fine-tuning in our experiments.
We collect and combine nine human-human and multi-turn task-oriented dialogue corpora to train a task-oriented dialogue BERT (TOD-BERT). In total, there are around 100k dialogues with 1.4M utterances across over 60 different domains. Like BERT (Devlin et al., 2018), TOD-BERT is formulated as a masked language model and uses a deep bidirectional Transformer (Vaswani et al., 2017) encoder as its model architecture. Unlike BERT, TOD-BERT incorporates two special tokens for user and system to model the corresponding dialogue behavior. A contrastive objective function of response selection task is combined during pretraining stage to capture response similarity. We select BERT because it is the most widely used model in NLP research recently, and our unified datasets can be easily applied to pre-train any existing language models.
We test TOD-BERT on task-oriented dialogue systems on four core downstream tasks, including intention recognition, dialogue state tracking, dialogue act prediction, and response selection. What we observe is: TOD-BERT outperforms BERT and other strong baselines such as GPT-2 (Radford et al., 2019) and DialoGPT (Zhang et al., 2019) on all the selected downstream tasks, which further confirms its effectiveness for improving dialogue language understanding. We find that response contrastive learning is beneficial, but it is currently overlooked not well-investigated in dialogue pretraining research. More importantly, TOD-BERT has a stronger few-shot ability than BERT on each task, suggesting that it can reduce the need for expensive human-annotated labels. TOD-BERT can be easily leveraged and adapted to a new taskoriented dialogue dataset. Our source code and data processing are released to facilitate future research on pre-training and fine-tuning of task-oriented dialogue 1 .

Related Work
General Pre-trained Language Models, which are trained on massive general text such as Wikipedia and BookCorpus, can be roughly divided into two categories: uni-directional or bidirectional attention mechanisms. GPT (Radford et al., 2018) and GPT-2 (Radford et al., 2019) are representatives of uni-directional language models using a Transformer decoder, where the objective is to maximize left-to-right generation likelihood. These models are commonly applied in natural language generation tasks. On the other hand, BERT (Devlin et al., 2018), RoBERTa , and their variances are pre-trained using a Transformer encoder with bi-directional token prediction. These models are usually evaluated on classification tasks such as GLUE benchmark (Wang et al., 2018) or span-based question answering tasks (Ra-1 github.com/jasonwu0731/ToD-BERT jpurkar et al., 2016). Some language models can support both unidirectional and bi-directional attention, such as UniLM (Dong et al., 2019). Conditional language model pre-training is also proposed. For example, CTRL (Keskar et al., 2019) is a conditional Transformer model, trained to condition on control codes that govern style, content, and task-specific behavior. Recently, multi-task language model pretraining with unified sequence-to-sequence generation is proposed. Text-to-text Transformer (T5) (Raffel et al., 2019) unifies multiple text modeling tasks and achieves the promising results in various NLP benchmarks.
Dialogue Pre-trained Language Models are mostly trained on open-domain conversational data from Reddit or Twitter for dialogue response generation. Transfertransfo (Wolf et al., 2019) achieves good performance on ConvAI-2 dialogue competition using GPT-2. DialoGPT (Zhang et al., 2019) is an extension of GPT-2 that is pre-trained on Reddit data for open-domain response generation. Con-veRT (Henderson et al., 2019a) pre-trained a dual transformer encoder for response selection task on large-scale Reddit (input, response) pairs. PLATO (Bao et al., 2019) uses both Twitter and Reddit data to pre-trained a dialogue generation model with discrete latent variables. All of them are designed to cope with the response generation task for opendomain chatbots.
Pretraining for task-oriented dialogues, on the other hand, has few related works. Budzianowski and Vulić (2019) first apply the GPT-2 model to train on response generation task, which takes system belief, database result, and last dialogue turn as input to predict next system responses. It only uses one dataset to train its model because few public datasets have database information available. Henderson et al. (2019b) pre-trained a response selection model for task-oriented dialogues. They first pre-train on Reddit corpora and then fine-tune on target dialogue domains, but their training and fine-tuning code is not released. Peng et al. (2020) focus on the natural language generation (NLG) task, which assumes dialogue acts and slot-tagging results are given to generate a natural language response. Pre-training on a set of annotated NLG corpora can improve conditional generation quality using a GPT-2 model. Name # Dialogue # Utterance Avg. Turn # Domain MetaLWOZ  37,884 432,036 11.4 47 Schema (Rastogi et al., 2019) 22,825 463,284 20.3 17 Taskmaster (Byrne et al., 2019) 13,215 303,066 22.9 6 MWOZ (Budzianowski et al., 2018) 10,420 71,410 6.9 7 MSR-E2E  10,087 74,686 7.4 3 SMD (Eric and Manning, 2017) 3,031 15,928 5.3 3 Frames (Asri et al., 2017) 1,369 19,986 14.6 3 WOZ (Mrkšić et al., 2016) 1

Method
This section discusses each dataset used in our taskoriented pre-training and how we process the data. Then we introduce the selected pre-training base model and its objective functions.

Datasets
We collect nine different task-oriented datasets which are English, human-human and multi-turn. In total, there are 100,707 dialogues, which contain 1,388,152 utterances over 60 domains. Dataset statistics is shown in Table 1.
• MetaLWOZ : Meta-Learning Wizard-of-Oz is a dataset designed to help develop models capable of predicting user responses in unseen domains. This large dataset was created by crowdsourcing 37,884 goaloriented dialogs, covering 227 tasks in 47 domains. The MetaLWOZ dataset is used as the fast adaptation task for DSTC8  dialogue competition.
• Schema (Rastogi et al., 2019): Schema-guided dialogue has 22,825 dialogues and provides a challenging testbed for several tasks, in particular, dialogue state tracking. Each schema is a set of tracking slots, and each domain could have multiple possible schemas. This allows a single dialogue system to support many services and facilitates the simple integration of new services without requiring much training data. The Schema dataset is used as the dialogue state tracking task for DSTC8  dialogue competition.
• Taskmaster (Byrne et al., 2019): This dataset includes 13,215 dialogues comprising six do-mains, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures.
One is a two-person Wizard of Oz approach that one person acts like a robot, and the other is a self-dialogue approach in which crowdsourced workers wrote the entire dialog themselves. It has 22.9 average conversational turns in a single dialogue, which is the longest among all taskoriented datasets listed.
• MWOZ (Budzianowski et al., 2018): Multi-Domain Wizard-of-Oz dataset contains 10,420 dialogues over seven domains, and it has multiple domains in a single dialogue. It has a detailed description of the data collection procedure, user goal, system act, and dialogue state labels. Different from most of the existing corpora, it also provides full database information.
• MSR-E2E : Microsoft end-toend dialogue challenge has 10,087 dialogues in three domains, movie-ticket booking, restaurant reservation, and taxi booking. It also includes an experiment platform with built-in simulators in each domain.
• SMD (Eric and Manning, 2017): Stanford multidomain dialogue is an in-car personal assistant dataset, comprising 3,301 dialogues and three domains: calendar scheduling, weather information retrieval, and point-of-interest navigation. It is designed to smoothly interface with knowledge bases, where a knowledge snippet is attached with each dialogue as a piece of simplified database information.
• Frames (Asri et al., 2017): This dataset comprises 1,369 human-human dialogues with an average of 14.6 turns per dialogue, where users are given some constraints to book a trip and assistants who search a database to find appropriate trips. Unlike other datasets, it has labels to keep track of different semantic frames, which is the decision-making behavior of users throughout each dialogue.
• WOZ (Mrkšić et al., 2016) and Cam-Rest676 (Wen et al., 2016): These two corpora use the same data collection procedure and same ontology from DSTC2 (Henderson et al., 2014). They are one of the first task-oriented dialogue datasets that use Wizard of Oz style with text input instead of speech input, which improves the model's capacity for the semantic understanding instead of its robustness to automatic speech recognition errors.

TOD-BERT
We train our TOD-BERT based on BERT architecture using two loss functions: masked language modeling (MLM) loss and response contrastive loss (RCL). Note that the datasets we used can be used to pre-train any existing language model architecture, and here we select BERT because it is the most widely used model in NLP research. We use the BERT-base uncased model, which is a transformer self-attention encoder (Vaswani et al., 2017) with 12 layers and 12 attention heads with its hidden size d B = 768.
To capture speaker information and the underlying interaction behavior in dialogue, we add two special tokens, [USR] and [SYS], to the bytepair embeddings (Mrkšić et al., 2016). We prefix the special token to each user utterance and system response, and concatenate all the utterances in the same dialogue into one flat sequence, as shown in Figure 1. For example, for a dialogue D = {S 1 , U 1 , . . . , S n , U n }, where n is the number of dialogue turns and each S i or U i contains a sequence of words, the input of the pre-training model is processed as "[SYS] S 1 [USR] U 1 . . . " with standard positional embeddings and segmentation embeddings.
Masked language modeling is a common pretraining strategy for BERT-like architectures, in which a random sample of tokens in the input sequence is selected and replaced with the special token [MASK]. The MLM loss function is the crossentropy loss on predicting the masked tokens. In the original implementation, random masking and replacement are performed once in the beginning and saved for the training duration. Here we conduct token masking dynamically during batch training. TOD-BERT is initialized from BERT, a good starting parameter set, then is further pre-trained on those task-oriented corpora. The MLM loss function is where M is the total number of masked tokens and P (x m ) is the predicted probability of the token x m over the vocabulary size.
Response contrastive loss can also be used for dialogue language modeling since it does not require any additional human annotation. Pretraining with RCL can bring us several advantages: 1) we can learn a better representation for the [CLS] token, as it is essential for all the downstream tasks, and 2) we encourage the model to capture underlying dialogue sequential order, structure information, and response similarity. Unlike the original next sentence prediction (NSP) objective in BERT pre-training, which concatenates two segments A and B to predict whether they are consecutive text with binary classification, we apply a dual-encoder approach (Henderson et al., 2019a) and simulate multiple negative samples. We first draw a batch of dialogues {D 1 , . . . , D b } and split each dialogue at a randomly selected turn t. For example, D 1 will be separated into two segments, one is the context {S 1 1 , U 1 1 , . . . , S 1 t , U 1 t } and the other is the response {S 1 t+1 }. We use TOD-BERT to encode all the contexts and their corresponding responses separately.
Afterwards, we have a context matrix C ∈ R b×d B and a response matrix R ∈ R b×d B by taking the output [CLS] representations from the b dialogues. We treat other responses in the same batch as randomly selected negative samples. The RCL objective function is Increasing batch size to a certain amount can obtain better performance on downstream tasks, especially for the response selection. The Softmax function normalizes the vector per row. In our setting, increasing batch size also means changing the positive and negative ratio in the contrastive learning. Batch size is a hyper-parameter that may be limited by hardware. We also try different negative sampling strategies during pre-training such as local sampling (Saeidi et al., 2017), but do not observe significant change compared to random sampling.
Overall pre-training loss function is the weighted-sum of L mlm and L rcl , and in our experiments, we simply sum them up. We gradually reduce the learning rate without a warm-up period. We train TOD-BERT with AdamW (Loshchilov and Hutter, 2017) optimizer with a dropout ratio of 0.1 on all layers and attention weights. GELU activation functions (Hendrycks and Gimpel, 2016) is used. Models are early-stopped using perplexity scores of a held-out development set, with mini-batches containing 32 sequences of maximum length 512 tokens. Experiments are conducted on two NVIDIA Tesla V100 GPUs.

Downstream Tasks
We care the most in this paper whether TOD-BERT, a pre-trained language model using aggregated taskoriented corpora, can show any advantage over BERT. Therefore, we avoid adding too many additional components on top of its architecture when fine-tuning on each downstream task. Also, we use the same architecture with a similar number of parameters for a fair comparison. All the model parameters are updated with a gradient clipping to 1.0 using the same hyper-parameters during finetuning. We select four crucial task-oriented downstream tasks to evaluate: intent recognition, dialogue state tracking, dialogue act prediction, and response selection. All of them are core components in modularized task-oriented systems .
Intent recognition task is a multi-class classification problem, where we input a sentence U and models predict one single intent class over I possible intents.
where F is a pre-trained language model and we use its [CLS] embeddings as the output representation. W 1 ∈ R I×d B is a trainable linear mapping. The model is trained with cross-entropy loss between the predicted distributions P int and the true intent labels.
Dialogue state tracking can be treated as a multi-class classification problem using a predefined ontology. Unlike intent, we use dialogue history X (a sequence of utterances) as input and a model predicts slot values for each (domain, slot) pair at each dialogue turn. Each corresponding value v j i , the i-th value for the j-th (domain, slot) pair, is passed into a pre-trained model and fixed its representation during training.
where Sim is the cosine similarity function, and S j ∈ R |v j | is the probability distribution of the j-th (domain, slot) pair over its possible values. G j is the slot projection layer of the j slot, and the number of layers |G| is equal to the number of (domain, slot) pairs. The model is trained with cross-entropy loss summed over all the pairs.
Dialogue act prediction is a multi-label classification problem because a system response may contain multiple dialogue acts, e.g., request and inform at the same time. Model take dialogue history as input and predict a binary result for each possible dialogue act: where W 2 ∈ R d B ×N is a trainable linear mapping, N is the number of possible dialogue acts, and each value in A is between [0, 1] after a Sigmoid layer. The model is trained with binary cross-entropy loss and the i-th dialogue act is considered as a triggered dialogue act if A i > 0.5.
Response selection is a ranking problem, aiming to retrieve the most relative system response from a candidate pool. We use a dual-encoder strategy (Henderson et al., 2019b) and compute similarity scores between source X and target Y , where Y i is the i-th response candidate and r i is its cosine similarity score. Source X can be truncated, and we limit the context lengths to the most recent 256 tokens in our experiments. We randomly sample several system responses from the corpus as negative samples. Although it may not be a true negative sample, it is common to train a ranker and evaluate its results (Henderson et al., 2019a).

Evaluation Datasets
We pick up several datasets, OOS, DSTC2, GSIM, and MWOZ, for downstream evaluation. The first three corpora are not included in the pre-trained task-oriented datasets. For MWOZ, to be fair, we do not include its test set dialogues during the pretraining stage. Details of each evaluation dataset are discussed in the following: • OOS (Larson et al., 2019): The out-of-scope intent dataset is one of the largest annotated intent datasets, including 15,100/3,100/5,500 samples for the train, validation, and test sets, respectively. It covers 151 intent classes over ten domains, including 150 in-scope intent and one outof-scope intent. The out-of-scope intent means that a user utterance that does not fall into any of the predefined intents. Each of the intents has 100 training samples.
• DSTC2 (Henderson et al., 2014): DSTC2 is a human-machine task-oriented dataset that may include a certain system response noise. It has 1,612/506/1117 dialogues for train, validation, and test sets, respectively. We follow  to map the original dialogue act labels to universal dialogue acts, which results in 9 different system dialogue acts.
• GSIM (Shah et al., 2018a): GSIM is a humanrewrote machine-machine task-oriented corpus, including 1500/469/1039 dialogues for the train, validation, and test sets, respectively. We combine its two domains, movie and restaurant domains, into one single corpus. It is collected by Machines Talking To Machines (M2M) (Shah et al., 2018b) approach, a functionality-driven process combining a dialogue self-play step and a crowdsourcing step. We map its dialogue act labels to universal dialogue acts , resulting in 6 different system dialogue acts.
• MWOZ (Budzianowski et al., 2018): MWOZ is the most common benchmark for task-oriented dialogues, especially for dialogue state tracking. It has 8420/1000/1000 dialogues for train, validation, and test sets, respectively. Across seven different domains, in total, it has 30 (domain, slot) pairs that need to be tracked in the test set. We use its revised version MWOZ 2.1, which has the same dialogue transcripts but with cleaner state label annotation.

Results
For each downstream task, we first conduct the experiments using the whole dataset, and then we simulate the few-shot setting to show the strength of our TOD-BERT. We run at least three times with different random seeds for each few-shot experiment to reduce data sampling variance, and we report its mean and standard deviation for these limited data scenarios. We investigate two versions of TOD-BERT; one is TOD-BERT-mlm that only uses MLM loss during pre-training, and the other is TOD-BERT-jnt, which is jointly trained with the MLM and RCL objectives. We compare TOD-BERT with BERT and other baselines, including two other strong pre-training models GPT-2 (Radford et al., 2019) and DialoGPT (Zhang et al., 2019). For a GPT-based model, we use mean pooling of its hidden states as its output representation, which we found it is better than using only the last token.

10-Shot
BERT 75.5% ± 1.1% 88.6% ± 1.1% 84.7% ± 0.3% 16.5% ± 1.7% TOD-BERT-mlm 76.6% ± 0.8% 90.5% ± 1.2% 84.3% ± 0.2% 14.0% ± 1.3% TOD-BERT-jnt 77.3% ± 0.5% 91.0% ± 0.5% 84.5% ± 0.4% 15.3% ± 2.1%   datasets, as shown in Table 2. We evaluate accuracy on all the data, the in-domain intents only, and the out-of-scope intent only. Note that there are two ways to predict out-of-scope intent, one is to treat it as an additional class, and the other is to set a threshold for prediction confidence. Here we report the results of the first setting. TOD-BERTjnt achieves the highest in-scope and out-of-scope accuracy. Besides, we conduct 1-shot and 10-shot experiments by randomly sampling one and ten utterances from each intent class in the training set. TOD-BERT-jnt has 13.2% all-intent accuracy improvement and 16.3% in-domain accuracy improvement compared to BERT in the 1-shot setting.

Dialogue State Tracking
Two evaluation metrics are commonly used in dialogue state tracking task: joint goal accuracy and slot accuracy. The joint goal accuracy compares the predicted dialogue states to the ground truth at each dialogue turn. The ground truth includes slot values for all the possible (domain, slot) pairs. The output is considered as a correct prediction if and only if all the predicted values exactly match its ground truth values. On the other hand, the slot accuracy individually compares each (domain, slot, value) triplet to its ground truth label.
In Table 5, we compare BERT to TOD-BERTjnt on the MWOZ 2.1 dataset and find the latter has 2.4% joint goal accuracy improvement. Since the original ontology provided by Budzianowski et al. (2018) is not complete (some labeled values are not included in the ontology), we create a new ontology of all the possible annotated values. We also list several well-known dialogue state trackers as reference, including DSTReader , HyST , TRADE , and ZSDST (Rastogi et al., 2019). We also report the few-shot experiments using 1%, 5%, 10%, and 25% data. Note that 1% of data has around 84 dialogues. TOD-BERT outperforms BERT in all the setting, which further show the strength of task-oriented dialogue pre-training.

Dialogue Act Prediction
We conduct experiments on three different datasets and report micro-F1 and macro-F1 scores for the dialogue act prediction task, a multi-label classification problem. For the MWOZ dataset, we remove the domain information from the original system dialogue act labels. For example, the "taxi-inform" will be simplified to "inform". This process reduces the number of possible dialogue acts from 31 to 13. For DSTC2 and GSIM corpora, we follow  to apply universal dialogue act mapping that maps the original dialogue act labels to a general dialogue act format, resulting in 9 and 6 unique system dialogue acts in DSTC2 and GSIM, respectively. We run two other baselines, MLP and RNN, to further show the strengths of BERT-based MWOZ (13) DSTC2 (   models. The MLP model simply takes bag-of-word embeddings to make dialogue act prediction, and the RNN model is a bi-directional GRU network.
In Table 4, one can observe that in full data scenario, TOD-BERT consistently works better than BERT and other baselines, no matter which datasets or which evaluation metrics. In the fewshot experiments, TOD-BERT-mlm outperforms BERT by 3.5% micro-F1 and 6.6% macro-F1 on MWOZ corpus in the 1% data scenario. We also found that 10% of training data can achieve good performance that is close to full data training.
metric is computed using a random batch of 100 examples so that responses from other examples in the same batch can be used as random negative candidates. This allows us to be compute the metric across many examples in batches efficiently. While it is not guaranteed that the random negatives will indeed be "true" negatives, the 1-of-100 metric still provides a useful evaluation signal. During inference, we run five different random seeds to sample batches and report the average results.
In Table 6, we conduct response selection experiments on three datasets, MWOZ, DSTC2, and GSIM. TOD-BERT-jnt achieves 65.8% 1-to-100 accuracy and 87.0% 3-to-100 accuracy on MWOZ, which surpasses BERT by 18.3% and 11.5%, respectively. The similar results are also consistently observed in DSTC2 and GSIM datasets, and the advantage of the TOD-BERT-jnt is more evident in the few-shot scenario. We do not report TOD-BERT-jnt for MWOZ few-shot setting because it is not fair to compare them with others as the full MWOZ training set is used for response contrastive learning during pre-training stage. The response selection results are sensitive to the training batch size since the larger the batch size the harder the prediction. In our experiments, we set batch size equals to 25 for all the models.

Visualization
In Figure 2, we visualize the embeddings of BERT, TOD-BERT-mlm, and TOD-BERT-jnt given the same input from the MWOZ test set. Each sample point is a system response representation, which is passed through a pre-trained model and reduced its high-dimension features to a two-dimension point using the t-distributed stochastic neighbor embedding (tSNE) for dimension reduction. Since we know the true domain and dialogue act labels for each utterance, we use different colors to represent different domains and dialogue acts. As one can observe, TOD-BERT-jnt has more clear group boundaries than TOD-BERT-mlm, and two of them are better than BERT.
To analyze the results quantitatively, we run Kmeans, a common unsupervised clustering algorithms, on top of the output embeddings of BERT and TOD-BERT. We set K for K-means equal to 10 and 20. After the clustering, we can assign each utterance in the MWOZ test set to a predicted class. We then compute the normalized mutual information (NMI) between the clustering result and the actual domain label for each utterance.
Here is what we observe: TOD-BERT consistently achieves higher NMI scores than BERT. For K=10, TOD-BERT has a 0.143 NMI score, and BERT only has 0.094. For K=20, TOD-BERT achieves a 0.213 NMI score, while BERT has 0.109.

Conclusion
We propose task-oriented dialogue BERT (TOD-BERT) trained on nine human-human and multiturn task-oriented datasets across over 60 domains. TOD-BERT outperforms BERT on four dialogue downstream tasks, including intention classification, dialogue state tracking, dialogue act prediction, and response selection. It also has a clear advantage in the few-shot experiments when only limited labeled data is available. TOD-BERT is easy-to-deploy and will be open-sourced, allowing the NLP research community to apply or fine-tune any task-oriented conversational problem.