Simple Data Augmentation with the Mask Token Improves Domain Adaptation for Dialog Act Tagging

The concept of Dialogue Act (DA) is universal across different task-oriented dialogue domains - the act of “request” carries the same speaker intention whether it is for restaurant reservation or ﬂight booking. However, DA taggers trained on one domain do not generalize well to other domains, which leaves us with the expensive need for a large amount of annotated data in the target domain. In this work, we investigate how to better adapt DA taggers to desired target domains with only unlabeled data. We propose M ASK A UGMENT , a controllable mechanism that augments text input by leveraging the pre-trained M ASK token from BERT model. Inspired by consistency regularization, we use M ASK A UGMENT to introduce an unsupervised teacher-student learning scheme to examine the domain adaptation of DA taggers. Our extensive experiments on the Simulated Dialogue (GSim) and Schema-Guided Dialogue (SGD) datasets show that M ASK A UGMENT is useful in improving the cross-domain generalization for DA tagging.


Introduction
Dialog act (DA) tagging, one of the important NLU components of modern task-oriented dialog systems, aims to capture the speaker's intention behind the utterances at each dialog turn. Several different schema and taxonomies have been introduced by several different researchers (Core and Allen, 1997; Stolcke et al., 2000;Bunt et al., 2010;Mezza et al., 2018) over the years. However, the main focus of the recent work (Kumar et al., 2018;Chen et al., 2018;Raheja and Tetreault, 2019) on DA tagging was on human-human social conversations (Godfrey et al., 1992;Jurafsky et al., 1997), which is less applicable for task-oriented setting.
Recently, several task-oriented dialogue datasets (Shah et al., 2018;Henderson et al., 2014;Budzianowski et al., 2018) have been released. However, the discrepancy in their annotation schema hinders the progress on building DA taggers that can generalize across domains and possibly datasets. To address this issue, Paul et al. (2019) propose a universal schema for DAs by aligning annotations for multiple existing corpora. In this regard, another useful corpora employed as a testbed in this work is Schema-guided dialogues (SGD) (Rastogi et al., 2020), which covers 20 domains under the same DA annotation schema.
It is often challenging and costly to obtain a large amount of in-domain dialogues with annotations. However, unlabeled dialogue corpora in target domain can easily be curated from past conversation logs or collected via crowd-sourcing (Byrne et al., 2019;Budzianowski et al., 2018) at a more reasonable cost. The goal of this work is to investigate how to leverage pre-trained masked language models (e.g., BERT) to better adapt DA taggers to unseen domains with available unlabeled dialogues. Pre-trained language models (Devlin et al., 2019; have been successful for several NLP tasks including dialogue systems (Wolf et al.,Figure 2: Given a dialogue turn in target domain, we obtain teacher and student representations by applying two different maskings on its flattened original representation. We use the output binary probability distributions (per dialog act) of the teacher as soft targets to train the student. Orange and green colored boxes indicate different segment ids. 2019; Zhang et al., 2019;Bao et al., 2020;Henderson et al., 2019;Wu et al., 2020). However, domain adaptation capabilities of these models remain to be further explored for goal-oriented dialogues.
In this paper, we use the pre-trained MASK token of BERT model to define MASKAUGMENT, which stochastically augments text input by randomly replacing its tokens with the MASK token. We adopt consistency regularization approach (Sajjadi et al., 2016) to introduce an unsupervised teacher-student learning scheme by leveraging MASKAUGMENT for generating teacher and student representations retaining different amount of the original content from the unlabeled dialogue example. Our extensive experiments on GSim (Shah et al., 2018) and SGD (Rastogi et al., 2020) datasets suggest: (i) BERT establishes a much stronger baseline compared to previous work (Paul et al., 2019), (ii) The proposed teacher-student learning via MASKAUG-MENT is useful in further improving the target domain F1 score over BERT baseline: up to 3% when the full source domain data is used, and up to 10% for the low-resource setting.

MASKAUGMENT
In this section, we first discuss the task setup, BERT-based DA tagging model, and relevant background. We then define the proposed fine-tuning objectives leveraging MASKAUGMENT.

Task Setup
We start by formalizing the DA tagging task, depicted in Figure 1, as a multi-label classification problem. Let D = [T 1 , T 2 , . . . , T n ] denote a dialogue of n turns as a series of user and system utterances. Let A = {a j } m 1 be the predefined set of m different DAs in the schema. The objective of dialogue act tagging is to determine a subset A k ⊆ A of DAs that apply to the current turn T k given the conversation history D :k = [T 1 , T 2 , . . . , T k ] so far. We formulate this objective simply as a classification problem with binary labels y j ∈ {0, 1} for each act a j where y j = 1 if a j ∈ A k and y j = 0 otherwise. As defined above, dialogue act tagging is a turn-level classification problem, hence every turn T k constitutes: (i) a labeled example (D :k , A k ) if we have a set A k of DA annotations, or (ii) an unlabeled example (D :k , ·) otherwise.

Model
Given a conversation history D :k as input, we first convert it into a sequence of words by concatenating user and system utterances. Before concatenating each utterance, we prepend it with corresponding speaker tag using [SYS] and [USR] special tokens indicating system and user sides, respectively. Finally, the whole flattened sequence is finalized by prepending it with [CLS] special token to obtain the final dialogue history representation: The segment ids are set to 0 and 1 for the tokens of past turns and the current turn, respectively.
For DA tagging task, dialogue history x is used as input to pre-trained language model M , and the model computes a probability vector p θ (·|x) = σ(W M (x) + b) where M (x) ∈ R d is the output contextualized embedding corresponding to CLS token, W ∈ R m×d and b ∈ R m are trainable weights of a linear projection layer, σ is the sigmoid function, θ denotes the entire set of trainable parameters of model M along with (W, b), and finally p θ (a j |x) indicates the probability of tag a j being triggered. The following objective is used to train the model parameters. Supervised tagging loss (STL). This objective is used to update the DA tagger via the supervision coming from labeled source data S. We use binarycross entropy loss J STL (θ; x, y) defined as:

Learning with MASKAUGMENT
Semi-supervised learning (SSL) (Berthelot et al., 2019 is an effective approach for improving deep learning models by leveraging in-domain unlabeled data. Unlike traditional SSL setting, our objective is to primarily address the underlying source-to-target domain shift. In prior work (Xie et al., 2019;Wei and Zou, 2019), unsupervised data augmentation methods including word replacement and backtranslation have been shown useful for short written text classification. However, such augmentation methods are shown to be less effective (Shleifer, 2019) when used with pre-trained models. Besides, back-translation is less applicable in our scenario as translation of multi-turn dialogue itself is a rather challenging task compared to short text.
Instead, we propose a simple and controllable data augmentation-MASKAUGMENT-to explore a new unsupervised teacher-student learning scheme for domain adaptation of DA taggers. MASKAUG-MENT augments the original text input by randomly replacing its tokens with MASK token at a specified probability. We follow the masking policy in (Devlin et al., 2019). Formally, let z(x|x, ) denote the MASKAUGMENT as a stochastic transformation with -probability for input x. Below we define three fine-tuning objectives leveraging MASKAUG-MENT that are used in addition to J STL . Masked tagging loss (MTL). We incorporate MASKAUGMENT into the STL objective by perturbing its input sequence x as follows: Masked LM loss (MLM). This is the original objective that BERT is pre-trained with. The objective of MLM training is to correctly reconstruct a randomly selected subset (with probability ) of input tokens leveraging the unmasked context. We denote this loss by J MLM (θ; x, ). Teacher-Student Learning with Disagreement Loss (DAL). We adopt consistency regularization (Sajjadi et al., 2016;Laine and Aila, 2017) widely used in traditional SSL (Berthelot et al., 2019; and define disagreement loss, which employs MASKAUG-MENT in a novel way to give rise to an unsupervised teacher-student training. The core idea is to contrast the amount of controllable perturbations to learn more generalizable representations. We propose a stochastic imputation-based teacher and student selection by leveraging MASKAUG-MENT. As in Figure 2, we sample two augmentationsx (t) ∼ z(x|x, t ) andx (s) ∼ z(x|x, s ) for teacher and student, respectively. We take t < s to ensure that the teacher augmentationx (t) retains more of the original content x than the student augmentationx (s) , hence is more reliable. The disagreement loss J DAL (θ; x, t , s ) is then computed as the binary cross-entropy loss between the teacher p θ (·|x (t) ) and the student p θ (·|x (s) ) distributions as in Eq. 2, treating teacher as the soft target (y).

Datasets
GSIM (Shah et al., 2018) consists of machinemachine task-oriented dialogues in two tasks of two different domains: buying a movie ticket (GMov) and reserving a restaurant table (GRes). It contains 1500/469/1117 dialogues for the train/dev/test sets. Following (Paul et al., 2019), its dialogue acts are mapped to 13 tags in universal schema. SGD (Rastogi et al., 2020) consists of 22,825 schema-guided single/multi-domain dialogues where domains can have multiple schemas, each defined by a set of tracking slots. We use singledomain dialogues of smaller sizes including music (SMusic), media (SMedia), ride-sharing (SRide) as source domains to study generalization on flights (SFlights), the largest one, as the target domain.

Training and Implementation Details
The final loss function is the sum of the active ones among J STL , J MTL , J DAL , J MLM except J MLM is multiplied with 0.1 when active. DAL is activated after 1 epoch of training with the remaining objectives. We perform a tuning of t ∈ [0, 0.1] and s ∈ [0.1, 0.5] for DAL objective. We optimize the loss using AdamW (Loshchilov and Hutter, 2017). The learning rate is tuned on [10 −5 , 5 × 10 −5 ] with no warmup steps. We use a batch of 16 examples with maximum sequence length of 128, which covers around 9.9, 10.3, 9.9 turns on average for train, dev, test splits, respectively. We use transformers library 1 for our implementation.

Results and Discussion
We begin our discussion with our main findings on domain adaptation as presented in Table 1. We explore the effect of incorporating our proposed MTL and DAL objectives on top of STL (baseline) for both Transformer (Vaswani et al., 2017) and BERT (Devlin et al., 2019)   objective leads to considerable improvements on the LSTM (Paul et al., 2019). Fine-tuning BERT with STL objective from scratch provides further improvements on Transformer, establishing a much stronger baseline both on source and target domain performance. For both Transformer and BERT models, our proposed DAL and MTL objectives are independently useful in further improving the cross-domain generalization over strong baselines that are trained only with STL objective while not hurting the source domain performance. Moreover, fine-tuning on the combined unsupervised objective of DAL and MTL leads to the best performance (last row) on target domains across the board, hinting they provide orthogonal benefits.
Domain-adaptive pre-training (pre-BERT). As shown useful by Gururangan et al. (2020), we explore domain-adaptive pre-training of BERT model on the combination of source and target domain dialogues with MLM loss before fine-tuning it on the task. As presented in Table 2, pre-BERT helps improve the F1 score on the target domain (GRes) by up to 2.2% over the strong scratch-BERT model across different training objectives. Incorporating MASKAUGMENT into pre-BERT via our proposed DAL and MTL objectives leads to 2.1% boost over fine-tuning with only STL, achieving 4.8% F1 score improvement over LSTM (Paul et al., 2019) (89.2%) trained on the full labeled data (GRes) itself in a supervised way. This might partly be   due to the effect of learning a more domain-aware MASK token, which in return may lead to a more informed and useful teacher representations.
The effect of MLM in fine-tuning. We also conduct experiments on using MLM as unsupervised fine-tuning objective on the target domain dialogues. As shown in Table 2, it helps improve the cross-domain generalization performance. Specifically, our ultimate model (last row) achieves 94.1% and 94.4% F1 scores on the target domain for scratch-BERT and pre-BERT models, respectively. Consistent gains on precision and recall. In Table 3, we demonstrate that our proposed approach leads to consistent gains on both precision and recall. While the improvement is consistent, we observe that MASKAUGMENT significantly helps close the recall gap between scratch-BERT and pre-BERT (i.e., from 2.5% to 0.3% on the dev set and from 1.3% to 0.6% on the test set). Low-resource setting for source domain. As shown in Table 4, we observe that the benefit of (a) Improved example for sys-offer.    MASKAUGMENT through DAL and MTL objectives becomes larger as the number of labeled dialogues in the source domain gets smaller. The effect of domain-adaptive pre-training also becomes stronger, providing 12% improvement over scratch-BERT when only 10 labeled dialogues are available in the source domain while achieving 85.1% F1 score on the target domain with 50 labeled dialigues when combined with MASKAUGMENT. Adaptation performance across DAs. In Table  5, we present additional analysis on the adaptation performance across the set of all dialog acts in the schema. MASKAUGMENT provides significant improvement across most of the DAs including frequent ones such as request and sys-offer while not hurting the performance much (if not improving) on other frequent acts such as affirm and inform. For scratch-BERT setting, baseline (STL) objective obtains superior performance on less fre-quent DAs including sys-negate, sys-notify-failure, and thank-you, for which the performance drop is mostly bridged in pre-BERT setting. On the other hand, Pre-BERT provides consistent adaptation improvement over scratch-BERT across all dialog acts except for sys-negate and sys-notify-failure. Qualitative analysis of the approach. In Figures 3a and 3b, we provide examples for improved predictions on sys-offer and request acts, respectively. These are some of the most frequent DAs that MASKAUGMENT can provide a significant (5-20%) improvement over the baseline approach for both scratch-BERT and pre-BERT settings. In Figure 3c, we include an example where scratch-BERT with MASKAUGMENT fails on predicting sys-notify-failure act correctly as opposed the baseline. However, most of such failure cases vanish for pre-BERT setting, where the gap in F1 score drops from 11.4% in scatch-BERT to only 0.5% in pre-BERT as shown in Table 5.

Conclusion
We study cross-domain generalization of pretrained language models for DA tagging. While the fine-tuned BERT model performs well on indomain DA tagging, its cross-domain generalization is still not satisfactory. To combat this shortcoming, we investigate domain adaptation through the proposed unsupervised teacher-student training that leverages the MASKAUGMENT method for data augmentation. Our empirical results show that the proposed training scheme leads to significant improvements on domain adaptation for dialog act taggers. In the future, we plan to explore MASKAUGMENT for other tasks in NLP domain.