Effective Unsupervised Domain Adaptation with Adversarially Trained Language Models

Recent work has shown the importance of adaptation of broad-coverage contextualised embedding models on the domain of the target task of interest. Current self-supervised adaptation methods are simplistic, as the training signal comes from a small percentage of \emph{randomly} masked-out tokens. In this paper, we show that careful masking strategies can bridge the knowledge gap of masked language models (MLMs) about the domains more effectively by allocating self-supervision where it is needed. Furthermore, we propose an effective training strategy by adversarially masking out those tokens which are harder to reconstruct by the underlying MLM. The adversarial objective leads to a challenging combinatorial optimisation problem over \emph{subsets} of tokens, which we tackle efficiently through relaxation to a variational lowerbound and dynamic programming. On six unsupervised domain adaptation tasks involving named entity recognition, our method strongly outperforms the random masking strategy and achieves up to +1.64 F1 score improvements.


Introduction
Contextualised word embedding models are becoming the foundation of state-of-the-art NLP systems (Peters et al., 2018;Yang et al., 2019;Raffel et al., 2019;Brown et al., 2020;Clark et al., 2020). These models are pretrained on large amounts of raw text using self-supervision to reduce the labeled data requirement of target tasks of interest by providing useful feature representations (Wang et al., 2019a). Recent work has shown the importance of further training of pre-trained masked language models (MLMs) on the target domain text, as the benefits of their contextualised representations can deteriorate substantially in the presence of domain mismatch (Ma et al., 2019;Wang et al., 2019c;Gururangan et al., 2020). This is particularly crucial in unsupervised domain adaptation (UDA), where there is no labeled data in the target domain (Han and Eisenstein, 2019) and the knowledge from source domain labeled data is transferred to the target domain via a common representation space. However, current self-supervised adaptation methods are simplistic, as the training signal comes from a small percentage of randomly masked-out tokens. Thus, it remains to investigate whether there exist more effective self-supervision strategies to bridge the knowledge gap of MLMs about the domains to yield higher-quality adapted models.
A key principle of UDA is to learn a common embedding space of both domains which enables transferring a learned model on source task to target task. It is typically done by further pretraining the MLM on a combination of both source and target data. Selecting relevant training examples has been shown to be effective in preventing the negative transfer and boosting the performance of adapted models (Moore and Lewis, 2010;Ruder and Plank, 2017). Therefore, we hypothesise that the computational effort of the further pretraining should concentrate more on learning words which are specific to the target domain or undergo semantic/syntactic shifts between the domains.
In this paper, we show that the adapted model can benefit from careful masking strategy and propose an adversarial objective to select subsets for which the current underlying MLM is less confident. This objective raises a challenging combinatorial optimisation problem which we tackle by optimising its variational lower bound. We propose a training algorithm which alternates between tightening the variational lower bound and learning the parameters of the underlying MLM. This involves proposing an efficient dynamic programming (DP) algorithm to sample from the distribution over the space of masking subsets, and an effective method based on Gumbel softmax to differentiate through the subset sampling algorithm.
We evaluate our adversarial strategy against the random masking and other heuristic strategies including POS-based and uncertainty-based selection on UDA problem of six NER span prediction tasks. These tasks involve adapting NER systems from the news domain to financial, twitter, and biomedical domains. Given the same computational budget for further self-supervising the MLM, the experimental results show that our adversarial approach is more effective than the other approaches, achieving improvements up to +1.64 points in Fscore and +2.23 in token accuracy compared to the random masking strategy.
2 Uunsupervised DA with Masked LMs UDA-MLM. This paper focuses on the UDA problem where we leverage the labeled data of a related source task to learn a model for a target task without accessing to its labels. We follow the two-step UDA procedure proposed in Adapt-aBERT consisting of a domain tuning step to learn a common embedding space for both domains and a task tuning step to learn to predict task labels on source labeled data (Han and Eisenstein, 2019). The learned model on the source task can be then zero-shot transferred to the target task thanks to the assumption that these tasks share the same label distribution.
This domain-then-task-tuning procedure resembles the pretrain-then-finetuning paradigm of MLM where the domain tuning shares the same training objective with the pretraining. In domain tuning step, off-the-shelf MLM is further pretrained on an equal mixture of randomly masked-out source and target domain data.
Self-Supervision. The training principle of MLM is based on self-supervised learning where the labels are automatically generated from unlabeled data. The labels are generated by covering some parts of the input, then asking the model to predict them given the rest of the input.
More specifically, a subset of tokens is sampled from the original sequence x x x and replaced with [MASK] or other random tokens . 1 Without loss of generality, we assume that all sampled tokens are replaced with [MASK]. Let us denote the set of masked out indices by S, the ground truth tokens by x x x S = {x i |i ∈ S}, and the resulting puzzle by x x xS which is generated by masking out the sentence tokens with indices in S. The training objective is to minimize the negative log likelihood of the ground truth, where B θ is the MLM parameterised by θ, and D is the training corpus.

Adversarially Trained Masked LMs
Given a finite computational budget, we argue that it should be spent wisely on new tokens or those having semantic/syntactic shifts between the two domains. Our observation is that such tokens would pose more challenging puzzles to the MLM, i.e. the model is less confident when predicting them. Therefore, we propose to strategically select subsets for which the current underlying MLM B θ is less confident about its predictions: Henceforth, we assume that the size of the masked set K for a given sentence x x x is fixed. For example in BERT , K is taken to be 15% × |x x x| where |x x x| denotes the length of the sentence. We denote all possible subsets of indices in a sentence with a fixed size by S K .

Our Variational Formulation
The masking strategy learning problem described in eqn (2) is a minimax game of two players: the puzzle generator to select the subset resulting in the most challenging puzzle, and the MLM B θ to best solve the puzzle by reconstructing the masked tokens correctly. As optimising over the subsets is a hard combinatorial problem over the discrete space of S K , we are going to convert it to a continuous optimisation problem. We establish a variational lower bound of the objective function over S using the following inequality, where q(.) is the variational distribution provided by a neural network π φ . This variational distribution q(S|x x x; π φ ) estimates the distribution over all subset of size K. It is straightforward to see that the weighted sum of negative log likelihood of all possible subsets is always less than the max value of them. Our minimax training objective is thus, where Z is the partition function making sure the probability distribution sums to one, The number of possible subsets is |S K | = |x x x| K , which grows exponentially with respect to K. In §4, we provide efficient dynamic programming algorithm for computing the partition function and sampling from this exponentially large combinatorial space. In the following, we present our model architecture and training algorithm for the puzzle generator φ and MLM θ parameters based on the variational training objective in eqn (5).

Model Architecture
We learn the masking strategy through the puzzle generator network as shown in Figure 1. It is a feed-forward neural network assigning a selection probability π φ (i|x x x) for each index i given the original sentence x x x, where φ denote the parameters. Inputs to the puzzle generator are the feature More specifically, they are output of the last hidden states of the MLM. The probability of perform masking at position i is computed by applying sigmoid function over the feed-forward net output π φ (i|x x x) = σ(FFNN(h h h i )). From these probabilities, we can sample the masked positions in order to further train the underlying MLM B θ .

Optimising the Variational Bound
We use an alternating optimisation algorithm to train the MLM B θ and the puzzle generator π φ (Algorithm 1). The update frequency for π φ is determined via a mixing hyperparameter β.
Training the MLM. Fixing the puzzle generator, we can train the underlying MLM model using gradient descent on MLM objective in eqn (1), where we approximate the expectation by sampling.
where S m ∼ q(S|x x x; π φ ). In §4.2, we present an efficient sampling algorithm based on a sequential decision making process involving discrete choices, i.e. whether to include an index i or not.
Algorithm 1 Adversarial Training Procedure Input: data D, update freq. β, masking size K Output: generator π φ , MLM B θ 1: Let φ ← φ 0 ; θ ← θ 0 2: while stopping condition is not met do 3: Update the MLM using Eq. (8) 6: if coinToss(β)==Head then 7: Update the generator using Eq. (10) 9: end if 10: end for 11: end while 12: return θ, φ Training the Puzzle Generator. Fixing the MLM, we can train the puzzle generator by considering − log P r(x x x S |x x xS; B θ ) as the reward, and aim to optimise the expected reward, We may aim to sample multiple index sets {S 1 , .., S M } from q(S|x x x; π φ ), and then optimise the parameters of the puzzle generator by maximizing the Monte Carlo estimate of the expected reward. However, as sampling each index set S m corresponds to a sequential decision making process involving discrete choices, we cannot backpropagate through the sampling process to learn the parameters of the puzzle generator network. Therefore, we rely on the Gumbel-Softmax trick (Jang et al., 2017) to deal with this issue and backpropagate through the parameters of π φ , which we will cover in §4.3.

A DP for the Partition Function
In order to sample from the variational distribution in eqn (6), we need to compute its partition function in eqn (7). Interestingly, the partition function can be computed using dynamic programming (DP). Let us denote by Z(j, k) the partition function of all subsets of size k from the index set {j, .., |x x x|}. Hence, the partition function of the q distribution Algorithm 2 Sampling Procedure Function: subsetSampling Input: datapoint x x x, prob. π φ , masking size K Output: subset S, sample log probability l 1: Let S ← ∅; l ← 0; j ← 0 2: Calculate DP table Z using Eq. (11) 3: while |S| < K do 4: The DP relationship can be written as, The initial conditions are Z(j, 0) = 1 and corresponding to two special terminal cases in selection process in which we have picked all K indices, and we need to select all indices left to fulfil K. This amounts to a DP algorithm with the time complexity O(K|x x x|).

Subset Sampling for MLMs
The DP in the previous section also gives rise to the sampling procedure. Given a partial random subset S j−1 with elements chosen from the indices {1, .., j − 1}, the probability of including the next index j, denoted by q j (yes|S j−1 , π φ ), is where Z(j, k) values come from the DP table.
Hence, the probability of not including the index j is In case the next index is chosen to be in the sample, The sampling process entails a sequence of binary decisions (Figure 1.b) in an underlying Markov Decision Process (MDP). It is an iterative process, which starts by considering the index one. At each decision point j, the sampler's action space is to whether include (or not include) the index j into the partial sample S j based on eqn (13). We terminate this process when the partially selected subset has K elements.
The sampling procedure is described in Algorithm 2. In our MDP, we actually sample an index by generating Gumbel noise in each stage, and then select the choice (yes/no) with the maximum probability. This enables differentiation through the sampled subset, covered in the next section.

Differentiating via Gumbel-Softmax
Once the sampling process is terminated, we then need to backpropagate through the parameters of π φ , when updating the parameters of the puzzle generator according to eqn (10).
More concretely, let us assume that we would like to sample a subset S. As mentioned in previous section, we need to decide about the inclusion of the next index j given the partial sample so far S j−1 based on the eqn (13). Instead of uniform sampling, we can equivalently choose one of these two outcomes as follows where the random noise o j is distributed according to standard Gumbel distribution. Sampling a subset then amounts to a sequence of argmax operations. To backpropagate through the sampling process, we replace the argmax operators with softmax, as argmax is not differentiable. That is, The log product of the above probabilities for the decisions in a sampling path is returned as l in Algorithm 2, which is then used for backpropagation.

Experiments
We evaluate our proposed masking strategy in UDA for named entity span prediction tasks coming from three different domains.

Unsupervised Domain Adaptation Tasks
Source and Target Domain Tasks. Our evaluation is focused on the problem of identifying named entity spans in domain-specific text without access to labeled data. The evaluation tasks comes from several named entity recognition (NER) dataset including WNUT2016 (Strauss et al., 2016), FIN (Salinas Alvarado et al., 2015), JNLPBA (Collier and Kim, 2004), BC2GM (Smith et al., 2008), BioNLP09 (Kim et al., 2009), and BioNLP11EPI (Kim et al., 2011). Table 1 reports data statistics. These datasets cover three domains social media (TWEETS), financial (FIN) and biomedical (BIOMED). We utilize the CoNLL-2003 English NER dataset in newstext domain (NEWS) as the source task and others as the target. We perform domain-tuning and source task-tuning, followed by zero-shot transfer to the target tasks, as described in §2. Crucially, we do not use the labels of the training sets of the target tasks, and only use their sentences for domain adaptation. Since the number of entity types are different in each task, we convert all the labels to entity span in IBO scheme. This ensures that all tasks share the same set of labels consisting of three tags: I, B, and O. Extra Target Domain Unlabeled Corpora. As the domain tuning step can further benefit from additional unlabeled data, we create target domain unlabeled datasets from the available corpora of relevant domains. More specifically, we use publicly available corpora, Sentiment140 (Go et al., 2009), SEC Filing 2019 2 (DeSola et al., 2019) PubMed (Lee et al., 2020) for the TWEET, FIN and BIOMED domains respectively (Table 1). From the unlabeled corpora, the top 500K and 1M similar sentences to the training set of each target task are extracted based on the average n-gram similarity where 1 ≤ n ≤ 4, resulting in extra target domain unlabeled corpora.

Masking Strategies for MLM Training
We compare our adversarial learned masking strategy approach against random and various heuristic masking strategies which we propose: • Random. Masked tokens are sampled uniformly at random, which is the common strategy in the literature .
• POS-based strategy. Masked tokens are sampled according to a non-uniform distribution, where a token's probability depends on its POS tag. The POS tags are obtained using spaCy. 3 Content tokens such as verb (VERB), noun (N), adjective (ADJ), pronoun (PRON) and adverb (ADV) tags are assigned higher probability (80%) than other content-free tokens such as PREP, DET, PUNC (20%).
• Uncertainty-based strategy. We select those tokens for which the current MLM is most uncertain for the reconstruction, where the uncertainty is measured by the entropy. That is, we aim to select those tokens with high Entropy[P r i (.|x x xS i ; B θ )], where x x xS i is the sentence x x x with the ith token masked out, and P r i (.|x x xS i ; B θ ) is the predictive distribution for the ith position in the sentence.
Calculating the predictive distribution for each position requires one pass through the network. Hence, it is expensive to use the exact entropy, as it requires |x x x| passes. We mitigate this cost by using P r i (.|x x x; B θ ) instead, which conditions on the original unmasked sentence. This estimation only costs one pass through the MLM.
• Adversarial learned strategy. The masking strategy is learned adversarially as in §3. The puzzle-generator update frequency β (Algorithm 1) is set to 0.3 for all experiments.
These strategies only differ in how we choose the candidate tokens. The number of to-be-masked tokens is the same in all strategies (15%). Among them, 80% are replaced with [MASK], 10% are replaced with random words, the rest are kept unchanged as in . In our experiments, the masked sentences are generated dynamically on-the-fly.
To evaluate the models, we compute precision, recall and F1 scores on a per token basis. We report average performance of five runs.

Implementation Details
Our implementation is based on Tensorflow library (Abadi et al., 2016) 4 . We use BERT-Base model architecture which consists of 12 Transformer layers with 12 attention heads and hidden size 768  in all our experiments. We use the cased wordpiece vocabulary provided in the pretrained English model. We set learning rate to 5e-5 for both further pretraining and task tuning. Puzzle generator is a two layer feed-forward network with hidden size 256 and dropout rate 0.1.

Empirical Results
Under the same computation budget to update the MLM, we evaluate the effect of masking strategy in the domain tuning step under various size of additional target-domain data: none, 500K and 1M. We continue pretraining BERT on a combination of unlabeled source (CoNLL2003), unlabeled target task training data and additional unlabeled target domain data (if any). If target task data is smaller, we oversample it to have equal size to the source data. The model is trained with batch size 32 and max sequence length 128 for 50K steps in 1M targetdomain data and 25K steps in other cases. It equals to 3-5 epochs over the training set. After domain tuning, we finetune the adapted MLM on the source task labeled training data (CoNLL2003) for three epochs with batch size 32. Finally, we evaluate the resulting model on target task. On the largest dataset, random and POS strategy took around 4 hours on one NVIDIA V100 GPU while entropy  Table 2: F1 score of name entity span prediction tasks in three UDA scenarios which differ in the amount of additional target-domain data. rand, pos, ent and adv denote the random, POS-based, uncertainty-based, and adversarial masking strategy respectively.∆ row reports the average improvement over random masking across all tasks. Bold shows the highest score of task on each UDA setting. † indicates statistically significant difference to the random baseline with p-value ≤ 0.05 using bootstrap test.  and adversarial approach took 5 and 7 hours respectively. The task tuning took about 30 minutes. Results are shown in Table 2. Overall, strategically masking consistently outperforms random masking in most of the adaptation scenarios and target tasks. As expected, expanding training data with additional target domain data further improves performance of all models. Comparing to random masking, prioritising content tokens over contentfree ones can improve up to 0.7 F1 score in average. By taking the current MLM into account, uncertainty-based selection and adversarial learned strategy boost the score up to 1.64. Our proposed adversarial approach yields highest score in 11 out of 18 cases, and results in the largest improvement over random masking across all tasks in both UDA with and without additional target domain data. We further explore the mix of random masking and other masking strategies. We hypothesise that the combination strategies can balance the learning of challenging tokens and effortless tokens when forming the common semantic space, hence improve the task performance. In a minibatch, 50% of sentences are masked according to the corresponding strategy while the rest are masked randomly. Results are shown in Table 3. We observe an additional performance to the corresponding single-strategy model across all tasks.

Analysis
Domain Similarity. We quantify the similarity between source (CoNLL2003) and target domains by vocabulary overlap between the domains (excluding stopwords). Figure 2 shows the vocabulary overlap across tasks. As seen, all the target domains are dissimilar to the source domain, with FIN having the lowest overlap. FIN has gained the largest improvement from the adversarial strategy in the UDA results in Tables 2 and 3. As expected, the biomedical datasets have relatively higher vocabulary overlap with each other.
Density Ratio of Masked Subsets. We analyze the density ratio of masked-out tokens in the target and source domains where P r s (w) and P r t (w) is the probability of token w in source and target domains, respectively. These probabilities are according to unigram language models trained on the training sets of the source and target tasks. The higher value of r(w) means the token w is new or appears more often in the target text than in the source. Figure 3 plots the density ratio of masked-out tokens during domain tuning time for four UDA tasks. Comparing to other strategies, we observed that adversarial approach tends to select tokens which have higher density ratio, i.e. more significant in the target.
Syntactic Diversity in Masked Subset. Table 4 describes the percentage of POS tags in masked subset selected by different masking strategies. We observed that our method selects more tokens from the major POS tags (71%) compared to random (45%) and entropy-based (55%) strategies. It has chosen less nouns compared to the POS strategy, and more pronouns compared to all other strategies.
Tagging Accuracy of OOV and non-OOV. We compare the tagging accuracy of out-of-vocabulary (OOV) words which are in target domain but not  presenting in source, and non-OOV tokens in Table 5. As seen, our adversarial masking strategy achieves higher accuracy on both OOV and non-OOV tokens in most cases.

Related Work
Unsupervised Domain Adaptation. The main approaches in neural UDA include discrepancybased and adversarial-based methods. The discrepancy-based methods are based on the usage of the maximum mean discrepancy or Wasserstein distance as a regularizer to enforce the learning of domain non-discriminative representations (Shen et al., 2018). Inspired by the Generative Adversarial Network (GAN) (Goodfellow et al., 2014), the adversarial-based methods learn a representation that is discriminative for the target task and indiscriminative to the shift between the domains (Ganin and Lempitsky, 2015).
Domain Adaptation with MLM. Performance of fine-tuned MLM can deteriorate substantially on the presence of domain mismatch. The most straightforward domain adaptation approach in MLM is to adapt general contextual embedding to a specific domain (Lee et al., 2020;Alsentzer et al., 2019;Chakrabarty et al., 2019), that is to further improve pretrained MLM by continuing to pretrain language models on related domain or similar tasks (Gururangan et al., 2020), or via intermediate task which is also referred to as STILTs (Phang et al., 2018). Recent works have proposed twostep adaptive domain adaptation framework which consists of domain tuning and task finetuning (Ma et al., 2019;Wang et al., 2019c;Logeswaran et al., 2019). They have demonstrated that domain tuning is necessary to adapt MLM with both domain knowledge and task knowledge before finetuning, especially when the labelled data 28.56 6.89 Table 5: Tagging accuracy of in-vocabulary (non-OOV) and out-of-vocabulary (OOV) words in UDA + 500K in-domain data.
in target task is extremely small. Our experiment setting is similar to Han and Eisenstein (2019)'s work. However, we focus on learning masking strategy to boost the domain-tuning step.
Adversarial Learning. Recent research in adversarial machine learning has either focused on attacking models with adversarial examples (Alzantot et al., 2018;Ebrahimi et al., 2018), or training models to be robust against these attacks . Wang et al. (2019b); Liu et al. (2020) propose the use of adversarial learning for language models. They consider autoregressive LMs and train them to be robust against adversarial perturbations of the word embeddings of the target vocabulary.

Conclusion
We present an adversarial objective for further pretraining MLM in UDA problem. The intuition behind the objective is that the adaptation effort should focus on a subset of tokens which are chal-lenging to the MLM. We establish a variational lower bound of the objective function and propose an effective sampling algorithm using dynamic programming and Gumbel softmax trick. Comparing to other masking strategies, our proposed adversarial masking approach has achieve substantially better performance on UDA problem of named entity span prediction for several domains.