Semi-supervised Domain Adaptation for Dependency Parsing via Improved Contextualized Word Representations

In recent years, parsing performance is dramatically improved on in-domain texts thanks to the rapid progress of deep neural network models. The major challenge for current parsing research is to improve parsing performance on out-of-domain texts that are very different from the in-domain training data when there is only a small-scale out-domain labeled data. To deal with this problem, we propose to improve the contextualized word representations via adversarial learning and fine-tuning BERT processes. Concretely, we apply adversarial learning to three representative semi-supervised domain adaption methods, i.e., direct concatenation (CON), feature augmentation (FA), and domain embedding (DE) with two useful strategies, i.e., fused target-domain word representations and orthogonality constraints, thus enabling to model more pure yet effective domain-specific and domain-invariant representations. Simultaneously, we utilize a large-scale target-domain unlabeled data to fine-tune BERT with only the language model loss, thus obtaining reliable contextualized word representations that benefit for the cross-domain dependency parsing. Experiments on a benchmark dataset show that our proposed adversarial approaches achieve consistent improvement, and fine-tuning BERT further boosts parsing accuracy by a large margin. Our single model achieves the same state-of-the-art performance as the top submitted system in the NLPCC-2019 shared task, which uses ensemble models and BERT.


Introduction
Dependency parsing aims to capture syntax with a dependency tree and is proven to be helpful for various natural language processing (NLP) tasks, such as semantic role labeling (Xia et al., 2019), natural language generation (Park and Kang, 2019), and machine translation (Hadiwinoto and Ng, 2017). Given an input sentence s = w 1 w 2 . . . w n , a dependency tree, as depicted in Figure 1, is defined as d = {(h, m, l), 0 ≤ h ≤ n, 1 ≤ m ≤ n, l ∈ L}, where (h, m, l) is a dependency from the head word w h to the child word w m with the relation label l ∈ L, and w 0 is a pseudo word that points to the root word of the sentence.
In recent years, neural network based approaches have achieved remarkable improvement and outperformed the traditional discrete-feature based approaches by a large margin in dependency parsing (Chen and Manning, 2014;Kiperwasser and Goldberg, 2016;Andor et al., 2016;Dozat and Manning, 2017). Most remarkably, Dozat and Manning (2017) propose a simple yet effective deep BiAffine parser and achieve the state-of-the-art accuracy on a variety of datasets and languages.
However, the domain adaptation problem, i.e., how to improve parsing performance on texts that are very different from the training data, remains a key challenge for the parsing community, especially when trying to apply the parsing technique to real-life web data. Taking the examples in Figure 1, we can see that as user-generated texts, the left sentence from the product comment (PC) domain is quite non-canonical and contains a lot of ellipsis phenomena. In contrast, the right one from the balanced corpus (BC) domain is a typical sentence from newswire texts and is much more formal. Hence, domain  Figure 1: Examples of dependency trees. The left sentence is from the target-domain PC data and the right one is from the source-domain BC data.
differences can be represented with both sentence and parse tree distribution changes due to new words and phrases, new expression structures, etc. The key for domain adaptation is how to model differences and commonalities between different domains. Most previous works focus on unsupervised cross-domain parsing, assuming there is no target-domain labeled data. Typical methods include self-training (McClosky and Charniak, 2008;Yu et al., 2015) and co-training (Sarkar, 2001). However, due to the intrinsic difficulty of domain adaptation, progress in this direction is very slow. In the past few years, semi-supervised cross-domain parsing attracts more attention due to the emergence of more labeled data. Particularly, Li et al. (2019b) release large-scale labeled and unlabeled datasets, and they find that their proposed domain embedding (DE) approach is more effective than the direct concatenation (CON) method. The feature augmentation (FA) method, as another typical technique for semi-supervised domain adaptation, is first proposed by Daumé III (2007). Kim et al. (2016) successfully apply it to a neural model which leverages multiple BiLSTMs to extract shared and private domain features. To learn the differences and commonalities between source and target domains, the DE method uses explicit domain indicators as extra inputs, whereas the FA method employs a shared and two private BiLSTM encoders for the feature separation.
This work proposes to improve the contextualized word representation by adversarial learning and fine-tuning BERT, thus further modeling more pure yet effective domain-specific and domain-invariant representations. To alleviate the domain-invariant representations from being contaminated by domainspecific ones, we apply adversarial learning to enhance three typical semi-supervised approaches, i.e., CON, FA, and DE with two useful strategies, i.e., fused target-domain word representations and orthogonality constraints. At the same time, we utilize a large-scale target-domain unlabeled data to fine-tune BERT and obtain more reliable contextualized word representations, leading to a large improvement over using off-the-shelf BERT representations. Our final single model achieves nearly the same state-ofthe-art performance as the ensemble models with BERT of Li et al. (2019c), which won the first place in the cross-domain parsing shared task recently organized at the international conference on natural language processing and Chinese computing (NLPCC-2019). Although we focus on semi-supervised domain adaptation for dependency parsing, the techniques and findings may be applicable to domain adaptation for other NLP tasks. All codes are released publicly available for the research purpose 1 .

Base Model
In this work, we select the state-of-the-art BiAffine parser as our strong baseline model. As shown in the left part of Figure 2, the parser mainly contains four components: Input layer, BiLSTM encoder, MLP layer, and BiAffine layer.
Input layer. Given an input sentence s = w 0 w 1 . . . w n , the input layer directly maps it into vector representations x 0 x 1 . . . x n . Each vector representation x i is the concatenation of its word and POS-tag embeddings: x where emb word w i is the sum of a fixed word2vec representation and a fine-tuned word embedding.  Figure 2: The left part is the framework of BiAffine parser, and the right is the framework of adversarial CON model.
is a fine-tuned POS-tag embedding. Additionally, we also enhance model performance by replacing the word embedding emb word w i with its BERT representation rep BERT w i . BiLSTM encoder. The BiLSTM encoder takes x 0 x 1 ...x n as inputs and obtains context-aware word representations h 0 h 1 ...h n . First, a three-layer BiLSTM is applied to sequentially encode the input words from forward and backward two directions. Then, the two sequences of hidden states are obtained, represented as h i at each step as the final hidden states h i . We omit the detailed computation of BiLSTM encoder and write it as follows: where the θ BiLSTM represents all the parameters of the BiLSTM encoder. MLP (multi-layer perceptron) layer. The MLP layer takes h i as input and uses two separate MLPs to get two lower-dimensional representation vectors.
where r H i is the representation vector of w i as a head word, and r D i as a dependent, and MLP H/D both have a single hidden layer with the ReLU activation function.
BiAffine layer. The scores of all dependencies are computed via a BiAffine operation, where score(i ← j) is the score of the dependency (j, i) and the matrix W b is a BiAffine parameter. The arc-factorization score of a dependency tree is computed with extra MLPs, which can be seen in Dozat and Manning (2017). After obtaining the scores, the highest-scoring tree can be decoded with the dynamic programming algorithm known as maximum spanning tree (McDonald et al., 2005). Parser loss. Assuming w j is the gold-standard head of w i , the BiAffine parser loss for each position i is 0≤k≤n,k =i e score(i←k) The BiAffine parser treats the classification of dependency labels as a separate task after finding the highest-scoring dependency tree.

Approaches
In this work, we propose to improve contextualized word representations by adversarial learning and fine-tuning BERT processes to boost the performance of cross-domain dependency parsing. Concretely, we apply adversarial learning to three typical semi-supervised approaches with two useful strategies, thus obtaining more pure word representations. Simultaneously, we propose to fine-tune BERT with all target-domain unlabeled data to obtain more reliable word representations.

The Adversarial CON Method
The CON method is the most common technique for semi-supervised cross-domain dependency parsing, which ignores domain differences and directly trains the BiAffine parser with all source-and targetdomain labeled data. To capture the domain-invariant information that is not special to a particular domain as much as possible, we employ an adversarial network on BiAffine parser, which is shown in the right of Figure 2. Following Ganin and Lempitsky (2015), we use a Gradient Reversal Layer (GRL) for adversarial learning to prevent the domain classifier from making an accurate prediction about the domain types of the word. First, the inputs from different domains are parameterized by the same BiLSTM, and its output h i is used for adversarial learning and dependency parsing. For adversarial learning, the GRL takes h i as its input, and the forward and backward propagations of the GRL are defined as follows: where λ is a hyper-parameter. Over the GRL, the domain classifier utilizes an MLP to compute the domain scores and a softmax to obtain the probabilities of domain distribution for each word w i , where θ d = {W 1 , W 2 , b 1 , b 2 } denotes the parameters of domain classifier. The adversarial network is trained to minimise the cross-entropy of the predicted and true distributions, whereẑ i is the gold domain of word w i , z j i represents the predicted probability of word w i belonging to domain j, n is the word number of one sentence, and m is the domain number. Finally, the adversarial CON model is jointly trained with parser and adversary losses, where α is a hyper-parameter to balance the parsing and adversarial learning tasks.

The Adversarial FA Method
The FA method is another popular technique for domain adaptation, which applies a shared and m private BiLSTMs to learn domain-invariant and domain-specific features (Kim et al., 2016). To alleviate the shared and private latent feature spaces from interfering with each other, we apply the adversarial learning to the FA model with two useful strategies, i.e., fused target-domain word representations and orthogonality constraints. As shown in the left of Figure 3, we employ a shared and two private BiLSTM encoders for feature separation. First, the input x i is fed into a shared BiLSTM and its corresponding private BiLSTM, thus obtaining domain-invariant representation h inv i and domain-specific one h spe i . Then, we employ two Inputs: . . . xi . . .

BiLSTMs (inv)
BiLSTMs ( where h src i and h tgt i are the outputs of source-and target-domain private BiLSTMs. Orthogonality constraints. Following Bousmalis et al. (2016), we encourage the domain-specific features to be mutually exclusive with the shared features by imposing the orthogonality constraints. The loss of orthogonality constraints is computed as follows: We then use the combination of h inv i and h spe i as final contextualized word representations h i for the dependency parsing, while h inv i is used for adversarial learning to make the shared space more pure. Finally, our adversarial FA model is jointly trained with the total loss L * f a , which is defined as follows: where α and β are hyper-parameters.

The Adversarial DE Method
The DE method is recently proposed by Li et al. (2019b), which trains the BiAffine parser by concatenating the primary input vector x i and a fine-tuned domain embedding emb dom d i as the new input x i .
Since emb dom d i enables to explicitly represent which domain the input comes from and the adversarial learning is helpful to detect domain-invariant knowledge, we propose a novel adversarial DE method for effective feature separation.
As shown in the right of Figure 3, we employ two independent BiLSTM encoders to capture domainspecific and domain-invariant features by the utilization of domain embedding and adversarial learning. Concretely, a BiLSTM takes x i as the input and its output h inv i is fed into the GRL for the adversarial learning. Simultaneously, the other BiLSTM uses x i as the input and obtains the output h spe i .  as the final contextualized word representation h i , which is used for dependency parsing by shared MLP and biaffine operations. In addition, the orthogonality loss is used to divergent the domain-specific and domain-invariant representations. Finally, the entire model is optimized by a joint loss, which is the same defined as L * f a .

Fine-tuning BERT with All Target-domain Unlabeled Data
Recently proposed contextualized word representations, such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) can further improve parsing performance by a large margin Li et al., 2019a). Remarkably, BERT has been proven effective on a variety of natural language processing tasks (Devlin et al., 2019). Recently, researchers pay more attention to updating BERT representations with additional corpus and achieve great progress on BERT applications (Gururangan et al., 2020). Motivated by the successful utilization of BERT and BERT's strong capability of word representations, we propose to fine-tune BERT model parameters with all unlabeled data to obtain more reliable representations. First, we use the released Chinese BERT-Based model as the original BERT model. 2 Then, we finetune BERT on the unlabeled data using the parameters in the original BERT model as the starting point. To save computational resource, we merge all train/unlabeled data of all domains as one unlabeled dataset for fine-tuning BERT once. Thus, the same fine-tuned BERT model is used for all three-target domains. Since the product comment (PC) and product blog (PB) data are user-generated independent sentences without context information, we remove the next sentence loss and tune the BERT model parameters with only language model loss. Following Li et al. (2019a), we train all BERT-enhanced models by replacing the pre-trained word embedding emb word w i with the fixed BERT representation rep BERT w i . For rep BERT w i , we first compute the mean value of the 4-top layer BERT outputs, and then a linear map is used to reduce the high dimensional outputs into a low dimensional vector.

Experiments
Datasets. We use the Chinese multi-domain dependency parsing datasets released at the NLPCC-2019 shared task 3 , containing four domains: one source domain which is a balanced corpus (BC) from newswire, three target domains which are the product comments (PC) data from Taobao, the product blog (PB) data from Taobao headline, and a web fiction data named "ZhuXian" (ZX). The detailed data statistics are shown in Table 1.
Evaluation. We use unlabeled attachment score (UAS) and labeled attachment score (LAS) to evaluate the dependency parsing accuracy. Each parser is trained for at most 1000 iterations, and the performance is evaluated on the dev data after each iteration for model selection. We stop the training if the peak performance does not increase in 100 consecutive iterations.
Hyper-parameters. We follow the hyper-parameter settings of Dozat and Manning (2017), such as learning rate and dropout ratios. The loss weights α and β are set to 0.001. The GRL hyper-parameter λ is 10 −5 . For pre-trained word embeddings, we train word2vec embeddings on Chinese Gigaword Third Edition (Mikolov et al., 2013), consisting of about 1.2 million sentences. Table 2 presents the parsing accuracy on dev data when each parser is trained on a single-domain training data. First, although PC-train is much smaller than BC-train, the PC-trained parser outperforms the BC-   trained parser by about 30%, indicating that the target-domain labeled data is useful and important to train a parser specially when there is a large divergence between two domains. Second, the gap between PB-trained and BC-trained parsers is about 11% while the scale of PB-train and PC-train is very close, demonstrating that PB-train is much similar with BC-train. Third, the accuracy of ZX-trained parser is about 5% higher than the BC-trained one. The reason may be that the BC-train data are from the newswire which may contain novels. Overall, the results clearly demonstrate that the model easily achieves good performance when the training and testing data are from the same domain.

Combining Two Training Datasets
We first train the three representative non-adversarial models with the combination of source-and targetdomain data. Then, we conduct detailed ablation study on adversarial models to gain in-depth insight about the effect of different model components.
Results of non-adversarial models. As shown in the top block of Table 3, we can see that CON obviously outperforms FA on PB and ZX domains, but underperforms on the PC domain, demonstrating that the FA approach performs well only when there is a large difference between source and target domains. In addition, we find that the DE model achieves nearly the same accuracy as the CON, indicating that both domain-invariant features in the CON model and domain-specific features in the DE model are equally important for cross-domain dependency parsing.
Results of adversarial models. The results of comparison experiments on adversarial approaches are shown in the bottom block of Table 3. First, we can see that directly applying adversarial network on non-adversarial models even slightly reduces the model performance specially on the CON and FA. The reason may be that the target-domain related parameters are trained inadequately with only a smallscale labeled data. Second, the utilization of the fused word representation and orthogonality constraints enables to obviously enhance the performance of the vallina adversarial models, indicating that the two strategies are helpful for feature separation representations. Finally, we find that our proposed adversarial models consistently outperform the non-adversarial ones, demonstrating that pure word representation is an effective knowledge to improve the accuracy of cross-domain dependency parsing.   Table 5: Final results on test data. With the limited length of the page, we use "*" to denote "ensemble models", "E" to denote "model with ELMo", "B" to denote "model with BERT", and "FB" to denote "model with fine-tuning BERT".

Utilization of Unlabeled Dataset
In order to obtain more reliable domain-related word representations that benefit for cross-domain dependency parsing, we exploit the large-scale target-domain unlabeled data to fine-tune BERT model parameters. Detailed comparative experiments are conducted to verify the effectiveness of fine-tuned BERT representations, and the results are shown in Table 4. First, we find that BERT as deep and contextualized word representation has a strong representational capacity and achieve higher performances among all models. Second, we can see that fine-tuning BERT with unlabeled data can significantly improve the performances of both adversarial and non-adversarial models, demonstrating that BERT can learn domain-related knowledge and produce more reliable contextualized word representations by finetuning operation. Third, the performance gaps between all BERT-enhanced models reduces sharply, but the adversarial models still consistently improve the accuracy of non-adversarial ones, indicating adversarial learning and fine-tuning BERT are complementary for word representations that can benefit from each other. Overall, we find that fine-tuning BERT is an effective method to leverage unlabeled data and the adversarial learning is still useful on BERT-enhanced models. Table 5 shows the final results and makes a comparison with previous works on test data. We report the parsing accuracy of our baseline models in the second block and our proposed adversarial models in the last block. First, comparing the results on the two blocks, we can clearly see that all adversarial models outperform the non-adversarial ones, indicating that adversarial learning is helpful to detect pure yet effective domain-invariant and domain-specific representations. Second, the utilization of BERT can improve the accuracy of both non-adversarial and adversarial models by a large margin, and the fine-tuned BERT enables to further enhance parsing performances. The reason may be that fine-tuning BERT with a large-scale target-domain unlabeled data is extremely useful to learn more reliable word representations. Finally and foremost, although the baseline becomes much stronger with fine-tuned BERT, our proposed adversarial approach still achieves higher performance, demonstrating the adversarial learning and fine-tuning BERT are complementary and mutual benefit for word representations. We also give the main results newly submitted at NLPCC-2019 shared task in the top block of Table  5. Yu et al. (2019) attempt to combine the power of self-training and ensemble models to improve the model performance. Peng et al. (2019) re-implement the DE method to learn explicit domain information and further improve the parsing accuracy with ELMo. Li et al. (2019c) directly update the BERT representations by the parsing loss, and tri-training is used to augment the target domain training data. Our final single model achieves nearly the same performance as the top submitted system at the shared task (Li et al., 2019c) without the complex model ensemble process.

Related Work
Domain adaptation has been a long-standing yet challenging research topic. Here we try to briefly summarize the representative approaches for both unsupervised and semi-supervised domain adaptation.

Unsupervised Domain Adaptation
Due to the lack of target-domain labeled data, previous researches mostly focus on the unsupervised domain adaptation. Self-training is a simple method to incorporate unlabeled data into the new model, which first annotates the unlabeled data with the existing model, and then train a new model with the combination of newly generated data and actual labeled data (Yarowsky, 1995). As a typical unsupervised approach, self-training has proven effective on cross-domain constituency parsing (McClosky et al., 2006) and dependency parsing (Yu et al., 2015), but there are also many failed works. Charniak (1997) reports either minor improvements or significant damage for parsing by using self-training.  show the same findings on POS-tagging task. Co-training is another way to utilize the unlabeled data (Blum and Mitchell, 1998). It leverages multiple learners to annotate the unlabeled data respectively, and then arguments the training data with the newly labeled data when multiple learners agree on the annotation labels. Sarkar (2001) and Steedman et al. (2003) demonstrate that co-training is helpful for unsupervised cross-domain parsing. However, it still is a challenge to select the appropriate labeled data for self-training and co-training.

Semi-supervised Domain Adaptation
Semi-supervised domain adaptation assumes the model is trained with all source-and target-domain labeled data. Most recently, Li et al. (2019c) and Yu et al. (2019) reveal that newly generated targetdomain data by self-training or tri-training and model ensemble can improve the cross-domain parsing performance significantly. The model ensemble method is a commonly used strategy to integrate different parsing models in dependency parsing (Nivre and McDonald, 2008). However, all these approaches require to retrain parser repeatedly, making them difficult for practical applications. Daumé III (2007) for the first time proposes the FA method on sequence labeling task, which distinguishes domain-specific and domain-invariant with different feature extractors. Kim et al. (2016) successfully employ the FA technique on neural network, which uses a shared and m private BiLSTM encoders for feature separation. As another direction, Li et al. (2019b) propose to utilize an extra domain embedding to indicate the domain information of the input word, and they find that the parsing accuracy of the DE model is obviously higher than other semi-supervised approaches.
The adversarial learning is a commonly used strategy to extract pure domain-invariant representations that does not belong to a particular domain as much as possible (Goodfellow et al., 2014;Bousmalis et al., 2016;Kim et al., 2017;Britz et al., 2017;Cao et al., 2018;Guo et al., 2018;Zeng et al., 2018;Adams et al., 2019). Most relevantly, Sato et al. (2017) employ adversarial network to the FA and CON methods, finding that there is little gains and even damage the performance, specially when the scale of target-domain labeled training data is small. Motivated by these works, we apply adversarial learning on three typical semi-supervised domain adaptation, i.e., CON, FA, and DE with two useful strategies, i.e., fused target-domain word representation and orthogonality constraints to detect more pure yet effective word representations, thus further boosting the performance of cross-domain dependency parsing.

Conclusions
This work successfully exploits adversarial learning and fine-tuning BERT to model pure yet effective word representations that benefit for the cross-domain dependency parsing. We have demonstrated the effectiveness of adversarial learning and fine-tuning BERT by applying them to three representative semisupervised approaches. Experimental results show that our proposed adversarial approaches achieve consistent improvement, and fine-tuning BERT further boosts parsing accuracy by a large margin. The detailed comparison experiments demonstrate that both the fused target-domain word representation and orthogonality loss are useful for adversarial models to alleviate the domain-invariant representations from being contaminated by domain-specific ones. The analysis on the utilization of BERT indicates that the fine-tuning BERT with the target-domain unlabeled data encourages BERT to learn more reliable contextualized word representations, leading to a large improvement over using off-the-shelf BERT on both non-adversarial and adversarial models.