Fine-grained Knowledge Fusion for Sequence Labeling Domain Adaptation

In sequence labeling, previous domain adaptation methods focus on the adaptation from the source domain to the entire target domain without considering the diversity of individual target domain samples, which may lead to negative transfer results for certain samples. Besides, an important characteristic of sequence labeling tasks is that different elements within a given sample may also have diverse domain relevance, which requires further consideration. To take the multi-level domain relevance discrepancy into account, in this paper, we propose a fine-grained knowledge fusion model with the domain relevance modeling scheme to control the balance between learning from the target domain data and learning from the source domain model. Experiments on three sequence labeling tasks show that our fine-grained knowledge fusion model outperforms strong baselines and other state-of-the-art sequence labeling domain adaptation methods.


Introduction
Sequence labeling tasks, such as Chinese word segmentation (CWS), POS tagging (POS) and named entity recognition (NER), are fundamental tasks in natural language processing. Recently, with the development of deep learning, neural sequence labeling approaches have achieved pretty high accuracy , relying on large-scale annotated corpora. However, most of the standard annotated corpora belong to the news domain, and models trained on these corpora will get sharp declines in performance when applied to other domains like social media, forum, literature or patents (Daume III, 2007;Blitzer et al., 2007), which limits their application in the real world. Domain adaptation 1 Our code is available at https://github.com/yhy1117/ FGKF-DA.  aims to exploit the abundant information of wellstudied source domains to improve the performance in target domains (Pan and Yang, 2010), which is suitable to handle this issue. Following Daume III (2007), we focus on the supervised domain adaptation setting, which utilizes large-scale annotated data from the source domain and smallscale annotated data from the target domain. For sequence labeling tasks, each sample is usually a sentence, which consists of a sequence of words/Chinese characters, denoted as the element. We notice an interesting phenomenon: different target domain samples may have varying degrees of domain relevance to the source domain. As depicted in Table 1, there are some tweets similar to the news domain (i.e. strongly relevant). But there are also some tweets of their own style, which only appear in the social media domain (i.e. weakly relevant). The phenomenon can be more complicated for the cases where the whole sample is strongly relevant while contains some target domain specific elements, or vice versa, showing the diversity of relevance at the element-level. In the rest of this paper, we use 'domain relevance' to refer to the domain relevance to the source domain, unless specified otherwise.
Conventional neural sequence labeling domain adaptation methods (Liu and Zhang, 2012;Peng and Dredze, 2017;Lin and Lu, 2018)  focus on reducing the discrepancy between the sets of source domain samples and target domain samples. However, they neglect the diverse domain relevance of individual target domain samples, let alone the element-level domain relevance. As depicted in Figure 1, obviously, strongly relevant samples/elements should learn more knowledge from the source domain, while weakly relevant samples/elements should learn less and keep their characteristics.
In this paper, we propose a fine-grained knowledge fusion model to control the balance between learning from the target domain data and learning from the source model, inspired by the knowledge distillation method (Bucila et al., 2006;Hinton et al., 2015). With both the sample-level and element-level domain relevance modeling and incorporating, the fine-grained knowledge fusion model can alleviate the negative transfer (Rosenstein et al., 2005) in sequence labeling domain adaptation.
We verify the effectiveness of our method on six domain adaptation experiments of three different tasks, i.e. CWS, POS and NER, in two different languages, i.e. Chinese and English, respectively. Experiments show that our method achieves better results than previous state-of-the-art methods on all tasks. We also provide detailed analyses to study the knowledge fusion process.
Contributions of our work are summarized as follows: • We propose a fine-grained knowledge fusion model to balance the learning from the target data and learning from the source model.
• We also propose multi-level relevance mod- eling schemes to model both the sample-level and element-level domain relevance.
• Empirical evidences and analyses are provided on three different tasks in two different languages, which verify the effectiveness of our method.

Knowledge Distillation for Adaptation
Knowledge distillation (KD), which distills the knowledge from a sophisticated model to a simple model, has been employed in domain adaptation (Bao et al., 2017;Meng et al., 2018). Recently, online knowledge distillation (Furlanello et al., 2018; is shown to be more effective, which shares lower layers between the two models and trains them simultaneously. For sequence labeling domain adaptation, we utilize the online knowledge distillation method to distill knowledge from the source model to improve the target model, denoted as basicKD, which is depicted in Figure 2. We use the Bi-LSTM-CRF architecture (Huang et al., 2015), for both the source model and the target model, and share the embedding layer between them. Notations For the rest of the paper, we use the superscript S and T to denote the source domain and the target domain, respectively. Source domain data is a set of m samples with gold la-bel sequences, denoted as (x S j , y S j ) m j=1 . Similarly, target domain data has n samples, denoted where n m. The training loss of the source model is the cross entropy between the predicted label distributionŷ and the gold label y: The training loss of the target model is composed of two parts, namely the sequence labeling loss L T SEQ and the knowledge distillation loss L T KD : where L T SEQ is similar to L S , while L T KD is the cross entropy between the probability distributions predicted by the source model and the target model. α is a hyper-parameter scalar, which is used to balance the learning from the target domain data and the learning from the source model.

Relevance Modeling
BasicKD provides individual learning goals for every sample and element of the target domain, using a scalar α to weight. As a result, the source model has the same influence on all target samples, in which the diversity of domain relevance is neglected.
Here we present methods to model the domain relevance of target samples and elements, which could then be used to guide the knowledge fusion process (see §4). The overall architecture is shown in Figure 3. The relevance of each sample is a scalar, denoted as the sample-level relevance weight, w samp i for the i th sample, which can be obtained by the sample-level domain classification. The relevance of each element is also a scalar, while the relevance weights of all elements within a sample form a weight vector w elem , which can be obtained by the similarity calculation.

Element-level Relevance
To acquire the element-level relevance, we employ the domain representation q ∈ R 2d h (d h is the dimension of the Bi-LSTM) and calculate the similarity between the element representation and the domain representation. We incorporate two methods to get q: (1) Domain-q: q is a trainable domain specific vector, where every element within a domain share the same q; (2) Sample-q: q is the domain relevant feature extracted from each sample, where every element within a sample share the same q. Because of the superiority of the capsule network modeling abstract features , we use it to capture the domain relevant features within a sample. We incorporate the same bottom-up aggregation process as  and the encoded vector is regarded as q: where h is the hidden state matrix of a sample. The similarity calculation formula is the matrix dot 2 : where h j is the hidden states of the j th element and w elem j is the relevance weight of it. B ∈ R 2d h ×2d h is a trainable matrix.

Sample-level Relevance
To acquire the sample-level domain relevance, we make use of the domain label to carry out samplelevel text classification (two class, source domain or target domain). The weight w elem is normalized across the sample length using the softmax function, then the sample representation can be obtained by the weighted sum of hidden states. The process can be expressed as: r ∈ R 2d h is the sample representation and L is the sample length.
Once the sample representation is obtained, the multi-layer perceptron (MLP) and softmax do sample classification next: where w samp is the sample relevance weight.

Fine-grained Knowledge Fusion for Adaptation
With the relevance modeling, the fine-grained knowledge fusion model is proposed to fusion the knowledge from the source domain and the target domain at different levels. The overall architecture is shown in Figure 2.

Sample-level Knowledge Fusion
Different samples of target domain tend to show different domain relevance, and as a result, they need to acquire different amount of knowledge from the source domain. Different α is assigned to each target sample based on its domain relevance to achieve the sample-level knowledge fusion. The new α can be computed as: where α samp i is the α of the i th sample and w samp i is the relevance weight of it; σ denotes the sigmoid function; τ is temperature and γ is bias.
The loss functions of the target model can be computed as: The sample classification losses of the source model L S sc and target model L T sc are both cross entropy.

Element-level Knowledge Fusion
Besides the sample-level domain relevance, different elements within a sample tend to present diverse domain relevance. In this method, we assign different α to each element based on its domain relevance weight to achieve the elementlevel knowledge fusion. The new α can be computed as: where α elem i ∈ R L is a vector, in which α elem ij denotes the α of the j th element in the i th sample. w elem i is the relevance weight of the i th sample. W α and b α are trainable parameters.
The loss functions of the target model can be expressed as: where * ij denotes the * of the j th element in the i th sample, and the final loss function is the same with Eq.(11).

Multi-level Knowledge Fusion
In this method, we take both the sample-level and element-level relevance diversities into account to implement the multi-level knowledge fusion, and the multi-level α can be computed as: where denotes the element-wise product. α multi ∈ R n×L is a matrix as well.
Sample b samples from the source data 8.
Compute L S , and update θ S 9.
Compute L S sc , and update θ S 10.
end for 11.
Use θ S to test x T train and get p S 12.
while in an episode: 13.
Sample b samples from the target data 14.
Use relevance modeling to get w samp ,w elem 15.
Use θ T to predict p T , and compute L T KD 17. Compute L T , and update θ T 18.
Compute L T sc , and update θ T 19.
end while 20. until converge

Training Process
Both the source model and the target model can be pre-trained on the source domain data (warm up, optional). In the fine-grained knowledge fusion method, the source model and the target model are trained alternately. Within an episode, we use I steps to train the source model ahead, then the soft target (p S ) can be obtained and the target model will be trained. During the training of the target model, the parameters of the source model are fixed (gradient block). Every training step includes the sequence labeling training and the sample classification training. We conduct early stopping according to the performance of the target model. The whole training process is shown in Algorithm 1.

Datasets
We conduct three sequence labeling tasks: CWS, POS and NER, and the latter two tasks containing both Chinese and English settings. Detailed datasets are shown in Table 2. There are two kinds of source-target domain pairs: news-novels and news-social media. To be consistent with the set-ting where there is only small-scale target domain data, we use 5% training data of Weibo for both CWS and POS. For the different NER tag sets, we only focus on three types of entities: Person (PER), Location (LOC) and Organization (ORG) and regard other types as Other (O).

Settings
For each task, hyper-parameters are set via grid search on the target domain development set. Embedding size and the dimension of LSTM hidden states is set to 100. Batch size is set to 64. Learning rate is set to 0.01. We employ the dropout strategy on the embedding and MLP layer with the rate of 0.2. The l 2 regularization term is set to 0.1. The gradient clip is set to 5. The teach step I is set to 100. The routing iteration is set to 3 and the number of the output capsules is set to 60. The temperature τ is initialized to 1 and the probability bias γ is initialized to 0.5. We set the α of the basicKD method to 0.5 according to Hinton et al. (2015). We randomly initialize the embedding matrix without using extra data to pre-train, unless specified otherwise.

Baselines
We implement several baseline methods, including: source only (training with only source domain data), target only (training with only target domain data) and basicKD (see §2).
We also re-implement state-of-the-art sequence labeling domain adaptation methods, following their settings except for unifying the embedding size and the dimension of LSTM hidden states: • Pre-trained methods: Pre-trained embedding incorporates source domain data with its gold label to pre-train context-aware character embedding (Zhou et al., 2017), which is used to initialize the target model; Pretrained model trains the model on the source domain and then finetune it on the target domain.  • Projection methods: Linear projection (Peng and Dredze, 2017) uses the domainrelevant matrix to transform the learned representation from different domains into the shared space; Domain mask (Peng and Dredze, 2017) masks the hidden states of Bi-LSTM to split the representations into private and public regions to do the projection; Neural adaptation layer (NAL) (Lin and Lu, 2018) incorporates adaptation layers at the input and output to conduct private-publicprivate projections.
• Adversarial method: Adversarial multicriteria learning (AMCL)  uses the shared-private architecture with the adversarial strategy to learn the shared representations across domains.

Overall Results on CWS
We use the F1-score (F) and the recall of outof-vocabulary words (R oov ) to evaluate the domain adaptation performance on CWS. We compare methods with different relevance modeling schemes and different levels of knowledge fusion, without warm up. And we denote our final model as FGKF, which is the multi-level knowledge fusion with the sample-q relevance modeling and warm up.
The results in Table 4 show that both the ba-sicKD method and fine-grained methods achieve performance improvements through domain adaptation. Compared with the basicKD method, FGKF behaves better (+1.1% F and +2.8% R oov v.s. basicKD on average), as it takes multilevel relevance discrepancies into account. The sample-q method performs better than the domainq method, which shows the domain feature is bet-  ter represented at the sample level, not at the domain level. As for the granularity of α, the performances of α elem is better than α samp , showing the necessity of modeling element-level relevance. And there isn't a distinct margin between α elem and α multi as most of the multi-level domain relevance can be included by the element level. Results of FGKF with warm up indicate that starting from sub-optimal point is better than starting from scratch for the target model. Among related works (Table 3), AMCL and Pre-trained model methods have better performances in CWS. Compared with other methods, FGKF achieves the best results in both F and R OOV . Results demonstrate the effectiveness of our fine-grained knowledge fusion architecture for domain adaptation, and also show the significance of considering sample-level and element-level relevance discrepancies.

Overall Results on POS and NER
To further verify the effectiveness of FGKF, we conduct experiments on POS and NER tasks, using F1-score as the evaluation criterion. Detailed results are shown in Table 3. In these tasks, FGKF Qing Yun big battle , Beast God lose at Zhu Xian Old Sword under  achieves better results than other adaptation methods. Extra gain could be obtained by using pretrained embedding. These results also verify the generalization of our method over different tasks and languages.

Analysis
In this section, we will display and discuss the domain adaptation improvements provided by our fine-grained knowledge fusion method.

Performances of Elements with Different Relevance
To further probe into the experimental results of the fine-grained knowledge fusion, we classify the target test data (in element level) into two classes: strongly relevant and weakly relevant, based on their relevance degrees to the source domain. The partition threshold is according to the average relevance score of the target training data. Detailed results on Twitter are depicted in Table 5.  It is reasonable that both the basicKD and FGKF enhance the performance of the strongly relevant part, while FGKF get larger improvements because it is able to enhance the knowledge fusion by learning more from the source model. For the weakly relevant part, the basicKD method damages the performance on it (from 87.41 to 83.82 for POS and from 56.29 to 52.63 for NER), which indicate the negative transfer. On the contrary, FGKF improves the performance of the weakly relevant part compared with the target only baseline with a large margin. It is shown that the fine-grained domain adaptation method can reduce the negative transfer on the weakly relevant part and contribute to the transfer on the strongly relevant one.

Relevance Weight Visualization
We carry out the visualization of the element-level relevance weight to illustrate the effects of the two relevance modeling schemes (domain-q and sample-q). Figure 4 exhibits two cases of elementlevel relevance modeling results, from which we can explicitly observe that the two schemes capture different domain relevance within a sample. In the first case, the sample-q method extracts more domain relevant elments, like "Qingyun", "Beast God" and "Zhuxian Old Sword", while the domain-q method ignores the last one. In the second case, the domain-q method extracts "front" incorrectly. These results indicate that the sampleq method can implement better relevance modeling than the domain-q method to some extent, and prove that the domain relevant feature is better represented at the sample level, not at the domain level.

Case Study
We take two samples in Twitter test set as examples to show how the element-level relevance affects the adaptation. Results in Table 6 show that both basicKD and FGKF can improve the performance of strongly relevant elements, e.g. "got (VBD)", "Lovis (B-PER)". However, only FGKF  reduces the transfer of source domain errors, e.g. "u (NN)", "The (B-ORG) Sun (I-ORG)".

Ablation Study
We conduct the ablation study on Twitter dataset (

Influence of Target Data Size
Here we investigate the impact of the target domain data size on FGKF. As is depicted in Figure  5, when the size is small (20%), the gap is pretty huge between FGKF and basicKD, which verifies the significance of fine-grained knowledge fusion in the low-resource setting. Even with the size of target data increasing, there are still stable margins between the two methods. Figure 5: Results of CWS target test set with varying target training data size. Only 10% training data of Weibo is utilized.

Related Work
Besides the source domain data, some methods utilize the target domain lexicons , unlabeled (Liu and Zhang, 2012) or partial-labeled target domain data  to boost the sequence labeling adaptation performance, which belong to unsupervised or semi-supervised domain adaptation. However, we focus on supervised sequence labeling domain adaptation, where huge improvement can be achieved by utilizing only small-scale annotated data from the target domain. Previous works in domain adaptation often try to find a subset of source domain data to align with the target domain data (Chopra et al., 2013;Ruder and Plank, 2017) which realizes a kind of source data sample or construct a common feature space, while those methods may wash out informative characteristics of target domain samples. Instance-based domain adaptation (Jiang and Zhai, 2007;Zhang and Xiong, 2018) implement the source sample weighting by assigning higher weights to source domain samples which are more similar to the target domain. There are also some methods (Guo et al., 2018;Kim et al., 2017;Zeng et al., 2018) explicitly weighting multiple source domain models for target samples in multi-source domain adaptation. However, our work focuses on the supervised single source domain adaptation, which devote to implementing the knowledge fusion between the source domain and the target domain, not within multiple source domains. Moreover, considering the important characteristics of sequence labeling tasks, we put more attention to the finer-grained adaptation, considering the domain relevance in sample level and element level.

Conclusion
In this paper, we propose a fine-grained knowledge fusion model for sequence labeling domain adaptation to take the domain relevance diversity of target data into account. With the relevance modeling on both the sample level and element level, the knowledge of the source model and target data can achieve multi-level fusion. Experimental results on different tasks demonstrate the effectiveness of our approach, and show the potential of our approach in a broader range of domain adaptation applications.