Cross-Domain NER using Cross-Domain Language Modeling

Due to limitation of labeled resources, cross-domain named entity recognition (NER) has been a challenging task. Most existing work considers a supervised setting, making use of labeled data for both the source and target domains. A disadvantage of such methods is that they cannot train for domains without NER data. To address this issue, we consider using cross-domain LM as a bridge cross-domains for NER domain adaptation, performing cross-domain and cross-task knowledge transfer by designing a novel parameter generation network. Results show that our method can effectively extract domain differences from cross-domain LM contrast, allowing unsupervised domain adaptation while also giving state-of-the-art results among supervised domain adaptation methods.


Introduction
Named entity recognition (NER) is a fundamental task in information extraction and text understanding. Due to large variations in entity names and flexibility in entity mentions, NER has been a challenging task in NLP. Cross-domain NER adds to the difficulty of modeling due to the difference in text genre and entity names. Existing methods make use of feature transfer (Daumé III, 2009;Kim et al., 2015;Obeidat et al., 2016;Wang et al., 2018) and parameters sharing (Lee et al., 2017;Yang et al., 2017;Lin and Lu, 2018) for supervised NER domain adaptation.
Language modeling (LM) has been shown useful for NER, both via multi-task learning (Rei, 2017) and via pre-training (Peters et al., 2018). Intuitively, both noun entities and context patterns can be captured during LM training, which benefits the recognition of named entities. A natural question that arises is whether cross-domain * Work done when visiting Westlake University. We are interested in transferring NER knowledge from the news domain to the target domain by contrasting large raw data in both domains through cross-domain LM training.
Naive multi-task learning by parameter sharing (Collobert and Weston, 2008) does not work effectively in this multi-task, multi-domain setting due to potential conflict of information. To achieve cross-domain information transfer as shown in the red arrow, two types of connections must be made: (1) cross-task links between NER and LM (for vertical transfer) and (2) cross-domain links (for horizontal transfer). We investigate a novel parameter generator network to this end, by decomposing the parameters θ of the NER or LM task on the source or target text domain into the combination θ = f (W, I D d , I T t ) of a set of meta parameters W, a task embedding vector I T t (t ∈ {ner, lm}) and a domain embedding vector I D d (d ∈ {src, tgt}), so that domain and task-correlations can be learned through similarities between the respective domain and task embedding vectors.
In Figure 1, the values of W, {I T t }, {I D d } and the parameter generation network f (·, ·, ·) are all trained in a multi-task learning process optimizing NER and LM training objectives. Through the process, connections between the sets of parameters θ src,ner , θ src,lm , θ tgt,ner and θ tgt,lm are decomposed into two dimensions and distilled into two task embedding vectors I T ner , I T lm and two domain embedding vectors I D src , I D tgt , respectively. Compared with traditional multi-task learning, our method has a modular control over cross-domain and cross-task knowledge transfer. In addition, the four embedding vectors I T ner , I T lm , I D src and I D tgt can also be trained by optimizing on only three datasets for θ src,ner , θ src,lm and θ tgt,lm , therefore achieving zero-shot NER learning on the target domain by deriving θ tgt,ner automatically. Results on three different cross-domain datasets show that our method outperforms naive multitask learning and a wide range of domain adaptation methods. To our knowledge, we are the first to consider unsupervised domain adaptation for NER via cross-domain LM tasks and the first to work on NER transfer learning between domains with completely different entity types (i.e. news vs. biomedical). We released our data and code at https://github.com/ jiachenwestlake/Cross-Domain_NER.
2 Related Work NER. Recently, neural networks have been used for NER and achieved state-of-the-art results. Hammerton (2003) use a unidirectional LSTM with a Softmax classifer. Collobert et al. (2011) use a CNN-CRF architecture. Santos andGuimarães (2015) extend the model by using character CNN. Most recent work uses LSTM-CRF (Lample et al., 2016;Ma and Hovy, 2016;Chiu and Nichols, 2016;. We choose BiLSTM-CRF as our method since it gives stateof-the-art resutls on standard benchmarks. Cross-domain NER. Most existing work on cross-domain NER investigates the supervised setting, where both source and target domains have labeled data. Daumé III (2009) maps entity label space between the source and target domains. Kim et al. (2015) and Obeidat et al. (2016) use label embeddings instead of entities themselves as the features for cross-domain transfer. Wang et al. (2018) perform label-aware feature representation transfer based on text representation learned by BiLSTM networks.
Recently, parameters transfer approaches have seen increasing popularity for cross-domain NER. Such approaches first initialize a target model with parameters learned from source-domain NER (Lee et al., 2017) or LM (Sachan et al., 2018), and then fine-tune the model using labeled NER data from the target domain. Yang et al. (2017) jointly train source-and target-domain models with shared parameters, Lin and Lu (2018) add adaptation layers on top of existing networks. Except for , all the above methods use crossdomain NER data only. In contrast, we leverage both NER data and raw data for both domains. In addition, our method can deal with a zero-shot learning setting for unsupervised NER domain adaptation, which no existing work considers. Learning task embedding vectors. There has been related work using task vector representations for multi-task learning. Ammar et al. (2016) learn language embeddings for multi-lingual parsing. Stymne et al. (2018) learn treebank embeddings for cross-annotation-style parsing. These methods use "task" embeddings to augment word embedding inputs, distilling "task" characteristics into these vectors for preserving word embeddings. Liu et al. (2018) learn domain embeddings for multi-domain sentiment classification. They combine domain vectors with domainindependent representation of the input sentences to obtain a domain-specific input representation. A salient difference between our work and the methods above is that we use domain and task embeddings to obtain domain and task-specific parameters, rather than input representations.
Closer in spirit to our work, Platanios et al. (2018) learn language vectors, using them to generate parameters for multi-lingual machine translation. While one of their main motivation is to save the parameter space when the number of langauges grows, our main goal is to investigate the modularization of transferable knowledge in a cross-domain and cross-task setting. To our knowledge, we are the first to study "task" embeddings in a multi-dimensional parameter decomposition setting (e.g. domain + task).

Methods
The overall structure of our proposed model is shown in Figure 2. The bottom shows the com- bination of two domains and two tasks. Given an input sentence, word representations are first calculated through a shared embedding layer (Subsection 3.1). Then a set of task-and domainspecific BiLSTM parameters is calculated through a novel parameter generation network (Subsection 3.2), for encoding the input sequence. Finally, respective output layers are used for different tasks and domains (Subsection 3.3).

Input Layer
Following , given an input x = [x 1 , x 2 , . . . , x n ] from a source-domain NER train- , each word x i is represented as the concatenation of its word embedding and the output of a character level CNN : where e w represents a shared word embedding lookup table and e c represents a shared character embedding lookup table. CNN(·) represents a standard CNN acting on a character embedding sequence e c (x i ) of a word x i . ⊕ represents vector concatenation.

Parameter Generation Network
To transfer knowledge across domains and tasks, we dynamically generate the parameters of BiLSTM using a Parameter Generation Network (f (·, ·, ·)). The resulting parameters are denoted as θ d,t LSTM , where d ∈ {src, tgt} and t ∈ {ner, lm} represent domain label and task label, respectively. More specifically: where W ∈ R P (LSTM) × V ×U represents a set of meta parameters in the form of a 3rd-order tensor and I D d ∈ R U , I T t ∈ R V represent domain embedding and task embedding, respectively. U , V represent domain and task embedding sizes, respectively. P (LSTM) is the number of BiLSTM parameters. ⊗ refers to tensor contraction.
Given the input v and the parameter θ d,t LSTM , the hidden outputs of a task and domain-specific BiL-STM unit can be uniformly written as: for the forward and backward directions, respectively.

Output Layers
NER. Standard CRFs (Ma and Hovy, 2016) are used as output layers for NER.
, the output probability p(y|x) over label sequence y = l 1 , l 2 , . . . , l i produced on input sentence x is: where y represents an arbitary labal sequence, and w l i CRF is a model parameter specific to l i , and b is a bias specific to l i−1 and l i . Considering that the NER label sets across domains can be different, we use CRF(S) and CRF(T) to represent CRFs for the source and target domains in Figure 2, respectively. We use the first-order Viterbi algorithm to find the highest scored label sequence.
h n ] to compute the probability of next word x i+1 given x 1:i , represented as Considering the computational efficiency, Negative Sampling Softmax (NSSoftmax) (Mikolov et al., 2013;Jean et al., 2014) is used to compute forward and backward probabilities, respectively, as follows: where #x represents the vocabulary index of the target word x. w #x and b #x are the target word vector and the target word bias, respectively. Z is the normalization item computed by where N x represents the nagative sample set of the target word x. Each element in the set is a random number from 1 to the cross-domain vocab-

Training Objectives
NER. Given a manually labeled dataset D ner = {(x n , y n )} N n=1 , the sentence-level negative loglikehood loss is used for training: Language modeling. Given a raw data set D lm = {(x n )} N n=1 , LM f and LM b are trained jointly using Negative Sampling Softmax. Negative samples are drawn based on word frequency distribution in D lm . The loss function is: Joint training. To perform joint training for NER and language modeling on both the source and target domains, we minimize the overall loss: where λ d is a domain weight and λ t is a task weight. λ is the L 2 regularization parameters and Θ represents the parameters set. split training data into minibatches: if do supervised learning then 10: # target-domain NER 11: 18: end while Note: * means none in unsupervised learning

Multi-Task Learning Algorithm
We propose a cross-task and cross-domain joint training method for multi-task learning. Algorithm 1 provides the training procedure. In each training step (line 1 to 18), minibatches of the 4 tasks in Figure 1 take turns to train (lines 4-5, 7-8, 11-12 and 15-16, respectively). Each task first generates the parameters θ d,t LSTM using W and their respective I D d , I T t , and then compute gradients for f (W, I D d , I T t ) and domain-specific output layer (θ crfs , θ crft or θ nss ). In the scenario of unsupervised learning, there is no training data of the target-domain NER, and lines 11-12 will not be executed. At the end of each training step, parameters of f (·, ·, ·) and private output layers are updated together in line 17.

Experimental Settings
Data. We take the CoNLL-2003 English NER data (Sang andMeulder, 2003) as our sourcedomain data. In addition, 377,592 sentences from the Reuters are used for source-domain LM training in unsupervised domain adaptation. Three sets of target-domain data are used, including two publicly available biomedical NER datasets, BioNLP13PC (13PC) and BioNLP13CG (13CG) 1 and a science and technology dataset we collected and labeled. Statistics of the datasets are shown in Table 1.
CoNLL-2003 contains four types of entities, namely PER (person), LOC (location), ORG (organization) and MISC (miscellaneous). BioNLP13CG consists of five types, namely CHEM (Chemical), CC (cellular component), G/p (gene/protein), SPE (species) and CELL (cell), BioNLP13PC consists of three types of those entities: CHEM, CC and G/P. We use text of their training sets for language modeling training 2 .
For the science and technology dataset, we collect 620 articles from CBS SciTech News 3 , manually labeling them as a test set for unsupervised domain adaptation. It consists of four types of entities following the CoNLL-2003 standard. The numbers of each entity type are comparable to the CoNLL test set, as listed in Table 2. The main difference is that a great number of entities in the CBS News dataset are closely related to the domain of science and technology. In particular, for the MISC category, more technology terms such as Space X, bitcoin and IP are included, as compared with the CoNLL data set. Lack of such entities in the CoNLL training set and the difference of text genre cause the main difficulty in domain transfer. To address this difference, 398,990 unlabeled sentences from CBS SciTech News are used for LM training. We released this dataset as one contribution of this paper. Hyperparameters. We choose NCRF++  for developing the models. Our hyperparameter settings largly follow , with the following exceptions: (1) The batch size is set to 30 instead of 10 for shorter training time in multi-task learning; (2) RMSprop with a learning rate of 0.001 is used for our Sin-1 https://github.com/cambridgeltl/MTL-Bioinformatics-2016 2 We tried to use a larger number of raw data from the PubMed, but this did not improve the performances.  Mu l t i T a s k -T a r g e t gle Task Model (STM-TARGET) for the strongest baseline according to development experiments, while the multi-task models use SGD with a learning rate of 0.015 as . We use domain embeddings and task embeddings of size 8 to fit the model in one GPU of 8GB memory. The word embeddings for all models are initialized with GloVe 100-dimension vectors (Pennington et al., 2014) and fine-tuned during training. Character embeddings are randomly initialized.

Development Experiments
We report a set of development experiments on the biomedical datasets 13PC and 13CG.
Learning curves. Figure 3 shows the F1-scores against the number of training iterations on the 13CG development set. STM-TARGET is our single task model trained on the target-domain training set T ner ; FINETUNE is a model pre-trained using the source-domain training data S ner and then fine-tuned using the target-domain data T ner ; MULTITASK simultaneously trains source-domain NER and target-domain NER following Yang et al. (2017). For STM+ELMO, we mix the source-and target-domain raw data for training a contextualized ELMo representation (Peters et al., 2018), which is then used as inputs to an STM-TARGET model. This model shows a different way of transfer by using raw data, which is different from FINETUNE and MULTITASK. Note that due to differences in the label sets, FINETUNE and MUL-TITASK both share parameters between the two models except for the CRF layers.
As can be seen from Figure 3, the F1 of all models increase as the number of training iteration increases from 1 to 50, with only small fluctuations. All of the models converge to a plateau range when the iteration number increases to 100. All transfer learning methods outperform the STM-TARGET method, showing the usefulness of using source data to enhance target labeling. The strong performance of STM+ELMO over FINE-TUNE and MULTITASK shows the usefulness of raw text. By simultaneously using source-domain raw text and target-domain raw text, our model gives the best F1 over all iterations. Effect of language model for transfer. Figure  4 shows the results of source language modeling, target language modeling, source NER and target NER for both development datasets when the number of training iterations increases. As can be seen, multi-task learning under our framework brings benefit to all tasks, without being negatively influenced by potential conflicts between tasks (Bingel and Søgaard, 2017;Mou et al., 2016).

Final Results on Supervised Domain Adaptation
We investigate supervised transfer from CoNLL to 13PC and 13CG, comparing our model with a range of baseline transfer approaches. In particular, three sets of comparisons are made, including (1) a comparison between our method with other supervised domain adaptation methods, such as MULTITASK(NER) 4 and ELMo, (2) a comparison between the use of different subsets of data for transfer under our own framework and (3) a comparison with the current state-of-the-art in the literature for these datasets.
(1) Comparison with other supervised transfer methods. We compare our method with STM-TARGET, MULTITASK(NER), FINETUNE and STM+ELMO. The observations are similar to those on the development set. Note that FINETUNE does not always improve over STM-TARGET, which shows that the difference between the two datasets can hurt naive transfer learning, without considering domain descriptor vectors.
ELMo. The ELMo methods use raw text via language model pre-training, which has been shown to benefit many NLP tasks (Peters et al., 2018). In our cross-domain setting, STM+ELMO gives a significant improvement over STM-TARGET on the 13CG dataset, but only a small improvement on the 13PC dataset. The overall improvements are comparable to that of MUL-TITASK only using the raw data. We also tried to use the ELMo model (Original) released by Peters The results are 84.08% on 13PC and 79.57% on 13CG, respectively, which are lower compared to 85.54% and 79.86% by our method, respectively, despite the use of much larger external data. This shows the effectiveness of our model.
Multi-task of NER and LM. We additionally compare our method with the naive multi-task learning setting (Collobert and Weston, 2008), which uses shared parameters for the four tasks but use the exact same data conditions as the FINAL model. which is shown in the MULTI-TASK(NER+LM) method in Table 3. The method gives an 81.33% F1 on 13PC and 75.27% on 13CG, which is much lower compared with all baseline models. This demonstrates the challenge of the cross-domain and cross-task setting, which contains conflicting information from different text genres and task requirements.
(2) Ablation experiments. Now that we have compared our method with baselines utilizing similar data sources, we turn to investigate the influence of data sources on our own framework. As shown in Figure 5, we make novel use of 4 data sources for the combination of two tasks in two domains. If some sources are removed, our settings fall back to traditional transfer learning. For example, if the LM task is not considered, then the task setting is standard supervised domain adaptation.
The baselines include (1) CO-LM, which represents our model without source-domain tasks, joint training the target-domain NER and language modeling, transferring parameters as: θ t LSTM = W ⊗ I T t , (t ∈ {ner, lm}).
NER, transferring parameters as: (3) MIX-DATA, which uses the same NER data in source-and target-domain as FINAL, but also uses combined raw text to train source-and target-domain language models.
Our method outperforms all baselines significantly, which shows the importance of using rich data. A contrast between our method and MIX-DATA shows the effectiveness of using two different language models across domains. Even through MIX-DATA uses more data for training language models on both the source and target domains, it cannot learn a domain contrast since both sides use the same mixed data. In contrast, our model gives significantly better results by gleaning such contrast.
(3) Comparison with current state-of-the-art.
Finally, Table 3 also shows a comparison with a state-of-the-art method on the 13PC and 13CG datasets (Crichton et al., 2017), which leverages POS tagging for multi-task learning by using cotraining method. Our model outperforms their results, giving the best results in the literature. Discussion. When the number of target-domain NER sentences is 0, the transfer learning setting is unsupervised domain adaptation. As the number of target domain NER sentences increases, they will intuitively play an increasingly important role for target NER. Figure 6 compares the F1-scores of the baseline STM-TARGET and our multi-task model with varying numbers of target-domain NER training data under 100 training epochs. In the nearly unsupervised setting, our method gives the largest improvement of 20.5% F1-scores. As the number of training data increases, the gap between the two methods becomes smaller. But our method still gives a 3.3% F1 score gain when the number of training sentences reach 3,000, show-Mu l t i T a s k Mu l t i T a s k -T a r g e t -T a r g e t Figure 7: Fine-grained comparisons on 13PC and 13CG.
ing the effectiveness of LM in knowledge transfer. Figure 7 shows fine-grained NER results of all available entity types. In comparison to STM-TARGET, FINETUNE and MULTITASK, our method outperforms all the baselines on each entity type, which is in accordance with the conclusion of development experiments.

Unsupervised Domain Adaptation
For unsupervised domain adaptation, many settings in Subsection 4.2 do not hold, including STM-TARGET, FINETUNE, MULTITASK, CO-LM and CO-NER. Instead, we add a naive baseline, STM-SOURCE, which directly applies a model trained on the source-domain CoNLL-2003 data to the target domain. In addition, we compare with models that make use of source NER, source LM and target LM data, including SELF-TRAIN, which improves a source NER model on target raw text (Daumé III, 2008). STM-ELMO, which uses ELMo embeddings trained over combined source-and target-domain raw text for STM-SOURCE, STM-ELMO(SRC), which uses only the source-domain raw data for training ELMo, STM-ELMO(TGT), which uses only the target-domain raw text for training ELMo, and DANN (Ganin et al., 2016), which performs generative adversarial training over source-and target-domain raw data. Final results. The final results are shown in Table 4. SELF-TRAIN gives better results compared with the STM-SOURCE baseline, which shows the effectiveness of target-domain raw data. Adversarial training brings significantly better improvements compared with naive self-training. Among ELMo methods, the model using both the source-domain raw data and target-domain raw data outperforms the model using only the sourceor target-domain raw data. ELMo also outper-  Table 4: Three metrics on CBS SciTech News. We use the CoNLL dev set to select the hyperparameters of our models. ELMo and Ours are given the same overall raw data, SELF-TRAIN and DANN use the selected raw data from overall raw data for better performances. † indicates that our results are statistically significant compared to all baselines with p < 0.01 by t-test. forms DANN, which shows the strength of LM pre-training. Interestingly, ELMo with targetdomain raw data gives similar accuracies to ELMo with mixed source-and target-domain data, which shows that target-domain LM is more useful for the pretraining method. It also indicates that our method makes better use of LMs over two different domains. Compared with all baseline models, our model gives a final F1 of 73.59, significantly better than the best result of 70.85 obtained by STM+ELMO, demonstrating the effectiveness of parameter generation network for cross-task, cross-domain knowledge transfer.
Influence of raw text. For zero-shot learning, domain adaptation is achieved solely through LM channels. We thus compare the effectiveness of raw text from both the source domain and the target domain. Figure 8 shows the results. The line "SRC: varying; TGT: varying" shows the F1scores against varying numbers of raw sentences in both source and target domains. Each number in the x-coordinate indicates an equal amount of source-and target-domain text. As can be seen, increasing raw text gives increased F1 for   NER, which demonstrates effective use of raw data by our method. The lines "SRC: 100%; TGT: varying" and "SRC: varying; TGT: 100%" show to alternative measures by fixing the sourceand target-domain raw text to 100% of our data, and then varying only the other domain text. A comparison between the two lines shows that the target-domain raw data gives more influence to the domain adaptation power, which conforms to intuition. Discussion. Table 5 shows a breakdown for the improvement of our model over STM-SOURCE by different entity types. Compared with PER, LOC and ORG names, our method brings the most improvements over MISC entities, which are mostly types that are specific to the technology domain (see Subsection 4.1). Intuitively, the amount of overlap is the weakest for this type of entities between raw text from source and target domains. Therefore, the results show the effectiveness of our method in deriving domain contrast with respect to NER from cross-domain language modeling. Table 6 shows a case study, where "Brittany Kaiser" is a personal name and "CBS This Morning" is a programme. Without using raw text, STM-SOURCE misclassifies "Brittany Kaiser" as ORG. Both DANN and our method give the correct results because the name is mentioned in raw text, from which connections between the pattern "PER spoke" can be drawn. With the help of raw text, DANN and our method can also recognize "CBS This Morning" as a entity, which has a common pattern of consecutive capital letters in both source and target domains.
DANN misclassifies "CBS This Morning" as ORG. In contrast, our model can classify it correctly as the category of MISC, in which most entities are specific to the target domain (see Subsection 4.1). This is likely because adversarial training in DANN aims to match feature distributions between source and target domains by mimicing the domain discriminator, which can lead to concentration on domain common features but confusion about such domain-specific features. This demonstrates the advantage of our method in deriving both domain common and domain-specific features.

Conclusion
We considered NER domain adaptation by extracting knowledge of domain differences from raw text. For this goal, cross-domain language modeling is conducted through a novel parameter generation network, which decomposes domain and task knowledge into two sets of embedding vectors. Experiments on three datasets show that our method is highly effective among supervised domain adaptation methods, while allowing zeroshot learning in unsupervised domain adaptation.