Neural Adaptation Layers for Cross-domain Named Entity Recognition

Recent research efforts have shown that neural architectures can be effective in conventional information extraction tasks such as named entity recognition, yielding state-of-the-art results on standard newswire datasets. However, despite significant resources required for training such models, the performance of a model trained on one domain typically degrades dramatically when applied to a different domain, yet extracting entities from new emerging domains such as social media can be of significant interest. In this paper, we empirically investigate effective methods for conveniently adapting an existing, well-trained neural NER model for a new domain. Unlike existing approaches, we propose lightweight yet effective methods for performing domain adaptation for neural models. Specifically, we introduce adaptation layers on top of existing neural architectures, where no re-training using the source domain data is required. We conduct extensive empirical studies and show that our approach significantly outperforms state-of-the-art methods.


Introduction
Named entity recognition (NER) focuses on extracting named entities in a given text while identifying their underlying semantic types. Most earlier approaches to NER are based on conventional structured prediction models such as conditional random fields (CRF) (Lafferty et al., 2001;Sarawagi and Cohen, 2004), relying on handcrafted features which can be designed based on domain-specific knowledge (Yang and Cardie, 2012;Passos et al., 2014;Luo et al., 2015). Recently, neural architectures have been shown effective in such a task, whereby minimal feature engineering is required (Lample et al., 2016;Ma and Hovy, 2016;Peters et al., 2017;Liu et al., 2018). Domain adaptation, as a special case for transfer learning, aims to exploit the abundant data of wellstudied source domains to improve the performance in target domains of interest (Pan and Yang, 2010;Weiss et al., 2016). There is a growing interest in investigating the transferability of neural models for NLP. Two notable approaches, namely INIT (parameter initialization) and MULT (multitask learning), have been proposed for studying the transferrability of neural networks under tasks such as sentence (pair) classification (Mou et al., 2016) and sequence labeling (Yang et al., 2017b). The INIT method first trains a model using labeled data from the source domain; next, it initializes a target model with the learned parameters; finally, it fine-tunes the initialized target model using labeled data from the target domain. The MULT method, on the other hand, simultaneously trains two models using both source and target data respectively, where some parameters are shared across the two models during the learning process. Figure 1 illustrates the two approaches based on the BLSTM-CRF (bidirectional LSTM augmented with a CRF layer) architecture for NER. While such approaches make intuitive senses, they also come with some limitations.
First, these methods utilize shared domaingeneral word embeddings when performing learning from both source and target domains. This essentially assumes there is no domain shift of input feature spaces. However, cases when the two domains are distinct (words may contain different se-Bob [B,o,b] Dylan [D,y,l,a,n] visited [v,i,s,i,t,e,d] [ S,w,e,d,e,n] Sweden CRF Layer

Bidirectional LSTM Layer
Bob [B,o,b] Dylan [D,y,l,a,n] visited [v,i,s,i,t,e,d] [ mantics across two domains), we believe such an assumption can be weak. Second, existing approaches such as INIT directly augment the LSTM layer with a new output CRF layer when learning models for the target domain. One basic assumption involved here is that the model would be able to re-construct a new CRF layer that can capture not only the variation of the input features (or hidden states outputted from LSTM) to the final CRF layer across two domains, but also the structural dependencies between output nodes in the target output space. We believe this overly restrictive assumption may prevent the model from capturing rich, complex cross-domain information due to the inherent linear nature of the CRF model.
Third, most methods involving cross-domain embedding often require highly time-consuming retraining word embeddings on domain-specific corpora. This makes it less realistic in scenarios where source corpora are huge or even inaccessible. Also, MULT-based methods need retraining on the source-domain data for different target domains. We think this disadvantage of existing methods hinders the neural domain adaptation methods for NER to be practical in real scenarios.
In this work, we propose solutions to address the above-mentioned issues. Specifically, we address the first issue at both the word and sentence level by introducing a word adaptation layer and a sentence adaptation layer respectively, bridging the gap between the two input spaces. Similarly, an output adaptation layer is also introduced between the LSTM and the final CRF layer, captur-ing the variations in the two output spaces. Furthermore, we introduce a single hyper-parameter that controls how much information we would like to preserve from the model trained on the source domain. These approaches are lightweight, without requiring re-training on data from the source domain. We show through extensive empirical analysis as well as ablation study that our proposed approach can significantly improve the performance of cross-domain NER over existing transfer approaches.

Base Model
We briefly introduce the BLSTM-CRF architecture for NER, which serves as our base model throughout this paper. Our base model is the combination of two recently proposed popular works for named entity recognition by Lample et al. (2016) and Ma and Hovy (2016). Figure 2 illustrates the BLSTM-CRF architecture.
Following Lample et al. (2016), we develop the comprehensive word representations by concatenating pre-trained word embeddings and character-level word representations, which are constructed by running a BLSTM over sequences of character embeddings. The middle BLSTM layer takes a sequence of comprehensive word representations and produces a sequence of hidden states, representing the contextual information of each token. Finally, following Ma and Hovy (2016), we build the final CRF layer by utilizing potential functions describing local structures to define the conditional probabilities of complete predictions for the given input sentence.
This architecture is selected as our base model due to its generality and representativeness. We note that several recently proposed models (Peters et al., 2017;Liu et al., 2018) are built based on it. As our focus is on how to better transfer such architectures for NER, we include further discussions of the model and training details in our supplementary material.

Our Approach
We first introduce our proposed three adaptation layers and describe the overall learning process.

Word Adaptation Layer
Most existing transfer approaches use the same domain-general word embeddings for training both source and target models. Assuming that there is little domain shift of input feature spaces, they simplify the problem as homogeneous transfer learning (Weiss et al., 2016). However, this simplified assumption becomes weak when two domains have apparently different language styles and involve considerable domain-specific terms that may not share the same semantics across the domains; for example, the term "cell" has different meaning in biomedical articles and product reviews. Furthermore, some important domainspecific words may not be present in the vocabulary of domain-general embeddings. As a result, we have to regard such words as out-of-vacabulary (OOV) words, which may be harmful to the transfer learning process. Stenetorp et al. (2012) show that domainspecific word embeddings tend to perform better when used in supervised learning tasks. 1 However, maintaining such an improvement in the transfer learning process is very challenging. This is because two domain-specific embeddings are trained separately on source and target datasets, and therefore the two embedding spaces are heterogeneous. Thus, model parameters trained in each model are heterogeneous as well, which hinders the transfer process. How can we address such challenges while maintaining the improvement by using domain-specific embeddings?
We address this issue by developing a word adaptation layer, bridging the gap between the source and target embedding spaces, so that both input features and model parameters be- 1 We also confirm this claim with experiments presented in supplementary materials. come homogeneous across domains. Popular pre-trained word embeddings are usually trained on very large corpora, and thus methods requiring re-training them can be extremely costly and time-consuming. We propose a straightforward, lightweight yet effective method to construct the word adaptation layer that projects the learned embeddings from the target embedding space into the source space. This method only requires some corpus-level statistics from the datasets (used for learning embeddings) to build the pivot lexicon for constructing the adaptation layer.

Pivot Lexicon
A pivot word pair is denoted as (w s , w t ), where w s 2 X S and w t 2 X T . Here, X S and X T are source and target vocabularies. A pivot lexicon P is a set of such word pairs.
To construct a pivot lexicon, first, motivated by Tan et al. (2015), we define P 1 , which consists of the ordinary words that have high relative frequency in both source and target domain corpora: is the frequency function that returns the number of occorrence of the word w in the dataset, and s and t are word frequency thresholds 2 . Optionally, we can utilize a customized word-pair list P 2 , which gives mappings between domain-specific words across domains, such as normalization lexicons (Han and Baldwin, 2011;Liu et al., 2012). The final lexicon is thus defined as P = P 1 [ P 2 .

Projection Learning
Mathematically, given the pre-trained domainspecific word embeddings V S and V T as well as a pivot lexicon P, we would like to learn a linear projection transforming word vectors from V T into V S . This idea is based on a bilingual word embedding model (Mikolov et al., 2013), but we adapt it to this domain adaptation task.
We first construct two matrices V ⇤ S and V ⇤ T , where the i-th rows of these two matrices are the vector representations for the words in the i-th entry of P: P (i) = (w i s , w i t ). We use V ⇤i S to denote the vector representation of the word w i s , and similarly for V ⇤i T and w i t . Next, we learn a transformation matrix Z minimizing the distances between V ⇤ S and V ⇤ T Z with the following loss function, where c i is the confidence coefficient for the entry P (i) , highlighting the significance of the entry: We use normalized frequency (f ) and Sørensen-Dice coefficient (Sørensen, 1948) to describe the significance of each word pair: The intuition behind this scoring method is that a word pair is more important when both words have high relative frequency in both domains. This is because such words are likely more domainindependent, conveying identical or similar semantics across these two different domains. Now, the matrix Z exactly gives the weights to the word adaptation layer, which takes in the target domain word embeddings and returns the transformed embeddings in the new space. We learn Z with stochastic gradient descent. After learning, the projected new embeddings would be V T Z, which would be used in the subsequent steps as illustrated in Figure 2 and Figure 3. With such a word-level input-space transformation, the parameters of the pre-trained source models based on V S can still be relevant, which can be used in subsequent steps.
We would like to highlight that, unlike many previous approaches to learning cross-domain word embeddings (Bollegala et al., 2015;Yang et al., 2017a), the learning of our word adaptation layer involves no modifications to the sourcedomain embedding spaces. It also requires no retraining of the embeddings based on the targetdomain data. Such a distinctive advantage of our approach comes with some important practical implications: it essentially enables the transfer learning process to work directly on top of a welltrained model by performing adaptation without involving significant re-training efforts. For example, the existing model could be one that has already gone through extensive training, tuning and testing for months based on large datasets with embeddings learned from a particular domain (which may be different from the target domain).

Sentence Adaptation Layer
The word adaptation layer serves as a way to bridge the gap of heterogeneous input spaces, but it does so only at the word level and is context independent. We can still take a step further to address the input space mismatch issue at the sentence level, capturing the contextual information in the learning process of such a mapping based on labeled data from the target domain. To this end, we augment a BLSTM layer right after the word adaptation layer (see Figure 2), and we name it the sentence adaptation layer.
This layer pre-encodes the sequence of projected word embeddings for each target instance, before they serve as inputs to the LSTM layer inside the base model. The hidden states for each word generated from this layer can be seen as the further transformed word embeddings capturing target-domain specific contextual information, where such a further transformation is learned in a supervised manner based on target-domain annotations. We also believe that with such a sentence adaptation layer, the OOV issue mentioned above may also be partially alleviated. This is because without such a layer, OOV words would all be mapped to a single fixed vector representation -which is not desirable; whereas with such a sentence adaptation layer, each OOV word would be assigned their "transformed" embeddings based on its respective contextual information.

Output Adaptation Layer
We focus on the problem of performing domain adaptation for NER under a general setup, where we assume the set of output labels for the source and target domains could be different. Due to the heterogeneousness of output spaces, we have to reconstruct the final CRF layer in the target models.
However, we believe solely doing this may not be enough to address the labeling difference problem as highlighted in (Jiang, 2008) as the two output spaces may be very different. For example, in the sentence "Taylor released her new songs", "Taylor" should be labeled as "MUSIC-ARTIST" instead of "PERSON" in some social media NER datasets; this suggests re-classifying with contextual information is necessary. In another example, where we have a tweet "so...#kktny in 30 mins?"; here we should recognize "#kktny" as a CREATIVE-WORK entity, but there is little similar instances in newswire data, indicating that context-aware re-recognition is also needed. How can we perform re-classification and rerecognition with contextual information in the target model? We answer this question by inserting a BLSTM output adaptation layer in the base model, right before the final CRF layer, to capture variations in outputs with contextual information. This output adaption layer takes the output hidden states from the BLSTM layer from the base model as its inputs, producing a sequence of new hidden states for the re-constructed CRF layer. Without this layer, the learning process directly updates the pre-trained parameters of the base model, which may lead to loss of knowledge that can be transferred from the source domain.

Overall Learning Process
Figure 3 depicts our overall learning process. We initialize the base model with the parameters from the pre-trained source model, and tune the weights for all layers -including layers from the base model, sentence and output adaptation layers, and the CRF layer. We use different learning rates when updating the weights in different layers using Adam (Kingma and Ba, 2014). In all our experiments, we fixed the weights to Z for the word adaptation layers to avoid over-fitting. This allows us to preserve the knowledge learned from the source domain while effectively leveraging the limited training data from the target domain.
Similar to Yang et al. (2017b), who utilizes a hyper-parameter for controlling the transferability, we also introduce a hyper-parameter that serves a similar purpose -it captures the relation between the learning rate used for the base model (↵ base ) and the learning rate used for the adaptation layers plus the final CRF layer ↵ base = · ↵ adapt . If = 0, we fix the learned parameters (from source domain) of the base model completely (Ours-Frozen). If = 1, we treat all the layers equally (Ours-FineTune).

Experimental Setup
In this section, we present the setup of our experiments. We show our choice for source and target domains, resources for embeddings, the datasets for evaluation and finally the baseline methods.

Source and Target Domains
We evaluate our approach with the setting that the source domain is newswire and the target domain is social media. We designed this experimental setup based on the following considerations: • Challenging: Newswire is a well-studied domain for NER and existing neural models perform very well (around 90.0 F1-score (Ma and Hovy, 2016)). However, the performance drop dramatically in social media data (around 60.0 F-score (Strauss et al., 2016)). • Important: Social media is a rich soil for text mining (Petrovic et al., 2010;Rosenthal and McKeown, 2015;Wang and Yang, 2015), and NER is of significant importance for other information extraction tasks in social media (Ritter et al., 2011a;Peng and Dredze, 2016;Chou et al., 2016). • Representative: The noisy nature of user generated content as well as emerging entities with novel surface forms make the domain shift very salient (Finin et al., 2010;Han et al., 2016). Nevertheless, the techniques developed in this paper are domain independent and thus can be used for other learning tasks across any two domains so long as we have the necessary resources.

Resources for Cross-domain Embeddings
We utilizes GloVe (Pennington et al., 2014) to train domain-specific and domain-general word embeddings from different corpora, denoted as follows: 1) source emb, which is trained on the newswire domain corpus (NewYorkTimes and Dai-lyMail articles); 2) target emb, which is trained on a social media corpus (Archive Team's Twitter stream grab 3 ); 3) general emb, which is pretrained on CommonCrawl containing both formal and user-generated content. 4 We obtain the intersection of the top 5K words from source and target vocabularies sorted by frequency to build P 1 . For P 2 , we utilize an existing twitter normalization lexicon containing 3,802 word pairs (Liu et al., 2012). More details are in supplemental material.

NER Datasets for Evaluation
For the source newswire domain, we use the following two datasets: OntoNotes-nw -the newswire section of OntoNotes 5.0 release dataset (ON) (Weischedel et al., 2013) that is publicly available 5 , as well as the CoNLL03 NER dataset (CO) (Sang and Meulder, 2003). For the first dataset, we randomly split the dataset into three sets: 80% for training, 15% for development and 5% for testing. For the second dataset, we follow their provided standard train-dev-test split. For the target domain, we consider the following two datasets: Ritter11 (RI) (Ritter et al., 2011b) and WNUT16 (WN) (Strauss et al., 2016), both of which are publicly available. The statistics of the four datasets we used in the paper are shown in Table 1.

Baseline Transfer Approaches
We present the baseline approaches, which were originally investigated by Mou Mou et al. (2016) and Yang et al. (2017b) follow Collobert and Weston (2008) and use a hyper-parameter as the probability of choosing an instance from D S instead of D T to optimize the model parameters. By selecting the hyper-parameter , the multi-task learning process tends to perform better in target domains. Note that this method needs re-training of the source model with D S every time we would like to build a new target model, which can be time-consuming especially when D S is large.
MULT+INIT: This is a combination of the above two methods. We first use INIT to initialize the target model. Afterwards, we train the two models as what MULT does.

Main Results
We primarily focus on the discussion of experiments with a particular setup where D S is set to OntoNotes-nw and D T is Ritter11. In the experiments, "in-domain" means we only use D T to train our base model without any transfer from the source domain. " " represents the amount of improvement we can obtain (in terms of F measure) using transfer learning over "in-domain" for each transfer method. The hyper-parameters and are tuned from {0.1, 0.2, ..., 1.0} on the development set, and we show the results based on the developed hyper-parameters.
We first conduct the first set of experiments to evaluate performance of different transfer methods under the assumption of homogeneous input spaces. Thus, we utilize the same word embeddings (general emb) for training both source and target models. Consequently we remove the word adaptation layer (cd emb) in our approach under this setting. The results are listed in Table 2. As we can observe, the INIT-Frozen method leads to a slight "negative transfer", which is also reported in the experiments of Mou et al. (2016). This indicates that solely updating the parameters of the final CRF layer is not enough for performing re-classification and re-recognition of the named entities for the new target domain. The INIT-FineTune method yields better results for it also updates the parameters of the middle LSTM  layers in the base model to mitigate the heterogeneousness. The MULT and MULT+INIT methods yield higher results, partly due to the fact that they can better control the amount of transfer through tuning the hyper-parameter. Our proposed transfer approach outperforms all these baseline approaches. It not only controls the amount of transfer across the two domains but also explicitly captures variations in the input and output spaces when there is significant domain shift.
We use the second set of experiments to understand the effectiveness of each method when dealing with heterogeneous input spaces. We use source emb for training source models and target emb for learning the target models. From the results shown in Table 3, we can find that all the baseline methods degrade when the two word embeddings used for training source models and learning target models are different from each other. The heterogeneousness of input feature space hinders them to better use the information from source domains. However, with the help of word and sentence adaptation layers, our method achieves better results. The experiment on learning without the word adaptation layer also confirms the importance of such a layer. 6 Our results are also comparable to the results when the cross-lingual embedding method of Yang et al. (2017a) is used instead of the word adaptation layer. However, as we mentioned earlier, their method requires re-training the embeddings from target domain, and is more expensive.

Ablation Test
To investigate the effectiveness of each component of our method, we conduct ablation test based on our full model (F =66.40) reported in Table 3. We use to denote the differences of the performance between each setting and our model. The results 6 We use the approximate randomization test (Yeh, 2000) for statistical significance of the difference between "Ours" and "MULT+INIT". Our improvements are statistically significant with a p-value of 0.0033.  of ablation test are shown in Table 4. We first set to 0 and 1 respectively to investigate the two special variant (Ours-Frozen, Ours-FineTune) of our method. We find they both perform worse than using the optimal (0.6).
One natural concern is whether our performance gain is truly caused by the effective approach for cross-domain transfer learning, or is simply because we use a new architecture with more layers (that is built on top of the base model) for learning the target model. To understand this, we carry out an experiment named "w/o transfer" by setting to 1, where we randomly initialize the parameters of the middle BLSTMs in the target model instead of using source model parameters. Results show that such a model does not perform well, confirming the effectiveness of transfer learning with our proposed adaptation layers. Results also confirm the importance of all the three adaptation layers that we introduced. Learning the confidence scores (c i ) and employing the optional P 2 are also helpful but they appear to be playing less significant roles.

Additional Experiments
As shown in Table 5, we conduct some additional experiments to investigate the significance of our improvements on different source-target domains, and whether the improvement is simply because of the increased model expressiveness due to a larger number of parameters. 7 We first set the hidden dimension to be the same as the dimension of source-domain word embeddings for the sentence adaptation layer, which is 200 (I/M-200). The dimension used for the output adaptation layer is just half of that of the base BLSTM model. Overall, our model roughly   involves 117.3% more parameters than the base model. 8 To understand the effect of a larger parameter size, we further experimented with hidden unit size as 300 (I/M-300), leading to a parameter size of 213, 750 that is comparable to "Ours" (203, 750). As we can observe, our approach outperforms these approaches consistently in the four settings. More experiments with other settings can be found in the supplementary material.

Effect of Target-Domain Data Size
To assess the effectiveness of our approach when we have different amounts of training data from the target domain, we conduct additional experiments by gradually increasing the amount of the target training data from 10% to 100%. We again use the OntoNotes-nw and Ritter11 as D S and D T , respectively. Results are shown in Figure 4. Experiments for INIT and MULT are based on the respective best settings used in Table 5. We find that the improvements of baseline methods tend to be smaller when the target training set becomes larger. This is partly because INIT and MULT do not explicitly preserve the parameters from source models in the constructed target models. Thus, the transferred information is diluted while we train the target model with more data. In contrast, our transfer method explicitly saves the transferred information in the base part of our target model, and we use separate learning rates to help the target model to preserve the transferred knowledge. Sim-  ilar experiments on other datasets are shown in the supplementary material.

Effect of the Hyper-parameter
We present a set of experiments around the hyperparameter in Figure 5. Such experiments over different datasets can shed some light on how to select this hyper-parameter. We find that when the target data set is small (Ritter11), the best are 0.5 and 0.6 respectively for the two source domains, whereas when the target data set is larger (WNUT16), the best becomes 0.7 and 0.8. The results suggest that the optimal tends to be relatively larger when the target data set is larger.

Related Work
Domain adaptation and transfer learning has been a popular topic that has been extensively studied in the past few years (Pan and Yang, 2010). For well-studied conventional feature-based models in NLP, there are various classic transfer approaches, such as EasyAdapt (Daumé, 2007), instance weighting (Jiang and Zhai, 2007) and structural correspondence learning (Blitzer et al., 2006). Fewer works have been focused on transfer approaches for neural models in NLP. Mou et al. (2016) use intuitive transfer methods (INIT and MULT) to study the transferability of neural network models for the sentence (pair) classification problem; Lee et al. (2017) utilize the INIT method on highly related datasets of electronic health records to study their specific deidentification problem. Yang et al. (2017b) use the MULT approach in sequence tagging tasks including named entity recognition. Following the MULT scheme, Wang et al. (2018) introduce a label-aware mechanism into maximum mean discrepancy (MMD) to explicitly reduce domain shift between the same labels across domains in medical data. Their approach requires the output space to be the same in both source and target domains. Note that the scenario in our paper is that the output spaces are different in two domains.
All these existing works do not use domainspecific embeddings for different domains and they use the same neural model for source and target models. However, with our word adaptation layer, it opens the opportunity to use domainspecific embeddings. Our approach also addresses the domain shift problem at both input and output level by re-constructing target models with our specifically designed adaptation layers. The hyper-parameter in our proposed methods and in MULT both control the knowledge transfer from source domain in the transfer learning process. While our method works on top of an existing pre-trained source model directly, MULT needs re-training with source domain data each time they train a target model. Fang and Cohn (2017) add an "augmented layer" before their final prediction layer for crosslingual POS tagging -which is a simple multilayer perceptron performing local adaptation for each token separately -ignoring contextual information. In contrast, we employ a BLSTM layer due to its ability in capturing contextual information, which was recently shown to be crucial for sequence labeling tasks such as NER (Ma and Hovy, 2016;Lample et al., 2016). We also notice that a similar idea to ours has been used in the recently proposed Deliberation Network (Xia et al., 2017) for the sequence generation task, where a second-pass decoder is added to a first-pass decoder to polish sequences generated by the latter.
We propose to learn the word adaptation layer in our task inspired by two prior studies. Fang and Cohn (2017) use the cross-lingual word embeddings to obtain distant supervision for target languages. Yang et al. (2017a) propose to re-train word embeddings on target domain by using regularization terms based on the sourcedomain embeddings, where some hyper-parameter tuning based on down-stream tasks is required. Our word adaptation layer serves as a lineartransformation (Mikolov et al., 2013), which is learned based on corpus level statistics. Although there are alternative methods that also learn a mapping between embeddings learned from different domains (Faruqui and Dyer, 2014;Artetxe et al., 2016;Smith et al., 2017), such methods usually involve modifying source domain embeddings, and thus re-training of the source model based on the modified source embeddings would be required for the subsequent transfer process.

Conclusion
We proposed a novel, lightweight transfer learning approach for cross-domain NER with neural networks. Our introduced transfer method performs adaptation across two domains using adaptation layers augmented on top of the existing neural model. Through extensive experiments, we demonstrated the effectiveness of our approach, reporting better results over existing transfer methods. Our approach is general, which can be potentially applied to other cross-domain structured prediction tasks. Future directions include investigations on employing alternative neural architectures such as convolutional neural networks (CNNs) as adaptation layers, as well as on how to learn the optimal value for from the data automatically rather than regarding it as a hyper-parameter. 9