Multi-Cell Compositional LSTM for NER Domain Adaptation

Cross-domain NER is a challenging yet practical problem. Entity mentions can be highly different across domains. However, the correlations between entity types can be relatively more stable across domains. We investigate a multi-cell compositional LSTM structure for multi-task learning, modeling each entity type using a separate cell state. With the help of entity typed units, cross-domain knowledge transfer can be made in an entity type level. Theoretically, the resulting distinct feature distributions for each entity type make it more powerful for cross-domain transfer. Empirically, experiments on four few-shot and zero-shot datasets show our method significantly outperforms a series of multi-task learning methods and achieves the best results.


Introduction
Named entity recognition (NER) is a fundamental task in information extraction, providing necessary information for relation classification (Mooney and Bunescu, 2006), event detection (Popescu et al., 2011), sentiment classification (Mitchell et al., 2013), etc. NER is challenging because entity mentions are an open set and can be ambiguous in the context of a sentence. Due to relatively high cost in manual labeling, cross-domain NER has received increasing research attention. Recently, multi-task learning methods (Yang et al., 2017;Wang et al., 2018Wang et al., , 2019Zhou et al., 2019;Jia et al., 2019) have achieved great success for cross-domain NER. Other methods such as fine-tuning (Rodriguez et al., 2018), share-private (Cao et al., 2018;Lin and Lu, 2018) and knowledge distill (Yang et al., 2019) also show effectivenesses for cross-domain NER.
There are three main source of challenges in cross-domain NER. First, instances of the same type entities can be different across domains. For example, typical person names can include "Trump" and "Clinton" in the political news domain, but "James" and "Trout" in the sports domain. Second, different types of entities can exhibit different degrees of dissimilarities across domains. For example, a large number of location names are shared in the political news domain and the sports domain, such as "Barcelona" and "Los Angeles", but the case is very different for organization names across these domains. Third, even types of entities can be different across domains. For example, while disease names are a type of entities in the medical domain, it is not so in the biochemistry domain.
We investigate a multi-cell compositional LSTM structure to deal with the above challenges by separately and simultaneously considering the possibilities of all entity types for each word when processing a sentence. As shown in Figure 1, the main idea is to extend a standard LSTM structure by using a separate LSTM cell to model the state for each entity type in a recurrent step. Intuitively, the model differs from the baseline LSTM by simultaneously considering all possible entity types. A compositional cell (C cell) combines the entity typed cells (ET cells) for the next recurrent state transition by calculating a weighted sum of each ET cell, where the weight of each ET cell corresponds to the probability of its corresponding entity type. Different from naive parameter sharing on LSTM (Yang et al., 2017), source domain and target domain in our multi-task learning framework share only the ET cells corresponding to the same entity types and the same C cell, but not for the domain-specific ET cells. In this way, our model learns domain-invariant in the entity level. Intuitively, our model addresses the above challenges by modeling entity type sequences more explicity, which are relatively more robust across domains compared with entity instances. For example, the pattern "PER O PER O LOC" can exist in both the political and sports domains, despite

Word Embs Char CNN Target Domain
Source Domain ∈ ∈ Figure 1: Overall structures. The red, blue and purple in (c) represent target, source and shared parts, respectively. that the specific PER instances can be different. In addition, thanks to the merging operation at each step, our method effectively encodes multiple entity type sequences in linear time by having a sausage shaped multi-cell LSTM. Thus it allows us to learn distributional differences between entity type chains across domains. This effectively reduces the confusions of different entities when source domain and target domain have different entity types in few-shot transfer, where the target domain has a few training data. In zero-shot transfer where the target domain has no training data, a target-domain LM transfers source-domain knowledge. This knowledge transfer is also in the entity level thanks to the compositional weights which are supervised by gold-standard entity type knowledge in source-domain training.
Theoretically, our method creates distinct feature distributions for each entity type across domains, which can give better transfer learning power compared to representation networks that do not explicitly differentiate entity types ( §3.4). Empirically, experiments on four fewshot and zero-shot datasets show that our method gives significantly better results compared to standard BiLSTM baselines with the same numbers of parameters. In addition, we obtain the best resutls on four cross-domain NER datasets. The code is released at https://github.com/ jiachenwestlake/Multi-Cell_LSTM.

Method
Given a sentence x = [x 1 , . . . , x m ], the vector representation w t for each word x t is the concatenation of its word embedding and the output of a character level CNN, following . A bi-directional LSTM encoder is used to obtain sequence level features h = [h 1 , . . . , h m ]. We use the forward LSTM component to explain the de-tails in the following subsections. Finally, a CRF layer outputs the label sequence y = l 1 , . . . , l m .

Baseline LSTM
We adopt the standard LSTM (Graves and Schmidhuber, 2005) for the baseline. At each time step t (t ∈ [1, ..., m]), the baseline calculates a current hidden vector h (t) based on a memory cell c (t) . In particular, a set of input gate i (t) , output gate o (t) and forget gate f (t) are calculated as follows: where [W; b] are trainable parameters. σ represents the sigmoid activation function.

Multi-Cell Compositional LSTM
As shown in Figure 1 (b), we split cell computation in the baseline LSTM unit into l copies, each corresponding to one entity type. These cells are shown in black. A compositional cell (shown in red) is used to merge the entity typed LSTM cells into one cell state for calculating the final hidden vector. In this process, a weight is assigned to each entity type according to the local context. Entity typed LSTM cells (ET cells). Given w (t) andĥ (t−1) , the input gate i where the [W k ; b k ] represent the trainable parameters specific to the k-th ET cell.
Then a copy of the compositional memory cell stateĉ (t−1) of the previous time step (t − 1) is used to update the temporary memory cell state.
The above operations are repeated for l ET cells with the sameĉ (t−1) . We finally acquire a list of ET cell states [c Compositional LSTM cell (C cell). For facilitating integration of ET cells, a input gateî (t) and a temporary cell state ĉ (t) of the compositional cell (C cell) are computed similarly to those of the ET cells, but another output gateô (t) is added, which are computed as follows: (4) where [Ŵ;b] are trainable parameters of the C cell. Merging. We use the temporary cell state of the C cell ĉ l ] for obtaining a compositional representation. To this end, additive attention (Dzmitry et al., 2015) is used, which achieves better results in our development compared with other attention mechanism (Vaswani et al., 2017). The temporary memory cell state of the C cellĉ The weight α (t) k reflects the similarity between ĉ (t) and the k-th ET cell state c k is computed as: where [P; Q; v] are trainable parameters. The memory cell state of the C cell is updated as: Finally, we obtain the hidden stateĥ (t) :

Training Tasks
Below we discuss the two auxiliary tasks before introducing the main NER task. The auxiliary tasks are designed in addition to the main NER task in order to better extract entity type knowledge from a set of labeled training data for training ET cells and C cell. Formally, denote a training set as D ent = {(x n , e n )} N n=1 , where each training instance consists of word sequence x = [x 1 , . . . , x m ] and its corresponding entity types e = [e 1 , . . . , e m ]. Here each entity type e t is a label such as [PER, O, LOC,. . . ] without segmentation tags (e.g., B/I/E). Entity type prediction. Given the ET cell states l ], we define the aligned entity distribution for x t : Where [w k ; b k ] are parameters specific to the k-th entity type e k . The negative log-likehood loss is used for training on D ent : Attention scoring. Similar to the entity type prediction task, given the attention scores between the temporary C cell and ET cells in Equation 6: l )/2], we convert the attention scores to entity aligned distributions for x t using softmax: Similar to the loss of entity type prediction: While entity type prediction brings supervised information to guide the ET cells, attention scoring introduces supervision to guide the C cell. NER. This is the main task across domains. Standard CRFs (Ma and Hovy, 2016) , the output probability p(y|x) over labels y=l 1 , . . . , l m is: where y represents an arbitary labal sequence, and w lt CRF is a model parameter specific to l t , and b (l t−1 ,lt) CRF is a bias specific to l t−1 and l t .

Algorithm 1 Transfer learning
Input: Source-domain NER dataset Sner, target-domain NER dataset Tner or raw data T lm and entity dictionary De Output: Target-domain model 1: while training steps not end do 2: : end for 18: Update paremeters of networks based on L.

19: end while
A sentence-level negative log-likehood loss is used for training on D ner ={(x n , y n )} N n=1 : 3 Transfer Learning The multi-cell LSTM structure above is domain agnostic, and can therefore be used for in-domain NER too. However, the main goal of the model is to transfer entity sequence knowledge across domains, and therefore the ET cells and C cell play more significant roles in the transfer learning setting. Below we introduce the specific roles each cell is assigned in cross-domain settings.

Multi-Task Structure
Following the common cross-domain setting, we use source-domain NER dataset S ner and the targetdomain NER dataset T ner or raw data T lm . The entity type sets of source and target domains are represented as E d , where d ∈ {S, T }, respectively.
As shown in Figure 1 (c), our multi-task learning structure follows Yang et al. (2017), which consists of shared embedding layer and shared BiL-STM layer, as well as domain-specific CRF layers. Our method replaces LSTM with multi-cell LSTM, following we introduce the multi-task parameter sharing mechanism in multi-cell LSTM.
ET cells. All ET cells {C k } k∈E S ∪E T in multicell LSTM are a composion of entity-specific cells from both source and target domains. For each domain d ∈ {S, T }, the actually used ET cells are the domain-specific subset {C k } k∈E d , aiming to conserve domain-specific features.
C cell. In order to make the source and target domains share the same feature space in a word level, we use a shared C cellĈ across domains.

Unsupervised Domain Adaptation
To better leverage target-domain knowledge without target-domain NER labeled data, we conduct the auxiliary dictionary matching and language modeling tasks on target-domain raw data T lm = {(x n )} N n=1 . Auxiliary tasks. To better extract entity knowledge from raw data, we use a pre-collected named entity dictionary D e by Peng et al. (2019) to label T lm and obtain a set of entity words D + ent , which are used to train entity prediction task and attention scoring task jointly.
Language modeling. Follwing Jia et al. (2019), we use sampling softmax to compute forward LM probability p f (x t |x <t ) and backward LM probability p b (x t |x >t ), respectively: where w x and b x are the target word vector and bias, respectively. Z is the normalization item computed by the target word and negative samples. The LM loss function on T lm is:

Training Objective
Algorithm 1 is the transfer learning algorithm under both supervised and unsupervised domain adaptation settings. Both source-and target-domain training instances undertake auxiliary tasks and obtain the loss L a , which is a combination of L ent and L atten weighted by λ ent and λ atten , respectively (line 6). Supervised domain adaptation. The auxiliary tasks as well as source-and target-domain NER tasks (line 8, 11) form the final training objective: where λ d (d ∈ {S, T }) are the domain weights for NER tasks. λ is the L 2 regularization parameters and Θ represents the parameters set. Unsupervised domain adaptation. The training objective for UDA is similar to that of SDA, except for using target-domain LM task (line 13) instead of target-domain NER task:

Theoretical Discussion
Below we show theoretically that our method in §2.2 is stronger than the baseline method in §2.1 for domain adaptation. Following Ben-David et al.
(2010), a domain is defined as a pair of input distribution D on X and a labeling function y: X→Y, where Y is a (l − 1)-simplex 1 . According to this definition, < D S , y S > and < D T , y T > represent source and target domains, respectively. A hypothesis is a function h: X→{1, ..., l}, which can be a classification model. Target-domain error is defined as the probabil- The training target for h is to minimize a convex weighted combination of source and target errors, α (h) = α T (h) + (1 − α) S (h), where α ∈ [0, 1) is the domain weight, when α = 0, it is the setting of UDA. Theorem 1 Let h be a hypothesis in class H, then: Here λ is a constant that values the shared error of the ideal joint hypothesis. In d H∆H (D S , D T ), sup denotes the supremum of the right term for is similar. Intuitively, the theorem states the upper bound of T (h) based on α (h) and the distance between D S and D T in the H∆H space, which is measured as the discrepancy between the two classifiers h and h .
The original theorem, however, concerns only one model h for transfer learning. In our supervised settings, in contrast, their CRF layers are specific to the source and target domains, respectively. Below we use h * to denote our overall model with shared multi-cell LSTM model and domain-specific CRF layers. Further, we use h 1 to denote the target domain subsystem that consists of the shared multicell LSTM model and the target-specific CRF layer, and h 2 to denote its source counterpart. Theorem 1 can be extended to our settings as follows: The proof is mainly based on trangle inequalities, see Appendix A for details. Considering that the upper bounds of T (h) ( T (h 1 )), α (h) ( α (h * )) and λ (λ * ) are small when training converges, our goal is to reduce d H∆H (D S , D T ). In particular, we define a model h is a composition function h = g • f , where f represents the multi-cell LSTM model and g represents the CRF layer, • denotes function composition. We assume h and h share the same multi-cell LSTM model, namely h = g • f and h = g • f , we have To obtain the supremum of the right term, we may wish to assume that both g and g can classify correctly in the source domain, then The optimization objective is as follows: Aiming to min f ∈F d H∆H (D S , D T ), we decompose the unified feature space into several entity typed distributions using multi-cell LSTM, resulting in that source-and target-domain features belonging to the same entity type are clustered together. The proof is mainly based on the cluster assumption (Chapelle and Zien, 2005), which is equivalent to the low density separation assumption, states that the decision boundary should lie on a low-density region. According to the cluster assumption, both g and g tend to cross the low-density regions in the shared    for developing the models. The multi-task baselines are based on Jia et al. (2019). Our hyperparameter settings largely follow ; word embeddings for all models are initialized with PubMed 200-dimension vectors (Chiu et al., 2016)   GloVe 100-dimension vectors (Pennington et al., 2014) in other experiments. All word embeddings are fine-tuned during training. Character embeddings are randomly initialized. Figure 2 shows the performances of the main targetdomain NER task and the auxiliary entity prediction and attention scoring tasks on the development sets of BioNLP13CG and Twitter when the number of training iterations increases. As can be seen from the figure, all the three tasks have the same trend of improvement without potential conflicts between tasks, which shows that all the three tasks take the feature space of the same form.

Supervised Domain Adaptation
We  Table 2: Results on three few-shot datasets. * indicates that we reproduce the baseline bi-directional LSTM in a similar way to our model for fair comparisons. † indicates statistical significance compared to target-domain settings and cross-domain settings with p < 0.01 by t-test. ‡ indicates statistical significance compared to LM pre-training based methods with p < 0.01 by t-test.
CELL LSTM, all of the multi-task models obtain significantly better results on all of the three datasets. This shows the effectiveness of multi-task learning in few-shot transfer.
Cross-domain settings. We make comparisons with the traditional parameter sharing mechanism MULTI-TASK(LSTM) (Yang et al., 2017) together with two improved methods, MULTI-TASK+PGN (Jia et al., 2019), which adds an parameter generation networks (PGN) to generate parameters for source-and target-domain LSTMs and MULTI-TASK+GRAD (Zhou et al., 2019), which adds a generalized resource-adversarial discriminator (GRAD) and leverages adversarial training. The results show that our method can significantly outperform these multi-task methods on the same datasets, which shows the effectiveness of our multi-cell structure in cross-domain settings.
Comparison with the state-of-the-art models.
Results show that our model outperforms crossdomain method of Jia et al. (2019), cross-type method of Wang et al. (2019) and methods using addition features (Crichton et al., 2017;. Recently, LM pre-training based methods such as ELMO/BIOELMO (Peters et al., 2018), BERT (Devlin et al., 2019) and BIOBERT (Lee et al., 2020) achieve state-of-the-art results on NER. However, these methods use additional large-scale language resources, thus it is unfair to make direct comparisons with our method. Thus we leverage the outputs of LM pre-training meth-  ods as contextualized word embeddings. In particular, we use the same batch size as our method and the Adam optimizer with an initial learning rate 3e-5 in BERT fine-tuning baselines. Results show that our method benifits from these LM pretraining output features and outperforms these LM pre-training based methods.

Unsupervised Domain Adaptation
We conduct unsupervised domain adaptation on the CBS SciTech News test set, using CoNLL-2003 as the source-domain dataset. The overall results are listed in Table 3. Adding a named entity dictionary. With the named entity dictionary collected by Peng et al. (2019), the results show a significant improvement (75.19% F 1 v.s. 72.81% F 1 ). To make fair comparison, we add the entity dictionary information to BILSTM+LM by doing an entity type prediction task together with the target-domain LM. BIL-STM+LM+DICT achieves better result than BIL-STM+LM (72.49% F 1 v.s. 71.30% F 1 ), but it still cannot be comparable to our results. This shows that the auxiliary tasks can help learn entity knowledge from raw data, even if the named entity dictionary can not label all entities in a sentence.

Analysis
Visualization. In the proposed multi-cell LSTM, both ET cells and C cell play important roles in constructing a shared feature spaces across domains. We visualize feature spaces of ET cells and C cell in the Broad Twitter experiments.   of the same ET cell gather together across domains. This indicates that our model can learn cross-domain entity typed knowledge with the help of ET cells, which are more robust across domains. Figure 4 visualize the hidden vectors of the target-domain only baseline, the multi-task baseline and the proposed model. From the figure, we can see that both the multi-task baseline and ours can obtain similar feature distributions across domains compared with the target-domain only baseline. In comparison with the multi-task baseline, our model also shows strong matches across domains in an entity type level, which can better narrow the gap between source and target domains as discussed in §3.4. Fine-grained comparison. We make fine-grained comparisons between our model and the multi-task baseline on the BioNLP dataset, aiming to show how our model achieves better results on the entity type level. Following Crichton et al. (2017) and Jia et al. (2019), we study five well studied entity groups (not including all entity types) in BioNLP13CG. As shown in Table 4, both MULTI (Multi-Task baseline) and Ours achieve significant F 1 improvement over the target-domain only baseline LSTM on the biochemistry entity groups that appear in both the target and the source datasets, such as CHEM, CC and GGP, which is consistent with intuition.
But for biology entity groups not appearing in the source dataset, such as CELL and SPE, MULTI using traditional parameter sharing hardly improves the performances (+0.14% F 1 for CELL and +0.39% F 1 for SPE v.s. +1.82% F 1 for All). In contrast, Ours achieves relatively strong improvements (+2.10% F 1 for CELL and +2.84% F 1 for SPE). This benefits from the distinct feature distributions across entity types by the multi-cell LSTM structure, which can effectively prevent the confusions drawn in a unified feature space. Ablation study. We conduct ablation studies on auxiliary tasks and model parameters. The results   are listed in Table 5.
Auxiliary tasks. When we only ablate L ent , the results on all of the three datasets suffer significant decline (-0.44% F 1 on BioNLP dataset, -0.85% F 1 on Broad Twitter dataset and -0.24% F 1 on CBS News dataset, respectively). When we only ablate L atten , the results on all of the three datasets suffer significant decline (over -1.5% F 1 on all of the three datasets). When we both ablate L ent and L atten , our model achieves similar results as the BILSTM-BASED baseline. This indicates that domain transfer of our model depends heavily on both auxiliary tasks.
Number of parameters. We use two strategies to make the number of parameters of BILSTM-BASED baseline comparable to that of our model: (i) STACKED BILSTMS, stacking multi-layer BiL-STMs and enlarging the hidden size. (ii) HIDDEN EXPANSION, with similar model structure, just enlarging the hidden size. Our model still significantly outperforms these baselines, which shows that the effects of our model do not arise from a larger number of parameters.
Case study. Table 6 shows a case study, "WHO" is an organization and "Nipah" is a virus. Without using target-domain raw data, BI-LSTM baseline miclassifies "Nipah" as ORG. Both Ours and  BILSTM+LM give the correct results because this entity is mentioned in raw data. Using the multicell structure, our method learns the pattern "ORG, O, ORG, O" from source data without confusions by target-domain specific entities, thus Ours recognizes "WHO" correctly.

Conclusion
We have investigated a multi-cell compositional LSTM structure for cross-domain NER under the multi-task learning strategy. Theoretically, our method benefits from the distinct feature distributions for each entity type across domains. Results on a range of cross-domain datasets show that multi-cell compositional LSTM outperforms BiL-STM under the multi-task learning strategy.