Zero-Resource Cross-Domain Named Entity Recognition

Existing models for cross-domain named entity recognition (NER) rely on numerous unlabeled corpus or labeled NER training data in target domains. However, collecting data for low-resource target domains is not only expensive but also time-consuming. Hence, we propose a cross-domain NER model that does not use any external resources. We first introduce a Multi-Task Learning (MTL) by adding a new objective function to detect whether tokens are named entities or not. We then introduce a framework called Mixture of Entity Experts (MoEE) to improve the robustness for zero-resource domain adaptation. Finally, experimental results show that our model outperforms strong unsupervised cross-domain sequence labeling models, and the performance of our model is close to that of the state-of-the-art model which leverages extensive resources.


Introduction
Named entity recognition (NER) is a fundamental task in text understanding and information extraction. Recently, supervised learning approaches have shown their effectiveness in detecting named entities (Ma and Hovy, 2016;Chiu and Nichols, 2016;Winata et al., 2019). However, there is a vast performance drop for low-resource target domains when massive training data are absent. To solve this data scarcity issue, a straightforward idea is to utilize the NER knowledge learned from highresource domains and then adapt it to low-resource domains, which is called cross-domain NER.
Due to the large variances in entity names across different domains, cross-domain NER has thus far been a challenging task. Most existing methods consider a supervised setting, leveraging labeled NER data for both the source and target domains (Yang et al., 2017;Lin and Lu, 2018).
However, labeled data in target domains is not always available. Unsupervised domain adaptation naturally arises as a possible way to circumvent the usage of labeled NER data in target domains. However, the only existing method, proposed by Jia et al. (2019), requires an external unlabeled data corpus in both the source and target domains to conduct the unsupervised cross-domain NER task, and such resources are difficult to obtain, especially for low-resource target domains. Therefore, we consider unsupervised zero-resource cross-domain adaptation for NER which only utilizes the NER training samples in a single source domain.
To meet the challenge of zero-resource crossdomain adaptation, we first propose to conduct multi-task learning (MTL) by adding an objective function to detect whether tokens are named entities or not. This objective function helps the model to learn general representations of named entities and to distinguish named entities from sequences in target domains. In addition, we observe that in many cases, different entity categories could have a similar or the same context. For example, in the sentence "Arafat subsequently cancelled a meeting between Israeli and PLO officials," the person entity "Arafat", can be replaced with an organization entity within the same context, which illustrates the confusion among different entity categories and makes zero-resource adaptation much more difficult. Intuitively, when the entity type of a token is hard to be predicted based on the token itself and the token's context, we want to borrow the opinions (i.e., representations) from different experts. Hence, we propose a Mixture of Entity Experts (MoEE) framework to tackle the confusion of entity categories, and the predictions are based on the tokens and the context, as well as all entity experts.
Experimental results show that our model is able to outperform current strong unsupervised crossdomain sequence tagging approaches, and reach comparable results to the state-of-the-art unsupervised method that utilizes extensive resources. ...

Gate Label
Meta-Expert Feature

Related Work
Most of the existing work on cross-domain NER has been to investigate the supervised setting, where both source and target domains have labeled data (Daume III, 2007;Obeidat et al., 2016;Yang et al., 2017;Lee et al., 2018). Yang et al. (2017) jointly trained models on the source and target domain with shared parameters. Lin and Lu (2018) added adaptation layers on top of existing models, and Wang et al. (2018) introduced label-aware feature representations for NER adaptation. Lee et al. (2018) utilized the idea of transfer learning by first initializing a target model with parameters learned from source-domain NER, and then using labeled target domain data to fine-tune the model. However, no prior work has focused on the unsupervised setting of cross-domain NER, except for Jia et al. (2019). In Jia et al. (2019), however, external unlabeled data corpora resources in both the source and target domains are required to train language models for domain adaptations. This limitation has motivated us to develop a model that doesn't need any external resources. Tackling the low-resource scenario where there are zero or minimal existing resources has always been an interesting yet challenging task (Xie et al., 2018;Liu et al., 2019b;Shah et al., 2019). Instead of utilizing large amounts of bilingual resources, Liu et al. (2019a,b) only utilized a few word pairs for zero-shot cross-lingual dialogue systems. Unsupervised machine translation approaches Artetxe et al., 2017) have also been introduced to circumvents the need of parallel data. Winata et al. (2020) introduced the cross-accent speech recognition task and utilized meta-learning to cope with the data scarcity issue in target accents. Bapna et al. (2017) and Shah et al. (2019) proposed to do cross-domain slot filling with minimal resources. To the best of our knowledge, we are the first to propose methods on cross-domain adaptation for NER with zero external resources.

Methodology
As illustrated in Fig. 1, our model combines a bidirectional LSTM and conditional random field (CRF) into a BiLSTM-CRF structure (Lample et al., 2016) with MTL and MoEE modules. The parameters of BiLSTM are shared in the multi-task learning.

Multi-Task Learning
Due to the large variations of named entities across domains, unsupervised cross-domain NER models often suffer from an inability to recognize named entities. Hence, we propose to learn general representations of named entities and enhance the robustness for adaptation by adding an objective function to predict whether tokens are named entities or not, which is represented as Task 1 in Fig. 1(a). To do so, based on the original named entity labels for each token in the training set, we create another label set, which represents whether tokens are named entities or not. Specifically, in this process, all nonentity tokens are consistent with the original labels, and other tokens belonging to different entity categories are classified as being in the same class representing the general named entity. Task 2 in Fig. 1(a) represents the original NER task, which is to predict a concrete category for each token. Let us denote X = [w 1 , w 2 , ..., w n ] as the input text sequence, and the MTL can be formulated as: [h 1 , h 2 , ..., h n ] = BiLSTM([w 1 , w 2 , ..., w n ]), where CRF 1 and CRF 2 denote the CRF layers for Task 1 and Task 2 , respectively, and [p T 1 1 , p T 1 2 , ..., p T 1 n ] and [p T 2 1 , p T 2 2 , ..., p T 2 n ] represent the corresponding predictions.

Mixture of Entity Experts
Traditional NER models make predictions based on the features of the tokens and the context. Due to the confusion among different entity categories, NER models could easily overfit to the source domain entities and lose generalization ability to the target domain. Therefore, we introduce an MoEE framework, as depicted in Fig. 1(b). It combines representations generated by experts to produce the final prediction. In this way, the knowledge from different experts is incorporated to model the inherent confusion and improve the generalization ability to target domains.
Each entity category acts as an entity expert, which consists of a linear layer. Note that we consider the non-entity as a special entity category. The expert gate consists of a linear layer followed by a softmax layer, which generates the confidence distribution over entity experts. We use the gold labels in Task 2 to supervise the training of the expert gate. Finally, the meta-expert feature incorporates features from all experts based on the confidential scores from the expert gate. We formulate the MoEE module as follows: where m i is the meta-expert feature for the i-th hidden state of the BiLSTM, where expt is the feature generated from the expert, and L denotes the linear layer. We show that the MoEE has E experts following the number of entity categories plus the non-entity category. The expert features are computed based on the BiLSTM hidden states, and the predictions are conditioned on the expert features and the hidden states, which makes crossdomain adaptation more robust.

Optimization
During training, we optimize for Task 1 , Task 2 and the expert gate with cross-entropy losses L task1 , L task2 and L gate , respectively, as we detail below: (6) where J and |Y j | denote the number of training data and the length of the tokens for each training sample, respectively; p jk and y jk denote the predictions and labels for each token, respectively; and the superscripts of p jk and y jk represent the tasks. Hence, the final objective function is to minimize the sum of all the aforementioned loss functions.

Dataset
We take the CoNLL-2003 English NER data (Sang andDe Meulder, 2003) containing 15.0K/3.5K/3.7K samples for the training/validation/test sets as our source domain. We take the dataset containing 2K sentences from SciTech News provided by Jia et al. (2019) as our target domain. The datasets in the source and target domains contain the same four types of entities, namely, PER (person), LOC (location), ORG (organization), and MISC (miscellaneous).

Experimental Setup
Embeddings We test our approaches on the Fast-Text word embeddings (Bojanowski et al., 2017) and the pre-trained model BERT (Devlin et al., 2019). Entity names in the target domain are likely to be out-of-vocabulary (OOV) words because they don't usually exist in the source domain training set. FastText word embeddings are able to leverage the subword information and avoid the OOV problem, and BERT can solve this problem by using the BPE encoding. We try both freeze and unfreeze settings for FastText embeddings in the training. And for the BERT model, we add different modules (e.g., MoEE) on top to do fine-tuning.
Baselines Since we are the first to conduct zeroresource cross-domain NER, we compare our approach with strong unsupervised cross-domain sequence labeling models under minimal resources. Concept Tagger was proposed by Bapna et al. (2017) to utilize entity descriptions for unsupervised cross-domain utterance slot filling, and Robust Sequence Tagger (Shah et al., 2019) was introduced to combined both entity descriptions and a few examples from each entity category for the same unsupervised task. In addition, we also compare our approach with the following baselines BiLSTM-CRF (Lample et al., 2016), BiLSTM-CRF w/ MTL, and BiLSTM-CRF w/ MoEE, as well as with the state-of-the-art model of the unsupervised cross-domain NER from Jia et al. (2019) which utilizes a large corpus in both the source and target domains.
Training Details For FastText embeddings 1 based models, we use a BiLSTM with a 200dimensional hidden state and two layers. The linear layer size for each entity expert is 200. An Adam optimizer with a learning rate of 1e-3, a batch size of 32, and a dropout rate of 0.3 are used to train our model. We utilize the binary models provided in FastText to obtain the embeddings for OOV words. For BERT-based models, given the strong textual understanding ability of the BERT model, we remove the BiLSTM from the text encoder, and only linear layer is utilized for sequence labeling (i.e., CRF layer is removed) (Devlin et al., 2019  for the evaluation, we use the standard IOB (in-outbegin) format to calculate the F1-score.

Results & Discussion
From Table 1, our model combined with MTL and the MoEE outperforms the strong baselines Concept Tagger and Robust Sequence Tagger on all the embedding settings that we test. We conjecture that these two baselines, which utilize slot descriptions or slot examples, are suitable for limited slot names in the slot filling task, while they fail to cope with wide variances of entity names in the NER task across different domains, while our model is more robust to the domain variations. MTL helps our model recognize named entities in the target domain, while the MoEE adds information from different entity experts and helps our model detect the specific named entity types. Surprisingly, the performance of our best model (with freezed FastText embeddings) is close to that of the stateof-the-art model that needs a large data corpus in the source and target domains, which illustrates our model's generalization ability to the target domain.
We observe that the freezed FastText embeddings bring better performance than unfreezed ones. We conjecture that the embeddings could overfit to the source domain if we unfreeze them in the training. Additionally, using freezed FastText embeddings is slightly better than BERT fine-tuning.
We speculate that the reason is that NER is a wordlevel sequence tagging task, while the BERT model leverages subword embeddings, which could lose part of the word-level information for the task.
We visualize the confidence scores on different entity experts for each token in Fig. 2. The expert gate can align non-entity tokens to the non-entity expert with strong confidence. For some entity tokens, e.g., "Drudge", the expert gate gives high confidence on more than one expert (e.g., "PER" and "ORG") since the model is not sure whether "Drudge" is a "PER" or "ORG". Our model is expected to learn the "PER" and "ORG" expert representations based on the hidden state of "Drudge", which contains the information of this token and its context, and then combine the expert representations for the prediction.

Conclusion
In this paper, we propose a zero-resource crossdomain framework for the named entity recognition task, which consists of multi-task learning and Mixture of Entity Experts modules. The former learns the general representations of named entities to cope with the model's inability to recognize named entities, while the latter learns to combine the representations of different entity experts, which are based on the BiLSTM hidden states. Experimental results show that our model outperforms strong cross-domain sequence tagging models, and the performance is close to that of the state-of-the-art model that utilizes extensive resources.