Dual Adversarial Neural Transfer for Low-Resource Named Entity Recognition

We propose a new neural transfer method termed Dual Adversarial Transfer Network (DATNet) for addressing low-resource Named Entity Recognition (NER). Specifically, two variants of DATNet, i.e., DATNet-F and DATNet-P, are investigated to explore effective feature fusion between high and low resource. To address the noisy and imbalanced training data, we propose a novel Generalized Resource-Adversarial Discriminator (GRAD). Additionally, adversarial training is adopted to boost model generalization. In experiments, we examine the effects of different components in DATNet across domains and languages and show that significant improvement can be obtained especially for low-resource data, without augmenting any additional hand-crafted features and pre-trained language model.


Introduction
Named entity recognition (NER) is an important step in most natural language processing (NLP) applications. It detects not only the type of named entity, but also the entity boundaries, which requires deep understanding of the contextual semantics to disambiguate the different entity types of same tokens. To tackle this challenging problem, most early studies were based on handcrafted rules, which suffered from limited performance in practice. Current methods are devoted to developing learning based algorithms, especially neural network based methods, and have been advancing the state-of-the-art progressively (Collobert et al., 2011;Lample et al., 2016;Chiu and Nichols, 2016;Ma and Hovy, 2016). These end-to-end models generalize well on new entities based on features automatically learned from the data. However, when † The first two authors contributed equally. ‡ Corresponding author.
the annotated corpora is small, especially in the low resource scenario (Zhang et al., 2016), the performance of these methods degrades significantly since the hidden feature representations cannot be learned adequately.
Recently, more and more approaches have been proposed to address low-resource NER. Early works (Chen et al., 2010;Li et al., 2012) primarily assumed a large parallel corpus and focused on exploiting them to project information from high-to low-resource. Unfortunately, such a large parallel corpus may not be available for many low-resource languages. More recently, crossresource word embedding (Fang and Cohn, 2017;Adams et al., 2017;Yang et al., 2017) was proposed to bridge the low-and high-resources and enable knowledge transfer. Although the aforementioned transfer-based methods show promising performance in low-resource NER, there are two issues remain further study: 1) Representation Difference -they did not consider the representation difference across resources and enforced the feature representation to be shared across languages/domains; 2) Resource Data Imbalancethe training size of high-resource is usually much larger than that of low-resource. The existing methods neglect such difference in their models, resulting in poor generalization.
In this work, we present a general neural transfer framework termed Dual Adversarial Transfer Network (DATNet) to address the above issues in a unified framework for low-resource NER. Specifically, to handle the representation difference, we first investigate on two architectures of hidden layers (Bi-LSTM) for transfer. The first one is that all the units in hidden layers are common units shared across languages/domains. Another is composed of both private and common units, where the private part preserves the independent language/domain information. Extensive experiments are conducted to show that there is not always a winner and two transfer strategies have their own advantages over each other in different situations, which is largely ignored by existing research. On top of common units, the adversarial discriminator (AD) loss is introduced to encourage the resource-agnostic representation so that the knowledge from high resource can be more compatible with low resource. To handle the resource data imbalance issue, we further propose a variant of the AD loss, termed Generalized Resource-Adversarial Discriminator (GRAD), to impose the resource weight during training so that low-resource and hard samples can be paid more attention to. In addition, we create adversarial samples to conduct the Adversarial Training (AT), further improving the generalization and alleviating over-fitting problem. We unify two kinds of adversarial learning, i.e., GRAD and AT, into one transfer learning model, termed Dual Adversarial Transfer Network (DATNet), to achieve end-toend training and obtain significant improvements on a series of NER tasks In contrast with prior methods, we do not use additional hand-crafted features and do not use cross-lingual word embeddings as well as pre-trained language models (Peters et al., 2018;Radford, 2018;Akbik et al., 2018;Devlin et al., 2018) when addressing the crosslanguage tasks.

Related Work
Named Entity Recognition NER is typically framed as a sequence labeling task which aims at automatic detection of named entities (e.g., person, organization, location and etc.) from free text (Marrero et al., 2013). The early works applied CRF, SVM, and perception models with handcrafted features (Ratinov and Roth, 2009;Passos et al., 2014;Luo et al., 2015). With the advent of deep learning, research focus has been shifting towards deep neural networks (DNN), which requires little feature engineering and domain knowledge (Lample et al., 2016;Zukov Gregoric et al., 2018;Zhou et al., 2019). (Collobert et al., 2011) proposed a feed-forward neural network with a fixed sized window for each word, which failed in considering useful relations between long-distance words. To overcome this limitation, (Chiu and Nichols, 2016) presented a bidirectional LSTM-CNNs architecture that automatically detects word-and character-level features. Ma and Hovy (2016) further extended it into bidirectional LSTM-CNNs-CRF architecture, where the CRF module was added to optimize the output label sequence.  proposed task-aware neural language model termed LM-LSTM-CRF, where character-aware neural language models were incorporated to extract character-level embedding under a multi-task framework.
Transfer Learning for NER Transfer learning can be a powerful tool to low resource NER tasks. To bridge high and low resource, transfer learning methods for NER can be divided into two types: the parallel corpora based and the shared representation based transfer. Early works mainly focused on exploiting parallel corpora to project information between the high-and low-resource languages (Yarowsky et al., 2001;Chen et al., 2010;Li et al., 2012;Feng et al., 2018). For example, (Chen et al., 2010) and (Feng et al., 2018) proposed to jointly identify and align bilingual named entities. Ni et al. (Ni and Florian, 2016;Ni et al., 2017) utilized the Wikipedia entity type mappings to improve low-resource NER. (Mayhew et al., 2017) created a cross-language NER system, which works well for very minimal resources by translate annotated data of high-resource into lowresource. On the other hand, the shared representation methods do not require the parallel correspondence (Rei and Søgaard, 2018). For instance, (Fang and Cohn, 2017) proposed cross-lingual word embeddings to transfer knowledge across resources. (Yang et al., 2017) presented a transfer learning approach based on deep hierarchical recurrent neural network, where full/partial hidden features between source and target tasks are shared. (Al-Rfou' et al., 2015) built massive multilingual annotators with minimal human expertise by using language agnostic techniques. (Cotterell and Duh, 2017) proposed character-level neural CRFs to jointly train and predict low-and highresource languages. (Pan et al., 2017) proposes a large-scale cross-lingual named entity dataset which contains 282 languages for evaluation. In addition, multi-task learning (Yang et al., 2016;Luong et al., 2016;Rei, 2017;Aguilar et al., 2017;Hashimoto et al., 2017;Lin et al., 2018) shows that jointly training on multiple tasks/languages helps improve performance. Different from transfer learning methods, multi-task learning aims at improving the performance of all the resources instead of low resource only.

Adversarial Learning
Adversarial learning originates from Generative Adversarial Nets (GAN) (Goodfellow et al., 2014), which shows impressing results in computer vision. Recently, many papers have tried to apply adversarial learning to NLP tasks. (Liu et al., 2017) presented an adversarial multi-task learning framework for text classification. (Gui et al., 2017) applied the adversarial discriminator to POS tagging for Twitter. (Kim et al., 2017) proposed a language discriminator to enable language-adversarial training for cross-language POS tagging. Apart from adversarial discriminator, adversarial training is another concept originally introduced by (Szegedy et al., 2014;Goodfellow et al., 2015) to improve the robustness of image classification model by injecting malicious perturbations into input images. Recently, (Miyato et al., 2017) proposed a semisupervised text classification method by applying adversarial training, where for the first time adversarial perturbations were added onto word embeddings. (Yasunaga et al., 2018) applied adversarial training to POS tagging. Different from all these adversarial learning methods, our method is more general and integrates both the adversarial discriminator and adversarial training in an unified framework to enable end-to-end training.

Dual Adversarial Transfer Network
In this section, we introduce two transfer architectures for DATNet in detail. For the base model, we follow the state-of-the-art LSTM-CNN-CRF based structure Lample et al., 2016;Chiu and Nichols, 2016;Ma and Hovy, 2016) for NER task, as shown in Figure 1(a).

Character-level Encoder
Previous works have shown that character features can boost sequence labeling performance by capturing morphological and semantic information (Lin et al., 2018). For low-resource dataset to obtain high-quality word features, character features learned from other language/domain may provide crucial information for labeling, especially for rare and out-of-vocabulary words. Characterlevel encoder usually contains BiLSTM (Lample et al., 2016) and CNN (Chiu and Nichols, 2016;Ma and Hovy, 2016) approaches. In practice, (Reimers and Gurevych, 2017) observed that the difference between the two approaches is statistically insignificant in sequence labeling tasks, but character-level CNN is more efficient and has less parameters. Thus, we use character-level CNN and share character features between high-and low-resource tasks to enhance the representations of low-resource.

Word-level Encoder
To learn a better word-level representation, we concatenate the character-level features of each word with a latent word embedding as where the latent word embedding w emb i is initialized with pre-trained embeddings and fixed during training. One unique characteristic of NER is that the historical and future input for a given time step could be useful for label inference. To exploit such a characteristic, we use bidirectional LSTM architecture (Hochreiter and Schmidhuber, 1997) to extract contextualized word-level features. In this way, we can gather the information from the past and future for a particular time frame t as follows, After the LSTM layer, the representation of a word is obtained by concatenating its left and right context representation as follows, To consider the resource representation difference on word-level features, we introduce two kinds of transferable word-level encoder in our model, namely DATNet-Full Share (DATNet-F) and DATNet-Part Share (DATNet-P). In DATNet-F, all the BiLSTM units are shared by both resources while word embeddings for different resources are disparate. The illustrative figure is depicted in the Figure 1(c). Different from the DATNet-F, the DATNet-P decomposes the BiL-STM units into the shared component and the resource-related one, which is shown in the Figure 1(b). Different from existing works (Yang et al., 2017;Fang and Cohn, 2017;Cotterell and Duh, 2017;Cao et al., 2018), in this work, we investigate the performance of two different shared representation architectures on different tasks and give the corresponding guidance (see Section 4.5).

Generalized Resource-Adversarial Discriminator
In order to make the feature representation extracted from the source domain more compatible with those from the target domain, we encourage the outputs of the shared BiLSTM part to be resource-agnostic by constructing a resourceadversarial discriminator, which is inspired by the Language-Adversarial Discriminator proposed by (Kim et al., 2017). Unfortunately, previous works did not consider the imbalance of training size for two resources. Specifically, the target domain consists of very limited labeled training data, e.g., 10 sentences. In contrast, labeled training data in the source domain are much richer, e.g., 10k sentences. If such imbalance was not considered during training, the stochastic gradient descent (SGD) optimization would make the model more biased to high resource (Lin et al., 2017b).
To address this imbalance problem, we impose a weight α on two resources to balance their influences. However, in the experiment we also observe that the easily classified samples from high resource comprise the majority of the loss and dominate the gradient. To overcome this issue, we further propose Generalized Resource-Adversarial Discriminator (GRAD) to enable adaptive weights for each sample which focuses the model training on hard samples. To compute the loss of GRAD, the output sequence of the shared BiLSTM is firstly encoded into a single vector via a self-attention module (Bahdanau et al., 2015), and then projected into a scalar r via a linear transformation. The loss function of the resource classifier is formulated as: where I i∈D S , I i∈D T are the identity functions to denote whether a sentence is from high resource (source) and low resource (target), respectively; α is a weighting factor to balance the loss contribution from high and low resource; the parameter (1 − r i ) γ (or r γ i ) controls the loss contribution from individual samples by measuring the discrepancy between prediction and true label (easy samples have smaller contribution); and γ scales the contrast of loss contribution from hard and easy samples. In practice, the value of γ does not need to be tuned much and usually set as 2 in our experiment. Intuitively, the weighting factors α and (1 − r i ) γ reduce the loss contribution from high resource and easy samples, respectively. Note that though the resource classifier is optimized to minimize the resource classification error, when the gradients originated from the resource classification loss are back-propagated to the other model parts than the resource classifier, they are negated for parameter updates so that these bottom layers are trained to be resource-agnostic.

Label Decoder
The label decoder induces a probability distribution over sequences of labels, conditioned on the word-level encoder features. In this paper, we use a linear chain model based on the first-order Markov chain structure, termed the chain conditional random field (CRF) (Lafferty et al., 2001), as the decoder. In this decoder, there are two kinds of cliques: local cliques and transition cliques. Specifically, local cliques correspond to the individual elements in the sequence. And transition cliques, on the other hand, reflect the evolution of states between two neighboring elements at time t − 1 and t and we define the transition distribution as θ. Formally, a linear-chain CRF can be written as p(y|h 1:T ) = where Z(h 1:T ) is a normalization term and y is the sequence of predicted labels as follows: y = y 1:T . Model parameters are optimized to maximize this conditional log likelihood, which acts as the objective function of the model. We define the loss function for source and target resources as follows, S = − i log p(y|h 1:T ), T = − i log p(y|h 1:T ).

Adversarial Training
So far our model can be trained end-to-end with standard back-propagation by minimizing the following loss: Recent works have demonstrated that deep learning models are fragile to adversarial examples (Goodfellow et al., 2015). In computer vision, those adversarial examples can be constructed by changing a very small number of pixels, which are virtually indistinguishable to human perception (Pin-Yu et al., 2018). Recently, adversarial samples are widely incorporated into training to improve the generalization and robustness of the model, which is so-called adversarial training (AT) (Miyato et al., 2017). It emerges as a powerful regularization tool to stabilize training and prevent the model from being stuck in local minimum. In this paper, we explore AT in context of NER. To be specific, we prepare an adversarial sample by adding the original sample with a perturbation bounded by a small norm to maximize the loss function as follows: where Θ is the current model parameters set. However, we cannot calculate the value of η exactly in general, because the exact optimization with respect to η is intractable in neural networks. Following the strategy in (Goodfellow et al., 2015), this value can be approximated by linearizing it as follows, η x = g g 2 , where g = ∇ (Θ; x) where can be determined on the validation set. In this way, adversarial examples are generated by adding small perturbations to the inputs in the direction that most significantly increases the loss function of the model. We find such η against the current model parameterized by Θ, at each training step, and construct an adversarial example by x adv = x + η x . Note that we generate this adversarial example on the word and character embedding layer, respectively, as shown in the Figure 1(b) and 1(c). Then, the classifier is trained on the mixture of original and adversarial examples to improve the generalization. To this end, we augment the loss in Eqn. 2 and define the loss function for adversarial training as: where (Θ; x), (Θ; x adv ) represents the loss from an original example and its adversarial counterpart, respectively. Note that we present the AT in a general form for the convenience of presentation. For different samples, the loss and parameters should correspond to their counterparts. For example, for the source data with word embedding w S , the loss can be defined as follows, AT = (Θ; w S )+ (Θ; w S,adv ) with w S,adv = w S +η w S and = GRAD + S . Similarly, we can compute the perturbations η c for char-embedding and η w T for target word embedding.

Datasets
In order to evaluate the performance of DATNet, we conduct the experiments on following widely used NER datasets: CoNLL-2003 English NER (Kim and De, 2003), CoNLL-2002Spanish & Dutch NER (Kim, 2002, WNUT-2016English Twitter NER (Zeman, 2017. The statistics of these datasets are described in Table 1   the CRF layer (Collobert et al., 2011;Chiu and Nichols, 2016;Yang et al., 2017) or introduce the orthographic feature as additional input for learning social media NER in tweets (Partalas et al., 2016;Limsopatham and Collier, 2016;Aguilar et al., 2017), we do not use hand-crafted features and only words and characters are considered as the inputs. Our goal is to study the effects of transferring knowledge from high-resource dataset to low-resource dataset. To be noted, we used only training set for model training for all datasets except the WNUT-2016 NER dataset. Since in this dataset, all the previous studies merged the training set and validation set together for training. Specifically, we use CoNLL-2003 English NER dataset as high-resource (i.e., source) for all the experiments, CoNLL-2002 and WNUT datasets as low-resource (i.e., target) in cross-language and cross-domain NER settings, respectively.

Experimental Setup
We use 50-dimensional publicly available pretrained word embeddings for English, Spanish and Dutch of CoNLL and WNUT datasets in our experiments, which are trained by word2vec on the corresponding Wikipedia articles (Lin et al., 2018), and the 30-dimensional randomly initialized character embeddings are used for all the datasets. We set the filters as 20 for char-level CNN and the dimension of hidden states of the word-level LSTM as 200 for both base model and DATNet-F. For DATNet-P, we set 100 for source, share, and target LSTMs, respectively. Parameters optimization is performed by Adam (Kingma and Ba, 2015) with gradient clipping of 5.0 and learning rate decay strategy. We set the initial learning rate of β 0 = 0.001 for all experiments. At each epoch t, learning rate β t is updated using where ρ is decay rate with 0.05. To reduce over-fitting, we apply Dropout (Srivastava et al., 2014) to the embedding layer and the output of the LSTM layer, respectively.

Comparison with State-of-The-Art Results
In this section, we compare our approach with state-of-the-art methods on CoNLL and WNUT benchmark datasets. Note that our models do not use any additional large-scale language resources, so we do not consider the language models (Peters et al., 2018;Radford, 2018;Devlin et al., 2018) for fair comparison. In the experiment, we exploit all the source data (i.e., CoNLL-2003 English NER) and target data to improve performance of target tasks. The averaged results with standard deviation over 10 repetitive runs are summarized in Table 2, and we also report the best results on each task for fair comparison with other SOTA methods. From results, we observe that incorporating the additional resource is helpful to improve performance. DATNet-P achieves the highest F1 score on CoNLL-2002 Spanish and sec-

Transfer Learning Performance
In this section, we investigate on improvements with transfer learning under multiple low-resource settings with partial target data. To simulate a lowresource setting, we randomly select subsets of target data with varying data ratio at 0.05, 0.1, 0.2, 0.4, 0.6, and 1.0. The results for cross-language and cross-domain transfer are shown in Figure  2(a) and 2(b), respectively, where we compare the results with each part of DATNet under various data ratios. From those figures, we have the following observations: 1) both adversarial training and adversarial discriminator in DATNet consistently contribute to the performance improvement; 2) the transfer learning component in the DATNet consistently improves over the base model results 1 We are not sure whether (Feng et al., 2018) has incorporated the validation set into training. And if we merge training and validation sets, we can push the F1 score to 88.71. and the improvement margin is more substantial when the target data ratio is lower. For example, when the data ratio is 0.05, DATNet-P model outperforms the base model by more than 4% absolutely in F1-score on Spanish NER and DATNet-F model improves around 13% absolutely in F1score compared to base model on WNUT-2016 NER.
In the second experiment, we further investigate DATNet on the extremely low resource cases, e.g., the number of training target sentences is 10, 50, 100, 200, 500 and 1,000. The setting is quite challenging and fewer previous works have studied before. The results are summarized in Table 3. We have two interesting observations: 1) DATNet-F outperforms DATNet-P on cross-language transfer when the target resource is extremely low, however, this situation is reversed when the target dataset size is large enough (here for this specific dataset, the threshold is 100 sentences); 2) DATNet-F is always superior to DATNet-P on cross-domain transfer. For the first observation, DATNet-F with more shared hidden units is more efficient to transfer knowledge than DATNet-P when data size is extremely small. For the second observation, because cross-domain transfer are in the same language, more knowledge is common between source and target domains, requiring more shared hidden features to carry with these knowledge compared to cross-language transfer. Therefore, for cross-language transfer with extremely low resource and cross-domain transfer, we suggest using DATNet-F to achieve better performance. As for cross-language transfer with relatively more training data, DATNet-P is preferred.

Ablation Study of DATNet
In the proposed DATNet, both GRAD and AT play important roles in low resource NER. In this experiment, we further investigate how GRAD and AT help to transfer knowledge across language/domain. In the first experiment, we used t-SNE (Maaten and Hinton, 2008) to visualize the feature distribution of BiLSTM outputs without AD, with normal AD (GRAD without considering data imbalance), and with the proposed GRAD in Figure 3. From this figure, we can see that GRAD in DATNet makes the distribution of extracted features from source and target datasets much more similar by considering data imbalance, which indicates that the outputs are resource-invariant.
To better understand the working mechanism, Table 4 further reports the quantitative performance comparison between models with different components. We observe that GRAD shows the stable superiority over the normal AD regardless of other components. There is not always a winner between DATNet-P and DATNet-F on different settings. DATNet-P architecture is more suitable to cross-language transfer while DATNet-F is more suitable to cross-domain transfer.
From the previous results, we know that AT helps enhance the overall performance by adding perturbations to inputs with the limit of = 5, i.e., η 2 ≤ 5. In this experiment, we further investigate how target perturbation w T with fixed source perturbation w S = 5 in AT affects knowledge transfer and the results on Spanish NER are summarized in Table 6. The results generally indicate that less training data require a larger to prevent over-fitting, which further validates the necessity of AT in the case of low resource data.
Finally, we analyze the discriminator weight α in GRAD and results are summarized in Table 5. From the results, it is interesting to find that α is directly proportional to the data ratio ρ, basically, which means that more target training data requires larger α (i.e., smaller 1−α to reduce training emphasis on the target domain) to achieve better performance.

Conclusion
In this paper we develop a transfer learning model DATNet for low-resource NER, which aims at addressing representation difference and resource data imbalance problems. We introduce two variants, DATNet-F and DATNet-P, which can be chosen according to cross-language/domain user case and target dataset size. To improve model generalization, we propose dual adversarial learning strategies, i.e., AT and GRAD. Extensive experiments show the superiority of DATNet over existing models and it achieves significant improvements on CoNLL and WNUT NER benchmark datasets.