Adversarial Transfer Learning for Chinese Named Entity Recognition with Self-Attention Mechanism

Named entity recognition (NER) is an important task in natural language processing area, which needs to determine entities boundaries and classify them into pre-defined categories. For Chinese NER task, there is only a very small amount of annotated data available. Chinese NER task and Chinese word segmentation (CWS) task have many similar word boundaries. There are also specificities in each task. However, existing methods for Chinese NER either do not exploit word boundary information from CWS or cannot filter the specific information of CWS. In this paper, we propose a novel adversarial transfer learning framework to make full use of task-shared boundaries information and prevent the task-specific features of CWS. Besides, since arbitrary character can provide important cues when predicting entity type, we exploit self-attention to explicitly capture long range dependencies between two tokens. Experimental results on two different widely used datasets show that our proposed model significantly and consistently outperforms other state-of-the-art methods.


Introduction
The task of named entity recognition (NER) is to recognize the named entities in given text. N-ER is a preliminary and important task in natural language processing (NLP) area and can be used in many downstream NLP tasks, such as relation extraction (Bunescu and Mooney, 2005), event extraction (Chen et al., 2015) and question answering (Yao and Van Durme, 2014). In recent years, numerous methods have been carefully studied for NER task, including Hidden Markov Models (HMMs) (Bikel et al., 1997), Support Vector Machines (SVMs) (Isozaki and Kazawa, 2002) and Conditional Random Fields (CRFs) (Lafferty et al., 2001). Currently, with the development of deep learning, neural networks (Lample et al., 2016;Peng and Dredze, 2016;Luo and Yang, 2016) have been introduced to NER task. All these methods need to determine entities boundaries and classify them into pre-defined categories.
Although great improvements have been achieved by these methods on Chinese NER task, some issues still have not been well addressed. One significant drawback is that there is only a very small amount of annotated data available. Weibo NER dataset (Peng and Dredze, 2015;He and Sun, 2017a) and Sighan2006 NER dataset (Levow, 2006) are two widely used datasets for Chinese NER task, containing 1.3k and 45k training examples, respectively. On the two datasets, the highest F1 scores are 48.41% and 89.21%, respectively. As a basic task in NLP area, the performance is not satisfactory. Fortunately, Chinese word segmentation (CWS) task is to recognize word boundaries and the amount of supervised training data for CWS is abundant compared with NER. There are many similarities between Chinese NER task and CWS task, which we call task-shared information. As shown in Figure 1, given a sentence " •» ¯•: : (Hilton leaves Houston Airport)", the two tasks have the same boundaries for some words such as " • (Hilton)" and "» (leaves)", while Chinese NER has more coarse-grained boundaries than CWS task for certain word such as " ¯• :: (Houston Airport)" in the example of Figure  1, which we call task-specific information. In order to incorporate word boundary information from CWS task into NER task, Peng and Dredze (2016) propose a joint model that performs Chinese NER with CWS task. However, their proposed model only focuses on task-shared information between Chinese NER and CWS, and ignores filtering the specificities of each task, which will bring noise for both of the tasks. For example, the CWS task splits " ¯•:: (Houston Airport)" into " ¯• (Houston)" and ":: (Airport)", while the NER task takes " ¯•:: (Houston Airport)" as a whole entity. Thus, how to exploit task-shared information and prevent the noise brought by CWS task to Chinese NER task is a challenging problem.
Another issue is that most proposed models cannot explicitly model long range dependencies when predicting entity type. Though bidirectional long short term memory (BiLSTM) can learn long-distance dependencies, it cannot conduct direct connections between arbitrary two characters. As shown in Figure 1, if the model only focuses on the word " • (Hilton)", it can be a person or organization. However, when the model explicitly captures the dependencies between " • (Hilton)" and "» (leaves)", it is easy to classify " • (Hilton)" into "person" category. Context information is very crucial for determining the entity type. While in the sentence " O( • (I will be staying at the Hilton)", the entity type of " • (Hilton)" is "organization". Thus, how to better capture the global dependencies of the whole sentence is another challenging problem.
To address the above problems, we propose an adversarial transfer learning framework to integrate the task-shared word boundary information into Chinese NER task in this paper. The adversarial transfer learning is incorporating adversarial training into transfer learning. To better capture long range dependencies and synthesize the information of the sentence, we extend self-attention mechanism into the framework. Specifically, we try to improve Chinese NER task performance by incorporating shared boundary information from CWS task. To prevent the specific information of CWS task from lowering the performance of the Chinese NER task, we introduce adversarial training to ensure that the Chinese NER task on-ly exploits task-shared word boundary information. Then, for tackling the long range dependency problems, we utilize self-attention to synthesize the hidden representation of BiLSTM. Finally, we evaluate our model on two different widely used Chinese NER datasets. Experimental results show that our proposed model achieves better performance than other state-of-the-art methods and gains new benchmarks.
In summary, the contributions of this paper are as follows: • We propose an adversarial transfer learning framework to incorporate task-shared word boundary information from CWS task into Chinese NER task. To our best knowledge, it is the first work to apply adversarial transfer learning method into NER task.
• We introduce self-attention mechanism into our model, which aims to capture the global dependencies of the whole sentence and learn inner structure features of sentence.
• We conduct our experiment on two different widely used Chinese NER datasets, and the experimental results demonstrate that our proposed model significantly and consistently outperforms previous state-of-the-art methods. We release the source code publicly for further research 1 .
2 Related Work NER Many methods have been proposed for N-ER task. Early studies on NER often exploit SVMs (Isozaki and Kazawa, 2002), HMMs (Bikel et al., 1997) and CRFs (Lafferty et al., 2001), heavily relying on feature engineering. Zhou et al. (2013) formulate Chinese NER as a joint identification and categorization task. In recent years, neural network models have been introduced to N-ER task (Collobert et al., 2011;Huang et al., 2015;Peng and Dredze, 2016  NER, which are jointly trained with CWS task. However, the specific features brought by CWS task can lower the performance of the Chinese N-ER task. Adversarial Training Adversarial networks have achieved great success in computer vision (Goodfellow et al., 2014;Denton et al., 2015). In NLP area, adversarial training has been introduced for domain adaptation (Ganin and Lempitsky, 2014;Gui et al., 2017), cross-lingual transfer learning (Chen et al., 2016;Kim et al., 2017), multi-task learning Liu et al., 2017) and crowdsourcing learning (Yang et al., 2018). Bousmalis et al. (2016) propose shared-private model in domain separation network. Different from these works, we exploit adversarial network to jointly train Chinese NER task and CWS task, aiming to extract task-shared word boundary information from CWS task. To our knowledge, it is the first work to apply adversarial transfer learning framework to Chinese NER task.
Self-Attention Self-attention has been introduced to machine translation by Vaswani et al. (2017) for capturing global dependencies between input and output and achieves state-of-the-art performance. For language understanding task, Shen et al. (2017) exploit self-attention to learn long range dependencies. Tan et al. (2017) apply self-attention to semantic role labelling task and achieve state-of-the-art results. We are the first to introduce self-attention mechanism to Chinese N-ER task.

Method
In this paper, we propose a novel adversarial transfer learning framework that will learn task-shared word boundary information from CWS task, filter specific information of CWS and explicitly capture the long range dependencies between arbitrary two characters in sentence. The architecture of our proposed model is illustrated in Figure 2. The model mainly consists of five components: embedding layer, shared-private feature extractor, self-attention, task-specific CRF and task discriminator. In the following sections, we will describe each part of our proposed model in detail.

Embedding Layer
Similar to other neural network models, the first step of our proposed model is to map discrete characters into the distributed representations. For a given Chinese sentence x = {c 1 , c 2 , . . . , c N } from Chinese NER dataset or CWS dataset, we lookup embedding vector from pre-trained embedding matrix for each character c i as x i ∈ R de .

Shared-Private Feature Extractor
Long short term memory (LSTM) (Hochreiter and Schmidhuber, 1997) is a variant of recurrent neural network (RNN) (Elman, 1990), which enables to address the gradient vanishing and exploding problems in RNN via introducing gate mechanism and memory cell. The unidirectional LSTM only leverages information from the past, ignoring the future information. In order to incorporate information from both sides of sequence, we adopt BiLSTM to extract features. Specially, the hidden state of BiLSTM could be expressed as follows: are the hidden states of the forward and backward LSTM at position i, respectively. ⊕ denotes concatenation operation.
As shown in Figure 2, we propose a sharedprivate feature extractor, which assigns a private BiLSTM layer and shared BiLSTM layer for task k ∈ {N ER, CW S}. The private BiLSTM layer is used to extract task-specific features, and the shared BiLSTM layer is used to learn task-shared word boundaries. Formally, for any sentence in dataset of task k, the hidden states of shared and private BiLSTM layer can be computed as follows: where θ s and θ k are the shared BiLSTM parameters and private BiLSTM parameters of task k, respectively.

Self-Attention
Inspired by the self-attention applied to machine translation (Vaswani et al., 2017) and semantic role labelling (Tan et al., 2017), we exploit selfattention to explicitly learn the dependencies between any two characters in sentence and capture the inner structure information of sentence. In this paper, we adopt the multi-head self-attention mechanism. H = {h 1 , h 2 , . . . , h N } denotes the output of private BiLSTM. Correspondingly, S = {s 1 , s 2 , . . . , s N } is the output of shared BiLSTM. We will take the self-attention in private space as example to illustrate how it works. The scaled dotproduct attention can be precisely described as follows: h are query matrix, keys matrix and value matrix, respectively. In our setting, Q = K = V = H. d is the dimension of hidden units of BiL-STM, which equals to 2d h . Multi-head attention first linearly projects the queries, keys and values h times by using different linear projections. Then h projections perform the scaled dot-product attention in parallel. Finally, these results of attention are concatenated and once again projected to get the new representation. Formally, the multi-head attention can be expressed as follows:

Task-Specific CRF
For a sentence in dataset of task k, we compute the final representation via concatenating the representations from private space and shared space after self-attention layer: where H k and S k are the outputs of private selfattention and shared self-attention of task k, respectively.
Considering the dependencies between successive labels, we exploit CRF (Lafferty et al., 2001) to inference tags instead of making tagging decisions using h i independently. Due to the difference of labels, we introduce a specific CR-F layer for each task. Given a sentence x = {c 1 , c 2 , . . . , c N } with a predicted tag sequence y = {y 1 , y 2 , . . . , y N }, the CRF tagging process can be formalized as follows: where W s ∈ R |T |×4d h and b s ∈ R |T | are trainable parameters. |T | denotes the number of output labels. o i,y i represents the score of the y i -th tag of the character c i . T is a transition score matrix which defines the scores of two successive labels. Y x represents all candidate tag sequences for given sentence x. In decoding, we use Viterbi algorithm to get the predicted tag sequenceȳ. For training, we exploit negative log-likelihood objective as the loss function. The probability of the ground-truth label sequence is computed by: whereŷ denotes the ground-truth label sequence. Given T training examples (x (i) ;ŷ (i) ), the loss function L T ask can be defined as follows: We use gradient back-propagation method to minimize the loss function.

Task Discriminator
Inspired by adversarial networks (Goodfellow et al., 2014), we incorporate adversarial training into shared space to guarantee that specific features of tasks do not exist in shared space. We propose a task discriminator to estimate which task the sentence comes from. Formally, the task discriminator can be expressed as follows: where θ d indicates the parameters of task discriminator. W d ∈ R K×2d h and b d ∈ R K are trainable parameters. K is the number of tasks.
Besides the task loss L T ask , we introduce an adversarial loss L Adv to prevent specific features of CWS task from creeping into shared space. The adversarial loss trains the shared model to produce shared features such that the task discriminator cannot reliably recognize which task the sentence comes from. The adversarial loss can be computed as follows: where θ s denotes the trainable parameters of shared BiLSTM. E s denotes the shared feature extractor. T k is the number of training examples of task k. x (i) k is the i-th example of task k. There is a minimax optimization that the shared BiLST-M generates a representation to mislead the task discriminator and the discriminator tries its best to correctly determine the type of task.
We add a gradient reversal layer (Ganin and Lempitsky, 2014) below the softmax layer to address the minimax optimization problem. In the training phrase, we minimize the task discriminator errors, and through gradient reversal layer the gradients will become opposed sign to adversarially encourage the shared feature extractor to learn task-shared word boundary information. After training phrase, the shared feature extractor and task discriminator reach a point where the discriminator cannot differentiate the tasks according to the representations learned from shared feature extractor.

Training
The final loss function of our proposed model can be written as follows: where λ is a hyper-parameter. L N ER and L CW S can be computed via Eq.14. I(x) is a switching function to identify which task the input comes from. It is defined as follows: where D N ER and D CW S are Chinese NER training corpora and CWS training corpora, respectively.
In the training phrase, at each iteration, we first select a task from {N ER, CW S} in turn. Then, we sample a batch of training instances from the given task to update the parameters. We use Adam (Kingma and Ba, 2014) algorithm to optimize the final loss function. Since Chinese NER task and CWS task may have different convergence rate, we repeat the above iterations until early stopping according to the Chinese NER task performance.
There are three blocks. The first two blocks contain the main and simplified models proposed by Peng and Dredze (2015) and Peng and Dredze (2016), respectively. The last block lists the performance of our proposed model. 2017a) and SIGHAN2006 NER dataset (Sighan-NER) (Levow, 2006). We use the MSR dataset (from SIGHAN2005) for CWS task. The WeiboNER is annotated with four entity types (person, location, organization and geopolitical entities), including named entities and nominal mentions. The SighanNER is simplified Chinese, which contains three entity types (person, location and organization). For WeiboNER, we use the same training, development and testing splits as Peng and Dredze (2015). Since the SighanNER does not have development set, we sample 10% data of training set as development set. We use MSR dataset to improve the performance of the Chinese NER task. Table 1 gives the details of the three datasets.

Settings
For evaluation, we use the Precision (P), Recall (R) and F1 score as metrics in our experiment.
For hyper-parameter configurations, we adjust them according to the performance on development set of Chinese NER task. We set the character embedding size d e to 100. The dimensionality of LSTM hidden states d h is 120. The initial learning rate is set to 0.001. The loss weight coefficient λ is set to 0.06. We set the dropout rate to 0.3.
The number of projections h is 8. We set the batch size of SighanNER and WeiboNER as 64 and 20, respectively.
For trainable parameters initialization, we use xavier initializer (Glorot and Bengio, 2010) to initialize parameters. The character embeddings used in our experiment are pre-trained on Baidu Encyclopedia corpus and Weibo corpus by using word2vec toolkit (Mikolov et al., 2013).

Compared with State-of-the-art Methods
In this section, we will give the experimental results of our proposed model and previous stateof-the-art methods on WeiboNER dataset and SighanNER dataset, respectively.

Evaluation on WeiboNER
We compare our proposed model with the latest models on WeiboNER dataset. Table 2 shows the experimental results for named entities on the original WeiboNER dataset.
In the first block of Table 2, we give the performance of the main model and baselines proposed by Peng and Dredze (2015). They propose a CRF-based model to jointly train the embeddings with NER task, which achieves better results than pipeline models. In addition, they consider the po-

Models
Named Entity Nominal Mention Overall P(%) R(%) F1(%) P(%) R(%) F1(%) F1(%) Peng and Dredze (2015) 74  Table 3: Experimental results on the updated WeiboNER dataset (He and Sun, 2017a). There are two blocks. The first block is the performance of latest models. The second block reports the performance of our proposed model. With the limited length of the page, we use "adv" to denote "adversarial".
Models P(%) R(%) F1(%)  91.22 81.71 86.20 Zhou et al. (2006) 88.94 84.20 86.51 Luo and Yang (2016) 91.30 87.22 89.21 BiLSTM+CRF+adversarial+self-attention 91.73 89.58 90.64 Table 4: Results on SighanNER dataset. There are two blocks. The first block reports the result of previous methods. The second block gives the performance of our proposed model. sition of each character in a word to train character and position embeddings.
In the second block of Table 2, we report the performance of the main model and baselines proposed by Peng and Dredze (2016). Aiming to incorporate word boundary information into the N-ER task, they propose an integrated model that can joint training CWS task, improving the F1 score from 46.20% to 48.41% as compared with pipeline model (Pipeline Seg.Repr.+NER).
In the last block of Table 2, we give the experimental result of our proposed model (BiLSTM+CRF+adversarial+self-attention). We can observe that our proposed model significantly outperforms other models. Compared with the model proposed by Peng and Dredze (2016), our method gains 4.67% improvement in F1 score. Interestingly, WeiboNER dataset and MSR dataset are different domains. The WeiboNER dataset is social media domain, while the MSR dataset can be regard as news domain. The improvement of performance indicates that our proposed adversarial transfer learning framework may not only learn task-shared word boundary information from CWS task but also tackle the domain adaptation problem.
We also conduct an experiment on the updated WeiboNER dataset. Table 3 lists the performance of the latest models and our proposed model on the updated dataset. In the first block of Table 3, we report the performance of the latest models. The model proposed by Peng and Dredze (2015) achieves F1 score of 56.05% on overall performance. He and Sun (2017b) propose an unified model for Chinese NER task to exploit the data from out-of-domain corpus and in-domain unlabelled texts. The unified model improves the F1 score from 54.82% to 58.23% compared with the model proposed by He and Sun (2017a).
In the second block of Table 3, we give the result of our proposed model. It can be observed that our proposed model achieves a very competitive performance. Compared with the latest model proposed by He and Sun (2017b), our model improves the F1 score from 58.23% to 58.70% on overall performance. The improvement demonstrates the effectiveness of our proposed model. Table 4 lists the comparisons on SighanNER dataset. We observe that our proposed model achieves new state-of-the-art performance.

Evaluation on SighanNER
In the first block, we give the performance of previous methods for Chinese NER task on SighanNER dataset.  propose a character-based CRF model for Chinese NER task. Zhou et al. (2006) introduce a pipeline model, which first segments the text with characterlevel CRF model and then applies word-level CRF to tag. Luo and Yang (2016) first train a word segmenter and then use word segmentation as addi-    tional features for sequence tagging. Although the model achieves competitive performance, giving the F1 score of 89.21%, it suffers from the error propagation problem.
In the second block, we report the result of our proposed model. Compared with the state-ofthe-art model proposed by Luo and Yang (2016), our method improves the F1 score from 89.21% to 90.64% without any additional features, which demonstrates the effectiveness of our proposed model.

Effectiveness of Adversarial Transfer
Learning and Self-Attention Table 5 provides the experimental results of our proposed model and baseline as well as its simplified models on SighanNER dataset and WeiboN-ER dataset. The simplified models are described as follows: • BiLSTM+CRF: The model is used as strong baseline in our work, which is trained using Chinese NER training data.
• BiLSTM+CRF+transfer: We apply transfer learning to BiLSTM+CRF model without adversarial loss and self-attention mechanism.
• BiLSTM+CRF+self-attention: The model integrates the self-attention mechanism based on BiLSTM+CRF model.
From the experimental results of Table 5, we have following observations: • Effectiveness of transfer learning. BiL-STM+CRF+transfer improves F1 score from 89.13% to 89.89% as compared with BiLST-M+CRF on SighanNER dataset and achieves 1.08% improvement on WeiboNER dataset, which indicates the word boundary information from CWS is very effective for Chinese NER task.
• Effectiveness of adversarial training. By introducing adversarial training, BiLST-M+CRF+adversarial boosts the performance as compared with BiLSTM+CRF+transfer model, showing 0.15% and 0.36% improvement on SighanNER dataset and WeiboNER dataset, respectively. It proves that adversarial training can prevent specific features of CWS task from creeping into shared space.
• Effectiveness of self-attention mechanism.
When compared with BiLSTM+CRF, the BiLSTM+CRF+self-attention significantly improves the performance on the two different datasets with the help of information learned from self-attention, which verifies that the self-attention mechanism is effective for Chinese NER task.
We observe that our proposed adversarial transfer learning framework and self-attention lead to noticeable improvements over the baseline, improving F1 score from 51.01% to 53.08% on Wei-boNER dataset and giving 1.51% improvement on SighanNER dataset.

Case Study
Word boundary information from CWS task is very important for Chinese NER task, especially when different entities appear together, . We take a sentence in WeiboNER test set as example for illustrating the effectiveness of our proposed model. As shown in Figure 4(a), when two "person" entities appearing together, our proposed method exploits word segmentation information to determine the boundary between them and then make correct taggings. In Figure 4(b), when labelling the word " ø (the boss)", the self-attention explicitly learns the dependencies with " Í (respect)", therefore, our model enables to correctly classify the word into "person" category. It verifies that the self-attention is very effective for Chinese NER task.

Error Analysis
According to the results of Table 2 and Table 4, our proposed model achieves 4.67% and 1.43% improvement as compared with previous stateof-the-art methods on WeiboNER dataset and SighanNER dataset, respectively. However, the overall performance on WeiboNER dataset is relatively low. Two reasons can be explained for this issue. One reason is that the number of training examples in WeiboNER dataset is very limited as compared with SighanNER dataset. There are only 1.3k examples in WeiboNER training corpora, which is not enough to train deep neural networks. Another reason is that the expression is informal in social media, lowering the performance on Wei-boNER dataset. While the greater improvement on WeiboNER dataset proves that our method is helpful to solve the problem.

Conclusions
In this paper, we propose a novel adversarial transfer learning framework for Chinese NER task, which can exploit task-shared word boundaries features and prevent the specific information of CWS task. Besides, we introduce self-attention mechanism to capture the dependencies of arbitrary two characters and learn the inner structure information of sentence. Experiments on two different widely used datasets demonstrate that our method significantly and consistently outperforms previous state-of-the-art models.