Improve Neural Entity Recognition via Multi-Task Data Selection and Constrained Decoding

Entity recognition is a widely benchmarked task in natural language processing due to its massive applications. The state-of-the-art solution applies a neural architecture named BiLSTM-CRF to model the language sequences. In this paper, we propose an entity recognition system that improves this neural architecture with two novel techniques. The first technique is Multi-Task Data Selection, which ensures the consistency of data distribution and labeling guidelines between source and target datasets. The other one is constrained decoding using knowledge base. The decoder of the model operates at the document level, and leverages global and external information sources to further improve performance. Extensive experiments have been conducted to show the advantages of each technique. Our system achieves state-of-the-art results on the English entity recognition task in KBP 2017 official evaluation, and it also yields very strong results in other languages.


Introduction
Entity Recognition (ER) is a fundamental task in Natural Language Processing (NLP). The task includes named entity recognition and nominal entity recognition. ER is the building blocks for higher level applications such as natural language understanding, question answering, machine reading comprehension, etc. They are usually treated as sequence labeling problems. Although the topics have been studied extensively for the past several decades, development of neural network and deep learning based methods in recent years (Lample et al., 2016;Ma and Hovy, 2016;Kenton Lee and Zettlemoyer, 2017;Xinchi Chen, 2017) significantly improves the previous state-of-the-art. * Work was done while doing internship at Alibaba.
A popular neural architecture for ER is BiLSTM-CRF (Lample et al., 2016). The architecture has been shown to achieve best performance on many sequence labeling tasks. In addition, the architecture can be easily extended to model different sources of training data. In real world applications, it is important to include external data sources for model training, because using only domain-specific data for training is usually not enough to achieve best performance. For example, in the case of KBP 2016 tracks, both the 1st and the 2nd teams (ranking in the NERC evaluation) use external data source (Liu et al., 2016;Xu et al., 2017) for model training. The challenge here is to transfer knowledge from external data source to target data source. Multi-Task (MT) BiLSTM-CRF architecture  is designed for this knowledge transfer.
In this work, we develop an ER model based on the MT BiLSTM-CRF architecture, with additional entity embeddings and domain adaption. Two novel methods are proposed to further improve the model performance.

Multi-Task Data Selection
To ensure homogeneity between source and target training data, adaptive training data selection is applied to source data during multi-task learning, to filter out instances with different distribution and misaligned annotation guideline. Data selection is interleaved with model training iteratively, and this training process terminates until convergence.

Constrained Decoding using Knowledge Base
Knowledge-based constraints are enforced at decoding time. The goal is to capture document level contexts given those knowledge. For example, a phrase is likely to be an entity if it is detected in another sentence in the same document. It also helps detect related mentions, such as the mention apple is more likely to be a ORG when it occurs in the same discussion forum with Apple Inc.

Related Works
There are many works in literature applying neural networks to ER problems (Lample et al., 2016;Ma and Hovy, 2016;Peng and Dredze, 2016). The baseline model of this work is mostly closed to . However, we introduce additional channel in the embedding layer (Peng and Dredze, 2016). The idea of multi-task data selection is derived from topics of data selection (Moore and Lewis, 2010) and instance weighting (Jiang and Zhai, 2007) from the transfer learning community. Different from previous work, we propose an adaptive selection approach interleaved with MT BiLSTM-CRF model training. Decoding with global constraints has been studied in (Yarowsky, 1993;Krishnan and Manning, 2006). Here we share similar ideas with previous work, but explore the use of external knowledge base (Radford et al., 2015) as constraints.

Approach
This section describes the baseline model used for the ER task. We first describe a slight variant of BiLSTM-CRF and its MT version for transfer learning. For the sake of brevity, discussions of the basis theory of MT learning are skipped and more details can be found in (Zhang and Yang, 2017). Then we present in details how data selection and constrained decoding are applied to further improve the model performance.

BiLSTM-CRF
BiLSTM-CRF is a widely adopted neural architecture for sequence labeling problems including ER. BiLSTM-CRF is a hierarchical model and the architecture is illustrated in Figure 1(a).
The first layer of the model maps words to their embeddings. Let x = (x 1 , · · · , x n ) denote a sentence composed of n words in a sequence, with x i s as their word/character embedding combinations. In the second layer, word embeddings are encoded using a bidirectional-LSTM network, and the output is The encodings are further passed to a fully connection network, to compute CRF features φ(x) = G · h, and finally objective to optimize is the CRF likelihood defined as the following, where y are predicted labels and Z is the normalizing constant.

Entity Embeddings
We extend the BiLSTM-CRF model by adding entity embedding channel to the embedding layer. As a result, x i is the concatenation of word embedding, character embedding and its entity embedding, Entity embeddings are derived from a noisy gazetteer created using Wikipedia articles. The gazetteer is derived from the word-entity statistics from . More specifically, each coordinate of the entity embedding is the probability distribution of a word occurring as the corresponding entity type.

Domain Adaption
To explore external datasets, we apply MT BiLSTM-CRF with domain adaptions, as illustrated in Figure 1(b). The fully connection layer are adapted to different datasets. The CRF features are computed separately, i.e. φ T (x) = G T · h, φ S (x) = G S · h for target and source dataset respectively. The loss function p(y|x; θ T ) and p(y|x; θ S ) are optimized in alternating order.

Multi-task Data Selection
Multi-task training can alleviate some of the problem caused by data heterogeneity between target and source. This section presents an adaptive data selection algorithm during multi-task training that further removes noisy data from source dataset.
The data selection procedure is described in details in Algorithm 1. At each iteration, data selection from the source domain is interleaved with model parameter updates. Training data is selected Algorithm 1 Multi-task Data Selection Input: Target training dataset (x, y) ∈ T , source training dataset (x , y ) ∈ S. Initialize: S train ← S; X S = {x : (x , y ) ∈ S}. Repeat: 1. Train the model for one iteration, by optimizing the following instance weighted object function, 2. Compute consistency score for each training example in S, Thresholds α and β are manually set that determine the selection/exclusion of a data point. 4. Update source training set S train , S train ← S train ∪ S same \ S dif f . In the new training set, data with different distributions are eliminated. Until: |S dif f | < k Return: the final BiLSTM-CRF model. based on a consistency score, which measures the similarity between target and source data distribution. Specifically, the consistency score is derived from the KL divergence between φ T (x) and φ S (x) for every word in the sentence in the source training data. According to step 4, data that are not consistent with the target are eliminated from the training dataset. The iterations terminate until there is few additional data to filter out, up to a manually-tuned threshold.

Constrained Decoding using Knowledge Base
It has been well studied that non-local information can be used to help improve entity recognition performance (Radford et al., 2015) (Krishnan and Manning, 2006). Here we describe a globally constrained decoding (Graves et al., 2012) method used in our model. In particular, we use external knowledge information to guide the decoding process at the document level.

Knowledge Base
An external knowledge base is built from Wikipedia articles (Radford et al., 2015) (Dalton et al., 2014). For each Wikipedia entity, we first extract all its aliases from the redirects, and then build a cluster of the mentions for the this entity which includes all its aliases. Our goal is that given a document mentions Microsoft, the knowledge base can help identify the other mentions such as MS Corp. The knowledge base can be naturally extended to include related entities (using anchor texts), instead of only aliases of the same entity, in the cluster; we leave this to the future works.
Then we apply global decoding with constraint C, such that all mentions that belong to the same cluster should be labeled as the same entity type within a single document, where subscripts 1 : N are indices of sentences within the same document. We use a greedy algorithm for decoding.

Experiments
This section presents experiments results of our methods on the KBP 2016 and 2017 evaluation datasets. We focus on Engilsh (ENG) and Mandarin Chinese (CMN) ER tasks, which include both named entity recognition (NAM) and nominal entity recognition (NOM). The neural models are implemented using Tensorflow (Abadi et al., 2016). Dropout and gradient clipping are applied when necessary to avoid numerical issues during training. Performance numbers are reported using the NERC F 1 score as defined in (Ji et al., 2016).

Datasets
KBP 2015 data is used for evaluation on the 2016 evaluation dataset. Both datasets are used for training for KBP 2017 evaluation. We also leverage external data sources to improve model performance. Unlike (Liu et al., 2016), manual annotation is not feasible to us due to budget limit, we instead use ACE (Walker et al., 2006) (Song et al., 2015) entity annotations as source datasets. It is worth noting that annotation guidelines are different from one dataset to another, especially for nominal entity annotations.

Baseline
The baseline is a BiLSTM-CRF model with word and character embeddings which simply combines source and target data as training data. GloVe vectors (Pennington et al., 2014) are used as word embeddings. NAM and NOM models are trained separately with individually tuned parameters.

Results
First, we examine the performance impact of entity embedding. As shown in Table 1, entity embedding is very useful for both NAM and NOM prediction tasks, and for both languages. It provides an overall performance improvement of 2.2 F 1 points. Since the entity embeddings are derived from soft gazetteer features, this experiment confirms again the usefulness of gazetteer even in neural network models. In theory, the entity embeddings should have been already captured by the model itself; the additional predictability of the entity embeddings actually comes from the external dataset (Wikipedia) where the embeddings are derived from. Next the effectiveness of Multi-Task Data Selection is evaluated. Results in Table 2 show that both MT and MTDS can significantly improve NOM detection over the baseline, and adaptive data selection in MTDS further improves over the MT model. However, there is no gain at all for NAM detection for both languages. We manually evaluate the source and target datasets, and find that the annotation guideline and data distribution of NAM data are quite the similar while there are some significant differences for NOM data. Notably, many of the plural form nouns are marked as nominal entities in the ACE dataset while in our target KBP tasks plural nouns are not labeled as    Finally, we use model ensemble to further improve model scores. Four models are combined together for final evaluation. Majority vote is applied to produce final results. We presents the evaluation results on both KBP 2016 and 2017 datasets in Table 4, and compare them with stateof-the-art scores (Ji et al., 2016)  . Our system ranks 1st in the English entity recognition task in the official evaluation in 2017. We also perform very strongly in the Chinese language as well: the best team applies many hand-tuned rules in the evaluation , while our model is free of rules. It also can be concluded from the table that the additional training data for KBP 2016 increases the overall model performance by 0.7 F 1 points.

Conclusion and Future Works
This paper presents novel methods to improve neural entity recognition tasks. Multi-task data selection removes noise from training data, while constrained decoding further improves the model by exploiting global and external information sources. Extensive experiments show the effectiveness of the methods. Work needs to be done to justify in theoretic foundation the adaptive data selection algorithm. Furthermore, runtime and computational complexity of the system should be studied. We also plan to extend the knowledge base cluster to include related entities.