Gazetteer-Enhanced Attentive Neural Networks for Named Entity Recognition

Current region-based NER models only rely on fully-annotated training data to learn effective region encoder, which often face the training data bottleneck. To alleviate this problem, this paper proposes Gazetteer-Enhanced Attentive Neural Networks, which can enhance region-based NER by learning name knowledge of entity mentions from easily-obtainable gazetteers, rather than only from fully-annotated data. Specially, we first propose an attentive neural network (ANN), which explicitly models the mention-context association and therefore is convenient for integrating externally-learned knowledge. Then we design an auxiliary gazetteer network, which can effectively encode name regularity of mentions only using gazetteers. Finally, the learned gazetteer network is incorporated into ANN for better NER. Experiments show that our ANN can achieve the state-of-the-art performance on ACE2005 named entity recognition benchmark. Besides, incorporating gazetteer network can further improve the performance and significantly reduce the requirement of training data.


Introduction
Named entity recognition (NER), aiming to identify text mentions of specific entity types, is a fundamental NLP task. Recently, region-based NER approaches (Finkel and Manning, 2009;Xu et al., 2017;Sohrab and Miwa, 2018) have attracted significant attention, which first encode all candidate regions (commonly all subsequences of a sentence) using a region encoder, then identify whether each subsequence is an entity mention of target types using a classifier. For example, in Figure 1 all subsequences of the sentence, such as "George Washington", will first be encoded, * Xianpei Han is the corresponding author.  Figure 1: The overall architecture of GEANN. The candidate region is "George Washington", which literally could be a person or an organization (university). and then be classified into entity types. Compared to sequential labeling models, region-based models can naturally detect nested or overlapping mentions by considering all subsequences, and therefore are of great value to NER. Generally, an effective region encoder should capture two kinds of knowledge for NER. One is name knowledge, which encodes the inner compositional regularity of entity mentions, i.e., how likely a subsequence itself will be an entity mention. For example, a region encoder should know "George Washington" is a valid PER name because "George" is a common first name and "Washington" is a common last name. Another is context knowledge, which identifies whether the subsequence in the context indeed refers to an entity. For example, a region encoder should know "X said" is a suitable context for PER mention and "study at X" is a suitable context for Org mention.
Currently, most region-based NER models learn these two kinds of knowledge only from expensive, fully-annotated training data, and therefore often face the training data bottleneck, i.e., the lack of training data will significantly undermine the performance. To address this problem, we find that name knowledge can be effectively captured by leveraging easily-obtainable gazetteer resources. For example, it is easy to learn the company name patterns "the...company" and "...Inc." from a company name gazetteer containing "the Walt Disney company" and "Apple Inc.". By capturing mention regularity entailing gazetteers, the region-based models can be enhanced with more accurate name knowledge, and thereby the need of fully-annotated training data can be reduced.
To this end, this paper proposes Gazetteer-Enhanced Attentive Neural Networks (GEANN), whose architecture is shown in Figure 1. Specifically, to better decouple name and context knowledge for incorporating gazetteer information, we first design a new region-based attentive neural network (ANN), which introduces attention mechanism (Bahdanau et al., 2014;Vaswani et al., 2017;Su et al., 2018) to explicitly model the association between mentions and contexts. Starting from ANN, we further introduce an auxiliary gazetteer network, which can effectively learn name knowledge only using gazetteers, i.e., it can encode each utterance in a context-free way, and identifies whether it matches regular patterns of mentions. Finally, the learned gazetteer network is incorporated into ANN to capture better name and context knowledge. Experiments show that ANN can achieve the state-of-the-art NER performance, and incorporating name knowledge from gazetteers can significantly reduce the training data requirement. To the best of our knowledge, this is the first work trying to explicitly exploit mention-context association with attention mechanism in region-based NER, as well as the first work which enhances NER model with name knowledge captured from gazetteers using neural networks.

Attentive Neural Network for NER
This section describes our attentive neural network, which directly classifies over all subsequences of a sentence to recognize whether each subsequence corresponds to an entity mention. Figure 1 (a) shows the architecture of ANN.
Given a sentence, ANN first maps all words into word representations {x 1 , x 2 , ..., x n } following Lample et al. (2016). Then a BiLSTM layer is used to obtain context-aware word representation h A t . After that, for each candidate region s ij , we follow Sohrab and Miwa (2018) to use an innerregion encoder to obtain its representation s ij , which captures name knowledge considering both its boundary and inside information: where MLP is a multi-layer perceptron. To explicitly model the association between a region and its context, we design an attentive context encoder, which outputs a contextual vector c ij entailing the context knowledge of s ij by: is an attentive model which scores how important the word x k is for recognizing the entity type of s ij . After obtaining c ij , we concatenate it with s ij , and feed it into a MLP classifier to obtain the probabilities of s ij corresponding to an entity type (or NIL if s ij is not a mention). Similar to previous methods, ANN is learned by minimizing the crossentropy loss on the fully-annotated training data. By decoupling and explicitly modeling the mention-context association, ANN not only can better identify entity mentions, but also is very easy to incorporate external name knowledge. This enables convenient integration of gazetteer knowledge, as we will illustrate next.
whether it should be included in the gazetteer of specific entity types, i.e., whether it is a valid entity name. Gazetteer network is context-free, which only considers whether the input follows the compositional regularity of mentions, and therefore can be trained only using gazetteers.
Formally, given an input utterance u = {u i , ..., u j }, an utterance encoder first learns its representation u, which has similar structure with the inner-region encoder of ANN. After that, u is used to compute the probabilities of it being a valid name for each type: where s is the sigmoid function and k th dimension of O G u indicates the probability of u being a valid mention of type y k . As an utterance can possibly be a valid mention of various types, we use a multi-label, multi-class cross-entropy loss to train our gazetteer network: where G is the gazetteers, g u is an one-hot vector whose k th will be set to 1 if utterance u is in the gazetteer of type y k , elsewise 0. In this way, a well-trained gazetteer network can learn an effective utterance representation u, which is used to identify whether the utterance is a valid mention. This means u should capture enough name knowledge of specific entity types. To incorporate such knowledge into ANN, we simply concatenate the representation learned by gazetteer network and the representation learned by the original inner-region encoder. Then this new representation is fed to the following modules of ANN. In this way, name knowledge learned from gazetteers is incorporated to enhance the region encoder, and the requirement of fullyannotated data can be reduced.

Experimental Settings
Data Preparation. We conducted experiments on ACE2005 named entity recognition task 1 . We used the same dataset splits as ; Katiyar and Cardie (2018). For each entity type in ACE2005, we collect a gazetteer from 1 Conventional NER datasets, such as CoNLL2003, removed nested entity mentions and therefore are not suitable for evaluating region-based NER models. Wikipedia anchor texts, i.e., anchor texts linking to an entity whose type are the same will be included in a gazetteer. The same as previous studies, models are evaluated using micro-F1. To balance time complexity and recall, we follow  to restrict mention length up to 6, which covers more than 93% mentions. Baselines. Following methods were compared: 1) LSTM-CRF (Lample et al., 2016), which is the most widely used NER baseline, but it cannot handle nested or overlapping mentions.
2) Neural Transition ), a transition model which achieved very competitive performance on ACE2005.
3) Segmental Hypergraph , a hypergraph-based model which introduces a new tagging schema and achieved the state-of-the-art performance on ACE2005. 4) Exhaustive Model (Sohrab and Miwa, 2018), a region-based model using a region encoder to capture both inner and boundary features of a candidate region, which is similar to ANN without attentive contextual encoder. Table 1 shows the overall results of our methods compared with baselines. We can see that:

Overall Results
1) By explicitly modeling both name and context knowledge, the proposed attentive neural network is an effective region-based NER model, and achieves the state-of-the-art performance. Compared with other baselines, ANN achieves significant F1-score improvements.
2) Incorporating name knowledge from gazetteers can significantly improve the performance.
GEANN further achieves 1.1 F1-score improvement over ANN. This indicates that name knowledge learned from gazetteers can significant enhance the region encoder, and therefore improves the NER performance. 3) Our attentive context encoder provides an effective way to exploit context knowledge for NER. Compared with Exhaustive baseline, AN-N achieves significant improvement by explicitly modeling the association between entity mentions and their contexts.

Effects of Gazetteer Network
To further investigate the effect of introducing gazetteers, Table 2 shows the results when training data size varies. We can see that: 1) For Transition and SH, their performance significantly decreased when reducing training data. We believe this because these approaches need to model complex label structure, so large scale training data are critical, and reducing training data will have huge impact on them.
2) Region-based models are less sensitive to the reduction of training data. We believe this is because their output structure is simple, and therefore can be trained using less training data.
3) GEANN can achieve significant improvements over ANN regardless of training data size. By leveraging name knowledge from gazetteers, GEANN with only 50% training data can achieve comparable performance with ANN trained with the entire dataset.

GEANN with BERT
Pretrained context-aware representation, such as ELMOs (Peters et al., 2018) and BERT (Devlin et al., 2018), have shown significant progress in many NLP tasks, especially in low-resource cases. To verify the adaptivity of the proposed GEANN, we further introduce BERT into models by replacing the word embedding with BERT representations. Figure 3 shows the results. We can see that even BERT can enhance NER, GEANN can still further achieve significant improvement over BERT regardless of the training data size. This verified GEANN can further capture task-specific name knowledge, which complement well with universal pretrained language knowledge.

Related Work
Sequential labeling approaches (Zhou and Su, 2002;Chieu and Ng, 2002;Bender et al., 2003;Settles, 2004;Lample et al., 2016) are widely used in NER. But this paradigm cannot handle nested mentions without specially designed tagging schema (Lu and Roth, 2015;Katiyar and Cardie, 2018;Lin et al., 2019). Recently, region-based models provide a natural solution for this issue. Finkel and Manning (2009) first proposed to classify over regions corresponding to parsing tree nodes. Xu et al. (2017) proposed to directly classify over all subsequences with a neural network model. Sohrab and Miwa (2018) extended their method by introducing a new region encoder. Generally, these methods have achieved promising results but heavily rely on fully-annotated data. Gazetteers or dictionaries have long been regarded as a useful and easily-obtainable resource for NER. Previous methods commonly incorporated gazetteers by either using them to as handcraft features (Bender et al., 2003;Tsuruoka and Tsujii, 2003;Ciaramita and Altun, 2005;Minkov et al., 2005;Ritter et al., 2011;Seyler et al., 2018;, or using them to generate data by distant supervision Ren et al., 2015;Giannakopoulos et al., 2017;Shang et al., 2018). However, the first kind of methods can not fully leverage the inner mention structure knowledge entailing in gazetteers, while the second approaches will result in remarkable noise.

Conclusions
This paper first proposes attentive neural networks, an effective region-based model which explicitly models mention-context association. Then we propose to incorporate an auxiliary gazetteer network to enhance ANN. The gazetteer network can effectively learn name knowledge only using easily-available gazetteers, and therefore can significantly improve model performance and reduce data requirement. Experiments show that GEANN achieves the state-of-the-art performance on ACE2005 with much lower data requirement.