Towards Improving Neural Named Entity Recognition with Gazetteers

Most of the recently proposed neural models for named entity recognition have been purely data-driven, with a strong emphasis on getting rid of the efforts for collecting external resources or designing hand-crafted features. This could increase the chance of overfitting since the models cannot access any supervision signal beyond the small amount of annotated data, limiting their power to generalize beyond the annotated entities. In this work, we show that properly utilizing external gazetteers could benefit segmental neural NER models. We add a simple module on the recently proposed hybrid semi-Markov CRF architecture and observe some promising results.


Introduction
In the past few years, neural models have become dominant in research on named entity recognition (NER) (Lample et al., 2016;Ma and Hovy, 2016;Chiu and Nichols, 2016, inter alia), as they effectively utilize distributed representations learned from large-scale unlabeled texts (Pennington et al., 2014;Peters et al., 2018;Devlin et al., 2018, inter alia), while avoiding the huge efforts required for designing hand-crafted features or gathering external lexicons. Results from modern neural NER models have achieved new state-of-the-art performance over standard benchmarks such as the popular CoNLL 2003 shared task dataset (Tjong Kim Sang and De Meulder, 2003).
An end-to-end model with the property of letting the data speak for itself seems to be appealing at first sight. However, given that the amount of labeled training data for NER is relatively small when compared with other tasks with millions of training examples, the annotated entities could only achieve a rather limited coverage for a theoretically infinite space of variant entity names. * Work during internship at Microsoft Research Asia Moreover, current neural architectures heavily rely on the word form due to the use of word embeddings and character embeddings, which could lead to a high chance of overfitting. 1 For instance, all the appearances of the single token Clinton in the CoNLL 2003 dataset are person names, while in practice it is also possible to refer to locations. 2 Data-driven end-to-end models trained on that dataset could implicitly bias towards predicting PERSON for most occurrences of Clinton even under some contexts when it refers to a location.
On the other hand, for frequently studied languages such as English, people have already collected dictionaries or lexicons consisting of long lists of entity names, known as gazetteers. Gazetteers could be treated as an external source of knowledge that could guide models towards wider coverage beyond the annotated entities in NER datasets. In traditional log-linear named entity taggers (Ratinov and Roth, 2009;Luo et al., 2015), gazetteers are commonly used as discrete features in the form of whether the current token or current span is appearing in the gazetter or not. There does not seem to be any reason for a neural model not to utilize the off-the-shelf gazetters.
In this paper, we make a simple attempt in utilizing gazetteers in neural NER. Building on a recently proposed architecture called hybrid semi-Markov conditional random fields (HSCRFs) where span-level scores are derived from tokenlabel scores, we introduce a simple additional module that scores a candidate entity span by the degree it softly matches the gazetteer. Experimental studies over CoNLL 2003 and OntoNotes show the utility of gazetteers for neural NER models.

Hybrid semi-Markov CRFs
Our approach is by nature based on the hybrid semi-Markov conditional random fields (HSCRFs) proposed by Ye and Ling (2018), which connect traditional CRFs (Lafferty et al., 2001) and semi-Markov CRFs (Sarawagi and Cohen, 2005) by simultaneously leveraging token-level and segment-level scoring information.
Let s = s 1 , . . . , s p denote a segmentation of input sequence x = x 1 , . . . , x n , where a segment s j = t j , u j , y j represents a span with a start position t j , an end position u j , and a label y j ∈ Y . We assume that all segments have positive lengths and the start position of the first segment is always 1, then the segmentation s satisfies t 1 = 1, u p = n, u j − t j ≥ 0, and t j+1 = u j + 1 for 1 ≤ j < p. Let l = l 1 , . . . , l n be the corresponding token-level labels of x. A traditional semi-CRF (Sarawagi and Cohen, 2005) gives a segmentation of an input sequence and assign labels to each segment in it. For named entity recognition tasks, a correct segmentation of the sentence Scottish Labour Party narrowly backs referendum should be HSCRFs inherit the definition of segmentation probability from traditional semi-CRFs. Given a sequence x = x 1 , . . . , x n , the probability of segmentation s = s 1 , . . . , s p is defined as where score(s, x) = p j=1 ψ(y j , y j+1 , x, t j , u j ), and Z(x) = s score(s , x) is the normalization term. Note that y p+1 is defined as a special END . The Viterbi algorithm could be used for decoding, i.e., getting the most likely segmentation for a query sentence.
HSCRFs employ a specific method to calculate the segment score using token-level labels, with the score potential function ψ(·) defined as ψ(y j , y j+1 , and b y j ,y j+1 is the segment label transition score from y j to y j+1 , ϕ token (l i , w i ) calculates the score of the i-th token being classified into token-level label l i , v i is the feature representation vector of the i-th token x i , and a l i is the weight parameter vector for token label l i . In HSCRFs, v i is the concatenation of (1) BiLSTM encoded representation , the position embedding in the segment.

Gazetteer-enhanced sub-tagger
The most naïve attempt could be treating each gazetteer entity as an additional labeled training sentence, but we found consistently decreased performance in our initial experiments, as this would introduce a shift of label distribution given that the amount of gazetteer entity entries are typically large. Therefore, it seems more natural to utilize gazetteers in a separate module rather than naïvely using them as augmented data. The structure of HSCRFs makes it straightforward to introduce a scoring scheme for candidate spans based on gazetteers. Following the scoring scheme of HSCRFs, we train a span classifier in the form of a sub-tagger and extract token-level features at the same time. Let z = z 1 , . . . , z k be an entity in the gazetteer with a corresponding label m. This span-level label can be expanded into token-level labels m 1 , . . . , m k . For example, the entity Scottish Labour Party is labeled as B−ORG, I−ORG, L−ORG and Berlin is labeled as U −LOC under the BILOU scheme. Similar to Equation 2, the scoring function of our sub-tagger is defined as where v i is defined in Section 2.1 and w m i is the weight parameter vector for token label m i . We calculate sigmoid φ(m, z) as the probability of category m and minimize the cross-entropy loss for training this sub-tagger.
The token-level BILOU scores derived from the sub-tagger are larger at scale. We rescale the scores with the tanh activation function (m, z i ) is derived for each token in a segment, where is the concatenation operation and M is the set of all BILOU scheme token-level labels. The final φ j for soft dictionary enhanced HSCRF is: where µ i = η i v i and b l i is the new weight parameter for token label l i .
The HSCRF model and the sub-tagger derived from it are linear in the way they calculate the span scores. Unlike other semi-CRF models (Zhuo et al., 2016;Zhai et al., 2017;Sato et al., 2017) which utilize neural approaches to derive span scores from word-level representations, HSCRF calculates span score by summing up word-level scores inside a span along BILOU paths constrained by tag m i 's.
This sub-tagger could be analogously treated as playing the role of soft dictionary look-ups, as opposed to the traditional way that activates a discrete feature only for hard token/span matches.

Gazetteers
We use the gazetteers contained in the publicly available UIUC NER system (Khashabi et al., 2018). The gazetteers were originally collected from the web and Wikipedia, consisting of around 1.5 million entities grouped into 79 fine-grained categories. We trimmed and mapped these groups into CoNLL-formatted NER tags (see Appendix for details) with about 1.3 million entities kept.

Dataset
Evaluation is performed on the CoNLL-2003 English NER shared task dataset (Tjong Kim Sang and De Meulder, 2003) and the OntoNotes 5.0 dataset (Pradhan et al., 2013). We follow the standard train/development/test split described in the original papers along with previous evaluation settings (Chiu and Nichols, 2016).

Training
Due to the space limit, we leave hyperparameter details to the supplementary materials. 4 Word representation The representation for a word consists of three parts: pretrained 50dimensional GloVe word embedding (Pennington et al., 2014), contextualized ELMo embedding (Peters et al., 2018), along with a convolutional character encoder trained from randomly initialized character embeddings, following previous work (Ye and Ling, 2018).
Gazetteer-enhanced sub-tagger We randomly split the gazetteer entities for training (80%) and validation (20%), and sampled 1 million nonentity n-grams (the maximal n is 7) from the CoNLL 2003 training set excluding named entities as negative samples (O labels). We applied early stopping on validation loss when training the sub-tagger.

Alternative baselines with gazetteers
Many previous NER systems (Ratinov and Roth, 2009;Passos et al., 2014;Chiu and Nichols, 2016) make use of discrete gazetteer features by directly concatenating them with word-level representations. Apart from simple discrete feature concatenation, we also compare our framework with another baseline that utilizes gazetteer embedding as an additional feature. We add a single embedding layer for discrete gazetteer features. To be more specific, if a text span corresponds to multiple tags in the gazetteer, we sum all the embedded vector as the final gazetteer tag representation. Otherwise, if a text span has no corresponding tags in the gazetteer, a zero vector of the same dimension will be chosen. Then, the gazetteer tag representation is concatenated with each word-level representation inside a span. Table 1 shows the results on the CoNLL 2003 dataset and OntoNotes 5.0 dataset respectively. HSCRFs using gazetteer-enhanced sub-tagger outperform the baselines, achieving comparable results with those of more complex or larger models on CoNLL 2003 and new state-of-the-art results on OntoNotes 5.0. We also attached some out-of-domain analysis in the Appendix.

Model
Test Set F1-score(±std) CoNLL OntoNotes Ma and Hovy (2016) 91.21 - Lample et al. (2016) 90.94 -  91.24±0.12 - Devlin et al. (2018) 92.8 -Chiu and Nichols (2016)  To better attribute the improments of our model, we split the test sets into four non-overlapped subsets according to whether an entity appears in the train set and gazetteer or not, and collect results respectively. We evaluate the performance of our systems on these subsets. Details of the evaluation of each system are shown in Table 2 and Table 3.
We observe that our current approach of subtagger soft-dictionary matching consistently improves over baseline approaches on most subsets, while direct concatenating discrete gazetteer features or using gazetteer embedding have sometimes decrease the performance. However, the re-sults on CoNLL and OntoNotes reveal slightly different patterns for the feature concatenation baseline and the gazetteer embedding baseline, making it difficult to analyze the underlying reasons. We leave more systematic experimental studies over the baselines to future work.
We also evaluate the gazetteer sub-tagger on the held-out data of the gazetteer to analyze the potential impact of this module. For predictions, we choose the labels with the highest possibility. If none of the label receives a probability greater than 50%, the sample will be labeled as not being an entity. The results are reported in Table 4.
We can see that while the sub-tagger module could help a lot in identifying person names (PER) and organization names (ORG), currently the worst-performing category is the miscellaneous type (MISC), which is possibly a result of the diversity in this category. Improving the prediction of such entities might further provide performance gains for named entity recognition in general.

Discussion
Experimental results demonstrate the usefulness of gazetteer knowledge and show some promising results from our initial attempt to make use of gazetteer information. The sub-tagger has an advantage over hard matching with the capability of recognizing entity names not appearing in but being similar to those contained in the gazetteer. Table 5 lists some examples that the baselines failed to recognize as a complete entity name, while the sub-tagger enhanced system managed to do it. We checked a few cases for which only the sub-tagger enhanced model got correct predictions, and found terms with similar patterns from the gazetteer while not in training data as in Table 6. The gazetteer possesses an abundance of similar terms that enables generalization to out-ofgazetteer items.
In summary, we show that gazetteer-enhanced modules could be useful for neural NER models. Future directions will include trying similarly enhanced modules on other different types of segmental models (Kong et al., 2016;Liu et al., 2016;Zhuo et al., 2016;Zhai et al., 2017;Sato et al., 2017), along with richer representations for further gain. Also, we would like to further explore the possibility to use domain-specific gazetteers or dictionaries to boost the performance of NER in      various domains , beyond the standard corpora.