Soft Gazetteers for Low-Resource Named Entity Recognition

Traditional named entity recognition models use gazetteers (lists of entities) as features to improve performance. Although modern neural network models do not require such hand-crafted features for strong performance, recent work has demonstrated their utility for named entity recognition on English data. However, designing such features for low-resource languages is challenging, because exhaustive entity gazetteers do not exist in these languages. To address this problem, we propose a method of “soft gazetteers” that incorporates ubiquitously available information from English knowledge bases, such as Wikipedia, into neural named entity recognition models through cross-lingual entity linking. Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.


Introduction
Before the widespread adoption of neural networks for natural language processing tasks, named entity recognition (NER) systems used linguistic features based on lexical and syntactic knowledge to improve performance (Ratinov and Roth, 2009). With the introduction of the neural LSTM-CRF model (Huang et al., 2015;Lample et al., 2016), the need to develop hand-crafted features to train strong NER models diminished. However, Wu et al. (2018) have recently demonstrated that integrating linguistic features based on part-of-speech tags, word shapes, and manually created lists of entities called gazetteers into neural models leads to better NER on English data. Of particular interest to this paper are the gazetteer-based featuresbinary-valued features determined by whether or not an entity is present in the gazetteer. 1 Code and data are available at https://github. com/neulab/soft-gazetteers.
Although neural NER models have been applied to low-resource settings (Cotterell and Duh, 2017;Huang et al., 2019), directly integrating gazetteer features into these models is difficult because gazetteers in these languages are either limited in coverage or completely absent. Expanding them is time-consuming and expensive, due to the lack of available annotators for low-resource languages (Strassel and Tracey, 2016).
As an alternative, we introduce "soft gazetteers", a method to create continuous-valued gazetteer features based on readily available data from highresource languages and large English knowledge bases (e.g., Wikipedia). More specifically, we use entity linking methods to extract information from these resources and integrate it into the commonlyused CNN-LSTM-CRF NER model (Ma and Hovy, 2016) using a carefully designed feature set. We use entity linking methods designed for low-resource languages, which require far fewer resources than traditional gazetteer features (Upadhyay et al., 2018;Zhou et al., 2020).
Our experiments demonstrate the effectiveness of our proposed soft gazetteer features, with an average improvement of 4 F1 points over the baseline, across four low-resource languages: Kinyarwanda, Oromo, Sinhala, and Tigrinya.

Background
Named Entity Recognition NER identifies named entity spans in an input sentence, and classifies them into predefined types (e.g., location, person, organization). A commonly used method for doing so is the BIO tagging scheme, representing the Beginning, the Inside and the Outside of a text segment (Ratinov and Roth, 2009). The first word of a named entity is tagged with a "B-", subsequent words in the entity are "I-", and non-entity words are "O".  Binary Gazetteer Features Gazetteers are lists of named entities collected from various sources (e.g., nation-wide census, GeoNames, etc.). They have been used to create features for NER models, typically binary features indicating whether the corresponding n-gram is present in the gazetteer.
Entity Linking Entity linking (EL) is the task of associating a named entity mention with its corresponding entry in a structured knowledge base (KB) (Hachey et al., 2013). For example, linking the entity mention "Mars" with its Wikipedia entry.
In most entity linking systems (Hachey et al., 2013;Sil et al., 2018), the first step is shortlisting candidate KB entries, which are further processed by an entity disambiguation algorithm. Candidate retrieval methods, in general, also score each candidate with respect to the input mention.

Soft Gazetteer Features
As briefly alluded to in the introduction, creating binary gazetteer features is challenging for lowresource languages. The soft gazetteer features we propose instead take advantage of existing limited gazetteers and English knowledge bases using lowresource EL methods. In contrast to typical binary gazetteer features, the soft gazetteer feature values are continuous, lying between 0 and 1.
Given an input sentence, we calculate the soft gazetteer features for each span of n words, s = w i , . . . , w i+n−1 , and then apply the features to each word in the span. We assume that we have an EL candidate retrieval method that returns candidate KB entries C = (c 1 , c 2 ...) for the input span. c 1 is the highest scoring candidate.
As a concrete example, consider a feature that represents the score of the top-1 candidate. Figure 1 shows an example of calculating this feature on a sentence in Kinyarwanda, one of the languages used in our experiments. The feature vector f has an element corresponding to each named entity type in the KB (e.g., LOC, PER, and ORG).
For this feature, the element corresponding to the entity type of the highest scoring candidate c 1 is updated with the score of the candidate. That is, This feature vector is applied to each word in the span, considering the position of the specific word in the span according to the BIO scheme; we use the "B-" vector elements for the first word in the span, "I-" otherwise.
For a word w i , we combine features from different spans by performing an element-wise addition over vectors of all spans of length n that contain w i . The cumulative vector is then normalized by the number of spans of length n that contain w i , so that all values lie between 0 and 1. Finally, we concatenate the normalized vectors for each span length n from 1 to N (N = 3 in this paper).
We experiment with different ways in which the candidate list can be used to produce feature vectors. The complete feature set is: 1. top-1 score: This feature takes the score of the highest scoring candidate c 1 into account.
2. top-3 score: Like the top-1 feature, we additionally create feature vectors for the second and third highest scoring candidates.  1 type(c)=t is an indicator function that returns 1.0 when the candidate type is the same as the feature element being updated, 0.0 otherwise.

top-30 count:
This feature computes typewise counts for the top-30 candidates.

margin:
The margin between the scores of consecutive candidates within the top-4. These features are not computed type-wise. For example the feature value for the margin between the top-2 candidates is, We experiment with different combinations of these features by concatenating their respective vectors. The concatenated vector is passed through a fully connected neural network layer with a tanh non-linearity and then used in the NER model.

Named Entity Recognition Model
As our base model, we use the neural CRF model of Ma and Hovy (2016). We adopt the method from Wu et al. (2018) to incorporate linguistic features, which uses an autoencoder loss to help retain information from the hand-crafted features throughout the model (shown in Figure 2). We briefly discuss the model in this section, but encourage readers to refer to the original papers for a more detailed description.
NER objective Given an input sequence, we first calculate a vector representation for each word by concatenating the character representation from a CNN, the word embedding, and the soft gazetteer features. The word representations are then used as input to a bidirectional LSTM (BiLSTM (CRF), which predicts a sequence of NER labels. The training objective, L CRF , is the negative loglikelihood of the gold label sequence.
Autoencoder objective Wu et al. (2018) demonstrate that adding an autoencoder to reconstruct the hand-crafted features leads to improvement in NER performance. The autoencoder takes the hidden states of the BiLSTM as input to a fully connected layer with a sigmoid activation function and reconstructs the features. This forces the BiLSTM to retain information from the features. The crossentropy loss of the soft gazetteer feature reconstruction is the autoencoder objective, L AE .

Training and inference
The training objective is the joint loss: L CRF + L AE . The losses are given equal weight, as recommended in Wu et al. (2018). During inference, we use Viterbi decoding to obtain the most likely label sequence.

Experiments
In this section, we discuss our experiments on four low-resource languages and attempt to answer the following research questions: 1) "Although gazetteer-based features have been proven useful for neural NER on English, is the same true in the low-resource setting?" 2) "Do the proposed soft-gazetteer features outperform the baseline?" 3) "What types of entity mentions benefit from soft gazetteers?" and 4) "Does the knowledge base coverage affect performance?".

Experimental setup NER Dataset
We experiment on four lowresource languages: Kinyarwanda (kin), Oromo (orm), Sinhala (sin), and Tigrinya (tir). We use the LORELEI dataset (Strassel and Tracey, 2016), which has text from various domains, including news and social media, annotated for the NER task. Table 1 shows the number of sentences annotated. The data is annotated with four named entity types: locations (LOC), persons (PER), organizations (ORG), and geopolitical entities (GPE). Following the CoNLL-2003 annotation standard, we merge the LOC and GPE types (Tjong Kim Sang and De Meulder, 2003). Note that these datasets are very low-resource, merely 4% to 13% the size of the CoNLL-2003 English dataset.
These sentences are also annotated with entity links to a knowledge base of 11 million entries, which we use only to aid our analysis. Of particular interest are "NIL" entity mentions that do not have a corresponding entry in the knowledge base (Blissett and Ji, 2019). The fraction of mentions that are NIL is shown in Table 1.
Gazetteer Data We also compare our method with binary gazetteer features, using entity lists from Wikipedia, the sizes of which are in Table 1.
Implementation Our model is implemented using the DyNet toolkit (Neubig et al., 2017), and we use the same hyperparameters as Ma and Hovy (2016). We use randomly initialized word embeddings since we do not have pretrained vectors for low-resource languages. 2 Evaluation We perform 10-fold cross-validation for all experiments because of the small size of our datasets. Our primary evaluation metric is spanlevel named entity F1 score.

Methods
Baselines We compare with two baselines: • NOFEAT: The CNN-LSTM-CRF model (section 4) without any features.
• BINARYGAZ: We use Wikipedia entity lists (Table 1) to create binary gazetteer features.
Soft gazetteer methods We experiment with different candidate retrieval methods designed for lowresource languages. These are trained only with small bilingual lexicons from Wikipedia, of similar size as the gazetteers (Table 1).
• WIKIMEN: The WikiMention method is used in several state-of-the-art EL systems (Sil et al., 2018;Upadhyay et al., 2018), where 2 A note on efficiency: our method involves computing entity linking candidates for each n-gram span in the dataset. The most computationally intensive candidate retrieval method (PBEL, discussed in subsection 5.2) takes ≈1.5 hours to process all spans on a single 1080Ti GPU. Note that this is a preprocessing step and once completed, it does not add any extra computational cost to the NER training process.
bilingual Wikipedia links are used to retrieve the appropriate English KB candidates.
• Pivot-based-entity-linking (Zhou et al., 2020): This method encodes entity mentions on the character level using n-gram neural embeddings (Wieting et al., 2016) and computes their similarity with KB entries. We experiment with two variants and follow Zhou et al. (2020) for hyperparameter selection: 1) PBELSUPERVISED: trained on the small number of bilingual Wikipedia links available in the target low-resource language.
2) PBELZERO: trained on some high-resource language ("the pivot") and transferred to the target language in a zero-shot manner. The transfer languages we use are Swahili for Kinyarwanda, Indonesian for Oromo, Hindi for Sinhala, and Amharic for Tigrinya.
Oracles As an upper-bound on the accuracy, we compare to two artificially strong systems: • ORACLEEL: For soft gazetteers, we assume perfect candidate retrieval that always returns the correct KB entry as the top candidate if the mention is non-NIL.
• ORACLEGAZ: We artificially inflate BINA-RYGAZ by augmenting the gazetteer with all the named entities in our dataset.

Results and Analysis
Results are shown in Table 2 (Rijhwani et al., 2019). Finally, we find that both ORACLEGAZ and OR-ACLEEL improve by a large margin over all nonoracle methods, indicating that there is substantial headroom to improve low-resource NER through either the development of gazetteer resources or the creation of more sophisticated EL methods.
How do soft-gazetteers help? We look at two types of named entity mentions in our dataset that we expect to benefit from the soft gazetteer features: 1) non-NIL mentions with entity links in the KB that can use EL candidate information, and 2) mentions unseen in the training data that have additional information from the features as compared to the baseline. Table 3 shows that the soft gazetteer features increase the recall for both types of mentions by several points. Table 3 indicates that the soft gazetteer features benefit those entity men-   tions that are present in the KB. However, our dataset has a significant number of NIL-clustered mentions ( Table 1). The ability of our features to add information to NIL mentions is diminished because they do not have a correct candidate in the KB. To measure the effect of KB coverage, we augment the soft gazetteer features with ORACLEGAZ features, applied only to the NIL mentions. Large F1 increases in Table 4 indicate that higher KB coverage will likely make the soft gazetteer features more useful, and stresses the importance of developing KBs that cover all entities in the document.

Conclusion
We present a method to create features for lowresource NER and show its effectiveness on four low-resource languages. Possible future directions include using more sophisticated feature design and combinations of candidate retrieval methods.