Named entity recognition with document-specific KB tag gazetteers

We consider a novel setting for Named Entity Recognition ( NER ) where we have access to document-speciﬁc knowledge base tags . These tags consist of a canonical name from a knowledge base ( KB ) and entity type, but are not aligned to the text. We explore how to use KB tags to create document-speciﬁc gazetteers at inference time to improve NER . We ﬁnd that this kind of supervision helps recognise organisations more than standard wide-coverage gazetteers. Moreover, augmenting document-speciﬁc gazetteers with KB information lets users specify fewer tags for the same performance, reducing cost.


Introduction
NER is the task of identifying names in text and assigning them a type (e.g. person, location, organisation, miscellaneous). State-of-the-art supervised approaches use models that incorporate a name's form, its linguistic context and its compatibility with known names. These models rely on large manually-annotated corpora, specifying name spans and types. These are vital for training models, but it is laborious and expensive to label every occurrence of a name in a document.
We consider a non-standard setting where, for each document, we have metadata in the form of document-specific knowledge base tags. A KB tag is a canonical name, that is an identifier in a KB (e.g. a Wikipedia title), and an entity type. While these tags have a correct type assigned for at least one context, they are not aligned to phrases in the text, and may not share the same form as all of their mentions (e.g. we may see the tag United Nations for the mention UN). We also assume that each tag matches at least one mention in the document, but do not specify where in the document the mention is.
There are many sources of KB tags, such as manual entity indexing for news stories or data extracted from personalised knowledge stores. For example, the New York Times Annotated Corpus (Sandhaus, 2008) contains more than 1.5M articles "manually tagged by library scientists with tags drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors". Names and types are also present in large quantities of financial news stories from Bloomberg (Bradesko et al., 2015), in the form of linked names of companies and people.
Document-level tags may be quicker for annotators to apply than the usual method of marking spans in text, and are thus a cheap form of supervision. It is hard to make strong comparisons to the standard NER task, as KB tags can be considered partial, unaligned gold-standard supervision -so fully supervised models should perform better, the question is by how much and why.
This paper explores effective ways to use KB tags for improving NER. We use the CoNLL 2003 English NER dataset (Tjong Kim Sang and De Meulder, 2003), annotated with Wikipedia links (Hoffart et al., 2011). This allows us to simulate a set of KB tags for each document in the TRAIN, TESTA and TESTB splits of the dataset. We use a document's KB tags to build a documentspecific gazetteers which we use in addition to standard features for a conditional random field (CRF) model (Lafferty et al., 2001).
We compare against wide-coverage gazetteers, which score 89.85% F-score on TESTA. Assuming access to all possible KB tags, the upper bound for KB tag models is substantially better at 92.85% F-score. KB tags help NER accuracy across all entity types, but provide relatively better supervision for organisation entities than wide-coverage gazetteers. The benefit of KB tags comes from their type information, which is required for good performance. We also examine how performance degrades as we use fewer KB tags, simulating the use-case where a busy knowledge worker spends less time annotating. We find that KB augmentation means we require fewer tags to reach the same performance, which reduces the cost of obtaining KB tags. We show how KB tags can be exploited as a useful complement to traditional NER supervision.

Background
Gazetteers have long been used to augment statistical NER models, adding general evidence of tokens used in names (Nadeau and Sekine, 2007). These are usually drawn from wide-coverage sources like Wikipedia and census lists (Ratinov and Roth, 2009) and can be incorporated into sequence models by designing binary features that indicate whether a token appears in a gazetteer entry. Features can be refined by specifying which part of an entry a token matches using tag encoding schemes such as IOB (Kazama and Torisawa, 2007). Using multiple gazetteers allows feature weights to capture different name types and sources. Given their purpose to increase coverage beyond names included in training data, gazetteers are usually large, general and static, remaining the same during training and prediction time.
Beyond their use as sources for gazetteers, the link structure in and around KBs has been used to create training data. A prominent technique is to follow links back from KB articles to documents that mention the subject of the article, heuristically labelling high-precision matches to create training data. This has been used for genetic KBs (Morgan et al., 2003;Vlachos andGasperin, 2006), andWikipedia (Kazama andTorisawa, 2007;Richman and Schone, 2008;Nothman et al., 2013). These works do not consider our setting where gold-standard entities are given at inference time as their goal is to generate training data.
KBs have also been used to help other natural language processing tasks such as coreference resolution (Rahman andNg, 2011), topic modelling (Kataria et al., 2011) and named entity linking (Cucerzan, 2007;Ratinov et al., 2011). Finally, it may be that supervised data is only available in some circumstances, for example in the case of personalising NER models. Jung et al. (2015) query a user's smartphone data services to create user-specific gazetteers of personal information. The background NER model is initially trained without access to the user-specific information and later adapted on the users's smartphone.

Document-level KB tags
We incorporate information from KB tags by building document-specific gazetteers. Figure 1 shows an example with a document in which the names need to be recognised and typed (in square brackets). We are also given a list of KB tags, each of which is a canonical name and a type. These are linked to a KB which we use to extract aliases, in our case each canonical name is a Wikipedia article, and redirects to that article are considered aliases. Our goal is that knowing that a document mentions the entity West Indies cricket team can help us identify Windies or Calypso Cavaliers.
To create gazetteers from a document's KB tags, we preprocess the canonical name from each KB tag, tokenising by underscore, lowercasing and removing parenthesised suffixes (e.g. Chris Lewis (cricketer) becomes chris lewis). We use an encoding scheme to incorporate the type information from the KB tag. Inspired by Kazama and Torisawa (2007), who applied IOB encoding to gazetteers, we apply the BMEOW (a.k.a. BILOU), a scheme that also distinguishes between beginning, middle, end, outside and single word positions. 1 For example, this allows us to map chris lewis to B-PER E-PER, and we can aggregate gazetteers of tokens for each encoded type, such that the gazetteer for B-PER contains chris.
Our CRF gazetteer features are calculated from an input token from the text that we wish to label. Having created a document's KB tag gazetteers, we can define binary features that are active if an input token matches (case-insensitively) with a particular gazetteer. This models both the part of the KB tag name that the token matched, and its type. The input token Chris thus activates the feature f B-PER and the token cricket would activate the f I-MISC and f I-ORG, as it matches inside entries of the two types.

Methodology
We define several configurations to investigate KB tags. The first four baselines either do not use KB tags, or do not integrate them into the CRF. The second four configurations use KB tag features in the CRF model.

Baselines
KB tag matching (MATCH) We find the longest full match from the document gazetteer and apply the known type. This will not match partial or noncanonical names, but should be high-precision. This is similar to the CoNLL 2003 baseline system (Tjong Kim Sang and De Meulder, 2003).

Baseline (CRF)
We train a CRF model using CRFsuite (Okazaki, 2007) with a standard set of features that encode lexical context, token shape, but no external knowledge features such as gazetteers. All following configurations build on the CRF with standard features. KB tag repair (CRF +REPAIR ) We label the text using the baseline CRF, then find the longest full match from the document gazetteer and assign the known type. When a gazetteer match overlaps with a CRF match, we prefer the gazetteer and remove the latter. Although we do not consider partial matches, this may recognise longer names that can be difficult for CRF models.
Wide-coverage gazetteers (CRF +WIDE ) This uses gazetteers distributed with the Illinois NER system (Ratinov and Roth, 2009). We encode each 1 We omit the O tag as all gazetteer tokens are inside. phrase using the BMEOW scheme described above, and use the filename of each gazetteer as its type. There are 33 gazetteers drawn from many sources with approximately 2 million entries.

Using KB tags as CRF features
KB tag names (CRF +NAME ) We generate document-specific gazetteer features, but use the same type for each entry.

KB tag names and types (CRF +NAME+TYPE )
This is equivalent to CRF +NAME , but includes known types. Since type varies with context, this may not be correct, but is hopefully informative.
KB tag names, types and KB aliases (CRF +NAME+TYPE+AKA ) This builds on the above, but uses the KB to augment the document-specific gazetteer with known aliases of the KB tags, for example adding UN for United Nations with the known type.
KB tag names, types, KB aliases and large gazetteers (CRF +NAME+TYPE+AKA+WIDE ) This combines all KB tag features with the widecoverage gazetteers.
We fetch and cache KB information using a Wikipedia API client. 2 We assume the tag set of person (PER), organisation (ORG), location (LOC) and miscellaneous (MISC), and report precision, recall and F-score from the conlleval evaluation script. The median proportion of mentions in a document that are linked to the KB is 81% in TRAIN and TESTB, and 85% in TESTA. Augmenting the gazetteer with aliases produces, on average, 26 times the number of gazetteer entries than KB tags alone in TESTA, and 23 times in TESTB. Table 1 shows the performance of different configurations -we focus first on TESTA overall Fscores. Matching against KB tag names results in high-precision but low recall with an F-score of 55.35%, far worse than the baseline CRF at 87.68%. Despite its naïve assumptions, repairing the CRF tags using longest matches in the document gazetteer performs surprisingly well at 89.76%, just lower than using wide coverage gazetteers, with an F-score of 89.85%.

Results
The first setting that uses KB tags as CRF features is CRF +NAME , which includes typeless names  Table 1: Results for CoNLL 2003 TESTA and TESTB. We report P/R/F for all tags and per-type F-scores. Methods starting with "+" build on the standard CRF by repairing or adding features. Figure 2: How many sentences should an annotator check for KB tags? TESTA results for CRF +NAME+TYPE and CRF +NAME+TYPE+AKA where KB tags are drawn from the first n sentences. This is compared to the Fscore for CRF +WIDE and versions of the models with access to all sentences in the document (horizontal, thin lines).
and has an F-score of 89.29%. Precision and recall are lower than wide coverage gazetteers, suggesting that, without type information, bigger gazetteers are better. Adding type features (CRF +NAME+TYPE ) results in better performance than either CRF or CRF +WIDE at 92.7% F-score. 3 Augmenting the document gazetteers using aliases from the KB further improves F-score for aliases (92.85%). Adding wide-coverage gazetteers to KB tags slightly decreases F-score at 92.57%. These results indicate that type information is critical and, to confirm this, we ran experiments that used only name and alias information from KB tags. This scores 89.45% F-score on TESTA and 83.62% F-score on TESTB. While aliases help, type information is required to improve performance beyond wide coverage gazetteers.
To give some insight into why KB tag types are effective, consider the name West Indian. This appears 65 times across 11 of the 33 wide-We also examine the per-tag F-scores for TESTA to investigate whether KB tags help some types of entities more than others.
Using CRF +NAME+TYPE+AKA we obtain around 95.5% F-score for PER and LOC entities. As with CRF +WIDE , MISC entities remain hard to tag correctly. However, if we consider the percentage F-score gain by type over the CRF baseline, CRF +WIDE gazetteers improve performance most for PER (+2.94%), then ORG (2.77%) entities. The top two are reversed for CRF +NAME+TYPE+AKA , with ORG (+6.83%), then PER (+5.48%). This suggests that KB tags are particularly well-suited for helping recognise organisations names.
We see similar trends in TESTB, except KB tags and CRF +WIDE are complementary. The experiments illustrate that if we are lucky enough to have KB tags, they improve NER. However, the models use all possible KB tags and should be considered an upper bound. To better model busy workers, we restrict the gazetteers to only KB tags from mentions in the first n sentences. This matches asking an annotator to only bother looking at the first n sentences. Figure 2 shows how the KB tag models perform on TESTA as we increase n. To achieve better performance than CRF +WIDE , one should view the first 5 sentences for CRF +NAME+TYPE . Aliases (CRF +NAME+TYPE+AKA ) reduce performance slightly when only using a few sentences, but with more than 4 sentences, aliases are consistently useful. This trend is also apparent in TESTB, showing that augmenting tags with KB information improves NER, especially when only a few tags are available.

Discussion and conclusion
There are several avenues to explore further. Asking annotators to specify types is not ideal and it would be better to predict them from the KB. We only use the KB to collect aliases, but we could use it to harvest related entities. Another challenge is appropriately modelling the interaction between a sentence-level task and document-level constraints. A KB tag might match a mention in one sentence and this should influence predictions there. However, its evidence should be less important elsewhere since that constraint has already been satisfied. This would improve robustness, however global constraints are hard to model in sentence-unit CRF models. This paper presents a novel NER setting whereby we have access to some number of KB tags -canonical names and types -at training and inference time. We explore how best to use this information, finding that CRF models can indeed take advantage of this non-standard supervision. Moreover, models benefit from integration with the KB, in our case augmenting document gazetteers to maximise the benefit of KB tags.