Context-Sensitive Recognition for Emerging and Rare Entities

This paper is a shared task system description for the 2017 W-NUT shared task on Rare and Emerging Named Entities. Our paper describes the development and application of a novel algorithm for named entity recognition that relies only on the contexts of word forms. A comparison against the other submitted systems is provided.


Introduction
NER is a common foundational step for many pipelines that rely on natural language processing (NLP). The main goal is the identification of mentions of entities (e.g., persons or locations). As a pre-processing task for unstructured text, NER may, for example, provide index keywords for information retrieval systems (Tjong Kim Sang and De Meulder, 2003), or topic-rich features for machine learning (ML) applications (Kumaran and Allan, 2004;Vavliakis et al., 2013). Effective approaches to NER have long utilized conditional random fields (Lafferty et al., 2001), support vector machines (McCallum and Li, 2003), and perceptrons (Settles, 2004;Ju et al., 2011;Luo et al., 2015). In addition to relying on face-value, gold-standard data, systems may benefit from a variety of other data representations and sources (Strauss et al., 2016), including gazetteers, word classes (e.g, Brown clusters), orthographic features, and grammatical relations between types of words, such as part of speech. Large-scale annotated resources for NER have also been developed in semi-supervised fashions, constructed from online encyclopedias (Nothman et al., 2008(Nothman et al., , 2012 and refined by crowdsourcing (Bos et al., 2017).
While NER systems have been in development for some time, their applicability to noisy-text domains (i.e., unedited, user-generated content) is somewhat limited. This is a multi-faceted problem (Derczynski et al., 2015), involving grammatical inconsistency and rapidly-shifting domains, requiring specialized algorithms. While progress has been made through annotation and specialized systems development (Ritter et al., 2011), there are still large gains to be made for this domain (Augenstein et al., 2017), which is highlighted well by both the shared task at the W-NUT this year (Strauss et al., 2016), and that of the previous year.
Adaptation to the task domain's wide-range of writing styles and abundant grammatical inconsistencies presents the need for algorithmic flexibility. These properties make precision loss an issue, and the presence of rare and emerging entities makes recall an extreme challenge, too. Our participation in the present shared task relies on a novel approach: utilizing flexible "contexts" as features -derived from token forms -alone. We rely upon these features for their capacity to relate to never-before-seen tokens as potential entities, and incorporate them into a statistical model that can handle both gold-standard data and large, lexical resources.

Shared Task Data
We began our approach by scoping the task data set composition. There were 6 named entity types: corporation, creative work, group, location, person, and product, which were a mapping down from 10 in the 2016 W-NUT Twitter NER Shared Task. A decomposition of the current shared task data (see Tab. 1) exhibits several important features. The proportion of unique entities out of all increased from about 80% to 90% from the training to the development and test sets. However, the training, development, and test sets all exhibited internal stability in the proportions of unique numbers for each type of named entity. In other words, no named entity type dropped out of proportion when considering unique forms. However, the focus on rare entities resulted in large increases in the percentage of the data occupied by the person category. These proportions and the availability of large-scale gazetteer data highlighted this type for the initial focus of our model's development.

Previous Work
Context models are conditional statistical models whose features are derived from the structural patterns surrounding or within written language. We refer to context models that rely on exterior information as external context models, and those that rely on interior information as internal context models. For example, word-level context models applied to the text: "Out to lunch in New York City." might place the entity "New York City" in the external context "Out to lunch in *.", or the internal context "New York *" (in each case reserving * as a wildcard).
Context models trace their roots to Shannon (1948), but have likewise seen recent attention (Piantadosi et al., 2011). They have been applied to both patterns of character appearance and word appearance, with the majority of attention directed towards word patterns and external models. In recent work by Williams et al. (2015a), an internal context model was used to identify missing multiword dictionary entries. We utilize this model here, but apply it at the character level so as to be able to identify both single snd multi-word named entities.

Context-Sensitive NER
We represent a token, w, by its sequence of n characters: w = (l 1 , l 2 , · · · , l n ), and define its set of 2 n−1 contexts, C w by the corresponding removal patterns of contiguous subsequences. The context, c i···j ∈ C w , defined by the removal of characters i through j is: Despite execution at the sub-word level, this is precisely the same construction as in Williams et al. (2015a), which was used to compute likelihoods of dictionary definition. For a given word, weighting across its contexts is accomplished as in Williams et al. (2015a), induced by a partition process (Williams et al., 2015b). However instead of dictionary definition, we use the context conditional probabilities to determine the likelihoods of named entity tags. For any word, w, and positive tag, t (e.g., B-location, I-person, B-group, etc.), a computed likelihood, L(t|C w ), can be interpreted as "the likelihood of drawing a t-tagged word from the contexts of w". Note that these likelihoods can be non-zero for words that were not present in training, and are higher for words that are similar to tagged words. For example, if w 1 = Larry, w 2 = Harry, and only w 1 appeared in a gold standard, with tag t = B-person, L(t|C w 2 ) would be elevated.

Entity Recognition
To handle entities composed of multiple words, e.g., (w 1 , w 2 , · · · , w k ), we assess a potential entity's membership to a particular type, e.g., "location", via the harmonic mean, L(t 1 , t 2 , · · · , t k |w 1 , w 2 , · · · , w k ), of their component-word likelihood values, such that only the first word has the B-version tag (t 1 ) and all others have the I-version. A candidate is accepted if its likelihood mean is above a thresholds value, which is determined in optimization (see Sec. 4).

Conflict Resolution
A given word may fall within multiple predicted entities, both of different types and lengths. To resolve potential conflicts between predicted entities we establish precedence by accepting 1) predictions appearing first, over 2) longer predictions, over 3) predictions of higher likelihood.

Gold-Standard Data
In addition to the gold-standard data provided for the shared task (see Sec. 2.1 and Tab. 1) we utilize 1) all components of the W-NUT 2016 Twitter NER shared task (Strauss et al., 2016), 2) all components of the 2003 CONLL NER shared task (Tjong Kim Sang and De Meulder, 2003), 3) the WikiNER annotations (Nothman et al., 2008(Nothman et al., , 2012, and 4) the Groningen Meaning Bank (Bos et al., 2017). Each corpus required mapping its entity types to the six 2017 shared task types, and for data sets (2), (3), and (4), only mappings for the location and person types were deemed appropriate (geo-loc, facility, and loc to location, and per to person). However for data set (1), additional mappings were accepted from tvshow and movie to creative-work, sportsteam to group, and company to corporation.

Supplemental Lexica
To extend model training to as many forms as possible, supplemental lexica were incorporated from the gazetteer materials provided alongside the gold data from the W-NUT 2016 Twitter NER shared task. Only several gazetteers were incorporated into the final model: automotive.model and business.consumer product for the product type; firstname.5k, lastname.5000, people.family name, and people.person.filtered for the person type; and location.country for the location type. Each entry in a given gazetteer was treated as a weighted instance of its named entity type. Weights offset the extreme size of gazetteers in comparison to the gold standard data, and were determined as follows. For a given entity type, let x be the number of typed named entities in the gold standard training data, and y be the number of gazetteer entries. The type's gazetteer entries were then incor-porated with weight x/y, and all O-tagged tokens were counted with weight 2.

Optimization
Model development consisted of training on the gold-standard training data (see Sec. 2.1), in addition to the external gold standards (see Sec. 3.1), and the supplemental lexica (see Sec. 3.2). With the trained model, optimization was performed with respect to the development data set, which notably had a disproportionate representation of person entities. We determined thresholds for each of the entity types through separate optimizations. Given the brief timeline, these were conducted adaptively, optimizing thresholds for by-type F 1 values, honing in by step sizes of 0.1, 0.01, and finally 0.001. Note that the optimization procedure exhibited no predictive power on entity types creating work and corporation, leading us to restrain our model from predicting those types. After final threshold parameters were determined, a final combined model (see Sec. 2.2.4) was allowed to train additionally on the development data set before being applied to the final test data set.

Results
To understand our model's performance in the context of other systems, we provide a finegrained system evaluation across the entity types (see Tab. 2). This follows the specialized sharedtask evaluation method, focusing on precision, recall, and F 1 with respect to unique named entity surface forms. On the primary categories in which our model made predictions (location and person), our model's performance was reasonably competitive, with high levels of precision. At location, our system outperformed two other models by overall F 1 , and was in range of the other models with respect to the person type. For all other entity types,  our system performed poorly (although no predictions were made for the corporation and creative work categories). Notably, the only categories at which other teams performed consistently well were the person and location categories, with the main observation being low recall, rarely above 20%.

Discussion
For this shared task we developed and evaluated a novel NER algorithm that relies only on features derived from word forms. Despite having the lowest task evaluation scores, this model exhibited competitive performance at two of the largest categories. These two categories (person and location) had significant external data availabile (both gold standards and supplemental lexica), and exhibited the most promise during model optimization. The system's ability to perform competitively at these entity types appears to suggest that increased performance at the other types may be possible with the availability of other, categoryspecific and large-scale external resources. We note that our model's optimization exhibited an extreme lack of predictive power at the corporation and creative work categories, which, in addition to being affected by sparsity, may have also been affected by the lack of acceptable mappings from the external gold-standard resources into these categories. While lexical data were weighted to good effect (increased performance), the coverage of gold standard data only over the person and location entity types may have negatively impacted our system's ability to predict other types. Thus, a potential improvement for prediction of these types might be accomplished by applying a similar weighting scheme to the external gold-standard data. This leaves us with avenues for improvement, along with competitive, task-specific scores at the person and location categories; all of this, while relying on features derived only from word forms, points toward value in the continued development of context-sensitive NER for rare and emerging entities.