On the Strength of Character Language Models for Multilingual Named Entity Recognition

Character-level patterns have been widely used as features in English Named Entity Recognition (NER) systems. However, to date there has been no direct investigation of the inherent differences between name and nonname tokens in text, nor whether this property holds across multiple languages. This paper analyzes the capabilities of corpus-agnostic Character-level Language Models (CLMs) in the binary task of distinguishing name tokens from non-name tokens. We demonstrate that CLMs provide a simple and powerful model for capturing these differences, identifying named entity tokens in a diverse set of languages at close to the performance of full NER systems. Moreover, by adding very simple CLM-based features we can significantly improve the performance of an off-the-shelf NER system for multiple languages.


Introduction
In English, there is strong empirical evidence that the character sequences that make up proper nouns tend to be distinctive.Even divorced of context, a human reader can predict that "hoekstenberger" is an entity, but "abstractually" 2 is not.Some NER research explores the use of characterlevel features including capitalization, prefixes and suffixes (Cucerzan and Yarowsky, 1999;Ratinov and Roth, 2009), and character-level models (CLMs) (Klein et al., 2003) to improve the performance of NER, but to date there has been no systematic study isolating the utility of CLMs in capturing distinctions between name and non-name tokens in English or across other languages.
We conduct an experimental assessment of the discriminative power of CLMs for a range of lan- 1 The code and resources for this publication can be found at: https://cogcomp.org/page/publication_view/846 2 Not a real name or a real word.The entity CLM gives a low average perplexity and small variance to entity tokens (left), while giving non-entity tokens much higher perplexity and higher variance (right).
guages: English, Amharic, Arabic, Bengali, Farsi, Hindi, Somali, and Tagalog.These languages use a variety of scripts and orthographic conventions (for example, only three use capitalization), come from different language families, and vary in their morphological complexity.We demonstrate the effectiveness of CLMs in distinguishing name tokens from non-name tokens, as illustrated by Figure 1, which shows perplexity histograms from a CLM trained on entity tokens.Our models use only individual tokens, but perform extremely well in spite of taking no account of word context.We then assess the utility of directly adding simple features based on this CLM implementation to an existing NER system, and show that they have a significant positive impact on performance across many of the languages we tried.By adding very simple CLM-based features to the system, our scores approach those of a state-of-the-art NER system (Lample et al., 2016) across multiple languages, demonstrating both the unique importance and the broad utility of this approach.2 Methods

Character Language Models
We propose a very simple model in which we train an entity CLM on a list of entity tokens, and a nonentity CLM on a list of non-entity tokens.Both lists are unordered, with all entries treated independently.Each token is split into characters and treated as a "sentence" where the characters are the "words."For example, "Obama" is an entity token, and is split into "O b a m a".From these examples we learn a score measuring how likely it is that a sequence of characters forms an entity.At test time, we also split each word into characters and determine perplexity using the entity and nonentity CLMs.We assign the label corresponding to the lower perplexity CLM.We experiment with four different kinds of language model: N-gram model, Skip-gram model, Continuous Bag-of-Words model (CBOW), and Log-Bilinear model (LB).We demonstrate that the N-gram model is best suited for this task.

Data
To determine whether name identifiability applies to languages other than English, we conduct experiments on a range of languages for which we had previously gathered resources (such as Brown clusters): English, Amharic, Arabic, Bengali, Farsi, Hindi, Somali, and Tagalog.
For English, we use the original splits from the ubiquitous CoNLL 2003 English dataset (Sang and Meulder, 2003), which is a newswire dataset annotated with Person (PER), Organization (ORG), Location (LOC) and Miscellaneous (MISC).To collect the list of entities and nonentities as the training data for the Entity and Non-Entity CLMs, we sample a large number of PER/ORG/LOC and non-entities from Wikipedia, using types derived from their corresponding Free-Base entities (Ling and Weld, 2012).
For all other languages, we use a subset of the corpora from the LORELEI project annotated for the NER task (Strassel and Tracey, 2016).We build our entity list using the tokens labeled as entities in the training data, and our non-entity list from the remaining tokens.These two lists are then used to train two CLMs, as described above.
Our datasets vary in size of entity and non-entity tokens, as shown in Table 1.The smallest, Farsi, has 4.5K entity and 50K non-entity tokens; the largest, English, has 29K entity and 170K nonentity tokens.

CLM for Named Entity Identification
In this section, we first show the power of CLMs for distinguishing between entity and non-entity tokens in English, and then that this power is robust across a variety of languages.
We refer to this task as Named Entity Identification (NEI), because we are concerned only with finding an entity span, not its label.We differentiate it from Named Entity Recognition (NER), in which both span and label are required.To avoid complicating this straightforward approach by requiring a separate mention detection step, we evaluate at the token-level, as opposed to the more common phrase-level evaluation.We also apply one heuristic: if a word has length 1, we automatically predict 'O' (or non-entity).This captures most punctuation and words like 'I' and 'a'.
Figure 1 shows that for the majority of entity tokens, the entity CLM computes a relatively low perplexity compared to non-entity tokens.Though there also exist some non-entities with low entity CLM perplexity, we can still reliably identify a large proportion of non-entity words by setting a threshold value for entity CLM perplexity.If a token perplexity lies above this threshold, we label it as a non-entity token.Table 2: Token level identification F1 scores.Averages are computed over all languages other than English.Two baselines are also compared here: Capitalization tags a token in test as entity if it is capitalized; and Exact Match keeps track of entities seen in training, tagging tokens in Test that exactly match some entity in Train.The bottom section shows state-of-the-art models which use complex features for names, including contextual information.Languages in order are: English, Amharic, Arabic, Bengali, Farsi, Hindi, Somali, and Tagalog.The rightmost column is the average of all columns excluding English.
Since we also build a CLM for non-entities, we can also compare the entity and non-entity perplexity scores for a token.For those tokens not excluded using the threshold as described above, we compare the perplexity scores of the two models and assign the label corresponding to the model yielding the lower score.
We compare SRILM against Skip-gram and CBOW, as implemented in Gensim, and the Log-Bilinear (LB) model.We trained both CBOW and Skip-gram with window size 3, and size 20.We tuned LB, and report results with embedding size 150, and learning rate 0.1.Despite tuning the neural models, the simple N-gram model outperforms them significantly, perhaps because of the relatively small amount of training data. 4e compare the CLM's Entity Identification against two state-of-the-art NER systems: Cog-CompNER (Khashabi et al., 2018) and LSTM-CRF (Lample et al., 2016).We train the NER systems as usual, but at test time we convert all predictions into binary token-level annotations to get the final score.As Table 2 shows, the result of Ngram CLM, which yields the highest performance, is remarkably close to the result of state-of-theart NER systems (especially for English) given the simplicity of the model.

Improving NER with CLM features
In this section we show that we can augment a standard NER system with simple features based on our entity/non-entity CLMs to improve performance in many languages.Based on their superior performance as reported in Section 3, we use the N-gram CLMs.

Features
We define three simple features that capture information provided by CLMs and which we expect to be useful for NER.
Entity Feature We define one "isEntity" feature based on the perplexities of the entity and non-entity CLMs.We compare the perplexity calculated by entity CLM and non-entity CLM described in Section 3, and return a boolean value indicating whether the entity CLM score is lower.

Language Features
We define two languagerelated features: "isArabic" and "isRussian".We observe that there are many names in English text that originate from other languages, resulting in very different orthography than native English names.We therefore build two languagebased CLMs for Arabic and Russian.We collect a list of Arabic names and a list of Russian names by scraping name-related websites, and train an Arabic CLM and a Russian CLM.For each token, when the perplexity of either the Arabic or the Russian CLM is lower than the perplexity of the Non-Entity CLM, we return True, indicating that this entity is likely to be a name from Arabic/Russian.Otherwise, we return False.Table 3: NER results on 8 languages show that even a simplistic addition of CLM features to a standard NER model boosts performance.CogCompNER is run with standard features, including Brown clusters; (Lample et al., 2016) is run with default parameters and pre-trained embeddings.Unseen refers to performance on named entities in Test that were not seen in the training data.Full is performance on all entities in Test.Averages are computed over all languages other than English.

Experiments
We use CogCompNER (Khashabi et al., 2018) as our baseline NER system because it allows easy integration of new features, and evaluate on the same datasets as before.For English, we add all features described above.For other languages, due to the limited training data, we only use the "isEntity" feature.We compare with the state-of-theart character-level neural NER system of (Lample et al., 2016), which inherently encodes comparable information to CLMs, as a way to investigate how much of that system's performance can be attributed directly to name-internal structure.
The results in Table 3 show that for six of the eight languages we studied, the baseline NER can be significantly improved by adding simple CLM features; for English and Arabic, it performs better even than the neural NER model of (Lample et al., 2016).For Tagalog, however, adding CLM features actually impairs system performance.
In the same table, the rows marked "unseen" report systems' performance on named entities in Test that were not seen in the training data.This setting more directly assesses the robustness of a system to identify named entities in new data.By this measure, Farsi NER is not improved by nameonly CLM features and Tagalog is impaired.Benefits for English, Hindi, and Somali are limited, but are quite significant for Amharic, Arabic, and Bengali.

Discussion
Our results demonstrate the power of CLMs for recognizing named entity tokens in a diverse range of languages, and that in many cases they can improve off-the-shelf NER system performance even when integrated in a simplistic way.
However, the results from Section 4.2 show that this is not true for all languages, especially when only considering unseen entities in Test: Tagalog and Farsi do not follow the trend for the other languages we assessed even though CLM performs well for Named Entity Identification.
While the end-to-end model developed by (Lample et al., 2016) clearly includes information comparable to that in the CLM, it requires a fully annotated NER corpus, takes significant time and computational resources to train, and is non-trivial to integrate into a new NER system.The CLM approach captures a very large fraction of the entity/non-entity distinction capacity of full NER systems, and can be rapidly trained using only entity and non-entity token lists -i.e., it is corpus-agnostic.For some languages it can be used directly to improve NER performance; for others (such as Tagalog), the strong NEI performance indicates that while it does not immediately boost performance, it can ultimately be used to improve NER there too.

Related Work
Cucerzan and Yarowsky (1999) is one of the earliest works to use character-based features (character tries) for NER.The approach of Klein et al. (2003) was one of the original papers in the CoNLL 2003 NER shared task.Their approach, which ranked in the top 3 for both English and German shared tasks, used character-based features for NER.They do two experiments: one with a character-based HMM, another with using character n-grams as features to a maximum entropy model.The focus on character-level patterns is similar to our work, but without the specific exploration of language models alone.
Using character-based models similar to ours, Smarr and Manning (2002) show that unseen noun phrases can be accurately classified into a small number of categories using only a character-based model independent of context.We tackle a somewhat more challenging task of distinguishing entities from non-entities.Lample et al. (2016) use character embeddings in an LSTM-CRF model.Their ablation studies show that character-level features improve performance significantly.
We are not aware of any work that directly evaluates CLMs for identifying name tokens, nor of work that demonstrates the utility of characterlevel information for identifying names in multiple languages.

Conclusions and Future Work
We have shown, in a series of simple experiments, that in many languages names are identifiable by character patterns alone, and that character level patterns have strong potential for building better NER systems.
In the future, we plan to make a more thorough analysis of reasons for the high variance in NER performance.In particular, we will study why it is possible, as with Tagalog, to have high Named Entity Identification results but lose points in NER.

Figure 1 :
Figure 1: Perplexity histogram of entity (left) and nonentity tokens (right) in CoNLL Train calculated by entity CLM for both sides.The graphs show the percentage of tokens (y axis) with different levels of CLM perplexities (x axis).The entity CLM gives a low average perplexity and small variance to entity tokens (left), while giving non-entity tokens much higher perplexity and higher variance (right).
The threshold is tuned on development data.