Robustness to Capitalization Errors in Named Entity Recognition

Robustness to capitalization errors is a highly desirable characteristic of named entity recognizers, yet we find standard models for the task are surprisingly brittle to such noise.Existing methods to improve robustness to the noise completely discard given orthographic information, which significantly degrades their performance on well-formed text. We propose a simple alternative approach based on data augmentation, which allows the model to learn to utilize or ignore orthographic information depending on its usefulness in the context. It achieves competitive robustness to capitalization errors while making negligible compromise to its performance on well-formed text and significantly improving generalization power on noisy user-generated text. Our experiments clearly and consistently validate our claim across different types of machine learning models, languages, and dataset sizes.


Introduction
In the last two decades, substantial progress has been made on the task of named entity recognition (NER), as it has enjoyed the development of probabilistic modeling (Lafferty et al., 2001;Finkel et al., 2005), methodology (Ratinov and Roth, 2009), deep learning (Collobert et al., 2011;Huang et al., 2015;Lample et al., 2016) as well as semi-supervised learning (Peters et al., 2017(Peters et al., , 2018. Evaluation of these developments, however, has been mostly focused on their impact on global average metrics, most notably the microaveraged F1 score (Chinchor, 1992).
For practical applications of NER, however, there can be other considerations for model evaluation. While standard training data for the task consists mainly of well-formed text (Tjong Kim Sang, 2002;Pradhan and Xue, 2009), models trained on such data are often applied on a broad range of domains and genres by users who are not necessarily NLP experts, thanks to the proliferation of toolkits (Manning et al., 2014) and generalpurpose machine learning services. Therefore, there is an increasing demand for the strong robustness of models to unexpected noise.
In this paper, we tackle one of the most common types of noise in applications of NER: unreliable capitalization. Noisiness in capitalization is a typical characteristic of user-generated text (Ritter et al., 2011;Baldwin et al., 2015), but it is not uncommon even in formal text. Headings, legal documents, or emphasized sentences are often capitalized. All-lowercased text, on the other hand, can be produced in large scale from upstream machine learning models such as speech recognizers and machine translators (Kubala et al., 1998), or processing steps in the data pipeline which are not fully under the control of the practitioner. Although a text without correct capitalization is perfectly legible for human readers (Cattell, 1886;Rayner, 1975) with only a minor impact on the reading speed (Tinker and Paterson, 1928;Arditi and Cho, 2007), we show that typical NER models are surprisingly brittle to all-uppercasing or all-lowercasing of text. The lack of robustness these models show to such common types of noise makes them unreliable, especially when characteristics of target text are not known a priori.
There are two standard treatments on the problem in the literature. The first is to train a caseagnostic model (Kubala et al., 1998;Robinson et al., 1999), and the second is to explicitly correct the capitalization (Srihari et al., 2003;Lita et al., 2003;Ritter et al., 2011). One of the main contributions of this paper is to empirically evaluate the effectiveness of these techniques across models, languages, and dataset sizes. However, both approaches have clear conceptual limitations. Case-agnostic models discard orthographic infor-  mation (how the given text was capitalized), which is considered to be highly useful (Robinson et al., 1999); our experimental results also support this. The second approach of correcting the capitalization of the text, on the other hand, requires an access to a high-quality truecasing model, and errors from the truecasing model would cascade to final named entity predictions. We argue that an ideal approach should take a full advantage of orthographic information when it is correctly present, but rather than assuming the information to be always perfect, the model should be able to learn to ignore the orthographic information when it is unreliable. To this end, we propose a novel approach based on data augmentation (Simard et al., 2003). In computer vision, data augmentation is a highly successful standard technique (Krizhevsky et al., 2012), and it has found adoptions in natural language processing tasks such as text classification (Zhang and Le-Cun, 2015), question-answering (Yu et al., 2018) and low-resource learning (Sahin and Steedman, 2018). Consistently across a wide range of models (linear models, deep learning models to deep contextualized models), languages (English, German, Dutch, and Spanish), and dataset sizes (CoNLL 2003 and OntoNotes 5.0), the proposed method shows strong robustness while making little compromise to the performance on well-formed text.

Formulation
Let x = (x 1 , x 2 , . . . , x n ) be a sequence of words in a sentence. We follow the standard approach of formulating NER as a sequence tagging task (Rabiner, 1989;Lafferty et al., 2001;Collins, 2002). That is, we predict a sequence of tags y = (y 1 , y 2 , . . . , y n ) where each y i identifies the type of the entity the word x i belongs to, as well as the position of it in the surface form according to IOBES scheme (Uchimoto et al., 2000). See Table 1 (a) for an example annotated sentence. We train probabilistic models under the maximum likelihood principle, which produce a probability score P [y | x] for any possible output sequence y.
All-uppercasing and all-lowercasing are com-mon types of capitalization errors. Let upper(x i ) and lower(x i ) be functions that lower-cases and upper-cases the word x i , respectively. Robustness of a probabilistic model to these types of noise can be understood as the quality of scoring function P[y | upper(x 1 ), . . . , upper(x n )] and P[y | lower(x 1 ), . . . , lower(x n )] in predicting the correct annotation y, which can still be quantified with standard evaluation metrics such as the micro-F1 score.

Prior Work
There are two common strategies to improve robustness to capitalization errors. The first is to completely ignore orthograhpic information by using case-agnostic models (Kubala et al., 1998;Robinson et al., 1999). For linear models, this can be achieved by restricting the choice of features to case-agnostic ones. On the other hand, deep learning models without hand-curated features (Lample et al., 2016;Chiu and Nichols, 2016) can be easily made case-agnostic by lower-casing every input to the model. The second strategy is to explictly correct the capitalization by using another model trained for this purpose, which is called "truecasing" (Srihari et al., 2003;Lita et al., 2003). Both methods, however, have the common limitation that they discard orthographic information in the target text, which can be correct; this leads to degradation of performance on well-formed text.

Data Augmentation
Data augmentation refers to a technique of increasing the size of training data by adding labelpreserving transformations of them (Simard et al., 2003). For example, in image classification, an object inside of an image does not change if the image is rotated, translated, or slightly skewed; most people would still recognize the same object they would find in the original image. By training a model on transformed versions of training images, the model becomes invariant to the transformations used (Krizhevsky et al., 2012). In order to improve the robustness of NER models to capitalization errors, we appeal to the  same idea. When a sentence is all-lowercased or all-uppercased as in Table 1 (b) and (c), each word would still correspond to the same entity. This implies such transformations are also label-preserving ones: for a sentence x and its ground-truth annotation y, y would still be a correct annotation for the all-uppercased sentence (upper(x 1 ), . . . , upper(x n )) as well as the alllowercased version (lower(x 1 ), . . . , lower(x n )). Table 1 would share the same annotation.

Experiments
We consider following three models, each of which is state-of-the-art in their respective group: Linear: Linear CRF model (Finkel et al., 2005) from Stanford Core NLP (Manning et al., 2014), which is representative of feature engineering approaches. BiLSTM: Deep learning model from Lample et al. (2016) which uses bidirectional LSTM for both character-level encoder and word-level encoder with CRF loss. This is the state-of-the-art supervised deep learning approach (Reimers and Gurevych, 2017). ELMo: Bidirectional LSTM-CRF model which uses contextualized features from deep bidirectional LSTM language model (Peters et al., 2018). For all models, we used hyperparameters from original papers. We compare four strategies: Baseline: Models are trained on unmodified training data. Caseless: We lower-case input data both at the training time and at the test time. Truecasing: Models are still trained on unmodified training data, but every input to test data is "truecased" (Lita et al., 2003) using CRF truecasing model from Stanford Core NLP (Manning et al., 2014), which ignores given orthographic information in the text. Due to the lack of access to truecasing models in other languages, this strategy was used only on English. DA (Data Augmentation): We augment the original training set with upper-cased and lower-cased versions of it, as discussed in Section 4.
We evaluate these models and methods on three versions of the test set for each dataset: Original: Original test data. Upper: All words are uppercased. Lower: All words are lower-cased. Note that both Caseless and Truecasing method perform equally on all three versions because they ignore any original orthographic information in the test dataset. We focus on micro-averaged F1 scores.
We use CoNLL-2002 Spanish andDutch (Tjong Kim Sang, 2002) and CoNLL-2003 English andGerman (Sang andDe Meulder, 2003) to cover four languages, all of which orthographic information is useful in idenfitying named entities, and upper or lower-casing of text is straightforward. We additionally evaluate on OntoNotes 5.0 English (Pradhan and Xue, 2009), which is about five times larger than CoNLL datasets and contains more diverse genres. F1 scores are shown in Table 2 and 3.
Question 1: How robust are NER models to capitalization errors? Models trained with the standard Baseline strategy suffer from significant loss of performance when the test sentence is upper/lower-cased (compare 'Original' column with 'Lower' and 'Upper'). For example, F1 score of BiLSTM on lower-cased CoNLL-2003 English is abysmal 0.4%, completely losing any predictive power. Linear and ELMo are more robust than BiLSTM thanks to smaller capacity and semisupervision respectively, but the degradation is still strong, ranging 20pp to 60pp loss in F1.  Question 2: How effective Caseless, Truecasing, and Data Augmentation approaches are in improving robustness of models? All methods show similar levels of performance on lowercased or uppercased text. Since Caseless and Data Augmentation strategy do not require additional language-specifc resource as truecasing does, they seem to be superior to the truecasing approach, at least on CoNLL-2003 English and OntoNotes 5.0 datasets with the particular truecasing model used. Across all datasets, the performance of Linear model on lower-cased or upper-cased test set is consistently enhanced with data augmentation, compared with caseless models.
Question 3: How much performance on wellformed text is sacrificed due to robustness? Caseless and Truecasing methods are perfectly robust to capitalization errors, but only at the cost of significant degradation on well-formed text: caseless and truecasing strategy lose 5.1pp and 6.2pp respectively on the original test set of CoNLL-2003 English compared to Baseline strategy, and on non-English datasets the drop is even bigger. On the other hand, data augmentation preserves most of the performance on the original test set: with BiLSTM, its F1 score drops by only 0.4pp and 0.1pp respectively on CoNLL-2003 and OntoNotes 5.0 English. On non-English datasets, the drop is bigger (0.1pp on Spanish but 2.5pp on Dutch and 2.7pp on German) but still data augmentation performs about 7pp higher than Caseless on original well-formed text across languages.
Question 4: How do models trained on wellformed text generalize to noisy user-generated text? The robustness of models is especially important when the characteristics of target text are not known at the training time and can deviate significantly from those of training data. To this end, we trained models on CoNLL 2003-English, and evaluated them on annotations of Twitter data from Fromreide et al. (2014), which exhibits natural errors of capitalization common in user-generated text. 'Transfer to Twitter' column of Table 2 reports results. In this experiment, Data Augmentation approach consistently and significantly improves upon Baseline strategy by 3.8pp, 3.1pp, and 3.0pp with Linear, BiLSTM, and ELMo models respectively on Original test set of Twitter, demonstrating much strengthened generalization power when the test data is noisier than the training data.
In order to understand the results, we examined some samples from the dataset. Indeed, on a sentence like 'OHIO IS STUPID I HATE IT', BiLSTM model trained with Baseline strategy was unable to identify 'OHIO' as a location although the state is mentioned fifteen times in the training dataset of CoNLL 2003-English as 'Ohio'. BiL-STM models trained with all other strategies correctly identified the state. On the other hand, on another sample sentence 'Someone come with me to Raging Waters on Monday', BiLSTM models from Baseline and Data Augmentation strategies were able to correctly identify 'Raging Waters' as a location thanks to the proper capitalization, while the model from Caseless strategy failed on the entity due to its ignorance of orthographic information.

Conclusion
We proposed a data augmentation strategy for improving robustness of NER models to capitalization errors. Compared to previous methods, data augmentation provides competitive robustness while not sacrificing its performance on wellformed text, and improving generalization to noisy text. This is consistently observed across models, languages, and dataset sizes. Also, data augmentation does not require additional languagespecific resource, and is trivial to implement for many natural languages. Therefore, we recommend to use data augmentation by default for training NER models, especially when characteristics of test data are little known a priori.