Simple Features for Strong Performance on Named Entity Recognition in Code-Switched Twitter Data

In this work, we address the problem of Named Entity Recognition (NER) in code-switched tweets as a part of the Workshop on Computational Approaches to Linguistic Code-switching (CALCS) at ACL’18. Code-switching is the phenomenon where a speaker switches between two languages or variants of the same language within or across utterances, known as intra-sentential or inter-sentential code-switching, respectively. Processing such data is challenging using state of the art methods since such technology is generally geared towards processing monolingual text. In this paper we explored ways to use language identification and translation to recognize named entities in such data, however, utilizing simple features (sans multi-lingual features) with Conditional Random Field (CRF) classifier achieved the best results. Our experiments were mainly aimed at the (ENG-SPA) English-Spanish dataset but we submitted a language-independent version of our system to the (MSA-EGY) Arabic-Egyptian dataset as well and achieved good results.


Introduction
Recently, social media texts such as tweets and Facebook posts have attracted attention from the Natural Language Processing (NLP) research community. This content has many applications as it provides clues to analyze sentiments of the masses towards areas ranging from basic electronic products to mental health issues to even national political candidates. These applications have motivated the NLP community to rethink strategies for common tools, such as tokenizers, named entity taggers, POS taggers, dependency parsers, in the context of informal and noisy text.
As access to the internet becomes more and more universal, a linguistically diverse population has come online. Hong et al. (2011) showed that in a collection of 62 million tweets, only a little over 50% of them were in English. This multilingualism has given rise to such interesting patterns as transliteration and code-switching. The multilingual behavior combined with the informal nature of the content makes the task of building NLP tools even harder.
In this paper, we solve the problem of Named Entity Recognition (NER) for code-switched twitter data as a part of the ACL'18 Computational Approaches to Linguistic Code-switching (CALCS) Shared Task (Aguilar et al., 2018). Code-switching is a phenomenon that occurs when multilingual speakers alternate between two or more languages or dialects. This phenomenon can be observed across different sentences, within the same sentence or even in the same word. This shared task is similar to other social media tasks, except that the data is explicitly chosen to contain code-switching. The entities for the task are: Event, Group, Location, Organization, Other, Person, Product, Time, and Title. Below is an example of some code-switched data, switching between English and Spanish:   (Ritter et al., 2011). This can be explained by the fact that such systems rely on hand-crafted standard local features and some background knowledge, which is not reliable in data as noisy as tweets. With only a limited number of characters, people use a variety of creative ways to express their thoughts, including emoticons and novel abbreviations.
There have been few recent workshops and shared-tasks on analysis of such noisy social media data, such as Workshop on Noisy User-Generated Text (WNUT) at EMNLP (2014,2016,2017), Workshop on Approaches to Subjectivity, Sentiment and Social Media (WASSA) at NAACL (2016), and Forum for Information Retrieval Evaluation (FIRE: 2015, 2016, 2017.

Experimental Setup
Here we describe the data, evaluation, and the model we used.

Data
In our experiments, we focus primarily on the English-Spanish (ENG-SPA) dataset. However, we submitted our basic system results for Arabic-Egyptian (MSA-EGY) dataset as well.
The organizers provided annotated train and development sets for each language. They also provided an unannotated set of test data, which we annotated with our system, and submitted for evaluation. We never had access to the gold annotated test set, before or after the evaluation.
Tables 1 and 2 provide information about the data in terms of number of tweets and tokens for the (EN-SPA) English-Spanish and (MSA-EGY) Modern Standard Arabic-Egyptian language pairs. Tables 3 and 4 provide statistics of the named entities for both (EN-SPA) English-Spanish and (MSA-EGY) Modern Standard Arabic-Egyptian language pairs, where each cell can be interpreted as Number (Percentage) and entity 'O' represents all non-NE tokens. Please note that the data has been tagged using the IOB scheme and data in Tables 3 and 4 is the result of grouping named entities according to the IOB scheme.

Evaluation
We used the standard harmonic mean F1 score to evaluate the system performance. Additionally, we used surface form F1 score as described in Derczynski et al. (2017). Both of these metrics were a part of the evaluation in the CALCS shared task.

Method
We used the sklearn implementation of Conditional Random Field (CRF) 1 (McCallum and Li, 2003) as the base model in our NER system.

Experiments
This section gives an overview of our experiments. First, we identify various local and global features using a variety of monolingual tweets and Gazetteers and train a CRF-based classifier on the data. Second, we try to improve system recall using a 2-step NER process. Third, we convert the convert the code-mixed data to monolingual data using language identification (using a characterbased language model) and translation.
Of the three experiments that we tried, the first method gave the best results. We compare against the best performing system in the shared task as well as the organizer's baseline in Table 5. The baseline was provided by the organizers and used Bi-directional LSTMs followed by softmax layer (trained for 5 epochs) to infer the output labels.
The shared task used Surface Form F1 scores as well, but we omit them from our results as they were the same as harmonic mean F1 in all cases. All scores are reported in Table 6. Detailed scores are available in the appendix.

Experiment 1
Our first experiment used a standard set of features, augmented with some task-specific ideas, and defined as follows. Given a sequence of words in a sentence: ..., w i−2 , w i−1 , w i , w i+1 , w i+2 , ... and the current word in consideration is w i , we used the following features: • If w i is in the beginning of sentence  (2016)) labels as features. • For each word w k in a context window of ±2: -The word w k itself -If w k is upper case -Shape and Short shape (where same consecutive characters in the shape are compressed to a single character) of w k -If w k contains any special symbol like: ,#,$,-,,,etc. or an emoji. -If w k is alphabetic or alphanumeric -Emoji Description: We identified the 40 most common emojis present in our dataset and manually labelled them with representative words, such as smile, kiss, sad, etc. These emoji description (sense) of every context word were used as another feature.
We also ran the experiment on the MSA-EGY dataset (without the Gazetteer features).

Experiment 2
Following the first experiment, our main observation was that the recall was quite low. One reason for this could be the presence of a large amount of tokens tagged as 'O' (∼97%). In contrast, the standard CONLL 2002 Spanish training NER corpus (Tjong Kim Sang, 2002) had ∼87% of the tokens tagged as 'O'.
To solve this issue, we experimented with a 2step NER process (similar to (Eiselt and Figueroa, 2013)): 1. Train a CRF model to identify whether a token is 'O' or not 2. Train a CRF model to identify the type of named-entity (if identified as non-'O') As expected, we saw major improvements in recall, but these were offset by a substantial drop in precision. Overall, this led to a lower F1 score than before. In light of these results, we did not use the 2-step approach for any other experiments.

Experiment 3
In this experiment, we tried to eliminate the codeswitching by converting the data to a monolingual form. Our method is to identify the language of each token in the dataset and translate into a common language.
We collected training data for language identification using the Twitter API. We downloaded tweets for English and Spanish and assumed that each word in those tweets belonged to that particular language. The statistics for the downloaded data is shown below: 1. 3000 Spanish tweets (7700 tokens ∼56%) 2. 1900 English tweets (6100 tokens ∼44%) Then, we trained a character-level RNN-based language model on this data to do language identification. In order to validate, we split our data and used 80% for training and rest for validating, achieving an accuracy of 79% on this validation data. We used this model to identify the language of all the tokens in dataset, then used Google Translate API to translate English tokens to Spanish. Finally, we used the language identification and the translation as features in our CRF model, in addition to all the features used in experiment 1.
As compared to the results from experiment 1, this improved the recall on both development and test sets, but again, the loss in precision caused a slight overall drop in performance.

Conclusion
Our submissions earned 4th place out of 8 submissions in the ENG-SPA task, and 3rd place out of 6 submissions in the MSA-EGY task.
Surprisingly, our simplest NER model, trained without using any language identification or translations, worked best. The other more sophisticated experiments showed promise in improving the recall, but damaged the precision too much to improve the F1 score.
One of the challenges we faced was dissimilarity between development and test dataset. Although some of the techniques that we tried on the development dataset improved the system performance, the same effect was not seen in the test dataset. For example, see the change in performance between Table 7 and Table 8. The F1 score on the development set jumped 12 points, but the score on the test set dropped 9 points. This could be explained by the very small size of the development dataset, where a few errors or successes could change the score dramatically. Without access to the test data, we could not do any qualitative error analysis.
Finally, since the 2- Step NER achieved such a high recall, we believe that creating an ensemble of 1-Step and 2-Step systems could achieve a better overall F1 score.

Appendices A ENG-SPA detailed results
We show detailed results for ENG-SPA experiments in the following tables.