Named Entity Recognition on Code-Switched Data Using Conditional Random Fields

Named Entity Recognition is an important information extraction task that identifies proper names in unstructured texts and classifies them into some pre-defined categories. Identification of named entities in code-mixed social media texts is a more difficult and challenging task as the contexts are short, ambiguous and often noisy. This work proposes a Conditional Random Fields based named entity recognition system to identify proper names in code-switched data and classify them into nine categories. The system ranked fifth among nine participant systems and achieved a 59.25% F1-score.


Introduction
With the increasing usage of social media, micro blogs and chats in various socio-economical classes, ethnicities and genres in the global society, a new category of informal short texts has evolved in recent years. One of the important phenomena that can appear in such texts is codemixing or code-switching (CS), where bi-lingual users often switch back and forth between their common languages during interactions. Processing of such texts by automatic means encounters several challenges due to the usage of mixed vocabulary, misspellings, abbreviations, transliterations, emojis, and many more. Furthermore, it is in many cases difficult to interpret the texts because of the short contexts.
The Natural Language Processing and text mining communities have taken necessary initiatives to encourage researchers through organizing various workshops and shared-tasks, and by opening mainstream research tracks to develop resources and novel approaches to processing code-mixed texts efficiently and for extracting valuable information from such messy contents. In this direction, the CALCS 2018 Shared Task (Aguilar et al., 2018) focused on identifying a predefined set of nine Named Entity (NE) types: Person, Location, Organization, Group, Title, Product, Event, Time, and Other. The NE identification task addressed code-mixed texts of Spanish-English (SPA-ENG) and Modern Standard Arabic-Egyptian (MSA-EGY); here we will look at the first pair (SPA-ENG) only.
In this work, the named entity recognition task is considered as a sequence labeling problem, for which CRF is a natural choice to identify entity mentions from code-switched data and classify them to one of the nine aforementioned NE categories. With initial named entity token and language identification, a wide range of features (described in Section 3) are explored for this purpose. As per the overall ranking of the submitted systems under the shared task, our approach is reasonably effective.
The paper is organized as follows: The shared task datasets are presented in Section 2. The named entity recognition system is described in Section 3. Results are presented in Section 4, with error analysis reported in Section 5. Section 6 addresses future work and concludes.

Named Entity Recognition
To identify and classify each token from the codeswitched data into nine categories (Person, Location, Organization, Group, Title, Product, Event, Time and Other), a supervised CRF-based (Lafferty et al., 2001) approach was used. Different features were extracted from external sources and applied to recognize the target entities. In a first step, each token was identified as either being a named entity (called a mention) or not. All the beginning and intermediate parts of named entities (for all nine entity categories) were converted into 'B-mention' and 'I-mention', respectively, and a CRF-based model was applied to identify the mentions.
In the next step, the identified mentions ('Bmention' and 'I-mention') were used as features along with other features described in subsections 3.1 and 3.2 to classify each token into one of the nine categories. The 'BIO' 1 notation was used to represent the named entities.
The CRF-based mention and named entity identification models were implemented using CRFsuite (python-crfsuite), 2 which allows for fast training by utilizing L-BFGS (Liu and Nocedal, 1989), a limited memory quasi-Newton algorithm for large scale numerical optimization. The classifier was trained both on features retrieved from external resources and on features directly extracted from the training data, as detailed in the following two subsections.

Features from external sources
The following features were extracted from other external resources:

Language identification
The language identification data from the previous code-switching workshop (Diab et al., 2016) was collected and converted into 'lang1', 'lang2' and 'other' (with 'other' grouping the labels 'mixed', 'ne', 'fw' and 'unknown'). If any token of the 'other' categories was followed by 'lang1', it was assigned to 'lang1'. If the token was followed by 'lang2', it was assigned to 'lang2'. A model described by Sikdar and Gambäck (2016) was built using the converted language identification data and applied to the current shared task's (Aguilar et al., 2018) training and development sets to get language information ('lang1', 'lang2' and 'other') for each token. This language information was then used as a feature for named entity identification in the current shared task.

Named entity token identification
Only the tweets containing named entities were extracted from the data from the previous codeswitched workshop, and a CRF based model was built using these tweets with different features (local context, suffix, prefix, all-upper-case, startswith-upper-case, and hash symbol) and applied to the current shared task's training, development and test data to get named entity information for each token.

Part-of-speech information
The Stanford tagger 3 was used to extract part-ofspeech (POS) information for training, development and test data. First, the English version of the Stanford tagger was applied to get English POS tags, and then the Spanish version of the tagger was applied. For tokens belonging to 'lang1' or 'other', the English POS tag was considered. For tokens belonging to 'lang2', the Spanish POS was picked. The POS information for a word together with its two preceding and two following tokens' part-of-speech tags (i.e., a -2 to +2 window) were used as features.
In addition, the first two characters of the current word's POS tag and those of the previous and next two words' POS tags (-2 to +2 tokens) were used as features.

Stem
The stem of each token was identified using the Stanford parser. 4

Noisy data named entity recognizer
The named entities of the current workshop's datasets were identified using the model for named entity recognition in noisy user generated texts described by Sikdar and Gambäck (2017).

Features from training data
The following features were extracted from the training data.
• word itself: the current word.
• word in lower case: all alphabetic characters in the word converted to lower-case. • local context of word in lower-case (with a -2 to +2 window, i.e., from two preceding to two following tokens). • all-upper-case: binary feature checking whether the current token only has uppercase letters or not. • starts-with-upper-case: binary feature checking whether the current token starts with a capital letter or not. • word-length: binary feature set if the length of a word is greater than a threshold (> 5). • suffix: n-grams of the last 1, 2 or 3 characters. • prefix characters: n-grams of the first 1, 2 or 3 characters. • is-digit: binary feature checking whether the current word contains any digit or not. • two-digit: binary feature set if the current word contains two digits. • is-alphanumeric: current word contains both digits and letters. • is-special-characters: binary feature set if the current word contains either '#' or '@'. • is-stop-word: the current word is on NLTK's 5 stop word list. • most-frequent-word: after removing all stop words, a list was prepared based on high frequency of words (1000 words from the training data    upper-case letters replaced with 'A', all digits replaced with '0', and all other characters left unaltered. • Pair-wise-mutual-information-score: PMI calculated based on the number of times the current word belongs to each NE category divided by the word's total number of occurrences in training data.
• beginning-of-the-word: binary feature checking whether the current token belongs to beginning of the sentence or not.
• ending-of-the-word: binary feature checking whether the current token belongs to end of the sentence or not.
To identify the mentions, the above features were used together. To identify named entities, the predicted mentions along with contexts consisting of the previous two and the next two tokens were used as features, in addition to the other features described in subsections 3.1 and 3.2.

Results
The supervised learning approach was applied to identify mentions. Identified mentions were taken as features along with the other features mentioned in Section 3 to recognize named entities. The classifiers were learned from the training data and tested on the development data. 5-fold crossvalidation (CV) was applied to the training data. The mention identification results are shown in Table 2. The average precision, recall and F1score values of 5-fold CV on the training data were 80.64%, 71.82% and 75.95%, respectively. The F1-score on the development data was 62.00% due to a significant drop in recall.
The system was applied to named entity recognition and results are shown in Table 3. The average F1-score of 5-fold cross-validation was 59.19%. When tested on the development data, the system achieved an F-score of 41.70%.
The system was then applied to the unseen test data and achieved an F1-score of 59.25%, which is similar to the 5-fold CV F1-score.
Comparing our system ('Flytxt') to the other systems participating in the shared task, Table 4 reports the results and shows that the system secured fifth position and achieved clearly better scores than the baseline system ('Baseline').

Error Analysis
When analyzing the output on the development data for named entity recognition, it is clear that many of the named entities are not identified at all by the system. This might be due to the word itself and/or some the contexts word not occurring in the training data.
Furthermore, some named entities are misclas-sified into other categories, plausibly since those words occur in both named entity categories. The confusion matrix for named entity recognition is reported in Table 5, for each of the nine classes ('EVENT', 'GROUP','LOC', 'ORG', 'OTHER', 'PER', 'PROD', 'TIME', 'TITLE'). The matrix was built using relaxed match, with the 'B-' and 'I-' distinctions ignored for each named entity class.

Conclusion
This paper proposed a Conditional Random Field based approach to identifying and classifying named entities. Compared to the baseline, the proposed system achieved better results.
To investigate the effectiveness of the external features, a feature ablation study should be the next step. Most of the features have been extracted directly from training data, but the features could have been further optimized using grid search and evolutionary approaches.
As an alternative to the feature-based classifier, deep learning-based approaches such as LSTM (Long Short-Term Memory), stack-based LSTM and CNN (Convolution Neural Network) can be explored to classify the proper names into the nine categories.