Multilingual Named Entity Recognition on Spanish-English Code-switched Tweets using Support Vector Machines

This paper describes our system submission for the ACL 2018 shared task on named entity recognition (NER) in code-switched Twitter data. Our best result (F1 = 53.65) was obtained using a Support Vector Machine (SVM) with 14 features combined with rule-based post processing.


Introduction
Named Entity Recognition (NER) is a part of information extraction and refers to the automatic identification of named entities in text. The ACL 2018 shared task invited participants to extract and classify the following named entities in codeswitched data obtained from Twitter: person, location, organization, group, title, product, event, time, and other (Aguilar et al., 2018). The Tweets are either Spanish-English or Modern Standard Arabic-Egyptian, and participants were free to participate in either language pair. This paper describes our system for the Spanish-English NER task.
This particular NER task is challenging for two reasons. Firstly, NER has proved to be more difficult for Tweets than for longer text, as accuracy in NER ranges from 85-90% on longer texts compared to 30-50% on Tweets (Derczynski et al., 2015). One of the reasons for this difference is that Tweets contain non-standard spelling, unusual punctuation, and unreliable capitalization. Fromheide et al. (2014) also point out that another difficulty stems from the rapidly changing topics and linguistic conventions on Twitter. The 2015 and 2016 shared tasks for NER on Noisy Usergenerated Text (W-NUT) reported F1 scores between 16.47 and 52.41 for identifying 10 different NE categories (Baldwin et al., 2015;Strauss et al., 2016). NER methods range from bidirectional long short-term memory (LSTM) (Limsopatham and Collier, 2016) and Conditional Random Fields (CRF) (Toh et al., 2015), to Named Entity Linking (Yamada et al., 2015). The second added challenge for the data in this task is that the Tweets contain English and Spanish named entities. Both languages need to be taken into account in order to accurately identify the NEs in this data.

Data sets
The organizers provided three different English-Spanish data sets: a training set, a development set, and a test set. The data consists of multilingual Spanish-English Tweets and contains NEs in both languages. Table 1 provides an overview of the data and the total number of NEs available in each of the sets (Aguilar et al., 2018). The gold standard for the test set was not distributed and we are therefore not aware of the distribution of NEs in the test set. Train  50,757  616,069 12,366  Development 832  9583  152  Test 15,634 183,011 -

System description
We used scikit-learn 0.19 (Pedregosa et al., 2011) to train and test five different types of classifiers using eight-fold cross validation: • Support Vector Machine (SVM) (Chang and Lin, 2011) • Decision Trees (DT) • K-nearest Neighbors (KNN) • AdaBoost (Ada) (Freund and Schapire, 1995) • Random Forest (RF) (Breiman, 2001) We trained the classifiers with different training corpus sizes of 80.000, 120.000, 200.000, 300.000 and 550.000 tokens, and we reserved 10% of each size for testing to avoid overfitting on the training data. The best classifier is the Support Vector Machine using the default scikit-learn parameters and a Radial Basis Function (RBF) Kernel, which is defined as The results are obtained using the pre-and postprocessing steps that are described in further detail in sections 3.1 and 3.3.

Pre-processing
Early experiments showed that reducing the original tag set from two tags per category to one tag per category improved overall classification. 'B-LOC' refers to either the first word in a multiword NE or a single word NE, and 'I-LOC' refers to any tokens in a multi-word NE that follows the initial 'B-' token. The information specific to the location of the NE within an NE sequence was removed and both tags are reduced to 'X-'. This improved classification performance as it reduced the number of different possible tags from 19 to 10 (one per NE category plus the "O" tag) and was easily reverted in the post-processing stage.

Feature selection
After testing numerous different features, and discarding ones such as 'proceeded by preposition or possessive pronoun' and 'difference in rank in the frequency dictionaries', we found that the features described below achieved the best result. There are three different types of features: tokencentered features (1-5), context related features (6-9), and rank dictionary lookup features (10-14). To reduce dimensionality and computational workload, we condensed several mutually exclusive boolean features into common functions returning different integer values according to their outcome. For example, for the capitalization feature, rather than returning a boolean outcome for each of the four possible capitalization options (all lowercase, all uppercase length greater than 3, all uppercase length less than 3, first letter capitalized), they are combined into one feature that returns [0,1,2,3]. All rank features are obtained by sorting the corresponding list in order of frequency, with the most frequent occurrence in rank one. We normalized the ranks so that the value stays between 0 and 1, where 0 denotes the absence in the ranked lists and the closer the figure is to 1, the more highly ranked the token is.
For each feature, the possible outcomes that are inserted into the vector are provided in square brackets, where 'int' denotes the absolute rank, pairs of [0-1] boolean outcomes, and lists of numbers correspond to the exclusive outcomes of the function. 3.3 Post-processing The first step in post-processing was to restore all the named entity categories that were simplified during the training of the SVM. All categories were reduced, for example, from 'B-PER' and 'I-PER' to X-PER in a pre-processing step, and were changed back to the original annotation. The second step in post-processing was to address the misclassified multi-word tokens. For example, in a sequence of 'B-TITLE', 'I-TITLE', 'I-TITLE', if the middle token is misclassified as not being an NE, the tags shift to 'B-TITLE', 'O', 'B-TITLE' and the entire multi-word NE would therefore be misclassified.
To solve this issue, we used a dictionary lookup approach and compared possible multi-word NE sequences to lists of multi-word tokens based on the types of tokens present in the training data. The '-GROUP', '-PERSON' and '-OTHER' lists stems from Wikipedia, and the '-TITLE' list contains titles of video games available from Steam. We found post-processing to be most effective when the multi-word NE consisted of at least two tokens and was no longer than five tokens. We started by checking the longest NEs first, so that, for example, 'Tomb Raider' would not split the longer NE 'Rise of the Tomb Raider'. If a match was found in any of the lists, the tags gained from post-processing replaced those tagged by the SVM.
The final step addresses specific tokens that are very frequent in many of the categories and are therefore not learned correctly by the classifiers. The Spanish particle 'de', was often classified as an NE, but should have been classified as 'O'. So, if 'de' was tagged as an NE, but not proceeded by a  token with a 'B-', the NE tag was removed. A similar rule applies to the article 'the', which was frequently tagged as 'O', and caused issues for multiword NEs starting with 'the'. If 'the' is followed by a NE, the tag is switched to match the rest of the tokens in the multi-word sequence. Table 2 shows the best result obtained with a training size of 550.000 tokens for each of the five classifiers using 8-fold cross validation and the results of those five classifiers when applied to the heldout test data. Note that all figures are without postprocessing. We only performed post-processing on the SVM to achieve the final result of 53.56. We also tested the classifiers with different sizes of training data. The evaluation of the results per named entity category using the best performing SVM show that some of the categories were classified more accurately than others. The best results were obtained for person (66.11), location (58.51) and product (54.11). The most challenging categories were time (5.06) and other (6.67).

Discussion
The large variation in F1 per category, for example in '-TIME', is partly due to the inconsistent annotation of tokens. Table 5 below shows the days of the week present in the training data in both Spanish and English and all the tags associated with these tokens. It shows that all of these tokens are inconsistently annotated in that they are sometimes annotated as '-TIME' and sometimes annotated as 'O'. For example in Tweets (1) and (2) below, 'Happy Friday' is used in the same context, but is only tagged as 'B-TIME' in the first Tweet.
(1) Happy Friday Familia!!! #ElvacilonDe-LaGatita #battingcage #HappyHour 17 ave NW 7 Calle http://t.co/fbPk0sER05 (2) RT @isazapata : Challenge yourself and move away from your comfort zone! Happy Friday!! http://t.co/OK320hNQ Some variation in the annotation of tokens such as 'Friday' is to be expected, as the token may not always refer to a day of the week but a title or another type of named entity, but the SVM will discard the information from the feature vector if 'Friday' is 'tagged as 'O' more often than '-TIME'.  Whilst training the classifiers, we noticed a large amount of variation in the results for the train/test data. To find out exactly how much the results fluctuate, we used the random split function in scikit-learn and split the training data into two chunks: 90% training and 10% testing and retrained the classifier with the new version of the training data. Consequently, the intermediate results for each of the classifiers was always on a different 10% test set. The difference between the best and the worst result can be up to an increase in macro F1 of 0.12 with the same classifier and the same size training set. The results also showed that by increasing the number of tokens in the training data, the performance of the classifiers improved.
To illustrate why this may be the case, table 6 below contains the number of overlapping NEs for three different splits for each training size. It shows the large amount of variance in the results depending on how the random split occurred. We counted all types that were tagged as an NE in the training data in total, compared to how many of those NEs were in the train and test sets. For example, for the first random 30.000 tokens split, there were 456 NEs in the training data, and 65 NEs in the training test set. A total of 17 NEs in the training test set were also present in the training data, meaning that the SVM had already en-countered these tokens. Depending on how the data was split, the overlap already encountered in the training data varies from 0.19 to 0.26 for 30.000 tokens. This difference is not as large for 550.000 tokens, where it varies between 0.6 and 0.63.  Table 6: Distribution of NEs in the training data. The overlap refers to the percentage of types that was present in both the training set and the test set extracted from the training. Table 6 also illustrates that the number of overlapping tokens increases immensely when the number of tokens in the training data increases. It ranges from .19 to .63, which means that the higher the number of tokens in the training set, the likelihood that NEs in the test set are also present in the training data increases. Therefore, the classifier does not need to classify as many unseen tokens and overall performance increases.

Conclusion and Future Work
We presented a named entity recognition system for Spanish-English code-switched Tweets based on a combination of classical machine learning algorithms and post-processing. The best performing classifier was a Support Vector Machine with an RBF kernel, allowing it to be flexible and less prone to overfitting compared to other classifiers on the held-out test data. We used a small set of features which were selected based on frequency observations in the training data. This provides a classifier with low computational costs and could allow for easy adaptation for other language pairs. Overall, the task of recognizing named entities in multilingual Twitter data proved to be quite challenging. We managed to achieve an overall F1 of 53.65 and thus modestly outperformed the baseline provided by Aguilar et al. (2018). The results show that there is a large amount of variation in classifier performance depending on the specific NEs present in the training and test sets. The classifiers could be improved by incorporating gazetteer resources more specific to Spanishspeaking countries, for example for geographical entities similar to that of the United States census list. Currently, the focus lies on English NEs as there are more resources available. Furthermore, the current approach relies heavily on gazetteering, and the wider context of a token could be taken into account by, for example, determining correlations of certain types of NEs with related verbs in the same Tweet.