The University of Texas System Submission for the Code-Switching Workshop Shared Task 2018

This paper describes the system for the Named Entity Recognition Shared Task of the Third Workshop on Computational Approaches to Linguistic Code-Switching (CALCS) submitted by the Bilingual Annotations Tasks (BATs) research group of the University of Texas. Our system uses several features to train a Conditional Random Field (CRF) model for classifying input words as Named Entities (NEs) using the Inside-Outside-Beginning (IOB) tagging scheme. We participated in the Modern Standard Arabic-Egyptian Arabic (MSA-EGY) and English-Spanish (ENG-SPA) tasks, achieving weighted average F-scores of 65.62 and 54.16 respectively. We also describe the performance of a deep neural network (NN) trained on a subset of the CRF features, which did not surpass CRF performance.


Introduction & Prior Approaches
Named entity recognition (NER) and classification are essential tasks in information extraction (Nadeau and Sekine, 2007). However, NER in texts in which multiple languages are represented is not straightforward because NEs can be language-specific (e.g., Estados Unidos in Spanish vs. United States) or language-neutral but regionally specific (e.g., Los Angeles) or even mixed (e.g., Nueva York in Spanish) (Ç etinoglu, 2016;Guzman et al., 2016). The task is further complicated by the fact that names of companies, institutions and brands in one language can be common nouns in another (e.g, Toro is a brand name for a U.S. company but toro in Spanish means bull'). These challenges confound the already difficult task of working with multilingual texts, which can be considered resource scarce' with respect to the availability of NLP tools (Riaz, 2010;Zirikly and Diab, 2015;Sitaram and Black, 2016;Guzmán et al., 2017). But NER in multilingual communication is essential given that multilingualism is common throughout the world, and, for many speakers, language mixing is a shared practice and one that can be prevalent in social media like Twitter (Jurgens et al., 2014;Jamatia et al., 2015Jamatia et al., , 2016Vilares et al., 2015).

Data Description
Over 62k Tweets were collected and manually annotated for NEs to be used in this shared task (Aguilar et al., 2018). The annotators labeled each NE using one of ten tags: PERSON, LOCATION, ORGANIZATION, PRODUCT, GROUP, EVENT, TIME, TITLE, OTHER, or NOT-NE. All tokens are tagged using the IOB scheme while ignoring hashtags and @-mentions, i.e. Louis Vuitton is tagged with B-ORG and I-ORG but @RideAlong is tagged as O. NEs can occur in all languages and, since this is Twitter data, can frequently be misspelled or missing orthographic features that would ease identification. The Tweets were divided into training, development, and test sets and released to the participants of the shared task along with tools for preprocessing of the Tweets.

Conditional Random Field
One approach we used to perform NE recognition in this shared task was the usage of conditional random fields (Lafferty et al., 2001), a technique used for sequence labeling. More specifically, python-crfsuite (Peng and Korobov, 2014) was used, a Python wrapper around CRFsuite (Okazaki, 2007), an implementation of CRFs in C/C++. CRFs work by looking at several words and their features and expected classification (in this case the NE classification) as examples and using the information gained to predict classifications on future data that has not been seen before. For our use of CRFsuite, the values of 1.0 for L1 and 0.001 for L2 regularization (from the NER example provided by the package) were used with a total of 150 training iterations. All other parameters were left at their default values.

Features Used
Several different features of the tweets as whole and individual tokens were used as input, some of which rely on external resources to generate. Initially we developed our features on the ENG-SPA dataset. Interestingly many of the features used for ENG-SPA performed well on the MSA-EGY data. Inspiration for the features used was drawn from various papers from the First Workshop on CALCS (Chittaranjan et al., 2014;Lin et al., 2014). The features used can be grouped into five categories: 1. Word features: lowercase copy of the word, its two last characters, length, whether it is the first word or not, whether this word is all alphanumeric characters (only for the MSA-EGY dataset), if this word is made up of digits or not, and if the word contains emoji.
2. Capitalization: is the word all uppercase or title case?
3. Language tags: off-the-shelf taggers from the Natural Language Toolkit (NLTK) (Bird and Loper, 2004) were used to perform NE and part of speech (POS) tagging on one tweet at a time and the tags were applied to individual tokens.
4. Language detection: in the ENG-SPA dataset only, language detection on entire tweets was done using langdetect, a Python port (Danilák, 2017) of language-detection (Nakatani, 2010) originally written in Java. Probabilities of the tweet being English or Spanish rounded to 2 digits after the decimal point were used. If the tweet was classified as neither English or Spanish, the probability was set to be 0. For example, "Quiero un roadtrip asap" was falsely classified as Romanian.
5. Twitter functionality: does the overall tweet contain an @mention or #hashtag? Is this word itself one of the two? Is this a URL?
A subset of the features mentioned above were applied to the next and previous words and used as features to classify the current word: the word in lowercase form, its last two characters, if it is the first word, title case, uppercase, a URL, @mention, or #hashtag, if it contains an emoji, its NE and POS tag classification by NTLK. Additional features have been experimented with and their results are included in section 4. These features include the last three characters of the word, whether it contains a digit (not if it is a digit itself), or if it is made up of exclusively ASCII characters.

Deep / Wide Model
The deep and wide architectures have had recent success for the use of recommendation engines (Cheng et al., 2016), but here we adapt it for the use of NER. Deep and wide architectures have the benefit of embedding categorical variables in a vector space allowing for unseen feature combinations and the use of cross-product feature transformations for effective and interpretable features. This combination of cross-product feature combinations and dense embeddings allows for deep and wide models to memorize and generalize to the input data while reducing feature engineering efforts.

Training process
The model was trained using Tensorflow, an opensource machine learning framework designed by Google (Abadi et al., 2016). The classifier provides a general purpose wide and deep learning model for users to train. The wide model is a pre-built linear classifier which attempts to classify each word in a particular tweet based on values from their linear combinations.
The deep model used a pre-built neural network to classify the data by letting its features propagate through the network. Using Python, Tweets from the tsv file were first parsed into a internal data model where the features are computed as properties of the individual words. The model outputs a csv file with each feature listed as a column that can be conveniently passed to the DNNLinearCombinedClassifier. We used a subset of the CRF features including the word itself, capitalization of the word, word type, and the adjacent words.
The wide portion of the model enables NER tagging through linear properties. Features were inputted as the base column to provide information to the activation layer of the neural network. Some features such as the word, word's capitalization, word's type were cross validated as a set and hence would make the model recognize that these grouped features would have dependencies among themselves. Implementing a neural network, the deep model greatly increased the training time with a ratio of roughly 1:20 per iteration. The models did not perform well against the CRF possibly due to a lack of features, hence the CRF was used in the final submission of the project.

CRF Performance
Our submission for the shared task was evaluated using both the harmonic mean F1 and the surface forms F1 metrics (Derczynski et al., 2017) on each dataset. In line with the baseline performance, our system performs better on the MSA-EGY data than the ENG-SPA data despite the difference in data size. The scores on the two challenges were 65.62 for MSA-EGY and 54.16 for ENG-SPA. After the shared task submission closed, we continued experimenting with different features. The F1-scores (computed using scikitlearn (Pedregosa et al., 2011)) of the CRF trained on the training data set and evaluated on the testing set using various configurations of features are shown in table 2. These results are different from those submitted to the competition as they were evaluated on a different data set.
Inclusion or omission of certain features affected the two sets of data differently: for example including the ASCII feature improves scores for ENG-SPA but decreases that for MSA-EGY. The last row (special) shows an attempt to maximize the score by combining successful individual features and while scores do increase, this at-tempt does not perform as well as expected. For ENG-SPA the submitted configuration excluding POS and NE seems to work best while the submitted configuration with a combination of changes (shown in table 1) works best for MSA-EGY going by F1-score. Table 1 shows the features that were modified for use. An asterisk (*) indicates that this is a change compared to the submitted configuration a. Rows not included are features that remained unchanged throughout.

NN Performance
As shown in table 3, the F1-score was suboptimal due to a low recall score. Two different models, one implementing only the wide portion and the other implementing the deep and wide models were trained with features extracted from the data set. Three different variants of the features and the results are displayed in table 2. Surprisingly, the wide model showed an overall better performance than the wide and deep model. This may be due to a lack of the features extracted from the dataset for the deep learning to build on. The lack of recall may occur due to the same reason, which eventually leads to the rejection of this model.

Conclusion
In this paper, we described the University of Texas BATs research group's submission for the CALCS 2018 Shared Task for NER. We found that some features improved results of the CRF model on one language combination, but not on the other. In both cases, our CRF model outperformed the baseline NER performance. However, training an NN using the same features as the CRF did not significantly improve F1-scores, but further feature engineering on or combination of both models could improve the performance.