Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task

In the third shared task of the Computational Approaches to Linguistic Code-Switching (CALCS) workshop, we focus on Named Entity Recognition (NER) on code-switched social-media data. We divide the shared task into two competitions based on the English-Spanish (ENG-SPA) and Modern Standard Arabic-Egyptian (MSA-EGY) language pairs. We use Twitter data and 9 entity types to establish a new dataset for code-switched NER benchmarks. In addition to the CS phenomenon, the diversity of the entities and the social media challenges make the task considerably hard to process. As a result, the best scores of the competitions are 63.76% and 71.61% for ENG-SPA and MSA-EGY, respectively. We present the scores of 9 participants and discuss the most common challenges among submissions.


Introduction
Code-switching (CS) is a linguistic behavior that occurs on spoken and written language.CS happens when multilingual speakers move back and forth from one language to another in the same discourse.The growing incidence of social media in the way we communicate has also increased the occurrences of code-switching on informal written language.As a result, there is a prevalent demand for more tools and resources that can help to process such phenomenon.
In the previous versions of the Computational Approaches to Linguistic Code-Switching (CALCS) workshop, we focused on providing an annotated corpora for language identification (Solorio et al., 2014;Molina et al., 2016).In this occasion, we extend the annotations to the Named Entity Recognition (NER) level.The goal of this shared task is to provide a code-switched NER dataset that can help to benchmark NER stateof-the-art approaches.This will directly impact the performance of higher-level NLP applications where the code-switching behavior is commonly found.
ENG-SPA Tweet Original: @ xoxoBecky lmao ni ganas tengo de llorar , the last movie that made me cry was [Pineapple Express]TITLE me dejo llorando de risa English: @ xoxoBecky lmao I don't even want to cry , the last movie that made me cry was In the English-Spanish data, the highlighted words represent a movie, tagged as TITLE.While in the MSA-EGY data, the bolded words represent government agencies, tagged as ORGANIZATION We had a total of 9 participants from which we received 8 submissions on English-Spanish and 5 submissions on Modern Standard Arabic-Egyptian.The best F1-score reported for ENG-SPA 1 was 63.76% by the IIT BHU team (Trivedi et al., 2018) whereas in MSA-EGY 2 was 71.61% by the FAIR team (Wang et al., 2018).

Task definition
The task consists of recognizing entities in a relatively short code-switched context.The entity types for this task are person, organization, location, group, title, product, event, time, and other.We describe each entity type on Section 3.1.Since NER is a sequential tagging task, we use the IOB scheme to identify multiple words as a single named entity.The addition of this scheme duplicates the number of entities in the task yielding a B(eginning) and I(nside) variations of each of them.This leaves us with 19 possible labels for the classification task.
The evaluation of the task uses two versions of the F1-score.The first is the standard F1, and the second is the Surface Form F1-score introduced by Derczynski et al. (2014).The Surface Form F1-score captures the rare and emerging aspects of the entities.We average both metrics to determine the positions in the leaderboard.Additionally, the shared task was conducted on the CodaLab platform3 , where participants are able to directly evaluate their approaches against the gold data.

Datasets
In this section we provide the definition of our labels, describe the annotation process and show the distribution of the ENG-SPA and MSA-EGY datasets.

Entity instructions
The named entities have been annotated using the instructions below.Note that the definitions of the entity types apply to both language pairs.
• Person: This entity type includes proper names and nicknames that can identify a person uniquely.We ignore cases where a person is referred by nouns with adjectives that are not necessarily a nickname.Single artists and famous people are treated as person.
• Organization: This entity type includes names of companies, institutions and corporations, i.e. every entity that has employees and takes actions as a whole.If the NE can potentially be any other type, the context should be sufficient to support whether it is organization or not (e.g., Facebook as organization vs. Facebook as the website application).
• Location: This NE refers to physical places that people can visit.It includes cities, countries, addresses, facilities, touristic places, etc.This entity type is not to be confused with organization.For instance, when people use organization names to refer to places that can be visited (e.g., restaurants), those entities must be tagged as location.
• Group: This NE includes sports teams, music bands, duets, etc. Group and organization are not to be confused.For example, the Houston Astros as a team (i.e., group) is different from the Houston Astros institution.
• Product: This NE refers to articles that have been manufactured or refined for sale, like devices, medicine, food produced by a company, any well-defined service, website accounts, etc.
• Title: This type includes titles of movies, books, TV shows, songs, etc. Very often, titles can be sentences (e.g., the movie We're the Millers).Titles usually refer to media and must not be confused with the product type.
• Event: This type refers to situations or scenarios that gather people for a specific purpose such as concerts, competitions, conferences, award events, etc. Events do not consider holidays.
• Time: This NE includes months, days of the week, seasons, holidays and dates that happen periodically, which are not events (e.g., Christmas).It excludes hours, minutes, and seconds.'Yesterday', 'tomorrow', 'week' and 'year' are not tagged as time.
• Other: This type includes any other named entity that does not fit in the previous categories.This may include nationalities, languages, music genres, etc.
The motivation behind these entity types partly lies on the contextual difference in which they appear.For instance, when an organization can be lexically confused with a product, the context should break down the ambiguity.Additionally, we tried to include entity types that have an impact on higher-level NLP applications under similar social media scenarios.

ENG-SPA
Data annotation: We use the English-Spanish language identification dataset introduced in the first CALCS shared task (Solorio et al., 2014).We build upon this dataset to generate the entity labels.To annotate the data, we designed a Crowd-Flower4 job from scratch5 .The interface of the job is described in Figure 2. The job allows annotators to select one or many words for a single NE.When the annotators select a word the tool suggests to incorporate words surrounding the current selection.
When the selection of a whole entity is done, the annotators can add the entity to the second step where the type is determined.The annotators repeat this process until no more named entities can be identified in the tweet.The output of our customized job contains the entity type of one or multiple words that identify an NE according to the criteria of the annotators.The annotators are required to know both English and Spanish, and the job is constrained to reach an accuracy of at least 80%.We also required 3 annotators per tweet.Additionally, the job was launched in geographic locations were both English and Spanish are reasonably common.Some of these places were USA, Mexico, Central America, Puerto Rico, Colombia, Venezuela, Chile, Uruguay, Paraguay and Spain.
After getting the output data from CrowdFlower, we reviewed the results to correct any possible mistakes.
Data distribution: The entity types along with their distribution are listed in  very similar data distribution, which can also help to adapt the learning from training to testing.

MSA-EGY
Validating old tweets: For the Modern Standard Arabic-Egyptian Arabic Dialect (MSA-EGY) language pair, we combined the training, development, and test sets that we used in the EMNLP 2016 CS Shared Task (Molina et al., 2016) to create the new training corpora for the NER Shared Task.The data was harvested from Twitter.We apply a number of quality and validation checks to insure the quality of the old data.Therefore, we retrieved all old tweets using the the new version of the Arabic Tweets Token Assigner which is made available through the Shared Task website 7 .One of the main reasons for the re-crawling step is 7 https://code-switching.github.io/2018/ to eliminate the tweets that have been deleted, or the tweets that belong to the users whose accounts are suspended by Twitter.The other reason is that some tweets may cause encoding issues when they are retrieved using the crawler script.Thus, all these tweets were removed and eliminated.After performing the validation checks, we accepted and published 11,224 tweets (10,102 tweets for the training set, and 1,122 tweets for the development set).
. Data creation and annotation: Since we combined the test set used in the EMNLP-2016 CS Shared Task (Molina et al., 2016) with the dataset used in the EMNLP-2014 CS Shared Task (Solorio et al., 2014) to form the new training and development sets, we needed to crawl and annotate a new test set for our new Shared Task.We resorted to using the Tweepy library to harvest the timeline of 12 Egyptian public figures.We applied the same filtration criteria when crawling and building the test set used in the 2016 CS shared task (Molina et al., 2016).We divided the old combined tweets into training and development sets as follows: 80% train set and 10% development set.Thus, we needed ∼ 1,110 tweets, which represents the 10% of the new test set.As we did in the previous Shared Task, we wanted to consider choosing tweets from public figures whose tweets contain more code-switching points.Therefore, we resorted to using the Automatic Identification of Dialectal Arabic (AIDA2) tool (Al-Badrashiny et al., 2015) to perform token-level language identification for the MSA and EGY tokens in context.Public figures with more than 35% of code-switching points in their tweets were considered.The annotation work of the MSA-EGY dataset was done in-lab by two trained Egyptian native speakers.Our annotation team followed the Named Entity Annotation Guidelines for MSA-EGY, which is made available through the Shared Task website8 .In the two previous editions of the CS Shared Task (Solorio et al., 2014;Molina et al., 2016), we used a Named Entity ("ne") tag.The "ne" tag was defined as a word or multi-word that represents names of a unique entity such as people's names, countries and places, organizations, companies, websites, etc.The AIDA2 tool (Al-Badrashiny et al., 2015) was used to assign initial automatic tags for highly confident data categories Figure 3: MSA-EGY Data Annotation (i.e., URL, Punctuation, Number, etc) in addition to named entities.Then, we extracted and prepared all the tweets that contained "ne" for annotation.As we mentioned earlier, the IOB scheme is used as an annotation scheme to identify multiple words as a single named entity.All the URLs, Punctuation and Numbers tags are deterministically converted to "O" tag, while the tweets that include "ne" tags were given to our in-lab annotators for validation and re-annotation if needed.
. Quality checks and data distribution: We computed the Inter-Annotator Agreement (IAA) on 10% of the dataset to validate the performance and agreement among annotators.One of our annotators is a specialist linguist who carried out adjudication and revisions of accuracy measurements.We approached a stable Inter Annotator Agreement (IAA) of over 92% pairwise agreement.The workflow of the annotation process for MSA-EGY is shown in Figure -3.
The total number of tweets in MSA-EGY dataset is 12,334 tweets.It is divided into three sets train, development, and test sets (10,102, 1,122, 1,110 tweets, respectively).Table 1 shows that the total number of NE training tokens is 23,093.It means that NE tokens represent 11.3% of the total number of tokens.Similarly, the percentages of NE tokens in the development and test sets are 7.5%, 11.9%, respectively.As we mentioned earlier, the MSA-EGY tweets were harvested from the timeline of 12 Egyptian politicians public figures.Generally, politicians tend to use NEs more often when they write their tweets.This explains why the percentage of the NE tokens in MSA-EGY dataset is higher than the percentage of the NE tokens in ESP-ENG dataset.

Approaches
In this section, we briefly describe the systems of the participants and discuss their results as well as the final scores.
• IIT BHU (Trivedi et al., 2018).They proposed a "new architecture based on gating of character-and word-based representation of a token".They captured the character and the word representations using a CNN and a bidirectional LSTM, respectively.They also used the Multi-Task Learning on the output layer and transfer the learning to a CRF classifier following Aguilar et al. (2017).Moreover, they fed a gazetteers representation to their model.
• FAIR (Wang et al., 2018).They proposed a joint bidirectional LSTM-CRF network that uses attention at the embedding layer.They also preprocessed the data before feeding the network.
• Linguists (Jain et al., 2018).They used a Conditional Random Fields with many handcrafted features.Their focus was primarily on English-Spanish data.
• Flytxt (Sikdar et al., 2018).This team also employed a Conditional Random Fields.They fed the CRF with features from both external and internal resources.Additionally, they incorporated the language identification labels of the datasets from the previous versions of this workshop.
• semantic (Geetha et al., 2018).They jointly trained a Bidirectional LSTM with a Conditional Random Fields on the output layer.
• BATs (Janke et al., 2018).They used a Conditional Random Fields with multiple features.Some of those features were also used for neural network, but they got better results with the CRF approach.
• Fraunhofer FKIE (Claeser et al., 2018).They used a Support Vector Machine (SVM) classifier with a Radial Basis kernel.They handcrafted a lot of features and also included gazetteers.
• Baseline.We used a simple Bidirectional LSTM network with randomly initialized embedding vectors of 200 dimensions.We also used dropout operations on each direction of the BLSTM component.

Evaluation
The evaluation of the shared task was conducted through CodaLab, where the participants were able to obtain immediate feedback of their submissions.The metrics used for the evaluation phase were the standard harmonic mean F1-score and the Surface Form F1 variation proposed by Derczynski et al. (2014).Additionally, to have a single leaderboard per language pair, we unified both metrics by averaging them.The average values are the ones described in Table 3.As stated by (Derczynski et al., 2014), the idea of the Surface Form F1-score is to capture the novel and emerging aspects that are usually encountered in social media data.Those aspects describe a fast-moving language that constantly produces new entities challenging more the recall capabilities of state-of-the-art models than the precision side.

Results and Error analysis
Although all the scores reported by the participants outperformed the baselines in both ENG-SPA and MSA-EGY language pairs, the results are arguably low considering that the current stateof-the-art systems achieve around 91.2% of F1score on well-formatted text (Lample et al., 2016;Ma and Hovy, 2016;Liu et al., 2017).As mentioned before, the best performing systems reached 63.76% (Trivedi et al., 2018) and 71.61% (Wang et al., 2018) for ENG-SPA and MSA-EGY, respectively.These low outcomes are aligned with the challenges that come along with social media data and the addition of more heterogeneous entity types (Ritter et al., 2011;Augenstein et al., 2017;Derczynski et al., 2014;Aguilar et al., 2018).
Most of the MSA-EGY tweets are related to politics because they were harvested from the According to the results of the participants in the ENG-SPA shared task, the top three most challenging entity types were event, title, and time.It is worth noting that these three classes are more or less the least frequent types in the dataset (see Table 1), which suggests that having more data samples would produce better results.However, in the case of title, there are 1,980 samples against 1,381 samples of organization, and the performance is significantly better for the latter one (19% vs. 35% of F1-scores).Additionally, looking at ing that the entity instances are flexible in format that can even describe independent sentences (i.e., a homogeneous type is person).The entities Love Man (title), Billboard 2014 (event), and show de shamu (event) also describe the same pattern and they were hardly identified by participants.
Unlike English and Spanish language pair which can be considered as two distinct languages, Modern Standard Arabic and Egyptian are more closely related which makes the task of identifying NE tokens more challenging.This is mainly due to the fact that Modern Standard Arabic and Egyptian are close variants of one another and hence they share considerable amount of lexical items.Some of the challenges faced by the participants include words that still have punctuation attached to them (e.g. ) , (mSr, (Egypt ) .In order to mitigate these issues, some participants preprocessed these cases by, for example, removing any leading and trailing punctuation from those tokens.Other participants normalized these cases by unifying all the attached punctuations, while the remaining participants decided to keep them and let their model learn them.Table 5 and the following examples show some challenges faced by the submitted systems: • Clitic attachment can obscure tokens, e.g.

Related work
Before the CALCS workshop series, the codeswitching behavior was studied from different perspectives and for many languages (Toribio, 2001;Solorio and Liu, 2008a,b;Piergallini et al., 2016;AlGhamdi et al., 2016).Most of them focused on either exploring this phenomenon or solving core code-switching tasks from the NLP pipeline.More recently, researchers have been considering the sentiment analysis task on code-switching settings (Lee and Wang, 2015;Vilares et al., 2015).However, the lack of resources at the core level of the NLP pipeline greatly reduces the chances of improving higher-level applications.In this line, we aim at providing two datasets for named entity recognition benchmarks on the English-Spanish and Modern Standard Arabic-Egyptian language pairs.It worth noting that there are some contributions of CS corpora, such as a collection of Turkish-German CS tweets (Calzolari et al., 2016), a large collection of Modern Standrd Arabic and Egyptian Dialectal Arabic CS data (Diab et al., 2016) and a collection of sentiment annotated Spanish-English tweets (Vilares et al., 2016).Named entity recognition has been vastly studied along the years (Sang and Meulder, 2003).More recently, however, the focus has drastically moved to social media data due to the great incidence that social networks have in our daily communication (Ritter et al., 2011;Augenstein et al., 2017).The workshop on Noisy User-generated Text (W-NUT) has been a great effort towards the study of named entity recognition on noisy data.In 2016, the organizers focused on named entities from different topics to evaluate the adaptation of models from one topic to another (Strauss et al., 2016).In 2017, the organizers introduced the Surface Form F1-score metric and collected data from multiple social media platforms (Derczynski et al., 2014).The challenge not only lies on the entity types and the social media noisy but also in the distribution of the datasets and their different data domain patterns.

Conclusion
We presented the setup and results of the 3rd shared task of the Computational Approaches to Linguistic Code-Switching workshop.We introduced a named entity recognition dataset focused on code-switched social media text for two language pairs: English-Spanish and Modern Standard Arabic-Egyptian.We received submissions from nine teams, eight of them submitted to ENG-SPA and six to MSA-EGY.Similar to the previous sequence tagging tasks of our workshop, the predominant aspect among the approaches was the Conditional Random Fields.Additionally, the combination of the CRF with a bidirectional LSTM (with some variations) yielded the best results among participants.The best F1score for ENG-SPA was 63.7628% and for MSA-EGY was 71.6154%.Compared to monolingual formal text (i.e., newswire), the reported scores are significantly lower due to the code-switching phenomenon as well as the noise of SM environment.This serves as strong evidence that we need more robust approaches that can detect and process named entities in such challenging conditions.
Figure 1: Examples of the CALCS 2018 dataset.In the English-Spanish data, the highlighted words represent a movie, tagged as TITLE.While in the MSA-EGY data, the bolded words represent government agencies, tagged as ORGANIZATION

Figure 2 :
Figure 2: The CrowdFlower interface that we developed to annotate the ENG-SPA dataset.The green-highlighted words are the entities selected by the annotator.The words in the same green area describe a single entity.Once the NE selection has been added, the annotators have to select the type of the entities.

Table 1 :
The named entity distribution of the training, development and testing sets for both language pairs.Note that the NE tokens row contains the B(eginning) and I(nside) tokens of the datasets following the IOB scheme.The O Tokens row refers to the non-entity tokens.

Table 2 :
The table shows the main component and strategies used by the participants.Ext Res means external resources such as pre-trained word embeddings, gazetteers, etc. Hand Feats means handcrafted features such as capitalization.

Table 3 :
The results of the participants in both ENG-SPA and MSA-EGY language pairs.The scores are based on the average of the standard and the Surface form F1 metrics.The highlighted teams are the best scores of the shared task.

Table 4 ,
the entity Orange is the New Black was not recognized by participants as a title.This is an example of what we refer to heterogeneous entity type, mean-

Table 4 :
Challenging samples from the test set.The bold words are the ground truth samples and the underscored words are the predictions of the best performing systems.

Table 5 :
Challenging samples from the MSA-EGY test set.The bold words are the ground truth samples.