Joint Part-of-Speech and Language ID Tagging for Code-Switched Data

Code-switching is the fluent alternation between two or more languages in conversation between bilinguals. Large populations of speakers code-switch during communication, but little effort has been made to develop tools for code-switching, including part-of-speech taggers. In this paper, we propose an approach to POS tagging of code-switched English-Spanish data based on recurrent neural networks. We test our model on known monolingual benchmarks to demonstrate that our neural POS tagging model is on par with state-of-the-art methods. We next test our code-switched methods on the Miami Bangor corpus of English Spanish conversation, focusing on two types of experiments: POS tagging alone, for which we achieve 96.34% accuracy, and joint part-of-speech and language ID tagging, which achieves similar POS tagging accuracy (96.39%) and very high language ID accuracy (98.78%). Finally, we show that our proposed models outperform other state-of-the-art code-switched taggers.


Introduction
Code-switching (CS) is the phenomenon by which multilingual speakers switch between languages in written or spoken communication. For example, a English-Spanish speaker might say "El teacher me dijo que Juanito is very good at math." CS can be observed in various linguistic levels: phonological, morphological, lexical, and syntactic and can be classified as intra-sentential (if the switch occurs within the boundaries of a sentence or utterance), or inter-sentential (if the switch occurs between two sentences or utterances). The impor-tance of developing NLP technologies for CS data is immense. In the US alone there is an estimated population of 56.6 million Hispanic people (US Census Bureau, 2014), of which 40 million are native speakers (US Census Bureau, 2015). Most of these speakers routinely code-switch. However, very little research has been done to develop NLP approaches to CS language, due largely to the lack of sufficient corpora of high-quality annotated data to train on. Yet CS presents serious challenges to all language technologies, including part-of-speech (POS) tagging, parsing, language modeling, machine translation, and automatic speech recognition, since techniques developed on one language quickly break down when that language is mixed with another.
One of Artificial Intelligence's ultimate goals is to enable seamless natural language interactions between artificial agents and human users. In order to achieve that goal, it is imperative that users be able to communicate with artificial agents as they do with other humans. In addition to such real time interactions, CS language is also pervasive in social media (David, 2001;Danet and Herring, 2007;Cárdenas-Claros and Isharyanti, 2009). So, any system which attempts to communicate with these users or to mine their social media content needs to deal with CS language.
POS tagging is a key component of any Natural Language Understanding system and one of the first researchers employ to process data. As such, it is crucial that POS taggers be able to process CS content. Monolingual POS taggers stumble when processing CS sentences due to out-of-vocabulary words in one language, confusable words that exist in both language lexicons, and differences in the syntax of the two languages. For example, when running monolingual English and Spanish taggers on the CS English-Spanish shown in Figure 1, the English tagger erroneously tagged most Spanish tokens, and similarly the Spanish tagger mistagged most English tokens. A tagger trained on monolingual English and Spanish sentences (EN+ES tagger) fared better, making only two mistakes: on the word "when", where the switch occurs (confusing the subordinating conjunction for an adverb), and the word "in" (which exists in both vocabularies). A tagger trained on CS instances of English-Spanish, however, was able to tag the whole sentence correctly.
In this paper, we present a comprehensive study of POS tagging for CS utterances that includes the following: a) use of a state-of-the-art bidirectional recurrent neural network b) use of a large CS English-Spanish corpus annotated with high-quality labels from the Universal POS tagset; c) extensive analyses of the performance of our taggers on monolingual and CS sentences; d) study of the performance of a tagger trained on the subset of the monolingual sentences of the CS corpus (in-genre baseline); e) examination of the effect of language identifiers both as feature inputs and for joint language identification and POS tagging; and f) comparison to state-of-the-art taggers for code-switching on the same corpus.

Related Work
A variety of tasks have been studied in CS data. For language identification (LID), Rosner and Farrugia (2007) proposed a word-level Hidden Markov Model and a character-level Markov Model to revert to when a word is outof-vocabulary, and tested these on a corpus of Maltese-English sentences, achieving 95% accuracy. Working on a Bengali-Hindi-English dataset of Facebook posts, Barman et al. (2014) employed classifiers using n-gram and contextual features to obtain 95% accuracy.
In the first statistical approach to POS-tagging on CS data, Solorio and Liu (2008) collected the Spanglish corpus, a small set of 922 English-Spanish sentences. They proposed several heuristics to combine monolingual taggers with limited success, achieving 86% accuracy when choosing the output of a monolingual tagger based on the dictionary language ID of each token. However, an SVM trained on the output of the monolingual taggers performed better than their oracle, reaching 93.48% accuracy. On the same dataset, Rodrigues (2013) compared the performance of a POS-tagger trained on CS sentences with a dynamic model that switched between taggers based on gold language identifiers; they found the latter to work better (89.96% and 90.45% respectively). Note, however, that the monolingual taggers from (Solorio and Liu, 2008) were trained on other larger corpora, while all the models used in (Rodrigues, 2013) were trained on the Spanglish corpus. Jamatia et al. (2015) used CS English-Hindi Facebook and Twitter posts to train and test POS taggers. They found a Conditional Random Field model to perform best (71.6% accuracy), and a combination of monolingual taggers similar to the one in (Solorio and Liu, 2008) achieved 72.0% accuracy. Again using Hindi-English Facebook posts, Vyas et al. (2014) ran Hindi and English monolingual taggers on monolingual chunks of each sentence. Sequiera et al. (2015) tested algorithms from (Solorio and Liu, 2008) and (Vyas et al., 2014) on the Facebook dataset from (Vyas et al., 2014) and the Facebook+Twitter dataset from (Jamatia et al., 2015), and found that (Solorio and Liu, 2008) yielded better results. Similarly, Barman et al. (2016) compared the methods proposed in (Solorio and Liu, 2008) and (Vyas et al., 2014) on a subset of 1,239 code-mixed Facebook posts from (Barman et al., 2014) and found that a modified version of (Solorio and Liu, 2008) performed best. They also experimented with per-forming joint POS and LID tagging using 2-level factorial Conditional Random Field and achieved statistically similar results.
AlGhamdi et al. (2016) tested seven different POS tagging strategies for CS data: four consisted of combinations of monolingual systems and the other three were integrated systems. They tested them on MSA-Egyptian Arabic and English-Spanish. The first three combined strategies consisted of running monolingual POS taggers and language ID taggers in different order and combining the outputs in a single multilingual prediction. The fourth approach involved training an SVM on the output of the monolingual taggers. The three integrated approaches trained a supervised model on a) the Miami Bangor corpus (which contains switched and monolingual utterances), b) the union of two monolingual corpora (Ancora-ES and Penn Treebank), c) the union of the three corpora. The monolingual approaches consistently underperformed compared to the other strategies. The SVM approach consistently outperformed the integrated approaches. However, this method was trained on both monolingual and multilingual resources -the Penn Treebank Data for the English model, and the Ancora-ES dataset for the Spanish model. In Section 6.4, we run experiments in similar conditions to the integrated approaches from (Al-Ghamdi et al., 2016), which we will compare to our work. The main contributions of this paper over this previous research on POS tagging for CS data, are the following: a) Our tagger is a bidirectional LSTM that achieves POS tagging accuracy comparable to state-of-the-art taggers on benchmark datasets like the Wall Street Journal corpus and the Universal Dependencies corpora. It is the first such model used to train code-switched POS taggers; b) Our model can simultaneously perform POS and LID tagging without loss of POS tagging accuracy; c) We run experiments on the Miami Bangor corpus of Spanish and English conversational speech. However, unlike (AlGhamdi et al., 2016) which used POS tags obtained from an automatic tagger and then mapped to a deprecated version of the Universal POS tagset, our experiments are run on newly crowd-sourced Universal POS tags (Soto and Hirschberg, 2017), which were obtained with high accuracy and interannotator agreement.

A Model for Neural POS Tagging
For our experiments we use a bi-directional LSTM network similar to the one proposed by Wang et al. (2015) with the following set of features: 1) word embeddings, 2) prefix and suffix embeddings of one, two and three characters, and 3) four boolean features that encode whether the word is all upper case, all lower case, formatted as a title, or contains any digits. In total, the input space consists of seven embeddings and four boolean features. For the embeddings, we compute word, prefix and suffix lexicons, excluding tokens that appear less than five times in the training set, and then assign a unique integer to each token. We also reserve two integers for the padding and out-of-lexicon symbols.
We present two architectures for POS tagging and one for joint POS and LID tagging. In the most basic architecture the word, prefix and suffix embeddings and the linear activation units are concatenated into a single layer. The second layer of the network is a bidirectional LSTM. Finally, the output layer is a softmax activation layer, whose ith output unit at time t represents the probability of the word w t being the part-of-speech POS i . We refer to this model as Bi-LSTM POS Tagger for the rest of the article and in our tables. For the second model, given the multilingual nature of our experiments, we modify the input space of our Bi-LSTM tagger to make use of the language ID information in our corpus. We add six more boolean features to represent the language ID (one for each label) and add six linear activation units in the first hidden layer, which are then concatenated with the rest of linear activation units and word embeddings in the basic model. This model is referred to as Bi-LSTM POS tagger + LID features.
Finally, our third model simultaneously tags words with POS and LID labels. The architecture of this model follows the Bi-LSTM POS architecture very closely adding a second output layer with softmax activations for LID prediction. Note that the POS and LID output layers are independent and are connected by their weight matrices to the hidden layer, and both loss functions are given the same weight. This model is referred to as joint POS+LID tagger. We implemented our code using the library for deep learning Keras (Chollet, 2015), on a Tensorflow backend (Abadi et al., 2015).

Datasets
Throughout our experiments we use three corpora for different purposes. The Wall Street Journal (WSJ) corpus is used to demonstrate that our proposed Bi-LSTM POS tagger is on par with current state-of-the-art English POS taggers. The Universal Dependencies (UD) corpus is used to train baseline monolingual POS taggers in English and Spanish that we can use to test on our CS data since both employ the Universal POS tagset (Petrov et al., 2012). The Miami Bangor corpus, which contains instances of inter-and intrasentential CS utterances in English and Spanish, is used for training and testing CS models and comparing these to monolingual models. Table  1 shows the number of sentences/utterances and tokens in each dataset split. For the MB corpus, Inter-CS refers to the subset of monolingual sentences and Intra-CS refers to the subset of CS sentences.

Wall Street Journal Corpus
The WSJ corpus (Marcus et al., 1999) is a monolingual English news corpus comprised of 49208 sentences and over 1.1 million tokens. It is tagged with the Treebank tagset (Santorini, 1990;Marcus et al., 1993), which has a total of 45 tags. We use the standard training, development and test splits from (Collins, 2002) which span sections 0-18 19-21 and 22-24 respectively.

Universal Dependency Corpora
Universal Dependencies (UD) is a project to develop cross-linguistically consistent treebank annotations for many languages. The English UD corpus (Silveira et al., 2014) is built from the English Web Treebank (Bies et al., 2012). The cor-  It is comprised of news blog data and has a total of 16,013 sentences and over 455k tokens.

Miami Bangor Corpus
The Miami Bangor (MB) corpus is a conversational speech corpus recorded from bilingual English-Spanish speakers living in Miami, FL. It includes 56 conversations recorded from 84 speakers. The corpus consists of 242,475 words (333,069 including punctuation tokens) and 35 hours of recorded conversation. The language markers in the corpus were manually annotated. Table 2 shows the language composition of the corpus. The dominant language in this corpus is English (53.48% of the tokens), followed by Spanish (27.78%). The ambiguous label includes words that are difficult to tag as either English or Spanish due to lack of context (e.g. "no"). Since, in the original corpus, punctuation tokens were labeled as ambiguous, we created an additional punctuation tag for our experiments. The mixed category contains tokens that are formed by morphemes and roots from both languages (e.g. "ripear") and the category 'Other' untranscribed tokens. However, the composition of the subset of CS sentences is different: Spanish becomes the dominant language, comprising 46.12% of the tokens compared to 38.98% of the English tokens.
The utterances in the original MB corpus were transcribed in the CHAT transcription and coding format (MacWhinney, 2000), which allows annotators to divide full utterances in chunks to repre-  sent citations and other speech discourse phenomena. However, working on full utterances is more suitable in the context of POS tagging. Therefore, following the guidelines in (MacWhinney, 2009), we used the utterance linkers and utterance terminators to reconstruct full utterances when possible. After this, the corpus had a total of 16013 sentences and 333K tokens. The original MB corpus was automatically glossed and tagged with POS tags using the Bangor Autoglosser (Donnelly and Deuchar, 2011a,b). The autoglosser finds the gloss for each token in the corpus and assigns the tag or group of tags most common for that word in the annotated language. However, here we use the Universal POS tags obtained by (Soto and Hirschberg, 2017). These tags were collected using crowdsourcing tasks and automatic labeling, with high annotation accuracy and label recall. We split the MB corpus into training and test. For the test split we randomly drew 4,200 utterances. The training split is used for 4-fold cross-validation. Table  3 shows the degree of multilingualism in the MB corpus and the two splits. In the full dataset, about 6.94% of the utterances contain intra-sentential switches. Note that full dataset and its train and test splits (columns 2 to 4) have very similar degrees of multilingualism according to the reported measures, whereas the subset of intra-sentential CS sentences (column 5) has a much higher rate of switched tokens (11%, from 1.26%) and average number of switches per sentence (1.41, from 0.098). More than 93% of CS utterances contain one or two switches; some contain up to eight switches. For example, the following sentence contains five switches (marked with '|'):"... y en | summer | y en | fall | tengo que hacer | one class."

Methodology
For the experiments involving the Bangor corpus, we perform 4-fold cross-validation (CV) on the training corpus to run grid search and obtain the best learning rate and decay learning rate parameter values. For the experiments on WSJ and UD, we use the official development set. The performance of the best parameter values is reported as "Dev" accuracy. We then train a model using the best parameter values on the full train set and obtain predictions for the test set (reported as "Test"). When pertinent we also report results on the subset of intra-sentential CS utterances of the test set (reported as "Intra-CS Test").
During CV, each model is trained for a maximum number of 75 epochs using batches of 128 examples. We use early stopping to halt training when the development POS accuracy has not improved for the last three epochs, and keep only the best performing model. However, when training the final model, we stop training after the number of epochs that the best model trained for during CV. The loss function used is categorical cross-entropy and we use ADAM (Kingma and Ba, 2015) with its default β 1 , β 2 and parameter values as the stochastic optimization method.
The word embeddings (Bengio et al., 2003) we use are trained with the rest of the network during training following the Keras implementation (Gal and Ghahramani, 2016). The size of the embedding layers is 128 for the word embeddings and 4, 8 and 16 for the prefix and suffix embeddings of length 1, 2 and 3 respectively. The Bi-LSTM hidden layer has 200 units for each direction.
Finally, we run McNemar's test (McNemar, 1947) to show significant statistical difference between pairs of classifiers when the accuracy of the classifiers is similar, and report statistical significance for p-values smaller than 0.05.

Experiments & Results
In this section, we present our experiments using the three Bi-LSTM models introduced in Section 3 and the datasets from Section 4. Our goal is a) to show that the basic Bi-LSTM POS tagger performs very well against known POS tagging benchmarks; b) to obtain baseline performances for monolingual taggers when tested on CS data; and c) to train and test the proposed models on CS data and analyze their performance when trained on different proportions of monolingual and CS data.

WSJ results
We begin by evaluating the performance of the Bi-LSTM POS tagger on the benchmark WSJ corpus to show that it is on par with current stateof-the-art English POS taggers. We train taggers on three incremental feature sets to measure how much each feature adds. Using only word embeddings we achieve 95.14% accuracy on the test set; adding word features increases accuracy to 95.84%; and adding the prefix and suffix embeddings further increases accuracy by up to 97.10%. This demonstrates that our tagger is on par with current state-of-the-art systems which report 97.78% (Ling et al., 2015), 97.45% (Andor et al., 2016), 97.35% (Huang et al., 2012), 97.34% (Moore, 2014) and 97.33% (Shen et al., 2007) accuracy on their standard test set. Systems most similar to our Bi-LSTM tagger with basic features reported 97.20% in (Collobert et al., 2011) and 97.26% .

Universal tagset baseline
In the second set of experiments we train baseline monolingual Spanish and English taggers on the UD corpora: one monolingual Spanish and one monolingual English tagger, and one tagger trained on both corpora. The goal of these experiments is to obtain taggers trained on the Universal tagset that we can use to obtain a baseline performance of monolingual taggers on the CS Bangor corpus. The results are shown in Table 4. The accuracy of the baseline UD taggers is slightly worse than the WSJ taggers, probably due to the smaller size of the UD datasets. The accuracy of the taggers on their own test sets is 94.78% and 95.02% for English and Spanish respectively. In comparison, Stanford's neural dependency parser (Dozat et al., 2017) reports accuracy values of 95.11% and 96.59% respectively.
In order to approximate how a monolingual tagger trained on established datasets performs on a conversational CS dataset, we test the baseline UD taggers on the MB test set and observe a dramatic drop in accuracy, due perhaps to the difference in genre (web blog data vs. transcribed conversation) and the bilingual nature of the Miami corpus. Note that, when training on both EN and ES UD, the    Table 4), we observe that the English model decreases in accuracy further, whereas the Spanish tagger has better performance. This is due to the CS sentences having more Spanish than English.

Miami Bangor results
In the third set of experiments we train the three proposed models (Bi-LSTM tagger, Bi-LSTM tagger with LID features and joint POS and LID tagger) on: a) the full MB corpus, b) the joint MB and UD ES&EN corpora, and c) instances of intersentential CS utterances from the MB corpus. The LID features were obtained from the MB corpus language tags. POS and LID accuracy results are shown in Table 5 and Table 6 respectively. When training on the full MB corpus (top subtable from table 5), the POS tagger achieves 96.34% accuracy, a significant improvement from the 88.17% of the UD EN&ES. The improvement holds up on the subset of CS utterances, achieving 96.10% accuracy. Adding the LID features improves performance by 0.15 and 0.34 absolute percentage points. In both cases these differences are statistically significant (p = 0.03). Furthermore, when running joint POS and LID tagging, we see that tagging accuracy decreases slightly with respect to the POS tagger with LID features. This result reaffirms the contribution of the LID features. The difference in performance between the joint tagger and the basic tagger is slightly higher but not statistically significant (p ∼ 0.5), showing that joint decoding does not harm overall performance. The best POS tagging accuracy is always achieved by the Bi-LSTM tagger with LID features on both Test and CS Test; however, the joint Tagger is very close at no more than 0.1 percentage points on Test. When adding the UD corpora during training (middle subtable from Table 5) we see some improvements for the three models (0.13, 0.14 and 0.22 absolute percentage points respectively), and once again the difference in performance between the basic tagger and the tagger with LID features is statistically significant (p < 0.05).
We performed statistical tests to measure how different the models trained on MB are from the models trained on MB+UD and found that the addition of more monolingual data only makes a difference for the joint tagger (p < 0.01) when looking at the performance on the Test set. On the CS test set, these models achieve about the same performance in POS tagging with a slight decrease for the basic tagger (-0.11 points, not significant) and a slight increase in accuracy for the joint tagger (0.38 percentage points, again not significant). Thus, it is clear that our model is able to learn from a few CS examples -even when many more monolingual sentences, from a different genre, are added to the train set.
Finally, we trained models on the subset of monolingual English and Spanish sentences from the MB training set to measure how a model trained on the same genre would be able to generalize on unseen intra-sentential CS sentences (bottom subtable from Table 5, marked as Inter-CS). This model would be closer to an in-genre intersentential CS tagger, tested on intra-sentential CS. Compared to the models trained on UD EN&ES, this model performs much better: 96.03% compared to 88.17% on the MB test set. This is mainly due to the fact that the UD corpus is not conversational speech. When comparing this result to the taggers trained on the full MB corpus, it can be seen that these new models achieved the lowest test accuracy across all models, probably due to  The difference in performance is more pronounced on the subset of CS utterances. Again, we ran statistical tests to compare these three new taggers to the taggers trained on the full MB corpus, and we found that their differences were statistically significant in the three cases (p < 0.001).
With respect to the LID accuracy of the joint Tagger, the best model is the one trained on the MB corpus, followed very closely by the model trained on MB and UD data. In both cases, the test set accuracy is above 98.49%. The accuracy on the CS test subset is sightly lower at 98.01% and 97.93%. The monolingual Bangor tagger sees a decrease in test accuracy (97.99%) and a bigger drop, down to 90.25%, on the CS subset. All the differences in performance between every pair of the three LID taggers are statistically significant (p < 10 −5 ).

Comparison to Previous Work
We compare the performance of our models to the Integrated and Combined models proposed in (AlGhamdi et al., 2016). In that paper, POS tagging results are reported on the MB corpus, but using a preliminary mapping to the first iteration of the Universal tagset (12 tags, as opposed to the current 17); furthermore, the train and test splits were different. Therefore, we decided to replicate their experiments using our data configuration and compare them to our own classifiers. With respect to their "Integrated" models, INT3:AllMonoData+CSD is comparable to our POS Tagger Table 7: Out-of-vocabulary (OOV) rate, sentence (Sacc) and word accuracy (Wacc) at the sentence level, fragment (CSFAcc) and word accuracy (CSFWacc) at the fragment level, average minimum distance from tagging error to CSF (AvgMinDistCSF), and percentage of errors that occur within a CSF (%ErrorsInCSF). combination in this paper, but in terms of data, this model would be most similar to our POS tagger trained on Miami and EN&ES UD, in which we reached 96.47% compared to their 92.20%. Furthermore, we note that our joint POS+LID tagger also has better POS accuracy than its counterparts Integrated systems from (AlGhamdi et al., 2016) in addition to performing LID tagging.

Error Analysis
In this section we aim to analyze the performance of the POS taggers on the CS sentences of the Bangor test set and more specifically, on the CS fragments (CSF) of those test sentences. We define a CSF as the minimum contiguous span of words where a CS occurs. Most often a CSF will be two words long, spanning a Spanish token and an English one or vice versa, but it is also possible for fragments to be longer than that, given that a Mixed or Ambiguous token could occur within a fragment. The average CSF length in the Bangor test set is 2.16. We compare the performance of the UD-EN, UD-ES, UD-EN&ES and the Bangortrained taggers on the Bangor CS Test set to understand the difference in errors made by monolingual and CS taggers. Table 7 shows the following measures: OOV rate, POS tagging accuracy at the sentence and word level, POS tagging accuracy in CS fragments at the fragment and word level, the average distance from a POS tagging error to the nearest CSF (AvgMinDistCSF) and the percentage of POS tagging errors that occur within the boundaries of a CS utterance (%ErrorsInCSF). All measures are computed on the CS subset of test sentences of the Bangor corpus using the basic POS taggers trained on UD-EN, UD-ES, UD EN&ES and the Bangor corpus. In the table, we see that the multilingual models have much lower OOV rates, which translates into much higher sentence-level and word-level POS tagging accuracy. The CS Bangor-trained model fares much better than the UD EN&ES model in terms of word-level accuracy (96.1 versus 87.2%), especially when looking at the sentence-level accuracy (60.7 versus 21.8%), because the Bangor model is able to deal with code-switches. When looking at the tagging accuracy on the CS utterances the relative gains at the word level are even larger. This demonstrates that training on CS sentences is an important factor to achieving high-performing POS tagging accuracy. It can also be seen from the table that, as the models achieve CS tagging accuracy, tagging errors are still concentrated near or within CSFsfor the UD EN&ES and Bangor models, Avg-MinDistCSF and %ErrorsInCS decrease as the CSF-level accuracies increase. This shows that even as the models improve at tagging CS fragments, CS fragments still remain the most challenging aspect of the task.

Conclusions
In this paper, we have presented a neural model for POS tagging and language identification on CS data. The neural network is a state-of-theart bidirectional LSTM with prefix, suffix and word embeddings and four boolean features. We used the Miami Bangor corpus to train and test models and showed that: a) monolingual taggers trained on benchmark training sets perform poorly on the test set of the CS corpus, b) our CS models achieve high POS accuracy when trained and tested on CS sentences, c) expanding the feature set to include language ID as input features yielded the best performing models, d) a joint POS and language ID tagger performs comparably to the POS tagger and its LID accuracy is higher than 98%, and e) a model trained on instances of in-genre inter-sentential CS performs much better than the monolingual baselines, but yielded worse test results than the model trained on instances of inter-sentential and intra-sentential code-switching. Furthermore, we compared to our results to the previous state-of-the-art POS tagger for this corpus and showed that our classifiers outperform them in every configuration.