Language Identification and Analysis of Code-Switched Social Media Text

In this paper, we detail our work on comparing different word-level language identification systems for code-switched Hindi-English data and a standard Spanish-English dataset. In this regard, we build a new code-switched dataset for Hindi-English. To understand the code-switching patterns in these language pairs, we investigate different code-switching metrics. We find that the CRF model outperforms the neural network based models by a margin of 2-5 percentage points for Spanish-English and 3-5 percentage points for Hindi-English.


Introduction
Code-switching occurs when a person switches between two or more languages in a single instance of spoken or written communication (Gumperz, 1982;Myers-Scotton, 1997). Codeswitching instances are prevalent in modern informal communications between multilingual individuals specially, in social media platforms such as Facebook and Twitter. Given this prevalence of code-switching, there is value in automatic processing and understanding of such data. Language identification at the word level is the first step in computational modeling of code-switched data. Language identification is important for a wide variety of end user applications such as information extraction systems, voice assistant interfaces, machine translation, as well as for tools to assist language assessment in bilingual children Chandu et al., 2017;Roy et al., 2013). Language detection, in addition, enables sociolinguistics and pragmatic studies of code-switching behavior.
Code-switching in speech is well studied in linguistics, psycholinguistic and sociolinguistics (Sankoff, 1970;Lipski, 1978;Poplack, 1980;Gumperz, 1982;Auer, 1984;Myers-Scotton, 1997, 2002. The alternation of languages across sentence boundaries is known as code-switching and the alternation within a sentence is known as code-mixing. In this paper we will refer to both instances as code-switching and differentiate between the types of code switching when necessary. Table 1 shows examples of code-switching for Hindi-English and Spanish-English. Example 1 Good morning sirji, aaj ka weather kaisa hai? (Good morning sir, How is the weather today?) Example 2 Styling day trabajando con @username vestuario para #ElFactorX y soy hoy chofer. I will get you there in pieces im a Safe Driver.
(Styling day working with @username wardrobe for #ElFactorX and today I am a driver. I will get you there in pieces im a Safe Driver.) Table 1: Example 1 shows code-switching between Hindi-English and Example 2 between Spanish-English (Molina et al., 2016).
Word level language identification of codeswitched text is inherently difficult. First, a single code-switched instance can have mixing at the sentence or clause level, the word level, and even at the sub-word level (e.g. sir-ji, chapathis). Second, the typology of the languages involved in switching and their inter-relatedness further increase the task complexity. For example, a shared Latin influence on Spanish and English results in lexical relatedness (Smith, 2001;August et al., 2002), making Spanish-English language identification harder than Hindi-English. Third, in spite of the fact that Hindi has a native script (Devanagari), most of the Hindi social media text is transliterated. Transliteration is conversion of a text from one script to another. In the case of Hindi, text is converted from native, Devanagari to Roman script. Due to lack of standardization in transliteration, a single Hindi word can have multiple surface forms (e.g. Humara, Hamara, Hamaaraa etc.). Some Hindi words can take the same surface form as an English word. The words 'hi' (an auxiliary verb), 'is' (this), and 'us' (that) are some examples. Finally, the characteristics of social media text such as non-standard spelling, contractions, and not strictly adhering to the grammar of the language adds to the list of challenges.
In this work, we make three contributions. First, we build a new code-switched dataset for Hindi-English (HIN-ENG) language pair from Facebook public pages and Twitter. Second, we investigate different code-switching metrics for Hindi-English and a standard Spanish-English (SPA-ENG) dataset. Third, we compare a traditional machine learning model -conditional random field (CRF), and two recurrent neural network (RNN) based systems, for word-level language identification of the above language pairs. In contrast to the CRF model, the RNN-based systems do not involve language specific resources or sophisticated feature engineering. We test these models, first for each of the language pairs individually, and then for a corpus with both the language pairs combined.
Among the language identification systems, the CRF model outperforms both the RNN-based systems across language pairs. When both the language pairs are combined, the result from the best performing model (CRF) is 25% points higher than the baseline system. The RNN-based models also give reasonable results.

Related Work
Over the last decade several researchers have explored word-level language identification for different language pairs and dialect varieties. The FIRE shared task series - (Roy et al., 2013;Sequiera et al., 2015b) focuses on language identification of code-mixed search queries in English and Indian languages for information retrieval. We use a larger set of labels compared to these tasks. The First and Second Shared Task on Language Identification in Code-Switched Data (Solorio et al., 2014;Molina et al., 2016) show the necessity for automatic process-ing of code-switched text and report comparison of different language identification systems. The best system from the second iteration of these shared tasks uses a logistic regression model and reports a token-level F1-score of 97.3% for SPA-ENG. Our results are competitive with this score.  use a dictionary based method and SVM model with various features for Hindi-English and Bengali-English. Their system achieves an F1-score of 79% for Hindi-English. Barman et al. (2014) create a new dataset and study code mixing between the three languages -English, Hindi, and Bengali using CRF and SVM models. In another work, Gella et al. (2014) build a language detection system for synthetically created code-mixed dataset for 28 languages. Similar to some of the works in the above mentioned papers, we model the language detection task as a sequence labeling problem and explore combinations of several features using the CRF model, but we use a larger set of labels. We obtain significantly higher performance for the Hindi-English language pair than .
Along with the traditional machine learning approach, some researchers have also used models based on artificial neural networks. Chang and Lin (2014) use an RNN architecture with pre-trained word2vec embeddings for SPA-ENG and the Nepali-English datasets from the First Shared Task on Language Identification in Code-Switched Data. Samih et al. (2016) build an LSTM based neural network architecture for SPA-ENG and MSA-DA datasets from the Second Shared Task on Language Identification in Code-Switched Data. Their model combines word and character representations initialized with pretrained word2vec embeddings. We replicate their model with softmax output layer for SPA-ENG and run similar experiments for HIN-ENG, as well as with both the corpora combined. Our result for SPA-ENG match that of Samih et al. (2016).

Data
We use the SPA-ENG dataset from the EMNLP Code-Switching Workshop 2016. This data is collected from Twitter, based on the geographical areas with strong presence of Spanish and English bilingual speakers -California, Texas, Miami, and New York (Solorio et al., 2014;Molina et al., 2016). The labels used are summarized in Table  2. The hashtags are treated as a word and are la-  . We crawl posts and their comments from the Facebook public pages of various sports-persons, political figures, and movie stars. We also crawl random tweets from geographical locations Mumbai and Delhi using the Twitter API. From the crawled posts, we remove the posts in native scripts, and remove duplicate and promotional posts. We filter the posts containing URLs and those with less than 3 words.  Table 3: Corpus statistics for the language pairs. Token ratio is the percentage of the total tokens that are unique. A higher token ratio implies a richer corpus vocabulary.
We follow EMNLP 2016 shared task annotation guidelines and use a semi-automatic approach to annotate the data. The labels are reviewed and corrected with the help of in-lab annotators. The inter-annotator agreement score over approximately 4, 000 tokens is 0.935. A portion of the Facebook dataset is annotated using the English lexicon and Hindi transliterated pairs. 1,2 We use pattern matching rules to label punctuations, emoticons, and usernames. These labels are then corrected manually for ne, fw, mixed, ambiguous, and unk labels. We also make use of two existing datasets -Facebook dataset from ICON2016 POS tagging shared task and the dataset from (Se-1 http://wortschatz.uni-leipzig.de/en/download 2 http://cse.iitkgp.ac.in/resgrp/cnerg/qa/fire13translit/ quiera et al., 2015a). 3 We manually map the labels of these data sets to labels in Table 2. We train a character n-gram based CRF model using the above mentioned three datasets (see Section 5.2) and predict the labels for all the posts crawled from Facebook and the random tweets from Twitter. From these, we identify the posts predicted as code-switched, correct the labels where necessary, and add them to the final dataset. The F1-weighted score for this model is close to 96 percent.

Code-Switching Analysis
In this section we provide some descriptive statistics about the corpora to understand the language distribution and language-relatedness. Table 4 shows the language distribution at post (tweet) level. The SPA-ENG dataset has a balanced distribution where as, in the HIN-ENG dataset majority of the instances are in English. The below statistics show that both the datasets have a good amount of code-switched instances to train and test the language identification systems. Ta   Spanish, and 2% are named-entities. The higher instances of the named-entities in the HIN-ENG dataset is a result of the way the data is sourced. Figure 1 shows the overlap between the tokens belonging to lang1, lang2, and ne. These overlaps introduce ambiguity for the automatic labeling task. Around 2.5% of the Hindi words in HIN-ENG share the same spelling as some English words because of transliteration of Hindi text to Roman script. In comparison, there is a 6% overlap between Spanish and English words in the SPA-ENG dataset (e.g. no, a, final). This indicates higher degree of lexical relatedness between Spanish and English as compared to Hindi and English. The overlap between language words and named-entities is due to words such as university and united. These words can be part of names of organizations, movie titles or song titles and can also be used as language constructs in either of the languages. Figure 2: Plot of character n-grams overlap between the languages in the datasets, for n = 2, 3, 4, 5 and 6.
In another analysis, we explore the similarity in character n-gram profiles of the languages involved (Maharjan et al., 2015). A higher simi-larity in the character n-grams increases the difficulty of the task. We generate character n-grams of length 2 to 6 from the language vocabularies of each corpora. We show the plot of the character n-gram overlaps for HIN-ENG and SPA-ENG in Figure 2. As expected, the overlap decreases rapidly with increase in n-gram length. The SPA-ENG n-gram overlap is higher than that of HIN-ENG for all n-gram lengths. This trend is consistent with the results in Figure 1. To further understand the complexity involved, for an n-gram occurring in both the languages, we calculate the probability of that n-gram being a part of an English word in the corpus. A probability closer to 50% indicates higher ambiguity in classifying that n-gram. We find that a significant fraction (25%) of these shared n-grams, averaged over all n-gram lengths, appear in the range 40%-60%.

Code-Switching Metrics
The code-switching behavior can be different depending on the medium of communication, context of language use, topic, authors (or speakers), and the languages being mixed among other factors. We compute 3 different metrics to understand code-switching patterns in our datasets, as well as to rationalize the performance of the language identification models. M-Index: Multilingual index is a word-countbased measure that quantifies the inequality of the language tags distribution in a corpus of at least two languages (Barnett et al., 2000). Equation (1) defines the M-Index as: where k is the total number of languages and p j is the total number of words in the language j over the total number of words in the corpus. The value ranges between 0 and 1 where, a value of 0 corresponds to a monolingual corpus and 1 corresponds to a corpus with equal number of tokens from each language.
Integration Index: Integration Index is the approximate probability that any given token in the corpus is a switch point (Guzman et al., 2016;Guzmán et al., 2017). Given a corpus composed of tokens tagged by language {l j } where i ranges from 1 to n − 1, the size of the corpus. The I-index is computed as follows: where S(l i , l j ) = 1 if l i = l j and 0 otherwise. For a corpus with n tokens, there are n − 1 possible switch points. It quantifies the frequency of codeswitching in a corpus. Code-Mixing Index: At the utterance level, this is computed by finding the most frequent language in the utterance and then counting the frequency of the words belonging to all other languages present . It is calculated using: where n i=1 (w i ) is the sum over number of words for all N languages in the utterance, max(w i ) is the highest number of words present from any language, n is the total number of tokens, and u is the number of language independent tokens. Here, we consider the labels lang1, lang2, and fw as language words and the rest as other. The range of CMI value is [0, 100). If an utterance has language independent tokens or only monolingual tokens, then the corresponding CMI value is 0. A higher value of CMI indicates higher level of mixing between the languages. CMI-all is an average over all utterances in the corpus and CMI-mixed is an average over only code-switched instances.  SPA-ENG has higher M-Index (Table 5) value indicating a balanced ratio of words from the two languages. This is consistent with the distribution of language words in the datasets (Table 2). The differences in CMI-all between HIN-ENG and SPA-ENG is about 0.9 percentage points and 0.1 percentage points for CMI-mixed. The higher difference for CMI-all could be because of the higher percentage of code-switched instances (9%) in HIN-ENG as compared to SPA-ENG (Table 4). Considering CMI-mixed and I-Index metrics together, it is evident that HIN-ENG has more language mixing and higher number of code-switching points than SPA-ENG. This is because HIN-ENG has more instances that have multiple word insertions. In SPA-ENG, instances with word insertion at more than one place in an utterance are less frequent. We also observe that a larger majority of code-switching happens between language words in HIN-ENG (76%) than in SPA-ENG (69%). For example, a number of Hindi word insertions are due to the use of the honorary article ji with an address form (Sir/Madam). In general, observing more code-switching in HIN-ENG is due to the fact that code-switching between Hindi and English is very widespread in India (Parshad et al., 2016;.

Language Identification Models
We provide below a brief description of each of the models used. CRF: Language identification is a sequence labeling task where the label of a token in a sequence is correlated with the labels of its neighboring tokens. So we use CRF -a sequence labeling model to capture the structure in the data. We explore different language independent features such as character n-grams, word unigram, morphological features, affixes, and contextual information for the language pairs. For each word, we generate character n-grams of length 1 to 5 and filter them based on a minimum threshold frequency of 5. To capture the morphological information of the tokens, we use binary features -is digit, is special character, is all capital, is title case, begins with @ character, has accent character (for SPA-ENG only) and has apostrophe. We also use language dependent resources like lexicons and monolingual parts-of-speech (POS) taggers. For HIN-ENG, we use three different lexicons -Leipzig corpus for English, FIRE 2013 transliterated Hindi word pairs, and lexically normalized dictionary from Han et al. (2012) and the output of Twitter POS tagger and CRF++ based Hindi POS tagger. 4,5 For SPA-ENG, we use Leipzig corpus Spanish along with the other two lexicons mentioned above and the output from monolingual TreeTaggers for Spanish and English. 6 Bidirectional LSTM: Long Short Term Memory networks (LSTMs) (Hochreiter and Schmidhuber, 1997) are a variation of recurrent neural networks (RNNs), that address the vanishing gradient issue (Hochreiter, 1998) by extending RNNs with memory cells. A shortcoming of LSTM is that only the previous history in a sequence can be utilized. In a sequence labeling task like language identification, it is helpful to use the future context given in the sequence. Bidirectional LSTM (BLSTM) networks can access both the preceding and succeeding contexts by involving two separate hidden layers. These networks can capture the long distance relations in the sequence efficiently, in both directions. We build an end-to-end sequence model with a single BLSTM layer layer (Figure 3). Word-Character LSTM: This model is a replication of the model proposed by Samih et al. (2016) (Figure 4). The input layer in this model has word and character embeddings. The latter are used to capture morphological features of a word. We use two LSTMs to learn fixed-dimensional representations from the embedding layers. At the output layer, we apply a softmax over the concatenated word and character vectors to obtain the token label. Unlike the BLSTM model, here current token and the neighboring tokens are considered to predict the label for the current token. We replace the emoticons in the dataset with a place-holder character to reduce the vocabulary size and as a result reduce the dimension of character embeddings. This decreases the number of trainable model parameters and thereby mitigates overfitting to some extent.

Experiments and Results
For CRF, we run experiments with different combinations of hand-crafted features discussed in the previous section. We run three different sets of experiments-with no contextual information, and with surrounding words of context window sizes 1 and 2. Table 6 and Table 7 shows results from these experiments.
For the RNN-based systems, we use pre-trained fastText word embeddings. 7 We learn the embeddings using a large monolingual corpus for each of the languages and a smaller code-switched corpus for the language pairs. The rationale for using a large monolingual data is that it is readily available and that it can account for the different contexts in which words appear in different languages -thus providing an accurate separation between the languages. We train three separate sets of embeddings each for SPA-ENG, HIN-ENG, and SPA-ENG + HIN-ENG. The embeddings for SPA-ENG are trained by combining a portion of English Gigaword corpus (Graff et al., 2003) and Spanish Gigaword corpus (Graff, 2006), and a subset of tweets from Samih et al. (2016). For HIN-ENG, we combine a portion of English Gigaword corpus, transliterated Hindi monolingual corpus, and Facebook posts that contain code-switching. All

Experiments
Context-  these corpora are used to train the embeddings for SPA-ENG + HIN-ENG. This helps to capture the word usage in the context of each language and eliminates the ambiguity for the words that have same surface form in multiple languages. We train 300-dimension embedding vectors using fastText skip-gram model for 250 epochs with a learning rate of 0.001 and a minimum word count threshold of 5.
For BLSTM model, we initialize the embedding layer with the pre-trained fastText word embeddings and feed the output sequence from this layer to the BLSTM layer. At the output layer a softmax activation function is applied over the hidden representation learned in the BLSTM layer. For word-char model, we initialize the word embedding matrix with fastText embeddings and use random initialization for character embedding matrix. We train both the RNN-based models by optimizing the cross entropy objective function with Adam (Kingma and Ba, 2014) optimizer. We use dropout masks after BLSTM layer in BLSTM model, LSTM layers in word-char model, and embedding layer in each model to mitigate overfitting. The reported BLSTM model and word-char models have hidden units of size 80 and 100 respectively in the LSTM layers. For word-char model, for each token we try a neighboring token window size of 1, 2, and 3. The context window size of 2 gives better results and is reported here.  Multiple Language Pair Experiment. We use the models described in Section 6 in an experiment to identify the labels for a dataset with multiple language pairs. This dataset has both Spanish-English and Hindi-English language pairs (SPA-ENG + HIN-ENG). To account for the third language, we use an additional label -lang3 (HIN). Except for the pre-trained word embeddings, the models do not involve any language dependent feature engineering, and are easy to scale for multiple language pairs. As the word embeddings are  trained mostly on monolingual data, this dependency does not constrain the systems.

Results and Evaluation
We use a simple lexicon-based model as baseline for our language identification systems. We use F1-weighted scores for model evaluations to account for the imbalance in label distributions (Table 2). All the models improve the performance over the respective baseline models by 7 to 25 percentage points. For CRF, which is the best performing model across language pairs, the current word and its character n-grams are the most important features. Adding POS tags does not improve these results by much. This could be because the POS taggers are optimized for monolingual data and their output for the code-switched data contains noise. Using contextual information improves the results for HIN-ENG, but not for SPA-ENG. In Table 8   based models and the CRF model. We consider the performance of the CRF model using only the language independent features with a context size of 2 for a fair comparison. Among the RNN-based systems, while the results are competitive overall, there is no single system that performs the best across language pairs. The BLSTM system performs better for HIN-ENG, while word-char system performs better for SPA-ENG. The BLSTM model captures long distance dependencies in a sequence and this is in line with the observation made above with the CRF model-more context helps for HIN-ENG. It is also consistent with the code-switching patterns discussed in Section 5. A majority of code-switched tweets in SPA-ENG have a single instance of word insertion and these are being miss-labeled by the models. The overall better results for SPA-ENG are because of a larger training data used. 8 The baseline results for SPA-ENG + HIN-ENG is relatively low as compared to the individual language pairs. This shows that simultaneously identifying language for multiple language pair is harder. We obtain reasonable results for these initial experiments with all the models.
To understand these results better, we look at the label-wise F1-score for lang1, lang2 and ne (Table  10). The F1-scores for CRF is better across the labels and the difference is significantly high for ne. The F1-score ne is relatively high for HIN-ENG, which can be attributed to the fact that around 58% of the named-entities in the test set appear in the training set. This overlap is only 17% for SPA-ENG. So, infrequent named-entities seems to be hardest to accurately label. In addition, the RNN-based models are more sensitive to amount of training samples.
Further, we examine the transitions learned by our best CRF model for each of the language pairs (Table 9). For both language pairs, the transitions between the same languages are more likely than switching. But we also observe that the transitions from lang1 to lang2 and vice-versa rank higher for HIN-ENG than SPA-ENG. This is because there are fewer code-switching points in SPA-ENG as compared to HIN-ENG in these datasets.
(a) BLSTM Model (b) Word-char LSTM Model Figure 5: Projection of word representations learned by the neural networks model for HIN-ENG + SPA-ENG. We reduce the word vector dimensions using PCA. The mapping of labels to colors: lang1 -red, lang2 -green, lang3 -blue, ne -black, other -orange, ambiguous -purple, mixed -purple, fw -yellow, unk -yellow.
We also visualize the feature representations learned by the RNN-based models by projecting the word embeddings for a randomly selected subset of words from the development datasets for SPA-ENG + HIN-ENG ( Figure 5). The wordchar model gives a clearer separation between the three languages, the words belonging to the labels other and ne. While the BLSTM model also provides clear separation between the language words, there is an overlap with the tokens from other. These results show that these models can be scaled to detect code-switching in multiple language pairs without any additional feature engineering.

Conclusions
The complexity of language identification of codeswitched data depends on the data source, codeswitching behavior, and the typology and relation between the languages involved. We find that the code-switching metrics complement each other in explaining the code-switching patterns across language pairs. The analysis of code-switching metrics shows that in our datasets Hindi-English speakers tend to switch languages more often than Spanish-English speakers. In future, it would be interesting to explore and compare the codeswitching behavior of data from different sources such as movie scripts, song lyrics, and chat conversations across different language pairs. We successfully use two different deep learning architectures without involving sophisticated feature engineering for the task and obtain competitive results. However a traditional CRF model performs better than the deep learning models for the language pairs considered. This is probably due to the amount of training data we have. The results show that word embeddings are able to capture the language separation well. Scaling these systems to identify languages in datasets with many language pairs and datasets with switching between more than two languages is a potential future direction to explore.