GLUECoS : An Evaluation Benchmark for Code-Switched NLP

Code-switching is the use of more than one language in the same conversation or utterance. Recently, multilingual contextual embedding models, trained on multiple monolingual corpora, have shown promising results on cross-lingual and multilingual tasks. We present an evaluation benchmark, GLUECoS, for code-switched languages, that spans several NLP tasks in English-Hindi and English-Spanish. Specifically, our evaluation benchmark includes Language Identification from text, POS tagging, Named Entity Recognition, Sentiment Analysis, Question Answering and a new task for code-switching, Natural Language Inference. We present results on all these tasks using cross-lingual word embedding models and multilingual models. In addition, we fine-tune multilingual models on artificially generated code-switched data. Although multilingual models perform significantly better than cross-lingual models, our results show that in most tasks, across both language pairs, multilingual models fine-tuned on code-switched data perform best, showing that multilingual models can be further optimized for code-switching tasks.


Introduction
Code-switching, or code-mixing, is the use of more than one language in the same utterance or conversation and is prevalent in multilingual societies all over the world. It is a spoken phenomenon and is found most often in informal chat and social media on the Internet. Processing, understanding, and generating code-mixed text and speech has become an important area of research.
Recently, contextual word embedding models trained on a large amount of text data have shown state-of-the-art results in a variety of NLP tasks. Models such as BERT (Devlin et al., 2018) and its multilingual version, mBERT, rely on large amounts of unlabeled monolingual text data to build monolingual and multilingual models that can be used for downstream tasks involving limited labelled data. (Wang et al., 2018) propose a Generalized Language Evaluation Benchmark (GLUE) to evaluate embedding models on a wide variety of language understanding tasks. This benchmark has spurred research in monolingual transfer learning settings.
Data and annotated resources are scarce for codeswitched languages, even if one or both languages being mixed are high resource. Due to this, there is a lack of standardized datasets in code-switched languages other than those used in shared tasks in a few language pairs. Although models using synthetic code-switched data and cross-lingual embedding techniques have been proposed for codeswitching (Pratapa et al., 2018a), there has not been a comprehensive evaluation of embedding models across different types of tasks. Furthermore, there have been claims that multilingual models such as mBERT are competent in zero-shot cross lingual transfer and code-switched settings. Though comprehensively validated by (Pires et al., 2019) in the case of zero-shot transfer, the probing in codeswitched settings was limited to one dataset of one task, namely POS Tagging.
To address all these issues and inspired by the GLUE (Wang et al., 2018) benchmark, we propose GLUECoS, a language understanding evaluation framework for Code-Switched NLP. We include five tasks from previously conducted evaluations and shared tasks, and propose a sixth, Natural Language Inference task for code-switching, using a new dataset 1 (Khanuja et al., 2020). We include tasks varying in complexity ranging from wordlevel tasks [Language Identification (LID); Named Entity Recognition (NER)], syntactic tasks [POS tagging], semantic tasks [Sentiment Analysis; Question Answering] and finally a Natural Language Inference task. Where available, we include multiple datasets for each task in English-Spanish and English-Hindi. We choose these language pairs, not only due to the relative abundance of publicly available datasets, but also because they represent variations in types of code-switching, language families, and scripts between the languages being mixed. We test various cross-lingual and multilingual models on all of these tasks. In addition, we also test models trained with synthetic codeswitched data. Lastly, we fine-tune the best performing multilingual model with synthetic codeswitched data and show that in most cases, its performance exceeds the multilingual model, highlighting that multilingual models can be further optimized for code-switched settings.
The main contributions of our work are as follows: • We point out the lack of standardized datasets for code-switching and propose an evaluation benchmark GLUECoS, which can be used to test models on various NLP tasks in English-Hindi and English-Spanish.
• In creating the benchmark, we highlight the tasks that are missing from code-switched NLP and propose a new task, Natural Language Inference, for code-switched data.
• We evaluate cross-lingual and pre-trained multilingual embeddings on all these tasks, and observe that pre-trained multilingual embeddings significantly outperform cross-lingual embeddings. This highlights the competence of generalized language models over cross lingual word embeddings.
• We fine-tune pre-trained multilingual models on linguistically motivated synthetic codeswitched data, and observe that they perform better in most cases, highlighting that these models can be further optimized for codeswitched settings.
The rest of the paper is organized as follows. We relate our work to prior work to situate our contributions. We introduce the tasks and datasets used for GLUECoS motivating the choices we make. We describe the experimental setup, with details of the models used for baseline evaluations. We present the results of testing all the models on the benchmark and analyze the results. We conclude with a direction for future work and highlight our main findings.

Relation to prior work
The idea of a generalized benchmark for codeswitching is inspired by GLUE (Wang et al., 2018), which has spurred research in Natural Language Understanding in English, to an extent that a set of harder tasks have been curated in a follow-up benchmark, SuperGLUE (Wang et al., 2019) once models beat the human baseline for GLUE. The motivation behind GLUE is to evaluate models in a multi-task learning framework across several tasks, so that tasks with less training data can benefit from others. Although our current work does not include models evaluated in a multi-task setting, we plan to implement this in subsequent versions of the benchmark.
There have been shared tasks conducted in the past as part of code-switching workshops co-located with notable NLP conferences. The first and second workshops on Computational Approaches to Code Switching  conducted a shared task on Language Identification for several language pairs . The third workshop (Aguilar et al., 2018) included a shared task on Named Entity Recognition for the English-Spanish and Modern Standard Arabic-Egyptian Arabic language pairs (Aguilar et al., 2019).
The Forum for Information Retrieval Evaluation (FIRE) aims to meet new challenges in multilingual information access and has conducted several shared tasks on code-switching. These include tasks on transliterated search, (Roy et al., 2013;Choudhury et al., 2014) code-mixed entity extraction (Rao and Devi, 2016) and mixed script information retrieval (Sequiera et al., 2015;Banerjee et al., 2016). Other notable shared tasks include the Tool Contest on POS Tagging for Code-Mixed Indian Social Media at ICON 2016 , Sentiment Analysis for Indian Languages (Code-Mixed) at ICON 2017(Patra et al., 2018 and the Code-Mixed Question Answering Challenge (Chandu et al., 2018a).
Each of the shared tasks mentioned above attracted several participants and have led to follow up research in these problems. However, all tasks have focused on a single NLP problem and so far, there has not been an evaluation of models across several code-switched NLP tasks. Our objective with proposing GLUECoS is to address this gap, and determine which models best generalize across different tasks, languages and datasets.

Tasks and Datasets
Some NLP tasks are inherently more complex than others -for example, a Question Answering task that needs to understand both the meaning of the question and answer, is harder to solve by a machine than a word-level Language Identification task, in which a dictionary lookup can give reasonable results. Some datasets and domains may contain very little code-switching, while others may contain more frequent and complex code-switching. Similar languages, when code-switched, may maintain the word order of both languages, while other language pairs that are very different may take on the word order of one of the languages. With these in mind, our choice of tasks and datasets for GLUE-CoS are based on the following principles : • We choose a variety of tasks, ranging from simpler ones, on which the research community has already achieved high accuracies, to relatively more complex, on which very few attempts have been made.
• We desire to evaluate models on languagepairs from different language families, and on a varied number of tasks, to enable detailed analysis and comparison. This led us to choose English-Hindi and English-Spanish, as we found researched upon datasets for almost all tasks in our benchmark for these language pairs.
• English and Spanish are written in the Roman script, while English-Hindi datasets can contain Hindi words written either in the original Devanagari script, or in the Roman script, thus adding script variance as an additional parameter to analyse upon.
• We include multiple datasets from each language pair where available, so that results can be compared across datasets for the same task.
Due to the lack of standardized datasets, we choose to create our own train-test-validation splits for some tasks. Also, we use an off-the-shelf transliterator and language detector, where necessary, details of which can be found in Appendix A. Table   1 shows all the datasets that we use, with their statistics, while Table 2 shows the code-switching statistics of the data in terms of standardized metrics for code-switching (Gambäck and Das, 2014;Guzmán et al., 2017). Briefly, the code-mixing metrics include : • Code-Mixing Index (CMI) : The fraction of language dependent tokens not belonging to the matrix language in the utterance.
• Average switch-points (SP Avg) : The average number of intra-sentential language switchpoints in the corpus.
• Multilingual Index (M-index) : A word-countbased measure quantifying the inequality of distribution of language tags in a corpus of at least two languages.
• Probability of Switching (I-index) :The proportion of the number of switchpoints in the corpus, relative to the number of languagedependent tokens.
• Burstiness : The quantification of whether switching occurs in bursts (randomly similar to a Poisson process), or has a more periodic character.
• Language Entropy (LE) : The bits of information needed to describe the distribution of language tags.
• Span Entropy (SE) : The bits of information needed to describe the distribution of language spans.
In cases where the datasets have been a part of shared tasks, we report the highest scores obtained in each task as the State Of The Art (SOTA) for the dataset. However, note that we report this to situate our results in context of the same, and these cannot be directly compared, since each task's SOTA is obtained by varied training architecture, suited to perform well in one particular task alone.

Language Identification (LID)
Language Identification is the task of obtaining word-level language labels for code-switched sentences. For English-Hindi we choose the FIRE 2013 (FIRE LID) dataset originally created for the transliterated search subtask (Roy et al., 2013). The test and development sets provided contain wordlevel language tagged sentences. For training we     which also contains language labels.
For English-Spanish we choose the dataset in , provided as part of the LID shared task at EMNLP 2014. We report the highest score obtained for SPA-EN  as the SOTA for this task.

Part of Speech (POS) tagging
POS tagging includes labelling at the word level, grammatical part of speech tags such as noun, verb, adjective, pronoun, prepositions etc. For English-Hindi, we use two datasets. The first is the codeswitched Universal Dependency parsing dataset provided by (Bhat et al., 2018) (UD POS). This corpus contains a transliterated version, where Hindi is in the Roman script, and also a corrected version in which Hindi has been manually converted back to Devanagari. We report the highest score obtained by (Bhat et al., 2018) as the SOTA for this task.
The second English-Hindi dataset we use was part of the ICON 2016 Tool Contest on POS Tagging for Code-Mixed Indian Social Media Text (Jamatia et al., 2016) (FG POS). We report the highest score obtained by (Anupam Jamatia, 2016)-(report communicated directly by authors) as the SOTA for this task.
For English-Spanish, of the two corpora utilised in , we choose the Bangor Miami corpus (Bangor POS) owing to the larger size of the corpus. We report the highest score obtained by  as the SOTA for this task.

Named Entity Recognition (NER)
NER involves recognizing named entities such as person, location, organization etc. in a segment of text. For English-Hindi we use the Twitter NER corpus provided by ) (IIITH NER). We report the highest score obtained by  as the SOTA for this task.
For English-Spanish, we use the Twitter NER corpus provided as part of the CALCS 2018 shared task on NER for code-switched data (Aguilar et al., 2019) (CALCS NER). We report the highest score obtained by (Winata et al., 2019) as the SOTA for this task.

Sentiment Analysis
Sentiment analysis is a sentence classification task wherein each sentence is labeled to be expressing a positive, negative or neutral sentiment.
For English-Hindi we choose the sentiment annotated social media corpus used in the ICON 2017 shared task; Sentiment Analysis for Indian Languages (SAIL) (Patra et al., 2018). This corpus is originally language tagged at the word level with Hindi in the Roman script. We report the highest score obtained for HI-EN (Patra et al., 2018) as the SOTA for this task.
For English-Spanish we choose the sentiment annotated Twitter dataset provided by (Vilares et al., 2016) which we split into an 8:1:1 train:test:validation split ensuring sentiment distribution. (Vilares et al., 2016) report an average F1 score of 58.9 on the same dataset, while (Pratapa et al., 2018b) report an F1 of 64.6 on the same, which we report as the SOTA for this dataset. We are not aware of future work done on this dataset.

Question Answering (QA)
Question Answering is the task of answering a question based on the given context or world knowledge. We choose the dataset provided by (Chandu et al., 2018a) which contains two types of questions for En-Hi, one with context (185 article based questions) and one containing image based questions (774 questions). For the image based questions we use the DrQA -Document Retriever module 2 to extract the most relevant context from Wikipedia. Since it is a code-switched dataset, context could 2 https://github.com/facebookresearch/DrQA not be extracted for all questions. We obtain a final dataset having 313 (question-answer-context) triples.

Natural Language Inference (NLI)
Natural Language Inference is the task of inferring a positive (entailed) or negative (contradicted) relationship between a premise and hypothesis. While most NLI datasets contain sentences or images as premises, the code-switched NLI dataset we use contains conversations as premises, making it a conversational NLI task (Khanuja et al., 2020). Since this is a new dataset, we report our number as the SOTA for this task.

Experimental Setup
We use standard architectures for solving each of the tasks mentioned above (Refer to Appendix B). We experiment with several existing cross lingual word embeddings that have been shown to perform well on cross lingual tasks. We also experiment with the Multilingual BERT (mBERT) model released by (Devlin et al., 2018). In a survey on cross lingual word embeddings, (Ruder et al., 2017) establish that various embedding methods optimize for similar objectives given that the supervision data involved in training them is similar. Based on this, we choose the following representative embedding methods that vary in the amount of supervision involved in training them.

MUSE Embeddings
We use the MUSE library 3 to train both supervised and unsupervised word embeddings. The unsupervised word embeddings are learnt without any parallel data or anchor point. It learns a mapping from the source to the target space using adversarial training and (iterative) Procrustes refinement (Conneau et al., 2017). The supervised method leverages a bilingual dictionary (or identical character strings as anchor points), to learn a mapping from the source to the target space using (iterative) Procrustes alignment.

BiCVM Embeddings
This method, proposed by (Hermann and Blunsom, 2014), leverages parallel data, based on the assumption that parallel sentences are equivalent in meaning and subsequently have similar sentence representations. We use the BiCVM toolkit 4 to learn these embeddings. The parallel corpus we use for English-Spanish consists of 4.5M parallel sentences from Twitter. For English-Hindi, we make use of an internal parallel corpus consisting of roughly 5M parallel sentences.

BiSkip Embeddings
This method makes use of parallel corpora as well as word alignments to learn cross-lingual embeddings. (Luong et al., 2015) adapt the skip-gram objective originally proposed by (Mikolov et al., 2013) to a bilingual setting wherein a model learns to predict words cross-lingually along with the monolingual objectives. We make use of the fastalign toolkit 5 to learn word alignments given parallel corpora and use the BiVec toolkit 6 to learn the final BiSkip embeddings given the parallel corpora and the word alignments. The parallel corpora utilised to learn these are the same as those used to learn the BiCVM embeddings.

Synthetic Data (GCM) Embeddings
We also experiment with skip-gram embeddings learnt from synthetically generated code-mixed data as proposed by (Pratapa et al., 2018b). We make use of the fasttext library 7 to learn the skipgram embeddings. For English-Spanish, we obtain data from (Pratapa et al., 2018a) which consists of 8M synthetic code-switched sentences. For English-Hindi, we generate synthetic data from the IITB parallel corpus. 8 We sample from the generated sentences obtained using Switch Point Fraction (SPF), as described in (Pratapa et al., 2018a), to obtain a GCM corpus of roughly 10M sentences.

mBERT
Multilingual BERT is pre-trained on monolingual corpora of 104 languages and has been shown to perform well on zero shot cross-lingual model transfer and code-switched POS tagging (Pires et al., 2019). Specifically, we use the bert-basemultilingual-cased model for our experiments. (Sun et al., 2019) show that fine-tuning BERT with in-domain data on language modeling improves performance on downstream tasks. On similar lines, we fine-tune the mBERT model with synthetically generated code-switched data (gCM) and a small amount of real code-switched data (rCM), on the masked language modeling objective. The training curriculum we use in fine-tuning this model is similar to as proposed by (Pratapa et al., 2018a), which has been shown to improve language modeling perplexity. Although we train on real codemixed data, it accounts for a small fraction (less than 5%) of the total code-mixed data used. Refer to Appendix C for training details.

Results and Analysis
Tables 3-8 show the results of using the embedding techniques described above for each task and dataset. mBERT provides a large increase in accuracy as compared to cross-lingual techniques, and in most cases, the modified mBERT technique performs best. We do not experiment with baseline or cross-lingual embedding techniques for NLI, since we find that mBERT surpasses the other techniques for all other tasks. For NLI, as in the other cases, we find that modified mBERT performs better than mBERT. We hypothesize that this happens because code-switched languages are not just a union of two monolingual languages. The distributions and usage of words in code-switched languages differ from their monolingual counterparts, and can only be captured with real code-switched data, or synthetically generated data that closely mimics real data. (Glavas et al., 2019) point out how all crosslingual word embedding methods optimize for bilingual lexicon induction. Each model is trained using different language pairs and different training and evaluation dictionaries, leading to it overfitting to the task it is optimizing for and failing in other cross-lingual scenarios. Also, the loss function in training cross-lingual word embeddings has a component where w1 in one language predicts the context of its aligned word w2 in the other language. However, in the case of code-switching, w1 appear-9 The original task was language tagging and transliteration of Hindi words in the Roman script, while we report LID results for Hindi in Devanagari. An accuracy of 99.0 was obtained on the original subtask (Roy et al., 2013) 10 We create our own test split from the training data, since the test data is not publicly available 11 The original dataset contains multiple code-mixed pairs and there exists no language based segregation of the results. Since we only choose the EN-HI examples we report this as N/A       ing in the context of w2 may not be natural. This clearly highlights the need to learn cross-lingual embeddings keeping code-mixed language processing as an optimization objective.
The results using mBERT cannot be directly compared to the cross-lingual models because of the difference in the magnitude of data involved in training. Also, due to the fact that mBERT is trained on 104 languages together, with massive amounts of data for a large number of epochs, it learns several common features better providing for a well represented common embedding space. The training data used for training the cross-lingual embeddings is restricted to Twitter and query logs, while mBERT is trained on the entire wiki dump.
Overall, the cross-lingual and mBERT models perform better for English-Spanish as compared to English-Hindi. This could be due to several reasons.
• English and Spanish are similar languages, with both mostly retaining individual word order while code-switching, which is not the case for English and Hindi.
• Romanized Hindi does not use standardized spellings, and errors made by the transliterator could have influenced the results.
• We use Twitter and social media data to train cross lingual word embeddings for English-Spanish which are similar in domain to the task datasets, while we use the IITB and query-based parallel corpora for English-Hindi which is generic in domain, constrained by the available resources at hand.
We find that for most tasks, modified mBERT performs better than mBERT. In cases where this is not true (QA En-Hi; FG En-Hi), the difference in accuracy between the two models is small. This could be attributed to errors made by the transliterator or corpus differences, but in general we observe that the modified En-Hi mBERT model does not significantly outperform the base mBERT model. Given the promising results obtained by modified mBERT, it would be interesting to pre-train a language model for code-switched data which is trained on the monolingual corpora of languages involved and fine-tuned on GCM as proposed, to compare against fine-tuning mBERT itself, which is trained on multiple languages.
We find that accuracies vary across tasks in the GLUECoS benchmark, and except in the case of LID, code-switched NLP is far from solved. This is particularly stark in the case of Sentiment and NLI, which are three and two way classification tasks respectively. Modified mBERT performs only a little over chance, which shows that we are still in the early days of solving NLI for code-switched languages, and also indicates that our models are far from truly being able to understand code-switched language.

Conclusion
In this paper, we introduce the first evaluation benchmark for code-switching, GLUECoS. The benchmark contains datasets in English-Hindi and English-Spanish for six NLP tasks -LID, POS tagging, NER, Sentiment Analysis, Question Answering and a new code-switched Natural Language Inference task. We test various embedding techniques across all tasks and datasets and find that multilingual BERT outperforms cross-lingual embedding techniques on all tasks. We also find that for most datasets, a modified version of mBERT that has been fine-tuned on synthetically generated code-switched data with a small amount of real code-switched data performs best. This indicates that while multilingual models do go a long way in solving code-switched NLP, they can be improved further by using real and synthetic code-switched data, since the distributions in code-switched languages differ from the two languages being mixed.
In this work, we use standard architectures to solve each NLP task individually and vary the embeddings used. In future work, we would like to experiment with a multi-task setup wherein tasks with less training data can significantly benefit from those having abundant labelled data, since most code-switched datasets are often small and difficult to annotate. We experiment with datasets having varied amounts of code-switching and from different domains and show that some tasks, such as LID and POS tagging are relatively easier to solve, while tasks such as QA and NLI have low accuracies. We would like to add more diverse tasks and language pairs to the GLUECoS benchmark in a future version.
All the datasets used in the GLUECoS benchmark are publicly available, and we plan to make the NLI dataset available for research use. We hope that this will encourage researchers to test multilingual, cross-lingual and code-switched embedding techniques and models on this benchmark.

A Additional Dataset Details
For each dataset wherein the training, development and test splits are not provided, we create balanced custom splits in an 8:1:1 ratio.
For En-Hi datasets, where the corpus is originally in the Roman script and is not language tagged, we use the LID tool provided by (Rijhwani et al., 2017) to obtain language tags.
In cases where language tags are provided, we convert Roman Hindi words to Devanagari using an off-the-shelf transliterator.

B Additional Training Details
We conduct each experiment for 5 random seed values and report the average of the results obtained.

B.1 Word-Level Tasks
For the word level tasks including Language Identification, Named Entity Recognition and Part of Speech tagging we make use of the sequence labeler. 12 This implements a BiLSTM with a CRF layer on the top as described in (Lample et al., 2016). We use the adadelta optimizer with a learning rate of 1.0, dropout of 0.5 and a batch size of 32. We run the model for a maximum of 20 epochs and stop if the validation accuracy on the best model selector hyperparameter shows no improvement for 5 epochs continually. The best model selector hyperparameter is the F1 score. The dimension of the word embeddings is 300.
We make use of the transformers library 13 for the mBERT experiments. We use the AdamW optimizer with a learning rate of 5e-5, epsilon of 1e-8, and a batch size of 32, as suggested by (Devlin et al., 2018). We train for 5 epochs.

B.2 Sentence-Level Tasks
For the sentence level tasks (Sentiment Classification) we implement a BiLSTM with one hidden layer of dimension 256. We apply a dropout of 0.5, and use the Adam optimizer with a 0.001 learning rate and 1e-8 epsilon value. We use a batch size of 64 and train for a maximum of 15 epochs stopping if the validation accuracy continually drops for 3 epochs. The dimension of the word embeddings is 300.

B.3 Sentence-Pair Tasks
For the embedding evaluations on the QA task, we make use of the BiDAF architecture 14 as proposed in (Seo et al., 2016). We keep the default training hyperparameters which include a learning rate of 0.5, a batch size of 1, training epochs as 5, a maximum context length of 400 tokens and a maximum question length of 50 tokens. We make use of the SQuAD training script and the XNLI training script of the Transformers library with its default hyperparameters for the mBERT experiments.

C BERT LM fine-tuning
We take the bert model released by Google (bert-base-multilingual-cased) and fine-tune it for masked language modeling on 2 types of codemixed datasets.
We use a curriculum wherein the model is first trained on generated code-mixed data (gCM) for 10 epochs and then on real code-mixed data (rCM) for 10 epochs.
For English-Hindi, the details of the datasets are as follows: • 2M gCM sentences generated from the parallel corpus by Kunchukuttan et al. (2018) • 93k rCM sentences from the corpora by Chandu et al. (2018b) For English-Spanish, the details of the datasets are as follows: • 8M gCM sentences generated from the corpus by Rijhwani et al. (2017)