CodemixedNLP: An Extensible and Open NLP Toolkit for Code-Mixing

The NLP community has witnessed steep progress in a variety of tasks across the realms of monolingual and multilingual language processing recently. These successes, in conjunction with the proliferating mixed language interactions on social media, have boosted interest in modeling code-mixed texts. In this work, we present CodemixedNLP, an open-source library with the goals of bringing together the advances in code-mixed NLP and opening it up to a wider machine learning community. The library consists of tools to develop and benchmark versatile model architectures that are tailored for mixed texts, methods to expand training sets, techniques to quantify mixing styles, and fine-tuned state-of-the-art models for 7 tasks in Hinglish. We believe this work has the potential to foster a distributed yet collaborative and sustainable ecosystem in an otherwise dispersed space of code-mixing research. The toolkit is designed to be simple, easily extensible, and resourceful to both researchers as well as practitioners. Demo:  and Library:


Introduction
Code-mixing refers to fluid alteration between two or more languages in a given utterance. This phenomenon is ubiquitous and more natural in multilingual communities, and is highly prevalent in social media platforms. Developing tools that can comprehend mixed texts can have a multitude of advantages, ranging from socially responsible NLP applications such as moderating abusive content in social media to improve naturalness of ubiquitous technologies such as conversational AI assistants and further to develop socio-cultural studies around human cognition, such as why and when people code-mix.
NLP tools for monolingual and multilingual language processing have rapidly progressed in 1 Demo is available at: https://bit.ly/3rzOcWb 2 The library and pretrained models are available at github.com/murali1996/CodemixedNLP. the past few years; thanks to the transformer-based models such as Multilingual BERT (Devlin et al., 2019) & XLM-RoBERTa (Conneau et al., 2020), and their pretraining techniques. On various mixed datasets, recent studies have shown that adopting multilingual pretrained models can perform better than their previous deep learning counterparts (Pires et al., 2019;Khanuja et al., 2020;Chakravarthy et al., 2020;Jayanthi and Gupta, 2021). While this looks promising for multilingual, the same is not translated to code-mixing. Hence, a critical investigation is required to understand generalizable modeling strategies to enhance performance on mixed texts (Winata et al., 2021;. At the same time, practitioners who require an off-the-shelf tool into their downstream mixed text application (eg. sentiment or language identification), currently have to resort to monolingual toolkits such as NLTK, Flair, IndicNLP and iNLTK. On the other hand, while there have been several episodic works on mixed text processing, such as proposing novel datasets or shared-tasks or training strategies, there haven't been many initiatives to collate these resources into a common setting; doing so can benefit both researchers and practitioners, thereby accelerating NLP for mixed texts.
In this work, we address some of these shortcomings by creating an extensible and open-source toolkit for a variety of semantic and syntactic NLP applications in mixed languages. Our toolkit offers-• simple plug-and-play command line interfaces with fine grained control over inputs, models and tasks for developing, quantifying, benchmarking, and re-using versatile model architectures tailored for mixed texts ( § 2.1, § 2.2, § 2.3) • easy to use single stop interfacing for a variety of data augmentation techniques including transliteration, spelling variations, expansion with monolingual corpora etc., by leveraging a collation of publicly available tools • a toolkit library to import fine-tuned and ready-to-use models for 7 different tasks in Hinglish, along with an easy-to-setup web interface wrapper based on flask server ( § 4) We believe the fine grained plug and play interfacing of the toolkit can serve a multitude of purposes in both academia and industry. Such fine control over the individual components of the model can enable accelerated experimentation in training different model architectures, such as multi-tasking, representation-fusion, and language-informed modeling.
This in-turn helps our understanding of utilizing pretrained transformer-models for mixed datasets. In addition, our toolkit also offers computation of metrics to quantify code-mixing such as Code-Mixing Index, Language Entropy, etc., which can be utilized to find peculiarities of low-performing subsets.
Like a curse in disguise, though code-mixing is widely prevalent and available on social media, it is accompanied with non-standard spellings, mixed scripts and ill-formed sentences are common in code-mixing. To combat this, our toolkit offers techniques to augment the training sets with multiple views of each input corresponding to the above problems.
Among many potential applications, we first demonstrate our toolkit's utility in benchmarking ( § 3). In addition, we publish state-of-the-art models for different NLP tasks in Hinglish and wrap them into a command line / deployable web interface ( § 4). Our toolkit is easily extensiblepractitioners can incorporate new pretrained as well as fine-tuned models, include text processors such as tokenizers, transliterators and translators, and add wrappers on existing methods for downstream NLP applications.

Toolkit
Our toolkit is organized into components as depicted in Figure 1.
In a nutshell, an end-to-end model architecture consists of one or more encoder components, a component for combining encodings, and one or more adaptor plus task components.

Input Embeddings
Multi-view Integration: Tokens in mixed texts are often manifested in cross-script and mixed forms, that we refer to as views. This infusion motivates integration of text representations in varied forms, such as transliterated, translated, script-normalized, and tokens belonging to one of the participating languages. Especially in the context of pretrained multilingual models, this technique means extracting a holistic representation of a mixed text. To this end, the toolkit facilitates combining representations from different views of an input. Text Tokenization: Motivated by some recent related works on using different word-level and sub-word-level embeddings (Winata et al., 2019;, our toolkit offers different tokenization methods for encoding text. Among the encoders available in our toolkit ( § 2.2), pretrained transformer-based encoders can either be tokenized using their default tokenization technique (i.e. subwords) or by using a character-CNN architecture (Boukkouri et al., 2020). LSTM-based models can take inputs in the form of tensor representationseg. word-level FastText (Bojanowski et al., 2017) or semi-character (Sakaguchi et al., 2017) representations, or character-level representationseg. char-BiLSTM 3 .

Tag-Informed Modeling:
Studies in the past have shown the usefulness of language tag-aware modeling for mixed and cross-lingual  texts (Chandu et al., 2018;Lample and Conneau, 2019). However, their usefulness in the context of pretrained models and code-mixing is not thoroughly investigated. To this end, we offer a more generalized method in our toolkit to conduct any tag-aware fine-tuning, wherein representations for different kinds of tags can be added to the text representations. Examples of such tags include POS tags, Language IDs, etc.

Models
Encoders: An Encoder in our toolkit can consist of a transformer-based or BiLSTM-based architecture. Specifically, for the former, we utilize pretrained models from the HuggingFace library (Wolf et al., 2020) and the latter is implemented in Pytorch (Paszke et al., 2019).

Representation Fusion:
Encodings from different encoders can be combined, and if required be augmented with (non-trainable) representations before passing through an adaptor. To combine encodings, one can either simply concat them or obtain a (trainable) weighted average, a more parameter-efficient choice than the former. Both choices are available in our toolkit. Adaptors: An adaptor is a task-specific neural layer and currently, BiLSTM and Multi-Layer Perceptron (MLP) choices are available as part our toolkit. The inputs to adaptors are fused representations if multiple encoders are specified, else output from a single encoder. These adaptors serve as task-specific learnable parameters.

Multitasking:
Multi-task learning can help models to pick relevant cues from one task to be applied to another. Such a setting was also previously investigated in the context of mixed texts, which showed promising improvements (Chandu et al., 2018). Furthermore, it is also shown in monolingual NLP that incorporating explicit semantics as an auxiliary task can enhance BERT's performance (Zhang et al., 2020). Motivated by these, our toolkit offers support to conduct training of one or more tasks. Once a final representation is produced by adaptors of each task, we use a training criterion to compute loss and perform gradient backpropagation.

Tasks
Tasks: Our toolkit currently supports two kinds of tasks-sentence-level text classification and word-level sequence tagging, the flow for each is demonstrated in Figure 1. The decoupled design of our toolkit helps in seamlessly creating multi-task training setups. The kinds of tasks for which we offer support currently are listed in Table 1.
Adaptive Pretraining: Following the successes of task-adaptive and domain-adaptive pretraining in monolingual and multilingual NLP tasks (Gururangan et al., 2020), users of our toolkit can also perform such adaptive pretrainings using mixed texts on top of pretrained transformer-based models.

Codemixed Quantification
Our toolkit offers 5 standardized metrics for quantifying mixing in text, namely Code-Mixing Index (Gambäck and Das, 2014), Average switch-points (Khanuja et al., 2020), Multilingual Index, Probability of Switching and Language Entropy (Guzmán et al., 2017). We offer simple command line methods to compute these metrics and also offer metric-based data sampling.

Data Augmentation
Our toolkit also offers techniques to do data augmentation. While data augmentation is useful in cases where there is training data scarcity, for mixed datasets, it is also essential to produce a more generalized model. As part of this feature, this toolkit currently offers augmentation through transliteration, spelling variations and monolingual corpora. We currently support transliteration of Indic languages through an off-the-shelf toolindic-trans (Bhat et al., 2015). Spelling variations include noising spelling, such as randomly removing/replacing vowel characters. Monolingual corpora augmentation is task specific. For a given task, such as sentiment classification, we augment publicly available monolingual corpora based on the task type from one or all of the mixing languages and use it while fine-tuning models.

Data Format
Due to diverse data formats of existing mixed datasets, benchmarking and comparing results across tasks is not readily feasible. To this end, we propose a standardized data format for syntactic, semantic level understanding and generation tasks, and our toolkit offers command line methods to adopt a user's dataset to this standard format.

Experiments
Among many potential research applications of our toolkit, in this section, we demonstrate one benchmarking. Table 1 presents performances of selected model architectures obtained using our toolkit on some popular mixed datasets. In Table 2, we also demonstrate the performances of different architectural choices implemented through our toolkit on two Hinglish datasets. For domain-adaptive pretraining of Hinglish datasets, we collate around 160K mixed sentences from several of the publicly available Hinglish datasets. For task-adaptive pretraining, we just use the training and testing data available in the dataset of interest. For training, we use standard optimizers and model configurations. 4

Demo
We fine-tune and publish transformer-based models for 7 tasks in Hinglish. We include 3 task types-(1) Semantic (Sentiment Classification, Hate Speech and Aggression Identification), (2) Syntactic (NER, POS and Language Identification), and (3) Generation (Hinglish→English Machine Translation). We present some examples of utilizing these models in Figure 2.

Conclusion
In this work, we presented a unified toolkit for modeling code-mixed texts. Additionally, the toolkit contains various functionalities such as data augmentation, code-mixing quantification, and ready-to-use fine-tuned models for 7 different NLP tasks in Hinglish. Our toolkit is simple enough for practitioners to integrate new features as well as develop wrappers around its existing functionalities. We believe this contribution facilitates a sustainable and extensible ecosystem of models by adding novel pretraining techniques tailored for mixed texts, text normalization techniques to counter spelling variations, error analysis tools to identify peculiarities in incorrect predictions and so on.