Detecting Entailment in Code-Mixed Hindi-English Conversations

The presence of large-scale corpora for Natural Language Inference (NLI) has spurred deep learning research in this area, though much of this research has focused solely on monolingual data. Code-mixing is the intertwined usage of multiple languages, and is commonly seen in informal conversations among polyglots. Given the rising importance of dialogue agents, it is imperative that they understand code-mixing, but the scarcity of code-mixed Natural Language Understanding (NLU) datasets has precluded research in this area. The dataset by Khanuja et. al. for detecting conversational entailment in code-mixed Hindi-English text is the first of its kind. We investigate the effectiveness of language modeling, data augmentation, translation, and architectural approaches to address the code-mixed, conversational, and low-resource aspects of this dataset. We obtain an 8.09% increase in test set accuracy over the current state of the art.


Introduction
Natural Language Inference (NLI) is a widely researched NLP task which involves determining if a premise entails or contradicts a hypothesis. The performance of machine learning models on this task has important implications for other Natural Language Understanding tasks such as Question Answering, Semantic Search and Text Summarization. While large corpora such as SNLI (Bowman et al., 2015) and MultiNLI  are available for monolingual and cross-lingual NLI, Khanuja et al. (2020a) introduce the first NLI dataset with Hindi-English (Hinglish) text. We refer to this dataset as CS-NLI.
Code-mixing is a phenomenon prevalent in multilingual communities (Claros and Isharianty, 2009). It poses a number of interesting challenges for NLP applications, such as the mixing of units from multiple grammar systems, morphological * Equal contribution differences between monolingual and code-mixed text due to the intermixing of affixes, and nonstandard transliteration between the writing systems involved. In CS-NLI, Hindi is present in a non-standard Romanized form. Multilingual speakers most often code-mix in informal settings such as social media, in-person, and telephonic conversations, due to which there is a dearth of clean, large-scale code-mixed corpora such as Wikipedia articles and books that can be used for pre-training, making this a low-resource task. Khanuja et al. (2020a) leverage Bollywood movie scripts containing Hinglish text to create CS-NLI, with conversations as premises. The creation of hypotheses based on dialogue-like premises transforms the task from one of textual entailment to one of conversational entailment. The inclusion of scripts from multiple movies makes this data inherently noisy due to non-standard Romanization of Hindi, the variation in dialects across movies and differing grammar styles among Hinglish speakers.
In this work, we explore and analyze a variety of techniques to leverage existing pre-trained models such as BERT (Devlin et al., 2019) for processing code-mixed and conversational text. We present a comparison of linguistic, data augmentation and architectural approaches to conversational entailment in code-mixed text. We show multiple techniques that interestingly give similar results, while also beating the current state of the art 1 . The code for the approaches described in this paper will be made available on GitHub 2 .

Related Work
NLI for monolingual and cross-lingual text is a well-researched task that has been addressed using a variety of techniques including neural networks, symbolic logic and knowledge bases (Bowman et al., 2015). The use of transformer models such as BERT and RoBERTa (Liu et al., 2019), pre-trained on large monolingual corpora, has advanced the state of the art on the SNLI and MultiNLI datasets. While unsupervised pre-training of deep learning models has been shown to improve performance on a variety of NLP tasks, the limited amount of data available precludes large-scale pre-training on code-mixed text. Multilingual BERT (mBERT) (Devlin et al., 2019) is pre-trained on monolingual Wikipedia corpora from 104 languages, including Hindi in its original Devanagari script. XLM-RoBERTa (XLM-R) (Conneau et al., 2020) is trained on the CommonCrawl corpus, which includes Romanized Hindi text, making this model the closest one to being pre-trained on Hinglish. Khanuja et al. (2020a) introduce a dataset spurring two challenging directions of research -NLI for code-mixing, and conversational entailment. The dataset contains 2,240 unique code-mixed premise-hypothesis pairs and their corresponding labels, with an 80:20 train-test split. We tackle the binary classification task of assigning an ENTAILMENT label if the premise entails the hypothesis and a CONTRADICTION label if the premise contradicts the hypothesis. Premises are in the form of multiple utterances from a conversation, with each utterance preceded by the name of the speaker. For example-Premise (Code-Mixed): RAHUL : Tumhara scooter aur ek joota security guard ko lobby mein mila . ## RIANA : Thank god !! Premise (Translation): RAHUL : The security guard found your scooter and one shoe in the lobby. ## RIANA : Thank god !!

Methodology
Given the success of pre-trained models on other NLI tasks, we tackle this task by fine-tuning BERT, mBERT and XLM-R for sentence-pair classification. Due to the scarcity of examples in CS-NLI, we focus our efforts on the modification and augmentation of the data used to fine-tune these models. In this section, we describe techniques to address the code-mixed, low-resource, and conversational aspects of the task.

Addressing Code-Mixing
We use approaches such as language modeling, transliteration, and translation to alleviate the ab-sence of code-mixing in the data used to pre-train transformer models. Masked Language Modeling: We fine-tune mBERT on the masked language modeling objective, following Khanuja et al. (2020b), on a combination of in-domain code-mixed movie scripts and publicly available datasets by Roy et al. (2013) and Bhat et al. (2018) to obtain modified mBERT (mod-mBERT) to be fine-tuned on the sentencepair classification task. Transliteration: We perform token-level language identification and transliterate the detected Romanized Hindi words in CS-NLI to Devanagari script using the approach in Singh et al. (2018), to enable mBERT to better understand them. Translation: Due to the difficulty in training codemixed to monolingual translation models, we follow the approach in Dhar et al. (2018) to obtain translations. We first transliterate the Romanized Hindi words, and then translate English phrases to Hindi using the Google Cloud translation API. 3 .

Addressing the Low-Resource Aspect
Due to the limited amount of code-mixed NLI data available for fine-tuning, we augment CS-NLI with 4000 monolingual entailment and contradiction examples sampled from the SNLI, XNLI (Conneau et al., 2018), and MPE (Lai et al., 2017) datasets. Transliterations of Devanagari Hindi sentence-pairs from the XNLI dataset provide additional NLI data in Romanized Hindi while SNLI examples do the same in English. The MPE dataset adds examples requiring aggregation of information across sentences (Lai et al., 2017).

Approaches to Conversational NLI
Each premise in CS-NLI contains turns in the form "Speaker Name: Utterance". Khanuja et al. (2020a) show that a number of hypotheses require an understanding of the transition between speakers, in addition to the meaning of the utterance itself. In order to estimate whether BERT understands the role of speakers, we remove speaker names occurring before each utterance, and fine-tune the models on CS-NLI. We find that the accuracy does not deteriorate, indicating that the BERT models may benefit from reinforcing speaker roles. Data Augmentation with Speaker Names: Khanuja et al. (2020a)  Utterance Representations using BERT: The premises in CS-NLI contain multiple turns of a conversation. Since BERT is commonly used for single-sentence representations, we encode each turn separately using mod-mBERT. We obtain utterance representations from mod-mBERT and pass them through a bidirectional LSTM (biLSTM). We concatenate the initial and final hidden states of the biLSTM with the mod-mBERT encoding of the hypothesis, and pass them through an MLP with two linear layers to obtain a classification output.

Experimental Setup
In the majority of our approaches, we fine-tune BERT, mBERT, mod-mBERT (110M parameters), and XLM-R (550M parameters) for 1 to 6 epochs on an Nvidia GeForce GTX 1070 GPU. We experiment with batch sizes of 8,16, and 32, and learning rates between 1e-5 and 5e-5, and report results using a batch size of 8 and learning rate of 1e-5.

Results and Analysis
On fine-tuning the BERT models on CS-NLI, we observe a large variation in the results based on the subset of data used for evaluating the model, as demonstrated in Table-2. To address this variation, we perform eight-fold cross validation with early stopping, and report the mean and standard deviation of the accuracies across eight splits. These results are shown in Table-3. We evaluate the models with the highest cross-validation accuracy on the test set and report these results in Table-  In this section, we provide qualitative and quantitative analysis of our approaches. The qualitative analysis is performed on the cross-validation splits.

Comparison of Pre-Trained Models
The majority of Hindi words in the NLI dataset are out of vocabulary for BERT. Nevertheless, it obtains a high cross-validation accuracy of 61.11%. We believe it achieves this by tuning the embeddings of WordPiece tokens of both Hindi and English text present in the dataset. To verify that it does not learn only from in-vocabulary English words, we fine-tune BERT after removing the words identified as Hindi, and find that its performance deteriorates sharply.
The benefit of mBERT's multilingual pretraining seems to be lost in CS-NLI due to the script mismatch between Devanagari Hindi used to pre-train mBERT, and Romanized Hindi in CS-NLI. mod-mBERT performs better than BERT and mBERT due to its enhanced understanding of Hinglish. We believe that fine-tuning on in-domain movie scripts increases mBERT's understanding of conversational code-mixed text, while the inclusion of code-mixed text from other sources enables it to better understand non-standard Romanization.  Although XLM-R is the only model which contains Romanized Hindi in its pre-training data, the model does not converge when fine-tuned on just CS-NLI. However, on augmentation with monolingual NLI examples, there is a large improvement in performance as shown in Table-3. The output of XLM-R's tokenizer shows that many of the Romanized Hindi words are in the model's vocabulary, in contrast to BERT and mBERT where the words get broken into multiple WordPiece tokens. Despite this fact, the model is unable to fit the training data even with an extensive hyper-parameter search, leading us to hypothesize that larger amounts of data are required for fine-tuning XLM-R. However, the performance of this model on code-mixed datasets bears further investigation.

Transliteration and Translation
Manual inspection shows that errors in language identification and transliteration result in noisy translated and transliterated versions of the data, deterring the performance. However, we find that augmenting the original training set with its translations allows the model to learn from code-mixed and monolingual forms of the same examples.

Data Augmentation
Although the SNLI, XNLI and MPE datasets contain monolingual examples of textual, nonconversational entailment, augmenting the data with examples from these datasets improves the performance of the models. We believe this is because the addition of these examples aids their general understanding of entailment. The mismatch between the nature of the entailment tasks poses the question of whether there exists an optimal subset and quantity of external data for augmentation. We were unable to find a correlation between the performance and number of external examples added. Finding the categories, if any exist, of examples that are most helpful to the model is challenging. Possible strategies include selection based on length, language complexity, dialect, and domain similarity in the case of Hindi XNLI data. In this work, however, we take a random sample of examples from these corpora.
Since each of these augmentation techniques improve the performance of the model, we augment CS-NLI with different combinations of the datasets, shown in Table-3. We observe an improvement, although it is not proportional to that of the individual augmentations.

Utterance Representations Using BERT
Separating utterance representations performs worse than the majority of our approaches. The addition of biLSTM layers over the BERT model introduces a large number of uninitialized parameters. We believe that the scarcity of data available to train these parameters leads to its poor perfor-mance. Further, the lack of an attention mechanism between utterances and the hypothesis may also pose a problem. Khanuja et al. (2020a) provide an analysis of the various kinds of examples present in CS-NLI. We attempt to discern similarities in the examples that the various models predict incorrectly in order to better address these classes of examples. We analyze various statistical properties of the premises such as their length, the number of turns in the conversation, and the number of distinct speakers, and observe no correlation between these properties and the correctness of the model's predictions. While the complexity of the Hindi and English vocabulary used may make some code-mixed examples more difficult than others, automatically identifying such differences is difficult.

Qualitative Analysis
McCoy et al. (2019) show that most neural models including BERT are expected to accurately predict examples involving negation, role swapping, paraphrasing and numerical changes, such as those shown in Khanuja et al. (2020a). However, crosslingual paraphrasing and negation in CS-NLI make it hard to detect these otherwise simple examples in code-mixed settings.
We evaluate the ability of BERT models to recognize role-swapping by generating examples of this nature. We find that mod-mBERT trained on CS-NLI only predicts 19% of these examples correctly, whereas a model trained using the speaker name data augmentation technique described in Section-4.3, with weighted cross-entropy loss, gets an accuracy of 87% on these examples, substantiating this approach.

Performance on the Test Set
The accuracy of mBERT with early stopping is 6% higher than the baseline. mod-mBERT shows the best performance with an accuracy that is 8% higher than the baseline, while the augmentation and modification approaches seem to reduce the performance of the model. We attribute the large difference between the test set and cross-validation accuracies to the sensitivity of models to different splits in the dataset, as shown in Table-

Conclusion
Our results show that there is a long way to go in NLP for code-mixed language tasks. Even using standard techniques such as multilingual language modeling and data augmentation, our results are still behind an equivalent task in a high resource environment. Although this dataset contains higher level challenges such as sarcasm detection that are not yet solved even in high-resource languages, even phenomena such as negation, role swapping and paraphrasing become challenging due to code-mixing.
Code-mixed language pairs can be thought of as a separate language (Sitaram et al., 2019), and perhaps large-scale pre-training on code-mixed data would be able to push the boundaries of codemixed interpretation, as has been the case with high-resource languages.