Code-Switching Patterns Can Be an Effective Route to Improve Performance of Downstream NLP Applications: A Case Study of Humour, Sarcasm and Hate Speech Detection

In this paper, we demonstrate how code-switching patterns can be utilised to improve various downstream NLP applications. In particular, we encode various switching features to improve humour, sarcasm and hate speech detection tasks. We believe that this simple linguistic observation can also be potentially helpful in improving other similar NLP applications.


Introduction
Code-mixing/switching in social media has become commonplace. Over the past few years, the NLP research community has in fact started to vigorously investigate various properties of such codeswitched posts to build downstream applications. The author in (Hidayat, 2012) demonstrated that inter-sentential switching is preferred more than intra-sentential switching by Facebook users. Further while 45% of the switching was done for real lexical needs, 40% was for discussing a particular topic and 5% for content classification. In another study (Dey and Fung, 2014) interviewed Hindi-English bilingual students and reported that 67% of the words were in Hindi and 33% in English. Recently, many down stream applications have been designed for code-mixed text. (Han et al., 2012) attempted to construct a normalisation dictionary offline using the distributional similarity of tokens plus their string edit distance. (Vyas et al., 2014) developed a POS tagging framework for Hindi-English data.
More nuanced applications like humour detection , sarcasm detection  and hate speech detection (Bohra et al., 2018) have been targeted for code-switched data in the last two to three years.

Motivation
The primary motivation for the current work is derived from (Vizcaíno, 2011) where the author notes -"The switch itself may be the object of humour". In fact, (Siegel, 1995) has studied humour in the Fijian language and notes that when trying to be comical, or convey humour, speakers switch from Fijian to Hindi. Therefore, humour here is produced by the change of code rather than by the referential meaning or content of the message. The paper also talks about similar phenomena observed in Spanish-English cases.
In a study of English-Hindi code-switching and swearing patterns on social networks (Agarwal et al., 2017), the authors show that when people code-switch, there is a strong preference for swearing in the dominant language. These studies together lead us to hypothesize that the patterns of switching might be useful in building various NLP applications.

The present work
To corroborate our hypothesis, in this paper, we consider three downstream applications -(i) humour detection , (ii) sarcasm detection  and (iii) hate speech detection (Bohra et al., 2018) for Hindi-English code-switched data. We first provide empirical evidence that the switching patterns between native (Hindi) and foreign (English) words distinguish the two classes of the post, i.e., humour vs non-humour or sarcastic vs non-sarcastic or hateful vs non-hateful. We then featurise these patterns and pump them in the state-of-the-art classification models to show the benefits. We obtain a macro-F1 improvement of 2.62%, 1.85% and 3.36% over the baselines on the tasks of humour detection, sarcasm detection and hate speech detection respectively. As a next step, we introduce a modern deep neu-ral model (HAN -Hierarchical Attention Network (Yang et al., 2016)) to improve the performance of the models further. Finally, we concatenate the switching features in the last hidden layer of the HAN and pass it to the softmax layer for classification. This final architecture allows us to obtain a macro-F1 improvement of 4.9%, 4.7% and 17.7% over the original baselines on the tasks of humour detection, sarcasm detection and hate speech detection respectively.

Dataset
We consider three datasets consisting of Hindi (hi) -English (en) code-mixed tweets scraped from Twitter for our experiments -Humour, Sarcasm and Hate. We discuss the details of each of these datasets below. Humour: Humour dataset was released by  and has Hindi-English codemixed tweets from domains like 'sports', 'politics', 'entertainment' etc. The dataset has uniform distribution of tweets in each category to yield better supervised classification results (see Table 1) as described by (Du et al., 2014). Here the positive class refers to humorous tweets while the negative class corresponds to non-humorous tweet. Some representative examples from the data showing the point of switch corresponding to the start and the end of the humour component.
• women can crib on things like humour start bhaiyya ye shakkar bahot zyada meethi hai humour end , koi aur quality dikhao 1 • shashi kapoor trending on mothersday how apt, humour start mere paas ma hai humour end 2 • political journey of kejriwal, from humour start mujhe chahiye swaraj humour end to humour start mujhe chahiye laluraj humour end Sarcasm: Sarcasm dataset released by  contains tweets that have hashtags #sarcasm and #irony. Authors used other keywords such as 'bollywood', 'cricket' and 'politics' to collect sarcastic tweets from these domains. In this case, the dataset is heavily unbalanced (see Table 1). Here the positive class refers to sarcastic tweets and the negative class means non-sarcastic tweets. Some representative examples from our data showing the point where the sarcasm starts and ends.

5
Hate speech: (Bohra et al., 2018) created the corpus using the tweets posted online in the last five years which have a good propensity to contain hate speech (see Table 1). Authors mined tweets by selecting certain hashtags and keywords from 'politics', 'public protests', 'riots' etc. The positive class refers to a hateful tweets while the negative class means non-hateful tweets 6 . An example of hate tweet showing the point of switch corresponding to the start and the end of the hate component.
• I hate my university, hate start koi us jagah ko aag laga dey hate end 7 .

Switching features
In this section, we outline the key contribution of this work. In particular, we identify how patterns of switching correlate with the tweet text being humorous, sarcastic or hateful. We outline a synopsis of our investigation below.

Switching and NLP tasks
In this section, we identify how switching behavior is related to the three NLP tasks at our 4 Gloss: said aib filthy pandit ji, whatever you are telling is it pure sanskrit? irony shameonyou. 5 irony bappi lahiri sings Gloss: doesn't matter you do not get gold or silver, you have got a friend to love. 6 The dataset released by this paper only had the hate/nonhate tags for each tweet. However, the language tag for each word required for our experiments was not available. Two of the authors independently language tagged the data and obtained an agreement of 98.1%. While language tagging, we noted that the dataset is a mixed bag including hate speech, offensive and abusive tweets which have already been shown to be different in earlier works (Waseem et al., 2017). However, this was the only Hindi-English code-mixed hate speech dataset available. 7 Gloss: I hate my university. Someone burn that place.
hand. Let Q be the property that a sentence has en words which are surrounded by hi words, that is there exists an English word in a Hindi context. For instance, the tweet koi hi to hi pray en karo hi mere hi liye hi bhi hi satisfies the property Q. However, bumrah hi dono hi wicketo hi ke hi beech hi gumrah hi ho hi gaya hi does not satisfy Q.
We performed a statistical analysis to determine the correlation between the switching patterns and a classification task at hand (represented by T ). Let us denote the probability that a tweet belongs to a positive class for a task T given that it satisfies property Q by p(T |Q). Similarly, let p(T | ∼ Q) be the probability that the tweet belongs to the positive class for task T and does not satisfy the property Q.
Further let avg(S|T ) be the average switching in positive samples for the task T and avg(S| ∼ T ) denote the average switching in negative samples for the task T .  The main observations from this analysis for the three tasks -humour, sarcasm and hate are noted in Table 2. For the humour task, p(humour|Q) dominates over p(humour| ∼ Q). Further the average number of switching for the positive samples in the humour task is larger than the average number of switching for the negative samples. Finally, we observe a positive Pearson's correlation coefficient of 0.04 between a text being humorous and the text having the property Q. This together indicates that the switching behavior has a positive connection with a tweet being humorous.
On the other hand p(sarcasm| ∼ Q) as well as p(hate| ∼ Q) respectively dominate over p(sarcasm|Q) and p(hate|Q). Moreover the average number of switching for the negative samples for both these tasks is larger than the average number of switching for the positive samples. The Pearson's correlation between a text being sarcastic (hateful) and the text having the property Q is negative: -0.17 (-0.04). This shows there is an overall negative connection between the switching behavior and sarcasm/hate speech detection tasks.  While we have tested on one language pair (Hindi-English), our hypothesis is generic and has been already noted by linguists earlier (Vizcaíno, 2011). Tweets are tokenized and punctuation marks are removed. All the hashtags, mentions and urls are stored and converted to string 'hashtag', 'mention' and 'url' to capture the general semantics of the tweet. Camel-case hashtags were segregated and included in the tokenized tweets (see (Belainine et al., 2016), (Khandelwal et al., 2017)). For example, #AadabArzHai can be decomposed into three distinct words: Aadab, Arz and Hai. We use the same pre-processing for all the results presented in this paper.

Machine learning baselines
Humour baseline : Uses features such as n-grams, bag-of-words, common words and hashtags to train the standard machine learning models such as SVM and Random-Forest. The authors used character n-grams, as previous work shows that this feature is very efficient in classifying text because they do not require expensive text pre-processing techniques like tokenization, stemming and stop words removal. They are also language independent and can be used in codemixed texts. In their paper, the authors report the results for tri-grams. Sarcasm baseline : This model also uses a combination of word n-grams, character n-grams, presence or absence of certain emoticons and sarcasm indicative tokens as features. A sarcasm indicative score is computed and chi-squared feature reduction is used to take the top 500 most relevant words. These were incorporated into features used for classification. Standard off-the-shelf machine learning models like SVM and Random Forest were used. Hate baseline (Bohra et al., 2018): The hate speech detection baseline also consists of similar features such as character n-grams, word n-grams, negation words 9 and a lexicon of hate indicative tokens. Chi-squared feature reduction method was used to decrease the dimensionality of the features. Once again SVM and Random Forest based classifiers were used for this task.
Switching features: We plug in the nine switching features introduced in the previous section to the three baseline models for humour, sarcasm and hate speech detection.

Deep learning architecture
In order to draw the benefits of the modern deep learning machinery, we build an end-to-end model for the three tasks at hand. We use the Hierarchical Attention Network (HAN) (Yang et al., 2016) which is one of the state-of-the-art models for text and document classification. It can represent sentences in different levels of granularity by stacking recurrent neural networks on character, word and sentence level by attending over the words which are informative. We use the GRU implementation of HAN to encode the text representation for all 9 see Christopher Pott's sentiment tutorial: http://sentiment.christopherpotts.net/ lingstruc.html the three tasks. Handling data imbalance by sub-sampling: Since the sarcasm dataset is heavily unbalanced we sub-sampled the data to balance the classes. To this purpose, we categorise the negative samples into those that are easy or hard to classify. Hypothesizing that if a model can predict the hard samples reliably it can do the same with the easy samples.
We trained a classifier model on the training dataset and obtained the softmax score which represents p(sarcastic|text) for the test samples. Those test samples which have a score less than a very low confidence score (say 0.001) are removed imagining them to be easy samples. The dataset thus got reduced and more balanced. It is important to note that positive samples are never removed. We validated this hypothesis through the test set. Our trained HAN model achieves an accuracy of 94.4% in classifying the easy (thrown out) samples as non-sarcastic thus justifying the sub-sampling. Switching features: We include the switching features to the pre-final fully-connected layer of HAN to observe if this harnesses additional benefits (see Figure 1).
Pre-trained embeddings: We obtained pretrained embeddings by training GloVe from scratch using the large code-mixed dataset (725173 tweets) released by (Patro et al., 2017) plus all the tweets (13278) in our three datasets.

Results
We compare the baseline models along with (i) the baseline + switching feature-based models and (ii) the HAN models. We use macro-F1 score for comparison all through. The main results are summarized in Table 4. The interesting observations that one can make from these results are -(i) inclusion of the switching features always improves the overall performance of any model (machine learning or deep learning) for all the three tasks, (ii) the deep learning models are always better than the machine learning models. Inclusion of switching features into the machine learning models (indicated as BF in Table 4) allows us to obtain a macro-F1 improvement of 2.62%, 1.85% and 3.36% over the baselines (indicated as B in Table 4) on the tasks of humour detection, sarcasm detection and hate speech detection respectively. Inclusion of the switching feature in the HAN model (indicated as HF in Table 4) allows us to obtain a macro-F1 improvement of 4.9%, 4.7% and 17.7% over the original baselines (indicated as B in Table 4) on the tasks of humour detection, sarcasm detection and hate speech detection respectively. Success of our model: Success of our approach is evident from the following examples. For instance, as we had demonstrated earlier, humour is positively correlated with switching, a tweet having a switching pattern like -anurag hi kashyap hi can en never en join en aap hi because en ministers en took en oath en, "main hi kisi hi anurag hi aur hi dwesh hi ke hi bina hi kaam hi karunga hi" which was not detected as humorous by the baseline (B) but was detected so by our models (BF and HF). Note that the author of the above tweet seems to have categorically switched to Hindi to express the humour; such observations have also been made in (Rudra et al., 2016) where opinion expression was cited as a reason for switching. Sarcasm being negatively correlated with switching, a tweet without having switching is more likely to be sarcastic. For instance, the tweet naadaan hi baalak hi kalyug hi ka hi vardaan hi hai hi ye hi, which bears no switching was labeled non-sarcastic by the baseline. Our models (BF and HF) have rectified it and correctly detected it as sarcastic.
Similarly, hate being negatively correlated with switching, a tweet with no switching -shilpa hi ji hi aap hi ravidubey hi jaise hi tuchho hi ko hi jawab hi mat hi dijiye hi ye hi log hi aap hi ke hi sath hi kabhi hi nahi hi was labeled as non-hateful by the baseline, was detected as hateful by our methods (BF and HF).

Conclusion
In this paper, we identified how switching patterns can be effective in improving three different NLP applications. We present a set of nine features that improve upon the state-of-the-art baselines. In addition, we exploit the modern deep learning machinery to improve the performance further. Finally, this model can be improved further by pumping the switching features in the final layer of the deep network.
In future, we would like to extend this work for other language pairs. For instance, we have seen examples of such switching in English-Spanish 10 and English-Telugu 11 pairs also. Further we plan to investigate other NLP applications that can benefit from the simple linguistic features introduced here.