Corpus Creation and Emotion Prediction for Hindi-English Code-Mixed Social Media Text

Emotion Prediction is a Natural Language Processing (NLP) task dealing with detection and classification of emotions in various monolingual and bilingual texts. While some work has been done on code-mixed social media text and in emotion prediction separately, our work is the first attempt which aims at identifying the emotion associated with Hindi-English code-mixed social media text. In this paper, we analyze the problem of emotion identification in code-mixed content and present a Hindi-English code-mixed corpus extracted from twitter and annotated with the associated emotion. For every tweet in the dataset, we annotate the source language of all the words present, and also the causal language of the expressed emotion. Finally, we propose a supervised classification system which uses various machine learning techniques for detecting the emotion associated with the text using a variety of character level, word level, and lexicon based features.


Introduction
Micro-blogging sites like Twitter and Facebook encourage users to express their daily thoughts in real time, which often result in millions of emotional statements being posted online, everyday. Identification and analysis of emotions in social-media texts are of great significance in understanding the trends, reviews, events and human behaviour. Emotion prediction aims to identify fine-grained emotions, i.e., Happy, Anger, Fear, Sadness, Surprise, Disgust, if any present in the text. Previous research related to this task has mainly been focused only on the monolingual text (Chen et al., 2010;Alm et al., 2005) due to the availability of large-scale monolingual resources. However, usage of code mixed language in online posts is very common, especially in multilingual societies like India, for expressing one's emotions * These authors contributed equally to this work. and thoughts, particularly when the communication is informal. Code-Mixing (CM) is a natural phenomenon of embedding linguistic units such as phrases, words or morphemes of one language into an utterance of another (Myers-Scotton, 1993;Muysken, 2000;Duran, 1994;Gysels, 1992). Following are some instances from a Twitter corpus of Hindi-English code-mixed texts also transliterated in English.
T1 : "I don't want to go to school today, teacher se dar lagta hai mujhe." Translation : "I don't want to go to school today, I am afraid of teacher." T2 : "Finally India away series jeetne mein successful ho hi gayi :D" Translation : "Finally India got success in winning the away series :D" T3 : "This is a big surprise that Rahul Gandhi congress ke naye president hain." Translation : "This is a big surprise that Rahul Gandhi is the new president of Congress." The above examples contain both English and Hindi texts. T1 expresses fear through Hindi phrase "dar lagta hai mujhe", happiness is expressed in T2 through a Hindi-English mixed phrase "jeetne mein successful ho hi gayi", while in T3, surprise is expressed through English phrase "This is a big surprise". Since very few resources are available for Hindi-English code-mixed text, in this paper we present our initial efforts in constructing the corpus and annotating the code-mixed tweets with associated emotion and the causal language for that emotion. We strongly believe that our initial efforts in constructing the annotated code-mixed emotion corpus will prove to be extremely valuable for researchers working on various natural processing tasks on social media. The structure of the paper is as follows. In Section 2, we review related research in the area of code mixing and emotion prediction. In Section 3, we describe the corpus creation and annotation scheme. In Section 4, we discuss the data statistics. In Section 5, we summarize our classification system which includes the pre-processing steps and construction of feature vector. In Section 6, we present the results of experiments conducted using various character-level, word-level and lexicon features. In the last section, we conclude our paper, followed by future work and the references.
2 Background and Related Work  performed analysis of data from Facebook posts generated by English-Hindi bilingual users. They created the corpus using posts from Facebook pages in which En-Hin bilinguals are highly active. They also collected the data from BBC Hindi News page. Their final corpus consisted of 6983 posts and 113,578 words. Among the 6983 posts 206 posts were in Devanagari Script, 6544 posts in Roman Script, 246 in Mixed Scripta and 28 in Other Script. After annotating the data with the Named Entities, POS Tags, Word Origin and deleting all such posts which had less than 5 words, they performed analysis on data. Their analysis showed that atleast 4.2% of the data is code-switched. Analysis depicted that significant amount of code-mixing was present in the posts.  formalized the problem, created a POS tag annotated Hindi-English code-mixed corpus and reported the challenges and problems in the Hindi-English codemixed text. They also performed experiments on language identification, transliteration, normalization and POS tagging of the dataset. Their POS tagger accuracy fell by 14% to 65% without using gold language labels and normalization. Thus, language identification and normalization are critical for POS tagging. (Sharma et al., 2016) addressed the problem of shallow parsing of Hindi-English code-mixed social media text and developed a system for Hindi-English code-mixed text that can identify the language of the words, normalize them to their standard forms, assign them their POS tag and segment into chunks. (Barman et al., 2014) addressed the problem of language identification on Bengali-Hindi-English Facebook comments. They annotated a corpus and achieved an accuracy of 95.76% using statistical models with monolingual dictionaries. (Raghavi et al., 2015) developed a Question Classification system for Hindi-English code-mixed language using word level resources such as language identification, transliteration, and lexical translation. In addition to information, text also contains some emotional content. (Alm et al., 2005) addressed the problem of text-based emotion prediction in the domain of children's fairy tales using supervised machine learning. (Das and Bandyopadhyay, 2010) deals with the extraction of emotional expressions and tagging of English blog sentences with Ekman's six basic emotion tags and any of the three intensities: low, medium and high. (Xu et al., 2010) built a Chinese emotion lexicon for public use. They adopted a graph-based algorithm which rank words according to a few seed emotion words. (Wang et al., 2016) performed emotion analysis on Chinese-English code-mixed texts using a BAN network. (Joshi et al., 2016;Ghosh et al., 2017) performed Sentiment Identification in Hindi-English code-mixed social media text.

Corpus Creation and Annotation
We created the Hindi-English code-mixed corpus using tweets posted online in last 8 years. Tweets were scrapped from Twitter using the Twitter Python API 1 which uses the advanced search option of twitter. We have mined the tweets by selecting certain hashtags from politics, social events, and sports, so that the dataset is not limited to a particular domain. The hashtags used can be found in the appendix section. Tweets retrieved are in the json format which consists all the information such as timestamp, URL, text, user, retweets, replies, full name, id and likes. An extensive semi-automated processing was carried out to remove all the noisy tweets. Noisy tweets are the ones which comprise only of hashtags or urls. Also, tweets in which language other than Hindi or English is used were also considered as noisy and hence removed from the corpus. Furthermore, all those tweets which were written either in pure English or pure Hindi language were removed, and thus, keeping only the code-mixed tweets. In the annotation phase, we further removed all those tweets which were not expressing any emotion.

Annotation
The annotation step was carried out in following two phases: Language Annotation : For each word, a tag was assigned to its source language. Three kinds of tags namely, 'eng', 'hin' and 'other' were assigned to the words by bilingual speakers. 'eng' tag was assigned to words which are present in English vocabulary, such as "successful", "series" used in T2. 'hin' tag was assigned to words which are present in the Hindi vocabulary such as "naye"(new), "hain"(is) used in T3. The tag 'other' was given to symbols, emoticons, punctuations, named entities, acronyms, and URLs.
Emotion and Causal Language Annotation : We annotated the tweets with six standard emotions, namely, Happiness, Sadness, Anger, Fear, Disgust and Surprise (Ekman, 1992(Ekman, , 1993. Hindi and English were annotated as the two causal languages. Since emotion in a statement can be expressed through the two languages separately, and also through mixed phrases like: "mujhe fear hai", it is thus essential to annotate the data with four kinds of causal situations (Lee and Wang, 2015), i.e. Hindi, English, Mixed and Both. Next, we further discuss these situations in detail.
Hindi means the emotion of the given post is solely expressed through Hindi text. In the example, T4 happiness is expressed through Hindi text.
T4 : "Bahut badiya, ab sab okay hai surgical strike ke baad." Translation : "Very good, now everything is okay after the surgical strike." English means the emotion of the given post is solely expressed through English text. T5 is an example that expresses surprise through English text.
T5 : "He is in complete shock, itni property waste ho gayi uski." Translation : "He is in complete shock that so much of his property has been wasted." Both means the emotion of the given tweet is expressed through both Hindi and English text. Since a user can express a kind of emotion using multiple phrases, it is essential to incorporate the case when same emotion is expressed through both the languages. T6 is an example where sadness is expressed through both Hindi and English texts. Mixed means the emotion of the given tweet is expressed through one or multiple Hindi-English mixed phrases. T7 is an example which expresses sadness through the mixed phrase 'dekhke sad lagta hai'.
T7 : "In this country gareeb logo ki haalat dekhke sad lagta hai." Translation : "It is sad to see the condition of poor people in this country." Annotation of this dataset is performed by two of the co-authors who are native Hindi speakers and have proficiency in both Hindi and English. Figure 1 shows an instance of annotation, where both the emotion and the caused language is annotated. In a given tweet, for each emotion, annotator marked whether it expresses that emotion along with it's caused language. The annotated dataset with the classification system is made available online 2 .

Inter Annotator Agreement
Annotation of the dataset to identify emotion in the tweets was carried out by two human annotators having linguistic background and proficiency in both Hindi and English. In order to validate the quality of annotation, we calculated the interannotator agreement (IAA) between the two annotation sets of 2866 code-mixed tweets using Cohen's Kappa coefficient. Table 1 shows the results of agreement analysis. We find that the agreement is significantly high. This indicates that the quality of the annotation and presented schema is productive. Furthermore, the agreement of emotion annotation is lower than that of caused language, which probably is due to the fact that in some tweets, emotions are expressed indirectly.

Data Statistics
We retrieved 3,55,448 tweets from Twitter. After manually filtering the tweets as described in Section 3, we found that only 5546 tweets were codemixed tweets.   those code-mixed tweets which were expressing any of the six emotions. Also, it is vital to note that some of the tweets contained multiple phrases depicting different emotions. These emotions could be caused by any of the four causal languages. As a result, total number of causal language annotations is more than the number of tweets in the dataset. Usually, a user while posting a tweet feels only one kind of emotion. Hence all such tweets are neglected to avoid any conflict between the literal depiction and the implicit conveyance of emotions in the tweets. This resulted in 2698 emotional code-mixed tweets. Table 3 shows the count of sentences in which emotion was expressed in English, Hindi, Both and Mixed. It clearly shows that in most of the sentences emotion is expressed through a mixed Hindi-English phrase.

System Architecture
After developing the annotated corpus, we try to detect emotion in the code-mixed tweets. We break down the process of emotion detection into three sub-processes: pre-processing of raw tweets, feature identification and extraction and finally, the classification of emotion as happiness, sadness, and anger. It is important to note that classification is carried out only for three classes i.e., 'happiness', 'sadness' and 'anger', as number of tweets which express 'fear', 'disgust' and 'surprise' are extremely limited. The steps have been discussed in sequential order.

Pre-processing of the code-mixed tweets
Following are the steps which were performed in order to pre-process the data prior to feature extraction.
1. Removal of URLs: All the links and URLs in the tweets are stored and replaced with "URL", as these do not contribute towards emotion of the text.
2. Replacing User Names: Tweets often contain mentions which are directed towards certain users. We replaced all such mentions with "USER." 3. Replacing Emoticons : All the emoticons used in the tweets are replaced with "Emoticon". Before replacing, the emoticons along with their respective counts are stored since we use them as one of the features for classification.
4. Removal of Punctuations: All the punctuation marks in a tweet are removed. However, before removing them we store the count of each punctuation mark since we use them as one of the features in classification.

Feature Identification and Extraction :
In our work, we have used the following feature vectors to train our supervised machine learning model.
1. Character N-Grams (C): Character N-Grams are language independent and have proven to be very efficient for classifying text. These are also useful in situations when the text suffers from errors such as misspellings (Cavnar et al., 1994;Huffman, 1995;Lodhi et al., 2002). Groups of characters can help in capturing semantic meaning, especially in the code-mixed language where there is an informal use of words, which vary significantly from the standard Hindi and English words. We use character n-grams as one of the features, where n varies from 1 to 3.
2. Word N-Grams (W) : Bag of word features have been widely used to capture emotion in a text (Purver and Battersby, 2012) and in detecting hate speech (Warner and Hirschberg, 2012). Thus we use word n-grams, where n varies from 1 to 3 as a feature to train our classification models.

Emoticons (E) :
We also use emoticons as a feature for emotion classification since they often represent textual portrayals of a writer's emotion in the form of symbols. For example, ':o('and ':(' express sadness, ':)'and ';)' express happiness. We use a list of Western Emoticons from Wikipedia. 3 4. Punctuations (P): Punctuation marks can also be useful for emotion classification. Users often use exclamation marks when they want to express strong feelings. Multiple question marks in the text can denote surprise, excitement, and anger. Usage of an exclamation mark in conjunction with the question mark indicates astonishment and annoyed feeling. We count the occurrence of each punctuation mark in a sentence and use them as a feature.

Repetitive Characters (R) :
Users on social media often repeat some characters in a word to stress upon particular emotion. For example, 'lol' (abbreviated form of laughing out loud) can be written as 'loool', 'looool'. 'Happy' can be written as 'happppyyyy,' 'haaappyy'. We stored the count of all such words in a tweet in which a particular character is repeated more than two times in a row and use them as one of the features.
6. Uppercase Words (U) : Users often write some words in a text in capital letters to represent shouting and anger (Dadvar et al., 2013). Hence for every tweet, we count all such words which are completely written in capital letters and contain more than 4 letters and use it as a feature.

Intensifiers (I):
Users often tend to use intensifiers for laying emphasis on sentiment and emotion. For example in the following codemixed text, "Wo kisi se baat nahi karega because he is too sad", Translation : "He will not talk to anyone because he is too sad". "too" is used to emphasize on the sadness of the boy. A list of English intensifiers was taken from wikipedia 4 . For creating the list of Hindi intensifiers, English intensifiers were  We count the number of intensifiers in a tweet and use the count as a feature.

Negation Words (N) :
We select negation words to address variance from the desired emotion caused by negated phrases like "not sad" or "not happy". For example the tweet "It's diwali today and subah jaldi uthna padega!! Not happy" should be classified as a sad tweet, even though it has a happy unigram. To tackle this problem we define negation as a separate feature. A list of English negation words was taken from Christopher Pott's sentiment tutorial 5 . Hindi negation words were manually selected from the corpus. We count the number of negations in a tweet and use the count as a feature.
9. Lexicon (L) : It has been demonstrated in (Mohammad, 2012) that emotion lexicon features provide a significant gain in classification accuracy when combined with corpusbased features, if training and testing sets are drawn from the same domain. We used the (Mohammad andTurney, 2010, 2013) emotion lexicon containing 14182 unigrams both of English and Hindi. The words in Hindi emotion lexicon were written in the Devanagri 6 script and had to be transliterated into Roman Script by the authors. Each word in the lexicon is given a association score of 1 if it is related to a emotion otherwise the association score is 0. A weight was given to each word in a lexicon. The exact weight values are mentioned in the Table 4. This assignment of weight ensured that if a word is related to more than one emotion then we don't lose any information.  Table 5: Impact of each feature on the classification accuracy of emotion in the text calculated by eliminating one feature at a time.

Results and Discussions
This section presents the results for various feature experimentation.

Feature Experiments
In order to determine the effect of each feature on classification, we performed several experiments by elimination one feature at a time. In all the experiments, we carried out 10-fold cross-validation.
We performed experiments using SVM classifier with radial basis function. The results of the experiments performed after eliminating one feature at a time (i.e., Ablation test to test interaction of feature sets) and using the above-mentioned classifier are mentioned in Table 5. Since the size of feature vectors formed are very large, we applied chi-square feature selection algorithm which reduces the size of our feature vector to 1600 7 . In our system, we have used SVM with RBF kernel as they perform efficiently in case of high dimensional feature vectors. For training our system classifier, we have used Scikit-learn (Pedregosa et al., 2011). The results from Table 5 shows that Character N-Grams, Punctuation Marks, Word N-Grams, Emoticons and Upper Case Words are the features which affect the accuracy most. We were able to achieve the best accuracy of 58.2% using the Character N-Grams, Word N-grams, Punctuation Marks and Emoticons as features trained with SVM classifier.

Conclusion and Future Work
In this paper, we present a freely available corpus of Hindi-English code-mixed text, consisting of tweet ids and the corresponding annotations. We also present the supervised system used for classifying the emotion of the tweets. The corpus consists of 2866 code-mixed tweets annotated with 6 emotions namely happiness, sadness, anger, surprise and sadness and with the caused language, i.e., English, Hindi, Mixed and Both. The words in the tweets are also annotated with the source language of the words. Experiments clearly show that usage of punctuation marks and emoticons result in better accuracy. Char N-Grams feature vector is also important for classification. As it is clear from the results, in the absence of char n-grams, the classification accuracy drops nearly by 16%. This paper describes the initial efforts in emotion prediction in Hindi-English code-mixed social media texts.
As a part of future work, the corpus can be annotated with part-of-speech tags at word level which may yield better results. Moreover, the dataset contains very limited tweets expressing fear, disgust, and surprise as emotion. Thus it can be extended to include more tweets having these emotions. The annotations and experiments described in this paper can also be carried out for code-mixed texts containing more than two languages from multilingual societies, in future.