Emotion in Code-switching Texts: Corpus Construction and Analysis

Previous researches have focused on analyzing emotion through monolingual text, when in fact bilingual or code-switching posts are also common in social media. Despite the important implications of code-switching for emotion analysis, existing automatic emotion extraction methods fail to accommodate for the code-switching content. In this paper, we propose a general framework to construct and analyze the code-switching emotional posts in social media. We first propose an annotation scheme to identify the emotions associated with the languages expressing them in a Chinese-English code-switching corpus. We then make some observations and generate statistics from the corpus to analyze the linguistic phenomena of code-switching texts in social media. Finally, we propose a mul-tiple-classifier-based automatic detection approach to detect emotion in the code-switching corpus for evaluating the effectiveness of both Chinese and English texts.


Introduction
Due to the popularity of opinion-rich resources (e.g., online review sites, forums, and the microblog websites), emotion analysis in text is of great significance in obtaining useful information for studies on social media (Pang et al., 2002;Liu et al., 2013;Lee et al., 2014). Previous researches have mainly focused on analyzing emotion through monolingual text Lee et al., 2013a). However, code-switching posts are also common in social media. Emotions can be expressed by either monolingual text or bilingual text in the code-switching posts. Code-switching text is defined as text that contains more than one language ('code') (Adel et al., 2013;Auer, 1999). [E1-E3] are three examples of code-switching emotional posts on Weibo.com that contain both Chinese and English texts. [E1] expresses the happiness emotion through English, and the sadness emotion in [E2] is expressed through both Chinese and English, while the sadness emotion in [E3] is expressed through a mixed Chinese-English phrase (hold 不住 'cannot take it').
[E1] 玩了一下午轮滑 so happy ！ (I went rollerblading the whole afternoon, so happy!) [E2] 开学以来，浮躁的情绪。不安稳的心 态。确实该自己检讨一下了。。。sigh~~~ (I have been grumpy and emotional since the first day of school, unstable mindset too. It's really time to self-evaluate...sigh~~~) [E3] 上了一天的课，嗓子 hold 不住了啊 (I have been teaching the whole day, my throat can't take it anymore.) Despite the important implications of codeswitching for emotion analysis, existing emotion analysis approaches fail to accommodate for the code-switching content. Thus, there is a crucial need for analyzing emotions in code-switching texts.
In this paper, we provide a well-defined and efficient method for constructing and analyzing a large-scale code-switching corpus from social media. We believe the annotated corpus provides a valuable resource for both linguistic analysis as well as natural language processing of emotion and code-switching texts. We construct and analyze the corpus using the below steps: First, we extract and filter the code-switching posts from the large-scale dataset by removing monolingual and noise posts. Second, we propose an annotation scheme to annotate both emotions and the language(s) expressing the emotions (hereafter caused language(s)) in the data set. Third, we analyze the agreement of the corpus to verify the quality of the annotation and effectiveness of the scheme. We also show some observations and statistics on the corpus to analyze the linguistic phenomena of code-switching texts on social media. Finally, we propose a multiple-classifierbased automatic detection approach to detect emotion in the annotated code-switching corpus for indicating the effectiveness of both Chinese text and English text in code-switching posts in detecting emotions.
The rest of the paper is organized as follows. In Section 2, we give an overview on the related work. In Section 3, we introduce our data collection method and the annotation scheme. In Section 4, we report the analysis of the corpus including the inter-annotator agreement as well as other relevant statistics. In Section 5, we propose an automatic emotion detection framework on code-switching text. Finally, we conclude our work in Section 6.

Related Work
In this section, we discuss related works on emotion analysis and code-switching text analysis.

Emotion Analysis
The earliest research on emotion has focused on the representation and processing of emotion in facial expressions and body language (Andrew, 1963;Ekman and Friesen, 1978). More recently, there has been mounting research on the neurobiological basis of emotion (Olson et al., 2007;Hervé et al., 2012) and how emotion is linked with other aspects of human cognition (Smith and Lazarus, 1993;Smith and Kirby, 2001;Bridge et al., 2010).
Emotion has been well studied in natural language processing, while most previous researches focused on analyzing emotions in monolingual text. Some of these studies focus on lexicon building, for example, Rao et al. (2012) automatically building the word-emotion mapping dictionary for social emotion detection, and Yang et al., (2014) propose a novel emotion-aware topic model to build a domain specific lexicon. Moreover, emotion classification is one of the important tasks in emotion analysis. For example, Liu et al., (2013) used co-training framework to infer the news reader's and comment writer's emotion collectively; Wen and Wan (2014) used class sequential rules for emotion classification of micro-blog texts by regarding each post as a data sequence.
The research of emotion has also been linked to the field of bilingualism. Previous studies have demonstrated that emotion is closely related to second language learning and use (Arnold, 1999;Schumann, 1999), as well as bilingual performance and language choice (Schrauf, 2000;Pavlenko, 2008). For example, there are a number of factors that may impact the use of emotion vocabulary, such as sociocultural competence, gender, and topic (Dewaele and Palvenko, 2002).
Despite a growing body of research on emotion, little has been done on the analysis of emotion in code-switching contexts due to the complications in processing two languages at the same time.

Analysis of Code-switching Texts
Research on code-switching can be traced back to the 1970s. Several theories have been proposed to account for the motivation behind codeswitching such as diglossia (Blom and Gumperz, 1972), communication accommodation theory (Giles and Clair, 1979), the markedness model (Myers-Scotton, 1993), and the conversational analysis model (Auer, 1984).
Code-switched documents have also received considerable attention in the NLP community. Several studies have focused on identification and analysis, including mining translations in code-switched documents (Ling et al., 2013), predicting code-switched points (Solorio and Liu, 2008), identifying code-switched tokens (Lignos and Marcus, 2013), adding code-switched support to language models (Li and Fung, 2012), and learning poly-lingual topic models from code-switching text .
Another related research topic, multilingual natural language processing, has begun to attract attention in the computational linguistic community due to its broad real-world applications. Relevant studies have been reported in different natural language processing tasks, such as parsing (Burkett et al., 2010), information retrieval (Gao et al., 2009), text classification (Amini et al., 2010), and sentiment analysis (Lu et al., 2011).
However, none have studied the multilingual code-switching issues in the task of emotion detection and classification. This area of research is especially crucial when public emotions are mostly expressed on the Internet. Additionally, the important implications of code-switching in emotion analysis serve as a first step towards an automatic multilingual classification system.

Data Collection and Annotation
In this section, we describe how to collect and filter code-switching posts on Weibo.com. We also discuss the annotation scheme and the annotation tool.

Data Collection
We sourced our data set from Weibo.com, one of the famous SNS websites in China. We identified a post as code-switched if at least two predicted languages, i.e. Chinese and English, appeared in the text. As the encoding of Chinese and English characters is different (the maximum number of encoded English characters is less than 128), we thus utilized each character code to identify the language in a simple manner. We also remove the noise, and advertisement posts ([E4] and [E5] are the examples of noise and advertisement posts).

Annotation Scheme
Five basic emotions were annotated, namely happiness, sadness, fear, anger and surprise (Lee et al., 2013b). Two languages, Chinese and English, were annotated as caused languages. Since emotion can be expressed through the two languages separately or collectively, and also could be expressed through mixed phrases e.

Annotation Tool and Format
An annotation tool is designed to facilitate the annotation process which allows better consistency. Figure 1: A sample of code-switching emotion annotation using the annotation tool Figure 1 shows an example instance annotated with both emotion and caused languages using our annotation tool. For each emotion, annotators marked whether the post expresses emotion, together with the caused languages toward the emotion. Figure 2 is a sample of an annotated instance. Each instance contains the caused language with the emotion tag, e.g., "<Happiness>CN </Happiness>", while the example tag means the post expresses the happiness emotion through Chinese text.

Statistics and Analysis
In this section, we analyze the agreement of the corpus, and present some observations and statistics.

Agreement Analysis
To verify the quality of the annotation, two human annotators were asked to annotate 1,000 posts. We then calculated the inter-annotator agreement between them using Cohen's Kappa coefficient. Table 2 shows the results of agreement analysis. We find that the agreement is high, indicating that the quality of the annotation and scheme is effective. In addition, the agreement of emotion annotation is lower than that of caused language, which probably due to the fact that some posts express more than one emotion, and some emotions are expressed implicitly.

Kappa score
Emotion 0.692 Caused Language 0.767

Statistics and Observations
In this subsection, we discuss some statistics from the dataset.

General Distribution of Data
Out of 4,195 annotated posts, 2,312 posts are found to express emotions. Moreover, 81.4% of emotional posts are expressed through Chinese. Although English contains relatively fewer words in each post, there are still 43.5% of emotional posts are expressed through English. This indicates that English is of vital importance to emotion expression even in code-switching contexts dominated by Chinese. Note that, there are overlaps between Chinese and English emotional posts, since some emotional posts are conducted in both Chinese and English. Besides, although some posts express the same emotion through both Chinese and English text ([E2]), there are still some posts expressed different emotions through different languages. For example, the happiness emotion in [E7] is expressed through Chinese, while the surprise emotion is expressed through English.
Moreover, as shown in Figure 3, we find that most posts describe people's daily lives, since people like to discuss their life on their microblogs, and posts from financial and political domains were limited.

Joint Distribution of Emotions and Caused Languages
For the purpose of analyzing the distribution of emotions and the caused languages, we first calculate the joint distribution between emotions and caused languages as in Figure 4. The Y-axis of the figure presents the conditional probability of a post expressing the emotion i e given that j l is the caused language, ( | ) ij p e l .
It is suggested in Figure 4 that: 1) happiness occurs more frequently than other emotions; 2) people prefer to use English text to express happiness more than sadness; 3) the distribution of emotions expressed through Chinese and English text are similar; and 4) fear and surprise occur less frequently in English text.

Transfer Probability between Emotions
We then examine the conditional probabilities of a post expressing emotion i e given that the post contains emotion j e . The conditional probabilities are shown as in Table 3. From the table, we find that the probability that a post contains more than one emotion is small. Moreover, the probability of polarity shifting between emotions (happiness vs. sadness, fear, anger) is limited.

Transfer Probability between Caused Languages
We also examine the conditional probabilities of the emotion(s) expressed in one language i l given that the emotion is expressed in another language j l simultaneously in a post. The conditional probabilities are shown as in Table 4.
Chinese English Chinese -0.236 English 0.614 - From the table, we find that there is a high probability that the two languages both express emotions, especially when given that the emotion is expressed in English. It is also highly likely that the emotion would be expressed in Chinese. Table 5 shows the statistics on the average sentence length of each language. We notice, as our data are always written by Chinese individuals, the length of Chinese words is longer than English words. Besides, the emotions expressed through English text are mostly single words, e.g., happy, high, and surprise. Note that, as mentioned above, although the length of Chinese words is longer than English words, English is of vital importance to emotion expressions even in code-switching context dominated by Chinese.

Distribution of Cue Words
In addition, we count the top-10 frequency emotion cue words of both English and Chinese text as given in Table 6. We find that the most frequent cue words express happiness emotions, for example, happy, nice, and 喜欢 (like). What is more, there are several negative expressions in the top-10 English cue words, e.g. sorry and shit, while the top-10 Chinese cue words are all positive. This may be due to the fact that expressing the negative emotion through native language (Chinese) would be too explicit for Chinese individuals, while most of them tend to express their negative emotions implicitly.

Automatic Emotion Detection in Code-switching Texts
Based on the annotated corpus data, we attempt to detect emotion in code-switching text automatically. Results show both Chinese and English texts are effective, and the classifier combination approach which incorporates both Chinese and English text achieves the best performance.

Overview of Detection Approach
A straightforward approach to detect emotion in code-switching text is using a supervised learning approach to classify the mixed text without any processing. Besides, we extract unigrams as a feature for each post. As emotions could be expressed in either Chinese or English text, we also adopt two classification approaches which consider Chinese or English texts individually. However, a more effective way to detect emotion in code-switching posts is incorporating both Chinese and English text through a Multiple Classifier System (MCS). The key issue in constructing a multiple classifier system is to find a suitable way to combine the outputs of the base classifiers. In MCS literature, various methods are available for combining the outputs, such as fixed rules including the voting rule, the product Documents Chinese Text English Text

Code-switching text Identification
Chinese Text Classifier f CN

Classifiers Combination by sum rule
Emotions rule and the sum rule (Kittler et al., 1998;. In this study, we adopt the sum rule, a popular fixed rule to combining the outputs of both Chinese and English text classifiers. For utilizing MCS to detect emotion in codeswitching texts, we first define the base classifiers. In this paper, we use the Chinese text classifier CN f and English text classifier EN f which only considers Chinese text or English text individually as two base classifiers. Each base classifier provides a kind of confidence measurement, e.g., posterior probabilities of the test sample belonging to each class. Formally, each base classifier  Figure 5 illustrates the process of the multiple classifier system for emotion detection in codeswitching texts.

Experiments
As described in Section 3, the data are collected from Weibo.com. We randomly select half of the posts as the training data and another half as the test data. We use FudanNLP 1 for Chinese word segmentation and Maximum Entropy (ME) as the basic supervised classification model, while the ME algorithm is implemented with the MAL- LET Toolkit 2 . Note that, as the number of posts which express fear and surprise are limited, we only detect the other three kinds of emotions, i.e. happiness, sadness, and anger.
As discussed in the above subsection, we use the following approaches for automatic emotion detection in code-switching text:  f ALL : which uses all the words of each post as a feature to train a Maximum Entropy (ME) classification model.  From the table, we find that: 1) The performance of basic approach f ALL which uses mixed text directly is inferior. 2) As Chinese is the dominant language, and the English text is loosely distributed, using Chinese text (f CN ) outperforms both using all text (f ALL ) and English text (f EN ). Besides, as the English texts in the posts are always composed of single words, f EN is much lower than the other two approaches.

3) As incorporating both Chinese classifiers
and English classifiers to a multiple classifier system, f comb achieves a better performance than the other approaches. It also indicates that both Chinese text and English text in code-switching posts are effective for detecting emotions.

Conclusion
This paper presents the development of a codeswitching emotion corpus in which the emotion is expressed through either Chinese or English. We first collect and filter the data from Weibo.com, which is annotated with both emotion and caused language; we then analyze the inter-annotator agreement on the dataset, and present our findings and analysis. Finally, we propose a multiple-classifiers-based approach to detect emotion in the annotated code-switching corpus. Results show that both Chinese text and English text in code-switching posts are effective in detecting emotions. We believe that emotions analysis in code-switching text underlies an innovative approach towards a linguistic model of emotion as well as automatic emotion detection and classification.