Emotion Distribution Learning from Texts

,


Introduction
The advent of social media and its prosperity enable the creation of massive online user-generated con- *

Sentence
Trains crash near Thai resort town Emotions anger disgust fear joy sadness surprise 2 0 62 0 90 10 tent including opinions and product reviews. Analyzing such user-generated content allows the detection of users' emotional states, which might be potentially useful for downstream applications such as brand watching, product recommendation, and detection of health-related issues, etc. Based on the way emotions are represented, computational models for emotion analysis can be categorized into dimensional models and categorical models (Calvo and D'Mello, 2010). Dimensional approaches (Russell, 2003) emphasize the fundamental dimensions of valence and arousal in understanding emotional experience, which have long been studied by emotion theorists. Categorical models (Gupta et al., 2013) involve the use of a categorical representation, in which emotions are represented by a number of labels. For example, Ekman's basic emotion set (Ekman, 1992) consists of anger, disgust, fear, happiness, sadness and surprise. An example of a sentence and the annotated emotions can be found in Table 1.
Considering each basic emotion as class label for the sentence, emotion detection can be treated as a classification problem. There is a large body of prior work on emotion classification (Mishne and de Rijke, 2006;Lin and He, 2009;Quan et al., 2015;Wang and Pal, 2015). By choosing the strongest emotion as the emotion label for the sentence, most of classification approaches are based on single-label learning. However, as shown in Table 1, a sentence might contain multiple emotions with varying intensities. Although, some lexicon-based approach such as (Wang and Pal, 2015) can output multiple emotions with intensities using non-negative matrix factorization. It can only guarantee convergence to a local minimum, which is prohibitive on the large, realistically-sized emotion detection problem.
Machine learning methods such as multi-label learning (MLL) can be employed to identify multiple emotions for each sentence (Zhang and Zhou, 2014). MLL usually selects a threshold, then labels emotions with scores higher than the threshold as relevant and the others as irrelevant. However, these methods are not able to learn the intensity of each emotion. To address this problem, a new machine learning paradigm called Label Distribution Learning (LDL) (Geng, 2016) was proposed in recently years. Similarly, in this paper, we propose an emotion distribution learning (EDL) algorithm. Different from the previous approaches, EDL assumes that each sentence contains a mixture of basic emotions with different intensities. Using categorical model, we can label each sentence with an emotion vector where each element corresponds to one basic emotion and the value of each element indicates the intensity of the emotion. We require that each vector element has a value between 0 and 1 and they sum up to 1. By doing so, the emotion vectors can be considered as emotion distributions and the proposed EDL algorithm aims to learn the mapping from sentences to their corresponding emotion distributions by minimizing the differences between the true distributions and the predicted distributions. Both the single-label learning and ML-L can be considered as special cases of EDL in emotion detection. Moreover, as some emotions cooccur more often while others rarely co-exist, the relations between basic emotions are captured according to the Plutchik's wheel of emotions theory (Plutchik, 1980) and are incorporated in the learning framework as constraints in order to improve the accuracy of emotion detection.
Our work makes the following contributions: • We propose a novel approach based on emotion distribution learning to identify multiple emotions with their intensities from texts. To the best of our knowledge, it is the first attempt to identify both emotions and intensities in the distribution learning framework.
• The relations between basic emotions are incorporated into the learning framework as constraints to improve the emotion detection accuracy. To avoid the incorporation of noisy information from the training data, the relation constraint is set based on the Plutchik's wheel of emotions theory.
• Experimental results show that the proposed approach can effectively deal with the emotion distribution detection problem and perform remarkably better than the state-of-the-art multilabel learning methods and emotion detection method.

Related Work
In general, emotion classification can be approached by two types of methods, lexicon-based or corpusbased. Lexicon-based approaches rely on emotion lexicons consisting of words and their corresponding emotion labels for detecting emotions from text. For example, WordNetAffect (Strapparava and Valitutti, 2004) was constructed by extending Wordnet, a lexical database of English terms, with information on affective terms. EmoSenticNet assigns six WordNetAffect emotion labels to SenticNet concepts (Poria et al., 2013), which can be thought of as an expansion of WordNetAffect emotion labels to a larger vocabulary. Many approaches were proposed based on emotion lexicons. For example, (Aman and Szpakowicz, 2007) classified emotional and non-emotional sentences using the constructed emotion lexicon. (Choudhury et al., 2012) employed a classifier to detect human affective states in social media. (Wang and Pal, 2015) proposed a model with several constraints based on an emotion lexicon for emotion classification.
Corpus-based methods aim to train supervised classifiers from annotated training data where each sentence or document is labelled with an emotion class. (Mishne and de Rijke, 2006) constructed models to predict the levels of various moods according to the language used by bloggers at a giv-en time. (Aman and Szpakowicz, 2007) described an emotion annotation task of identifying emotion category, emotion intensity and the words/phrases that indicate emotions in text. Emotion classification was conducted using trained support vector machines. (Agrawal and An, 2012) proposed an unsupervised context-based approach to detect emotions from text at the sentence level. They computed an emotion vector for each potential affect bearing word based on the semantic relatedness between words and various emotion concepts. The scores are then tuned using the syntactic dependencies within the sentence structure. (Bao et al., 2009) proposed an emotion topic model by augmenting latent Dirichlet allocation with an intermediate emotion layer. (Quan et al., 2015) proposed a logistic regression model for social emotion detection. Intermediate hidden variables were also introduced to model the latent structure of input text corpora.
Our work is partly inspired by (Quan et al., 2015). However, our proposed approach differs from (Quan et al., 2015) in two aspects: 1) by introducing the emotion distribution learning framework, many different criteria can be used to measure the distance between the true distribution and the predicted distribution, such as squared X 2 , Euclidean, Jeffery's divergence apart from Kullback-Leibler divergence employed in logistic regression model. 2) the relations between basic emotions are captured based on the Plutchik's wheel of emotions theory to avoid the incorporation of any noisy information from the training data.

Problem Setting
As have discussed in section 1, one sentence might contain one or more emotions, and each emotion has its own intensity. We use d y x to indicate the intensity of emotion y for sentence x, where x ∈ X and y ∈ Y. The emotion intensity is normalized to make d y x ∈ [0, 1] and ∑ y d y x = 1 to constitute the emotion distribution.
Note that d y x denotes the proportion that y accounts for in a full emotion distribution of x. It is different from the probability of y being a correct emotion label for x. Probability distribution implies that only one emotion label is correct for each sentence, while emotion distribution allows multiple emotions in one sentence. The goal of EDL is to learn a mapping from sentences X = R m to the distributions over a finite set of labels Y = {y 1 , y 2 , ...y c }. Each label represents one of the basic emotions.

Learning
Given a training set P x i } is the emotion distribution associated with x i . The goal of EDL is to learn a conditional probability mass function p(y|x) from P , where x ∈ X and y ∈ Y. Assuming that p(y|x) is a parametric model p(y|x; θ), where θ are model parameters, many different criteria can be used to measure the distance between two distributions, such as Squared X 2 , Euclidean, Jeffery's divergence, Kullback-Leibler (K-L) divergence and so on. Here we use Divergence defined by The formula above calculates the sum of all the distances between emotion intensities in the same position.
Then the optimal model parameters θ * is determined by , where E i is the ground truth emotion distribution of the i-th sentence and theÊ i is the predicted one by p(y|x i ; θ). The second term is a regularizer to make the predicted emotion distribution sparse, and the third term considers the relationship between different emotions. As mentioned in section 1, some emotions often co-occur such as joy and love, and some rarely co-exist such as joy and anger. Therefore, the third term is employed to incorporate such prior knowledge. The weight ω jk models the relationship between the j-th emotion and the k-th emotion in the distribution. In this paper, we capture the relationships between different emotions based on Plutchik's wheel of emotions (Plutchik, 1980) which is produced in psychology view. Plutchik's wheel of emotions includes several typical emotions and its eight sectors indicate eight primary emotion dimensions arranged as four pairs of opposites. We re-produce a wheel of eight emotions' relationships according to Plutchik's theory, which is shown in Figure 1. In the emotion wheel, emotions sat at opposite end have an opposite relationship, while emotions next to each other are more closely related. We quantify the relations between each pair of emotions based on the angle between them in wheel of emotions (Plutchik, 2001). For example, emotion pairs with 180 degrees are opposite to each other, which are described by −1, while emotion pairs with 90 degrees are described by 0, meaning no relationship between them. Emotion pairs with 45 degrees have the relationship value of 0.5, while emotion pairs with 135 degrees have the relationship value of −0.5. Figure 2 shows the gray-scale image of the pair-wise relationships of emotions presented in Figure 1. In each cell, the darker the color is, the more similar the two emotions are.
As for p(y|x; θ), similar to (Geng, 2016), we assume it takes a maximum entropy model, i.e., , where Z i = ∑ k exp( ∑ r θ kr x r i ) is the normalization factor, x r i is the r-th feature of x i , and θ kr is an element in θ. Substituting Equation 2 into Equation 1 yields the target function, The minimization of the function T (θ) can be effectively solved by the limited-memory quasi-Newton method (L-BFGS). The basic idea of L-BFGS is to avoid explicit calculation of the inverse Hessian matrix used in the Newton method. L-BFGS approximates the inverse Hessian matrix with an iteratively updated matrix instead of actually storing the full matrix. Here we follow the idea of an effective quasi-Newton method BFGS. Consider the secondorder Taylor series of T ′ (θ) = −T (θ) at the current estimate of the parameter vector θ (l) : where ∆ = θ (l+1) −θ (l) is the update step, ∇T ′ (θ (l) ) and h(θ (l) ) are the gradient and Hessian matrix of T ′ (θ (l) ) at θ (l) , respectively. The minimizer of Equation 4 is The line search Newton method uses ∆ (l) as the search direction p (l) = ∆ (l) and updates model parameters by where the step length α (l) is obtained from a line search procedure to satisfy the strong Wolfe conditions (Nocedal and Wright, 2006): The idea of BFGS is to avoid explicit calculation of H −1 (θ (l) ) by approximating it with an iteratively updated matrix B, i.e. As for the optimization of the target function T (θ), the computation of BFGS is mainly related to the first-order gradient of T ′ (θ), which can be achieved by where p ij = 1 Z i exp( ∑ r θ jr x r i ). Thus it performs more efficiently than the standard line search Newton method.
In order to compare with the MLL methods, labels in the predicted distribution need to be divided into two sets, i.e, the relevant and irrelevant sets. For this purpose, an extra virtual label y 0 is added into the label set, i.e., the extended label set Y ′ = Y ∪ {y 0 }={y 0 , y 1 , y 2 ...y c }. Using the new extended label set in the training process, the optimal parameter vector θ * is learned. As y 0 is the label that distinguishes the relevant and irrelevant emotions directly, it is initialized as the threshold used in MLL. Given a sentence x ′ , its emotion distribution is predicted by p(y|x ′ ; θ * ). The intensity value of y 0 splits the predicted distribution into two sets. The emotions with the intensity value higher than y 0 's are regarded as the relevant emotions, and the rest emotions are regarded as irrelevant ones. Therefore, EDL in fact implements the function of MLL without the need of setting the threshold manually.

Setup
We evaluate the proposed approach on the Ren-CECps corpus (Quan and Ren, 2010). It contains 35, 096 sentences selected from blogs in Chinese. Each sentence is annotated with 8 basic emotions, such as anger, anxiety, expect, hate, joy, love, sorrow and surprise, together with their emotion scores. Higher score represents higher emotion intensity. We use AS i (j) to represent the score of emotion j in sentence i. Given a sentence x i , the intensity of emotion j is calculated by d y j By doing so, each intensity value fulfills d y j x i ∈ [0, 1] and ∑ y d y j x i = 1. For each sentence, features are extracted using recursive auto-encoders (RAEs) (Socher et al., 2011). RAEs are neural networks that represent meanings of fixed-size inputs in the reduced dimensional space. For example, each word in a sentence is represented using a vector w ∈ R d , and the RAE method reduces the entire sentence to a single vector of size R d . Sentences are sequences of words that can be represented by a binary tree structure. The words are the leaves of the tree and their combined grouping is used to get a notion of the meaning of the sentence.
The internal nodes of the tree correspond to the combined meaning of the nodes underneath them. Each internal node is also represented in the same manner as individual words in the form of a vectorŵ ∈ R d . These internal nodes are the hidden representations of the neural network. In the RAE model, the vocabulary is stored in an embedding matrix V ∈ R d × D where D is the cardinality of the vocabulary. Typically, each word w ∈ V is initialized independently following a Gaussian distribution w i ∼ N (0, γ 2 ). In our experiment, we set the dimension of each sentence representation to 100.
We build a gray-scale image shown in Figure 3 by computing the correlation coefficient of the emotions from the Ren-CECps corpus. It can be observed that Figure 3 is quite similar to Figure 2, which shows that our proposed way in capturing the relations between emotions is inline with what have been revealed by the emotion annotations in the Ren-CECps corpus.

Experimental Results
As the output of EDL is a distribution, a natural choice of criteria is the averaged similarity or distance between the actual emotion distribution and the predicted distribution. There are many metrics that can be applied to measure the distance between two distributions. In this paper six of them are used to evaluate the results of EDL, i.e, Euclidean, Sϕrensen, Squared X 2 , KL divergence, Intersection and Fidelity, as suggested in (Geng and Ji, 2013).   The formulae of the six criteria are summarized in Table 4.2. Note that the virtual label y 0 is removed before evaluation.
As EDL can output both the relevant emotions and their respective emotion intensities, MLL can be seen as a special case of EDL that it only outputs emotion labels but not their intensities. Several evaluation criteria typically used in MLL can also be used to measure EDL's ability of distinguishing relevant emotions from irrelevant ones, including hamming loss, one error, coverage, ranking loss, and average precision as suggested by (Zhang and Zhou, 2014), which are summarized in Table 4.2. Hamming loss evaluates how many times an emotion label is misclassified. One-error evaluates the fraction of sentences whose top-ranked emotion is not in the relevant emotion set. Coverage evaluates how many steps are needed to move down the ranked emotion list so as to cover all the relevant emotions of the example. Ranking loss evaluates the fraction of reversely ordered emotion pairs. Average precision evaluates the average fraction of the relevant emotions ranked higher than a particular emotion y ∈ Y.
For each algorithm, ten-fold cross validation is conducted. EDL is first compared with four existing Label Distribution Learning (LDL) methods (Geng, Table 4.2. For all the measures, "↓" indicates "the smaller the better", while "↑" indicates "the larger the better". The best performance on each measure is highlighted by boldface. The two-tailed t-tests with 5% significance level are performed to see whether the differences between EDL and the baselines are statistically significant. We use • to indicate significance difference. As the state-of-the-art emotion detection method proposed in (Wang and Pal, 2015) can output the emotion distributions based on a dimensional reduction method, we present its experimental results on the Ren-CECps corpus in the last row of Table 4.2. It can be observed that EDL performs significantly better than all the baseline LDL methods and the state-of-the-art emotion detection approach on all criteria considered here.
Since EDL can be seen as an extension of MLL, EDL is compared with 7 widely used MLL methods using the virtual label y 0 , namely ML-KNN (Zhang and Zhou, 2014), ECC (Read et al., 2011), MLLOC (Huang andZhou, 2012), LIFT (Zhang, 2011), ML-RBF (Zhang, 2009), Rank-SVM (Zhang and Zhou, 2014), BP-MLL (Zhang and Zhou, 2006). Among the compared algorithms, ML-kNN is derived from the traditional k-nearest neighbor (kNN) algorithm. Maximum a posteriori (MAP) principle is used to determine which emotion set is related to the given sentence. CC (classifier chains method) overcomes the limitations of BR and performs better but requires more computations. ECC (ensemble classifier chains) applies classifier chains in an ensemble framework and obtains high predictive performances. MLLOC (Multi-label LOcal Correlation) tries to exploit emotion correlations in the expression data locally. The global discrimination fitting and local correlation sensitivity are incorporated into a unified framework, and solution for the optimization are developed. Rank-SVM provides a way of controlling the complexity of the overall learning system while having a small empirical error. The architectures of Rank-SVM is based on linear models of Support Vector Machines (SVM) (Boser et al., 1992). LIFT constructs features specific to each emotion by conducting clustering analysis on its positive or negative instances, and then performs training and testing by querying the clustering results (Zhang, 2011). BP-MLL is derived from the famous backpropagation algorithm through employing a novel error function capturing the characteristics of multi-label learning, i.e., the emotions belonging to a sentence should be ranked higher than those not belonging to that sentence (Zhang and Zhou, 2006).
The virtual label y 0 used in EDL and the threshold value used in MLL are all set to 2.5. Besides, the ε, ξ 1 and ξ 2 are set as 0.25, 0.0001, 0.1 respectively. For the MLL methods, the value of k is set to 8 in ML-KNN, ratio is 0.02 and µ is 2 in ML-RBF. Linear kernel is used in LIFT. Rank-SVM uses the RBF kernel with the width σ equals to 1. The evaluation results of the proposed approach in comparison to all MLL baselines are presented in Table 4.2. EDL performs best on all evaluation measures. It verifies the advantage of EDL owing to the consideration of varying intensity of the basic emotions.

Further Analysis
To fully understand the emotion detection results, we use word cloud (Harris, 2011) to output the top 30 frequent words in the testing data for the emotion love and anxiety based on the annotation as shown in the left part of Figure 4. We also output the top 30 frequent words for the two emotions based on the prediction generated by EDL as shown in Figure 4 's right part. It can be observed that most words based on prediction indeed express their associated emotions. For example, word "like" delivers the emotion of love (right part of Figure 4(a)) and word "problem" tells anxiety (right part of Figure 4(b)). Moreover, the annotation and the prediction share 20 out of the top 30 most frequent words for the emotion love such as "friend", "joy", "happiness", etc as shown in the middle of Figure 4(a) and 19 out of 30 for the emotion of anxiety (the middle of Figure 4(b). It demonstrates that EDL can learn emotions from text precisely.
To investigate the emotion distributions generated by EDL, a sentence from the Ren-CECps corpus together with the emotion distribution output by EDL is illustrated in Figure 5. The ground truth emotion distribution is obtained by normalizing the scores and the virtual label y 0 . As can be seen, the curve of the predicted emotion distribution is very similar as the ground truth distribution, which demonstrates that EDL can learn the varying intensities of all the basic emotions well.

Conclusions and Future Work
In this paper, we have proposed a novel approach based on EDL to identify multiple emotions with their intensities from texts. Moreover, the relations between basic emotions is incorporated in the learning framework as constraints to improve the learning accuracy. Experimental results show that the proposed approach can effectively deal with the emotion distribution detection problem and perform remarkably better than the state-of-the-art multi-label learning methods and the emotion detection method. In future work, we will investigate the efficiency of the proposed approach in other datasets and explore other methods in capturing the inter-relations of emotions.