Relevant Emotion Ranking from Text Constrained with Emotion Relationships

Text might contain or invoke multiple emotions with varying intensities. As such, emotion detection, to predict multiple emotions associated with a given text, can be cast into a multi-label classification problem. We would like to go one step further so that a ranked list of relevant emotions are generated where top ranked emotions are more intensely associated with text compared to lower ranked emotions, whereas the rankings of irrelevant emotions are not important. A novel framework of relevant emotion ranking is proposed to tackle the problem. In the framework, the objective loss function is designed elaborately so that both emotion prediction and rankings of only relevant emotions can be achieved. Moreover, we observe that some emotions co-occur more often while other emotions rarely co-exist. Such information is incorporated into the framework as constraints to improve the accuracy of emotion detection. Experimental results on two real-world corpora show that the proposed framework can effectively deal with emotion detection and performs remarkably better than the state-of-the-art emotion detection approaches and multi-label learning methods.


Introduction
With the growing prosperity of Web 2.0, people tend to share their feelings, attitudes and opinions through the social platforms such as online news sites, blogs. Detecting emotions from text can enhance the understanding of users' emotional states, which is useful in many downstream applications, such as human-computer interaction and personalized recommendation. Therefore, it is crucial to analyze and predict emotions from text accurately (Picard and Picard, 1997).
Research on emotion detection can be roughly categorized into two types: lexicon-based and learning-based approaches. Lexicon-based approaches usually rely on emotion lexicons (Lei et al., 2014;Rao et al., 2012). They cannot deal with text when words can't be found in emotion lexicons. Learning-based approaches can be furthered classified into unsupervised and supervised learning methods. Unsupervised approaches do not require annotated data for training. For example, by adding an emotion layer into traditional topic models, emotion-topic models were constructed to detect users' emotions (Bao et al., 2012(Bao et al., , 2009. Supervised learning approaches consider each emotion category as a class label and emotion detection is cast as a classification problem. If only choosing the strongest emotion as the emotion label for a given text, emotion detection is essentially a single-label classification problem (Lin et al., 2008;Quan et al., 2015). To predict multiple emotions simultaneously, emotion detection can be solved in the multi-label classification framework (Bhowmick, 2009). Moreover, to predict both multiple emotions and their intensities, some approaches have been proposed using emotion distribution learning (Zhou et al., 2016). Some lexicon-based approaches such as (Wang and Pal, 2015) can also output multiple emotions with intensities using non-negative matrix factorization.
In this paper, we are interested in exploring emotion ranking from either readers' perspective or writers' perspective in two different real-world corpora. In both cases, a given text is associated with multiple emotions. For example, Figure 1 illustrates an online news article crawled from Sina News Society Channel together with readers' emotion votes. It can be observed that when reading the news article, readers expressed different emotions with the majority showed "Sadness" and "Anger". We notice that some emotions such as "Touching", "Curiosity" and "Amusement" on-2-year-old baby found abandoned in garbage heap by his runaway mother and drugtaking father Recently, a netizen seek help for a 2-year-old baby who is alone at home unattended and starving because of his runaway mother and drug-taking father. According to the published pictures, the baby lives in a messy home with garbage everywhere. …… 妈妈出走爸爸吸毒 2岁娃无人管活在恶臭垃圾堆 近日网友发求助称因母亲离家出走父亲长期吸毒精神不正常，留下2岁的小"臭蛋"独自在家 无人照料甚至连吃的都没有。在发布的图片中,小"臭蛋"居住的家里凌乱不堪垃圾地。…… ly received 1 to 3 votes. In comparison to the total number of votes received, these votes could be considered as outliers or irrelevances. Also, the extremely low emotion votes might be due to readers' clicking errors. Taking into account such emotions during the learning process could introduce bias. Therefore, we aim to differentiate relevant emotions from irrelevant ones and only learn the rankings of relevant emotions while neglecting the irrelevant ones.
Our work makes the following contributions: • We propose a novel framework based on relevant emotion ranking to identify multiple emotions and produce the rankings of relevant emotions from text. In the framework, the objective emotion loss function is designed elaborately so that both emotion prediction and rankings for only relevant emotions are achieved without being affected by irrelevant ones. To the best of our knowledge, it is the first attempt to perform emotion detection and relevant emotion ranking at the same time.
• As some emotions co-occur more often while others rarely co-exist, the prior knowledge of emotion relationships is incorporated into the framework as a constraint. Such emotion relationship can provide important cues for emotion detection.
• Experimental results on two real-world corpora show that the proposed framework can effectively deal with the emotion detection problem and performs better than the stateof-the-art emotion detection methods and multi-label learning methods.

Related work
Emotion detection is one of the subfields of sentiment analysis where emotions are more finegrained and expressive. In general, emotion detection approaches can be categorized into two types: lexicon-based and learning-based approaches. Lexicon-based approaches usually rely on emotion lexicons consisting of words and their corresponding emotion labels. For example, Aman and Szpakowicz (2007) classified emotional and non-emotional sentences with a predefined emotion lexicon. Emotional dictionaries could also be constructed from training corpora of news articles and be used to predict the readers' emotion of a new articles (Lei et al., 2014;Rao et al., 2012). Agrawal and An (2012) proposed a context-based approach to detect emotions from text at sentence level. An emotion vector for each potential affect bearing word based on the semantic relation between emotion concepts and words was generated. The emotion vector was then tuned based on the syntactic dependencies within a sentence structure. Other lexicon-based approach such as (Wang and Pal, 2015) can also output multiple emotions with intensities using non-negative matrix factorization with constraints derived based on an emotion lexicon.
Learning-based approaches can be further categoried as unsupervised and supervised learning methods. Unsupervised learning approaches do not require labelled data for training. For example, the emotion-topic models (Bao et al., 2012(Bao et al., , 2009 were proposed by adding an extra emotion layer into traditional topic models such as Latent Dirichlet Allocation (Blei et al., 2003), thus capturing the generation of both emotion and text at the same time.
Supervised learning approaches typically cast emotion detection as a classification problem by considering each emotion category as a class label. If only choosing the strongest emotion as the label for a given text, emotion detection is essentially a single-label classification problem. Lin, Yang and Chen (2008) studied the classification of news articles into different categories based on readers' emotions with various combinations of feature sets. Strapparava and Mihalcea (2008) proposed several knowledge-based and corpus-based methods for emotion classification. Quan et al. (2015) proposed a logistic regression model with emotion dependency for emotion detection. Latent vari-ables were introduced to model the latent structure of input text. To predict multiple emotions simultaneously, emotion detection can be solved using multi-label classification. Bhowmick (2009) presented a method for classifying news sentences into multiple emotion categories using an ensemble based multi-label classification technique. Zhou et al. (2016) proposed a novel approach based on emotion distribution learning to predict multiple emotions with different intensities in a single sentence.

Methodology
Assuming a set of T emotions E = {e 1 , e 2 , ...e T } and a set of n instances X = {x 1 , x 2 , x 3 , ..., x n }, each instance x i ∈ R d is associated with a ranked list of its relevant emotions R i ⊆ E and also a list of irrelevant emotions R i = E − R i . Relevant emotion ranking aims to learn a score function g(x i ) = [g 1 (x i ), ..., g T (x i )] assigning a score g t (x i ) to each emotion e t , (t ∈ {1, ..., T }). As mentioned before, it is unnecessary to consider the rankings of irrelevant emotions since they might introduce errors into the model during the learning process. In order to differentiate relevant emotions from irrelevant ones, we need to define a threshold g Θ (x) which could be simply set to 0 or learned from data (Fürnkranz et al., 2008). Those emotions with scores lower than the threshold will be considered as irrelevant and hence discarded. The identification of relevant emotions and their ranking can be obtained simultaneously according to their scores assigned by the ranking function g. Here, the predicted relevant emotions of instance

Emotion Loss Function
The goal of relevant emotion ranking is to learn the parameter of the ranking function g. Without loss of generality, we assume that g are linear models, i.e., g t ( where Θ denotes the threshold. Relevant emotion ranking can be regarded as a special case of multi-label learning. Several evaluation criteria typically used in multi-label learning can also be used to measure the ranking function's ability of distinguishing relevant emotions from irrelevant ones, such as hamming loss, one error, coverage, ranking loss, and average precision as suggested in (Zhang and Zhou, 2014). However, these multilabel criteria cannot meet our requirement exactly as none of them considers the ranking among emotions which are considered relevant. Therefore, by incorporating PRO loss (Xu et al., 2013), the loss function for the instance x i is defined as follows: (1) where e t refers to the emotion belonging to relevant emotion set R i or the threshold Θ of instance x i while e s refers to the emotion which is less relevant than e t denoted as ≺. Thus, (e t , e s ) represents four types of emotion pairs: i.e., (relevant, relevant), (relevant, irrelevant), (relevant, threshold), and (threshold, irrelevant). The normalization term norm t,s is used to balance those four types of emotion pairs to avoid dominated terms by their respective set sizes. The set sizes of the four different types of emotion pairs mentioned above are and |R i |, respectively. Here, l t,s refers to a modified 0-1 error. Specifically, Note that l t,s is non-convex and difficult to optimize. Thus, a large margin surrogate convex loss (Vapnik and Vapnik, 1998) implemented in hinge form is used instead as follows: where (u) + = max{0, u}. However, Eq. 2 ignores the relationships between different emotions. As mentioned in Introduction section, some emotions often co-occur such as "joy" and "love" while some rarely coexist such as "joy" and "anger". Such relationship information among emotions can provide important clues for emotion ranking. Therefore, we incorporate this information into the emotion loss function as constraints. The objective function L (x i , R i , ≺, g) can be redefined as: where the weight ω ts models the relationship between the t-th emotion and the s-th emotion in the emotion set and can be calculated in multiple ways. Since the Pearson correlation coefficient (Nicewander, 1988) is the most familiar measure of relationship between two variables, we use it to measure the relationship of two emotions using their original emotion scores across each corpus.
From the above, it can be observed that the goal of relevant emotion ranking can be achieved through predicting an accurate relevant emotion set as well as the ranking of relevant emotions.

Relevant Emotion Ranking
After defining an appropriate loss function, we need to define a way to minimize the empirical error measured by the appropriate loss function and at the same time to control the complexity of the resulting model. It can be done by introducing a maximum margin strategy and regularization to deal with emotion ranking data, where a set of linear classifiers are optimized to minimize the emotion loss function mentioned before while having a large margin. We could potentially use an approach based on a label ranking method (Elisseeff and Weston, 2001). It is worth mentioning that the margin of the (relevant, relevant) label pair needs to be dealt with carefully, which is not considered in (Elisseeff and Weston, 2001).
The learning procedure of relevant emotion ranking (RER) is illustrated in Figure 2. The big rectangular dash line boxes denoted by x 1 to x n represent n instances in the training set. In each small box, e i , i ∈ {1, ...T } ∪ {Θ} represents an emotion of the instance where the shaded small boxes represent the relevant emotions while the dashed small boxes represent irrelevant ones and the last one e Θ is the threshold. Each emotion's corresponding weight vector is w i . We use m t,s to represents the margin between label e t and e s . There are four types of emotion pairs' margins in total, i.e., (relevant, relevant), (relevant, irrelevant), (relevant, threshold), and (threshold, irrelevant). Different types of emotion pairs' margins are denoted using different text/line colors. For each training instance x i , margin(x i ) represents the margin of instance x i which can be obtained by taking the minimum margin of all its possible label pairs m t,s . Similarly, the margin of the learning system margin(learningsystem) can be obtained by taking the minimum margin of all the training instances. By maximizing the margin of the learning system, the weight vector of each emotion can be derived from which the predicted emotion set and the ranking of relevant emotions can be obtained.
The learning system is composed of T + 1 linear classifiers [w 1 ; ...; w T ; w Θ ] with one classifier for each emotion label and the threshold, where w t , t ∈ {1, ...T } ∪ {Θ} is the weight vector for the t-th classifier of emotion e t . For a training instance x i and its corresponding emotion label set E i , the learning system's margin on instance x i is defined as follows by considering its ranking ability on x i 's four types of emotion pairs, i.e., (relevant, relevant), (relevant, irrelevant), (relevant, threshold), and (threshold, irrelevant): Here, u, v returns the inner product u v. For each emotion pair (e t , e s ), its discrimination boundary corresponds to the hyperplane w t − w s , x i = 0. Therefore, Eq. 4 returns the minimum value as the margin on instance x i . The margin on the whole training set G can be calculated as follows: If the learning algorithm is capable of properly ranking the four types of label pairs for each training instance, Eq. 5 will return a positive margin. In this ideal case, the final goal is to maximize the margin in Eq. 5: Suppose we have sufficient training examples such that for each label pair (e t , e s ), there exists x i ∈ Gsatisfying e t ∈ R i ∪ {Θ}, e s ∈≺ (e t ). Thus, the objective in Eq.6 becomes equivalent to max w j min 1≤s<t≤T +1 1 ||wt−ws|| and can be rewritten as min w j max 1≤s<t≤T +1 ||w t − w s ||.
Moreover, to overcome the complexity brought in by the max operator, the objective of the optimization problem can be re-written by approximating the max operator with the sum operator. Thus, the objective of Eq. 6 can be transformed as: To accommodate real-world scenarios where constraints in Eq. 7 can not be fully satisfied, slack variables can be incorporated into the objective function: Since ξ its does not need to be optimized since it can be easily determined by w t , w s . The final objective function can be reformulated as: As can be seen, Eq.9 consists of two parts balanced by the trade-off parameter λ. Specifically, the first part corresponds to the maximum margin of the learning system and it can also represent the complexity of the learning system, while the second part corresponds to the emotion loss function of the learning system implemented in hinge form.

Parameter Estimation
Let w = [w 1 ; ...; w T ; w Θ ], Eq. 9 is cast into a general form in SVM-type: where p is the total number of label pairs, calculated by n i=1 et∈R i ∪{Θ} es∈≺(et) norm t,s and 1 p (0 p ) is the p × 1 all one (zero) vector. The entries in vector C correspond to the weights of hinge losses, i.e., the normalization term to balance the four kinds of label pairs. The matrix A corresponds to the constraints for instances which reflects the emotion relationships and the margin of the label pairs. ξ does not need to be optimized since it can be easily determined by w. Hence the objective function can be reformulated into the following form without ξ: Through minimizing the objective function F (w, G), we can finally obtain parameter w and the ranking function g. Eq. 11 involves a large scale optimization. To address Eq. 11, we consider an efficient Alternating Direction Method of Multipliers (ADMM) solution (Bertsekas and Tsitsiklis, 1989). The basic idea of ADMM is to take the decomposition-coordinate procedure such that the solution of subproblems can be coordinated to find the solution to the original problem. We decompose G into M disjoint subsets, i.e., {G 1 , G 2 , ..., G M } and then Eq. 11 is converted into the following form: The surrogate augmented Lagrangian Function (LF) was introduced into Eq. 12 and it was cast into the following form: where α, β are the Lagrange multiplies. The updating process of Eq. 13 is shown in Algorithm 1.
Algorithm 1 Parameter updating process.  We evaluate the proposed approach on two realworld corpora, one is document level and the other is sentence level: Sina Social News (News) was collected from the Sina news Society channel where readers can choose one of the six emotions such as Amusement, Touching, Anger, Sadness, Curiosity, and Shock after reading a news article. As Sina is one of the largest online news sites in China, it is sensible to carry out experiments to explore the readers' emotion (social emotion). News articles with less than 20 votes were discarded since few votes can not be considered as proper representation of social emotion. In total, 5,586 news articles published from January 2014 to July 2016 were kept, together with the readers' emotion votes.
Ren-CECps corpus (Blogs) (Quan and Ren, 2010) contains 34,719 sentences selected from blogs in Chinese. Each sentence is annotated with eight basic emotions from writer's perspective, including anger, anxiety, expect, hate, joy, love, sorrow and surprise, together with their emotion scores indicating the level of emotion intensity which range from 0 to 1. Higher scores represents higher emotion intensity. The statistics of the two corpora are shown in Table 1.

Sina Social News
Ren The two corpora were preprocessed by using word segmentation and filtering. The python jieba segmenter is used for the segmentation and a removal of stop words is performed based on a stop word thesaurus. Words appeared only once or appeared in less than two documents were re-moved to alleviate data sparsity. We used the single layer long short-term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) to extract the features of each text. LSTM is one kind of recurrent neural networks, which can capture sequence information from text and can represent meanings of inputs in the reduced dimensional space. It treats text as a sequence of word embeddings and outputs a state vector over each word, which contains the information of the previous words. The final state vector can be used as the representation of the text. In our experiments, we set the dimension of each text representation to 100. During LSTM model training, we optimized the hyper parameters using a development dataset which is built using external data. We train LST-M using a learning rate of 0.001, a dropout rate of 0.3 and categorical cross-entropy as the loss function. The mini batch (Cotter et al., 2011) size is set to 32. After that, the learned text representations are fed into the proposed system for relevant emotion ranking as has been previously presented in the Methodology section.

Comparison with Baselines
There are only few baselines which address multiple emotions learning from text. We first compare the proposed framework with two baselines which have previously achieved the state-of-the-art performances on multi-emotion detection.
• Emotion Distribution Learning (EDL) (Zhou et al., 2016): It learns a mapping function from texts to their emotion distributions describing multiple emotions and their respective intensities based on label distribution learning. Moreover, the relationships of emotions are captured based on the Plutchik's wheel of emotions which are subsequently incorporated into the learning algorithm in order to improve the accuracy of emotion detection.
• EmoDetect (Wang and Pal, 2015): It outputs the emotion distribution based on a dimensionality reduction method using nonnegative matrix factorization which combines several constraints such as emotions bindings, topic correlations and emotion lexicons in a constraint optimization framework.
For each method, 10-fold cross validation is conducted using the same feature construction F 1(T Pt, F Pt, T Nt, F Nt) Table 2: Evaluation criteria for the Multi-Label Learning (MLL) methods. T P t , F P t , T N t , F N t represent the number of true positive, false positive, true negative, and false negative test examples with respect to emotion t respectively. F 1(T P t , F P t , T N t , F N t ) represent specific binary classification metric F1 (Manning et al., 2008). method to get the final performance. Supposing n test instances and T emotion categories, several evaluation criteria as presented in Table 2 typically used in multi-label learning can be used to measure the efficiency of the proposed framework and the baseline approaches. PRO Loss concerning the prediction on all labels as well as the rankings of only relevant labels. Hamming loss evaluates how many times an emotion label is misclassified. Ranking loss evaluates the fraction of reversely ordered emotion pairs. One-error evaluates the fraction of sentences whose top-ranked emotion is not in the relevant emotion set. Average precision evaluates the average fraction of the relevant emotions ranked higher than a particular emotion. Coverage evaluates how many steps are needed to move down the ranked emotion list so as to cover all the relevant emotions of the example. Subset Accuracy evaluates the fraction of correctly classified examples, i.e. the predicted label set is identical to the ground-truth label set. F 1 exam evaluates the averaged F1 (Manning et al., 2008) over instances. MicroF1 pools each document decisions across categories, and then computes an effectiveness measure on the pooled contingency table. MacroF1 take the average of F1 for all categories. Note that the threshold Θ is removed before evaluation. It should be pointed out that metrics from PRO Loss to F 1 exam work by evaluating the learning systems performance on each test ex-ample separately, and then returning the mean value across the test set. MicroF1 and MacroF1 work by evaluating the learning systems performance on each emotion category separately, and then returning the macro/micro-averaged value across all emotion categories. The evaluation results using 10 different evaluation criteria are shown in Table 3. It can be observed that our proposed method Relevant Emotion Ranking(RER) outperforms baseline approaches on all 10 evaluation metrics on both datasets.  Table 3: Comparison with emotion detection baselines. "↓" indicates "the smaller the better", while "↑" indicates "the larger the better". The best performance on each evaluation measure is highlighted by boldface.
We have also extended RER by incorporating emotion relationships as constraints into the learning framework, denoted as RERc in Table 3. The correlation of every pair of emotions is calculated based on their respective votes from news articles or scores from blogs. It can be observed from Table 3 that in almost all cases, incorporating the constraints gives better performance. It should be pointed out that the results of the baseline approach EDL are slightly different from those reported in (Zhou et al., 2016) since we employ L-STM for feature construction instead of recursive autoencoders.
Since relevant emotion ranking can be seen as an extension of multi-label learning, the proposed framework is also compared with 8 widely used multi-label learning methods using the threshold Θ which is initialized as 0.15 after normalization, such as ML-KNN (Zhang and Zhou, 2007), LIFT (Zhang, 2011), CLR (Fürnkranz et al., 2008), Rank-SVM (Zhang and Zhou, 2014) , MLLOC (Huang and Zhou, 2012), BP-MLL (Zhang and Zhou, 2006), ECC (Read et al., 2009) and ML-RBF (Zhang, 2009). ML-KNN is based on the traditional k-nearest neighbor (KNN) algorithm. LIFT constructs features specific to each emotion by conducting clustering analysis on its positive or negative instances. CLR transforms MLL into a label ranking problem by pairwise comparison which considers each label pairs and rank all the labels without recognizing that only the rankings of relevant ones are crucial. Rank-SVM focuses on distinguishing relevant from irrelevant while neglecting the rankings of relevant ones. MLLOC tries to exploit emotion correlations in the expression data locally. The global discrimination fitting and local correlation sensitivity are incorporated into a unified framework. BP-MLL is derived from the back propagation algorithm through employing a novel error function capturing the characteristics of multi-label learning. ECC applies classifier chains in an ensemble framework. ML-RBF gets the multi-label neural networks adopted from the traditional Radial Basis Function (RBF) neural networks.  For the MLL methods, the value of k is set to 8 in ML-KNN, ratio is 0.02 and µ is 2 in ML-RBF. Linear kernel is used in LIFT. Rank-SVM uses the RBF kernel with the width σ equals to 1. The CR4.5 is used as the base classifier for CLR and ECC. The evaluation results of the proposed  Table 4: Comparison with Multi-Label Learning (MLL) Methods."↓" indicates "the smaller the better", while "↑" indicates "the larger the better". The best performance on each evaluation measure is highlighted by boldface.
approach in comparison to all MLL baselines are presented in Table 4. RERc performs the best on all evaluation measures. It verifies the advantage of RERc due to its consideration of varying intensities of the emotion labels and the ignorance of irrelevant ones during the learning of the relevant emotion ranking model. We also observe that, in most cases, the performance on the News dataset is better than that in the Blogs dataset. This may due to different types of text observed in these two platforms. News articles are more formal while bogs typically contain informal language and are more colloquial.

Result Analysis
To fully understand the emotion detection results, we generate the top 10 most frequent words in the test set for each emotion label from Blogs dataset presented in Figure 3. Intuitively, we can find that most top words are quite expressive of their associated emotions. For example, the word "happy" delivers the emotion of Joy and the word "tears" tells Sorrow, etc. Moreover, we also observe that there are some common words across multiple emotion categories. For instance, "friend" appears in both Joy and Love. The results demonstrate that the proposed framework can learn emotions from text precisely.

Conclusions
In this paper, we have proposed a novel framework to detect multiple emotions from text based on relevant emotion ranking. Moreover, the relationships between emotions are incorporated into the learning framework as constraints. Experimental results on both News and Blogs datasets show that the proposed framework is able to detect multiple emotions as well as generating rankings of relevant emotions. It performs remarkably better than the state-of-the-art baselines on multiemotion detection and also outperforms several different methods used for multi-label learning. In the future, we will explore the possibility of extending the current framework by detecting emotions at more fine-grained level, for example, emotions associated with specific events reported in text.