An Interpretable Neural Network with Topical Information for Relevant Emotion Ranking

Text might express or evoke multiple emotions with varying intensities. As such, it is crucial to predict and rank multiple relevant emotions by their intensities. Moreover, as emotions might be evoked by hidden topics, it is important to unveil and incorporate such topical information to understand how the emotions are evoked. We proposed a novel interpretable neural network approach for relevant emotion ranking. Specifically, motivated by transfer learning, the neural network is initialized to make the hidden layer approximate the behavior of topic models. Moreover, a novel error function is defined to optimize the whole neural network for relevant emotion ranking. Experimental results on three real-world corpora show that the proposed approach performs remarkably better than the state-of-the-art emotion detection approaches and multi-label learning methods. Moreover, the extracted emotion-associated topic words indeed represent emotion-evoking events and are in line with our common-sense knowledge.


Introduction
With the growth of social web, people tend to share their opinions, feelings and attitudes on the social platforms such as online news sites and blogs. Emotion detection can enhance the understanding of users' emotional states, which is useful in many downstream applications such as humancomputer interaction and personalized recommendation. Therefore, it is crucial to predict emotions from texts accurately (Picard and Picard, 1997).
Research on emotion detection can be roughly categorized into two types: generative model based and discriminative model based. Generative model based approaches (Bao et al., 2012;Rao et al., 2014a) usually build on topic models and assume texts are generated from emotions * Corresponding author and hidden topics. While these models can extract emotion-associated topics, they perform less satisfactorily in emotion classification since they are not optimized directly to minimize the misclassification rate. Discriminative model based approaches consider each emotion category as a class label and typically cast emotion detection as a classification problem. Approaches to the prediction of both multiple emotions and their intensities include (Zhou et al., 2018(Zhou et al., , 2016Wang and Pal, 2015). Those approaches usually assumed wordlevel representations and ignored the latent topical information behind words, therefore failed to effectively distinguish different emotions carried by the same word in different topical contexts.
In this paper, we focus on relevant emotion ranking (RER) by differentiating relevant emotions from irrelevant ones and only learning the rankings of relevant emotions while ignoring the irrelevant ones. A neural network with a novel loss function is proposed to tackle the RER problem. A topic representing a real-world event, an abstract entity, or an object could indicate the subject or context of the emotion. Different topics might contain or invoke different emotions (Stoyanov and Cardie, 2008). Incorporating such latent topics is essential for discovering topic-associated emotions. Motivated by transfer learning, we incorporate hidden topics and the topic distributions generated from a topic model into a neural network for RER. The main contributions of the paper are summarized below: • A novel Interpretable Neural Network for Relevant Emotion Ranking (INN-RER) is proposed. A novel error function is employed to optimize the whole network for parameter estimation. To the best of our knowledge, it is the first neural network based approach for RER.
• To understand how the emotions are evoked, the neural network is initialized to make its hidden layer approximate the behavior of topic models so that the topical information is unveiled and incorporated.
• Experimental results on three different realworld corpora show that the proposed method can effectively deal with the emotion detection problem and perform better than the state-of-the-art emotion detection methods and multi-label learning methods. Moreover, emotion-association topic words extracted by INN-RER indeed represent emotion-evoking events.

Related Work
In general, approaches for emotion detection can be divided into two categories: generative model based and discriminative model based. Generative model based approaches typically built on topic models. For example, the emotion-topic model (Bao et al., 2012) was proposed by adding an extra emotion layer into traditional topic models to capture the generation of both emotions and topics from text at the same time. Other topic model based approaches such as affective topic model (Rao et al., 2014a), multi-label supervised topic model and sentiment latent topic model (Rao et al., 2014b) also modeled the emotions and topics simultaneously. Contextual sentiment topic model (Rao, 2016) assumed each word is either drawn from a background theme, a contextual theme or a topic and explicitly distinguished between context-dependent and context-independent topics. For discriminative model based methods, emotion detection is often casted as a classification problem by considering each emotion category as a class label. If only choosing the strongest emotion as the label for a given text, emotion detection is essentially a single-label classification problem. Lin et al. (2008) studied the classification of news articles into different categories based on readers' emotions with various combinations of feature sets. Strapparava and Mihalcea (2008) proposed several knowledge-based and corpus-based methods for emotion classification. Quan et al. (2015) proposed a logistic regression model with emotion dependency for emotion detection. Latent variables were introduced to model the latent structure of input text. Li et al. (2016) combined bi-term topic model and conventional neural network to detect single social emotion from short texts. To predict multiple emotions simultaneously, emotion detection can be solved using multi-label classification. Bhowmick (2009) presented a method for classifying news sentences into multiple emotion categories using an ensemble based multi-label classification technique. Wang and Pal (2015) output multiple emotions with intensities using nonnegative matrix factorization with several novel constraints such as topic correlation and emotion bindings. To predict multiple emotions with different intensities in a single sentence, Zhou et al. (2016) proposed a novel approach based on emotion distribution learning. Following this way, a relevant label ranking framework for emotion detection was proposed for predict multiple relevant emotions as well as the ranking of emotions based on their intensities (Zhou et al., 2018).
Our work is partly inspired by (Zhou et al., 2018) for relevant emotion ranking, but with the following differences: (1) our model takes into account latent topics in texts for emotion detection, which was ignored in the model proposed in (Zhou et al., 2018); (2) our model is built upon topic models and neural networks with a novel objective function defined to consider the interplay between topics and emotions, while the model in (Zhou et al., 2018) was developed based on a ranking framework with a linear objective function which was not able to describe complex relations between the input texts and their emotions.

The Proposed Approach
Assuming a set of T emotions L = {e 1 , e 2 , ...e T } and a set of n text instances X = {x 1 , x 2 , x 3 , ..., x n }, each instance x i ∈ R d is associated with a ranked list of its relevant emotions R i ⊆ L and also a list of irrelevant emotions R i = L − R i . Relevant emotion ranking aims to learn a score function g( which assigns a score g j (x i ) to each emotion e j , (j ∈ {1, ..., T }). In order to differentiate relevant emotions from irrelevant ones, we need to define a threshold Θ which could be simply set to a fixed value or learned from data (Fürnkranz et al., 2008). Those emotions with scores lower than the threshold will be considered as irrelevant and hence discarded. The identification of relevant emotions and their ranking can be obtained simultaneously according to their scores assigned by the learned ranking function g. As mentioned before, it is unnecessary to consider the rankings of irrelevant emotions since they might introduce errors into the model during the learning process.
We propose an Interpretable Neural Network for Relevant Emotion Ranking (INN-RER) built upon a multi-layer feed-forward neural network. Instead of using the simple sum-of-squares error function, a novel loss function is designed and employed. Accordingly, a new learning algorithm is proposed to minimize the new loss function. Furthermore, motivated by transfer learning, topical information generated from a topic model is transferred into the neural network by making its hidden layer approximate the behavior of topic models.  The overall framework of INN-RER is shown in Figure 1. The left part is a typical topic model (Blei et al., 2003). It is designed for discovering the main topics that pervade a large unstructured collection of documents. A document is allowed to contain a mixture of topics with different weights. As such, a document d can be represented by its topic distribution θ d . The right part is a three-layer neural network. It has d input units corresponding to the d-dimensional feature vector of each training sample x i , T output units corresponding to all possible emotion labels, and one hidden layer with P hidden units corresponding to the hidden topics. The input layer is fully connected to the hidden layer with weights V = [v qh ](1 ≤ q ≤ d, 1 ≤ h ≤ P ) and the hidden layer is fully connected to the output layer with weights W = [w hj ](1 ≤ h ≤ P, 1 ≤ j ≤ T ). The bias parameters α h (1 ≤ h ≤ P ) of the hidden units are considered as weights from an extra input unit with a fixed value of 1. Similarly, the bias parameters β j (1 ≤ j ≤ T ) of the output units are considered as weights from an extra hidden unit, with a fixed value of 1.

RER Data Set
The learning process of INN-RER consists of two main steps. Firstly, the first two layers of the network are initialized based on the output of the topic model. The feature transformation in neural network is conducted by minimizing the Kullback-Leibler (KL) divergence between the topic distribution θ produced by the topic model and the approximated distribution [T opic 1 , T opic 2 , ..., T opic P ] learned by the first two layers of the neural network, which is denoted by the blue rectangular dash line boxes in Figure 1. Then, the whole network is learnt and fine-tuned based on the novel loss function, which is denoted as the orange rectangular dash line boxes. Each step will be described in details in the following subsections.

INN-RER Initialization
As the number of hidden neurons and its semantic meaning is usually treated as a black box in conventional neural networks, the generated topics from the topic model are employed for guiding the construction of the hidden layer in the proposed neural network. By doing that, semantic topic information is incorporated to enhance the interpretability and accuracy of the proposed neural network. For a particular text sample x i in training set G, the input layer takes its termfrequency representation x q i as the input and feeds it to the hidden layer. Assuming the total number of topics is fixed as P , then the hidden layer would contain P neurons. The topic mixture θ x i generated from the topic model is approximated by the weights connecting the input and the hidden layers. Mathematically, the initialization procedure learns a function f (x q i |v qh , α h ) so that the output of f (x q i |v qh , α h ) is as close to θ x i as possible, where x q i , v qh , α h and θ x i denote the input, weight vector, bias of the first two layers of the network and the topic distribution of text x i generated from the topic model, respectively. A softmax function is applied to the output of the hidden layer, i.e., f (x q i |v qh , α h ), and the Kullback-Leibler divergence (Leahy, 2006) is employed as follows: where θ denotes a topic distribution derived from the topic model, and f denotes the output of the hidden layer. The KL divergence is a measure of the difference between two distributions. It is always non-negative and equals to zero when the two distributions are the same. As shown in Equation 1, the KL divergence can describe the difference between the topic distribution generated from topic models and the approximate distributions learned in the initialization procedure. Note that the topic distribution for a document generated by the topic model is used as the supervision information for initializing INN-RER. Thus, maximizing the log-likelihood is equivalent to minimizing the KL divergence according to Equation 1, and its gradients are as follows: According to the gradient descent method, the first two layers can be initialized iteratively by Equation 2 and 3. The initialization procedure for INN-RER is shown in Algorithm 1. η init with the subscript init represents the learning rate during the initialization procedure and λ is the penalty term. Note that the first two layer should be learnt from topic model as much as possible in order to incorporate topic information, thus the learning rate term η init should be larger than the learning rate during training procedure.

INN-RER Learning
This step aims to optimize the three-layer neural network to tackle the relevant emotion ranking problem. It can adjust the neural network initialized at previous step at the same time. An intuitive way is to define the global error function for the network on the training set. However, some important characteristics of relevant emotion ranking, such as ranking, not considering irrelevant emotions, are not considered in the classical back propagation algorithm (Rumelhart et al., 1988). for each text x i ∈ G do 4: end for 10: end for 11: end for The error function defined in traditional neural network such as mean-square error only focuses on individual label discrimination, i.e. whether a predicted label is correct or not. It does not consider the correlations between different labels of a training instance, e.g., relevant emotions should be ranked higher than irrelevant ones and there is a ranking for relevant emotions according to their intensities. Therefore, to fulfil the requirements of relevant emotion ranking, a novel global error function is defined as follows: Here, emotion e t and emotion e s are two emotion labels and e s is less relevant than emotion e t , represented by e s ∈≺ (e t ). The normalization term norm t,s is used to balance emotion pairs (e t , e s ) to avoid dominated terms by their set sizes. The term g t (x i ) − g s (x i ) measures the difference between two emotion outputs, e t and e s , of a given input text x i . We want the difference as larger as possible. Furthermore, the negation of this difference is fed to the exponential function in order to severely penalize the i-th error term if emotion e t is much smaller than e s . As the relationships among different emotions can provide important clues for emotion detection, we further incorporate the information into the loss function as constraints. Here, ω ts is the relationship between emotion e t and e s which is calculated by Pearson correlation coefficient (Nicewander, 1988).
The minimization of the global relevant emotion ranking loss function defined in Equation 4 is carried out by gradient descent combined with the back propagation (Rumelhart et al., 1988). For training instance x i and its label set L i , the actual output of the j-th output neuron is(omitting the superscript i without loss of generality): where β j is the bias of the j-th output neuron which is a "tanh" function: netg j is the input to the j-th output neuron: where w hj is the weight which connects the h-th hidden neuron and the j-th output neuron, and P is the number of hidden neurons, i.e., the topics. b h is the output of the h-th hidden neuron: where α h is the bias of the h-th hidden neuron, f () is also a "tanh" function. netb h is the input to the h-th hidden neuron: where x q is the q-th dimension of instance x. v qh is the weight which connects the q-th input neuron and the h-th hidden neuron. "tanh" function is differentiable, the error of the j-th output neuron can be defined as: Similarly, the error of the h-th hidden neuron can be defined as: In order to reduce the error of the neural network INN-RER, we can use gradient descent strategy: the biases are updated as follows: where η is the learning rate. The training process of the neural network is presented in Algorithm 2. for each text x i ∈ G do 4: Forward compute output of INN-RER's score function g given x i .

5:
Backward compute the gradient according to g and L based on the relevant emotion ranking loss function with learning rate of η learn and penalty term λ.

Experiments
We evaluate the proposed approach on the following three corpora: Sina Social News (News) was collected from the Sina news Society channel where readers can choose one of the six emotions such as Amusement, Touching, Anger, Sadness, Curiosity, and Shock after reading a news article. As Sina is one of the largest online news sites in China, it is sensible to carry out experiments to explore the readers' emotion (social emotion). News articles with less than 20 votes were discarded since few votes can not be considered as proper representation of social emotion. In total, 5,586 news articles published from January 2014 to July 2016 were kept, together with the readers' emotion votes. Ren-CECps corpus (Blogs) (Quan and Ren, 2010) contains 1,487 blogs in Chinese. Each document is annotated with eight basic emotions from writer's perspective, including anger, anxiety, expect, hate, joy, love, sorrow and surprise, together with their emotion scores indicating the level of emotion intensity in the range of [0, 1]. Higher scores represent higher emotion intensity. SemEval (Strapparava and Mihalcea, 2007) is an English data set containing 1,250 news headlines extracted from Google news, CNN, and many other portals. The news headlines are typically short. Each headline was manually scored in a fine-grained valence scale of 0 to 100 across 6 emotions (i.e., anger, disgust, fear, joy, sad and surprise). After pruning 4 items with the total scores equal to 0, 1246 headlines are got for the experiments.

News
Blogs SemEval  The statistics of the three corpora are shown in Table 1. The first two corpora were preprocessed by using the python jieba segmenter 1 for word segmentation and filtering. The third corpus SemEval is in English and can be tokenized by white spaces. Stop words and words appeared only once or in 1 https://github.com/fxsjy/jieba less than two documents were removed to alleviate data sparsity. Next, TF-IDF (term frequencyinverse document frequency) was used to extract the features from text. TF-IDF is a numerical statistic method that is designed to reflect how important a word is to a document in a corpus. In our experiments, we set the dimension of each text representation to 2,000 according to the ranking of the TF-IDF weights with each dimension of termfrequency(TF) features. After that, the text representations are fed into the proposed INN-RER method.
η init , η learn , λ, the number of iterations and the number of topics are set to 0.9, 0.1, 0.001, 100 and 60 respectively. The parameters were chosen by 10-fold cross-validation. The topic distribution used in INN-RER are derived in different ways. For long text such as News and Blogs, Latent Dirichlet Allocation (LDA) (Blei et al., 2003) is employed for generating topic distributions. For short texts in Semeval, bi-term topic model (BT-M) (Cheng et al., 2014) was used, since short text typically contains a few words which results in sparse word co-occurrence patterns. BTM is a variant of LDA which effectively infers the latent topic distribution of short text by modeling the generation of bi-terms in the whole corpus and it alleviates the problem of sparsity at the document level. For each method, 10-fold cross validation is conducted using the same feature construction method to get the final performance.
Evaluation metrics typically used in multi-label learning and label ranking are employed which are different from those of classical single-label learning systems (Sebastiani, 2001). The detailed explanation of evaluation metrics are presented in Table 2. Note that metrics from PRO Loss to F 1 exam work by evaluating performance on each test example separately and returning the mean value across test set. MicroF1 and MacroF1 work by evaluating performance on each emotion category separately and returning the macro/microaveraged value across all emotion categories.

Experimental Results
There are several approaches addressing multiple emotions detection from texts. Three generative model based baselines and three discriminative model based baselines are chosen.  Table 3: Experimental results of the proposed approach and the baselines. 'PL' represent Pro Loss, 'HL' represents Hamming Loss, 'RL' represents ranking loss, 'OE' represents one error, 'AP' represent average precision, 'Cov' represent coverage, 'F1' represents F 1 exam , MiF1' represents MicroF1, 'MaF1' represents MacroF1. "↓" indicates "the smaller the better", while "↑" indicates "the larger the better". The best performance on each evaluation measure is highlighted by boldface.  function from texts to their emotion distributions based on label distribution learning.

PRO Loss
• EmoDetect (Wang and Pal, 2015) outputs the emotion distribution based on a dimensionality reduction method using non-negative matrix factorization which combines several constraints such as emotions bindings and topic correlations.
• RER (Zhou et al., 2018) predicts multiple emotions and their rankings from text based on relevant emotion ranking using support vector machines.  We also evaluated INN-RER with random initialization instead of the proposed initialization procedure, which is denoted as INN-RER(-t).
Experimental results on the three corpora are summarized in Table 3. It can be observed from the table that: (1) INN-RER outperforms the baselines on almost all evaluation metrics across all the data sets; (2) INN-RER achieves better performance on almost all the evaluation metrics than INN-RER(-t), which further verifies the effectiveness of incorporating the topic information; (3) Both INN-RER and INN-RER(-t) perform remarkably better than RER which is based on linear models. It verifies the effectiveness of using the neural networks for RER task, which are able to learn dynamic and complex functions.

INN-RER Interpretation
In addition to comparing the performance of the proposed model with several baselines, we also present the experimental results from the perspective of result interpretation to fully understand INN-RER. The topic words of each emotion in three corpora are extracted according to the ranking of weights learned by INN-RER, i.e., the probabilities of topics conditioned on emotions (weights between the hidden layer and the output   Table 3. cord with what has been observed in social psychology (Stoyanov and Cardie, 2008). For example, in the Sina corpus, Topic 1 under the emotion touching is about "heroic rescue"; Topic 1 under the emotion anger is about "sexual molestation of a child" and Topic 2 under the emotion sadness is about an "car accident". In the SemEval and the Blog corpora, we can also find that topic words listed under each emotion category are related to some social events. For example, in the SemEval corpus, the Joy topic is about "home entertainment" and the Anger topic is about "terrorist attack". In the Blog corpus, the sorrow topic is about "earthquake and the lost of their loved ones". The extracted emotion-associated topic words unveil how the corresponding emotion is evoked. By incorporating topical information into neural network learning, we are able to obtain more interpretable results from INN-RER.

Comparison with Multi-Label Methods
Since relevant emotion ranking can be seen as an extension of multi-label learning, the proposed INN-RER is also compared with three widely used well-established multi-label learning methods such as LIFT (Zhang, 2011), Rank-SVM (Zhang and Zhou, 2014) and BP-MLL (Zhang and Zhou, 2006). In our experiments, LIFT used linear kernel and Rank-SVM uses the RBF kernel with the width σ set to 1 using the threshold Θ which is initialized as 0.15 after normalization. The results of INN-RER in comparison with M-LL baselines are presented in Table 4. It can be observed that INN-RER outperforms all the base-lines across all the datasets on all evaluation measures most of the time. This further verifies the effectiveness of our proposed INN-RER for multilabel emotion detection due to its consideration of rankings of the relevant emotions and the incorporation of topic models.

Conclusion
In this paper, we have proposed a novel interpretable neural network for relevant emotion ranking. Specifically, motivated by transfer learning, the neural network is initialized to make its hidden layer approximate the behavior of a topic model. Moreover, a novel error function is defined to optimize the whole neural network for relevant emotion ranking. Experimental results on three real-world corpora show that the proposed approach performs remarkably better than the state-of-the-art emotion detection approaches and multi-label learning methods. Moreover, the extracted emotion-associated topic words indeed represent emotion-evoking events which are in line with our common-sense knowledge. In the future, we will explore the possibility of learning a topic model and an emotion ranking function simultaneously in a unified framework.