Interpretable Relevant Emotion Ranking with Event-Driven Attention

Multiple emotions with different intensities are often evoked by events described in documents. Oftentimes, such event information is hidden and needs to be discovered from texts. Unveiling the hidden event information can help to understand how the emotions are evoked and provide explainable results. However, existing studies often ignore the latent event information. In this paper, we proposed a novel interpretable relevant emotion ranking model with the event information incorporated into a deep learning architecture using the event-driven attentions. Moreover, corpus-level event embeddings and document-level event distributions are introduced respectively to consider the global events in corpus and the document-specific events simultaneously. Experimental results on three real-world corpora show that the proposed approach performs remarkably better than the state-of-the-art emotion detection approaches and multi-label approaches. Moreover, interpretable results can be obtained to shed light on the events which trigger certain emotions.


Introduction
The advent and prosperity of social media enable users to share their opinions, feelings and attitudes online. Apart from directly expressing their opinions on social media posts, users can also vote for their emotional states after reading an article online. An example of a news article crawled from Sina News Society Channel together with its associated emotion votes received from readers is illustrated in Figure 1. Treating these emotion votes as labels for the news article, we can define the emotion detection problem as an emotion ranking problem that ranks emotions based on their intensities. Moreover, some of the emotion labels could be considered as irrelevant emotions. For example, the emotion categories 'Moved', 'Funny' and 'Strange' in Figure 1 only received one or two votes. These emotion votes could be noises (e.g., readers accidentally clicked on a wrong emotion button) and hence can be considered as irrelevant emotions. We need to separate the relevant emotions from irrelevant ones and only predict the ranking results for the relevant emotion labels. Therefore, the task we need to perform is the relevant emotion ranking. Understanding and automatically ranking users' emotional states would be potentially useful for downstream applications such as dialogue systems (Picard and Picard, 1997). Multiple emotion detection from texts has been previously addressed in (Zhou et al., 2016) which predicted multiple emotions with different intensities based on emotion distribution learning. A relevant emotion ranking framework was proposed in  to predict multiple relevant emotions as well as the rankings based on their intensities. However, existing emotion detection approaches do not model the events in texts which are crucial for emotion detection. Moreover, most of the existing approaches only produce emotion classification or ranking results, and they do not provide interpretations such as identifying which event triggers a certain emotion.
We argue that emotions may be evoked by latent events in texts. Let us refer back to the example shown in Figure 1 and read the text more carefully. We notice that words such as 'beat', 'child' and 'stick' marked in red are event-related words indicating the event of "child abuse" which may evoke the emotions of "Anger", "Sadness" and "Shock".
The above example shows that it is important to simultaneously consider the latent events in texts for relevant emotion ranking. In this paper we proposed an interpretable relevant emotion ranking model with event-driven attention (IRER-EA). We focus on relevant emotion ranking (RER) by discriminating relevant emotions from irrelevant ones and only learn the rankings of the relevant emotions based on their intensities.
Our main contributions are summarized below: • A novel interpretable relevant emotion ranking model with event-driven attention (IRER-EA) is proposed. The latent event information is incorporated into a deep learning architecture through event-driven attentions which can provide clues of how the emotions are evoked with interpretable results. To the best of our knowledge, it is the first deep event-driven neural approach for RER.
• To consider event information comprehensively, corpus-level event embeddings are incorporated to consider global events in corpus and document-level event distributions are incorporated to learn document-specific event-related attention respectively.
• Experimental results on three different realworld corpora show that the proposed method performs better than the state-of-the-art emotion detection methods and multi-label learning methods. Moreover, the event-driven attention enables dynamically highlighting important event-related parts evoking the emotions in texts.

Related Work
In general, emotion detection methods can mainly be categorized into two classes: lexicon-based methods and learning-based methods. Lexiconbased approaches utilize emotion lexicons including emotion words and their emotion labels for detecting emotions from texts. For example, emotion lexicons are used in (Aman and Szpakowicz, 2007) to distinguish emotional and non-emotional sentences. Emotion dictionaries could also be used to predict the readers' emotion of new articles (Rao et al., 2012;Lei et al., 2014).  proposed a model with several constraints using non-negative matrix factorization based on emotion lexicon for multiple emotion detection. However, these approaches often suffer from low recall.
Learning-based approaches can be further categorized into unsupervised and supervised learning methods. Unsupervised learning approaches do not require labeled training data (Blei et al., 2003). Supervised learning methods typically frame emotion detection as a classification problem by training supervised classifiers from texts with emotion categories Wang and Pal, 2015;Rao, 2016). Lin et al. (2008) studied the readers' emotion detection with various kinds of feature sets on news articles. Quan et al. (2015) detected emotions from texts with a logistic regression model introducing the intermediate hidden variables to model the latent structure of input text corpora. Zhou et al. (2016) predicted multiple emotions with intensities based on emotion distribution learning. A relevant label ranking framework for emotion detection was proposed to predict multiple relevant emotions as well as the rankings of emotions based on their intensities . However, these approaches do not model the latent events in texts.
In recent years, deep neural network models have been widely used for text classification. In particular, the attention-based recurrent neural networks (RNNs) (Schuster and Paliwal, 2002;Yang et al., 2016) prevail in text classification. However, these approaches ignore the latent events in texts thus fail to attend on event-related parts. Moreover, they are lack of interpretation.
Our work is partly inspired by  for relevant emotion ranking, but with the following significant differences: (1) our model incorporates corpus-level event embeddings and document-level event distributions by an eventdriven attention mechanism attending to eventrelated words, which are ignored in the mod- el  simply using a Kullback-Leibler (KL) divergence to approximatively learning the documents' topic distributions; (2) our model incorporates the event information into a deep learning architecture thus can consider the sequential information of texts which is ignored in the model  based on shallow bag-of-words representations.

Problem Setting
Assuming a set of T emotions, L = {l 1 , l 2 , ...l T }, and a set of Q document instances, D = {d 1 , d 2 , d 3 , ..., d Q }, each instance d i is associated with a list of its relevant emotions R i ⊆ L ranked by their intensities and also a list of irrelevant emotions R i = L − R i . Relevant emotion ranking aims to learn a score function g(d i ) = [g 1 (d i ), ..., g T (d i )] which assigns a score g j (d i ) to each emotion l j , (j ∈ {1, ..., T }). Relevant emotions and their rankings can be obtained simultaneously according to the scores assigned by the learned ranking function g.
The learning objective of relevant emotion ranking (RER) is to both discriminate relevant emotions from irrelevant ones and to rank relevant emotions according to their intensities. Therefore, to fulfil the requirements of RER, the global ob-jective function is defined as follows: Here, l s ∈≺ (l t ) represents that emotion l s is less relevant than emotion l t . The normalization term norm t,s is used to avoid terms dominated by the sizes of emotion pairs. The term g t (d i ) − g s (d i ) measures the difference between two emotions. ω ts represents the relationship between two emotions l t and l s which is calculated by Pearson correlation coefficient (Nicewander, 1988).
We present the overall architecture of the proposed interpretable relevant emotion ranking with event-driven attention (IRER-EA) model in Figure 2. It consists of four layers: (1) the Input Embedding Layer including both word embeddings and event embeddings; (2) the Encoder Layer including both the word encoder and the event encoder; (3) the Attention Layer which computes the word-level attention scores and the event-driven attention scores taking into account of the corpuslevel and document-level event information, respectively; (4) the Output Layer which generates the emotion ranking results.

Input Embedding Layer
The Input Embedding Layer contains word embeddings and event embeddings. Assuming a document d i consisting of N words represented as d i = {w 1 , w 2 , ..., w N }, the pre-trained word vector, GloVe (Pennington et al., 2014), is used to obtain the fixed word embedding of each word and d i can be represented as d i = {x 1 , x 2 , ..., x N } as shown in Figure 2.
Since nouns and verbs are more important than other word types in referring to specific events, they are utilized as inputs of topic model such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) to generate events automatically. Therefore, the granularity of extracted events is controlled by the predefined K, the number of events. For the corpus D consisting of K events {e 1 , e 2 , ..., e K }, the event embedding of the kth event e k can be obtained from the output event-word distribution matrix E of the topic model. For the single document d i , the event distribution p = (p 1 , p 2 , ..., p K ) obtained from the topic model represents the probability of the text expressing each event. The Encoder Layer contains both the word encoder and event encoder. As for the word encoder, an alternative RNN structure (Zhang et al., 2018) is used to encode texts into semantic representations since it has been shown to be more effective in encoding longer texts. For document d i , formally, a state at time step t can be denoted by:

Encoder Layer
which consists of sub-states h t i for the ith word w i in document d i and a document-level sub-state h t q as shown in Figure 3. The hidden states are independent of each other at the present recurrent step and are connected across recurrent steps, which can capture long-range dependencies. The recurrent state transition process is used to model information exchange between those sub-states to enrich state representations incrementally. The state transition is similar to LSTM (Hochreiter and Schmidhuber, 1997) and a recurrent cell c t i for each word w i and a cell c t q for document-level substate h q are used. The value of each h t i is computed based on the values of at two adjacent recurrent time steps. Note that the number of window size between two adjacent steps can be set manually. Hence, the hidden sub-states h i for individual words w i and a global document hidden state h q for d i are obtained.
As for event encoder, event representations are produced by the ReLU-actived neural perceptrons taking the event-word weight matrix E ∈ V × K as inputs. Hence, each event representation s k representing event k is obtained according to the event embedding e k , k ∈ {1, 2, 3, ..., K}.

Attention Layer
Given a word w n in document d i , h n is the hidden representation of w n after encoder. Given an event embedding e k in the corpus, s k is the event representation of e k generated by the event encoder. Then we utilize attention weights to enhance the word representations and event representations from different aspects.
Our model contains two kinds of attention mechanisms including the word-level attentions and the event-driven attentions.

Word-Level Attention
As for word-level attentions, since not all words contribute equally to the meaning of a document, we introduce an attention mechanism to extract words with greater importance and aggregate the representations of those informative words to form the document representation, which is shown in the left part of Figure 2. More concretely, where the weight a w i is the attention of the word w i and W w , b w and u w are parameters similar to (Pappas and Popescu-Belis, 2017). Note that we further incorporate the global information of the document representation h q obtained from the encoder to strengthen the word attention.

Event-Driven Attention
In our model, we use the event-driven attention mechanism to attend to event-related words, which can discover words more important for textrelated events. The event-driven attention leverages the corpus-level event information based on each event representation s k , k ∈ {1, 2, 3, ..., K} obtained from the corpus and the document-level event information based on the document's event distribution p = (p 1 , p 2 , ..., p K ).

Corpus-level Event-Driven Attention
The model utilizes the corpus-level event information by a joint attention mechanism to consider global events in corpus, which aggregates the semantic representations h = (h 1 , h 2 , ..., h N ) of an input text obtained and measures the interaction the words in the text with the event representations s = (s 1 , s 2 , ..., s K ) by the event-driven attention. The corpus-level event-driven attention is calculated as follows: where h = (h 1 , h 2 , ..., h N ) stands for the combination of all the hidden states of words in the document and W c and b c are parameters needed to be learnt for corpus-level event-driven attention. ϕ c = (ϕ c 1 , ϕ c 2 , ..., ϕ c N ) refers to the hidden representation of state h through a fully connected layer. Given the event representation s k , we measure the interaction of the words in the document and the event by an attention weight vector m c k which can be computed as the inner product of event s k and ϕ c followed by a softmax layer. a c = (a c 1 , a c 2 , ..., a c N ) stands for the average attention weights of all the events for words which contribute to discover event keywords of a document according to different events in corpus. Then we construct the text representation r c with the sum of hidden states weighted by a c .

Document-level Event-driven Attention
We further incorporate the document-level event-driven attention mechanism. Our model can attend to the event distributions of the current document in order to strengthen the effect of the current document expressing each event and learn document-specific event related attention. For each document, p = (p 1 , p 2 , ..., p K ) denotes the event distributions of the document, with each dimension representing the level of prominence of the corresponding event occurred in the document. The corpus-level event-driven attention weights can be further strengthened by including document-level event distributions. The document-level event-driven attention is calculated as follows: where h = (h 1 , h 2 , ..., h N ) stands for the aggregation of all the hidden states of words in the document and W d and b d are parameters needed to be learnt for the document-level event-driven attention. ϕ d = (ϕ d 1 , ϕ d 2 , ..., ϕ d N ) refers to the hidden representation of state h through a fully connected layer. m d k represents the interaction of the words in the document and the event which can be computed as the inner product of event s k and ϕ d . Then m d k is weighted by the document-level event distribution, p = (p 1 , p 2 , ..., p K ), followed by a softmax layer, and a e = (a e 1 , a e 2 , ..., a e N ) stands for the attention weight after incorporating the document-level event distributions. Then we construct the text representation r e with the sum of hidden states weighted by a e . Finally, r e is used as the final text representation obtained by the event-driven attention which simultaneously takes into account both the corpus-level event information and the document-level event information.

Output Layer
At last, we concatenate both the representations calculated by the word-level attention and the event-driven attention to obtain the final representation r = [r w , r e ], which is fed to a multi-layer perceptron and a softmax layer for identifying relevant emotions and their rankings.

Experiments
To evaluate our proposed approach, we conducted experiments on the following three corpora: Sina Social News (News)  consists of 5,586 news articles collected from the Sina news Society channel. Each document was kept together with the readers' emotion votes of the six emotions including Funny, Moved, Angry, Sad, Strange, and Shocked. Ren-CECps corpus (Blogs) (Quan and Ren, 2010) is a Chinese data set containing 1,487 blogs annotated with eight basic emotions from writer's perspective, including Anger, Anxiety, Expect, Hate, Joy, Love, Sorrow and Surprise. The emotions are represented by their emotion scores in the range of [0, 1]. Higher scores represent higher emotion intensities. SemEval (Strapparava and Mihalcea, 2007) contains 1,250 English news headlines extracted from Google news, CNN, and many other portals, which are manually annotated with a fine-grained valence scale of 0 to 100 across 6 emotions, including Anger, Disgust, Fear, Joy, Sad and Surprise.

News
Blogs SemEval  The statistics for the three corpora used in our experiments are shown in Table 1.
In our experiments, the News and Blog corpora were preprocessed using the python jieba segmenter 1 for word segmentation and filtering. The third corpus SemEval is in English and was tokenized by white space. Stanford CoreNLP 2 was applied for parts of speech tagging to obtain the nouns and verbs of the documents. Stop words and words appeared less than twice were removed from documents. We used the pre-trained Chinese GloVe and English GloVe 3 vectors as the word embeddings in the experiments and the dimension of the word embeddings was 300. where δ is the indicator function.
One Error The event embeddings and event distributions used in the proposed method are derived in different ways. For long documents including News and Blogs, LDA was employed to generate event embeddings and event distributions using verbs and nouns as the input. For short texts in SemEval with the sparsity problem, Bi-term Topic Model (BTM) (Cheng et al., 2014) was chosen. The number of topics was 60. The parameters were chosen from the validation set which is 10% of the training set. The encoder was trained using a learning rate of 0.001, a dropout rate of 0.5, a window size of 1 and a layer number of 3. The number of epochs was 10 and the mini batch (Cotter et al., 2011) size was 16. For each method, 10-fold cross validation was conducted to get the final results.
The baselines can be categorized into two classes, emotion detection methods and multi-label methods. Most these baselines are either reimplemented or cited from published papers. For instance, the results of multi-label methods are reimplemented, since they are not proposed for relevant emotion ranking. The performances of some emotion detection methods, such as EDL, EmoDetect, RER and INNRER, are cited from the pub-  Table 3: Comparison with Emotion Detection Methods and Multi-label Methods. 'PL' represent Pro Loss, 'HL' represents Hamming Loss, 'RL' represents ranking loss, 'OE' represents one error, 'AP' represent average precision, 'Cov' represent coverage, 'F1' represents F 1 exam , 'MiF1' represents MicroF1, 'MaF1' represents MacroF1. "↓" indicates "the smaller the better", while "↑" indicates "the larger the better". The best performance on each evaluation measure is highlighted by boldface.
lished paper  as they use the same experimental data as ours. Evaluation metrics typically used in multi-label learning and label ranking are employed in our experiments which are different from those of classical single-label learning systems (Sebastiani, 2001). The detailed explanations of evaluation metrics are presented in Table 2.

Compared Methods
There are several emotion detection approaches addressing multiple emotions detection from texts.
• Emotion Distribution Learning(EDL) (Zhou et al., 2016) learns a mapping function from sentences to their emotion distributions.
• EmoDetect (Wang and Pal, 2015) employs a constraint optimization framework with several constraints to obtain multiple emotions.
• RER  uses support vector machines to predict relevant emotions and rankings in one text based on their intensities.  Table 4: Comparison of different IRER-EA components."↓" indicates "the smaller the better", while "↑" indicates "the larger the better". The best performance on each evaluation measure is highlighted by boldface.
pare the proposed IRER-EA with several widelyused multi-label learning methods.
• LIFT (Zhang, 2011) constructs features specific to each label. • BP-MLL (Zhang and Zhou, 2006) employs a novel error function into back propagation algorithm to capture the characteristics of multi-label learning. For the MLL methods, linear kernel is used in LIFT. Rank-SVM uses the RBF kernel with the width σ equals to 1.
Experimental results on three corpora are shown in Table 3. It can be summarized from the table that: (1) IRER-EA performs better than stateof-art emotion detection baselines on almost all evaluation metrics across three corpora, which obviously shows the effectiveness of incorporating event information to obtain event-driven attentions for relevant emotion ranking; (2) IRER-EA achieves remarkably better results than MLL methods. It further confirms the effectiveness of IRER-EA, which uses a deep learning architecture incorporating event-driven attention for better performance.

Model Analysis
To further validate the effectiveness of eventdriven attention components, we compare IRER-EA with two sub-networks based on our architectures.
Experimental results on three corpora are shown in Table 4. It can be summarized from the table that: (1) All the sub-networks cannot compete with IRER-EA on three corpora, which indicates the corpus-level and document-level event information are effective for relevant emotion ranking task; (2) IRER-EA(-DEA) performs better than IRER-EA(-EA) on most of the evaluation metrics, which verifies the effectiveness of incorporating corpus-level event-driven attention; (3) IRER-EA achieves better results than IRER-EA(-DEA) on almost all the evaluation metrics which further proves the effectiveness of document-level eventdriven attention.

Case Study of Interpretability
To further investigate whether the event-driven attention is able to capture the event-associated words in a given document and provide interpretable results, we compare the results of the attention mechanisms of the word-level attentions and event-driven attentions by visualizing the weights of words in the same documents as shown in Figure 4. As the document of the News corpus and Blogs corpus are too long, we manually simplified the texts for better visualization results and provided English translations of the texts. The words marked in red represent highlyweighted ones according to the event-driven attentions, while the words with blue underlines are the

Corpora Texts Emotions
News 古琴台附近一名女子轻生溺亡。从她走进水里到最终溺亡，持续时间将 近两分钟。当时有 10 余人围观，竟无人下水施救。这种现象让我们寒 心。 A woman drowned near Guqin Terrace. It took nearly two minutes from she stepping into the water to her final death. More than 10 people were watching her, but no one came to rescue. Such phenomenon chills us.

Angry Sad Shocked
Blog 地震过后两百个小时里，每一张前沿的照片，每一段获救的视频，每个 爸爸妈妈发自心底的痛哭，都令我心痛。 During the two hundred hours after the earthquake, each photo and video about rescue process and the crying from the parents make my heart broken.

Sorrow Love
SemEval teacher in hide after attack on islam stir threat Fear Sad From the visualization results on an example News article, it can be observed that different from word-level attention that pays more attentions to emotion-associated words, such as 'chill' which may only evoke the emotion "Sad", the eventdriven attentions can find words indicating latent events in the document, such as 'drown', 'death', 'no one rescue' which are all closely related to the event "Suicide without rescue', which may evoke emotions such as "Angry" and "Shocked". In an example Blog article, word-level attentions highlight emotion-associated words such as 'crying' and 'broken' which may evoke the emotion "Sorrow", while event-driven attentions focus on the event-related words such as 'earthquake' and 'rescued' representing the event "Earthquake Relief '. Finally, in an example from the SemEval corpus, we can see that the word-level attention mechanism only gives a higher attention weight to the word 'threat' and ignores the word 'attack', which is also an important indicator of the emotions "Fear" and "Sad". On the contrary, the event-driven attention mechanism highlights both 'attack' and 'threat', representing the event "Terrorist attack'.
In summary, we can observe from Figure 4 that: (1) Event-driven attention can capture words representing latent events in texts; (2) Compared with the word-level attention which is prone to attend on emotion-associated keywords, eventdriven attention can find words representing one or more hidden events in a document, which can provide more explainable clues of which event triggers certain emotions; (3) Event-driven attention can achieve better performance especially in documents without any emotion-associated words.

Conclusion
In this paper, we have proposed an interpretable relevant emotion ranking model with event-driven attention. The event information is incorporated into a neural model through event-driven attentions which can provide clues of how the emotions are evoked with explainable results. Moreover, corpus-level event embeddings and documentlevel event distributions are incorporated respectively to consider event information comprehensively. Experimental results show that the proposed method performs better than the state-ofthe-art emotion detection methods and multi-label learning methods.