Adversarial Attention Modeling for Multi-dimensional Emotion Regression

In this paper, we propose a neural network-based approach, namely Adversarial Attention Network, to the task of multi-dimensional emotion regression, which automatically rates multiple emotion dimension scores for an input text. Especially, to determine which words are valuable for a particular emotion dimension, an attention layer is trained to weight the words in an input sequence. Furthermore, adversarial training is employed between two attention layers to learn better word weights via a discriminator. In particular, a shared attention layer is incorporated to learn public word weights between two emotion dimensions. Empirical evaluation on the EMOBANK corpus shows that our approach achieves notable improvements in r-values on both EMOBANK Reader’s and Writer’s multi-dimensional emotion regression tasks in all domains over the state-of-the-art baselines.


Introduction
Emotion analysis aims to recognize human emotion expression in a given text (Mishne et al., 2005;Abdul-Mageed and Ungar, 2017). Typically, studies in emotion analysis can be divided into either emotion classification (Yang et al., 2007;Tripathi et al., 2017) or emotion regression (Yu et al., 2015;Wang et al., 2016a). While emotion classification aims to label an input text with a single or multiple emotion categories, emotion regression aims to rate a single or multiple emotion dimension scores of an input text through machine learning models. In this study, we focus on emotion regression.
Compared with enormous studies in emotion classification, studies in emotion regression have a late start much due to the inherent difficulty of the regression task and the lack of large-scale emotion regression corpora in high quality. Despite of its difficulty, emotion regression is more Figure 1: An example of multi-dimensional emotion regression. The dimensional emotion score ranges from 1.0 to 5.0. In this example, the word very in blue only suggests one emotion dimension (i.e, a high Arousal score). The word scared and disaster in red suggest two emotion dimensions. Specifically, scared suggests a low Valence score and a low Dominance score, while Disaster denotes a low Valence score and a high Arousal score. suitable for fine-grained emotion analysis and has gained an increasing attention recently due to the availability of several emotion regression corpora in the last few years (Preotiuc-Pietro et al., 2016;Hahn and Buechel, 2017). In principle, these emotion regression corpora apply the widely-admitted Valence-Arousal model or Valence-Arousal-Dominance model (Barrett, 2006) to describe emotions with a continuous real number space in two or three dimensions. Moreover, while different emotion classification corpora often apply different classification systems, they describe emotions with a limited number of discrete pre-defined emotion categories.
In the literature, most of the existing studies in emotion regression focus on a single emotion dimension by training multiple independent models for different emotion dimensions (Yu et al., 2015;Wang et al., 2016a). Hence in this paper, we seek to solve multi-dimensional emotion regression via a joint approach. Recently, attention mechanism has been widely applied in sentiment and emotion classification (Wang et al., 2016b;Potamianos and Kokkinos, 2017). Likewise, in emotion regression, attention mechanism is supposed to be effective on determining what words are emotional for rating dimensional emotion scores. Figure 1 shows an example of emotion regression. Obviously, the dimensional emotion scores can be inferred from the colored words in this figure. Although the degree adverb, such as very, only suggests a high Arousal score, an emotional word often suggests more than one dimensional emotion score. This hints a possibility that the relationship between two emotion dimensions can be leveraged, which is overlooked by existing singledimensional emotion regression studies.
In this paper, we try to model the multidimensional learning task as a multi-task learning task through adversarial learning. Recently, studies in multi-task learning via adversarial learning (Liu et al., 2017;Masumura et al., 2018), which tried to conduct adversarial learning (Goodfellow et al., 2014) between multiple tasks to learn taskspecific features for achieving better performance for each task, has achieved a great success. We apply adversarial learning to model the task not only due to its capability of multi-task learning, but also due to its inherent collocability with attention mechanism. In the literature, adversarial learning has the difficulty in learning latent representations from discrete structures (e.g., sequence of word embeddings). Thus, most of existing studies in NLP apply adversarial learning with autoencoder-based models, which map a discrete word sequence into a continuous code space beforehand (Makhzani et al., 2015). In this study, we propose a more straightforward yet effective way to learn better representations via adversarial learning which directly learns continuous attention weights. This is done via an Adversarial Attention Network (AAN) which can leverage both advantages of adversarial learning and attention mechanism. AAN conducts adversarial learning between two attention layers to learn two sets of word weight parameters for two emotion dimensions. In this way, better weight information can be learned to represent words' importance for rating dimensional scores. Specifically, our proposed AAN has two features: • First, AAN conducts adversarial learning between two attention layers to decide the val-ues of words for rating two emotion dimension scores. In particular, we propose an adversarial training algorithm to learn two sets of better word weights which contribute to two emotion dimensions in two attention layers.
• Second, unlike existing single-dimensional emotion regression studies which separately train models for different emotion dimensions, AAN can leverage shared information between emotion dimensions (e.g., word scare contributes to both Valence and Dominance in the example shown in Figure 1) to better rate different emotion dimension scores, and thus achieve better regression results.
We apply AAN to the task of multi-dimensional emotion regression on a large-scale emotion regression corpus, namely EMOBANK, contributed by Hahn and Buechel (2017). Empirical evaluation on EMOBANK Reader's and Writer's multidimensional emotion regression tasks shows that AAN achieves significant improvements in rvalues over several strong baselines. Furthermore, it also shows that adversarial training between two attention layers is more effective than simply applying attention mechanism individually to each emotion dimension, or simply training two regressors jointly for a pair of emotion dimensions.

Emotion Regression
Compared with emotion classification, emotion regression had a late start due to the severe lack of large-scale annotated emotion regression corpora and the inherent difficulty of the regression task. Yu et al. (2015) implemented a lexiconbased weighted graph-based approach which models the relationship and similarity among emotion word nodes to rate the Valence-Arousal scores of emotion words. Their approach achieved the better performance over the simple linear regression approach, the kernel method, and the Pagerank algorithm. Preotiuc-Pietro et al. (2016) collected user information from Facebook, and built an English emotion regression corpus containing 2,895 texts. Wang et al. (2016a) proposed a regional CNN-LSTM-based approach to documentlevel emotion regression. Their approach first divided a whole text into several regions, and then extracted regional features from each region with multiple CNNs. By properly leveraging the fused regional features, an LSTM layer is finally applied to rating the Valence-Arousal scores of the whole text. Evaluation on several corpora showed the regional CNN-LSTM achieved a better performance over both the vanilla single-layered CNN and single-layered LSTM.  constructed a Chinese emotion regression corpus, which contains 2,009 texts, from multiple online resources. Buechel and Hahn (2016) investigated mapping the dimensional emotion scores to an emotion category of a text. They first annotated the SemEval07: task 14 corpus with dimensional scores, and then constructed the mapping from dimensional emotion scores to the emotion categories by KNN. On the basis, Hahn and Buechel (2017) built an emotion regression corpus, namely EMOBANK, which contains over 10,000 texts.

Adversarial Learning
Due to the success of generative adversarial network (GAN) in image generation (Goodfellow et al., 2014), adversarial learning has drawn more and more attention in the recent years. In order to well address the instability issue in GAN's training, Arjovsky et al. (2017) proposed Wasserstein GAN (WGAN) to tackle the issue in GAN. Especially, WGAN applied the Wasserstein distance between two distributions instead of the JS divergence adopted in GAN to avoid the training instability issue due to the failure of the JS divergence to indicate the training process of the discriminator when there is few overlaps between two distributions.
In the recent years, NLP researchers began to apply adversarial learning to various NLP tasks.  and Zhao et al. (2017) constructed adversarial networks with CNNs and LSTMs to train text generation models. Wu et al. (2017) proposed two types of adversarial models which consist of CNNs and RNNs, respectively. They discussed the advantages and disadvantages of two implementations on two relation extraction datasets. Masumura et al. (2018) proposed an adversarial training approach for multi-task multilingual learning, which jointly conducts task discrimination among languages and language discrimination among tasks. Chen and Cardie (2018) applied adversarial learning to multilingual word representation learning which maps word embed-dings in multiple languages to the same vector space.
In comparison, our study focuses on the task of multi-dimensional emotion regression. To the best of our knowledge, it is the first attempt which applies adversarial learning to emotion regression.

Adversarial Attention Network
In this section, we introduce AAN which conducts adversarial learning between a pair of emotion dimensions. Take the Valence dimension and the Arousal dimension as an example, Figure 2 illustrates the framework of the Valence-Arousal AAN. Besides the Valence-Arousal AAN, there are the Valence-Dominance AAN and the Arousal-Dominance AAN. Unless otherwise mentioned, in the rest of this section, we only introduce the detailed implementation of the Valence-Arousal AAN for convenience.

Attention Modeling
AAN takes a sequence of word vectors X = [x 1 x 2 ... x i ... x k ] of a text, which contains k words, as an input, where x i denotes the word vector of the ith words in the text. The attention layer aims to learn a normalized weight vector A = [a 1 a 2 ... a i ... a k ] from X by a one-layer LSTM to decide the value of a word vector, and finally output a weighted sequence: where Att denotes an attention layer, diag(A) means to place the elements of A in the principal diagonal of a diagonal matrix with zero offdiagonal elements (Mulaik, 2009), X ′ denotes the weighted input sequence, and sof tmax denotes the Softmax activation function for normalization. There are three attention layers, denoted as Att V , Att A , and Att S , contained by an AAN. Att V and Att A decide which words are valuable for rating the Valence score and the Arousal score, respectively. Att S is a shared attention layer to indicate which words contribute to the rating scores of both emotion dimensions: where X ′ V , X ′ A , and X ′ S denote the weighted sequence returned by three attention layers, respectively.

Feature Extraction
The feature extractor of AAN (denoted as Ext) is trained to extract the feature vector from a weighted sequence returned by an attention layer. In this study, the feature extractor is implemented using a single-layered bidirectional LSTM (BiLST M ): In most of the previous studies, the hidden state of the last time step h k from the output sequence H of BiLSTM layer is chosen as the feature vector. In this study, we further apply mean pooling to fetch richer textual information from the weight sequence: After mean pooling, h k and h are concatenated as the output feature vector F eat activated by the tanh function: where ⊕ denotes the concatenating operator. In AAN, the extraction of feature vectors from three weighted sequences is denoted as follows: where F eat V and F eat A denote the features for Valence and Arousal, and F eat S denotes the shared feature which contributes to both emotion dimensions.

Dimensional Emotion Regression
The regressor rates an emotion dimension score.
Since the regressor in AAN can be implemented in various ways as long as the gradients can be propagated in the network, to highlight the superiority of the proposed adversarial model, in this study, we implement the regressor simply with a singlelayered full-connected neural network: where S denotes the regression score of an emotion dimension, R denotes a regressor, W denotes the parameters of the full-connected layer, b denotes the bias term, relu stands for the Relu activation function. In AAN, the Valence score S V and the Arousal score S A are denoted as follows.
Note that the input of a regressor in AAN is the concatenation of the dimensional feature and the shared feature: where R V and R A denote two regressors in AAN.

Emotion Dimension Discrimination
The discriminator D judges which emotion dimension an input feature vector contributes to. In the implementation of the D, we follow the work of WGAN, and apply the Wasserstein distance between two feature distributions as the loss function of the discriminator in order to provide a smoother measure for indicating the training process than KL divergence and JS divergence. In this study, the discriminator is implemented with a singlelayered full-connected neural network to approximately fit the Wasserstein distance: where W denotes the parameters of the fullconnected layer, b denotes the bias term, tanh stands for the Tanh activation function. P ∈ (−1, 1) stands for the discriminating result. In AAN, the closer the value of P is to 1, the more probably F eat contributes to Valence. The discriminator outputs the results of F eat V and F eat A : where P V and P A denote the discriminating results of F eat V and F eat A , respectively.

Adversarial Training
To adversarially train the model, we first train Att V , Att A , Att S , R V , R A , and Ext by minimizing following regression losses. In this study, the mean square error is applied as the regression loss: where S V i and S A i denote the regression scores of the Valence dimension and the Arousal dimension of the ith input sample, respectively. T V i and T A i denote the annotated true values of two emotion dimensions of the ith input sample. n denotes the total number of input samples. Then, we update the parameters of D by maximizing the Wasserstein distance between two feature distributions: where S V i and S A i denote the regression scores of two feature vectors extracted from the ith input sample. It is worthwhile to mention that we clip the parameters of D to a fixed absolute value at each training epoch. This training technique follows the research of Arjovsky et al. (2017) in order to meet the Lipschitz continuity which is required for using a full-connected layer to approximately fit the Wasserstein distance. Finally, we update the parameters of Att V and Att A by adversarially fooling D: Regarding the optimizing algorithm, in this study, we use different optimizers for different parts of our model. Att V , Att A , Att S , and Ext apply Adam as their optimizers, while R V , R A , and D apply RMSProp as their optimizers. Parameters in the network are initialized with uniform samples in [− 6/(r + c), 6/(r + c)], where r and c are the numbers of rows and columns in the matrices (Glorot and Bengio, 2010).

Experimentation
In this section, we systematically evaluate our proposed AAN by applying it to the EMOBANK Reader's and Writer's multi-dimensional emotion regression compared with other baselines. For thorough evaluation, five-fold cross validation is applied in all experiments.

Experimental Settings Dataset
In this study,the EMOBANK (Hahn and Buechel, 2017) is used in our experiments to evaluate the proposed approach. This multi-dimensional emotion regression corpus is available from the contributors' GitHub repository 1 .
EMOBANK contains 10,548 texts annotated with 10,325 Reader's and 10,279 Writer's dimensional emotion scores, ranged from 1.0 to 5.0, in six domains. Table 1 gives the statistics of the numbers of texts in different domains on EMOBANK. In this study, we evaluate our approach in all the six domains of the EMOBANK corpus.

Evaluation Metrics
We apply the widely used Pearsons correlation coefficient r in all experiments as the evaluation metric for fair comparison because the contributors of EMOBANK also use r to evaluate the annotation quality between human annotators.

Baselines
In this study, the following baselines for emotion regression are implemented for fair comparison: • Deep CNN: A CNN-based approach proposed by Bitvai and Cohn (2015). This approach applies multiple parallel CNNs to extract multiple n-gram features in a text, and is considered as one of the stat-of-the-art regression baselines for sentiment regression. In our implementation of Deep CNN, three parallel CNNs are applied to extract the unigram feature, the bi-gram feature, and the trigram feature in a text.
• Regional CNN-LSTM: A state-of-the-art emotion regression baseline proposed by Wang et al. (2016a). This approach first divides a whole text into several regions, and then extracts regional features from each region with multiple CNNs.
• Context LSTM-CNN: A state-of-the-art text classification baseline proposed by Song et al. (2018). This approach models the longrange dependencies within the classified sentences with an LSTM, and short-span features with a stacked CNN. We modified this approach by changing its activation function in order to return the dimensional emotion scores.
• Attention Network: A simpler counterpart of AAN. It contains only one attention layer, a feature extractor, and a regressor, for singledimensional emotion regression.
• Joint Learning: Another simpler counterpart of AAN. It trains two regressors for two emotion dimensions in a joint learning style without any adversarial training technique. That is, this approach has the similar structure to AAN, except the absence of the discriminator. Here, three emotion dimension pairs are evaluated. Table 3 gives the performance of each approach in all six domains. Our proposed AAN notably performs better than other baselines, including the strong baseline Regional CNN-LSTM and Context LSTM-CNN in all cases. Furthermore, AAN outperforms its two counterparts (i.e., Attention Network and Joint Learning), justifying the effectiveness of the proposed adversarial learning approach. However, the overall r-values on EMOBANK are relatively low. This indicates the inherent difficulty of emotion regression on EMOBANK. As a reference, the average oracle rvalue between human annotators of EMOBANK is about 0.6 (Hahn and Buechel, 2017).

Experimental Results
News  In the News Domain, we can find that the performance in Valence is notably higher than those in other two dimensions. Moreover, the r-values in Arousal of Writer's emotion are lower than those of Reader's emotion. This indicates that a writer does not write a news article with too much emotional arousal in order to keep objectivity. For instance, the text "Scam lures victims with free puppy offer." relates to a negative emotion. This explains that the Valence scores of Reader's emotion and Writer's emotion are both low (<2.50). However, the Arousal score of Reader's emotion reaches 4.00, while the Arousal score of Writer's emotion is a medium of 3.25. This shows that in the News domain, the Arousal score of Writer's emotion tends to be a medium, even though a text can arouse a distinct Reader's emotion.
In the Fictions domain, the r-values in Arousal of Reader's emotion and Writer's emotion are much close compared with those in the News domain. This indicates that the writer of the Fictions domain writes texts with more distinct emotional arousal. For instance, the text "She screamed: I havent socialized with Terras elite for most of my life!" relates to a negative emotion, and the Arousal scores of Readers and Writer's emotion both reach 4.20. This shows that in the Fictions domain, a writer's emotional arousal is better represented by the Arousal score, and thus the r-value in Arousal of Writer's emotion is higher than that in the News domain.
Similar to the Fictions domain, the r-values in Arousal of Reader's emotion and Writer's emotion in the Blogs domain are very close. Furthermore, the r-values in Arousal are higher than those in the Fictions domain. This indicates that the emotion arousal in the Blogs domain is more distinct than that in the Fictions domain. For instance, the text "lol Wonderful Simply Superb." has extremely high score in Valence (4.8) and Arousal (4.8) of Reader's emotion, while its Valence and Arousal scores of Writer's emotion are also high (4.4 and 3.8, respectively). This implies that the writers of the Blogs domain express their emotion more frankly than those of the Fictions domain, and thus the regressor can better detect the emotion contained in the texts in the Blogs domain.
Unlike other domains, in the Essays domain, the r-values in Dominance of both Reader's emotion and Writer's emotion are extremely low. None of the baselines achieve an r-value in Dominance which is more than 0.1. The reason behind lies in that most texts in the Essays domain only objectively state realities. For instance, the text "Moore's second hypothesis is that America's foreign policy may contribute to the belief that violence is an appropriate means to solve conflicts a hypothesis which is shared by many sociologists and psychologists." only introduces the "Moore's second hypothesis" in an objective tone, while this kind of text is somehow hard to decide whether it expresses an active emotion or a passive emotion (i.e., whether the Dominance is high or low).
In the Letters domain, the performance in all dimensions reaches a high level in r-value compared with those in other domains. Specifically, there is no extremely low r-value (<0.20) in any dimension of either Reader's emotion or Writer's emotion. This implies that the writers of the Letters domain mostly write texts which relate to the real life of themselves or people around them. For instance, the text "They do not have the resources necessary to purchase gifts or food for a holiday meal." includes a pure emotion of writers, and such text can arouse more distinct emotion of readers.
Despite the overall lower performance than other domains due to the least text samples among all domains, there is no extremely low r-value achieved by any approach in the Travel Guides domain. Compared with the texts in the Essays domain, some texts in the Travel Guides domain state much about the histories and anecdota of the tourist attractions. However, besides the historical stories, for instance, the text "Good for the health is just one of the many magical qualities that are attributed to these beautiful emerald-green or turquoise stones." makes positive publicity for the tourist attraction in order to attract tourists, which contains a distinct positive emotion. Thus compared with the low r-values in Dominance in the Essays domain, the r-values in the Travel Guides domain are kept in a good level.

Conclusion
In this paper, we propose an Adversarial Attention Network (AAN) for multi-dimensional emotion regression. AAN takes the advantages from both adversarial learning and attention mechanism by conducting adversarial learning between two attention layers in order to learn better weighted information in a given text. Empirical evalua-tion on EMOBANK Reader's and Writer's threedimensional emotion regression tasks shows the superiority of the proposed model with better performance over several state-of-the-art baselines. This indicates the effectiveness of the proposed adversarial learning approach to multidimensional emotion regression.
However, our proposed AAN still has several limitations. In our future work, we would like to improve the model structure and the adversarial learning algorithm. Moreover, we would like to seek a stable and controllable way to conduct adversarial learning among more than two objects. Last but not least, we would like to apply our approach to other heterogeneous texts-concerned NLP tasks.