Improving Multi-label Emotion Classification via Sentiment Classification with Dual Attention Transfer Network

In this paper, we target at improving the performance of multi-label emotion classification with the help of sentiment classification. Specifically, we propose a new transfer learning architecture to divide the sentence representation into two different feature spaces, which are expected to respectively capture the general sentiment words and the other important emotion-specific words via a dual attention mechanism. Experimental results on two benchmark datasets demonstrate the effectiveness of our proposed method.


Introduction
In recent years, the number of user-generated comments on social media platforms has grown exponentially. In particular, social platforms such as Twitter allow users to easily share their personal opinions, attitudes and emotions about any topic through short posts. Understanding people's emotions expressed in these short posts can facilitate many important downstream applications such as emotional chatbots (Zhou et al., 2018b), personalized recommendations, stock market prediction, policy studies, etc. Therefore, it is crucial to develop effective emotion detection models to automatically identify emotions from these online posts.
In the literature, emotion detection is typically modeled as a supervised multi-label classification problem, because each sentence may contain one or more emotions from a standard emotion set containing anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise and trust. Table 1 shows three example sentences along with their emotion labels. Traditional approaches to emotion detection include lexicon-based methods (Wang and Pal, 2015), ID Tweet Emotion  graphical model-based methods (Li et al., 2015b) and linear classifier-based methods (Quan et al., 2015;Li et al., 2015a). Given the recent success of deep learning models, various neural network models and advanced attention mechanisms have been proposed for this task and have achieved highly competitive results on several benchmark datasets (Wang et al., 2016;Abdul-Mageed and Ungar, 2017;Felbo et al., 2017;Baziotis et al., 2018;He and Xia, 2018;Kim et al., 2018). However, these deep models must overcome a heavy reliance on large amounts of annotated data in order to learn a robust feature representation for multi-label emotion classification. In reality, large-scale datasets are usually not readily available and costly to obtain, partly due to the ambiguity of many informal expressions in user-generated comments. Conversely, it is easier to find datasets (especially in English) associated with another closely related task: sentiment classification, which aims to classify the sentiment polarity of a given piece of text (i.e., positive, negative and neutral). We expect that these resources may allow us to improve sentiment-sensitive representations and thus more accurately identify emotions in social media posts. To achieve these goals, we propose an effective transfer learning (TL) approach in this paper.
Most existing TL methods either 1) assume that both the source and the target tasks share the same sentence representation (Mou et al., 2016) or 2) divide the representation of each sentence into a shared feature space and two task-specific feature spaces (Liu et al., 2017;Yu et al., 2018), as demonstrated by Fig 1.a and Fig 1.b. However, when applying these TL approaches to our scenario, the former approach may lead the learnt sentence representation to pay more attention to general sentiment words such as good but less attention to the other sentiment-ambiguous words like shock that are also integral to emotion classification. The latter approach can capture both the sentiment and the emotion-specific words. However, some sentiment words only occur in the source sentiment classification task. These words tend to receive more attention in the source-specific feature space but less attention in the shared feature space, so they will be ignored in our emotion classification task. Intuitively, any sentiment word also indicates emotion and should not be ignored by our emotion classification task.
Therefore, we propose a shared-private (SP) model as shown in Fig 1.c, where we employ a shared LSTM layer to extract shared sentiment features for both sentiment and emotion classification tasks, and a target-specific LSTM layer to extract specific emotion features that are only sensitive to our emotion classification task. However, as pointed out by Liu et al. (2017) and Yu et al. (2018), it is not guaranteed that such a simple model can well differentiate the two feature spaces to extract shared and target-specific features as we expect. Take the sentence T1 in Table 1 as an example. Both the shared and task-specific layers could assign higher attention weights to good and goodness due to their high frequencies in the training data but lower attention weights to fearless due to its rare occurrences. In this case, this SP model can only predict the joy emotion but ignores the optimism emotion. Hence, to enforce the orthogonality of the two feature spaces, we further introduce a dual attention mechanism, which feeds the attention weights in one feature space as extra inputs to compute those in the other feature space, and explicitly minimizes the similarity between the two sets of attention weights. Experimental results show that our dual attention transfer architecture can bring consistent performance gains in comparison with several existing transfer learning approaches, achieving the state-of-the-art performance on two benchmark datasets.

Base Model for Emotion Classification
Given an input sentence, the goal of emotion analysis is to identify one or multiple emotions contained in it. Formally, let x = (w 1 , w 2 , . . . , w n ) be the input sentence with n words, where w j is a d-dimensional word vector for word w j in the vocabulary V, and is retrieved from a lookup table E ∈ R d×|V| . Moreover, let E be a set of pre-defined emotion labels. Accordingly, for each x, our task is to predict whether it contains one or more emotions in E. We denote the output as e ∈ {0, 1} K where e k ∈ {0, 1} denotes whether or not x contains the k-th emotion. We further assume that we have a set of labeled sentences, de- . Sentence Representation: We use the standard bi-directional Long Short Term Memory (Bi-LSTM) network to sequentially process each word in the input: where Θ f and Θ b denotes all the parameters in the forward and backward LSTM. Then, for each word x j , its hidden state h j ∈ R d is generated by For emotion classification, since emotion words are relatively more important for final predic-tions, we adopt the widely used attention mechanism (Bahdanau et al., 2014) to select the key words for sentence representation. Specifically, we first take the final hidden state h n as a sentence summary vector z, and then obtain the attention weight α i for each hidden state h j as follows: where W h , W z ∈ R a×d and v ∈ R a are learnable parameters. The final sentence representation H is computed as: Output Layer: We first apply a Multilayer Perceptron (MLP) with one hidden layer on top of H, followed by normalizing it to obtain the probability distribution over all of the emotion labels: Then, we propose to minimize the KL divergence between our predicted probability distribution and the normalized ground truth distribution as our objective function: During the test stage, we will select a threshold γ on the development set so that the emotion with scores higher than γ will be predicted as 1.

Transfer Learning Architecture
Due to the limited number of annotated data for multi-label emotion classification, here we resort to sentiment classification to consider a transfer learning scenario.
be another set of labeled sentences for sentiment classification, where y (m) is the ground-truth label indicating whether the m-th sentence is positive, negative or neutral.

Shared-Private (SP) Model
Intuitively, sentiment classification is a coarsegrained emotion analysis task, and can be fully leveraged to learn a more robust sentimentsensitive representation. Therefore, we first use  a shared attention-based Bi-LSTM layer to transform the input sentences in both tasks into a shared hidden representation H c , and also employ another task-specific Bi-LSTM layer to get the target-specific hidden representation H t . Next, we employ the following operations to map the hidden representations to the sentiment label y and the emotion label e: where W s ∈ R d×3 and b s ∈ R 3 are the parameters for the source sentiment classification task.

Proposed Dual Attention Transfer Network (DATN)
As we introduced before, the shared and targetspecific feature spaces in the above SP model are expected to respectively capture the general sentiment words and the task-specific emotion words. However, without any constraints, the two feature spaces may both tend to pay more attention to frequently occurring and important sentiment words like great and happy, but less to those rarely occurring but crucial emotion words like anxiety and panic. Therefore, to encourage the two feature spaces to focus on sentiment words and emotionspecific words respectively, we propose using the attention weights computed from the shared layer as extra inputs to compute the attention weights of the target-specific layer. Specifically, as shown in Fig. 2, we first use Eq.1 and Eq.2 to compute the attention weights α s in the shared layer, and then use the following equation to obtain the attention weights α t in the target specific layer: .
In addition, we introduce another similarity loss to explicitly enforce the difference between the two attention weights and minimize the cosine similarity between α s and α t . Finally, our combined objective function is defined as follows: where λ is a hyperparameter used to control the effect of the similarity loss.

Model Details
During the training stage, we adopted the widely used alternating optimization strategy, which iteratively samples one mini-batch from D s for only updating the parameters in the left part of our model, followed by sampling another mini-batch from D e for updating all the parameters in our model. It is also worth noting that in Fig. 2, we first obtain the shared attention weights α s and feed it as extra inputs to compute α t . In fact, to differentiate the attention weights in the two feature spaces, we can also first compute α t , followed by computing α s based on α t . We refer to these two variants of our model as DATN-1 and DATN-2 respectively.

Experiment Settings
Datasets: We conduct experiments on both English and Chinese languages. For English, we employ a widely used Twitter dataset from SemEval 2016 Task 4A (Nakov et al., 2016) as our source sentiment classification task. For our target emotion classification task, we use the Twitter dataset recently released by Se-mEval 2018Task 1C (Mohammad et al., 2018, which contains 11 emotions as shown in the top of Fig. 2. To tokenize the tweets in our dataset, we follow (Owoputi et al., 2013) by adopting most of  their preprocessing rules except that we split the hashtag into '#' and its subsequent word. For Chinese, we use a well known Chinese blog dataset Ren-CECps from (Quan and Ren, 2010), which contains 1487 documents with each sentence labeled by a sentiment label and 8 emotion labels: anger, expectation, anxiety, joy, love, hate, sorrow and surprise. Given the difficulty of finding a large-scale sentiment classification dataset specific to Chinese blogs, we simply divided the original dataset to form our source and target tasks 1 . The basic statistics of our two datasets are summarized in Table 2.
Parameter Settings: The word embedding size d is set to be 300 for E1 and 200 for E2, and the lookup table E is initialized by pre-trained word embeddings based on Glove 2 . The hidden dimension and the number of LSTM layers in both datasets are set to be 200 and 1. During training, Adam (Kingma and Ba, 2014) is used to schedule the learning rate, where the initial learning rate is set to be 0.001. Also, the dropout rate is set to 0.5. After tuning, λ is set as 0.05 for both datasets, and γ is set as 0.12 for E1 and 0.2 for E2. All the models are implemented with Tensorflow.
Evaluation Metrics: We take the official code from SemEval-18 Task 1C and use accuracy and Macro F1 score as main metrics. For E2, we follow (Zhou et al., 2018a) to use average precision (AP) and one error (OE) as secondary metrics.

Results
To better evaluate our proposed methods, we employed the following systems for comparison: 1) Base, training our base model in Section 2.1 only on D e ; 2) FT (Fine-Tuning), using D s to pretrain the whole model, followed by using D e to Fine Tune the model parameters; 3) FS, the Fully-Shared framework by (Mou et al., 2016) as shown in Fig 1.a; 4) PSP and APSP, the Private-Shared-Private framework and its extension with Adver-  by averaging ten runs (top) and the comparison between our best model and the state-of-the-art systems (bottom). DATN-2 * indicates the ensemble results of ten runs. Base † and DATN-2 † denotes the average results of conducting ten-fold cross validation on the whole dataset for fair comparison, and here for the source and target tasks in DATN-2 † , we use the same training data. For E1, Rank1 and Rank2 are the top two systems from the official leadboard; For E2, Rank1 and Rank2 are from (Zhou et al., 2016(Zhou et al., , 2018a. sarial losses by (Liu et al., 2017) as shown in In Table 3, we report the comparison results between our method and the baseline systems. It can be easily observed that 1) for transfer learning, although the performance of SP is similar to or even lower than some baseline systems, our proposed dual attention models, i.e., DATN-1 and DATN-2, can generally boost SP to achieve the best results. To investigate the significance of the improvements, we combine each model's predictions of all emotion labels followed by treating them as a single label, and then perform McNemar's significance tests (Gillick and Cox, 1989). Finally, we verify that for English, DATN-1 is significantly better than Base, FT, FS and SP, while DATN-2 is significant better than all the methods except APSP; for Chinese, DATN-1 and DATN-2 are significantly better than all the compared methods. 2) Even compared with the state-of-the-art systems in E1 which also employ other external resources, including the affective embedding, emotion lexicon and sentiment classification datasets (Baziotis et al., 2018), the ensemble results of DATN-2 can achieve slightly better performance; in addition, it is clear that our model can obtain the best performance in E2.
Furthermore, to obtain a better understanding of the advantage of our method, we choose one sentence from the test set of E1, and visualize the attention weights obtained by Base and DATN-2 in Fig 3. We can see that Base pays more attention to those frequent emotion words while ignoring the less frequent but important emoji, and thus fails to predict the love emotion implied by the emoji. In contrast, with the proposed dual attention mechanism, DATN-2 makes correct predictions since it can respectively capture the general sentiment words and the emotion-specific emojis.

Conclusion
In this paper, we proposed a dual attention-based transfer learning approach to leverage sentiment classification to improve the performance of multilabel emotion classification. Using two benchmark datasets, we show the effectiveness of the proposed transfer learning method.