A Multi-sentiment-resource Enhanced Attention Network for Sentiment Classification

Deep learning approaches for sentiment classification do not fully exploit sentiment linguistic knowledge. In this paper, we propose a Multi-sentiment-resource Enhanced Attention Network (MEAN) to alleviate the problem by integrating three kinds of sentiment linguistic knowledge (e.g., sentiment lexicon, negation words, intensity words) into the deep neural network via attention mechanisms. By using various types of sentiment resources, MEAN utilizes sentiment-relevant information from different representation sub-spaces, which makes it more effective to capture the overall semantics of the sentiment, negation and intensity words for sentiment prediction. The experimental results demonstrate that MEAN has robust superiority over strong competitors.

Despite the remarkable progress made by the previous work, we argue that sentiment analysis still remains a challenge.Sentiment resources including sentiment lexicon, negation words, intensity words play a crucial role in traditional sentiment classification approaches (Maks and Vossen, 2012;Duyu et al., 2014).Despite its usefulness, to date, the sentiment linguistic knowledge has been underutilized in most recent deep neural network models (e.g., CNNs and LSTMs).
In this work, we propose a Multi-sentimentresource Enhanced Attention Network (MEAN) for sentence-level sentiment classification to integrate many kinds of sentiment linguistic knowledge into deep neural networks via multi-path attention mechanism.Specifically, we first design a coupled word embedding module to model the word representation from character-level and word-level semantics.This can help to capture the morphological information such as prefixes and suffixes of words.Then, we propose a multisentiment-resource attention module to learn more comprehensive and meaningful sentiment-specific sentence representation by using the three types of sentiment resource words as attention sources attending to the context words respectively.In this way, we can attend to different sentimentrelevant information from different representation subspaces implied by different types of sentiment sources and capture the overall semantics of the sentiment, negation and intensity words for sentiment prediction.
The main contributions of this paper are summarized as follows.First, we design a coupled word embedding obtained from character-level embedding and word-level embedding to capture both the character-level morphological information and word-level semantics.Second, we propose a multi-sentiment-resource attention module to learn more comprehensive sentiment-specific sentence representation from multiply subspaces implied by three kinds of sentiment resources including sentiment lexicon, intensity words, negation words.Finally, the experimental results show that MEAN consistently outperforms competitive methods.

Model
Our proposed MEAN model consists of three key components: coupled word embedding module, multi-sentiment-resource attention module, sentence classifier module.In the rest of this section, we will elaborate these three parts in details.The overall framework is shown in Figure 1.

Coupled Word Embedding
To exploit the sentiment-related morphological information implied by some prefixes and suffixes of words (such as "Non-", "In-", "Im-"), we design a coupled word embedding learned from characterlevel embedding and word-level embedding.We first design a character-level convolution neural network (Char-CNN) to obtain character-level embedding (Zhang et al., 2015).Different from (Zhang et al., 2015), the designed Char-CNN is a fully convolutional network without max-pooling layer to capture better semantic information in character chunk.Specifically, we first input onehot-encoding character sequences to a 1 × 1 convolution layer to enhance the semantic nonlinear representation ability of our model (Long et al., 2015), and the output is then fed into a multi-gram (i.e.different window sizes) convolution layer to capture different local character chunk information.For word-level embedding, we use pretrained word vectors, GloVe (Pennington et al., 2014), to map each word to a low-dimensional vector space.Finally, each word is represented as a concatenation of the character-level embedding and word-level embedding.This is performed on the context words and the three types of sentiment resource words1 , resulting in four final coupled word embedding matrices: the Here, t, m, k, p are the length of the corresponding items respectively, and d is the embedding dimension.Each W is normalized to better calculate the following word correlation.

Multi-sentiment-resource Attention Module
After obtaining the coupled word embedding, we propose a multi-sentiment-resource attention mechanism to help select the crucial sentimentresource-relevant context words to build the sentiment-specific sentence representation.Concretely, we use the three kinds of sentiment resource words as attention sources to attend to the context words respectively, which is beneficial to capture different sentiment-relevant context words corresponding to different types of sentiment sources.For example, using sentiment words as attention source attending to the context words helps form the sentiment-word-enhanced sentence representation.Then, we combine the three kinds of sentiment-resource-enhanced sentence representations to learn the final sentiment-specific sentence representation.We design three types of attention mechanisms: sentiment attention, intensity attention, negation attention to model the three kinds of sentiment resources, respectively.In the following, we will elaborate the three types of attention mechanisms in details.First, inspired by (Xiong et al.), we expect to establish the word-level relationship between the context words and different kinds of sentiment resource words.To be specific, we define the dot products among the context words and the three kinds of sentiment resource words as correlation matrices.Mathematically, the detailed formulation is described as follows. (1) where M s , M i , M n are the correlation matrices to measure the relationship among the context words and the three kinds of sentiment resource words, representing the relevance between the context words and the sentiment resource word.
After obtaining the correlation matrices, we can compute the context-word-relevant sentiment word representation matrix X s , the context-wordrelevant intensity word representation matrix X i , the context-word-relevant negation word representation matrix X n by the dot products among the context words and different types of corresponding correlation matrices.Meanwhile, we can also  s by the dot product between the correlation matrix M s and the sentiment words W s , the intensity-word-relevant context word representation matrix X c i by the dot product between the intensity words W i and the correlation matrix M i , the negation-word-relevant context word representation matrix X c n by the dot product between the negation words W n and the correlation matrix M n .The detailed formulas are presented as follows: The final enhanced context word representation matrix is computed as: Next, we employ four independent GRU networks (Chung et al., 2015) to encode hidden states of the context words and the three types of sentiment resource words, respectively.Formally, given the word embedding X c , X s , X i , X n , the hidden state matrices H c , H s , H i , H n can be ob-tained as follows: After obtaining the hidden state matrices, the sentiment-word-enhanced sentence representation o 1 can be computed as: where q s denotes the mean-pooling operation towards H s , β is the attention function that calculates the importance of the i-th word h c i in the context and α i indicates the importance of the ith word in the context, u s and W s are learnable parameters.
Similarly, with the hidden states H i and H n for the intensity words and the negation words as attention sources, we can obtain the intensityword-enhanced sentence representation o 2 and the negation-word-enhanced sentence representation o 3 .The final comprehensive sentiment-specific sentence representation õ is the composition of the above three sentiment-resource-specific sentence representations o 1 , o 2 , o 3 :

Sentence Classifier
After obtaining the final sentence representation õ, we feed it to a softmax layer to predict the sentiment label distribution of a sentence: where ŷ is the predicted sentiment distribution of the sentence, C is the number of sentiment labels, Wo and bo are parameters to be learned.
For model training, our goal is to minimize the cross entropy between the ground truth and predicted results for all sentences.Meanwhile, in order to avoid overfitting, we use dropout strategy to randomly omit parts of the parameters on each training case.Inspired by (Lin et al., 2017), we also design a penalization term to ensure the diversity of semantics from different sentimentresource-specific sentence representations, which reduces information redundancy from different sentiment resources attention.Specifically, the final loss function is presented as follows: where y j i is the target sentiment distribution of the sentence, ŷj i is the prediction probabilities, θ denotes each parameter to be regularized, Θ is parameter set, λ is the coefficient for L 2 regularization, µ is a hyper-parameter to balance the three terms, ψ is the weight parameter, I denotes the the identity matrix and ||.|| F denotes the Frobenius norm of a matrix.Here, the first two terms of the loss function are cross-entropy function of the predicted and true distributions and L 2 regularization respectively, and the final term is a penalization term to encourage the diversity of sentiment sources.

Datasets and Sentiment Resources
Movie Review (MR)2 and Stanford Sentiment Treebank (SST)3 are used to evaluate our model.MR dataset has 5,331 positive samples and 5,331 negative samples.We adopt the same data split as in (Qian et al., 2017).SST consists of 8,545 training samples, 1,101 validation samples, 2210 test samples.Each sample is marked as very negative, negative, neutral, positive, or very positive.Sentiment lexicon combines the sentiment words from both (Qian et al., 2017) and (Hu and Liu, 2004), resulting in 10,899 sentiment words in total.We collect negation and intensity words manually as the number of these words is limited.

Baselines
In order to comprehensively evaluate the performance of our model, we list several baselines for sentence-level sentiment classification.
RNTN: Recursive Tensor Neural Network (Socher et al., 2013) is used to model correlations between different dimensions of child nodes vectors.
LSTM/Bi-LSTM: Cho et al. ( 2014) employs Long Short-Term Memory and the bidirectional variant to capture sequential information.
Tree-LSTM: Memory cells was introduced by Tree-Structured Long Short-Term Memory (Tai et al., 2015) and gates into tree-structured neural network, which is beneficial to capture semantic relatedness by parsing syntax trees.

Implementation Details
In our experiments, the dimensions of characterlevel embedding and word embedding (GloVe) are both set to 300.Kernel sizes of multi-gram convolution for Char-CNN are set to 2, 3, respectively.All the weight matrices are initialized as random orthogonal matrices, and we set all the bias vectors as zero vectors.We optimize the proposed model with RMSprop algorithm, using mini-batch training.The size of mini-batch is 60.The dropout rate is 0.5, and the coefficient λ of L 2 normalization is set to 10 −5 .µ is set to 10 −4 .ψ is set to 0.9.When there are not sentiment resource words in the sentences, all the context words are treated as sentiment resource words to implement the multi-path self-attention strategy.

Experiment Results
In our experiments, to be consistent with the recent baseline methods, we adopt classification accuracy as evaluation metric.We summarize the experimental results in Table 1.Our model has robust superiority over competitors and sets stateof-the-art on MR and SST datasets.First, our model brings a substantial improvement over the methods that do not leverage sentiment linguistic knowledge (e.g., RNTN, LSTM, BiLSTM, CNN and ID-LSTM) on both datasets.This verifies the effectiveness of leveraging sentiment linguistic resource with the deep learning algorithms.Second, our model also consistently outperforms LR-Bi-LSTM which integrates linguistic roles of sentiment, negation and intensity words into neural networks via the linguistic regularization.For example, our model achieves 2.4% improvements over the MR dataset and 0.8% improvements over the SST dataset compared to LR-Bi-LSTM.This is because that MEAN designs attention mechanisms to leverage sentiment resources efficiently, which utilizes the interactive information between context words and sentiment resource words.
In order to analyze the effectiveness of each component of MEAN, we also report the ablation test in terms of discarding character-level embedding (denoted as MEAN w/o CharCNN) and sentiment words/negation words/intensity words (denoted as MEAN w/o sentiment words/negation words/intensity words).All the tested factors con-tribute greatly to the improvement of the MEAN.In particular, the accuracy decreases sharply when discarding the sentiment words.This is within our expectation since sentiment words are vital when classifying the polarity of the sentences.(Qian et al., 2017), and the results marked with * denote the results are obtained by our implementation.

Conclusion
In this paper, we propose a novel Multi-sentimentresource Enhanced Attention Network (MEAN) to enhance the performance of sentence-level sentiment analysis, which integrates the sentiment linguistic knowledge into the deep neural network.

Figure 1 :
Figure 1: The Overall Framework of Our Model obtain the sentiment-word-relevant context word representation matrix X cs by the dot product between the correlation matrix M s and the sentiment words W s , the intensity-word-relevant context word representation matrix X c i by the dot product between the intensity words W i and the correlation matrix M i , the negation-word-relevant context word representation matrix X c n by the dot product between the negation words W n and the correlation matrix M n .The detailed formulas are presented as follows:

Table 1 :
Evaluation results.The best result for each dataset is in bold.The result marked with # are retrieved from