Neural Sentiment Classification with User and Product Attention

Document-level sentiment classiﬁcation aims to predict user’s overall sentiment in a document about a product. However, most of existing methods only focus on local text information and ignore the global user preference and product characteristics. Even though some works take such information into account, they usually suffer from high model complexity and only consider word-level preference rather than semantic levels. To address this issue, we propose a hierarchical neural network to incorporate global user and product information into sentiment clas-siﬁcation. Our model ﬁrst builds a hierarchical LSTM model to generate sentence and document representations. Afterwards, user and product information is considered via at-tentions over different semantic levels due to its ability of capturing crucial semantic components. The experimental results show that our model achieves signiﬁcant and consistent improvements compared to all state-of-the-art methods. The source code of this paper can be obtained from https://github. com/thunlp/NSC .


Introduction
Sentiment analysis aims to analyze people's sentiments or opinions according to their generated texts and plays a critical role in the area of data mining and natural language processing. Recently, sentiment analysis draws increasing attention of researchers with the rapid growth of online review * Corresponding author: M. Sun (sms@tsinghua.edu.cn) sites such as Amazon, Yelp and IMDB, due to its importance to personalized recommendation.
In this work, we focus on the task of documentlevel sentiment classification, which is a fundamental problem of sentiment analysis. Document-level sentiment classification assumes that each document expresses a sentiment on a single product and targets to determine the overall sentiment about the product.
Most existing methods take sentiment classification as a special case of text classification problem. Such methods treat annotated sentiment polarities or ratings as categories and apply machine learning algorithms to train classifiers with text features, e.g., bag-of-words vectors (Pang et al., 2002). Since the performance of text classifiers heavily depends on the extracted features, such studies usually attend to design effective features from text or additional sentiment lexicons (Ding et al., 2008;Taboada et al., 2011).
Motivated by the successful utilization of deep neural networks in computer vision (Ciresan et al., 2012), speech recognition (Dahl et al., 2012) and natural language processing (Bengio et al., 2006), some neural network based sentiment analysis models are proposed to learn low-dimensional text features without any feature engineering (Glorot et al., 2011;Socher et al., 2011;Socher et al., 2012;Socher et al., 2013;Kim, 2014). Most proposed neural network models take the text information in a sentence or a document as input and generate the semantic representations using well-designed neural networks. However, these methods only focus on the text content and ignore the crucial characteristics of users and products. It is a common sense that the user's preference and product's characteristics make significant influence on the ratings.
To incorporate user and product information into sentiment classification, (Tang et al., 2015b) bring in a text preference matrix and a representation vector for each user and product into CNN sentiment classifier. It modifies the word meaning in the input layer with the preference matrix and concatenates the user/product representation vectors with generated document representation before softmax layer. The proposed model achieves some improvements but suffers the following two problems: (1) The introduction of preference matrix for each user/product is insufficient and difficult to be well trained with limited reviews. For example, most users in IMDB and Yelp only have several tens of reviews, which is not enough to obtain a well-tuned preference matrix. (2) The characteristics of user and product should be reflected on the semantic level besides the word level. For example, a two star review in Yelp said "great place to grab a steak and I am a huge fan of the hawaiian pizza · · · but I don't like to have to spend 100 bucks for a diner and drinks for two". It's obvious that the poor rating result mainly relies on the last sentence compared with others.
To address these issues, we propose a novel hierarchical LSTM model to introduce user and product information into sentiment classification. As illustrated in Fig. 1, our model mainly consists of two parts. Firstly, we build a hierarchical LSTM model to generate sentence-level representation and document-level representation jointly. Afterwards, we introduce user and product information as attentions over different semantic levels of a document. With the consideration of user and product information, our model can significantly improve the performance of sentiment classification in several realworld datasets.
To summarize, our effort provide the following three contributions: (1) We propose an effective Neural Sentiment Classification model by taking global user and product information into consideration. Comparing with (Tang et al., 2015b), our model contains much less parameters and is more efficient for training.
(2) We introduce user and product information based attentions over different semantic levels of a document. Traditional attention-based neural network models only take the local text information into consideration. In contrast, our model puts forward the idea of user-product attention by utilizing the global user preference and product characteristics.
(3) We conduct experiments on several realworld datasets to verify the effectiveness of our model. The experimental results demonstrate that our model significantly and consistently outperforms other state-of-the-art models.

Related Work
With the trends of deep learning in computer vision, speech recognition and natural language processing, neural models are introduced into sentiment classification field due to its ability of text representation learning. (Glorot et al., 2011) use Stacked Denoising Autoencoder in sentiment classification for the first time. Socher conducts a series of recursive neural network models to learn representations based on the recursive tree structure of sentences, including Recursive Autoencoder (RAE) (Socher et al., 2011), Matrix-Vector Recursive Neural Network (MV-RNN) (Socher et al., 2012) and Recursive Neural Tensor Network (RNTN) (Socher et al., 2013). Besides, (Kim, 2014) and (Johnson and Zhang, 2014) adopt convolution neural network (CNN) to learn sentence representations and achieve outstanding performance in sentiment classification.
Recurrent neural network also benefits sentiment classification because it is capable of capturing the sequential information. (Li et al., 2015), (Tai et al., 2015) investigate tree-structured long-short term memory (LSTM) networks on text or sentiment classification. There are also some hierarchical models proposed to deal with document-level sentiment classification (Tang et al., 2015a;Bhatia et al., 2015), which generate different levels (e.g., phrase, sentence or document) of semantic representations within a document. Moreover, attention mechanism is also introduced into sentiment classification, which aims to select important words from a sen- tence or sentences from a document (Yang et al., 2016).
Most existing sentiment classification models ignore the global user preference and product characteristics, which have crucial effects on the sentiment polarities. To address this issue, (Tang et al., 2015b) propose to add user/product preference matrices and representation vectors into CNN models. Nevertheless, it suffers from high model complexity and only considers word-level preference rather than semantic levels. In contrast, we propose an efficient neural sentiment classification model with users and products to serve as attentions in both word and semantic levels.

Methods
In this section, we will introduce our User Product Attention (UPA) based Neural Sentiment Classification (NSC) model in detail. First, we give the formalizations of document-level sentiment classification. Afterwards, we discuss how to obtain document semantic representation via the Hierarchical Long Short-term Memory (HLSTM) network . At last, we present our attention mechanisms which incorporates the global information of users and products to enhance document representations. The enhanced document representation is used as features for sentiment classification. An overall illustration of UPA based NSC model is shown in Fig. 1.

Formalizations
Suppose a user u ∈ U has a review about a product p ∈ P . We represent the review as a document d with n sentences {S 1 , S 2 , · · · , S n }. Here, l i is the length of i-th sentence. The i-th sentence S i consists of l i words as {w i 1 , w i 2 , · · · , w i l i }. Documentlevel sentiment classification aims to predict the sentiment distributions or ratings of these reviews according to their text information.

Neural Sentiment Classification Model
According to the principle of compositionality (Frege, 1892), we model the semantic of a document through a hierarchical structure composed of word-level, sentence-level and document-level. To model the semantic representations of sentences, we adopt Long Short-Term Memory (LSTM) network because of its excellent performance on sentiment classification, especially for long documents. Similarly, we also use LSTM to learn document representations.
In word level, we embed each word in a sentence into a low dimensional semantic space. That means, each word w i j is mapped to its embedding w i j ∈ R d . At each step, given an input word w i j , the current cell state c i j and hidden state h i j can be updated with the previous cell state c i j−1 and hidden state h i j−1 as follows: where i, f , o are gate activations, stands for element-wise multiplication, σ is sigmoid function, W, b are the parameters we need to train. We then feed hidden states [h i 1 , h i 2 , · · · , h i l i ] to an average pooling layer to obtain the sentence representation s i .
In sentence level, we also feed the sentence embeddings [s 1 , s 2 , · · · , s n ] into LSTM and then obtain the document representation d through an average pooling layer in a similar way.

User Product Attention
We bring in User Product Attention to capture the crucial components over different semantic levels for sentiment classification. Specifically, we employ word-level UPA to generate sentence representations and sentence-level UPA to obtain document representation. We give the detailed implementations in the following parts.
It is obvious that not all words contribute equally to the sentence meaning for different users and products. Hence, in word level, instead of feeding hidden states to an average pooling layer, we adopt a user product attention mechanism to extract user/product specific words that are important to the meaning of sentence. Finally, we aggregate the representations of those informative words to form the sentence representation. Formally, the enhanced sentence representation is a weighted sum of hidden states as: where α i j measures the importance of the j-th word for current user and product. Here, we embed each user u and each product p as continuous and realvalued vectors u ∈ R du and p ∈ R dp , where d u and d p are the dimensions of user embeddings and product embeddings respectively. Thus, the attention weight α i j for each hidden state can be defined as: where e is a score function which scores the importance of words for composing sentence representation. The score function e is defined as: where W H , W U and W P are weight matrices, v is weight vector and v T denotes its transpose.
The sentences that are clues to the meaning of the document vary in different users and products. Therefore, in sentence level, we also use a attention mechanism with user vector u and product vector p in word level to select informative sentences to compose the document representation. The document representation d is obtained via: where β i is the weight of hidden state h i in sentence level which can be calculated similar to the word attention.

Sentiment Classification
Since document representation d is hierarchically extracted from the words and sentences in the documents, it is a high level representation of the document. Hence, we regard it as features for document sentiment classification. We use a non-linear layer to project document representation d into the target space of C classes: Afterwards, we use a softmax layer to obtain the document sentiment distribution: where C is the number of sentiment classes, p c is the predicted probability of sentiment class c. In where p g c is the gold probability of sentiment class c with ground truth being 1 and others being 0, D represents the training documents.

Experiments
In this section, we introduce the experimental settings and empirical results on the task of documentlevel sentiment classification.

Experimental Settings
We evaluate the effectiveness of our NSC model on three sentiment classification datasets with user and product information: IMDB, Yelp 2013 and Yelp 2014, which are built by (Tang et al., 2015b). The statistics of the datasets are summarized in Table 1. We split the datasets into training, development and testing sets in the proportion of 8:1:1, with tokenization and sentence splitting by Stanford CoreNLP (Manning et al., 2014). We use two metrics including Accuracy which measures the overall sentiment classification performance and RM SE which measures the divergences between predicted sentiment classes and ground truth classes. The Accuracy and RM SE metrics are defined as: where T is the numbers of predicted sentiment ratings that are identical with gold sentiment ratings, N is the numbers of documents and gd i , pr i represent the gold sentiment rating and predicted sentiment rating respectively. Word embeddings could be randomly initialized or pre-trained. We pre-train the 200-dimensional word embeddings on each dataset in (Tang et al., 2015a) with SkipGram (Mikolov et al., 2013). We set the user embedding dimension and product embedding dimension to be 200, initialized to zero. The dimensions of hidden states and cell states in our LSTM cells are set to 200. We tune the hyper parameters on the development sets and use adadelta (Zeiler, 2012) to update parameters when training. We select the best configuration based on performance on the development set, and evaluate the configuration on the test set.

Baselines
We compare our NSC model with several baseline methods for document sentiment classification: Majority regards the majority sentiment category in training set as the sentiment category of each document in test set.
Trigram trains a SVM classifier with unigrams, bigrams and trigrams as features.
TextFeature extracts text features including word and character n-grams, sentiment lexicon features, etc, and then train a SVM classifier.
UPF extracts use-leniency features (Gao et al., 2013) and corresponding product features from training data, which is further concatenated with the features in Trigram an TextFeature.
AvgWordvec averages word embeddings in a document to obtain document representation which is fed into a SVM classifier as features.
RNTN + RNN represents each sentence with the Recursive Neural Tensor Network (RNTN) (Socher et al., 2013) and feeds sentence representations into  JMARS considers the information of users and aspects with collaborative filtering and topic modeling for document sentiment classification.
UPNN brings in a text preference matrix and a representation vector for each user and product into CNN sentiment classifier (Kim, 2014). It modifies the word meaning in the input layer with the preference matrix and concatenates the user/product representation vectors with generated document representation before softmax layer.
For all baseline methods above, we report the results in (Tang et al., 2015b) since we use the same datasets.

Model Comparisons
We list the experimental results in Table 2. As shown in this table, we manually divide the results into two parts, the first one of which only considers the local text information and the other one incorpo-rates both local text information and the global user product information.
From the first part in Table 2, we observe that NSC, the basic implementation of our model, significantly outperforms all the other baseline methods which only considers the local text information. To be specific, NSC achieves more than 4% improvements over all datasets compared to typical well-designed neural network models. It demonstrates that NSC is effective to capture the sequential information, which can be a crucial factor to sentiment classification. Moreover, we employ the idea of local semantic attention (LA) in (Yang et al., 2016) and implement it in NSC model (denoted as NSC+LA). The results shows that the attention based NSC obtains a considerable improvements than the original one. It proves the importance of selecting more meaningful words and sentences in sentiment classification, which is also a main reason of introducing global user and product information in an attention form.
In the second part of Table 2, we show the performance of models with user product information. From this part, we have the following observations: (1) The global user and product information is   helpful to neural network based models for sentiment classification. With the consideration of such information in IMDB, UPNN achieves 3% improvement and our proposed NSC+UPA obtains 9% improvement in accuracy. The significant improvements state the necessity of considering these global information in sentiment classification.
(2) Our proposed NSC model with user production attention (NSC+UPA) significantly and consistently outperforms all the other baseline methods. It indicates the flexibility of our model on various realworld datasets. Note that, we also implement (Tang et al., 2015b)'s method to deal with user and product information on NSC (denoted as UPNN (NSC)). Though the employment of NSC improves the performance of UPNN, it is still not comparable to our model. More specifically, UPNN exceed the memory of our GPU (12G) when dealing with Yelp2014 dataset due to the high complexity of its parameters. Compared to UPNN which utilizes the user product information with matrices and vectors simultaneously, our model only embeds each user and product as a vector, which makes it suitable to largescale datasets. It demonstrates that our NSC model is more effective and efficient to handle additional user and product information.
Observations above demonstrate that NSC with user product attention (NSC+UPA) is capable of capturing meanings of multiple semantic layers within a document. Comparing with other user product based models, our model incorporates global user product information in an effective and efficient way. Furthermore, the model is also robust and achieves consistent improvements than state-ofthe-art methods on various real-world datasets.

Model Analysis: Effect of Attention
Mechanisms in Word and Sentence Level Table 3 shows the effect of attention mechanisms in word or sentence level respectively. From the table, we can observe that: (1) Both the attention mechanisms applied in word level and sentence level improve the performance for document sentiment classification compared with utilizing average pooling in word and sentence level; (2) The attention mechanism in word level improves more for our model as compared to sentence level. The reason is that the word attention mechanism can capture the informative words in all documents, while the sentence attention mechanism may only work in long documents with various topics.
(3) The model considering both word level attention and sentence level attention outperforms the ones considering only one semantic level attention. It proves that the characteristics of users and products are reflected on multiple semantic levels, which is also a critical motivation of introducing User Product Attention into sentiment classification.   Table 4 shows the performance of attention mechanisms with the information of users or products. From the table, we can observe that: (1) The information of both users and products contributes to our model as compared to a semantic attention. It demonstrates that our attention mechanism can catch the specific characteristic of a user or a product.
(2) The information of users is more effective than the products to enhance document representations. Hence, the discrimination of user preference is more obvious than product characteristics.

Model Analysis: Performance over Sentence Numbers and Lengths
To investigate the performance of our model over documents with various lengths, we compare the performance of different implementations of NSC under different document lengths and sentence number settings. Fig. 2 shows the accuracy of sentiment classification generated by NSC, NSC+ATT, UPNN(NSC) and NSC+UPA on the IMDB test set with respect to input document lengths and input sentence numbers in a document. From Fig. 2, we observe that our model NSC with attention mechanism of user and product information consistently outperforms other baseline methods for all input document lengths and sentence numbers. It indicates the robustness and flexibility of NSC on dif-ferent datasets.  To demonstrate the effectiveness of our global attention, we provide a review instance in Yelp2013 dataset for example. The content of this review is "Great wine, great ambiance, amazing music!". We visualize the attention weights in word-level for two distinct users and the local semantic attention (LA) in Fig 3. Here, the local semantic attention represents the implementation in (Yang et al., 2016), which calculates the attention without considering the global information of users and products. Note that, darker color means lower weight.

Case Study
According to our statistics, the first user often mentions "wine" in his/her review sentences. On the contrary, the second user never talks about "wine" in his/her review sentences. Hence, we infer that the first user may has special preference to wine while the second one has no concern about wine. From the figure, we observe an interesting phenomenon which confirms to our inference. For the word "wine", the first user has the highest atten-tion weight and the second user has the lowest attention weight. It indicates that our model can capture the global user preference via our user attention.

Conclusion and Future Work
In this paper, we propose a hierarchical neural network which incorporates user and product information via word and sentence level attentions. With the user and product attention, our model can take account of the global user preference and product characteristics in both word level and semantic level. In experiments, we evaluate our model on sentiment analysis task. The experimental results show that our model achieves significant and consistent improvements compared to other state-of-theart models.
We will explore more in future as follows: (1) In this paper, we only consider the global user preference and product characteristics according to their personal behaviors. In fact, most users and products usually have some text information such as user and product profiles. We will take advantages of those information in sentiment analysis in future.
(2) Aspect level sentiment classification is also a fundamental task in the field of sentiment analysis. The user preference and product characteristics may also implicitly influence the sentiment polarity of the aspect. We will explore the effectiveness of our model on aspect level sentiment classification.