Learning Semantic Representations of Users and Products for Document Level Sentiment Classification

Neural network methods have achieved promising results for sentiment classification of text. However, these models only use semantics of texts, while ignoring users who express the sentiment and products which are evaluated, both of which have great influences on interpreting the sentiment of text. In this paper, we address this issue by incorporating userand productlevel information into a neural network approach for document level sentiment classification. Users and products are modeled using vector space models, the representations of which capture important global clues such as individual preferences of users or overall qualities of products. Such global evidence in turn facilitates embedding learning procedure at document level, yielding better text representations. By combining evidence at user-, productand documentlevel in a unified neural framework, the proposed model achieves state-of-the-art performances on IMDB and Yelp datasets1.


Introduction
Document-level sentiment classification is a fundamental problem in the field of sentiment analysis and opinion mining (Pang and Lee, 2008;Liu, 2012). The task is to infer the sentiment polarity or intensity (e.g. 1-5 or 1-10 stars on review sites) of a document. Dominating studies follow Pang et al. (2002;2005) and regard this problem as a multi-class classification task. They usually use machine learning algorithms, and build sentiment classifier from documents with accompanying sentiment labels. Since the performance of a machine learner is heavily dependent on the choice of data representations (Domingos, 2012), many works focus on designing effective features (Pang et al., 2002;Qu et al., 2010;Kiritchenko et al., 2014) or learning discriminative features from data with neural networks (Socher et al., 2013;Kalchbrenner et al., 2014;Le and Mikolov, 2014).
Despite the apparent success of neural network methods, they typically only use text information while ignoring the important influences of users and products. Let us take reviews with respect to 1-5 rating scales as an example. A critical user might write a review "it works great" and mark 4 stars, while a lenient user might give 5 stars even if he posts an (almost) identical review. In this case, user preference affects the sentiment rating of a review. Product quality also has an impact on review sentiment rating. Reviews towards high-quality products (e.g. Macbook) tend to receive higher ratings than those towards low-quality products. Therefore, it is feasible to leverage individual preferences of users and overall qualities of products to build a smarter sentiment classifier and achieve better performance 2 .
In this paper, we propose a new model dubbed User Product Neural Network (UPNN) to capture user-and product-level information for sentiment classification of documents (e.g. reviews). UPNN takes as input a variable-sized document as well as the user who writes the review and the product which is evaluated. It outputs sentiment polarity label of a document. Users and products are encoded in continuous vector spaces, the representations of which capture important global clues such 2 One can manually design a small number of user and product features (Gao et al., 2013). However, we argue that they are not effective enough to capture sophisticated semantics of users and products. as user preferences and product qualities. These representations are further integrated with continuous text representation in a unified neural framework for sentiment classification.
We apply UPNN to three datasets derived from IMDB and Yelp Dataset Challenge. We compare to several neural network models including recursive neural networks (Socher et al., 2013), paragraph vector (Le andMikolov, 2014), sentimentspecific word embedding (Tang et al., 2014b), and a state-of-the-art recommendation algorithm JMARS (Diao et al., 2014). Experimental results show that: (1) UPNN outperforms baseline methods for sentiment classification of documents; (2) incorporating representations of users and products significantly improves classification accuracy. The main contributions of this work are as follows: • We present a new neural network method (UPNN) by leveraging users and products for document-level sentiment classification.
• We validate the influences of users and products in terms of sentiment and text on massive IMDB and Yelp reviews.
• We report empirical results on three datasets, and show that UPNN outperforms state-of-the-art methods for sentiment classification.

Consistency Assumption Verification
We detail the effects of users and products in terms of sentiment (e.g. 1-5 rating stars) and text, and verify them on review datasets.
We argue that the influences of users and products include the following four aspects.
• user-sentiment consistency. A user has specific preference on providing sentiment ratings. Some users favor giving higher ratings like 5 stars and some users tend to give lower ratings. In other words, sentiment ratings from the same user are more consistent than those from different users.
• product-sentiment consistency. Similar with user-sentiment consistency, a product also has its "preference" to receive different average ratings on account of its overall quality. Sentiment ratings towards the same product are more consistent than those towards different products.
• user-text consistency. A user likes to use personalized sentiment words when expressing opinion polarity or intensity. For example, a strict user might use "good" to express an excellent attitude, but a lenient user may use "good" to evaluate an ordinary product.

Algorithm 1 Consistency Assumption Testing
Input: data X, number of users/products m, number of iterations n Output: meaSame k , meaDif f k , 1 ≤ k ≤ n for k = 1 to n do meaSame k = 0, meaSame k = 0 • product-text consistency. Similar with usertext consistency, a product also has a collection of product-specific words suited to evaluate it. For example, people prefer using "sleek" and "stable" to evaluate a smartphone, while like to use "wireless" and "mechanical" to evaluate a keyboard.
We test four consistency assumptions mentioned above with the same testing criterion, which is formalized in Algorithm 1. For each consistency assumption, we test it for n = 50 iterations on each of IMDB, Yelp Dataset Challenge 2013 and 2014 datasets. Taking user-sentiment consistency as an example, in each iteration, we randomly select two reviews x i , x + i written by the same user u i , and a review x − i written by another randomly selected user. Afterwards, we calculate the measurements of (x i , x + i ) and (x i , x − i ), and aggregate these statistics for m users. In user-sentiment assumption test, we use absolute rating difference ||rating a − rating b || as the measurement between two reviews a and b. We illustrate the results in Figure 1 (a) 3 , where 2013same/2014same/amzsame (red plots) means that two reviews are written by a same user, and 2013dif f /2014dif f /amzdif f (black plots) means that two reviews are written by different users. We can find that: the absolute rating differences between two reviews written by a same user are lower than those written by different users (t-test with p-value < 0.01). In other words, sentiment ratings from the same user are more consistent than those from different users. This validates the user-sentiment consistency.
For testing product-sentiment consistency, we  use absolute rating difference as the measurement. The reviews x i , x + i are towards a same product p i , and x − i is towards another randomly selected product. From Figure 1 (b), we can see that sentiment ratings towards the same product are more consistent than those towards different products. In order to verify the assumptions of user-text and product-text consistencies, we use cosine similarity between bag-of-words of two reviews as the measurement. Results are given in Figure 1 (c) and (d). We can see that the textual similarity between two reviews written by a same user (or towards a same product) are higher than those written by different users (or towards different products).

User Product Neural Network (UPNN) for Sentiment Classification
We present the details of User Product Neural Network (UPNN) for sentiment classification. An illustration of UPNN is given in Figure 2. It takes as input a review, the user who posts the review, and the product which is evaluated. UPNN captures four kinds of consistencies which are verified in Section 2. It outputs the sentiment category (e.g. 1-5 stars) of a review by considering not only the semantics of review text, but also the information of user and product. In following subsections, we first describe the use of neural network for modeling semantics of variable-sized documents. We then present the methods for incorporating user and product information, followed by the use of UPNN in a supervised learning framework for sentiment classification.

Modeling Semantics of Document
We model the semantics of documents based on the principle of compositionality (Frege, 1892), which states that the meaning of a longer expression (e.g. a sentence or a document) comes from the meanings of its words and the rules used to combine them. Since a document consists of a list of sentences and each sentence is made up of a list of words, we model the semantic representation of a document in two stages. We first produce continuous vector of each sentence from word representations. Afterwards, we feed sentence vectors as inputs to compose document representation. For modeling the semantics of words, we represent each word as a low dimensional, continu-Softmax gold rating = 2 Figure 2: An illustration of the neural network approach for sentiment classification. w i means the i-th word of a review text. u k and p j are continuous vector representations of user k and product j for capturing user-sentiment and product-sentiment consistencies. U k and P j are continuous matrix representations of user k and product j for capturing user-text and product-text consistencies.
ous and real-valued vector, also known as word embedding (Bengio et al., 2003). All the word vectors are stacked in a word embedding matrix L w ∈ R d×|V | , where d is the dimension of word vector and |V | is the size of word vocabulary. These word vectors can be randomly initialized from a uniform distribution, regarded as a parameter and jointly trained with other parameters of neural networks. Alternatively, they can be pretrained from text corpus with embedding learning algorithms (Mikolov et al., 2013;Pennington et al., 2014;Tang et al., 2014b), and applied as initial values of word embedding matrix. We adopt the latter strategy which better exploits the semantic and grammatical associations of words.
To model semantic representations of sentences, convolutional neural network (CNN) and recursive neural network (Socher et al., 2013) are two state-of-the-art methods. We use CNN (Kim, 2014;Kalchbrenner et al., 2014) in this work as it does not rely on external parse tree. Specifically, we use multiple convolutional filters with different widths to produce sentence representation. The reason is that they are capable of capturing local semantics of n-grams of various granularities, which are proven powerful for sentiment classification. The convolutional filter with a width of 3 essentially captures the semantics of trigrams in a sentence. Accordingly, multiple convolutional filters with widths of 1, 2 and 3 encode the semantics of unigrams, bigrams and trigrams in a sentence.
An illustration of CNN with three convolutional filters is given in Figure 3.
Let us denote a sentence consisting of n words as {w 1 , w 2 , ...w i , ...w n }. Each word w i is mapped to its embedding representation e i ∈ R d . A convolutional filter is a list of linear layers with shared parameters. Let l cf be the width of a convolutional filter, and let W cf , b cf be the shared parameters of linear layers in the filter. The input of a linear layer is the concatenation of word embeddings in a fixed-length window size l cf , which is denoted as I cf = [e i ; e i+1 ; ...; e i+l cf −1 ] ∈ R d·l cf . The output of a linear layer is calculated as where W cf ∈ R len×d·l cf , b cf ∈ R len , len is the output length of linear layer. In order to capture the global semantics of a sentence, we feed the output of a convolutional filter to an average pooling layer, resulting in an output vector with fixedlength. We further add hyperbolic tangent functions (tanh) to incorporate element-wise nonlinearity, and fold (average) their outputs to generate sentence representation. We feed sentence vectors as the input of an average pooling layer to obtain the document representation. Alternative document modeling approaches include CNN or recurrent neural network. However, we prefer average pooling for its computational efficiency and good performance in our experiment.

Modeling Semantics of Users and Products
We integrate semantic representations of users and products in UPNN to capture user-sentiment, product-sentiment, user-text and product-text consistencies.
For modeling user-sentiment and productsentiment consistencies, we embed each user as a continuous vector u k ∈ R du and embed each product as a continuous vector p j ∈ R dp , where d u and d p are dimensions of user vector and product vector, respectively. The basic idea behind this is to map users with similar rating preferences (e.g. prefer assigning 4 stars) into close vectors in user embedding space. Similarly, the products which receive similar averaged ratings are mapped into neighboring vectors in product embedding space.
In order to model user-text consistency, we represent each user as a continuous matrix U k ∈ R d U ×d , which acts as an operator to modify the semantic meaning of a word. This is on the basis of vector based semantic composition (Mitchell and Lapata, 2010). They regard compositional modifier as a matrix X 1 to modify another component x 2 , and use matrix-vector multiplication y = X 1 × x 2 as the composition function. Multiplicative semantic composition is suitable for our need of user modifying word meaning, and it has been successfully utilized to model adjectivenoun composition (Clark et al., 2008;Baroni and Zamparelli, 2010) and adverb-adjective composition (Socher et al., 2012). Similarly, we model product-text consistency by encoding each product as a matrix P j ∈ R d P ×d , where d is the dimension of word vector, d P is the output length of product-word multiplicative composition. After conducting user-word multiplication and productword multiplication operations, we concatenate their outputs and feed them to CNN (detailed in Section 3.1) for producing user and product enhanced document representation.

Sentiment Classification
We apply UPNN to document level sentiment classification under a supervised learning framework (Pang and Lee, 2005). Instead of using handcrafted features, we use continuous representation of documents, users and products as discriminative features. The sentiment classifier is built from documents with gold standard sentiment labels.
As is shown in Figure 2, the feature representation for building rating predictor is the concatenation of three parts: continuous user representation u k , continuous product representation p j and continuous document representation v d , where v d encodes user-text consistency, product-text consistency and document level semantic composition. We use sof tmax to build the classifier because its outputs can be interpreted as conditional probabilities. Sof tmax is calculated as given in Equation 2, where C is the category number (e.g. 5 or 10).
We regard cross-entropy error between gold sentiment distribution and predicted sentiment distribution as the loss function of sof tmax.
We take the derivative of loss function through back-propagation with respect to the whole set of parameters θ = [W 1,2,3 cf ; b 1,2,3 cf ; u k ; p j ; U k ; P j ; W sof tmax , b sof tmax ], and update parameters with stochastic gradient descent. We set the widths of three convolutional filters as 1, 2 and 3. We learn 200-dimensional sentiment-specific word embeddings (Tang et al., 2014b) on each dataset separately, randomly initialize other parameters from a uniform distribution U (−0.01, 0.01), and set learning rate as 0.03.

Experiment
We conduct experiments to evaluate UPNN by applying it to sentiment classification of documents.

Experimental Setting
Existing benchmark datasets for sentiment classification such as Stanford Sentiment Treebank (Socher et al., 2013) typically only have text information, but do not contain users who express the sentiment or products which are evaluated. Therefore, we build the datasets by ourselves. In order to obtain large scale corpora without manual annotation, we derive three datasets from IMDB (Diao    Table 1. We split each corpus into training, development and testing sets with a 80/10/10 split, and conduct tokenization and sentence splitting with Stanford CoreNLP . We use standard accuracy (Manning and Schütze, 1999;Jurafsky and Martin, 2000) to measure the overall sentiment classification performance, and use M AE and RM SE to measure the divergences between predicted sentiment ratings (pr) and ground truth ratings (gd).

Baseline Methods
We compare UPNN with the following baseline methods for document-level sentiment classification.
(1) Majority is a heuristic baseline method, which assigns the majority sentiment category in training set to each review in the test dataset.
(4) We extract user-leniency features (Gao et al., 2013) and corresponding product features (denoted as UPF) from training data, and concatenate them with the features in baseline (2) and (3).
(5) We learn word embeddings from training and development sets with word2vec (Mikolov et al., 2013), average word embeddings to get document representation, and train a SVM classifier.
(6) We learn sentiment-specific word embeddings (SSWE) from training and development sets, and use max/min/average pooling (Tang et al., 2014b) to generate document representation.
(7) We represent sentence with RNTN (Socher et al., 2013) and compose document representation with recurrent neural network. We average hidden vectors of recurrent neural network as the features for sentiment classification.
(8) We re-implement PVDM in Paragraph Vector (Le and Mikolov, 2014) because its codes are not officially provided. The window size is tuned on development set.
(9) We compare with a state-of-the-art recommendation algorithm JMARS (Diao et al., 2014), which leverages user and aspects of a review with collaborative filtering and topic modeling.

Model Comparisons
Experimental results are given in Table 2. The results are separated into two groups: the methods above only use texts of review, and the methods below also use user and product information.
From the first group, we can see that majority performs very poor because it does not capture any text or user information. SVM classifiers with trigrams and hand-crafted text features are powerful for document level sentiment classification and hard to beat. We compare the word embedding learnt from each corpus with off-theshell general word embeddings 5 . Results show that tailored word embedding from each corpus performs slightly better than general word embeddings (about 0.01 improvement in terms of accuracy). SSWE performs better than context-based word embedding by incorporating sentiment information of texts. Setting a large window size (e.g. 15) is crucial for effectively training SS-WE from documents with accompanying senti-  Table 2: Sentiment classification on IMDB, Yelp 2014 and Yelp 2013 datasets. Evaluation metrics are accuracy (Acc, higher is better), MAE (lower is better) and RMSE (lower is better). Our full model is UPNN (full). Our model without using user and product information is abbreviated as UPNN (no UP). The best method in each group is in bold. ment labels. RNTN+Reccurent is a strong performer by effectively modeling document representation with semantic composition. Our text based model (UPNN no UP) performs slightly better than RNTN+Reccurent, trigram and text features.
From the second group, we can see that concatenating user product feature (UPF) with existing feature sets does not show significant improvements. This is because the dimension of existing feature sets is typically huge (e.g. 1M trigram features in Yelp 2014), so that concatenating a small number of UPF features does not have a great influence on the whole model. We do not evaluate JMARS in terms of accuracy because JMARS outputs real-valued ratings. Our full model UPNN yields the best performance on all three datasets. Incorporating semantic representations of user and product significantly (t-test with p-value < 0.01) boosts our text based model (UPNN no UP). This shows the effectiveness of UPNN over standard trigrams and hand-crafted features when incorporating user and product information.

Model Analysis: Effect of User and Product Representations
We investigate the effects of vector based user and product representations (u k , p j ) as well as matrix based user and product representations (U k , P j ) for sentiment classification. We remove vector based representations (u k , p j ) and matrix based representations (U k , P j ) from UPNN separately, and conduct experiments on three datasets. From Table 3, we can find that vector based representations (u k , p j ) are more effective than matrix based representations (U k , P j ). This is because u k and p j encode user-sentiment and product-sentiment consistencies, which are more directly associated with sentiment labels than user-text (U k ) and product-text (P j ) consistencies. Another reason might be that the parameters of vector representations are less than the matrix representations, so that the vector representations are better estimated. We also see the contribution from each of user and product by removing (U k , u k ) and (P j , p j ) separately. Results are given in Table 3. It is interesting to find that user representations are obviously more effective than product representations for review rating prediction.

Discussion: Out-Of-Vocabulary Users and Products
Out-of-vocabulary (OOV) situation occurs if a user or a product in testing/decoding process is never seen in training data. We give two natural solutions (avg UP and unk UP) to deal with OOV users and products. One solution (avg UP) is to regard the averaged representations of users/products in training data as the representation of OOV user/product. Another way (unk UP) is to learn a shared "unknown" user/product representation for low-frequency users in training data, and apply it to OOV user/product.  Table 3: Influence of user and product representations. For user k and product j, u k and p j are their continuous vector representations, U k and P j are their continuous matrix representations (see Figure 2).

IMDB
Yelp 2014 Figure 4: Accuracy of OOV user and product on OOV test set.
In order to evaluate the two strategies for OOV problem, we randomly select 10 percent users and products from each development set, and mask their user and product information. We run avg UP, unk UP together with UPNN (no UP) which only uses text information, and UPNN (full) which learns tailored representation for each user and product. We evaluate classification accuracy on the extracted OOV test set. Experimental results are given in Figure 5. We can find that these two strategies perform slightly better than UPNN (no UP), but still worse than the full model.

Sentiment Classification
Sentiment classification is a fundamental problem in sentiment analysis, which targets at inferring the sentiment label of a document. Pang and Lee (2002;2005) cast this problem a classification task, and use machine learning method in a supervised learning framework. Goldberg and Zhu (2006) use unlabelled reviews in a graphbased semi-supervised learning method. Many studies design effective features, such as text topic (Ganu et al., 2009), bag-of-opinion (Qu et al., 2010) and sentiment lexicon features (Kiritchenko et al., 2014). User information is also used for sentiment classification. Gao et al. (2013) design user-specific features to capture user leniency. Li et al. (2014) incorporate textual topic and user-word factors with supervised topic modeling. Tan et al. (2011) and Hu et al. (2013) utilize usertext and user-user relations for Twitter sentiment analysis. Unlike most previous studies that use hand-crafted features, we learn discriminative features from data. We differ from Li et al. (2014) in that we encode four kinds of consistencies and use neural network approach. User representation is also leveraged for recommendation (Weston et al., 2013), web search (Song et al., 2014) and social media analytics (Perozzi et al., 2014).

Neural Network for Sentiment Classification
Neural networks have achieved promising results for sentiment classification. Existing neural network methods can be divided into two groups: word embedding and semantic composition. For learning word embeddings, (Mikolov et al., 2013;Pennington et al., 2014) use local and global contexts, (Maas et al., 2011;Labutov and Lipson, 2013;Tang et al., 2014b;Tang et al., 2014a;Zhou et al., 2015) further incorporate sentiment of texts. For learning semantic composition, Glorot et al. (2011) use stacked denoising autoencoder, Socher et al. (2013) introduce a family of recursive deep neural networks (RNN). RNN is extended with adaptive composition functions (Dong et al., 2014), global feedbackward (Paulus et al., 2014), feature weight tuning (Li, 2014), and also used for opinion relation detection . Li et al. (2015) compare the effectiveness of recursive neural network and recurrent neural network on five NLP tasks including sentiment classification. (Kalchbrenner et al., 2014;Kim, 2014;Johnson and Zhang, 2014) use convolutional neural networks. Le and Mikolov (2014) introduce Paragraph Vector. Unlike existing neural network approaches that only use the semantics of texts, we take consideration of user and product representations and leverage their connections with text semantics for sentiment classification. This work is an extension of our previous work (Tang et al., 2015), which only takes consideration of userword association.

Conclusion
In this paper, we introduce User Product Neural Network (UPNN) for document level sentiment classification under a supervised learning framework. We validate user-sentiment, productsentiment, user-text and product-text consistencies on massive reviews, and effectively integrate them in UPNN. We apply the model to three datasets derived from IMDB and Yelp Dataset Challenge. Empirical results show that: (1) UPNN outperforms state-of-the-art methods for document level sentiment classification; (2) incorporating continuous user and product representations significantly boosts sentiment classification accuracy.