Cold-Start Aware User and Product Attention for Sentiment Classification

The use of user/product information in sentiment analysis is important, especially for cold-start users/products, whose number of reviews are very limited. However, current models do not deal with the cold-start problem which is typical in review websites. In this paper, we present Hybrid Contextualized Sentiment Classifier (HCSC), which contains two modules: (1) a fast word encoder that returns word vectors embedded with short and long range dependency features; and (2) Cold-Start Aware Attention (CSAA), an attention mechanism that considers the existence of cold-start problem when attentively pooling the encoded word vectors. HCSC introduces shared vectors that are constructed from similar users/products, and are used when the original distinct vectors do not have sufficient information (i.e. cold-start). This is decided by a frequency-guided selective gate vector. Our experiments show that in terms of RMSE, HCSC performs significantly better when compared with on famous datasets, despite having less complexity, and thus can be trained much faster. More importantly, our model performs significantly better than previous models when the training data is sparse and has cold-start problems.


Introduction
Sentiment classification is the fundamental task of sentiment analysis (Pang et al., 2002), where we are to classify the sentiment of a given text. It is widely used on online review websites as they contain huge amounts of review data that can be clas- sified a sentiment. In these websites, a sentiment is usually represented as an intensity (e.g. 4 out of 5). The reviews are written by users who have bought a product. Recently, sentiment analysis research has focused on personalization (Zhang, 2015) to recommend product to users, and vise versa.
To this end, many have used user and product information not only to develop personalization but also to improve the performance of the classification model (Tang et al., 2015). Indeed, these information are important in two ways. First, some expressions are user-specific for a certain sentiment intensity. For example, the phrase "very salty" may have different sentiments for a person who likes salty food and a person who likes otherwise. This is also apparent in terms of products. Second, these additional contexts help mitigate data sparsity and cold-start problems. Coldstart is a problem when the model cannot draw useful information from users/products where data is insufficient. User and product information can help by introducing a frequent user/product with similar attributes to the cold-start user/product.
Thanks to the promising results of deep neural networks to the sentiment classification task (Glorot et al., 2011;Tang et al., 2014), more recent models incorporate user and product information to convolutional neural networks (Tang et al., 2015) and deep memory networks (Dou, 2017), and have shown significant improvements. The current state-of-the-art model, NSC (Chen et al., 2016a), introduced an attention mechanism called UPA which is based on user and product information and applied this to a hierarchical LSTM. The main problem with current models is that they use user and product information naively as an ordinary additional context, not considering the possible existence of cold-start problems. This makes NSC more problematic than helpful in reality since majority of the users in review websites have very few number of reviews.
To this end, we propose the idea shown in Figure 1. It can be described as follows: If the model does not have enough information to create a user/product vector, then we use a vector computed from other user/product vectors that are similar. We introduce a new model called Hybrid Contextualized Sentiment Classifier (HCSC), which consists of two modules. First, we build a fast yet effective word encoder that accepts word vectors and outputs new encoded vectors that are contextualized with short-and long-range contexts. Second, we combine these vectors into one pooled vector through a novel attention mechanism called Cold-Start Aware Attention (CSAA). The CSAA mechanism has three components: (a) a user/product-specific distinct vector derived from the original user/product information of the review, (b) a user/product-specific shared vector derived from other users/products, and (c) a frequency-guided selective gate which decides which vector to use. Multiple experiments are conducted with the following results: In the original non-sparse datasets, our model performs significantly better than the previous state-of-the-art, NSC, in terms of RMSE, despite being less complex. In the sparse datasets, HCSC performs significantly better than previous competing models.

Related work
Previous studies have shown that using additional contexts for sentiment classification helps improve the performance of the classifier. We survey several competing baseline models that use user and product information and other models using other kinds of additional context. Baselines: Models with user and product information User and product information are helpful to improve the performance of a sentiment classifier. This argument was verified by Tang et al. (2015) through the observation at the consistency between user/product information and the sentiments and expressions found in the text. Listed below are the following models that employ user and product information: • JMARS (Diao et al., 2014) jointly models the aspects, ratings, and sentiments of a review while considering the user and product information using collaborative filtering and topic modeling techniques. • UPNN (Tang et al., 2015) uses a CNN-based classifier and extends it to incorporate userand product-specific text preference matrix in the word level which modifies the word meaning. • TLFM+PRC (Song et al., 2017) is a textdriven latent factor model that unifies userand product-specific latent factor models represented using the consistency assumption by Tang et al. (2015). • UPDMN (Dou, 2017) uses an LSTM classifier as the document encoder and modifies the encoded vector using a deep memory network with other documents of the user/product as the memory. • TUPCNN (Chen et al., 2016b) extends the CNN-based classifier by adding temporal user and product embeddings, which are obtained from a sequential model and learned through the temporal order of reviews. • NSC (Chen et al., 2016a) is the current stateof-the-art model that utilizes a hierarchical LSTM model (Yang et al., 2016) and incorporates user and product information in the attention mechanism.
Models with other additional contexts Other additional contexts used previously are spatial (Yang et al., 2017) and temporal (Fukuhara et al., 2007) features which help contextualize the sentiment based on the location where and the time when the text is written. Inferred contexts were also used as additional contexts for sentiment classifiers, such as latent topics (Lin and He, 2009) and aspects (Jo and Oh, 2011) from a topic model, argumentation features (Wachsmuth et al., 2015), and more recently, latent review clusters (Amplayo and Hwang, 2017). These additional con- texts were especially useful when data is sparse, i.e. number of instances is small or there exists cold-start entities.
Our model differs from the baseline models mainly because we consider the possible existence of the data sparsity problem. Through this, we are able to construct more effective models that are comparably powerful yet more efficient complexity-wise than the state-of-the-art, and are better when the training data is sparse. Ultimately, our goal is to demonstrate that, similar to other additional contexts, user and product information can be used to effectively mitigate the problem caused by cold-start users and products.

Our model
In this section, we present our model, Hybrid Contextualized Sentiment Classifier (HCSC) 1 which consists of a fast hybrid contextualized word encoder and an attention mechanism called Cold-Start Aware Attention (CSAA). The word encoder returns word vectors with both local and global contexts to cover both short and long range dependency relationship between words. The CSAA then incorporates user and product information to the contextualized words through an attention mechanism that considers the possible existence of cold-start problems. The full architecture of the model is presented in Figure 2. We describe the subparts of the model below.

Hybrid contextualized word encoder
The base model is a word encoder that transforms vectors of words {w i } in the text to new word vectors. In this paper, we present a fast yet very effective word encoder based on two different off-theshelf classifiers.
The first part of HCWE is based on a CNN model which is widely used in text classification (Kim, 2014). This encoder contextualizes words based on local context words to capture short range relationships between words. Specifically, we do the convolution operation using filter matrices W f ∈ R h×d with filter size h to a window of h words. We do this for different sizes of h. This produces new feature vectors c i,h as shown below, where f (.) is a non-linear function: The convolution operation reduces the number of words differently depending on the filter size h. To prevent loss of information and to produce the same amount of feature vectors c i,h , we pad the texts dynamically such that when the filter size is h, the number of paddings on each side is (h − 1)/2. This requires the filter sizes to be odd numbers. Finally, we concatenate all feature vectors of different h's for each i as the new word vector: The second part of HCWE is based on an RNN model which is used when texts are longer and include word dependencies that may not be captured by the CNN model. Specifically, we use a bidirectional LSTM and concatenate the forward and backward hidden state vectors as the new word vector, as shown below: The answer to the question whether to use local or global context to encode words for sentiment classification is still unclear, and both CNN and RNN models have previous empirical evidence that they perform better than the other (Kim, 2014;McCann et al., 2017). We believe that both short and long range relationships, captured by CNN and RNN respectively, are useful for sentiment classification. There are already previous attempts to intricately combine both CNN and RNN (Zhou et al., 2016), resulting to a slower model. On the other hand, HCWE resorts to combine them by simply concatenating the word vectors encoded from both CNN and RNN encoders, i.e.
This straightforward yet fast alternative outputs a word vector with semantics contextualized from both local and global contexts. Moreover, they perform as well as complex hierarchical structured models (Yang et al., 2016;Chen et al., 2016a) which train very slow.

Cold-start aware attention
Incorporating the user and product information of the text as context vectors u and p to attentively pool the word vectors, i.e. e(w i , u, p) = v tanh(W w w i + W u u + W p p + b), has been proven to improve the performance of sentiment classifiers (Chen et al., 2016a). However, this method assumes that the user and product vectors are always present. This is not the case in real world settings where a user/product may be new and has just got its first review. In this case, the vectors u and p are rendered useless and may also contain noisy signals that decrease the overall performance of the models.
To this end, we present an attention mechanism called Cold-Start Aware Attention (CSAA). CSAA operates on the idea that a cold-start user/product can use the information of other sim-ilar users/products with sufficient number of reviews. CSAA separates the construction of pooled vectors for user and for product, unlike previous methods that use both user/product information to create a single pooled vector. Constructing a user/product-specific pooled vector consists of three parts: the distinct pooled vector created using the original user/product, the shared pooled vector created using similar users/products, and the selective gate to select between the distinct and shared vectors. Finally, the user-and productspecific pooled vectors are combined into one final pooled vector.
In the following paragraphs, we discuss the step-by-step process on how the user-specific pooled vector is constructed. A similar process is done to construct the product-specific pooled vector, but is not presented here for conciseness.
The user-specific distinct pooled vector v d u is created using a method similar to the additive attention mechanism (Bahdanau et al., 2014) where the context vector is the distinct vector of user u, as shown in the equation below. An equivalent method is used to create the distinct product-specific pooled vector v d p .
The user-specific shared pooled vector v s u is created using the same method above, but using a shared context vector u . The shared context vector u is constructed using the vectors of other users and weighted based on a similarity weight. Similarity is defined as how similar the word usages of two users are. This means that if a user u k uses words similarly to the word usage of the original user u, then u k receives a high similarity weight. The similarity weight a s u k is calculated as the softmax of the product of µ({w i }) and u k with a project matrix in the middle, where µ({w i }) is the average of the word vectors. The similarity weights are used to create u , as shown below. Similar method is used for the shared productspecific pooled vector v s p .
We select between the user-specific distinct and shared pooled vector, v d u and v s u , into one userspecific pooled vector v u through a gate vector g u . The vector g u should put more weight to the distinct vector when user u is not cold-start and to the shared vector when u is otherwise. We use a frequency-guided selective gate that utilizes the frequency, i.e. the number of reviews user u has written. The challenge is that we do not know how many reviews should be considered cold-start or not. This is automatically learned through a twoparameter Weibull cumulative distribution where given the review frequency of the user f (u), a learned shape vector k u and a learned scale vector λ u , a probability vector is sampled and is used as the gate vector g u to create v u , according to the equation below. We normalized f (u) by dividing it to the average user review frequency. The relu function ensures that both k u and λ u are nonnegative vectors. The final product-specific pooled vector v p is created in a similar manner.
Finally, we combine both the user-and productspecific pooled vector, v u and v p , into one pooled vector v up . This is done by using a gate vector g up created using a sigmoidal transformation of the concatenation of v u and v p , as illustrated in the equation below.
We note that our attention mechanism can be applied to any word encoders, including the basic bag of words (BoW) to more recent models such as CNN and RNN. Later (in Section 4.2), we show that CSAA improves the performance of simpler models greatly.

Training objective
Normally, a sentiment classifier transforms the final vector v up , usually in a linear fashion, into a vector with a dimension equivalent to the number of classes C. A softmax layer is then used to obtain a probability distribution y over the sentiment classes. Finally, the full model uses a crossentropy over all training documents D as objective function L during training, where y is the gold probability distribution: However, HCSC has a nice architecture which can be used to improve the training. It contains seven pooled vectors , v up } that are essentially in the same vector space. This is because these vectors are created using weighted sums of either the encoded word vectors through attention or the parent pooled vectors through the selective gates. Therefore, we can train separate classifiers for each pooled vectors using the same parameters W and b. Specifically, for each v ∈ V, we calculate the loss L v using the above formulas. The final loss is then the sum of all the losses, i.e. L = v∈V L v .

Experimental settings
Implementation We set the size of the word, user, and product vectors to 300 dimensions. We use pre-trained GloVe embeddings 2 (Pennington et al., 2014) to initialize our word vectors. We simply set the parameters for both BiLSTMs and CNN to produce an output with 300 dimensions: For the BiLSTMs, we set the state sizes of the LSTMs to 75 dimensions, for a total of 150 dimensions. For CNN, we set h = 3, 5, 7, each with 50 Datasets  Classes  Train  Dev  Test  #docs #users #prods #docs #users #prods #docs #users #prods  IMDB  10  67426  1310  1635  8381  1310  1574  9112  1310  1578  Yelp 2013  5  62522  1631  1633  7773  1631  1559  8671  1631  1577   Datasets  Classes  Sparse20  Sparse50  Sparse80  #docs #users #prods #docs #users #prods #docs #users #prods  IMDB  10  44261  1042  1323  17963  659  840  2450  250  312  Yelp 2013  5  38687  1301  1288  16058  818  823  2406  352  304   Table 1: Dataset statistics feature maps, for a total of 150 dimensions. These two are concatenated to create a 300-dimension encoded word vectors. We use dropout (Srivastava et al., 2014) on all non-linear connections with a dropout rate of 0.5. We set the batch size to 32. Training is done via stochastic gradient descent over shuffled mini-batches with the Adadelta update rule (Zeiler, 2012), with l 2 constraint (Hinton et al., 2012) of 3. We perform early stopping using a subset of the given development dataset. Training and experiments are all done using a NVIDIA GeForce GTX 1080 Ti graphics card. Additionally, we also implement two versions of our model where the word encoder is a subpart of HCSC, i.e. (a) the CNN-based model (CNN+CSAA) and (b) the RNN-based model (RNN+CSAA). For the CNN-based model, we use 100 feature maps for each of the filter sizes h = 3, 5, 7, for a total of 300 dimensions. For the RNN-based model, we set the state sizes of the LSTMs to 150, for a total of 300 dimensions.

Datasets and evaluation
We evaluate and compare our models with other competing models using two widely used sentiment classification datasets with available user and product information: IMDB and Yelp 2013. Both datasets are curated by Tang et al. (2015), where they are divided into train, dev, and test sets using a 8:1:1 ratio, and are tokenized and sentence-splitted using Stanford CoreNLP . In addition, we create three subsets of the train dataset to test the robustness of the models on sparse datasets. To create these datasets, we randomly remove all the reviews of x% of all users and products, where x = 20, 50, 80. These datasets are not only more sparse than the original datasets, but also have smaller number of users and products, introducing cold-start users and products. All datasets are summarized in Table 1. Evaluation is done using two metrics: the Accuracy which measures the overall sentiment classification performance and the RMSE which measures the diver-  gence between predicted and ground truth classes. We notice very minimal differences among performances of different runs.

Comparisons on original datasets
We report the results on the original datasets in Table 2. On both datasets, HCSC outperforms all previous models based on both accuracy and RMSE. Based on accuracy, HCSC performs significantly better than all previous models except NSC, where it performs slightly better with 0.9% and 0.7% increase on IMDB and Yelp 2013 datasets. Based on RMSE, HCSC performs significantly better than all previous models, except when compared with UPDMN on the Yelp 2013 datasets, where it performs slightly better. We note that RMSE is a better metric because it measures how close the wrongly predicted sentiment and the ground truth sentiment are. Although NSC performs as well as HCSC based on accuracy, it performs worse based on RMSE, which means that its predictions deviate far from the original sentiment. It is also interesting to note that when CSAA is used as attentive pooling, both simple CNN and RNN models perform just as well as NSC, despite NSC being very complex and modeling the documents with compositionality (Chen et al., 2016a). This is especially true when com-  pared using RMSE, where both CNN+CSAA and RNN+CSAA perform significantly better (p < 0.01) than NSC. This proves that CSAA is an effective use of the user and product information for sentiment classification. Table 3 shows the accuracy of NSC (Chen et al., 2016a) and our models CNN+CSAA, RNN+CSAA, and HCSC on the sparse datasets. As shown in the table, on all datasets with different levels of sparsity, HCSC performs the best among the competing models. The difference between the accuracy of HCSC and NSC increases as the level of sparsity intensifies: While the HCSC only gains 0.8% and 1.0% over NSC on the less sparse Sparse20 IMDB and Yelp 2013 datasets, it improves over NSC significantly with 7.6% and 2.7% increase on the more sparse Sparse80 IMDB and Yelp 2013 datasets, respectively. We also run our experiments using NSC without user and product information, i.e. NSC(LA) which reduces the model into a hierarchical LSTM model (Yang et al., 2016). Results show that although the use of user and product information in NSC improves the model on less sparse datasets (as also shown in the original paper (Chen et al., 2016a)), it decreases the performance of the model on more sparse datasets: It performs 2.0%, 1.7%, and 1.2% worse than NSC(LA) on Sparse50 IMDB, Sparse80 IMDB, and Sparse80 Yelp 2013 datasets. We argue that this is because NSC does not consider the existence of cold-start problems, which makes the additional user and product in- formation more noisy than helpful.

Analysis
In this section, we show further interesting analyses of the properties of HCSC. We use the Sparse50 datasets and the corresponding results of several models as the experimental data.
Performance per review frequency We investigate the performance of the model over users/products with different number of reviews. Figure 3 shows plots of accuracy of both NSC and HCSC over (a) different user review frequency on IMDB dataset and (b) different product review frequency on Yelp 2013 dataset. On both plots, we observe that when the review frequency is small, the performance gain of HCSC over NSC is very large. However, as the review frequency becomes larger, the performance gain of HCSC over NSC decreases to a very marginal increase. This means that HCSC finds its improvements over NSC from cold-start users and products, in which NSC does not consider explicitly.
How few is cold-start? One intriguing question is when do we say that a user/product is coldstart or not. Obviously, users/products with no previous reviews at all should be considered coldstart, however the cut-off point between cold-start and non-cold-start entities is vague. Although we  cannot provide an exact answer to this question, HCSC is able to provide a nice visualization by reducing the shape and scale vectors, k and λ, of the frequency-guided selective gate into their averages and draw a Weibull cumulative distribution graph, as shown in Figure 5. The figure provides us these observations: First, users have a more lenient coldstart cut-off point compared to products; in the IMDB dataset, a user only needs approximately at least five reviews to use at least 80% of its own information (i.e. distinct vector). On the other hand, products tend to need more reviews to be considered sufficient and not cold start; in the IMDB dataset, a product needs approximately 40 reviews to use at least 80% of its own information. This explains the marginal increase in performance of previous models when only product information is used as additional context, as reported by previous papers (Tang et al., 2015;Chen et al., 2016a).
On the different pooled vectors We visualize the attention and gate values of two example results from HCSC in Figure 4 to investigate on how  user/product vectors, and distinct/shared vectors work. In the first example, both user and product are cold-start. The user distinct vector focuses its attention to wrong words, since it is not able to use any useful information from the user at all. In this case, HCSC uses the user shared vector by using a gate vector g u = 0. The user shared vector correctly attends to important words such as fresh, baked, soft, and pretzels. In the second example, both user and product are not cold-start. In this case, the distinct vectors are used almost entirely by setting the gates close to 1. Still, the corresponding shared vectors are similar to the distinct vectors, proving that HCSC is able to create useful user/product-specific context from similar users/products. Finally, we look at the differing attention values of users and products. We observe that user vectors focus on words that describe the product or express their emotions (e.g. fresh and enjoyed). On the other hand, product vectors focus more on words pertaining to the products/services (e.g. pretzels and waitress).
On the time complexity of models Finally, we report the time in seconds to run 100 batches of data of the models NSC, CNN+CSAA, RNN+CSAA, and HCSC in Figure 4. NSC takes too long to train, needing at least 6500 seconds to process 100 batches of data. This is because it uses two non-parallelizable LSTMs on top of each other. Our models, on the other hand, only use one (or none in the case of CNN+CSAA) level of BiLSTM. This results to at least 6.6x speedup on the IMDB datasets, and at least 10.7x speedup on the Yelp 2013 datasets. This means that HCSC does not sacrifice a lot of time complexity to obtain better results.

Conclusion
We propose Hybrid Contextualized Sentiment Classifier (HCSC) with a fast word encoder which contextualizes words to contain both short and long range word dependency features, and an attention mechanism called Cold-start Aware Attention (CSAA) which considers the existence of the cold-start problem among users and products by using a shared vector and a frequency-guided selective gate, in addition to the original distinct vector. Our experimental results show that our model performs significantly better than previous models. These improvements increase when the level of sparsity in data increases, which confirm that HCSC is able to deal with the cold-start problem.