Dual Memory Network Model for Biased Product Review Classification

In sentiment analysis (SA) of product reviews, both user and product information are proven to be useful. Current tasks handle user profile and product information in a unified model which may not be able to learn salient features of users and products effectively. In this work, we propose a dual user and product memory network (DUPMN) model to learn user profiles and product reviews using separate memory networks. Then, the two representations are used jointly for sentiment prediction. The use of separate models aims to capture user profiles and product information more effectively. Compared to state-of-the-art unified prediction models, the evaluations on three benchmark datasets, IMDB, Yelp13, and Yelp14, show that our dual learning model gives performance gain of 0.6%, 1.2%, and 0.9%, respectively. The improvements are also deemed very significant measured by p-values.


Introduction
Written text is often meant to express sentiments of individuals. Recognizing the underlying sentiment expressed in the text is essential to understand the full meaning of the text. The SA community is increasingly interested in using natural language processing (NLP) techniques as well as sentiment theories to identify sentiment expressions in the text.
Recently, deep learning based methods have taken over feature engineering approaches to gain further performance improvement in SA. Typical neural network models include Convolutional Neural Network (CNN) (Kim, 2014), Recursive auto-encoders (Socher et al., 2013), Long-Short Term Memory (LSTM) (Tang et al., 2015a), and many more.
Attention-based models are introduced to highlight important words and sentences in a piece of text. Different attention models are built using information embedded in the text including users, products and text in local context (Tang et al., 2015b;Yang et al., 2016;Gui et al., 2016). In order to incorporate other aspects of knowledge, Qian et al. (2016) developed a model to employ additional linguistic resources to benefit sentiment classification. Long et al.(2017b) and Mishra et al.(2016) proposed cognition-based attention models learned from cognition grounded eye-tracking data.
Most text-based SA is modeled as sentiment classification tasks. In this work, SA is for product reviews. We use the term users to refer to writers of text, and products to refer to the targets of reviews in the text. A user profile is defined by the collection of reviews a user writes. Product information defined for a product is the collection of reviews for this product. Note that user profiles and product information are not independent of each other. That is one reason why previous works use unified models. By commonsense we know that review text written by a person may be subjective or biased towards his/her own preferences. Lenient users tend to give higher ratings than finicky ones even if they review the same products. Popular products do receive higher ratings than those unpopular ones because the aggregation of user reviews still shows the difference in opinion for different products. While users and products both play crucial roles in sentiment analysis, they are fundamentally different.
Reviews written by a user can be affected by user preference which is more subjective whereas reviews for a product are useful only if they are from a collection of different reviewers, because we know individual reviews can be biased. The popularity of a product tends to reflect the general impression of a collection of users as an aggregated result. Therefore, sentiment prediction of a product should give dual consideration to individual users as well as all reviews as a collection.
In this paper, we address the aforementioned issue by proposing to learn user profiles and product review information separately before making a joint prediction on sentiment classification. In the proposed Dual User and Product Memory Network (DUPMN) model, we first build a hierarchical LSTM (Hochreiter and Schmidhuber, 1997) model to generate document representations. Then a user memory network (UMN) and a product memory network (PMN) are separately built based on document representation of user comments and product reviews. Finally, sentiment prediction is learned from a dual model.
To validate the effectiveness of our proposed model, evaluations are conducted on three benchmarking review datasets from IMDB and Yelp data challenge (including Yelp13 and Yelp14) (Tang et al., 2015a). Experimental results show that our algorithm can outperform baseline methods by large margins. Compared to the state-ofthe-art method, DUPMN made 0.6%, 1.2%, and 0.9% increase in accuracy with p-values 0.007, 0.004, and 0.001 in the three benchmark datasets respectively. Results show that leveraging user profile and product information separately can be more effective for sentiment predictions.
The rest of this paper is organized as follows. Section 2 gives related work, especially memory network models. Section 3 introduces our proposed DUPMN model. Section 4 gives the evaluation compared to state-of-the-art methods on three datasets. Section 5 concludes this paper and gives some future directions in sentiment analysis models to consider individual bias.

Related Work
Related work includes neural network models and the use of user/product information in sentiment analysis.

Neural Network Models
In recent years, deep learning has greatly improved the performance of sentiment analysis. Commonly used models include Convolutional Neural Networks (CNNs) (Socher et al., 2011), Recursive Neural Network (ReNNs) (Socher et al., 2013), and Recurrent Neural Networks (RNNs) (Irsoy and Cardie, 2014). RNN naturally benefits sentiment classification because of its ability to capture sequential information in text. However, standard RNNs suffer from the so-called gradient vanishing problem (Bengio et al., 1994) where gradients may grow or decay exponentially over long sequences. LSTM models are adopted to solve the gradient vanishing problem. An LSTM model provides a gated mechanism to keep the long-term memory. Each LSTM layer is generally followed by mean pooling and the output is fed into the next layer. Experiments in datasets which contain sentences and long documents demonstrate that LSTM model outperforms the traditional RNNs (Tang et al., 2015a,c). Attention mechanism is also added to LSTM models to highlight important segments at both sentence level and document level. Attention models can be built from text in local context (Yang et al., 2016), user/production information Long et al., 2017a) and other information such as cognition grounded eye tracking data (Long et al., 2017b). LSTM models with attention mechanism are currently the state-of-theart models in document sentiment analysis tasks Long et al., 2017b).
Memory networks are designed to handle larger context for a collection of documents. Memory networks introduce inference components combined with a so called long-term memory component (Weston et al., 2014). The long-term memory component is a large external memory to represent data as a collection. This collective information can contain local context (Das et al., 2017) or external knowledge base (Jain, 2016). It can also be used to represent the context of users and products globally (Tang et al., 2016). Dou uses (2017) a memory network model in document level sentiment analysis and makes comparable result to the state-of-the-art model .

Incorporating User and Product Information
Both user profile and product information have crucial effects on sentiment polarities. Tang et al. (2015b) proposed a model by incorporating user and product information into a CNN network for document level sentiment classification. User ids and product names are included as features in a unified document vector using the vector space model such that document vectors capture important global clues include individual preferences and product information.
Nevertheless, this method suffers from high model complexity and only word-level preference is considered rather than information at the semantic level . Gui et al. (2016) introduce an inter-subjectivity network to link users to the terms they used as well as the polarities of the terms. The network aims to learn writer embeddings which are subsequently incorporated into a CNN network for sentiment analysis.  propose a model to incorporate user and product information into an LSTM with attention mechanism. This model is reported to produce the state-of-the-art results in the three benchmark datasets (IMDB, Yelp13, and Yelp14). Dou (2017) also proposes a deep memory network to integrate user profile and product information in a unified model. However, the model only achieves a comparable result to the state-of-the-art attention based LSTM .

The DUPMN Model
We propose a DUPMN model. Firstly, document representation is learned by a hierarchical LSTM network to obtain both sentence-level representation and document level representation (Sundermeyer et al., 2012). A memory network model is then trained using dual memory networks, one for training user profiles and the other for training product reviews. Both of them are joined together to predict sentiment for documents.

Task Definition
Let D be the set of review documents for classification, U be the set of users, and P be the set of products. For each document d(d ∈ D), user u(u ∈ U ) is the writer of d on product p(p ∈ P ). Let U u (d) be all documents posted by u and P p (d) be all documents on p. U u (d) and P p (d) define the user context and the product context of d, respectively. For simplicity, we use U (d) and P (d) directly. The goal of a sentiment analysis task is to predict the sentiment label for each d.

Document Embedding
Since review documents for sentiment classification such as restaurant reviews and movie comments are normally very long, a proper method to embed the documents is needed to speed up the training process and achieve better accuracy. Inspired by the work of Chen , a hierarchical LSTM network is used to obtain em-bedding representation of documents. The first LSTM layer is used to obtain sentence representation by the hidden state of an LSTM network. The same mechanism is also used for document level representation with sentence-level representation as input. User and product attentions are included in the network so that all salient features are included in document representation. For document d, its embedding is denoted as d. d is a vector representation with dimension size n. In principle, the embedding representation of user context of d, denoted byÛ (d), and product contextP (d) vary depending on d. For easy matrix calculation, we take m as our model parameter so thatÛ (d) and P (d) are two fixed n × m matrices.

Memory Network Structure
Inspired by the successful use of memory networks in language modeling, question answering, and sentiment analysis (Sukhbaatar et al., 2015;Tang et al., 2016;Dou, 2017), we propose our DUPMN by extending a single memory network model to two memory networks to reflect different influences from users' perspective and products' perspective. The structure of the model is shown in Figure 1 with 3 hops as an example although in principle a memory network can have K computational hops.
The DUPMN model has two separate memory networks: the UMN and the PMN. Each hop in a memory network includes an attention layer Attention i and a linear addition Σ k . Since the external memoryÛ (d) andP (d) have the same structure, we use a generic notationM to denote them in the following explanations. Each document vector d is fed into the first hop of the two networks ( d 0 = d). Each d k−1 ( k= 1 ...... K-1) passes through the attention layer using an attention mechanism defined by a softmax function to obtain the attention weights p k for document d: And to produce an attention weighted vector a k by  a weighted mechanism to produce the output of DUPMN, Output DU P M N , is given below: Two different weight vectors W u and W p in Formula 3 can be trained for UMN and PMN. w U and w P are two constant weights to reflect the relative importance of user profile d u K and product information d p K . The parameters in the model including W U , W P , w U and w P . By minimizing the loss, those parameters can be optimized.
Sentiment prediction is obtained through a Sof tmax layer. The loss function is defined by the cross entropy between the prediction from Output DU P M N and the ground truth labels.

Experiment and Result Analysis
Performance evaluations are conducted on three datasets and DUPMN is compared with a set of commonly used baseline methods including the state-of-the-art LSTM based method Wu et al., 2018).

Datasets
The three benchmarking datasets include movie reviews from IMDB, restaurant reviews from Yelp13 and Yelp14 developed by Tang (2015a Since postings in social networks by both users and products follow the long tail distribution (Kordumova et al., 2016), we only show the distribution of total number of posts for different products. For example, #p(0-50) means the number of products which have reviews between the size of 0 to 50. We split train/development/test sets at the rate of 8:1:1 following the same setting in (Tang et al., 2015b;. The best configuration by the development dataset is used for the test set to obtain the final result.

Baseline Methods
In order to make a systematic comparison, three groups of baselines are used in the evaluation. Group 1 includes all commonly used feature sets mentioned in    Majority use the SVM classifier. Group 2 methods include the recently published sentiment analysis models which only use context information, including: • SSWE (Tang et al., 2014) -An SVM model using sentiment specific word embedding.
• CLSTM  -A Cached LSTM model to capture overall semantic information in long text.
• LSTM+LA ) -A state-ofthe-art LSTM using local context as attention mechanism at both sentence level and document level.
• LSTM+CBA (Long et al., 2017b)-A state-of-the-art LSTM model using cognition based data to build attention mechanism.
Group 3 methods are recently published neural network models which incorporate user and product information, including: • UPNN (Tang et al., 2015b) -User and product information for sentiment classification at document level based on a CNN network.
• UPDMN (Dou, 2017) -A deep memory network for document level sentiment classification by including user and product information in a unified model. Hop 1 gives the best result, and thus K=1 is used.
• InterSub (Gui et al., 2016) -A CNN model making use of user and product information. For the DUPMN model, we also include two variations which use only one memory network. The first variation only includes user profiles in the memory network, denoted as DUPMN-U. The second variation only uses product information, denoted as DUPMN-P. Compared to other state-of-the-art models Table 2 shows the result of the first experiment. DUPMN uses one hop (the best performer) with m being set at 100, a commonly used memory size for memory networks.

Performance Evaluation
Generally speaking, Group 2 performs better than Group 1. This is because Group 1 uses a traditional SVM with feature engineering (Chang and Lin, 2011) and Group 2 uses more advanced deep learning methods proven to be effective by recent studies (Kim, 2014;. However, some feature engineering methods are no worse than some deep learning methods. For example, the TextFeature model outperforms SSWE by a significant margin. When comparing Group 2 and Group 3 methods, we can see that user profiles and product information can improve performance as most of the methods in Group 3 perform better than methods in Group 2. This is more obvious in the IMDB dataset which naturally contains more subjectivity. In the IMDB dataset, almost all models with user and product information outperform the text-only models in Group 2 except LSTM+CBA (Long et al., 2017b). However, the two LSTM models in Group 2 which include local attention mechanism do show that attention base methods can outperform methods using user profile and product information. In fact, the LSTM+CBA model using attention mechanism based on cognition grounded eye-tracking data in Group 2 outperforms quite a number of methods in Group 3. LSTM+CBA in Group 2 is only inferior to LSTM+UPA in Group 3 because of the additional user profile and production information used in LSTM+UPA.
Most importantly, the DUPMN model with both user memory and product memory significantly outperforms all the baseline methods including the state-of-the-art LSTM+UPA model . By using user profiles and product information in memory networks, DUPMN outperforms LSTM+UPA in all three datasets. In the IMDB dataset, our model makes 0.6 % improvement over LSTM+UPA in accuracy with p−value of 0.007. Our model also achieves lower RMSE value. In the Yelp review dataset, the improvement is even more significant. DUPMN achieves 1.2% improvement in accuracy in Yelp13 with p−value of 0.004 and 0.9% in Yelp14 with p − value of 0.001, and the lower RMSE obtained by DUPMN also indicates that the proposed model can predict review ratings more accurately.

Effects of different hop sizes
The second set of experiments evaluates the effectiveness of DUPMN using different number of hops K. Table 3 shows the evaluation results. The number in the brackets after each model name indicates the number of hops used. Two conclusions can be obtained from Table 3. We find that more hops do not bring benefit. In all the three models, the single hop model obtains the best performance. Unlike video and image information, written text is grammatically structured and contains abstract information such that multiple hops may introduce more information distortion. Another reason may be due to over-fitting by the additional hops.

Effects of DUPMN-U and DUPMN-P
Comparing the performance of DUPMN-U and DUPMN-P in Table 3, it also shows that user memory and product memory indeed provide different kinds of information and thus their usefulness are different in different datasets. For the movie review dataset, IMDB, which is more subjective, results show that user profile information using DUPMN-U outperforms DUPMN-P as there is a 1.3% gain compared to that of DUPMN-P. However, on restaurant reviews in Yelp datasets, DUPMN-P performs better than DUPMN-U indicating product information is more valuable.
To further examine the effects of UMN and PMN on sentiment classification, we observe the difference of optimized values of the constant weights w U and w P between the UMN and the PMN given in Formula 3. The difference in their values indicates the relative importance of the two networks. The optimized weights given in Table 4 on the three datasets show that user profile has a higher weight than product information in IMDB because movie review is more related to personal preferences whereas product information    Table 3 on DUPMN-U and DUPMN-P.

Effects of the memory size
Most social network data follows the long tail distribution. If the memory size to represent the data is too small, some context information will be lost. On the other hand, too large memory size which requires more resources in computation and storage may not introduce much benefit. Thus, the fourth set of experiments evaluates the effect of dimension size m in the DUPMN memory networks. Figure 2 shows the result of the evaluation for 1 hop configuration with memory size starting at 1 with 10 points at each increment until size of 75, the increment set to 25 from 75 to 200 to cover most postings. Results show that when memory size increases from 10 to 100, the performance of DUPMN steadily increases. Once it goes beyond 100, DUPMN is no longer sensitive to memory size. This is related to the distribution of document frequency rated by user/product in Table 1 as the average is around 50. With long tail distribution, after 75, not many new documents will be included in the context. To improve algorithm efficiency without much compromise on performance, m can be any value that doubles the average. So, values between 100-200 in our algorithm should be quite sufficient.

Case Analysis
The review text below is for a sci-fi movie which has the golden label 10 (most positive). However, if it is read as an isolated piece of text, identifying its sentiment is difficult. The LSTM+LA model gives it the rating of 1 (most negative), perhaps because on the surface, there are many negative words like unacceptable, criticize and sucks even though the reviewer is praising the movie. Since our user memory can learn that the reviewer is a fan of sci-fi movies, our DUPMN model indeed gives the correct rating of 10.

Conclusion and Future Work
We propose a novel dual memory network model for sentiment predictions. We argue that user profile and product information are fundamentally different as user profiles reflect more on subjectivity whereas product information reflects more on salient features of products at aggregated level. Based on this hypothesis, two separate memory networks for user context and product context are built at the document level through a hierarchical learning model. The inclusion of an attention layer can further capture semantic information more effectively. Evaluation on three benchmark review datasets shows that the proposed DUPMN model outperforms the current state-of-the-art systems with significant improvements shown in p-value of 0.007, 0.004 and 0.001 respectively. We also show that single hop memory networks is the most effective model. Evaluation results show that user profile and product information are indeed different and have different effects on different datasets. In more subjective datasets such as IMDB, the inclusion of user profile information is more important. Whereas on more objective datasets such as Yelp data, collective information of restaurant plays a more important role in classification. Future works include two directions. One direction is to explore the contribution of user profiles and product information in aspects level sentiment analysis tasks. Another direction is to explore how knowledge-based information can be incorporated to further improve sentiment classification tasks.