Opinion Recommendation Using A Neural Model

We present opinion recommendation, a novel task of jointly generating a review with a rating score that a certain user would give to a certain product which is unreviewed by the user, given existing reviews to the product by other users, and the reviews that the user has given to other products. A characteristic of opinion recommendation is the reliance of multiple data sources for multi-task joint learning. We use a single neural network to model users and products, generating customised product representations using a deep memory network, from which customised ratings and reviews are constructed jointly. Results show that our opinion recommendation system gives ratings that are closer to real user ratings on Yelp.com data compared with Yelp’s own ratings. our methods give better results compared to several pipelines baselines.


Introduction
Offering a channel for customers to share opinions and give scores to products and services, review websites have become a highly influential information source that customers refer to for making purchase decisions. Popular examples include IMDB.com on the movie domain, Epinions.com on the product domain, and Yelp.com on the service domain. Figure 1 shows a screenshot of a restaurant review page on Yelp.com, which offers two main types of information. First, an overall rating score is given under the restaurant name; second, detailed user reviews are listed below the rating. * This work has been done when the first author worked at SUTD. Though offering useful overview and details about a product or service, such information has several limitations for a user who has not used the product or service. First, the overall rating is general and not necessarily agreeable to the taste of an individual customer. Being a simple reflection of all customer scores, it serves an average customer well, but can be rather inaccurate for individuals. For example, the authors themselves often find highly rated movies being tedious. Second, there can be hundreds of reviews for a product or service, which makes it infeasible for exhaustive reading. It would be useful to have a brief summary of all reviews, which ideally should be customized to the reader.
To address the limitations above, we propose a new task called opinion recommendation, which is to generate a customized review score of the product that the user is likely to give, as well as a customized review that the user would have written for the target product, if the user had reviewed the product. The proposed opinion recommendation task is closely related to several existing lines of work in NLP. The first is sentiment analysis (Hu and Liu, 2004;Pang and Lee, 2008) and opinion summarization (Nishikawa et al., 2010;Wang and Ling, 2016), which is to give a rating score or summary based on existing customer reviews. Our task is different in that we aim to generate user rating scores and review of a product unreviewed by the user. The second is recommendation (Su and Khoshgoftaar, 2009;Yang et al., 2014), which is to give a ranking score to a certain product or service based on the purchase history of the user and other customers who have purchased the target product. Our task is different in the source of input, which is textual customer reviews and ratings rather than numerical purchase history.
There are two types of inputs for our task, namely existing reviews of the target product, and the reviews of the user on other products, and two types of outputs, namely a customized rating score and a customized review. The ideal solution should consider the interaction between all given types of information, jointly predicting the two types of outputs. This poses significant challenges to statistical models, which require manually defined features to capture relevant patterns from training data. Deep learning is a relatively more feasible choice, offering viabilities of information fusion by fully connected hidden layers (Collobert et al., 2011;Henderson et al., 2013;Zhang and Weiss, 2016;Chen et al., 2016a). We leverage this advantage in building our model.
In particular, we use a sub RNN to model the semantic content of each review. A sub product model is used to consolidate existing reviews for the target product, and a user model is built by consolidating the reviews of the given user into a single vector form. To address potential sparsity of a user's history reviews, neighbor users are identified by collaborative filtering (Ding et al., 2006), and a vector representation is learned by using a neural neighborhood model. Finally, a deep memory network is utilized to find the association between the user and target product, jointly yielding the rating score and customised review. Experiments on a Yelp dataset show that the model outperforms several pipelined baselines. We make our source code publicly available under GPL at https://github.com/ wangzq870305/opinion_recommend.
2 Related Work Sentiment Analysis. Our task is related to document-level sentiment classification (Pang and Lee, 2008) for various neural network models have been used, including convolutional neural networks (Kim, 2014), recursive neural net-work (Socher et al., 2013) and recurrent neural network (Teng et al., 2016;Tai et al., 2015), Review rating prediction aims to predict the numeric rating of a given review. Pang and Lee (2005) pioneered this task by regarding it as a classification/regression problem. Most subsequent work focuses on designing effective textural features of reviews (Qu et al., 2010;Li et al., 2011;Wan, 2013).
User information has been widely investigated in sentiment analysis. Gao et al. (2013) developed user-specific features to capture user leniency, and Li et al. (2014) incorporated textual topic and userword factors through topic modeling. For integrating user information into neural network models, Tang et al. (2015) predicted the rating score given a review by using both lexical semantic information and a user embedding model. Chen et al. (2016b) proposed a neural network to incorporate global user and product information for sentiment classification via an attention mechanism. Different from the above research, which focuses on predicting the opinion on existing reviews, our task is to recommend the score that a user would give to a new product without knowing his review text. The difference originates from the objective. Previous research aims to predict opinions on reviewed products, while our task is to recommend opinion on new products, which the user has not reviewed.
Opinion Summarization. Our work also overlaps with to the area of opinion summarization, which constructs natural language summaries for multiple product reviews (Hu and Liu, 2004). Most previous work extracts opinion words and aspect terms. Typical approaches include association mining of frequent candidate aspects (Hu and Liu, 2004;Qiu et al., 2011), sequence labeling based methods (Jakob and Gurevych, 2010; Yang and Cardie, 2013), as well as topic modeling techniques (Lin and He, 2009). Recently, word embeddings and recurrent neural networks are also used to extract aspect terms (Irsoy and Cardie, 2014;. While all the methods above are extractive, Ganesan et al. (2010) presented a graph-based summarization framework to generate concise abstractive summaries of highly redundant opinions, and Wang and Ling (2016) used an attention-based neural network model to absorb information from multiple text units and generate summaries of movie reviews. We also perform abstractive summarization. However, different from the above research, which summarize existing reviews, we generate customized reviews for a unreviewed product.
Recommendation. has been solved on mainly purchase history. There are two main approaches, which are content-based and collaborativefiltering (CF) based (Adomavicius and Tuzhilin, 2005;Yang et al., 2014), respectively. Most existing social recommendation systems are CF-based, and can be further grouped into model-based CF and neighborhood-based CF (Kantor et al., 2011;Su and Khoshgoftaar, 2009). Matrix Factorization (MF) is one of the most popular models for CF. In recent MF-based social recommendation works, user-user social trust information is integrated with user-item feedback history (e.g., ratings, clicks, purchases) to improve the accuracy of traditional recommendation systems, which only factorize user-item feedback data (Ding et al., 2006;Koren, 2008;. There has been work integrating sentiment analysis and recommendation systems, which use recommendation strategies such as matrix factorization to improve the performance of sentiment analysis (Leung et al., 2006;Singh et al., 2011). These methods typically use ensemble learning (Singh et al., 2011) or probabilistic graph models (Wu and Ester, 2015). For example,  proposed a factor graph model to recommend opinion rating scores by using explicit product features as hidden variables. Different from the above research, we recommend user opinions.
Neural Network Models. Multi-task learning has been recognised as a strength of neural network models for natural language processing (Collobert et al., 2011;Henderson et al., 2013;Zhang and Weiss, 2016;Chen et al., 2016a), where hidden feature layers are shared between different tasks that have common basis. Our work can be regarded as an instance of such multi-tasks learning via shared parameters, which has been widely used in the research community recently.
Dynamic memory network models have been applied for NLP tasks such as question answering (Sukhbaatar et al., 2015;Kumar et al., 2016), language modeling (Tran et al., 2016) and machine translation . There are typically used to find abstract semantic representations of texts towards certain tasks, which are consistent with our main need, namely abstract- ing the representation of a product that is biased towards the taste of a certain user. We use a variation of the memory network model for obtaining user-specific review representation.

Model
Formally, the input to our model is a tuple .., r Tn t } is the set of existing reviews of a target product, R U = {r U 1 , r U 2 , ..., r Un u } is the set of user's history reviews, and R N = {r N 1 , r N 2 , ..., r Nn n } is the set of the user's neighborhood reviews. All the reviews are sorted with temporal order. The output is a pair Y S , Y R , where Y S is a real number between 0 and 5 representing the customized rating score of the target product, and Y R is a customised review. A characteristic of our model is that Y S and Y R are generated on a product that the user has not reviewed.
For capturing both general and personalized information, we first build a product model, a user model, and a neighborhood model, respectively, and using a memory network model to integrate these three types of information, constructing a customized product model. Finally, we predict a customized rating score and a review collectively using neural stacking framework. The overall architecture of the model is shown in Figure 2.

Review Model
A review is the foundation of our model, based on which we derive representations of both a user and a target product. In particular, a user profile can be achieved by modeling all the reviews R U of the user, and a target product profile can be obtained by using all existing reviews R T of the product. We use the average of word embeddings to model a review. Formally, given a review r = {x 1 , x 2 , ..., x m }, where m is the length of the review, each word x k is represented with a Kdimensional embedding e w k (Mikolov et al., 2013). We use the k (e w k )/m for the representation of the review e d (r).

User Model
A standard LSTM (Hochreiter and Schmidhuber, 1997) is used to learn the hidden states of an user's reviews to build the user model. Denoting the recurrent function at step t as LSTM(x t , h t−1 ), we obtain a sequence of hid- The initial state and all standard LSTM parameters are randomly initialized and tuned during training.
Not all reviews contribute equally to the representation of a user. We use the attention mechanism (Bahdanau et al., 2014;Yang et al., 2016) to extract the reviews that are relatively more important, aggregating the representation of reviews to form a vector. Taking the hidden s- where n u is the hidden variable size, α i ∈ [0, 1] is the weight of h U i , and i α i = 1.
For each piece of hidden state h U i , the scoring function is calculated by where W U and b U are model parameters. The attention vector v U is used to represent the user for the User Model.

Neighborhood Model
We use neighborhood reviews to improve the user model, since a user may not have sufficient reviews to construct a reliable model. Here a neighbor refers to a user that has similar tastes to the target user (Koren, 2008;Desrosiers and Karypis, 2011). The same as the user model, we construct the neighborhood model v N using the neighborhood reviews R N = {r N 1 , r N 2 , ..., r Nn n } with an attention recurrent network.
A key issue in building the neighborhood model is how to find neighbors of a certain user. In this study, we use matrix factorization (Koren, 2008) to detect neighbors, which is a standard approach for recommendation (Ding et al., 2006;Li et al., 2009;. In particular, users' rating scores of products are used to build a productusers matrix M ∈ R nt×nu with n t products and n u users. We approximate it using three factors, specifying soft membership of products and users (Ding et al., 2006) by finding: where F ∈ R nt×K represents the posterior probability of K topic clusters for each product; S ∈ R K×K encodes the distribution of each topic k; and T ∈ R K×nu indicates the posterior probability of K topic clusters for each user.
As a result of matrix factorization, we directly obtain the probability of each user on each topic from the person-topic matrix T . To infer T , the optimization problem in Eq.4 can be solved using the following updating rule: With the user-topic matrix T , we measure the implicit connection between two users using: where sim(i, j) measure the implicit connection degree between users i and j. If sim(i, j) is higher than a threshold η, we consider user j as the neighbor of user i.

Product Model
Given the representations of existing reviews {e d (r T 1 ), e d (r T 2 ), ..., e d (r Tn t )} of the product, we use LSTM to model their temporal orders, obtaining a sequence of hidden vectors h T = {h T 1 , h T 2 , ..., h Tn t } by recurrently feeding {e d (r T 1 ), e d (r T 2 ), ..., e d (r Tn t } as inputs. The hidden state vectors h T are used to represent the product. Customized Product Model. The product model represents salient information of existing reviews in their temporal order, yet do not reflect the taste of a particular user. We build the customised product model to integrate user information and product information (as reflected by the product model), resulting in a single vector that represents a customised product. From this vector we are able to synthesis both a customised review and a customised rating score. In particular, we use the user representation v U and the neighbour representation v N to transform the target product representation h T = {h T 1 , h T 2 , ..., h Tn t } into a customised product representation v C , which is tailored to the taste of the user.
A naive model of yielding v C could utilise the attention mechanism over h t , deriving a weighted sum according to user information. On the other hand, dynamic memory networks have been shown highly useful for deriving abstract semantic information compared with simple attention, and hence we follow Sukhbaatar et al. (2015) and Xiong et al. (2016), building a variation of DMN to iteratively find increasingly abstract representations of h t , by injecting v U and v N information.
The memory model consists of multiple dynamic computational layers (hops), each of which contains an attention layer and a linear layer. In the first computational layer (hop 1), we take the hidden variables h T i (0 ≤ i ≤ n t ) of product model as input, adaptively selecting important evidences through one attention layer using v U and v N . The output of the attention layer gives a linear interpolation of h T , and the result is considered as input to the next layer (hop 2). In the same way, we stack multiple hops and run the steps multiple times, so that more abstract representations of the target product can be derived.

The attention model outputs a continuous vector
where n t is the hidden variable size, β i ∈ [0, 1] is the weight of h T i , and i β i = 1. For each piece of hidden state h T i , we use a feed forward neural network to compute its semantic relatedness with the abstract representation v C . The scoring func-tion is calculated as follows at hop t: The vector v C is used to represent the customized product model. At the first hop, we define V 0 C = i h T i /n t . The product model h T i (0 ≤ i ≤ n t ) represents salient information of existing reviews in their temporal order, they do not reflect the taste of a particular user. We use the customised product model to integrate user information and product information (as reflected by the product model), resulting in a single vector that represents a customised product. From this vector we are able to synthesis both a customised review and a customised rating score.

Customized Review Generation
The goal of customized review generation is to generate a review Y R from the customized product representation v C , composed by a sequence of words y R 1 , ..., y Rn r . We use a standard LSTM decoder (Rush et al., 2015) to decompose the prediction of Y R into a sequence of word-level predictions: where each word y R j is predicted conditional on the previously generated y R 1 , ..., y R j−1 and the customized product vector v C . The probability is estimated by using standard word softmax: where h R j is the hidden state variable at timestamp j, which is modeled as LST M (u j−1 , h Rj ).
Here a LSTM is used to generate a new state h R j from the representation of the previous state h R j−1 and u j−1 . u j−1 is the concatenation of previously generated word y R j−1 and the input representation of customized model v C .

Customized Rating Prediction
A straightforward approach to predicting the rating score of a product is to take the average of existing review scores. However, the drawback is that it cannot reflect the the variance in user tastes. In order to integrate user preferences into the rating, we instead take a user-based weighted average of existing rating scores, so that the scores of reviews that are closer to the user preference are given higher weights. However, existing ratings can be all different from a users personal rating, if the existing reviews do not come from the user's neighbours. We thus use the customized product vector v c as a bias of the weighted average of existing rating scores. Formally, given the rating scores s 1 , s 2 , ..., s n of existing reviews, and the the customized product representation v C , we calculate: In the left term n i β i ·s i , we use attention weights β i in Eq.9 to measure the important of each rating score s i . The right term tanh(W S v C + b S ) is a review-based shift, weighted by µ.
Since the result of customized review generation can be helpful for rating score prediction, we use neural stacking additionally feeding the last hidden state h Rn of review generation model as input for Y S prediction, resulting in where ⊕ denotes vector concatenation.

Training
For our task, there are two joint training objectives, for review scoring and review summarisation, respectively. For review scoring, the loss function is defined as: where Y * S i is the predicted rating score, Y S i is the rating score in the training data, Θ is the set of model parameters and λ is a parameter for L2 regularization. For customized review generation, loss is defined by maximizing the log probability of Eq.10 (Sutskever et al., 2014;Rush et al., 2015). The two loss functions for score and review prediction share the representation vectors under v C , hence forming multi-task learning.
Standard back propagation is performed to optimize parameters, where gradients also propagate from the scoring objective to the review generation objective due to neural stacking (Eq.13). We apply online training, where model parameters are optimized by using AdaGrad (Duchi et al., 2011). Word embeddings are trained using the Skip-gram algorithm (Mikolov et al., 2013) 1 .

Experimental Settings
Our data are collected from the yelp academic dataset 2 , provided by Yelp.com, a popular restaurant review website. The data set contains three types of objects: business, user, and review, where business objects contain basic information about local businesses (i.e. restaurants), review objects contain review texts and star rating, and user objects contain aggregate information about a single user across all of Yelp. Table 1 illustrates the general statistics of the dataset. For evaluating our model, we choose 4,755 user-product pairs from the dataset. The userproduct pairs are extracted by following criterions: for each selected user-product pair, the user should have written 10 reviews at least, and the product should contain 100 reviews at least. In addition, the gold-standard review that the user write for the corresponding product should contain 10 helpful hits at least. We did not try alternative data selection rules. We will give the detail in our draft.
For each pair, the existing reviews of the target service (restaurant) are used for the product model. The rating score given by each user to the target service is considered as the gold customized rating score, and the review of the target service given by each user is used as the gold-standard customized review for the user. The remaining reviews of each user are used for training the user model. We use 3,000 user-product pairs to train the model, 1,000 pairs as testing data, and remaining data for development.
We use the ROUGE-1.5.5 (Lin, 2004) toolkit for evaluating the performance of customized review generation, and report unigram overlap (ROUGE-1) as a means of assessing informativeness. Mean Square Error (MSE) (Wan, 2013;Tang et al., 2015) is used as the evaluation metric for measuring the performance of customized rating score prediction. MSE penalizes more severe errors more heavily.

Hyper-parameters
There are several important hyper-parameters in our models, and we tune their values using the development dataset. We set the regularization weight λ = 10 −8 and the initial learning rate to 0.01. We set the size of word vectors to 128, the size of hidden vectors in LSTM to 128. In order to avoid over-fitting, dropout (Hinton et al., 2012) is used for word embedding with a ratio of 0.2. The neighbor similarity threshold η is set to 0.25.

Ablation Test
Effects of various configurations of our model, are shown on Table 2, where Joint is the full model of this paper, -user ablates the user model, -neighbor ablates the neighbor model, -rating is a single-task model that generates a review without the rating score, and -generation yields only the rating.
By comparing "Joint" and "-user,-neighbor", we can find that customized information have significant influence on both the rating and review generation results (p − value < 0.01 using ttest). In addition, comparison between "-Joint" and "-user", and between "-user" and "-user,neighbor" shows that both the user information and the neighbour user information of the user are effective for improving the results. A users neighbours can indeed alleviate scarcity of user reviews.
Finally, comparison between "Joint" and "generation", and between "Joint" and "-rating" shows that multi-task learning by parameter sharing is highly useful.  Figure 3: Influence of hops.

Influence of Hops
We show the influence of hops of memory network for customized review generation on Figure 3. When hop = 0, the model considers only the general product reviews (−user, −neighbor). When hop ≥ 1, customized product information is leveraged. From the figure we can find that, when hop = 3, the performance is the best. It indicates that multiple hops can capture more abstract evidences from external memory to improve the performance. However, too many hops leads to over-fitting, thereby harms the performance. As a result, we choose 3 as the number of hops in our final test.

Influence of µ
We show the influence of the bias weight parameter µ for rating prediction in Figure 4. With µ being 0, the model uses the weighted sum of existing reviews to score the product. When µ is very large, the system tends to use only the customized product representation v c to score the product, hence ignoring existing review scores, which are a useful source of information. Our results show that when µ is 1, the performance is optimal, thus indicating both existing review scores and review contents are equally useful.

Final Results
We show the final results for opinion recommendation, comparing our proposed model with the  Figure 4: Influence of bias score.
following state-of-the-art baseline systems: • RS-Average-Yelp is the widely-adopted baseline (e.g., by Yelp.com), using the averaged review scores as the final score.
• RS-Linear estimates the rating score that a user would give by s ui = s all + s u + s i (Ricci et al., 2011), where s u and s i are the the training deviations of rating score of the user u and the product i, respectively.
• RS-Item applies kNN to estimate the rating score (Sarwar et al., 2001). We choose the cosine similarity between v c to measure the distance between product.
• RS-MF is a state-of-the-art recommendation model, which uses matrix factorisation to predict rating score (Ding et al., 2006;Li et al., 2009;. • Sum-Opinosis uses a graph-based framework to generate abstractive summarisation given redundant opinions (Ganesan et al., 2010).
• Sum-LSTM-Att is a state-of-the-art neural abstractive summariser, which uses an attentional neural model to consolidate information from multiple text sources, generating summaries using LSTM decoding (Rush et al., 2015;Wang and Ling, 2016).
Being non-opinion recommendation methods, all the baselines are single-task models, without considering rating and summarisation prediction jointly. The results are shown in Table 3. Our model (" Joint") significantly outperforms both "RS-Average-Yelp" and "RS-Linear" (p − value < 0.01 using t-test). Note that, our proposed rating recommendation for the user are significantly closer individual real user rating compared with Yelp's rating.  Our proposed model also significantly outperforms state-of-the-art recommendation systems (RS-Item and RS-MF) (p − value < 0.01 using ttest), indicating that textual information are a useful addition to the rating scores themselves for recommending a product.
Finally, comparison between our proposed model and state-of-the-art summarisation techniques (Sum-Opinosis and Sum-LSTM-Att) shows the advantage of leveraging user information to enhance customised review generation, and also the strength of joint learning. Table 4 shows example outputs of rating scores and reviews. Ref. is the rating score and review written by user her/himself, and Base is the baseline model, that generates the rating score by RS-MF, and review by Sum-LSTM-Att. From these examples, we can find that, both rating score and review which generated by the proposed Joint model is closer to the real user. In particular, in the first example, the baseline system correctly identifies the main both price and quality information, which the target user wrote in the review, yet the baseline model did not yield comments about the price based on reviews of other users. Associating reviews and ratings closely, the joint model gives a rating score that is much closer to the real user score compared to the score given by the recommendation model MF. In addition, we can also find some habits of certain users from their customized reviews, for example, Mexican food, cheap and clean restaurant.

Conclusion
We proposed a novel task called opinion recommendation, which is to generate the review and rating score that a certain user would give to an unreviewed product or service. In particular, a  deep memory network was utilized to find the association between the user and the product, jointly yielding the rating score and customised review. Results show that our methods are better results compared to several pipelines baselines using state-of-the-art sentiment rating and summarisation systems. Review scores given by the opinion recordation system are closer to real user review scores compared to the review scores which Yelp assigns to target products.