Combining Deep Learning and Topic Modeling for Review Understanding in Context-Aware Recommendation

With the rise of e-commerce, people are accustomed to writing their reviews after receiving the goods. These comments are so important that a bad review can have a direct impact on others buying. Besides, the abundant information within user reviews is very useful for extracting user preferences and item properties. In this paper, we investigate the approach to effectively utilize review information for recommender systems. The proposed model is named LSTM-Topic matrix factorization (LTMF) which integrates both LSTM and Topic Modeling for review understanding. In the experiments on popular review dataset Amazon , our LTMF model outperforms previous proposed HFT model and ConvMF model in rating prediction. Furthermore, LTMF shows the better ability on making topic clustering than traditional topic model based method, which implies integrating the information from deep learning and topic modeling is a meaningful approach to make a better understanding of reviews.


Introduction
Recommender systems (RSs) are widely used in the field of electronic commerce to provide personalized recommendation services for customers. Most popular RSs are based on Collaborative Filtering (CF), which makes use of users' explicit ratings or implicit behaviour for recommendations (Koren, 2008). But CF models suffer from data sparsity, which is also called "cold-start" problem. Models perform poorly when there is few available data. To alleviate this problem, utilizing user reviews can be a good approach because user reviews can directly reflect users' preferences and items' properties and exactly correspond to the user latent factors and item latent factors in CF models. * corresponding author To understand user reviews, previous approaches are mainly based on topic modeling, a suite of algorithms that aim to discover the thematic information among documents (Blei, 2012). The simplest and commonly used topic model is latent dirichlet allocation(LDA). Recently, as deep learning shows great performance in computer vision (Krizhevsky et al., 2017) and NLP (Kim, 2014), some approaches combining deep learning with CF are proposed to capture latent context features from reviews.
However, we find there are some limitations in existing models. First, the LDA algorithm used in previous models like Hidden Factors as Topics (HFT) (McAuley and Leskovec, 2013) ignores contextual information. If a user writes "I prefer apple than banana when choosing fruits" in a review, we can clearly know the user's preference and recommend items including apple. But LDA ignores the structural information and considers the two words as the same since they both appear once in the sentence.
Compared with topic modeling, deep learning methods such as Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) are able to retain more context information. CNN uses sliding windows to capture local context and word order. RNN considers a sentence as a word sequence, and the former word information will be reserved and passed back, which gives RNN the ability to retain the whole sentence information.
But these still exist some problems. For CNN, the sizes of sliding windows are often small, which causes CNN model fails to link words in the sentence begin and end. Given the review "I prefer apple than google when choosing jobs", CNN can not notice the two words 'apple' and 'jobs' simultaneously if the windows size is small, so it will meet the ambiguity problem that the word 'apple' means fruit or company. For RNN, al-though it performs better than CNN on persisting former information, the information will still decreases with the length of sentences increasing. So when a review is long, the effect of RNN is limited.
Faced with these problems, we propose to integrate deep learning and topic modeling to extract more global context information and get a deeper understanding of user reviews. Deep learning methods can reserve context information, while topic modeling can provide word co-occurrence relation to make a supplement for information loss.
We use Long Short-Term Memory (LSTM) network for the deep learning part, because it is a special type of RNN which has better performance on gradient vanishing and long term dependence problems than vanilla RNN structure. We use LDA for the topic modeling part. Then the two parts are integrated into a matrix factorization framework. The final model is named LSTM-Topic matrix factorization (LTMF).
Furthermore, as the topic modeling part and deep learning part are connected in our model, the topic clustering results will be influenced by the deep learning information. In experiments, LTMF shows a better topic clustering ability than traditional LDA based HFT model. This gives us some inspiration on using the integrating methods into other tasks like sentiment classification.
In the remainder of the paper, we first review previous work related to our work. Then we address preliminaries and present our models in detail. After that we evaluate our approach by comparing our approach with state-of-the-art algorithms. Finally we conclude the paper with future work.

Related Work
There has been some earlier approaches to extract review information for RSs. Wang and Blei (2011) proposed collaborative topic regression (CTR) that combined topic modeling and collaborative filtering in a probabilistic model. McAuley and Leskovec (2013) developed a statistical model HFT using a transfer function to combine rating and review tightly. Ling et al. (2014) and Bao et al. (2014) proposed models similar to CTR and HFT with some structural differences.
Recently, several researchers begin to utilize deep learning in RSs. Wang et al. (2015) pre-sented a Bayesian model collaborative deep learning (CDL) leveraging SDAE neural networks as a text feature learning component. Bansal et al. (2016) trained a gated recurrent units (GRUs) network to encode text sequences into latent vectors. Zhao et al. (2016) trained a deep CNN to discover the abstract representation of movie posters and still frames, and incorporated it into a neighborhood CF model. Kim et al. (2016) utilized CNN to retain contextual information in review, and developed a document context-aware recommendation model (ConvMF). The ConvMF model is a recently proposed model and is shown to outperform PMF and CDL, and we choose it as a baseline in our experiments. Zheng et al. (2017) proposed the Deep Cooperative Neural Networks (DeepCoNN) model which constructed two concurrent CNN to simultaneously model user and item reviews and then combined the features into Factorization Machine. Attention in neural networks has been popular in nearly years, Seo et al. (2017) proposed a model using CNN with dual attention for rating prediction. There are some similarity between the D-attn model with our LTMF model for we both want to extract more global information, where they use attention CNN model and we utilize the information from both topic modeling and deep learning. The D-attn model fail to work if there is not enough reviews, while our LTMF model use review information as a supplementary of rating. So it can still work effectively even there are few reviews.
Besides, Diao et al. (2014) proposed a method jointly modeling aspects, sentiments and ratings for movie recommendation. Hu et al. (2015) proposed MR3 model to combine ratings, social relations and reviews together for rating prediction. These hybrid models boost the performance than individual components, which also give us some inspiration on proposing the LTMF framework.

Notations
We use explicit ratings as the training and test data. where each user and item is represented by a Kdimension latent vector, u i ∈ R K and v j ∈ R K . The rating sparse matrix is denoted as R ∈ R M ×N , where r ij is the rating of user u i on item v j . D is the review (document) corpus where d ij is the review of user u i on item v j .

PMF: a standard matrix factorization model
Probabilistic Matrix Factorization (PMF) (Mnih and Salakhutdinov, 2008) is an effective recommendation model that uses matrix factorization (MF) technique to find the latent features of users and items from a probabilistic perspective. In PMF, the predicted ratingR ij is expressed as the inner product of user latent vector u i and item la- To get latent vectors, PMF minimises the following loss function: where R ij is the observed rating. The first part of Eq.(1) is the sum-of-squared-error between predicted and observed ratings and the second part is quadratic regularization terms to avoid overfitting. λ u and λ v are corresponding regularization parameters. I ij is the indicator function which equals 1 if i-th user rated j-th item, and equals 0 otherwise.

HFT: understand reviews through topic modeling
Hidden Factors as Topics (HFT) (McAuley and Leskovec, 2013) provides an effective approach to integrates topic modeling into traditional CF models. It utilizes LDA, the simplest topic model which assumes there are k topics T = {t 1 , t 2 , ..., t k } in document corpus D. Each document d ∈ D has a topic distribution θ d over T and each topic has a word distribution φ over a fixed vocabulary. To connect the document-topic distribution θ and item factors v, HFT proposes a transformation function: where v j,k is the k-th latent factor in item vector v j and θ j,k is the k-th topic probability in item document-topic distribution θ j , κ is the parameter controlling the "peakiness" of the transformation.
Besides, HFT introduces an additional variable ψ to ensure the word distribution φ k is a stochastic vector which satisfies w φ k,w = 1, the relation is denoted as follows: (3) The final loss function is : whereR ij is predicted ratings, θ and φ are the topic and word distribution respectively, w d,n is the n-th word in document d and z d,n is the word's corresponding topic, λ t is a regularization parameters.

The LTMF Model
We propose the LSTM-Topic matrix factorization (LTMF) model, which integrates LSTM and topic modeling for recommendation. The model utilizes both rating and review information. For the rating part, we use probabilistic matrix factorization to extract rating latent vectors. For the review part, we use LDA (following the way of HFT) to extract topic latent vectors and adopt an LSTM architecture to generate document latent vectors . Then we combine the three vectors into a unified model. The overview of LTMF model is shown in Figure  1.

Parameter Relation
The left of Figure 1 is the parameters relations in LTMF model, which can be divided into three parts: Θ = {U, V} is the parameters associated with rating MF, Φ = {θ, φ} is the parameters associated with topic model, Ω = {W, l} is the parameters associated with LSTM. The shaded nodes are data (R:rating, D: reviews) where the others are parameters. Single connection lines represent there are constraint relationship between the two nodes. Double connections (e.g. V and θ) mean the relationship is bidirectional so they can affect each other's results.

LSTM Architecture
The right of Figure 1 is the LSTM architecture used in our models. For the j-th item, we concatenate all of its reviews as one document se-  quence D j . Every word in the document sequence D j = (w 1 , w 2 , ..., w nj ) will firstly be embedded into a p dimension vector. Next, word vectors are sent into LSTM network according to the word order in D j and produces a latent vector. Finally, the latent vector is sent to a full connect layer whose output is the document latent vector l j . The above process can be written as: where D j is the input document sequence, W represents weights and bias variables in LSTM network.

Probabilistic Prior
Gaussian distribution is the basic prior hypothesis in our model. We place zero-mean spherical Gaussian priors on user latent features u, LSTM weights W and observed ratings R. For item vector v, we place the Gaussian prior on its difference with LSTM output l j : The function is important for connecting ratings and reviews. Although document vector l j is closed to item feature vector v j for they both reflect item's properties. There still exists some discrepancies. For example, when writing reviews, users usually write more about appearance and only briefly mention price. So in review based document vector l j , the weight of "appearance" will be larger than rating based latent vector v j .
To preserve the discrepancy between v j and l j , we import the Gaussian noise vector σ v as the offset.

Objective Function
Finally, we maximize the log-posterior of the three parts and get the objective function as follows: where N k is the number of weighs in LSTM network, λ u , λ v , λ W are regularization parameters. z is the topic assignment for each word, λ t is the regularization parameters to control the proportion of topic part. The objective function of LTMF can be considered as an extended PMF model where the information from topic modeling and LSTM is included as regular terms. In the next section, we will explain how LTMF leverages the information from topic modeling and LSTM, and why LTMF can combine the information of the two parts.

The Effectiveness of LTMF
As shown in Figure 1, item vectors V connect with both topic part and LSTM part, which means the information from the two part will both affect the result of item vectors. If we take partial derivative of Eq.(6) with respect to v j , the constraint relationship can be clearer: In Eq. (7), the optimization direction of v j is subject to two regular terms. The former one is controlled by LSTM vector l j . The latter one is controlled by topic parameters (κ, n j,k , N j ). Hence, we can leverages the information from both LSTM and topic modeling for recommendation.
Besides, note the double connections between item vector V and topic distribution θ in Figure  1. They mean the information from topic modeling can affect the result of V , while the change in V can also be passed to topic part and affect the review understanding result of topic modeling by Eq.2. For V and LSTM vector l, the analysis is the same. Indeed, item vectors V plays the role of transporter to connect LSTM part and topic modeling part. This is why LTMF can combine the information of topic modeling and LSTM to make a deeper understanding of user reviews.
Furthermore, LTMF provides an effective framework to integrate topic model with deep learning networks for recommendation. In experiments, we replace the LSTM part with CNN to make a comparison model. Experiments show both models boost the rating prediction accuracy.

Optimization
Our objective is to search: Recall that Θ is the parameters associated with ratings MF, Φ is the parameters associated with topic modeling, z is the topic assignment for each word, κ is the peakiness parameter to control the transformation between item vector v and topic distribution θ, Ω is the parameters associated with LSTM.
For v j is coupled with the parameters of topic modeling and LSTM vector, we cannot optimize these parameters independently. We adopt a procedure that alternates between two steps. In each step we fix some parameters and optimize the others. The optimization process is shown below: 1. solve the objective by fixing z t and Ω t :

(a)
update Ω t+1 with fixing v j t+1 and document sequence D j .
In the step 1, we fix z and Ω to update remaining terms Θ, Φ, κ by L-BFGS algorithm. In the step 2, we fix Θ, Φ and κ to update LSTM parameters Ω and topic model parameters z. Since LSTM part and topic part are independent when item vectors V are certain, we can update the two term respectively. In step 2(a), we update Ω by back propagation algorithm. With fixing the other parameters, the objective function of W can be seen as a weighted squared error function ( v j − l j 2 F ) with L 2 regularized terms ( W 2 F ), which means we can use D j as the input and v j is the label to run the back propagation process. In step 2(b), we iterates through all documents and each word within to update z d,j via Gibbs Sampling. The reason why we do not divide the process into three steps is that the step 2(a) and 2(b) are independent with step 1 finished, which means we can parallelize the two steps.
Finally, we repeat these two steps until convergence. In practice, we run the step 1 with 5 gradient iterations using LBFGS, then we iterate the LSTM part 5 times. At the same time, we update the topic model part once. The whole process is called a cycle, and it usually takes 30 cycles to reach a local optimum.
In addition to the gradient of v j , the gradients of other parameters used in step 1 are listed as follows: (11) where ψ is used to determine word distribution φ by Eq.(3); n k,w is the number of times that word w occurs in topic k; N w is the word vocabulary size of the document corpus; N k is the number of words in topic k; n j,k is the number of times when topic k occurs in the document of item j; N j is the total number of words in document j; z w and z j are the corresponding normalizers: exp(κv j,k ).

Datasets
We use the real-world Amazon dataset 1 (collected by McAuley et al. (2015)) for our experiments. For the original dataset is too large, we choose 10 sub datasets in experiments. To increase data density, we remove users which have less than 3 ratings. For raw review texts, we adopt the same preprocessing methods as ConvMF 2 : set the maximum length of a item document to 300; remove common stop words and document specific words which have document frequency higher than 0.5; choose top 8000 distinct words as the vocabulary; remove all non-vocabulary words to construct input document sequences. After preprocessing, the statistics of datasets are listed in Table 1, where the abbreviations of datasets are shown in parentheses.

Baseline
The baselines used in our experiments are listed as follows: • HFT: This is a state-of-art method that combines reviews with ratings (McAuley and Leskovec, 2013). It utilizes LDA to capture unstructured textual information in reviews.
• ConvMF: Convolutional Matrix Factorization (ConvMF) (Kim et al., 2016) is a recently proposed recommendation model. It utilizes CNN to capture contextual information of item reviews.
• LMF: LSTM Matrix Factorization (LMF) is a submodel of LTMF without the topic part. We can compare it with ConvMF to show the effectiveness of LSTM than CNN on review understanding.
• CTMF: We modify the LTMF model by replacing the LSTM part with CNN (following the structure of ConvMF) and construct the comparison model CNN-Topic Matrix Factorization (CTMF). CTMF can be used to evaluate the effectiveness of combining deep learning and topic modeling.
In experiments, we randomly split one dataset into training set, test set, validation set under proportions of 80%, 10%, 10%, where each user and item appears at least once in the training set. We use Mean Square Error (MSE) as metric to evaluate various models.

Implementation Details
For all models, we set the dimension of user and item latent vectors K = 5, and initialize the vectors randomly between 0 and 1. Topic number and the dimension of document latent vector l are also set to 5. For methods using deep learning, we initialized word latent vectors randomly with the embedding dimension p = 200. The optimization algorithm used in back propagation is rmsprop and the activation function used in fully connected layer is tanh . In LSTM network, we set the output dimension to 128 and dropout rate 0.2. For CTMF, we adopt the same setting as ConvMF where the sliding window sizes is {3, 4, 5} and the shared weights per window size is 100. Hyper parameters are set as follows. For PMF, λ u = λ v = 0.1. For HFT, we select λ t ∈ {1, 5} which gives better result in each experiment. For LMF and ConvMF, we set λ u = 0.1 and λ v = 5. For LTMF and CTMF, we select λ t ∈ {0.05, 0.1, 0.5} which gives the lowest validation set error.

Quantitative analysis of rating prediction
We evaluate these models and report the lowest test set error on each dataset. The MSE results are shown in Table 2 where the best result of each dataset is highlighted in bold and the standard deviations of corresponding MSE are recorded in parenthesis.
We can see that the LTMF model consistently outperform these baselines on all datasets . This clearly confirms the effectiveness of our proposed method. To make a more intuitive comparison, the improvement histograms of these models are shown in Figure 2. The figure above are the improvements of HFT, ConvMF and LMF compared with PMF on different datasets, where PMF only uses rating information and the other three use both rating and review information with different approaches. We observe that all three methods make significant improvements over PMF, which indicates review information is helpful to model user and item features as well as improve recommendation results. Compared with HFT, LMF makes over 3% improvement on 9 out of the 10 datasets. ConvMF performs better than HFT while LMF still obtains over 3% improvement than ConvMF on 7 datasets. The differences between HFT, ConvMF and LMF can be attributed to their individual methods for re- The figure below is the comparison of two integrated models (LTMF and CTMF) that import topic information with two original models that only use deep learning (LMF and ConvMF). We can see that both integrated models outperform the original models, which confirms our conjecture that recommendation results can be improved by combining structural and unstructured information. For CTMF model, it makes over 2% improvement on 5 out of 10 datasets compared with ConvMF. As to LTMF model, it achieves nearly 1% improvements that LMF on 7 out of 10 datasets.
The reason why LTMF gains less promotion can be explained from two sides. Numerically, for the comparison model LMF is already a strong baseline proposed by ourselves, it's more difficult to make a significant improvement. Theoretically, since LSTM can persist enough global information when the input sentence is relatively short, the supplements of topic information in LTMF are not so remarkable. As an illustration, we can compare the results on datasets "DM" and "VG". For the dataset "DM", as shown in Table 1, it has the fewest words per item (38.79) and the improvement of LTMF is minimum. But for the dataset "VG", it has the most words per item (92.55). The global context information obtained by LSTM will still decrease with such long sentences, and the topic information can make an effective supplement. So the improvement of LTMF on "VG" is greater and comparable with CTMF.

Recommendation with different data sparsity
Rating data and review data are always sparse in RSs. To compare these models on making recommendation in different data sparsity, especially for new users who only have limited ratings, we choose the dataset "Baby" and refilter it to make sure every user has at least N ratings (N varies from 1 to 10). A greater N means the user has rated more items, so the data sparsity problem is weaker. We test all models on the 10 subsets of "Baby" with the same dataset split ratio and text preprocessing. The final results are shown in Figure 3, where the left one is the MSE values of all models, and the right one is the increase of the other models compared with PMF. We can observe that all models gain better recommendation accuracy with the increment of user rating number N . In other words, user and item latent features can be better extracted with more useful information. When N is small, especially when N = {2, 3}, the models which utilize both review and rating information achieve biggest improvements over PMF. It suggests that review information can provide effective supplement when rating data is scarce. With the increase of N , the improvements of all review used models become smaller. This is because models can extract more features from gradually dense ratings data, and the effectiveness of review data begins to decrease. Same as the previous experiment, our LTMF model achieve the best results in the comparison with other models.

Qualitative Analysis
In HFT, the result of topic words only depends on the information from Topic Modeling. But in our   proposed LTMF framework, the information extracted by LSTM and Topic Modeling will both affect the final word clustering results. So, we can compare the topic words discovered by HFT and LTMF to evaluate whether combing LSTM and Topic Modeling is able to make a better understanding of user reviews. We choose the dataset "Office Product" (OP) and show the top topic words of HFT and LTMF in Table 3 and Table 4. As we can see, there are many words existed in both tables (e.g. "wallet", "notebooks", "document"). These words are closely related to the category of dataset "Office Product", which implies both models can get a good interpretation of user reviews.
However, when we carefully compare the two tables, there exists some differences. In Table 3, there are some adjectives and verbs which have little help for topic clustering (e.g. "nice", "huge", "attach"), but they still get large weights and appear in the front of topic words list. Obviously, HFT misinterprets these words for they usually appear together with the real topic words. In Table 4, we are not able to find them in top words list, because extra information from LSTM makes a timely supplement. Besides, similar situations also occur on words "document" and "compatible". The word "document" is an apparent topic word, so LTMF gives it a larger weight in topic words list. For the word "compatible", as an adjectives, it can provide less topic information than nouns, so LTMF decreases its weight and put "camera" in the second place. From the above analysis we can see LTMF shows the better topic clustering ability than HFT.

Conclusion and Future Work
In this paper, we investigate the approach to effectively utilize review information for RSs. We propose the LTMF model which integrates both LSTM and Topic modeling in context aware recommendation. In the experiments, our LTMF model outperforms HFT and ConvMF in rating prediction especially when the data is sparse. Furthermore, LTMF shows better ability on making topic clustering than traditional topic model based method HFT, which implies integrating the information from deep learning and topic modeling is a meaningful approach to make a better understanding of reviews. In the future, we plan to evaluate more complex networks for recommendation tasks under the framework proposed by LTMF. Besides, we are interested to apply the method of combing topic model and deep learning into some traditional NLP tasks.