Towards Controllable and Personalized Review Generation

In this paper, we propose a novel model RevGAN that automatically generates controllable and personalized user reviews based on the arbitrarily given sentimental and stylistic information. RevGAN utilizes the combination of three novel components, including self-attentive recursive autoencoders, conditional discriminators, and personalized decoders. We test its performance on the several real-world datasets, where our model significantly outperforms state-of-the-art generation models in terms of sentence quality, coherence, personalization, and human evaluations. We also empirically show that the generated reviews could not be easily distinguished from the organically produced reviews and that they follow the same statistical linguistics laws.


Introduction
With the ever increasing interests in usergenerated reviews on online marketplace websites, such as Amazon, Yelp and TripAdvisor, it is necessary to provide a range of tools that would encourage users to provide feedback in a more efficient and effective manner, as only a small fraction of users really take time to write their own reviews (Chen and Xie, 2008). Automatic review generation, for example, takes the product information and user behavior as input and generates user reviews following the arbitrarily given users' sentiment designation and writing style personalized towards the specific product and user.
Researchers have proposed various types of product review generation methods (Yao et al., 2017;Dong et al., 2017;Lipton et al., 2015;Radford et al., 2017) and achieved great performance. However, they did not consider the inner hierarchical word-sentence-paragraph structure within user reviews, thus making their generation results significantly limited in length and coherence. (Li et al., 2015;Zang and Wan, 2017) did include the hierarchical connection in their review generation model, but they did not address the problem of controllable and personalized review generation targeted at the specific product and user, which is essential for the usefulness of generated reviews. Most importantly, all the aforementioned generative models did not include production descriptions in the generation process, thus their generation results lack credibility and diversity.
To address these problems, we propose a novel model RevGAN that automatically generates highquality user reviews given the information of product descriptions, sentiment labels and users' historical reviews. The proposed RevGAN model follows a three-staged process: In Stage 1, we propose to use Self-Attentive Recursive Autoencoder for mapping the discrete user reviews and product descriptions into continuous embeddings for the advantage of capturing the ''essence" of textual information and the convenience for subsequent optimization processes. In Stage 2, we utilize a novel Conditional Discriminator structure to control the sentiment of generated reviews by conditioning sentiment on the discriminator and forced the generator to adapt its generation policy correspondingly. Finally in Stage 3, to improve the personalization of generated reviews, we use a new Personalized Decoder method to decode the generated review embeddings according to users' writing styles extracted from their history corpus.
We conduct extensive experiments using multiple real-world datasets and show that the proposed RevGAN model significantly and consistently outperforms state-of-the-art baseline models and lead to the automated generation of reviews that are indeed very similar in style and content to the set of original reviews.
In general, this paper makes the following contributions: a. We propose a novel RevGAN model that automatically generates controllable and personalized reviews from product information, a set of user reviews and their writing styles. Especially. we propose three novel components of the generative framework: Self-Attentive Recursive Autoenocder that captures the hierarchical structure and latent semantic meanings of user reviews, Conditional Discriminator that generates controllable user reviews by conditioning the sentimental information on the discriminator to improve the generation performance in terms of sentence quality and context accuracy, and Personalized Decoder that takes the personalized writing style into account by concatenating the users' vocabulary preference onto the decoder to improve the personalization and credibility of the generated results, which is validated by the empirical human evaluation.
b. We empirically demonstrate that our proposed RevGAN model achieves state-of-the-art review generation performance, statistically and empirically outperforming several important benchmarks on multiple datasets. We also empirically show that the reviews generated by our method are very similar to the organically generated reviews and that the linguistic features of generated reviews follow the same statistical linguistics laws as reviews organically produced by the users.

Related Work
In this section, we briefly summarize the related work following two aspects covering previous work on automated review generation and GAN for NLG. We point out the connection and difference between our proposed model and prior literature, which leads to significant improvements of review generation performance.
However, these review generative models include neither product information nor users' writing styles into the generation process, thus making the generated reviews less persuasive. Also, they are limited in length and lack of coherence for neglecting the hierarchical connections within sentences, which are very important elements towards the helpfulness of a specific review (Mudambi and Schuff, 2010). Our RevGAN model, on the other hand, utilizes the combination of self-attentive hierarchical autoencoder and conditional discriminator for improved and controllable review generation, while we concatenate the contextual labels and users' history corpus into the personalized decoder at the same time. Experimental results support that our proposed model indeed achieves significantly better generation results compared to the prior literature.

GAN for NLG
GAN (Goodfellow et al., 2014) has become a powerful method for reconstruction and generation in real data space, which leaves great potential to be used for natural language generation purposes. Various methods have been proposed to get over the major problem of the discontinuity of textual information, including Se-qGAN , TextGAN , RankGAN (Lin et al., 2017) and Leak-GAN (Guo et al., 2017). However, regarding longtext generation tasks, for all the models above, the computation complexity might be too high, thus failing to provide satisfying results. Nor do these models take contextual and personalized information into consideration for controllable generation. Conditional GAN (Mirza and Osindero, 2014;Hu et al., 2017;Dong et al., 2017) concatenates the supervised labels into the input of generator and is able to control the generation of simple sentences. However, considering the high dimension of latent embedding vectors, concatenating significantly lower dimensional supervised information into the input might not be strong enough to force the generator to update towards the designated direction.
Therefore, to address these problems, we propose a novel conditional discriminator model which conditions the sentiment label on the discriminator to artificially change how the discriminator works, and force it backpropagate the loss function that could make the generator to learn what the user really want. Experimental results reported in Section 5 show that we outperform all other GAN-based-NLG models in the generation performance, and outperform Conditional GAN in terms of sentiment accuracy as well.

Models
In this section, we introduce the proposed RevGAN model for review generation that includes three novel components: Self-Attentive Recursive Au-toEncoder, Conditional Discriminator and Personalized Decoder. We first use Self-Attentive Recursive AutoEncoder to map the discontinuous user reviews and product descriptions into a latent continuous space, and utilize a novel version of cGAN to generate review embeddings for subsequent personalized decoding and review generation. Experimental results show that the combination of all three novel components achieves stateof-the-art review generation performance.

The product does exactly as it should
Buy this product ! . + The product does exactly as it should … Buy this product ! . + Word Index

Decode-Sentence
Word Sampling

Self-Attentive Recursive AutoEncoder
The proposed Self-Attentive Recursive AutoEncoder is illustrated in Figure 1. We implement a bidirectional Gated Recurrent Unit (GRU) (Cho et al., 2014) neural network as the encoder and the decoder in the model respectively. Compared with classical models like RNN or LSTM, GRU is computationally more efficient and better captures latent semantic meanings. We split each user review or product description into single sentences, and then map each sentence to their corresponding word indexes in the pre-defined dictionary. The in-dex sequences constitute the input of our proposed model.
We use R to represent a certain review consisting of N R sentences as R = {s 1 , s 2 , · · · , s N R }. Each sentence s consists of N s words as s = {w 1 , w 2 , · · · , w Ns }, where w i represents the index of certain word in the vocabulary with size V . We denote [W z , W r , U z , U r ] as the weight matrices for update gates and reset gates, z t , r t as the status for the update gate and reset gate, and x t , h t as the input and output vector at time t respectively. GRU learns the hidden representations using the following equations, while the hidden state h t at the end of the sequence constitutes latent sentence embeddings.
Besides, to capture the relative position representations. we incorporate self-attentive mechanism (Shaw et al., 2018) during encoding process. Typically, each output element h i is computed as weighted sum of a linearly transformed input elements And e ij is computed using a compatibility function that compares two input elements correspondingly. We visualize the self-attention mechanism with an example in Figure 2. After getting the sentence-level embeddings within each review, we merge those sentence embeddings in a recursive way to obtain paragraph embeddings via a binary tree structure. We denote the embedding for sentence s as e s , and during the encoding process the first parent node y 2 is computed from the first two children node (e 1 , e 2 ) by the standard neural network layer: where (e 1 : e 2 ) is the natural concatenation of these two embedding vectors, W e is the weight matrix with twice the size of the embedding vectors. The second parent node y 3 will be computed from the concatenation of the first parent node y 2 and the following embedding vector e 3 : We obtain the representation of the rest nodes similarly. The embedding for root node constitutes the entire review embeddings. The training process of W e follows a standard MLP network, where we optimize the reconstruction loss over every layer of the binary tree.
To unfold the recursive autoencoder, we'll start from the root node y N R of the binary tree. By utilizing another MLP network, we expand the paragraph embedding vector to two vectors: the leaf node and the lower level parent node, where the parent node would go through the same procedure until all of the leaf nodes in the binary tree are deciphered.
Finally, we assemble the paragraph from the bottom leaf node to the top one such that R = {e 1 , e 2 , · · · , e N R } to complete the review reconstruction process.

Conditional Discriminator
To generate meaningful reviews from the product information, we construct a cGAN model to transfer product descriptions into user reviews given specific sentiment labels. However, unlike the traditional conditional GAN methods (Mirza and Osindero, 2014;Hu et al., 2017), we do not concatenate the sentiment label directly into latent codes; considering the relatively high dimensions of latent embeddings, concatenating the sentiment scalar into the input might not be powerful enough to force the generator to update itself to match with the designated sentiment. Thus, we condition the sentiment labels on the discriminator to artificially change the rules that the discriminator works, and force it backpropagate loss functions that update generator policy correspondingly. The generated reviews and original reviews with the opposite sentiment are judged as negative examples, while only the original reviews that matches with the given sentiment are judged as positive examples by the conditional discriminator, as we propose the novel conditional discriminator D (for positive sentiment): D(x|c) = 1,organic positive reviews

0,others
For generator, the model is optimized to minimize the reconstruction error between generated reviews and original reviews, L(θ G ) = KL(p real (x)||p G (x; θ)) where KL stands for Kullback-Leibler divergence. The loss function of discriminator D is: ) Under ideal circumstances when generator and discriminator both reach their equilibrium, we could get that D(x) = D(G(x)) = 1 2 and that the generated reviews are indeed indistinguishable from the original reviews from the discriminator point of view.
The core idea lies in that, by artificially forcing the discriminator to take certain type of reviews as real samples, generator should learn the conditioned information and then transform the generated data distribution. This unique structure of GANs makes possible the controllable review generation process, and experimental results support the strength of our model over classical cGAN models.

Personalized Decoder
To personalize the generation process, apart from the conditioning of sentiment labels, we also take the users' specific writing styles into account. We provide the definition of writing style according to (Zheng et al., 2006) : Writing Style refers to the user's distinctive vocabulary choices and style of expression in his review creations. Assuming that the historical reviews written by user i contain [R i1 , R i2 , · · · , R iN i ], we calculate the usage frequency of each word from the corpus, which is denoted as a V-dimensional writing style vector W style . The intuition is that, during the decoding process, instead of generating each word right from the calculated word distribution via GRU network, we concatenate the writing style vector onto the distribution and sample the generated word afterwards, which would be determined by both the writing style vector W style and the distribution vector W : Note that, to deal with the cold start problem when the user has no historical reviews, we could simply set the writing style vector as identity matrix W style = I and generate the reviews under normal settings. Experimental results show that the involvement of personalized information (sentiment information and writing style) indeed improve the generation results and the helpfulness score from the empirical study as well.

Dataset
To empirically validate our proposed model, we implemented RevGAN on three subsets of the Amazon Review Dataset (He and McAuley, 2016; 1 , namely Musical Instrument, Automotive and Patio which include 44,006 reviews written by 3,697 users on 6.039 items.

Experiment Settings
The self-attentive recursive autoencoder is implemented by bidirectional GRUs with embedding dimension 300. GRU parameters and word embeddings are initialized from a uniform distribution between [-0.1,0.1]. The initial learning rate is 1e-3, which will be halved every 50 epochs until convergence. Batch size is set to 128 (128 sentences across review documents) for batch normalization (Ioffe and Szegedy, 2015). Sentences would be padded to the maximum length within each batch. Gradient clipping (Gulrajani et al., 2017) is adopted by scaling the gradients when the norm exceeds the threshold 1. For the recursive structure, the parameter settings are the same with sentence-level autoencoder only except that the size of the weight matrix is 600 × 300. The beam size for beam searching (Wiseman and Rush, 2016) would be fixed as 3. To validate the emotion label for each review, we implemented the state-of-the-art sentiment classifier VADER(Gilbert, 2014) 2 to label the sentiment score for each review. The baseline Seq-GAN, RankGAN and LeakGAN models are implemented through the Texygen (Zhu et al., 2018) 3 toolkit. The generator and the conditional discriminator of GANs are both set as Multilayer Perceptron(Rumelhart et al., 1985) (MLP) with 300 hidden layers. Their parameters are initialized from the normal distribution N(0,0.02). The learning rates for generator and conditional discriminator are fixed at 5e-5 and 1e-5 respectively. During each epoch, generator G would iterate 5 times while discriminator D would only iterate 1 time. The model updates 30,000 times in total. We implemented our model on a Tesla K80 GPU within PyTorch 4 environment, where the whole training takes about 12 hours.

Evaluation Metrics
To demonstrate that our purposed model indeed achieves the state-of-the-art review generation performance, we implement various evaluation metrics, including distribution-based Log-Likelihood and Perplexity, coherence-based Word Mover Distance (WMD) (Kusner et al., 2015), ngram-based BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004), contextual label accuracy and human evaluation to measure the performance of review generation. Specifically, following the same metric as (Dong et al., 2017), we use sentiment accuracy, the ratio of the reviews whose sentiment matches with the given label, as an important indication of the personalization ability of the generator. The higher sentiment accuracy is, the better it could provide supervised generated results. Besides, we conduct the human evaluation to assess the quality and helpfulness of generated results by randomly selecting the same number of reviews from the original dataset, the generation of RevGAN and the generation of other baseline models and asking the participants to analyze which ones are generated by the machine and which ones are really created by humans, where significance test shows that our generated reviews are indeed indistinguishable from the original data.

Baseline Models
To demonstrate that our purposed model indeed achieves the state-of-the-art review generation performance, we compare our model across various evaluation metrics with several important benchmarks, including charRNN (Yao et al., 2017), 3242 MLE (Bahl et al., 1990, SeqGAN , LeakGAN (Guo et al., 2017), RankGAN (Lin et al., 2017) and Attr2Seq (Dong et al., 2017). Besides, to verify the effectiveness of combining three novel components into the RevGAN model, we also compare the performance between RevGAN+CD (Conditional Discriminator), RevGAN+CD+SA (Self-Attentive Autoencoder) and RevGAN+CD+SA+PD (Personalized Decoder). The results show that our model indeed outperforms all the selected benchmark models significantly and consistently.

Significance Testing
We conduct significance tests to identify whether the difference between two review generation algorithms could indicate a difference in true system quality. Typically, following (Koehn, 2004), we use bootstrap re-sampling methods to get the asymptotic standard error of the estimated value of the evaluation metrics. Then the paired twosample t-test could be used to test the significance whether their population means differ statistically.
In terms of the indistinguishableness of the generated results, we conduct a chi-square test for independence to test whether there is a significant association between the human assessment and the actual value. As the results of statistical test are insignificant, we could then claim that our generated reviews are indistinguishable from original ones in the sense that human can't separate them apart.

Evaluation of review generation
To illustrate the superiority and generalizability of our RevGAN model, we implement our model on three different domains of the Amazon Review Dataset including musical instruments, automotive and patio products. The summary of our experiment results is reported in Table 1, from which we could clearly observe that, compared with the baseline text-generation models, our proposed RevGAN model performs significantly better in sentence quality and coherence performance. On average, we could witness a 5% increase in Word Mover Distance (WMD), 80% improvement in BLEU and 10% rising in ROUGE. Besides, the comparison between different variations of RevGAN model verifies that indeed the combination of all three novel components gives the best generation performance. By deploying bootstrap re-sampling techniques introduced in the previous section, we conduct hypothesis tests where all the tests confirm the significant improvement of our RevGAN model. In that sense, we claim that our model achieved the state-of-the-art results on review generation. We also showcase some generated reviews at the end of this section.

Evaluation of controllable generation
In this part, we evaluate the controllable generation performance of our purposed model by pre-setting the contextual labels. We fixed the sentiment label as 'positive' and 'negative' respectively conditioned on the discriminator, and then evaluate the sentiment accuracy of the generated reviews. The results are reported in Table  2, where our model beats the state-of-the-art algorithm Attr2Seq (Dong et al., 2017) and the classical model Conditional GAN where we condition the same sentiment on these two models as well.

Evaluation of Personalized Generation
Besides the statistical and semantical metrics, we also design an empirical study to test the personalized performance of our generated reviews. We randomly select 15 reviews to include in each questionnaire, 5 from the original dataset, 5 from RevGAN generated results with personalization and 5 from RevGAN generated results without personalization, and ask participants to analyze which ones are generated by the machine and which ones are really created by humans. Besides, they are also asked to assess the helpfulness of each review by choosing the helpfulness score scale 1-5 for each review. We sent out 100 questionnaires in total, and get 36 responses, the confusion matrix of which is reported in Table 3. To test that whether the RevGAN generated reviews are indeed statistically indistinguishable from original ones, we run chi-test for significance testing: which shows that, under 95% confident interval, we could claim that there's no statistical difference between our machine-generated reviews and those actually written by humans.
Besides, the results indicate that our generated reviews have no statistical difference in terms of the helpfulness scores from those written by consumers towards certain products, with average helpfulness scores 3.10 and 3.03 for machinegenerated and real-world reviews respectively. Thus, based on the t-test, we accept the hypothesis that there's no statistical difference between

Showcase
We present several showcases of our generated results with different contextual labels and domains as shown in Table 4. Additionally, we showcase the modification process in Table 5, where the personalized generated reviews tend to use more words from the user's history corpus. Besides, we check if reviews generated by RevGAN would have the same linguistic features by testing two major statistical laws of linguistics(Altmann and Gerlach, 2016): Zipf Law (Zipf, 1935) and Heap Law (Herdan, 1964). The former states that if words are ranked according to their frequency of appearance r = 1, 2, · · · , V , the frequency f(r) of the r-th word scales with the rank as f (r) = β Z r α Z , while the latter states that the number of different words V scales with database size N measured in the total number of words as V ∼ N α H . As shown in Figure 3 and 4, both the original reviews and the generated ones satisfy those two linguistic laws.

Conclusion
In this paper, we proposed RevGAN that automatically generates personalized product reviews from product embeddings as opposed to labels, which could output results targeting on specific products and users. To do this, we incorporate three novel components: self-attentive recursive autoen-Domain Sentiment Generated Reviews Musical Positive These chords got me to play my guitar better in less than one day. An excellent overdrive and an incredible value. I'll use them all the time.

Musical
Negative These pedals are not budget friendly. If you are looking for classic rock sounds, you won't love these expensive hardware. Automotive Positive I bought two sets of seat covers and this roll kit. Both fit well and look good. They were much easier to slide over the leather seats. Automotive Positive These seat covers look good and seem to be made of a good quality material. For the price, these are a great buy. Patio Negative It is not recommended. The cover is a little tight and hard to open and close. Patio Positive These traps have caught more mice than ever give. You only need a little peanut butter for the bait and tomcat would caught so many mice in one night. Will order again if needed .  coder, conditional discriminator and personalized decoder. Experimental results show that RevGAN performs significantly better than other baseline models and that our generated reviews are very similar to organically generated user reviews, as shown in Section 5.2 and Table 3.
As a part of the future work, we would like to improve the review generation process in a way that could receive several key words from users as input and generate reviews based on these prior information. Another direction of the future research, however, lies in developing novel methods that distinguish the type of reviews described in the paper and organic reviews.