Towards Automatic Generation of Product Reviews from Aspect-Sentiment Scores

Data-to-text generation is very essential and important in machine writing applications. The recent deep learning models, like Recurrent Neural Networks (RNNs), have shown a bright future for relevant text generation tasks. However, rare work has been done for automatic generation of long reviews from user opinions. In this paper, we introduce a deep neural network model to generate long Chinese reviews from aspect-sentiment scores representing users’ opinions. We conduct our study within the framework of encoder-decoder networks, and we propose a hierarchical structure with aligned attention in the Long-Short Term Memory (LSTM) decoder. Experiments show that our model outperforms retrieval based baseline methods, and also beats the sequential generation models in qualitative evaluations.


Introduction
Text generation is a central task in the NLP field. The progress achieved in text generation will help a lot in building strong artificial intelligence (AI) that can comprehend and compose human languages.
Review generation is an interesting subtask of data-to-text generation. With more and more online trades, it usually happens that customers are lazy to do brainstorming to write reviews, and sellers want to benefit from good reviews. As we can see, review generation can be really useful and worthy of study. But recent researches on text generation mainly focus on generation of weather reports, financial news, sports news (Konstas, 2014;Kim et al., 2016;, and so on. The task of review generation still needs to be further explored.
Think about how we generate review texts: we usually have the sentiment polarities with respect to product aspects before we speak or write. Inspired by this, we focus on study of review generation from structured data, which consist of aspect-sentiment scores.
Traditional generation models are mainly based on rules. It is time consuming to handcraft rules. Thanks to the quick development of neural networks and deep learning, text generation has achieved a breakthrough in recent years in many domains, e.g., image-to-text (Karpathy and Fei-Fei, 2015;Xu et al., 2015), video-to-text (Yu et al., 2016), and textto-text (Sutskever et al., 2014;Li et al., 2015), etc. More and more works show that generation models with neural networks can generate meaningful and grammatical texts (Bahdanau et al., 2015;Sutskever et al., 2011). However, recent studies of text generation mainly focus on generating short texts of sentence level. There are still challenges for modern sequential generation models to handle long texts. And yet there is very few work having been done in generating long reviews.
In this paper, we aim to address the challenging task of long review generation within the encoderdecoder neural network framework. Based on the encoder-decoder framework, we investigate different models to generate review texts. Among these models, the encoders are typically Multi-Layer Perceptron (MLP) to embed the input aspect-sentiment scores. The decoders are RNNs with LSTM units, but differ in architectures. We proposed a hierarchi-cal generation model with a new attention mechanism, which shows better results compared to other models in both automatic and manual evaluations based on a real Chinese review dataset.
To the best of our knowledge, our work is the first attempt to generate long review texts from aspectsentiment scores with neural network models. Experiments proved that it is feasible to general long product reviews with our model.

Problem Definition and Corpus
To have a better understanding of the task investigated in this study, we'd like to introduce the corpus first.
Without loss of generality, we use Chinese car reviews in this study and reviews in other domains can be processed and generated in the same way. The Chinese car reviews are crawled from the website AutoHome 1 . Each review text contains eight sentences describing eight aspects 2 , respectively: 空 间/Space, 动 力/Power, 控 制/Control, 油 耗/Fuel Consumption, 舒适度/Comfort, 外观/Appearance, 内 饰/Interior, and 性 价 比/Price. Each review text corresponds to these eight aspects and the corresponding sentiment ratings, and the review sentences are aligned with the aspects and ratings. So we may split the whole review into eight sentences when we need. Note that the sentences in each review are correlated with each other, so if we regard them as independent sentences with respect to individual aspect-sentiment scores, they probably seem pretty mendacious when put altogether. We should keep each review text as a whole and generate the long and complete review at one time, rather than generating each review sentence independently. Specifically, we define our task as generating long Chinese car reviews from eight aspectsentiment scores.
The raw data are badly formatted. In order to clean the data, we keep the reviews whose sentences corresponding to all the eight aspects. And we skip the reviews whose sentences are too long or too short. We accept length of 10 to 40 words per sen-tence. We use Jieba 3 for Chinese word segmentation. Note that each review text contains eight sentences, where each sentence has 24 Chinese characters on average. The review texts in our corpus are actually very long, about 195 Chinese characters per review.
And finally, we get 43060 pairs of aspectsentiment vectors and the corresponding review texts, in which there are 8340 different inputs 5 . Then we split the data randomly into training set and test set. The training set contains 32195 pairs (about 75%) and 6290 different inputs, while the test set contains the rest 10865 pairs with 2050 different inputs. The test set does not overlap with the train set with respect to the input aspect-sentiment vector.
Furthermore, we transform the input vector into aspect-oriented vectors as input for our models. For each aspect, we use an additional one-hot vector to represent the aspect, and then append the one-hot vector to the input vector. For example, if we are dealing with a specific aspect Power corresponding to a one-hot vector [0,1,0,0,0,0,0,0] for the above review with input vector [-1.0,-0.5,0.0,0.5,1.0,0.5,0.0,-0.5], the new input vector with respect to this aspect is actually [-1.0,-0.5,0.0,0.5,1.0,0.5,0.0,-0.5,0,1,0,0,0,0,0,0]. Each new input vector is aligned with a review sentence. Similarly, we can get eight new vectors with respect to the eight aspects as input for our models.

Preliminaries
In this section, we will give a brief introduction to LSTM Network (Hochreiter and Schmidhuber, 1997).

LSTM Network
An LSTM network contains LSTM units in RNN and an LSTM unit is a recurrent network unit that excels at remembering values for either long or short durations of time (Graves, 2012b;Sundermeyer et al., 2012). It contains an input gate, a forget gate, an output gate and a memory cell. Respectively, at time t, we set the above parts as i t , f t , o t , c t . In an LSTM network, we propagate as Equation (3)(4)(5).
In the past few years, many generation models based on LSTM networks have given promising results in different domains (Xu et al., 2015;Shang et al., 2015;Wu et al., 2016). Compared to other network units of RNN, like GRU (Chung et al., 2014), LSTM is considered the best one in most cases.

Notations
We define our task as receiving a vector of aspect-sentiment scores V s to generate review texts, which is a long sequence of words Y {y 1 , y 2 , . . . , y |Y |−1 , EOS } ( EOS is the special word representing the end of a sequence). As mentioned in section 2, we also transform an input vector V s into a series of new input vectors {V 1 , V 2 , . . . , V 8 } with respect to eight aspects for our models. More specifically, in order to obtain each V i , we append a one-hot vector representing a specific aspect to V s . That is, where O is a one-hot vector with the size of eight, and only the ith element of O is 1.
We have three different kinds of embeddings: E W stands for word embedding, E V stands for embedding of the input vector by a MLP encoder, and E C stands for embedding of context sentences. There will be subscripts specifying the word, the vector, and the context.
And in LSTM, h is a hidden vector, x is an input vector, P is the possibility distribution, y is the predicted word, and t is the time step.

Sequential Review Generation Models (SRGMs)
SRGMs are similar to the popular Seq2Seq models (Chung et al., 2014;Sutskever et al., 2011), except that it receives inputs of structured data (like aspectsentiment scores) and encodes them with an MLP. The encoder's output E V s is treated as the initial hidden state h 0 of the decoder. And the initial input vector is set as the word embedding of BOS ( BOS is the special word representing the begin of a sequence). Then the decoder proceeds as a standard LSTM network.
At time t(t ≥ 1), the hidden state of the decoder h t is used to predict the distribution of words by a softmax layer. We will choose the word with max possibility as the word predicted at time t, and the word will be used as the input of the decoder at time t + 1.
This procedure can be formulated as follows: y t = argmax w (P t,w ) (10) Figure 1: The architecture of SRGM-w.
In each training step, we adopt the negative likelihood loss function.
However, Sutskever et al. (2014) and Pouget-Abadie et al. (2014) have shown that standard LSTM decoder does not perform well in generating long sequences. Therefore, besides treating the review as a whole sequence, we also tried splitting the reviews into sentences, generating the sentences separately, and then concatenating the generated sentences altogether. Respectively, we name the sequential model generating the whole review as SRGM-w, and the one generating separate sentences as SRGM-s.

Hierarchical Review Generation Models (HRGMs)
Inspired by Li et al. (2015), we build a hierarchical LSTM decoder based on the SRGMs. Note that we have two different LSTM units in hierarchical models, in which the superscript S denotes the sentence-level LSTM, and the superscript P denotes the paragraph-level one. And t is the time step notation in the sentence decoder, while T is the time step notation in the paragraph decoder. Both the time step symbols are put in the position of subscripts.
There is a one-hidden-layer-MLP to encode the input vector into E V s . LST M P receives E V s as the initial hidden state, and the initial input x P 1 is a zero vector. At time T (T ≥ 1), the output of LST M P is used as the initial hidden state of LST M S . And then LST M S works just like the LSTM decoder in SRGMs. The final output of LST M S is treated as the embedding of the context sentences E C T , which is also the input of LST M P at time T + 1. We call this hierarchical model HRGM-o.
In the experiment results of HRGM-o, we find that the model has its drawback. In some test cases, the output texts miss some important parts of the input aspects.
As many previous studies have shown that the attention mechanism promises a better result by considering the context (Bahdanau et al., 2015;Fang et al., 2016;Li et al., 2015). We adopt attention to the generation of each sentence, which is aligned to the sentence's main aspect.
Different from the attention mechanism mentioned in previous studies, in our situation, we have the alignment relationships between aspectsentiment ratings and sentences, which are natural attentions to be used in the generation process. By applying additional input vector V T at each time step T , we obtain the initial hidden state of LST M S from two source vectors E V T and h P T . Therefore, we simply train a gate vector g to control the two parts of information. The encoding of V T is similar to Equation (13), but with different parameters. In brief, we change Equation (16) to Equation (22)(23).
Based on all of these, we propose a hierarchical model with a special aligned attention mechanism as shown in Figure 2. We call the model HRGM-a.

Training Detail
We implemented our models with TensorFlow 1.10 6 , and trained them on an NVIDIA TITANX GPU (12G).
Because the limitation of our hardware, we only do experiments with one layer of encoder and one layer of LSTM network. The batch size is 4 in HRGMs, and 32 in SRGMs. The initial learning rate is set to 0.5, and we dynamically adjust the learning rate according to the loss value. As experiments show that the size of hidden layer does not affect the results regularly, we set all of them to 500.
All the rest parameters in our model can be learned during training. 6 github.com/tensorflow/tensorflow/tree/r0.10

Baselines
Apart from SRGM-w and SRGM-s, we also developed several baselines for comparison.
• Rand-w: It randomly chooses a whole review from the training set.
• Rand-s: It randomly choose a sentence for each aspect from the training set and concatenates the sentences to form a review.
• Cos: It finds a sentiment vector from the training set which has the the largest cosine similarity value with the input vector, and then returns the corresponding review text.
• Match: It finds a sentiment vector from the training set which has the maximum number of rating scores matching exactly with that in the input vector, and then returns the corresponding review text.
• Pick: It finds one sentence for each aspect respectively in the training set by matching the same sentiment rating, and then concatenates them to form a review.
Generally speaking, models in this paper are divided into four classes. The first class is lower bound methods (Rand-w, Rand-s), where we choose something from the training set randomly. The second one is based on retrieval (Cos, Match, Pick), and we use similarity to decide which to choose. The third one is sequential generation models based on RNNs (SRGM-w, SRGM-s). And the last one is hierarchical RNN models to handle the whole review generation (HRGM-o, HRGM-a).

Automatic Evaluation
We used the popular BLEU (Papineni et al., 2002) scores as evaluation metrics and BLEU has shown good consistent with human evaluation in many machine translation and text generation tasks. High BLEU score means many n-grams in the hypothesis texts meets the gold-standard references. Here, we report BLEU-2 to BLEU-4 scores, and the evaluation is conducted after Chinese word segmentation.
The only parameters in BLEU is the weights W for n-gram precisions. In this study, we set W as average weights (W i = 1 n for BLEU-n evaluation). As for multiple answers to the same input, we put all of them into the reference set of the input.
The results are shown in Table 1 baselines get low BLEU scores in BLEU-2, BLEU-3 and BLEU-4. Among these models, Cos and Match even get lower BLEU scores than the lower bound methods in some BLEU evaluations, which may be attributed to the sparsity of the data in the training set. Pick is better than lower bound methods in all of the BLEU evaluations. Compared to the retrieval based baselines, SRGMs get higher scores in BLEU-2, BLEU-3, and BLEU-4. It is very promising that HRGMs get the highest BLEU scores in all evaluations, which demonstrates the effectiveness of the hierarchical structures. Moreover, HRGM-a achieves better scores than HRGM-o, which verifies the helpfulness of our proposed new attention mechanism. In all, the retrieval models and sequential generation models can not handle long sequences well, but hierarchical models can handle long sequences. The reviews generated by our models are of better quality according to BLEU evaluations.

Human Evaluation
We also perform human evaluation to further compare these models. Human evaluation requires human judges to read all the results and give judgments with respect to different aspects of quality.
We randomly choose 50 different inputs in the test set. For each input, we compare the best models in each class, specifically, Rand-s, Pick, SRGMs, HRGM-a, and the Gold (gold-standard) answer. We employ three subjects (excluding the authors of this paper) who have good knowledge in the domain of car reviews to evaluate the outputs of the models. The outputs are shuffled before shown to subjects. Without any idea which output belongs to which model, the subjects are required to rate on a 5-pt Likert scale 7 about readability, accuracy, and usefulness. In our 5-pt Likert scale, 5-point means "very satisfying", while 1-point means "very terrible". The ratings with respect to each aspect of quality are then averaged across the three subjects and the 50 inputs.
To be more specific, we define readability, accuracy, and usefulness as follows. Readability is the metric concerned with the fluency and coherence of the texts. Accuracy indicates how well the review text matches the given aspects and sentiment ratings. Usefulness is more subjective, and subjects need to decide whether to accept it or not when the text is shown to them. The readability, accuracy, even the length of the review text will have an effect on the usefulness metric.  The results are shown in Table 2. We can see that in human evaluations, all the models get high scores in readability. The readability score of our model HRGM-a is very close to the highest readability score achieved by Pick. Rand-s gets the worst scores for accuracy and usefulness, while the rest models perform much better in these metrics. Compared to the strong baselines Pick and SRGM-s, although our model is not the best in readability, it performs better in accuracy and usefulness. The results also demonstrate the efficacy of our proposed models.

Samples
To get a clearer view of what we have done and have an intuitive judgment of the generated texts, we present some samples in Table 3.
In Table 3, the first three samples are output texts of Gold-Standard, Pick, and our model HRGM-a for the same input. And in the last sample, we Translation: Trunk space is quite large, but the rear space is relatively small, and the seat is not smooth. Power is also okay, Comfort: 3 as long as willing to give oil. Steering wheel has high precision. Probably because of the reasons for thin tires, fuel con-Appearance: 5 sumption of automatic transmission is a bit high, urban fuel consumption in 10 or so, while on high way up to 7 oil. The seat Interior: 4 is still quite comfortable, but there is large noise when driving. I propose the car to have a better sound insulation. Little Price: 4 U N K 's appearance is better than others in the same class, and this is quite good. I especially like the Tomahawk wheels. Materials are okay, there is occasionally a little abnormal sound. The price of the car is acceptable. Yeah, worthy of the price U N K . After all, the price is not that high.  change one rating in the input to show how our model changes the output according to the slight difference in the input.
As we can see, Pick is a little better than our model HRGM-a in text length and content abun-dance. But the output of Pick has a few problems. For example, there is a serious logic problem in the reviews of Space and Comfort. It says the car is narrow in Space, but the car has a large space in Comfort, which violates the context consistency.
What's more, it gives improper review to Comfort. Although Comfort gets 3-point, the review sentence is kind of positive. And that can be considered as a mismatch with the input. On the contrary, our model produces review texts as a whole and the texts are aligned with the input aspect-sentiment scores more appropriately. All 3-point aspects get neutral or slightly negative reviews, while all 5-point aspects get definitely positive comments. And 4-point aspects also get reviews biased towards being positive.
As for the last example after changing the rating of Comfort from 3-point to 5-point, we can see that except for the review sentence for Comfort, other sentences do not change apparently. But the review sentence of Comfort changes significantly from neutral to positive, which shows the power of our model.

Related Work
Several previous studies have attempted for review generation (Tang et al., 2016;Lipton et al., 2015;Dong et al., 2017) . They generate personalized reviews according to an overall rating. But they do not consider the product aspects and whether each generated sentence is produced as the user requires. The models they proposed are very similar to SRGMs. And the length of reviews texts are not as long as ours. Therefore, our work can be regarded as a significant improvement of their researches.
Many researches of text generation are also closely related to our work. Traditional way for text generation (Genest and Lapalme, 2012;Yan et al., 2011) mainly focus on grammars, templates, and so on. But it is usually complicated to make every part of the system work and cooperate perfectly following the traditional techniques, while end-to-end generation systems nowadays, like the ones within encoder-decoder framework Sordoni et al., 2015), have distinct architectures and achieve promising performances.
Moreover, the recent researches on hierarchical structure help a lot with the improvement of the generation systems. Li et al. (2015) experimented on LSTM autoencoders to show the power of the hierarchical structured LSTM networks to encode and decode long texts. And recent studies have succefully generated Chinese peotries  and Song iambics(Wang et al., 2016) with hierarchical RNNs.
The attention mechanism originated from the area of image (Mnih et al., 2014), but is widely used in all kinds of generation models in NLP (Bahdanau et al., 2015;Fang et al., 2016). Besides, attention today is not totally the same with the original ones. It's more a thinking than an algorithm. Various changes can be made to construct a better model.

Conclusion and Future Work
In this paper, we design end-to-end models to challenge the automatic review generation task. Retrieval based methods have problems generating texts consistent with input aspect-sentiment scores, while RNNs cannot deal well with long texts. To overcome these obstacles, we proposed models and find that our model with hierarchical structure and aligned attention can produce long reviews with high quality, which outperforms the baseline methods.
However, we can notice that there are still some problems in the texts generated by our models. In some generated texts, the contents are not rich enough compared to human-written reviews, which may be improved by applying diversity decoding methods (Vijayakumar et al., 2016;. And there are a few logical problems in some generated texts, which may be improved by generative adversarial nets (Goodfellow et al., 2014) or reinforcement learning (Sutton and Barto, 1998).
In future work, we will apply our proposed models to text generation in other domains. As mentioned earlier, our models can be easily adapted for other data-to-text generation tasks, if the alignment between structured data and texts can be provided. We hope our work will not only be an exploration of review generation, but also make contributions to general data-to-text generation.