Personalized Review Generation By Expanding Phrases and Attending on Aspect-Aware Representations

In this paper, we focus on the problem of building assistive systems that can help users to write reviews. We cast this problem using an encoder-decoder framework that generates personalized reviews by expanding short phrases (e.g. review summaries, product titles) provided as input to the system. We incorporate aspect-level information via an aspect encoder that learns aspect-aware user and item representations. An attention fusion layer is applied to control generation by attending on the outputs of multiple encoders. Experimental results show that our model successfully learns representations capable of generating coherent and diverse reviews. In addition, the learned aspect-aware representations discover those aspects that users are more inclined to discuss and bias the generated text toward their personalized aspect preferences.


Introduction
Contextual, or 'data-to-text' natural language generation is one of the core tasks in natural language processing and has a considerable impact on various fields (Gatt and Krahmer, 2017). Within the field of recommender systems, a promising application is to estimate (or generate) personalized reviews that a user would write about a product, i.e., to discover their nuanced opinions about each of its individual aspects. A successful model could work (for instance) as (a) a highly-nuanced recommender system that tells users their likely reaction to a product in the form of text fragments; (b) a writing tool that helps users 'brainstorm' the review-writing process; or (c) a querying system that facilitates personalized natural lan-guage queries (i.e., to find items about which a user would be most likely to write a particular phrase). Some recent works have explored the review generation task and shown success in generating cohesive reviews (Dong et al., 2017;Ni et al., 2017;Zang and Wan, 2017). Most of these works treat the user and item identity as input; we seek a system with more nuance and more precision by allowing users to 'guide' the model via short phrases, or auxiliary data such as item specifications. For example, a review writing assistant might allow users to write short phrases and expand these key points into a plausible review.
Review text has been widely studied in traditional tasks such as aspect extraction (Mukherjee and Liu, 2012;He et al., 2017), extraction of sentiment lexicons (Zhang et al., 2014), and aspectaware sentiment analysis (Wang et al., 2016;McAuley et al., 2012). These works are related to review generation since they can provide prior knowledge to supervise the generative process. We are interested in exploring how such knowledge (e.g. extracted aspects) can be used in the review generation task.
In this paper, we focus on designing a review generation model that is able to leverage both user and item information as well as auxiliary, textual input and aspect-aware knowledge. Specifically, we study the task of expanding short phrases into complete, coherent reviews that accurately reflect the opinions and knowledge learned from those phrases.
These short phrases could include snippets provided by the user, or manifest aspects about the items themselves (e.g. brand words, technical specifications, etc.). We propose an encoderdecoder framework that takes into consideration three encoders (a sequence encoder, an attribute encoder, and an aspect encoder), and one decoder. The sequence encoder uses a gated recurrent unit 0 0 0 … 1 0 0 1 0 … 0 0  (GRU) network to encode text information; the attribute encoder learns a latent representation of user and item identity; finally, the aspect encoder finds an aspect-aware representation of users and items, which reflects user-aspect preferences and item-aspect relationships. The aspect-aware representation is helpful to discover what each user is likely to discuss about each item. Finally, the output of these encoders is passed to the sequence decoder with an attention fusion layer. The decoder attends on the encoded information and biases the model to generate words that are consistent with the input phrases and words belonging to the most relevant aspects.

Related Work
Review generation belongs to a large body of work on data-to-text natural language generation (Gatt and Krahmer, 2017), which has applications including summarization (See et al., 2017), image captioning (Vinyals et al., 2015), and dialogue response generation (Xing et al., 2017;Ghosh et al., 2017), among others. Among these, review generation is characterized by the need to generate long sequences and estimate high-order interactions between users and items. Several approaches have been recently proposed to tackle these problems. Dong et al. (2017) proposed an attribute-to-sequence (Attr2Seq) method to encode user and item identities as well as rating information with a multi-layer perceptron and a decoder then generates reviews conditioned on this information. They also used an attention mechanism to strengthen the alignment between output and input attributes. Ni et al. (2017) trained a collaborative-filtering generative concatenative network to jointly learn the tasks of review generation and item recommendation. Zang and Wan (2017) proposed a hierarchical structure to generate long reviews; they assume each sentence is associated with an aspect score, and learn the attention between aspect scores and sentences during training. Our approach differs from these mainly in our goal of incorporating auxiliary textual information (short phrases, product specifications, etc.) into the generative process, which facilitates the generation of higher-fidelity reviews.
Another line of work related to review generation is aspect extraction and opinion mining (Park et al., 2015;Qiu et al., 2017;He et al., 2017;Chen et al., 2014). In this paper, we argue that the extra aspect (opinion) information extracted using these previous works can effectively improve the quality of generated reviews. We propose a simple but effective way to combine aspect information into the generative model.

Approach
We describe the review generation task as follows. Given a user u, item i, several short phrases {d 1 , d 2 , ..., d M }, and a group of extracted aspects {A 1 , A 2 , ..., A k }, our goal is to generate a review (w 1 , w 2 , ..., w T ) that maximizes the probability P (w 1:T |u, i, d 1:M ). To solve this task, we propose a method called ExpansionNet which contains two parts: 1) three encoders to leverage the input phrases and aspect information; and 2) a decoder with an attention fusion layer to generate sequences and align the generation with the input sources. The model structure is shown in Figure 1.

Sequence encoder, attribute encoder and aspect encoder
Our sequence encoder is a two-layer bi-directional GRU, as is commonly used in sequence-tosequence (Seq2Seq) models . Input phrases first pass a word embedding layer, then go through the GRU one-by-one and finally yield a sequence of hidden states {e 1 , e 2 ..., e L }.
In the case of multiple phrases, these share the same sequence encoder and have different lengths L. To simplify notation, we only consider one input phrase in this section. The attribute encoder and aspect encoder both consist of two embedding layers and a projection layer. For the attribute encoder, we define two general embedding layers E u ∈ R |U |×m and E i ∈ R |I|×m to obtain the attribute latent factors γ u and γ i ; for the aspect encoder, we use two aspect-aware embedding layers E u ∈ R |U |×k and E i ∈ R |I|×k to obtain aspect-aware latent factors β u and β i . Here |U|, |I|, m and k are the number of users, number of items, the dimension of attributes, and the number of aspects, respectively. After the embedding layers, the attribute and aspect-aware latent factors are concatenated and fed into a projection layer with tanh activation. The outputs are calculated as: where W u ∈ R n×2m , b u ∈ R n , W v ∈ R n×2k , b v ∈ R n are learnable parameters and n is the dimensionality of the hidden units in the decoder.

Decoder with attention fusion layer
The decoder is a two-layer GRU that predicts the target words given the start token. The hidden state of the decoder is initialized using the sum of the three encoders' outputs. The hidden state at time-step t is updated via the GRU unit based on the previous hidden state and the input word. Specifically: where h 0 ∈ R n is the decoder's initial hidden state and h t ∈ R n is the hidden state at time-step t.
To fully exploit the encoder-side information, we apply an attention fusion layer to summarize the output of each encoder and jointly determine the final word distribution. For the sequence encoder, the attention vector is defined as in many other applications Luong et al., 2015): where a 1 t ∈ R n is the attention vector on the sequence encoder at time-step t, α 1 tj is the attention score over the encoder hidden state e j and decoder hidden state h t , and Z is a normalization term.
For the attribute encoder, the attention vector is calculated as: where a 2 t ∈ R n is the attention vector on the attribute encoder, and α 2 tj is the attention score between the attribute latent factor γ j and decoder hidden state h t .
Inspired by the copy mechanism (Gu et al., 2016;See et al., 2017), we design an attention vector that estimates the probability that each aspect will be discussed in the next time-step: where s ui ∈ R k is the aspect importance considering the interaction between u and i, e t is the decoder input after embedding layer at time-step t, and a 3 t ∈ R k is a probability vector to bias each aspect at time-step t. Finally, the first two attention vectors are concatenated with the decoder hidden state at time-step t and projected to obtain the output word distribution P v . The attention scores from the aspect encoder are then directly added to the aspect words in the final word distribution. The output probability for word w at time-step t is given by: where w t is the target word at time-step t, a 3 t [k] is the probability that aspect k will be discussed at time-step t, A k represents all words belonging to aspect k and 1 wt∈A k is a binary variable indicating whether w t belongs to aspect k.
During inference, we use greedy decoding by choosing the word with maximum probability, denoted as y t = argmax wt softmax(P (w t )). Decoding finishes when an end token is encountered.

Experiments
We consider a real world dataset from Amazon Electronics (McAuley et al., 2015) to evaluate our model. We convert all text into lowercase, add start and end tokens to each review, and perform tokenization using NLTK. 1 We discard reviews with length greater than 100 tokens and consider a vocabulary of 30,000 tokens. After preprocessing, the dataset contains 182,850 users, 59,043 items, and 992,172 reviews (sparsity 99.993%), which is much sparser than the datasets used in previous works (Dong et al., 2017;Ni et al., 2017). On average, each review contains 49.32 tokens as well as a short-text summary of 4.52 tokens. In our experiments, the basic ExpansionNet uses these summaries as input phrases. We split the dataset into training (80%), validation (10%) and test sets (10%). All results are reported on the test set.

Aspect Extraction
We use the method 2 in (He et al., 2017) to extract 15 aspects and consider the top 100 words from each aspect. Table 2 shows 10 inferred aspects and representative words (inferred aspects are manually labeled). ExpansionNet calculates an attention score based on the user and item aspect-aware representation, then determines how much these representative words are biased in the output word distribution.

Experiment Details
We use PyTorch 3 to implement our model. 4 Parameter settings are shown in Table 1. For the attribute encoder and aspect encoder, we set the dimensionality to 64 and 15 respectively. For both the sequence encoder and decoder, we use a 2layer GRU with hidden size 512. We also add dropout layers before and after the GRUs. The dropout rate is set to 0.1. During training, the input sequences of the same source (e.g. review, summary) inside each batch are padded to the same length.

Performance Evaluation
We evaluate the model on six automatic metrics (Table 3): Perplexity, BLEU-1/BLEU-4, ROUGE-L and Distinct-1/2 (percentage of distinct unigrams and bi-grams) . We compare User/Item user A3G831BTCLWGVQ and item B007M50PTM Review summary "easy to use and nice standard apps" Item title "samsung galaxy tab 2 (10.1-Inch, wi-fi) 2012 model" Real review "the display is beautiful and the tablet is very easy to use. it comes with some really nice standard apps." AttrsSeq "i bought this for my wife 's new ipad air . it fits perfectly and looks great . the only thing i do n't like is that the cover is a little too small for the ipad air . " ExpansionNet "i love this tablet . it is fast and easy to use . i have no complaints . i would recommend this tablet to anyone ." +title "i love this tablet . it is fast and easy to use . i have a galaxy tab 2 and i love it ." +attribute & aspect "i love this tablet . it is easy to use and the screen is very responsive . i love the fact that it has a micro sd slot . i have not tried the tablet app yet but i do n't have any problems with it . i am very happy with this tablet ." Figure 2: Examples of a real review and reviews generated by different models given a user, item, review summary, and item title. Highlights added for emphasis.
against three baselines: Rand (randomly choose a review from the training set), GRU-LM (the GRU decoder works alone as a language model) and a state-of-the-art model Attr2Seq that only considers user and item attribute (Dong et al., 2017). ExpansionNet (with summary, item title, attribute and aspect as input) achieves significant improvements over Attr2Seq on all metrics. As we add more input information, the model continues to obtain better results, except for the ROUGE-L metric. This proves that our model can effectively learn from short input phrases and aspect information and improve the correctness and diversity of generated results. Figure 2 presents a sample generation result. ExpansionNet captures fine-grained item information (e.g. that the item is a tablet), which Attr2Seq fails to recognize. Moreover, given a phrase like "easy to use" in the summary, ExpansionNet generates reviews containing the same text. This demonstrates the possibility of using our model in an assistive review generation scenario. Finally, given extra aspect information, the model successfully estimates that the screen would be an important aspect (i.e., for the current user and item); it generates phrases such as "screen is very respon- sive" about the aspect "screen" which is also covered in the real (ground-truth) review ("display is beautiful"). We are also interested in seeing how the aspectaware representation can find related aspects and bias the generation to discuss more about those aspects. We analyze the average number of aspects in real and generated reviews and show on average how many aspects in real reviews are covered in generated reviews. We consider a review as covering an aspect if any of the aspect's representative words exists in the review. As shown in Table 4, Attr2Seq tends to cover more aspects in generation, many of which are not discussed in real reviews. On the other hand, ExpansionNet better captures the distribution of aspects that are discussed in real reviews.