Attribute-aware Sequence Network for Review Summarization

Review summarization aims to generate a condensed summary for a review or multiple reviews. Existing review summarization systems mainly generate summary only based on review content and neglect the authors’ attributes (e.g., gender, age, and occupation). In fact, when summarizing a review, users with different attributes usually pay attention to specific aspects and have their own word-using habits or writing styles. Therefore, we propose an Attribute-aware Sequence Network (ASN) to take the aforementioned users’ characteristics into account, which includes three modules: an attribute encoder encodes the attribute preferences over the words; an attribute-aware review encoder adopts an attribute-based selective mechanism to select the important information of a review; and an attribute-aware summary decoder incorporates attribute embedding and attribute-specific word-using habits into word prediction. To validate our model, we collect a new dataset TripAtt, comprising 495,440 attribute-review-summary triplets with three kinds of attribute information: gender, age, and travel status. Extensive experiments show that ASN achieves state-of-the-art performance on review summarization in both auto-metric ROUGE and human evaluation.


Introduction
Review summarization aims to generate a condensed summary for a review or multiple reviews 1 . Dominating studies can be divided into two groups: extractive and abstractive approaches. Extractive approaches (Hu and Liu, 2004;Ganesan, 2010) extract sentences or phrases from a review, while abstractive methods (Gerani et al.,1 Here, we focus on single-review summarization, and we leave adapting our model to multi-review summarization scenario to future work. Summary: trendy , stylish , but not meeting expectations R e v i e w : definitely a night crowd type of hotel . . . very trendy common spaces , stylish. . and bold . however , the room felt a bit under done . . . too simple . location is great but very noisy even if we were on the top floor . will try another place next time . Attribute Information: traveled on business + businessperson preference for aspects + businessperson-specific vocabulary meeting, business, conference, internet, working Figure 1: A review-summary pair posted by a businessperson in our dataset shows the effect of attribute information on summarizing review. Underlined words in the review indicate the important sentences that the businessperson care about when summarizing the review. Bold word in businessperson-specific vocabulary shows the businessperson's word-using habits may help to generate the summary. 2014; Wang and Ling, 2016;Yang et al., 2018a;Li et al., 2019;Gao et al., 2019) summarize a review by employing graph-based or sequence-tosequence (S2S) models which can generate new phrases and sentences that do not appear in the review.
Despite the remarkable progress of previous studies, they typically only focus on review content and neglect the attribute information of users who post these reviews (e.g., gender, age, and occupation). Actually, such information is vital for generating summaries, which contains the following characteristics. (1) People with different attributes may care about different aspects 2 . For example, when choosing hotels, business people may care about location and room more than price, while solo travelers may prefer price more. Figure 1 presents a review posted by a businessper-son. Although the hotel is trendy, stylish, and in a good location, it is noisy. The businessperson, therefore, summarizes that the hotel is not suitable for meeting. (2) People with different attributes have different word-using habits or writing styles to summarize a review. According to our statistics (Section 2.2), business people often summarize a review with words like "meeting", "business" and "conference", while solo travelers often utilize "budget", "inexpensive" to summarize their reviews. These attribute-specific words may help generate summaries. Figure 1 shows an example: without considering the attribute information "businessperson", it is hard to generate the word "meeting" when summarizing the review due to its missing. Intuitively, "meeting" belongs to businessperson-specific vocabulary, and such an attribute-specific vocabulary can be incorporated to further improve the summarization performance.
Inspired by the above observations, we propose a model called Attribute-aware Sequence Network (ASN) to consider attribute information into review summarization. Specifically, ASN is based on sequence to sequence models (S2S), which are popular methods in text summarization (Rush et al., 2015;See et al., 2017;Li et al., 2018a) and review summarization (Wang and Ling, 2016;Ma et al., 2018). ASN updates over standard S2S are three-fold. First, except for standard encoder and decoder in S2S, we design an attribute encoder, which encodes attribute preference for using words into attribute embedding. Second, an attribute-aware review encoder is proposed to generate attribute-aware review representation. It utilizes a bidirectional-LSTM to encode a review, and then imports an attribute-based selective mechanism to select important information of it to obtain a better review representation. Third, we propose an attribute-aware summary decoder to consider different writing styles of users with different attributes. It incorporates attribute embedding and attribute-specific vocabulary memory into word prediction module to generate summaries.
To validate our approach, we collect a review summarization dataset with user attribute information named TripAtt from TripAdvisor website, which contains 495,440 attribute-review-summary triplets with three kinds of attribute information: gender, age and travel status. Extensive experi-ments show that ASN achieves state-of-the-art performance on review summarization in both autometric ROUGE and human evaluation. Our contributions are as follows: • To the best of our knowledge, we first propose an attribute-aware S2S-based model named Attribute-aware Sequence Network (ASN) to incorporate attribute information into review summarization.
• Our model adopts an attribute-based selective mechanism to consider different user preferences for review content and applies attribute-specific vocabulary to take the different writing styles of users with different attributes into consideration when generating a summary.
• For the evaluation of review summarization with attribute information, we collect a new dataset named TripAtt, which is available at https://github.com/Junjieli0704/ASN.

Dataset
Since there is no available review summarization dataset with user attribute information, we build a new one named TripAtt from TripAdvisor, an online hotel review site. TripAdvisor contains lots of user-generated reviews along with their authors and titles. The title of a review often summarizes the main idea of it; therefore, we take the title as the reference summary of the review. For attribute information, we can first get users' demographic information such as age and gender from the website. Second, users also explicitly label their travel status when booking hotels, such as traveled solo or traveled on business. Therefore, We take gender, age and travel status as our attribute information, and collect near 3 million attribute-reviewsummary triplets. However, users may write titles arbitrarily, and it results in many meaningless titles, such as "i will be back again", and "twice in one trip". To remove these noisy samples, we apply filters proposed by Li et al. (2019). Then, we also remove samples that the value of any attributes (gender, age, or travel status) is NULL. Finally, we construct TripAtt with 495,440 attribute-reviewsummary triplets. We randomly split the dataset into 2,000 reviews for test, 2,000 reviews for develop and the rest for training. Table 1 shows statistics of TripAtt.  Figure 2 presents the value distribution of gender, age, and travel status in TripAtt. Males are more likely to travel than females, middle-aged (35-64 years old) users account for near 74%, and around 40% users traveled with couples.

Attribute-specific Vocabulary
Frequent words of a user can reflect the user well. Therefore, we want to mine attribute-specific words from TripAtt to model attributes. We first merge all summaries posted by users with the same group (such as male or 35-49 years old users) into a document. Then we compute tf -idf scores 3 for each word appears in the document, and we finally select top-N words for different groups. The last column in Figure 2 shows top-5 words in different attribute-specific vocabularies. We find that these words can actually reflect different groups well. For example, female users often utilize "lovely", "beautiful", "cute" to summarize review while these words rarely appear in summaries of male users. Users traveled with family often care about whether the hotel is suitable for their kids since they usually summarize reviews using "kids", "disney", and "parks", while business people frequently consider whether the hotel is suitable for meeting.
3 Attribute-aware Sequence Network

Problem Formulation
Suppose we have a corpus D with |D| attributereview-summary triplets, and each triplet contains a review x = (x 1 , x 2 , ..., x |x| ), a summary y = (y 1 , y 2 , ..., y |y| ) and an attribute vector a = (a 1 , a 2 , ..., a |a| ) which records |a| kinds of attribute information of x's author. Since we have three kinds of attribute information (gender, age, 3 Using tf -idf scores means we do not include too general terms that all users commonly use, because they do not help model the specific group.  and travel status), |a| equals to 3. Classical review summarization is to generate y from x, while our goal needs to consider a's characteristics on summarizing reviews when generating y.

Model Framework
As shown in Figure 3, our model consists of three modules: attribute encoder, attribute-aware review encoder and attribute-aware summary decoder. Attribute encoder is based on attributespecific words obtained from Section 2.2, which not only stores these words into attribute-specific vocabulary memory, but also utilizes multi-layer perceptron to merge them into an attribute embedding. Then we introduce four strategies to consider attribute embedding and attribute-specific vocabulary memory. Equipped with attribute embedding, our attribute-aware review encoder can select vital information from review representation. Importing attribute embedding and attribute-specific vocabulary memory into word prediction process of our attribute-aware summary decoder, our model can generate summary well.

Attribute Encoder
This module will produce attribute embedding and attribute-specific vocabulary memory. Suppose we select top-K attribute-specific words for each attribute in a, and then we concatenate these words to get attribute-specific vocabulary A = (A 1 , ..., A K , ..., A 2K , ..., A |a|K ), where word indexes between (i − 1) × K + 1 and i × K in A belong to attribute a i . After that, we use an embedding matrix E v to embed each word {A i } |a|K i=1 Figure 3: The architecture of Attribute-aware Sequence Network (ASN). ASN encodes two kinds of attribute information, attribute embedding (a) and attribute-specific vocabulary memory (A), into its two basic modules (Attribute-aware Review Encoder and Attribute-aware Summary Decoder). 1 , and 2 show strategies based on attribute embedding, and represent Attribute Selection strategy, and Attribute Prediction strategy, respectively. 3 and 4 indicate strategies based on attribute-specific vocabulary memory, and represent Attribute Memory Prediction strategy and Attribute Memory Generation strategy, respectively.
, and we get matrix A, which is also called attribute-specific vocabulary memory. Then, we use a nonlinear layer to merge A's words belonging to attribute a i into embedding a i (See Equation (1)), which can only represent attribute a i . After that, we merge these |a| attribute embeddings a 1 , a 1 , ..., a |a| into a nonlinear layer to get attribute embedding a (See Equation (2)), which is used to represent attribute vector a.
where W a , w a , b a , and b a are learnable parameters, and σ denotes sigmoid function.

Attribute-aware Review Encoder
Given review x, it first embeds each word x i into vector x i using embedding matrix E v , which is the same embedding matrix in attribute encoder module. Then, these word vectors are fed into a singlelayer bidirectional LSTM one by one, producing a sequence of encoder hidden states h i . Classical review encoder utilizes h i to represent review x and ends here. However, we find that users with different backgrounds pay attention to different content of a review. Inspired by (Zhou et al., 2017), we propose an attribute-based selective mechanism to select the important information from review for users with different attributes. The selective mechanism can construct a tailored representation of review x by considering a. In detail, our attribute-based selective network takes attribute vector a and the encoder hidden state h i as input, and outputs a gate vector gate i to select h i .
is the concatenating operator, σ denotes sigmoid function, and is element-wise multiplication. From Equation (4), we find gate i is a vector whose value is between 0 and 1. A high value means most of the information in h i passed from the filter, which results in the word x i is important. This is the first strategy to consider attributes called Attribute Selection strategy.

Attribute-aware Summary Decoder
At each decoding time step t, the decoder (a single-layer unidirectional LSTM) receives previous word embedding to obtain the new hidden state s t . Then it computes context vector c t for time step t through the attention mechanism: where MLP stands for multi-layer perceptrons and α t,i matches the importance score between current decoder state s t and the encoder hidden state h i .
The classical summary decoder combines the context vector c t and the decoder state s t , and then feeds the merged vector into a linear layer to produce the vocabulary distribution.
However, when generating a summary, users with different attributes may have their own vocabulary. Thus it is natural to take attributespecific vocabulary memory A into consideration when predicting output vocabulary distribution and different words in A may have different effects. Thus, we utilize an attention mechanism to extract important words in A when obtaining vocabulary state m t .
where β t,k measures the importance score between current decoder state s t and the k-th word in attribute-aware vocabulary memory A k . Then we combine context vector c t , the decoder state s t , and m t into the readout state r t . Besides, we can also enhance the readout state r t by combining attribute vector a. After that, we feed the readout state into a linear layer to produce the vocabulary distribution P voc .
where W r and b r are learnable parameters. The strategies of adding attribute vector a and vocabulary state m t into readout state r t are called Attribute Prediction strategy and Attribute Memory Prediction strategy, respectively. Last but not least, inspired by (See et al., 2017), we also propose a soft copy mechanism to copy attribute-specific words in generating summaries, which is the 4-th strategy called Attribute Memory Generation strategy.
The generation probability p mgn ∈ [0, 1] for timestep t is calculated from the context vector c t , the decoder state s t and the vocabulary state m t : where W mg , b mg are learnable parameters, [; ] is the concatenating operator and σ is the sigmoid function. Next p mgn is used as a soft switch to choose between generating a word from the target vocabulary V t or coping a word from attributespecific vocabulary.
The first part in Equation (10) represents generating words from our vocabulary, and the second part indicates coping words from attribute-specific vocabulary memory, respectively.

Objective Function
Our goal is to maximize the output summary probability given the input sentence. Therefore, we optimize the negative log-likelihood loss function:

Evaluation Metric
We exploit ROUGE (Lin, 2004) as our evaluation metric. ROUGE scores reported in this paper are computed by Pyrouge package 4 .

Comparison Methods
In the experiments, we compare our model with several strong baseline methods, which can be divided into two types: extractive and abstractive approaches. LEAD1 is an extractive approach which selects the first sentence in review as summary. LEXRANK (Erkan and Radev, 2004) is also a famous extractive approach that computes text centrality based on PageRank algorithm. TEX-TRANK (Mihalcea and Tarau, 2004) is an unsupervised algorithm based on weighted-graphs.
S2SATT is a sequence to sequence model with attention implemented by us. SEASS (Zhou et al., 2017) employs a selective encoding model to control the information flow from encoder to decoder. PGN (See et al., 2017) copies words from the source text via pointing, while retaining the ability to produce novel words through the generator.

Implementation Details
Model Parameters The vocabulary is collected from the TripAtt training data. We lowercase the text, and there are 362,103 unique word types. We use the top 30,000 words as the model vocabulary since they can cover 99.01% of the training data.

Model Training
We set the batch size to 128. We truncate the review to 200 tokens, which is done to expedite training and testing. However we also find that truncating the review can raise the performance of the model 5 . We use the development set to choose the size of attribute-specific vocabulary and set it to 100. We also use Adam (Kingma and Ba, 2015), dropout (Srivastava et al., 2014) and gradient clipping (Pascanu et al., 2013) to make our model robust. We set the word embedding size to 128, and all LSTM hidden state sizes to 200. We use Adam as our optimizing algorithm. For the hyperparameters of Adam optimizer, we set the learning rate α = 0.001, two momentum parameters β 1 = 0.9 and β 2 = 0.999 respectively, and = 10 −8 . We use dropout with probability p = 0.2. We also apply gradient clipping with range [−5, 5].
Model Testing At test time, We use beam search with a beam size of 4.

Results
Our results are given in Table 2. For extractive methods, we can see that LEAD1 performs best. However, it only obtains 11.59 ROUGE-1, 2.41 ROUGE-2, and 10.21 ROUGE-L F1 scores. The reason is that summaries in TripAtt are very suc-5 Indeed, we found that using only the first 200 tokens of the review yields higher ROUGE scores than using all tokens. cinct 6 and they often cover contents across several sentences, such as the summary in Figure 1.
For abstractive methods, we find that S2SATT is better than all extractive methods. After considering selective mechanism into S2SATT, the performance of SEASS decreases slightly. Because the selective mechanism proposed by SEASS is designed for sentence summarization, which may not be suitable for summarizing reviews. The average length of the input is less than 40 in Zhou et al. (2017), while the length in TripAtt is about 170. When incoporating copy mechanism into S2SATT, PGN obtains better performance.
Finally, after considering our proposed attribute encoder and four attribute-based strategies, ASN performs significantly better than all previous methods. Compared to S2SATT, our model has a 0.91 ROUGE-1, 0.65 ROUGE-2, and 0.79 ROUGE-L gains, which shows explicitly modeling attribute-related characteristics can indeed improve summarization quality. Our model also surpasses PGN by 0.84 ROUGE-1, 0.41 ROUGE-2, and 0.61 ROUGE-L and achieves the state-of-theart performance on review summarization.

Human Evaluation on Aspect-level Coverage
Previous experiments show that ASN is better than other baselines when evaluating on ROUGE. However, ROUGE is only a word-based metric, which can not measure the semantic between two references. Table 3 illustrates an example. Since the summary generated by S2SATT contains more overlapping words with the gold summary than the one generated by ASN contains, S2SATT obtains higher ROUGE scores than ASN. However, from the view of aspects, ASN may be better. Because its summary describes location and service of a hotel, which are consistent with the gold result, while S2SATT's summary misses service. Therefore, except for ROUGE, we want also to evaluate aspect-level coverage of different systems. To perform the aspect-level evaluation, we first define seven aspects: location, service, room, value, facility, food, and hotel, where hotel describes the overall attitude. Then, we randomly sample 1000 attribute-review-summary triplets from test set and generate summaries of these reviews using S2SATT, PGN and ASN. After that, A sample in test set RG-1(%) RG-2(%) RG-L(%)   we ask two students to label aspects to these generated summaries and the reference summaries 7 . Finally, we compute aspect-level precision, recall, and F1 for different systems. Table 4 shows the aspect-level result, and we find that ASN outperforms other models by a large margin, which shows summaries generated by our model can not only contain more correct words, but also in a higher consistency on aspects with references.

Discussions
In this section, we study the effect of different attributes, different attribute-specific strategies, and different attribute-specific vocabulary size on review summarization.

Effects of Different Attributes
To understand which kind of attribute information is the most important in review summarization, we perform an ablation study and give the results in Table 5.
First, all these kinds of attribute information are helpful for review summarization. Adding one kind of attribute information can obtain at least 0.41 ROUGE-1, 0.18 ROUGE-2 and 0.22 ROUGE-L gains. Second, travel status information is the most important attribute for review summarization in TripAtt. Because the travel status is a domain-dependent attribute, while others are domain-independent ones.
7 Some examples about how to label aspects to summaries are shown in the last column of Table 3 Table 6: Effects of different attribute-based strategies on review summarization. "" means our model considers the specific strategy, while "-" means not. When there is no user-based strategies considered in our model, our model degrades into S2SATT (line 1).

Effects of different strategies
In this paper, we propose four attribute-based strategies to construct our attribute-aware review summarization model, which contains Attribute Selection strategy (ASel), Attribute Prediction strategy (APre), Attribute Memory Prediction strategy (AMP) and Attribute Memory Generation strategy (AMG). To evaluate the effect of each strategy on review summarization, we perform an ablation study and report results in Table 6. First, we observe that models with only one kind of attribute-based strategy (line 2-5) can at least exceed S2SATT by (+0.17 ROUGE-1, +0.13 ROUGE-2, +0.12 ROUGE-L) points. It shows that all these strategies improve the performance of re-  Figure 4: Effects of attribute-specific vocabulary size on review summarization on development set of Tri-pAtt. When there is no any attribute-specific vocabulary (the size is 0) in ASN, our model degrades into S2SATT. The primary axis is for ROUGE-1 and ROUGE-L, and the second axis is for ROUGE-2.
view summarization. APre and AMG are the two most effective strategies, because they directly affect the word prediction module in ASN. Second, models which deletes one kind of attribute-based strategy from ASN (line 6-9) will descend at least 0.11 in ROUGE-2 compared with ASN (line 10). It shows all our four user-based strategies are complementary. The most complementary strategies are ASel and AMG. The reason for ASel is that it is applied on the encoder module of ASN, while others are applied on the decoder module. For AMG, it differs from other decoder module strategies by affecting the decoding process through adding one more vocabulary distribution, while other strategies (APre and AMP) only add more features to classical word prediction module.
Third, ASN obtains the best result when considering all these strategies.

Effects of Attribute-specific Vocabulary Size
Since the attribute-specific vocabulary is the cornerstone for our attribute-aware model, we perform a test to detect the effect of its size on review summarization and show the result in Figure  4. First, we find that considering attribute-specific vocabulary can indeed improve the performance of review summarization, even though when the vocabulary size is very small (such as 20). Second, all curves increase firstly, and decrease with the increase of attribute-specific vocabulary size.
It indicates that small vocabulary may be helpful for ASN, however large vocabulary may import much noise and be harmful to ASN. The best per-  formance is obtained when the attribute-specific vocabulary size is 100, that's the reason why we set our attribute-vocabulary size to 100.

Case Study
We present two cases from TripAtt that show the effect of attribute on review summarization in Figure 5. Figure 5 (a) shows the effect of travel status on review summarization. Businessperson often books a hotel for working. In this case, the businesswoman stayed at the hotel for a company meeting. Whether the hotel is meet for the requirement may be the most important factor for her. She summarizes "great for a meeting". S2SATT could not get the point. Even though "meeting" appears in the review, PGN that has the effect of copying words from it also fails to generate the word. Our model containing businessperson-specific vocabulary can generate the word well and obtain a better summarization. Figure 5 (b) shows the effect of gender on review summarization. Word-using habits in different genders are different. Female users often utilize "lovely", "beautiful", "cute" to summarize review while these words rarely appear in summaries from male users. Without considering the gender bias, S2SATT and PGN can not generate the summarization well. Incorporating such writ-ing styles of female users in ASN, our model can generate "lovely" correctly, although it does not appear in the review.

Related Work
Review summarization belongs to sentiment analysis (Liu, 2016;Xia et al., 2015), which is a large area in natural language processing and contains sentiment classification (Li and Zong, 2008;Xia et al., 2011;Li et al., 2016Li et al., , 2018b, emotion detection , spam detection (Wang et al., 2017) and so on. There are two mainstream approaches for the problem: extractive and abstractive approaches.
A key task in extractive methods (Hu and Liu, 2004;Lerman et al., 2009;Xiong and Litman, 2014;Kunneman et al., 2018) is to identify important text units. For example, Hu and Liu (2004) first recognize the frequent product features and then attach extracted opinion sentences to the corresponding feature. Xiong and Litman (2014) exploit review helpfulness for review summarization. However, many studies (Carenini et al., 2013;Fabbrizio et al., 2014) have shown that abstractive approaches may be more appropriate for summarizing evaluative text than extractive ones. That is also the reason why we build our attribute-aware model based on abstractive methods.
Abstractive approaches (Ganesan, 2010;Gerani et al., 2014;Wang and Ling, 2016) are also very popular methods in review summarization. For example, Ganesan (2010) first represent review as token-based graphs based on the token order in the string and then rank summary candidates by scoring paths after removing redundant information from the graph. Gerani et al. (2014) utilize discourse structure of review to identify important aspects and then design a set of templates to generate summarizations. Wang and Ling (2016) propose an attention-based neural network model for generating abstractive summaries of opinionated text.
All of these studies focus on review summarization in the multiple review scenario, while our work focuses on the single review scenario. Recent review summarization studies (Ma et al., 2018;Yang et al., 2018a,b) also focus on the scenario. Ma et al. (2018) jointly models review summarization and sentiment classification in a unified framework. Yang et al. (2018a) study the aspect/sentiment-aware abstractive review sum-marization in an end-to-end manner. They mainly generate summary only based on review content and overlook the crucial influences of users. The most related work is Li et al. (2019). They study the personalized review summarization issue and also neglect the effect of user attributes on review summarization. Our proposed model fills this gap in the literature.

Conclusion and Future Work
In this paper, we propose an Attribute-aware Sequence Network (ASN) to consider attribute information into review summarization. ASN imports attribute-specific vocabulary to model attribute information and utilizes four attribute-based strategies to build attribute-aware review encoder and attribute-aware summary decoder. To validate our model, we construct a new dataset (TripAtt). Extensive experiments on TripAtt show that ASN achieves state-of-the-art performance on review summarization.