OpinionDigest: A Simple Framework for Opinion Summarization

We present OpinionDigest, an abstractive opinion summarization framework, which does not rely on gold-standard summaries for training. The framework uses an Aspect-based Sentiment Analysis model to extract opinion phrases from reviews, and trains a Transformer model to reconstruct the original reviews from these extractions. At summarization time, we merge extractions from multiple reviews and select the most popular ones. The selected opinions are used as input to the trained Transformer model, which verbalizes them into an opinion summary. OpinionDigest can also generate customized summaries, tailored to specific user needs, by filtering the selected opinions according to their aspect and/or sentiment. Automatic evaluation on Yelp data shows that our framework outperforms competitive baselines. Human studies on two corpora verify that OpinionDigest produces informative summaries and shows promising customization capabilities.


Introduction
The summarization of opinions in customer reviews has received significant attention in the Data Mining and Natural Language Processing communities. Early efforts (Hu and Liu, 2004a) focused on producing structured summaries which numerically aggregate the customers' satisfaction about an item across multiple aspects, and often included representative review sentences as evidence. Considerable research has recently shifted towards textual opinion summaries, fueled by the increasing success of neural summarization methods (Cheng and Lapata, 2016;Paulus et al., 2018;See et al., 2017;Liu and Lapata, 2019;Isonuma et al., 2019).
Opinion summaries can be extractive, i.e., created by selecting a subset of salient sentences from the input reviews, or abstractive, where summaries are generated from scratch. Extractive approaches produce well-formed text, but selecting the sentences which approximate the most popular opinions in the input is challenging. Angelidis and Lapata (2018) used sentiment and aspect predictions as a proxy for identifying opinion-rich segments. Abstractive methods (Chu and Liu, 2019;Bražinskas et al., 2019), like the one presented in this paper, attempt to model the prevalent opinions in the input and generate text that articulates them.
Opinion summarization can rarely rely on goldstandard summaries for training (see Amplayo and Lapata (2019) for a supervised approach). Recent work has utilized end-to-end unsupervised architectures, based on auto-encoders (Chu and Liu, 2019;Bražinskas et al., 2019), where an aggregated representation of the input reviews is fed to a decoder, trained via reconstruction loss to produce reviewlike summaries. Similarly to their work, we assume that review-like generation is appropriate for opinion summarization. However, we explicitly deal with opinion popularity, which we believe is crucial for multi-review opinion summarization. Additionally, our work is novel in its ability to explicitly control the sentiment and aspects of selected opinions. The aggregation of input reviews is no longer treated as a black box, thus allowing for controllable summarization.
Specifically, we take a step towards more interpretable and controllable opinion aggregation, as we replace the end-to-end architectures of previous work with a pipeline framework. Our method has three components: a) a pre-trained opinion extractor, which identifies opinion phrases in reviews; b) a simple and controllable opinion selector, which merges, ranks, and -optionally-filters the extracted opinions; and c) a generator model, which is trained  to reconstruct reviews from their extracted opinion phrases and can then generate opinion summaries based on the selected opinions. We describe our framework in Section 2 and present two types of experiments in Section 3: A quantitative comparison against established summarization techniques on the YELP summarization corpus (Chu and Liu, 2019); and two user studies, validating the automatic results and our method's ability for controllable summarization.

OPINIONDIGEST Framework
Let D denote a dataset of customer reviews on individual entities {e 1 , e 2 , . . . , e |D| } from a single domain, e.g., restaurants or hotels. For every entity e, we define a review set R e = {r i } |Re| i=1 , where each review is a sequence of words r = (w 1 , . . . , w n ).
Within a review, we define a single opinion phrase, o = (w o1 , . . . w om ), as a subsequence of tokens that expresses the attitude of the reviewer towards a specific aspect of the entity 2 . Formally, we define the opinion set of r as , where pol i is the sentiment polarity of the i-th phrase (positive, neutral, or negative) and a i is the aspect category it discusses (e.g., a hotel's service, or cleanliness).
For each entity e, our task is to abstractively generate a summary s e of the most salient opinions expressed in reviews R e . Contrary to previous abstractive methods (Chu and Liu, 2019;Bražinskas et al., 2019), which never explicitly deal with opinion phrases, we put the opinion sets of reviews at the core of our framework, as described in the following sections and illustrated in Figure 1.

Opinion Extraction
Extracting opinion phrases from reviews has been studied for years under the Aspect-based Sentiment Analysis (ABSA) task (Hu and Liu, 2004b;Luo et al., 2019;Dai and Song, 2019;. We follow existing approaches to obtain an opinion set O r for every review in our corpus 3 . Specifically, we used a pre-trained tagging model (Miao et al., 2020) to extract opinion phrases, their polarity, and aspect categories.
Step 1 (top-left) of Figure 1 shows a set of opinions extracted from a full review.

Opinion Selection
Given the set or reviews R e = {r 1 , r 2 , . . . } for an entity e, we define the entity's opinion set as O e = {O r 1 ∪O r 2 ∪. . . }. Summarizing the opinions about entity e relies on selecting the most salient opinions S e ⊂ O e . As a departure from previous work, we explicitly select the opinion phrases that will form the basis for summarization, in the following steps.
Opinion Merging: To avoid selecting redundant opinions in S e , we apply a greedy algorithm to merge similar opinions into clusters C = {C 1 , C 2 , ...}: given an opinion set O e , we start with an empty C, and iterate through every opinion in O e . For each opinion, (o i , pol i , a i ), we further iterate through every existing cluster in random order. The opinion is added to the first cluster C which satisfies the following criterion, or to a newly created cluster otherwise: where v i and v j are the average word embedding of opinion phrase o i and o j respectively, cos(·, ·) is the cosine similarity, and θ ∈ (0, 1] is a hyperparameter. For each opinion cluster {C 1 , C 2 , . . . }, we define its representative opinion Repr(C i ), which is the opinion phrase closest to its centroid.
Opinion Ranking: We assume that larger clusters contain opinions which are popular among reviews and, therefore, should have higher priority to be included in S e . We use the representative opinions of the top-k largest clusters, as selected opinions Opinion Filtering (optional): We can further control the selection by filtering opinions based on their predicted aspect category or sentiment polarity. For example, we may only allow opinions where a i = "cleanliness".

Summary Generation
Our goal is to generate a natural language summary which articulates S e , the set of selected opinions. To achieve this, we need a natural language generation (NLG) model which takes a set of opinion phrases as input and produces a fluent, review-like summary as output. Because we cannot rely on gold-standard summaries for training, we train an NLG model that encodes the extracted opinion phrases of a single review and then attempts to reconstruct the review's full text. Then, the trained model can be used to generate summaries.
Training via Review Reconstruction: Having extracted O r for every review r in a corpus, we con- is a textualization of the review's opinion set, where all opinion phrases are concatenated in their original order, using a special token [SEP].  Summarization: At summarization time, we use the textualization of the selected opinions, T (S e ), as input to the trained Transformer, which generates a natural language summary s e as output (Figure 1, Step 3b). We order the selected opinions by frequency (i.e., their respective cluster's size), but any desired ordering may be used.

Datasets
We used two review datasets for evaluation. The public YELP corpus of restaurant reviews, previously used by Chu and Liu (2019). We used a different snapshot of the data, filtered to the same specifications as the original paper, resulting in 624K training reviews. We used the same goldstandard summaries for 200 restaurants as used in Chu and Liu (2019).
We also used HOTEL, a private hotel review dataset that consists of 688K reviews for 284 hotels collected from multiple hotel booking websites. There are no gold-standard summaries for this dataset, so systems were evaluated by humans.

Baselines
LexRank (Erkan and Radev, 2004): A popular unsupervised extractive summarization method. It selects sentences based on centrality scores calculated on a graph-based sentence similarity.
MeanSum (Chu and Liu, 2019): An unsupervised multi-document abstractive summarizer that minimizes a combination of reconstruction and vector similarity losses. We only applied MeanSum to YELP, due to its requirement for a pre-trained language model, which was not available for HOTEL.
Best Review / Worst Review (Chu and Liu, 2019): A single review that has the highest/lowest average word overlap with the input reviews.

Experimental Settings
For opinion extraction, the ABSA models are trained with 1.3K labeled review sentences for YELP and 2.4K for HOTEL.
We trained a Transformer with the original architecture (Vaswani et al., 2017). We used SGD with an initial learning rate of 0.1, a momentum of β = 0.1, and a decay of γ = 0.1 for 5 epochs with a batch size of 8. For decoding, we used Beam Search with a beam size of 5, a length penalty of 0.6, 3-gram blocking (Paulus et al., 2018), and a maximum generation length of 60. We tuned hyperparameters on the dev set, and our system appears robust to their setting (see Appendix A).
We performed automatic evaluation on the YELP dataset with ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) (Lin, 2004) scores based on the 200 reference summaries (Chu and Liu, 2019). We also conducted user studies on both YELP and HOTEL datasets to further understand the performance of different models.

Results
Automatic Evaluation: Table 1 shows the automatic evaluation scores for our model and the baselines on YELP dataset. As shown, our framework outperforms all baseline approaches. Although OPINIONDIGEST is not a fully unsupervised framework, labeled data is only required by the opinion extractor and is easier to acquire than gold-standard summaries: on YELP dataset, the opinion extraction models are trained on a publicly available ABSA dataset (Wang et al., 2017). Human Evaluation: We conducted three user studies to evaluate the quality of the generated summaries (more details in Appendix B).
First, we generated summaries from 3 systems (ours, LexRank and MeanSum/Best Review) for every entity in YELP's summarization test set and 200  random entities in the HOTEL dataset, and asked judges to indicate the best and worst summary according to three criteria: informativeness (I), coherence (C), and non-redundancy (R). The systems' scores were computed using Best-Worst Scaling (Louviere et al., 2015), with values ranging from -100 (unanimously worst) to +100 (unanimously best.) We aggregated users' responses and present the results in Table 2(a). As shown, summaries generated by OPINIONDIGEST achieve the best informativeness and coherence scores compared to the baselines. However, OPINIONDIGEST may still generate redundant phrases in the summary. Second, we performed a summary content support study. Judges were given 8 input reviews from YELP, and a corresponding summary produced either by MeanSum or by our system. For each summary sentence, they were asked to evaluate the extent to which its content was supported by the input reviews. Table 3 shows the proportion of summary sentences that were fully, partially, or not supported for each system. OPINIONDIGEST produced significantly more sentences with full or partial support, and fewer sentences without any support.
Finally, we evaluated our framework's ability to generate controllable output. We produced aspectspecific summaries using our HOTEL dataset, and asked participants to judge if the summaries discussed the specified aspect exclusively, partially, or not at all. Table 4 shows that in 46.6% of the summaries exclusively summarized a specified aspect, while only 10.3% of the summaries failed to contain the aspect completely.
Example Output: Example summaries in Table 5 further demonstrate that a) OPINIONDIGEST is able to generate abstractive summaries from more than a hundred of reviews and b) produce controllable summaries by enabling opinion filtering.
The first two examples in Table 5 show summaries that are generated from 8 and 128 reviews of the same hotel. OPINIONDIGEST performs robustly even for a large number of reviews. Since our framework is not based on aggregating review representations, the quality of generated text is not affected by the number of inputs and may result in better-informed summaries. This is a significant difference to previous work (Chu and Liu, 2019;  Only 20 minutes from airport The staff were very friendly and helpful. The rooms were quite spacious. The bed was very comfortable and the room was clean. The location of the hotel is greatonly a few minutes from union square. The room was really nice and the view was great. This is my favorite Chinese food in the area. The food is really good and the portions are great. I really like the orange chicken and the crab puffs are the best I've had in a long time. The food here is really good. The shrimp fried rice is really good, and the rice is the best. Table 5: Example summaries on HOTEL (first two) and YELP (last four). Input opinions were filtered by the aspect categories (Asp), sentiment polarity (Pol), and # of reviews (N). Colors show the alignments between opinions and summaries. Italic denotes incorrect extraction. Underlined opinions do not explicitly appear in the summaries. Bražinskas et al., 2019), where averaging vectors of many reviews may hinder performance. Finally, we provide qualitative analysis of the controllable summarization abilities of OPINIONDIGEST, which are enabled by input opinion filtering. As discussed in Section 2.2, we filtered input opinions based on predicted aspect categories and sentiment polarity. The examples of controlled summaries (last 4 rows of These examples have redundant opinions and incorrect extractions in the input, but OPINIONDIGEST is able to convert the input opinions into natural summaries. Based on OPINIONDIGEST, we have built an online demo (Wang et al., 2020) 5 that allows users to customize the generated summary by specifying search terms.

Conclusion
We described OPINIONDIGEST, a simple yet powerful framework for abstractive opinion summarization. OPINIONDIGEST is a combination of existing ABSA and seq2seq models and does not require any gold-standard summaries for training. Our experiments on the YELP dataset showed that OPINIONDIGEST outperforms baseline methods, including a state-of-the-art unsupervised abstractive summarization technique. Our user study and qualitative analysis confirmed that our method can generate controllable high-quality summaries, and can summarize large numbers of input reviews.

A Hyper-parameter Sensitivity Analysis
We present OPINIONDIGEST's hyper-parameters and their default settings in Table 6. Among these hyper-parameters, we found that the performance of OPINIONDIGEST is relatively sensitive to the following hyper-parameters: top-k opinion (k), merging threshold (θ), and maximum token length (L).
To better understand OPINIONDIGEST's performance, we conducted additional sensitivity analysis of these three hyper-parameters. The results are shown in Figure 2. The results demonstrate that OPINIONDIGEST is robust to the choice of the hyper-parameters and constantly outperforms the best-performing baseline method.

B Human Evaluation Setup
We conducted user study via crowdsourcing using the FigureEight 6 platform. To ensure the quality of annotators, we used a dedicated expert-worker pool provided by FigureEight. We present the detailed setup of our user studies as follows.
Best-Worst Scaling Task: For each entity in the YELP and HOTEL datasets, we presented 8 input reviews and 3 automatically generated summaries to human annotators (Figure 3). The methods that generated those summaries were hidden from the annotators and the order of the summaries were shuffled for every entity. We further asked the annotators to select the best and worst summaries w.r.t. the following criteria: • Informativeness: How much useful information about the business does the summary provide? You need to skim through the original reviews to answer this.
• Coherence: How coherent and easy to read is the summary? 6 https://www.figure-eight.com/  • Non-redanduncy: Is the summary successful at avoiding redundant and repeated opinions?
To evaluate the quality of the summaries for each criteria, we counted the number of best/worst votes for every system and computed the score as the Best-Worst Scaling (Louviere et al., 2015) : The Best-Worst Scaling is known to be more robust for NLP annotation tasks and requires less annotations than rating-scale methods (Kiritchenko and Mohammad, 2016).
We collected responses from 3 human annotators for each question and computed the scores w.r.t. informativeness (I-score), coherence (C-score), and non-redundancy (R-score) accordingly.
Content Support Task: For the content support study, we presented the 8 input reviews to the annotators and an opinion summary produced from these reviews by one of the competing methods (ours or MeanSum). We asked the annotators to determine for every summary sentence, whether it is fully supported, partially supported, or not supported by the input reviews (Figure 4). We collected 3 responses per review sentence and calculated the ratio of responses for each category.
Aspect-Specific Summary Task: Finally, we studied the performance of OPINIONDIGEST in terms of its ability to generate controllable output. We presented the summaries to human judges and asked them to judge whether the summaries discussed the specific aspect exclusively, partially, or not at all ( Figure 5). We again collected 3 responses  per summary and calculated the percentage of responses.