Unsupervised Opinion Summarization with Noising and Denoising

The supervised training of high-capacity models on large datasets containing hundreds of thousands of document-summary pairs is critical to the recent success of deep learning techniques for abstractive summarization. Unfortunately, in most domains (other than news) such training data is not available and cannot be easily sourced. In this paper we enable the use of supervised learning for the setting where there are only documents available (e.g., product or business reviews) without ground truth summaries. We create a synthetic dataset from a corpus of user reviews by sampling a review, pretending it is a summary, and generating noisy versions thereof which we treat as pseudo-review input. We introduce several linguistically motivated noise generation functions and a summarization model which learns to denoise the input and generate the original review. At test time, the model accepts genuine reviews and generates a summary containing salient opinions, treating those that do not reach consensus as noise. Extensive automatic and human evaluation shows that our model brings substantial improvements over both abstractive and extractive baselines.


Introduction
The proliferation of massive numbers of online product, service, and merchant reviews has provided strong impetus to develop systems that perform opinion mining automatically (Pang and Lee, 2008). The vast majority of previous work (Hu and Liu, 2006) breaks down the problem of opinion aggregation and summarization into three interrelated tasks involving aspect extraction (Mukherjee and Liu, 2012), sentiment identification (Pang et al., 2002;Pang and Lee, 2004), and summary creation based on extractive (Radev et al., 2000;Lu et al., 2009) or abstractive methods (Ganesan et al., 2010;Carenini et al., 2013;Gerani et al., 2014;Di Fabbrizio et al., 2014). Although po-tentially more challenging, abstractive approaches seem more appropriate for generating informative and concise summaries, e.g., by performing various rewrite operations (e.g., deletion of words or phrases and insertion of new ones) which go beyond simply copying and rearranging passages from the original opinions.
Abstractive summarization has enjoyed renewed interest in recent years thanks to the availability of large-scale datasets (Sandhaus, 2008;Hermann et al., 2015;Grusky et al., 2018;Liu et al., 2018;Fabbri et al., 2019) which have driven the development of neural architectures for summarizing single and multiple documents. Several approaches (See et al., 2017;Celikyilmaz et al., 2018;Paulus et al., 2018;Gehrmann et al., 2018;Liu et al., 2018;Perez-Beltrachini et al., 2019;Wang and Ling, 2016) have shown promising results with sequence-to-sequence models that encode one or several source documents and then decode the learned representations into an abstractive summary.
The supervised training of high-capacity models on large datasets containing hundreds of thousands of document-summary pairs is critical to the recent success of deep learning techniques for abstractive summarization. Unfortunately, in most domains (other than news) such training data is not available and cannot be easily sourced. For instance, manually writing opinion summaries is practically impossible since an annotator must read all available reviews for a given product or service which can be prohibitively many. Moreover, different types of products impose different restrictions on the summaries which might vary in terms of length, or the types of aspects being mentioned, rendering the application of transfer learning techniques (Pan and Yang, 2010) problematic.
Motivated by these issues, Chu and Liu (2019) consider an unsupervised learning setting where there are only documents (product or business reviews) available without corresponding summaries. They propose an end-to-end neural model to perform abstractive summarization based on (a) an autoencoder that learns representations for each review and (b) a summarization module which takes the aggregate encoding of reviews as input and learns to generate a summary which is semantically similar to the source documents. Due to the absence of ground truth summaries, the model is not trained to reconstruct the aggregate encoding of reviews, but rather it only learns to reconstruct the encoding of individual reviews. As a result, it may not be able to generate meaningful text when the number of reviews is large. Furthermore, autoencoders are constrained to use simple decoders lacking attention (Bahdanau et al., 2014) and copy (Vinyals et al., 2015) mechanisms which have proven useful in the supervised setting leading to the generation of informative and detailed summaries. Problematically, a powerful decoder might be detrimental to the reconstruction objective, learning to express arbitrary distributions of the output sequence while ignoring the encoded input (Kingma and Welling, 2014;Bowman et al., 2016).
In this paper, we enable the use of supervised techniques for unsupervised summarization. Specifically, we automatically generate a synthetic training dataset from a corpus of product reviews, and use this dataset to train a more powerful neural model with supervised learning. The synthetic data is created by selecting a review from the corpus, pretending it is a summary, generating multiple noisy versions thereof and treating these as pseudoreviews. The latter are obtained with two noise generation functions targeting textual units of different granularity: segment noising introduces noise at the word-and phrase-level, while document noising replaces a review with a semantically similar one. We use the synthetic data to train a neural model that learns to denoise the pseudo-reviews and generate the summary. This is motivated by how humans write opinion summaries, where denoising can be seen as removing diverging information. Our proposed model consists of a multi-source encoder and a decoder equipped with an attention mechanism. Additionally, we introduce three modules: (a) explicit denoising guides how the model removes noise from the input encodings, (b) partial copy enables to copy information from the source reviews only when necessary, and (c) a discriminator helps the decoder generate topically consistent text.
We perform experiments on two review datasets representing different domains (movies vs businesses) and summarization requirements (short vs longer summaries). Results based on automatic and human evaluation show that our method outperforms previous unsupervised summarization models, including the state-of-the-art abstractive system of Chu and Liu (2019) and is on the same par with a state-of-the-art supervised model (Wang and Ling, 2016) trained on a small sample of (genuine) review-summary pairs.

Related Work
Most previous work on unsupervised opinion summarization has focused on extractive approaches Ku et al., 2006;Paul et al., 2010;Angelidis and Lapata, 2018) where a clustering model groups opinions of the same aspect, and a sentence extraction model identifies text representative of each cluster. Ganesan et al. (2010) propose a graph-based abstractive framework for generating concise opinion summaries, while Di Fabbrizio et al. (2014) use an extractive system to first select salient sentences and then generate an abstractive summary based on hand-written templates (Carenini and Moore, 2006).
As mentioned earlier, we follow the setting of Chu and Liu (2019) in assuming that we have access to reviews but no gold-standard summaries. Their model learns to generate opinion summaries by reconstructing a canonical review of the average encoding of input reviews. Our proposed method is also abstractive and neural-based, but eschews the use of an autoencoder in favor of supervised sequence-to-sequence learning through the creation of a synthetic training dataset. Concurrently with our work, Bražinskas et al. (2019) use a hierarchical variational autoencoder to learn a latent code of the summary. While they also use randomly sampled reviews for supervised training, our dataset construction method is more principled making use of linguistically motivated noise functions.
Our work relates to denoising autoencoders (DAEs; Vincent et al., 2008), which have been effectively used as unsupervised methods for various NLP tasks. Earlier approaches have shown that DAEs can be used to learn high-level text representations for domain adaptation (Glorot et al., 2011) and multimodal representations of textual and visual input (Silberer and Lapata, 2014). Recent work has applied DAEs to text generation tasks, specifically to data-to-text generation (Freitag and Roy, 2018) and extractive sentence compression (Fevry and Phang, 2018). Our model differs from these approaches in two respects. Firstly, while previous work has adopted trivial noising methods such as randomly adding or removing words (Fevry and Phang, 2018) and randomly corrupting encodings (Silberer and Lapata, 2014), our noise generators are more linguistically informed and suitable for the opinion summarization task. Secondly, while in Freitag and Roy (2018) the decoder is limited to vanilla RNNs, our noising method enables the use of more complex architectures, enhanced with attention and copy mechanisms, which are known to improve the performance of summarization systems (Rush et al., 2015;See et al., 2017).

Modeling Approach
Let X = {x 1 , ..., x N } denote a set of reviews about a product (e.g., a movie or business). Our aim is to generate a summary y of the opinions expressed in X. We further assume access to a corpus C = {X 1 , ..., X M } containing multiple reviews about M products without corresponding opinion summaries.
Our method consists of two parts. We first create a synthetic dataset D = {(X, y)} consisting of summary-review pairs. Specifically, we sample review x i from C, pretend it is a summary, and generate multiple noisy versions thereof (i.e., pseudoreviews). At training time, a denoising model learns to remove the noise from the reviews and generate the summary. At test time, the same denoising model is used to summarize actual reviews. We use denoising as an auxiliary task for opinion summarization to simulate the fact that summaries tend to omit opinions that do not represent consensus (i.e., noise in the pseudo-review), but include salient opinions found in most reviews (i.e., nonnoisy parts of the pseudo-review).

Synthetic Dataset Creation via Noising
We sample a review as a candidate summary and generate noisy versions thereof, using two functions: (a) segment noising adds noise at the token and chunk level, and (b) document noising adds noise at the text level. The noise functions are illustrated in Figure 1.
Summary Sampling Summaries and reviews follow different writing conventions. For exam-the movie is a fun comedy with fine performances . the film is a nice comedy with stellar cast .
with good production and zohan , a nice comedy by the film .  ple, reviews are subjective, and often include firstperson singular pronouns such as I and my and several unnecessary characters or symbols. They may also vary in length and detail. We discard reviews from corpus C which display an excess of these characteristics based on a list of domain-specific constraints (detailed in Section 4). We sample a review y from the filtered corpus, which we use as the candidate summary.

Token-level
Segment Noising Given candidate summary y = {w 1 , ..., w L }, we create a set of segment-level noisy versions Previous work has adopted noising techniques based on random n-gram alterations (Fevry and Phang, 2018), however, we instead rely on two simple, linguistically informed noise functions. Firstly, we train a bidirectional language model (BiLM; Peters et al., 2018) on the review corpus C. For each word in y, the BiLM predicts a softmax word distribution which can be used to replace words. Secondly, we utilize FLAIR 1 (Akbik et al., 2019), an off-theshelf state-of-the-art syntactic chunker that leverages contextual embeddings, to shallow parse each review r in corpus C. This results in a list of chunks C r = {c 1 , ..., c K } with corresponding syntactic labels G r = {g 1 , ..., g K } for each review r, which we use for replacing and rearranging chunks.
Segment-level noise involves token-and chunk-level alterations. Token-level alterations are performed by replacing tokens in y with probability p R . Specifically, we replace token w j in y, by sampling token w j from the BiLM predicted word distribution (see in Figure 1). We use nucleus sampling (Holtzman et al., 2019), which samples from a rescaled distribution of words with probability higher than a threshold p N , instead of the original distribution. This has been shown to yield better samples in comparison to top-k sampling, mitigating the problem of text degeneration (Holtzman et al., 2019). Chunk-level alterations are performed by removing and inserting chunks in y, and rearranging them based on a sampled syntactic template. Specifically, we first shallow parse y using FLAIR, obtaining a list of chunks C y , each of which is removed with probability p R . We then randomly sample a review r from our corpus and use its sequence of chunk labels G r as a syntactic template, which we fill in with chunks in C y (sampled without replacement), if available, or with chunks in corpus C, otherwise. This results in a noisy version x (c) (see Figure 1 for an example). Repeating the process N times produces the noisy set X (c) . We describe this process step-by-step in the Appendix. Document Noising Given candidate summary y = {w 1 , ..., w L }, we also create another set of document-level noisy versions Instead of manipulating parts of the summary, we altogether replace it with a similar review from the corpus and treat it as a noisy version. Specifically, we select N reviews that are most similar to y and discuss the same product. To measure similarity, we use IDF-weighted ROUGE-1 F1 (Lin, 2004), where we calculate the lexical overlap between the review and the candidate summary, weighted by token importance: where x is a review in the corpus, 1(·) is an indicator function, and P, R, and F 1 are the ROUGE-1 precision, recall, and F 1 , respectively. The reviews with the highest F 1 are selected as noisy versions of y, resulting in the noisy set X (d) (see Figure 1). We create a total of 2 * N noisy versions of y, i.e., X = X (c) ∪X (d) and obtain our synthetic train-ing data D = {(X, y)} by generating |D| pseudoreview-summary pairs. Both noising methods are necessary to achieve aspect diversity amongst input reviews. Segment noising creates reviews which may mention aspects not found in the summary, while document noising creates reviews with content similar to the summary. Relying on either noise function alone decreases performance (see the ablation studies in Section 5). We show examples of these noisy versions in the Appendix.

Summarization via Denoising
We summarize (aka denoise) the input X with our model which we call DENOISESUM, illustrated in Figure 2. A multi-source encoder produces an encoding for each pseudo-review. The encodings are further corrected via an explicit denoising module, and then fused into an aggregate encoding for each type of noise. Finally, the fused encodings are passed to a decoder with a partial copy mechanism to generate the summary y.
Multi-Source Encoder For each pseudo-review x j ∈ X where x j = {w 1 , ..., w L } and w k is the kth token in x j , we obtain contextualized token encodings {h k } and an overall review encoding d j with a BiLSTM encoder (Hochreiter and Schmidhuber, 1997): where − → h k and ← − h k are forward and backward hidden states of the BiLSTM at timestep k, and ; denotes concatenation (see module (a) in Figure 2). Explicit Denoising The model should be able to remove noise from the encodings before decoding the text. While previous methods (Vincent et al., 2008;Freitag and Roy, 2018) implicitly assign the denoising task to the encoder, we propose an explicit denoising component (see module (b) in Figure 2). Specifically, we create a correction vector c where q represents a mean review encoding and functions as a query vector, W and b are learned parameters, and superscript (c) signifies segment noising. We can interpret the correction vector as removing or adding information to each dimension when its value is negative or positive, respectively. Analogously, we obtaind Noise-Specific Fusion For each type of noise (segment and document), we create a noise-specific aggregate encoding by fusing the denoised encodings into one (see module (c) in Figure 2). Given {d (c) j }, the set of denoised encodings corresponding to segment noisy inputs, we create aggregate encoding s where α j is a gate vector with the same dimensionality as the denoised encodings. Analogously, we obtain s input. An advantage of our method is its ability to incorporate techniques used in supervised models, such as attention (Bahdanau et al., 2014) and copy (Vinyals et al., 2015). Pseudo-reviews created using segment noising include various chunk permutations, which could result to ungrammatical and incoherent text. Using a copy mechanism on these texts may hurt the fluency of the output. We therefore allow copy on document noisy inputs only (see module (d) in Figure 2).
We use two LSTM decoders for the aggregate encodings, one equipped with attention and copy mechanisms, and one without copy mechanism. We then combine the results of these decoders using a learned gate. Specifically, token w t at timestep t is predicted as: where s t and p(w t ) are the hidden state and predicted token distribution at timestep t, and σ(·) is the sigmoid function.

Training and Inference
We use a maximum likelihood loss to optimize the generation probability distribution based on summary y = {w 1 , ..., w L } from our synthetic dataset: The decoder depends on L gen to generate meaningful, denoised outputs. As this is a rather indirect way to optimize our denoising module, we additionally use a discriminative loss providing direct supervision. The discriminator operates at the output of the fusion module and predicts the category distribution p(z) of the output summary y (see module (e) in Figure 2). The type of categories varies across domains. For movies, categories can be information about their genre (e.g., drama, comedy), while for businesses their specific type (e.g., restaurant, beauty parlor). This information is often included in reviews but we assume otherwise and use an LDA topic model (Blei et al., 2003) to infer p(z) (we present experiments with human labeled and automatically induced categories in Section 5). An MLP classifier takes as input aggregate encodings s (c) and s (d) and infers q(z). The discriminator is trained by calculating the KL divergence between predicted and actual category distributions q(z) and p(z): The final objective is the sum of both loss functions: At test time, we are given genuine reviews X as input instead of the synthetic ones. We generate a summary by treating X as X (c) and X (d) , i.e., the outcome of segment and document noising.

Experimental Setup
Dataset We performed experiments on two datasets which represent different domains and summary types. The Rotten Tomatoes dataset 2 (Wang and Ling, 2016) contains a large set of reviews for various movies written by critics. Each set of reviews has a gold-standard consensus summary written by an editor. We follow the partition  (2019) includes a large training corpus of reviews without gold-standard summaries. The latter are provided for the development and test set and were generated by an Amazon Mechanical Turker. We follow the splits introduced in their work. A comparison between the two datasets is provided in Table 1. As can be seen, Rotten Tomatoes summaries are generally short, while Yelp reviews are three times longer. Interestingly, there are a lot more reviews to summarize in Rotten Tomatoes (approximately 100 reviews) while input reviews in Yelp are considerably less (i.e., 8 reviews).
Implementation To create the synthetic dataset, we sample candidate summaries using the following constraints: (1) the number of nonalphanumeric symbols must be less than 3, (2) there must be no first-person singular pronouns (not used for Yelp), and (3) the number of tokens must be between 20 to 30 (50 to 90 for Yelp). We set p R to 0.8 and 0.4 for token and chunk noise, and p N to 0.9. For each review-summary pair, the number of reviews N is sampled from the Gaussian distribution N (µ, σ 2 ) where µ and σ are the mean and standard deviation of the number of reviews in the development set. We created 25k (Rotten Tomatoes) and 100k (Yelp) pseudo-reviews for our synthetic datasets (see Table 1). We set the dimensions of the word embeddings to 300, the vocabulary size to 50k, the hidden di-   mensions to 256, the batch size to 8, and dropout (Srivastava et al., 2014) to 0.1. For our discriminator, we employed an LDA topic model trained on the review corpus, with 50 (Rotten Tomatoes) and 100 (Yelp) topics (tuned on the development set). The LSTM weights were pretrained with a language modeling objective, using the corpus as training data. For Yelp, we additionally trained a coverage mechanism (See et al., 2017) in a separate training phase to avoid repetition. We used the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.001 and l 2 constraint of 3. At test time, summaries were generated using length normalized beam search with a beam size of 5. We performed early stopping based on the performance of the model on the development set. Our model was trained on a single GeForce GTX 1080 Ti GPU and is implemented using PyTorch. 4 Comparison Systems We compared DENOIS-ESUM to several unsupervised extractive and abstractive methods. Extractive approaches include (a) LEXRANK (Erkan and Radev, 2004), an algorithm similar to PageRank that generates summaries by selecting the most salient sentences, (b) WORD2VEC (Rossiello et al., 2017), a centroidbased method which represents the input as IDFweighted word embeddings and selects as summary the review closest to the centroid, and (c) SEN-TINEURON, which is similar to WORD2VEC but uses a language model called Sentiment Neuron (Radford et al., 2017) Table 4: ROUGE-L of our model and versions thereof with less synthetic data (second block), using only one noising method (third block), and without some modules (fourth block). A more comprehensive table and discussion can be found in the Appendix.
Abstractive methods include (d) OPINOSIS (Ganesan et al., 2010), a graph-based summarizer that generates concise summaries of highly redundant opinions, and (e) MEANSUM (Chu and Liu, 2019), a neural model that generates a summary by reconstructing text from aggregate encodings of reviews. Finally, for Rotten Tomatoes, we also compared with the state-of-the-art supervised model proposed in Amplayo and Lapata (2019) which used the original training split. Examples of system summaries are shown in the Appendix.

Results
Automatic Evaluation Our results on Rotten Tomatoes are shown in Table 2. Following previous work (Wang and Ling, 2016;Amplayo and Lapata, 2019) we report five metrics: METEOR (Denkowski and Lavie, 2014), a recall-oriented metric that rewards matching stems, synonyms, and  paraphrases; ROUGE-SU4 (Lin, 2004), the recall of unigrams and skip-bigrams of up to four words; and the F1-score of ROUGE-1/2/L, which respectively measures word-overlap, bigram-overlap, and the longest common subsequence between system and reference summaries. Results on Yelp are given in Table 3 where we compare systems using ROUGE-1/2/L F1, following Chu and Liu (2019). As can be seen, DENOISESUM outperforms all competing models on both datasets. When compared to MEANSUM, the difference in performance is especially large on Rotten Tomatoes, where we see a 4.01 improvement in ROUGE-L. We believe this is because MEANSUM does not learn to reconstruct encodings of aggregated inputs, and as a result it is unable to produce meaningful summaries when the number of input reviews is large, as is the case for Rotten Tomatoes. In fact, the best extractive model, SENTINEURON, slightly outperforms MEANSUM on this dataset across metrics with the exception of ROUGE-L. When compared to the best supervised system, DENOISESUM performs comparably on several metrics, specifically ME-TEOR and ROUGE-1, however there is still a gap on ROUGE-2, showing the limitations of systems trained without gold-standard summaries. Table 4 presents various ablation studies on Rotten Tomatoes (RT) and Yelp which assess the contribution of different model components. Our experiments confirm that increasing the size of the synthetic data improves performance, and that both segment and document noising are useful. We also show that explicit denoising, partial copy, and the discriminator help achieve best results. Finally, human-labeled categories (instead of LDA topics) decrease model performance, which suggests that more useful labels can be approximated by automatic means.
Human Evaluation We also conducted two judgment elicitation studies using the Amazon Mechanical Turk (AMT) crowdsourcing platform. The first study assessed the quality of the summaries using Best-Worst Scaling (BWS; Louviere et al., 2015), a less labor-intensive alternative to paired comparisons that has been shown to produce more reliable results than rating scales (Kiritchenko and Mohammad, 2017). Specifically, participants were shown the movie/business name, some basic background information, and a gold-standard summary. They were also presented with three system summaries, produced by SENTINEURON (best extractive model), MEANSUM (most related unsupervised model), and DENOISESUM.
Participants were asked to select the best and worst among system summaries taking into account how much they deviated from the ground truth summary in terms of: Informativeness (i.e., does the summary present opinions about specific aspects of the movie/business in a concise manner?), Coherence (i.e., is the summary easy to read and does it follow a natural ordering of facts?), and Grammaticality (i.e., is the summary fluent and grammatical?). We randomly selected 50 instances from the test set. We collected five judgments for each comparison. The order of summaries was randomized per participant. A rating per system was computed as the percentage of times it was chosen as best minus the percentage of times it was selected as worst. Results are reported in Table 5, where Inf, Coh, and Gram are shorthands for Informativeness, Coherence, and Grammaticality. DENOISESUM was ranked best in terms of informativeness and coherence, while the extractive system SENTINEU-RON was ranked best on grammaticality. This is not entirely surprising since extractive summaries written by humans are by definition grammatical.
Our second study examined the veridicality of the generated summaries, namely whether the facts mentioned in them are indeed discussed in the input reviews. Participants were shown reviews and the corresponding summary and were asked to verify for each summary sentence whether it was fully supported by the reviews, partially supported, or not at all supported. We performed this experiment on Yelp only since the number of reviews is small and participants could read them all in a timely fashion. We used the same 50 instances as in our first study and collected five judgments per instance. Participants assessed the summaries produced by MEANSUM and DENOISESUM. We also included GOLD-standard summaries as an upper bound but no output from an extractive system as it by default contains facts mentioned in the reviews. Table 5 reports the percentage of fully (Full-Supp), partially (PartSupp), and un-supported (No-Supp) sentences. Gold summaries display the highest percentage of fully supported sentences (63.3%), followed by DENOISESUM (55.1%), and MEANSUM (41.7%). These results are encouraging, indicating that our model hallucinates to a lesser extent compared to MEANSUM.

Conclusions
We consider an unsupervised learning setting for opinion summarization where there are only reviews available without corresponding summaries.
Our key insight is to enable the use of supervised techniques by creating synthetic review-summary pairs using noise generation methods. Our summarization model, DENOISESUM, introduces explicit denoising, partial copy, and discrimination modules which improve overall summary quality, outperforming competitive systems by a wide margin. In the future, we would like to model aspects and sentiment more explicitly as well as apply some of the techniques presented here to unsupervised single-document summarization.