Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised

We present a neural framework for opinion summarization from online product reviews which is knowledge-lean and only requires light supervision (e.g., in the form of product domain labels and user-provided ratings). Our method combines two weakly supervised components to identify salient opinions and form extractive summaries from multiple reviews: an aspect extractor trained under a multi-task objective, and a sentiment predictor based on multiple instance learning. We introduce an opinion summarization dataset that includes a training set of product reviews from six diverse domains and human-annotated development and test sets with gold standard aspect annotations, salience labels, and opinion summaries. Automatic evaluation shows significant improvements over baselines, and a large-scale study indicates that our opinion summaries are preferred by human judges according to multiple criteria.


Introduction
Opinion summarization, i.e., the aggregation of user opinions as expressed in online reviews, blogs, internet forums, or social media, has drawn much attention in recent years due to its potential for various information access applications.For example, consumers have to wade through many product reviews in order to make an informed decision.The ability to summarize these reviews succinctly would allow customers to efficiently absorb large amounts of opinionated text and manufacturers to keep track of what customers think about their products (Liu, 2012).
The majority of work on opinion summarization is entity-centric, aiming to create summaries from text collections that are relevant to a particular entity of interest, e.g., product, person, company, and so on.A popular decomposition of the problem involves three subtasks (Hu and Liu, 2004,   1 Our code and dataset are publicly available at https: //github.com/stangelid/oposum.

2006):
(1) aspect extraction which aims to find specific features pertaining to the entity of interest (e.g., battery life, sound quality, ease of use) and identify expressions that discuss them; (2) sentiment prediction which determines the sentiment orientation (positive or negative) on the aspects found in the first step, and (3) summary generation which presents the identified opinions to the user (see Figure 1 for an illustration of the task).
A number of techniques have been proposed for aspect discovery using part of speech tagging (Hu and Liu, 2004), syntactic parsing (Lu et al., 2009), clustering (Mei et al., 2007;Titov and McDonald, 2008b), data mining (Ku et al., 2006), and information extraction (Popescu and Etzioni, 2005).Various lexicon and rule-based methods (Hu and Liu, 2004;Ku et al., 2006;Blair-Goldensohn et al., 2008) have been adopted for sentiment prediction together with a few learning approaches (Lu et al., 2009;Pappas and Popescu-Belis, 2017;Angelidis and Lapata, 2018).As for the summaries, a common format involves a list of aspects and the number of positive and negative opinions for each (Hu and Liu, 2004).While this format gives an overall idea of people's opinion, reading the actual text might be necessary to gain a better understanding of specific details.Textual summaries are created following mostly extractive methods (but see Ganesan et al. 2010 for an abstractive approach), and various formats ranging from lists of words (Popescu and Etzioni, 2005), to phrases (Lu et al., 2009), and sentences (Mei et al., 2007;Blair-Goldensohn et al., 2008;Lerman et al., 2009;Wang and Ling, 2016).
In this paper, we present a neural framework for opinion extraction from product reviews.We follow the standard architecture for aspect-based summarization, while taking advantage of the success of neural network models in learning continuous features without recourse to preprocessing tools or linguistic annotations.Central to our system is the ability to accurately identify aspect-Figure 1: Aspect-based opinion summarization.Opinions on image quality, sound quality, connectivity, and price of an LCD television are extracted from a set of reviews.Their polarities are then used to sort them into positive and negative, while neutral or redundant comments are discarded.specific opinions by using different sources of information freely available with product reviews (product domain labels, user ratings) and minimal domain knowledge (essentially a few aspectdenoting keywords).We incorporate these ideas into a recently proposed aspect discovery model (He et al., 2017) which we combine with a weakly supervised sentiment predictor (Angelidis and Lapata, 2018) to identify highly salient opinions.Our system outputs extractive summaries using a greedy algorithm to minimize redundancy.Our approach takes advantage of weak supervision signals only, requires minimal human intervention and no gold-standard salience labels or summaries for training.
Our contributions in this work are three-fold: a novel neural framework for the identification and extraction of salient customer opinions that combines aspect and sentiment information and does not require unrealistic amounts of supervision; the introduction of an opinion summarization dataset which consists of Amazon reviews from six product domains, and includes development and test sets with gold standard aspect annotations, salience labels, and multi-document extractive summaries; a large-scale user study on the quality of the final summaries paired with automatic evaluations for each stage in the summarization pipeline (aspects, extraction accuracy, final summaries).Experimental results demonstrate that our approach outperforms strong baselines in terms of opinion extraction accuracy and similarity to gold standard summaries.Human evaluation further shows that our summaries are preferred over comparison systems across multiple criteria.

Related Work
It is outside the scope of this paper to provide a detailed treatment of the vast literature on opinion summarization and related tasks.For a compre-hensive overview of non-neural methods we refer the interested reader to Kim et al. (2011) and Liu and Zhang (2012).We are not aware of previous studies which propose a neural-based system for end-to-end opinion summarization without direct supervision, although as we discuss below, recent efforts tackle various subtasks independently.
Aspect Extraction Several neural network models have been developed for the identification of aspects (e.g., words or phrases) expressed in opinions.This is commonly viewed as a supervised sequence labeling task; Liu et al. (2015) employ recurrent neural networks, whereas Yin et al. (2016) use dependency-based embeddings as features in a Conditional Random Field (CRF).Wang et al. (2016) combine a recursive neural network with CRFs to jointly model aspect and sentiment terms.He et al. (2017) propose an aspect-based autoencoder to discover fine-grained aspects without supervision, in a process similar to topic modeling.Their model outperforms LDA-style approaches and forms the basis of our aspect extractor.
Sentiment Prediction Fully-supervised approaches based on neural networks have achieved impressive results on fine-grained sentiment classification (Kim, 2014;Socher et al., 2013).More recently, Multiple Instance Learning (MIL) models have been proposed that use freely available review ratings to train segment-level predictors.Kotzias et al. (2015) and Pappas and Popescu-Belis (2017) train sentence-level predictors under a MIL objective, while our previous work (Angelidis and Lapata, 2018) introduced MILNET, a hierarchical model that is trained end-to-end on document labels and produces polarity-based opinion summaries of single reviews.Here, we use MILNET to predict the sentiment polarity of individual opinions.
Multi-document Summarization A few extractive neural models have been recently applied to generic multi-document summarization.Cao et al. (2015) train a recursive neural network using a ranking objective to identify salient sentences, while follow-up work (Cao et al., 2017) employs a multi-task objective to improve sentence extraction, an idea we adapted to our task.Yasunaga et al. (2017) propose a graph convolution network to represent sentence relations and estimate sentence salience.Our summarization method is tailored to the opinion extraction task, it identifies aspect-specific and salient units, while minimizing the redundancy of the final summary with a greedy selection algorithm (Cao et al., 2015;Yasunaga et al., 2017).Redundancy is also addressed in Ganesan et al. (2010) who propose a graph-based framework for abstractive summarization.Wang and Ling (2016) introduce an encoder-decoder neural method for extractive opinion summarization.Their approach requires direct supervision via gold-standard extractive summaries for training, in contrast to our weakly supervised formulation.

Problem Formulation
Let C denote a corpus of reviews on a set of products i=1 from a domain d C , e.g., televisions or keyboards.For every product e, the corpus contains a set of reviews R e = {r i } |Re| i=1 expressing customers' opinions.Each review r i is accompanied by the author's overall rating y i and is split into segments (s 1 , . . ., s m ), where each segment s j is in turn viewed as a sequence of words (w j1 , . . ., w jn ).A segment can be a sentence, a phrase, or in our case an Elementary Discourse Unit (EDU; Mann and Thompson 1988) obtained from a Rhetorical Structure Theory (RST) parser (Feng and Hirst, 2012).EDUs roughly correspond to clauses and have been shown to facilitate performance in summarization (Li et al., 2016), document-level sentiment analysis (Bhatia et al., 2015), and single-document opinion extraction (Angelidis and Lapata, 2018).
A segment may discuss zero or more aspects, i.e., different product attributes.We use A C = {a i } K i=1 to refer to the aspects pertaining to domain d C .For example, picture quality, sound quality, and connectivity are all aspects of televisions.By convention, a general aspect is assigned to segments that do not discuss any specific aspects.Let A s ⊆ A C denote the set of aspects mentioned in segment s; pol s ∈ [−1, +1] marks the polarity a segment conveys, where −1 indicates maximally negative and +1 maximally positive sentiment.An opinion is represented by tuple o s = (s, A s , pol s ), and O e = {o s } s∈Re represents the set of all opinions expressed in R e .
For each product e, our goal is to produce a summary of the most salient opinions expressed in reviews R e , by selecting a small subset S e ⊂ O e .We expect segments that discuss specific product aspects to be better candidates for useful summaries.We hypothesize that general comments mostly describe customers' overall experience, which can also be inferred by their rating, whereas aspect-related comments provide specific reasons for their overall opinion.We also assume that segments conveying highly positive or negative sentiment are more likely to present informative opinions compared to neutral ones, a claim supported by previous work (Angelidis and Lapata, 2018).
We describe our novel approach to aspect extraction in Section 4 and detail how we combine aspect, sentiment, and redundancy information to produce opinion summaries in Section 5.

Aspect Extraction
Our work builds on the aspect discovery model developed by He et al. (2017), which we extend to facilitate the accurate extraction of aspect-specific review segments in a more realistic setting.In this section, we first describe their approach, point out its shortcomings, and then present the extensions and modifications introduced in our Multi-Seed Aspect Extractor (MATE) model.

Aspect-Based Autoencoder
The Aspect-Based Autoencoder (ABAE; He et al. 2017) is an adaptation of the Relationship Modeling Network (Iyyer et al., 2016), originally designed to identify attributes of fictional book characters and their relationships.The model learns a segment-level aspect predictor without supervision by attempting to reconstruct the input segment's encoding as a linear combination of aspect embeddings.ABAE starts by pairing each word w with a pre-trained word embedding v w ∈ R d , thus constructing a word embedding dictionary L ∈ R V ×d , where V is the size of the vocabulary.The model also keeps an aspect embedding dictionary A ∈ R K×d , where K is the number of aspects to be identified and i-th row a i ∈ R d is a point in the word embedding space.Matrix A is initialized using the centroids from a k-means clustering on the vocabulary's word embeddings.
The autoencoder, first produces a vector v s for review segment s = (w 1 , . . ., w n ) using an attention encoder that learns to attend on aspect words.A segment encoding is computed as the weighted average of word vectors: (1) (2) where c i is the i-th word's attention weight, v s is a simple average of the segment's word embeddings and attention matrix M ∈ R d×d is learned during training.
Vector v s is fed into a softmax classifier to predict a probability distribution over K aspects: where W ∈ R K×d and b ∈ R K are the classifier's weight and bias parameters.The segment's vector is then reconstructed as the weighted sum of aspect embeddings: The model is trained by minimizing a reconstruction loss J r (θ) that uses randomly sampled segments n 1 , n 2 , . . ., n kn as negative examples:2 ABAE is essentially a neural topic model; it discovers topics which will hopefully map to aspects, without any preconceptions about the aspects themselves, a feature shared with most previous LDA-style aspect extraction approaches (Titov and McDonald, 2008a;He et al., 2017;Mukherjee and Liu, 2012).These models will set the number of topics to be discovered to a much larger number (∼ 15) than the actual aspects found in the data (∼ 5).This requires a many-to-one mapping between discovered topics and genuine aspects which is performed manually.

Multi-Seed Aspect Extractor
Dynamic aspect extraction is advantageous since it assumes nothing more than a set of relevant reviews for a product and may discover unusual and interesting aspects (e.g., whether a plasma television has protective packaging).However, it suffers from the fact that the identified aspects are finegrained, they have to be interpreted post-hoc, and manually mapped to coarse-grained ones.
We propose a new weakly-supervised set-up for aspect extraction which requires little human involvement.For every aspect a i ∈ A C , we assume there exists a small set of seed words {sw j } l j=1 which are good descriptors of a i .We can think of these seeds as query terms that someone would use to search for segments discussing a i .They can be set manually by a domain expert or selected using a small number of aspect-annotated reviews.Figure 2 (top) depicts four television aspects (image, sound, connectivity and price) and three of their seeds in word embedding space.MATE replaces ABAE's aspect dictionary with multiple seed matrices {A 1 , A 2 , . . ., A K }.Every matrix A i ∈ R l×d , contains one row per seed word and holds the seeds' word embeddings, as illustrated by the set of [3 × 2] matrices in Figure 2. MATE still needs to produce an aspect matrix A ∈ R K×d , in order to reconstruct the input segment's embedding.We accomplish this by reducing each seed matrix to a single aspect embedding with the help of seed weight vectors z i ∈ R l ( j z ij = 1), and concatenating the results, illus-trated by the [4 × 2] aspect matrix in Figure 2: The segment is reconstructed as in Equation (5).
Weight vectors z i can be uniform (for manually selected seeds), fixed, learned during training, or set dynamically for each input segment, based on the cosine distance of its encoding to each seed embedding.Our experiments showed that fixed weights, selected through a technique described below, result in most stable performance across domains.We only focus on this variant due to space restrictions (but provide more details in the supplementary material).
When a small number of aspect-annotated reviews are available, seeds and their fixed seed weights can be selected automatically.To obtain a ranked list of terms that are most characteristic for each aspect, we use a variant of the clarity scoring function which was first introduced in information retrieval (Cronen-Townsend et al., 2002).Clarity measures how much more likely it is to observe word w in the subset of segments that discuss aspect a, compared to the corpus as a whole: where t a (w) and t(w) are the l 1 -normalized tf-idf scores of w in the segments annotated with aspect a and in all annotated segments, respectively.
Higher scores indicate higher term importance and truncating the ranked list of terms gives a fixed set of seed words, as well as their seed weights by normalizing the scores to add up to one.Table 1 shows the highest ranked terms obtained for every aspect in the televisions domain of our corpus (see Section 6 for a detailed description of our data).

Multi-Task Objective
MATE (and ABAE) relies on the attention encoder to identify and attend to each segment's aspectsignalling words.The reconstruction objective only provides a weak training signal, so we devise a multi-task extension to enhance the encoder's effectiveness without additional annotations.We assume that aspect-relevant words not only provide a better basis for the model's aspect-based reconstruction, but are also good indicators of the product's domain.For example, the words colors and crisp, in the segment "The colors are perfectly crisp" should be sufficient to infer that the seg- where . . is a probability distribution over product domains for segment s and W C and b C are the classifier's weight and bias parameters.We use the negative log likelihood of the domain prediction as the objective function, combined with the reconstruction loss of Equation ( 5) to obtain a multi-task objective: where λ controls the influence of the classification loss.Note that the negative log-likelihood is summed over all segments in C all , whereas J r (θ) is only summed over the in-domain segments s ∈ C 1 .It is important not to use the out-of-domain segments for segment reconstruction, as they will confuse the aspect extractor due to the aspect mismatch between different domains.

Opinion Summarization
We now move on to describe our opinion summarization framework which is based on the aspect extraction component discussed so far, a polarity prediction model, and a segment selection policy which identifies and discards redundant opinions.
[-]0.75 4. The sound on this is horrendous.
[+]0.44  Given review r consisting of segments (s 1 , . . ., s m ), MILNET uses a CNN segment encoder to obtain segment vectors (u 1 , . . ., u m ), each used as input to a segment-level sentiment classifier.
For every vector u i , the classifier produces a sentiment prediction , where p (1) i and p (M ) i are probabilities assigned to the most negative and most positive sentiment class respectively.Resulting segment predictions (p stm 1 , . . ., p stm m ) are combined via a GRU-based attention mechanism to produce a document-level prediction p stm r and the model is trained end-to-end on the reviews' user ratings using negative log-likelihood.
The essential by-product of MILNET are segment-level sentiment predictions p stm i , which are transformed into polarities pol s i , by projecting them onto the [−1, +1] range using a uniformly spaced sentiment class weight vector.
Opinion Ranking Aspect predictions p asp s = p (a 1 ) s , . . ., p (a K ) s and polarities pol s , form the opinion set O e = {(s, A s , pol s )} s∈Re for every product e ∈ E C .For simplicity, we set the predicted aspect-set A s to only include the aspect with the highest probability, although it is straightforward to allow for multiple aspects.We rank every opinion o s ∈ O e according to its salience: where the quantity in parentheses is the probability difference between the most probable aspect and the general aspect.The salience score will be high for opinions that are very positive or very negative and are also likely to discuss a non-general aspect.
Opinion Selection The final step towards producing summaries is to discard potentially redundant opinions, something that is not taken into account by our salience scoring method.Table 2 shows a partial ranking of the most salient opinions found in the reviews for an LCD television.All segments provide useful information, but it is evident that segments 1 and 6 as well as 4 and 5 are paraphrases of the same opinions.
We follow previous work on multi-document summarization (Cao et al., 2015;Yasunaga et al., 2017) and use a greedy algorithm to eliminate redundancy.We start with the highest ranked opinion, and keep adding opinions to the final summary one by one, unless the cosine similarity between the candidate segment and any segment already included in the summary is lower than 0.5.

The OPOSUM Dataset
We created OPOSUM, a new dataset for the training and evaluation of Opinion Summarization models which contains Amazon reviews from six product domains: Laptop Bags, Bluetooth Headsets, Boots, Keyboards, Televisions, and Vacuums.The six training collections were created by downsampling from the Amazon Product Dataset3 introduced in McAuley et al. (2015) and contain reviews and their respective ratings.The reviews were segmented into EDUs using a publicly available RST parser (Feng and Hirst, 2012).
To evaluate our methods and facilitate research, we produced a human-annotated subset of the dataset.For each domain, we uniformly sampled (across ratings) 10 different products with 10 reviews each, amounting to a total of 600 reviews, to be used only for development (300) and testing (300).We obtained EDU-level aspect annotations, salience labels and gold standard opinion summaries, as described below.Statistics are provided in Table 3 and in supplementary material.
Aspects For every domain, we pre-selected nine representative aspects, including the general aspect.We presented the EDU-segmented reviews to three annotators and asked them to select the aspects discussed in each segment (multiple aspects were allowed).Final labels were obtained using a majority vote among annotators.Interannotator agreement across domains and annotated segments using Cohen's Kappa coefficient was K = 0.61 (N = 8,175, k = 3).
Opinion Summaries We produced opinion summaries for the 60 products in our benchmark using a two-stage procedure.First, all reviews for a product were shown to three annotators.Each annotator read the reviews one-by-one and selected the subset of segments they thought best captured the most important and useful comments, without taking redundancy into account.This phase produced binary salience labels against which we can judge the ability of a system to identify important opinions.Again, using the Kappa coefficient, agreement among annotators was K = 0.51 (N = 8,175, k = 3). 4In the second stage, annotators were shown the salient segments they identified (for every product) and asked to create a final extractive summary by choosing opinions based on their popularity, fluency and clarity, while avoiding redundancy and staying under a budget of 100 words.We used ROUGE (Lin and Hovy, 2003) as a proxy to inter-annotator agreement.For every product, we treated one ref-erence summary as system output and computed how it agrees with the rest.ROUGE scores are reported in Table 5 (last row).

Experiments
In this section, we discuss implementation details and present our experimental setup and results.We evaluate model performance on three subtasks: aspect identification, salient opinion extraction, and summary generation.
Implementation Details Reviews were lemmatized and stop words were removed.We initialized MATE using 200-dimensional word embeddings trained on each product domain using skipgram (Mikolov et al., 2013) with default parameters.We used 30 seed words per aspect, obtained via Equation ( 9).Word embeddings L, seed matrices {A i } K i=1 and seed weight vectors {z i } K i=1 were fixed throughout training.We used the Adam optimizer (Kingma and Ba, 2014) with learning rate 10 −4 and mini-batch size 50, and trained for 10 epochs.We used 20 negative examples per input for the reconstruction loss and, when used, the multi-tasking coefficient λ was set to 10. Seed words and hyperparameters were selected on the development set and we report results on the test set, averaged over 5 runs.

Aspect Extraction
We trained aspect models on the collections of Table 3 and evaluated their predictions against the human-annotated portion of each corpus.Our MATE model and its multitask counterpart (MATE+MT) were compared against a majority baseline and two ABAE variants: vanilla ABAE, where aspect matrix A is initialized using k-means centroids and fine-tuned during training; and ABAE init , where rows of A are fixed to the centroids of respective seed embeddings.This allows us to examine the benefits of our multi-seed aspect representation.Opinion Salience We are also interested in our system's ability to identify salient opinions in reviews.The first phase of our opinion extraction annotation provides us with binary salience labels, which we use as gold standard to evaluate system opinion rankings.For every product e, we score each segment s ∈ R e using Equation ( 12) and evaluate the obtained rankings via Mean Average Precision (MAP) and Precision at the 5th retrieved segment (P@5).5 Polarity scores were produced via MILNET; we obtained aspect probabilities from ABAE init , MATE, and MATE+MT.We also experimented with a variant that only uses MILNET's polarities and, additionally, with variants that ignore polarities and only use aspect probabilities.
Results are shown in Table 4 (bottom).The combined use of polarity and aspect information improves the retrieval of salient opinions across domains, as all model variants that use our salience formula of Equation ( 12) outperform the MILNET-and aspect-only baselines.When comparing between aspect-based alternatives, we observe that the extraction accuracy correlates with the quality of aspect prediction.In particular, ranking using MILNET+MATE+MT gives best results, with a 2.6% increase in MAP against MILNET+MATE and 4.6% against MIL-NET+ABAE init .The trend persists even when MILNET polarities are ignored, although the quality of rankings is worse in this case.
Opinion Summaries We now turn to the summarization task itself, where we compare our best performing model (MILNET+MATE+MT), with and without a redundancy filter (RD), against the following methods: a baseline that selects segments randomly; a Lead baseline that only selects the leading segments from each review; SumBasic, a generic frequency-based extractive summarizer (Nenkova and Vanderwende, 2005); LexRank, a generic graph-based extractive summarizer (Erkan and Radev, 2004); Opinosis, a graph-based abstractive summarizer that is designed for opinion summarization (Ganesan et al., 2010).All extractive methods operate on the EDU level with a 100-word budget.For Opinosis, we tested an aspect-agnostic variant that takes every review segment for a product as input, and a variant that uses MATE's groupings of segments to produce and concatenate aspect-specific summaries.Table 5 presents ROUGE-1, ROUGE-2 and ROUGE-L F1 scores, averaged across domains.Our model (MILNET+MATE+MT) significantly outperforms all comparison systems (p < 0.05; paired bootstrap resampling; Koehn 2004), whilst using a redundancy filter slightly improves performance.Assisting Opinosis with aspect predictions is beneficial, however, it remains significantly inferior to our model (see the supplementary material for additional results).
We also performed a large-scale user study.For every product in the OPOSUM test set, participants were asked to compare summaries produced by: a (randomly selected) human annotator, our best performing model (MILNET+MATE+MT+RD), Opinosis, and the Lead baseline.The study was conducted on the Crowdflower platform using Best-Worst Scaling (BWS; Louviere and Woodworth 1991;Louviere et al. 2015), a less labourintensive alternative to paired comparisons that has been shown to produce more reliable results than rating scales (Kiritchenko and Mohammad, 2017).We arranged every 4-tuple of competing summaries into four triplets.Every triplet was

Opinosis
The picture and not bright at all even compared to my 6-year old sony lcd tv.It will not work with an hdmi.Connection because of a conflict with comcast's dhcp.Being generous because I usuallly like the design and attention to detail of sony products).I am very disappointed with this tv for two reasons: picture brightness and channel menu.Numbers of options available in the on-line area of the tv are numerous and extremely useful.Wow look at the color, look at the sharpness of the picture, amazing and the amazing.

This work
Plenty of ports and settings and have been extremely happy with it.The sound is good and strong.The picture is beautiful.And the internet apps work as expected.And the price is even better.Unbelieveable picture and the setup is so easy.Wow look at the color, look at the sharpness of the picture.The Yahoo! widgets do not work.And avoid the Sony apps at all costs.Communication a bit difficult.:( shown to three crowdworkers, who were asked to decide which summary was best and which one was worst according to four criteria: Informativeness (How much useful information about the product does the summary provide?), Polarity (How well does the summary highlight positive and negative opinions?),Coherence (How coherent and easy to read is the summary?)Redundancy (How successfully does the summary avoid redundant opinions?).
For every criterion, a system's score is computed as the percentage of times it was selected as best minus the percentage of times it was selected as worst (Orme, 2009).The scores range from -100 (unanimously worst) to +100 (unanimously best) and are shown in Table 6.Participants favored our model over comparison systems across all criteria (all differences are statistically significant at p < 0.05 using post-hoc HD Tukey tests).Human summaries are generally preferred over our model, however the difference is significant only in terms of coherence (p < 0.05).
Finally, Figure 3 shows example summaries for a product from our televisions domain, produced by one of our annotators and by 3 comparison systems (LexRank, Opinosis and our MIL-NET+MATE+MT+RD).The human summary is primarily focused on aspect-relevant opinions, a characteristic that is also captured to a large extent by our method.There is substantial overlap between extracted segments, although our redundancy filter fails to identify a few highly similar opinions (e.g., those relating to the picture quality).The LexRank summary is inferior as it only identifies a few useful opinions, and instead selects many general or non-opinionated comments.Lastly, the abstractive summary of Opinosis does a good job of capturing opinions about specific aspects but lacks in fluency, as it produces grammatical errors.For additional system outputs, see supplementary material.

Conclusions
We presented a weakly supervised neural framework for aspect-based opinion summarization.Our method combined a seeded aspect extractor that is trained under a multi-task objective without direct supervision, and a multiple instance learning sentiment predictor, to identify and extract useful comments in product reviews.We evaluated our weakly supervised models on a new opinion summarization corpus across three subtasks, namely aspect identification, salient opinion extraction, and summary generation.Our approach delivered significant improvements over strong baselines in each of the subtasks, while a large-scale judgment elicitation study showed that crowdworkers favor our summarizer over competitive extractive and abstractive systems.
In the future, we plan to develop a more integrated approach where aspects and sentiment orientation are jointly identified, and work with additional languages and domains.We would also like to develop methods for abstractive opinion summarization using weak supervision signals.

Figure 3 :
Figure 3: Human and system summaries for a product in the Televisions domain.

Table 1 :
Highest ranked words for the television corpus according to Equation (9).ment comes from a television review, whereas the words keys and type in the segment "The keys feel great to type on" are more representative of the keyboard domain.Additionally, all four words are characteristic of specific aspects.Let C all = C 1 ∪ C 2 ∪ . . .denote the union of multiple review corpora, where C 1 is considered in-domain and the rest are considered out-ofdomain.We use d s ∈ {d C 1 , d C 2 , . . .} to denote the true domain of segment s and define a classifier that uses the vectors from our segment encoder as inputs:

Table 2 :
Most salient opinions according to scores from Equation (12) for an LCD TV.

Table 3 :
The OPOSUM corpus.Numbers in parentheses correspond to the human-annotated subset.

Table 4 :
Experimental results for the identification of aspect segments (top) and the retrieval of salient segments (bottom) on OPOSUM's six product domains and overall (AVG).

Table 6 :
Best-Worst Scaling human evaluation.
Product domain: Televisions Product name: Sony BRAVIA 46-Inch HDTV Human Plenty of ports and settings.Easy hookups to audio and satellite sources.The sound is good and strong.This TV looks very good.and the price is even better.The on-screen menu/options is quite nice.and the internet apps work as expected.The picture is clear and sharp.which is TOO SLOW to stream HD video...The software and apps built into this TV.are difficult to use and setup.Their service is handled off shore making.communication a bit difficult.:( LexRank Get a Roku or Netflix box.I watch cable, Netflix, Hulu Plus, YouTube videos and computer movie files on it.Sound is good much better.DO NOT BUY! this SONY Bravia ' Smart ' TV... and avoid the Sony apps at all costs.Because of these two issues, I returned the Sony TV.Also you can change the display and sound settings on each port.However, the streaming speed for netflix is just down right terrible.Most of the time I just quit.Since I do not own the cable box, So, I have the cable.