Towards Opinion Summarization of Customer Reviews

In recent years, the number of texts has grown rapidly. For example, most review-based portals, like Yelp or Amazon, contain thousands of user-generated reviews. It is impossible for any human reader to process even the most relevant of these documents. The most promising tool to solve this task is a text summarization. Most existing approaches, however, work on small, homogeneous, English datasets, and do not account to multi-linguality, opinion shift, and domain effects. In this paper, we introduce our research plan to use neural networks on user-generated travel reviews to generate summaries that take into account shifting opinions over time. We outline future directions in summarization to address all of these issues. By resolving the existing problems, we will make it easier for users of review-sites to make more informed decisions.


Introduction
In recent years, amount of available text corpora has been growing rapidly with increasing popularity of web. Users produce a huge amount of text every day. With a larger amount of text and information included within it, it becomes impossible for people to read all the texts and it leads to information overload. For a common human, it is not possible to read all the available text even if he reads only all the most relevant ones. The task of text summarization is known for a very long period. In late 50s Luhn (Luhn, 1958) tried to create abstract of documents automatically. Over decades there have been many summarization systems dealing with different forms of summarization. This task belongs to one of the most chal-lenging tasks in natural language processing. The task of text summarization can be particularly important for decision making or relevance judgments (Nenkova and McKeown, 2011).
Automatic text summarization became very useful and also important tool to help the user obtain as much information as possible without the necessity to read all the original documents. Many definitions of text summarization exist. Text summary can be defined as a text produced from one or more texts that contains the same information as the original text and is no longer than half of the original text (Hovy and Lin, 1998). Mani (Mani, 2001) defined the goal of summarization as a process of finding the source of information, extracting content from it and presenting the most important content to the user in a concise form and in a manner sensitive to needs of user's application.
We can divide techniques of summarization into two categories: abstractive and extractive summarization (Gambhir and Gupta, 2017). Extractive summarization aims to choose parts of the original document such as sentence part, whole sentence or paragraph. Abstractive summarization wants to get paraphrase content of the original document with respect to cohesion and concise of output summary. Selection of the text section in extractive summarization leads to a partial loss of an output cohesion, which abstractive summarization tries to accomplish.
In last few years, approaches based on neural networks became very popular for summarization task (Rush et al., 2015). A specific branch of text summarization is a summarization of opinions from the human-generated text. We can summarize opinions from customer reviews or comments on social networks. This problem differs from a standard summarization task due to a number of repetitive and redundant information. There can be also a problem with polarity of opinions be-tween different users. This types of summary can be very useful for both a customer of products and a product owner. Opinion summarization can be particularly important for decision making (Yuan et al., 2015). This summarization type can also show trends in opinions collected from comments on social networks, especially when a number of text entries grows very fast.
In this work, we also discuss analysis of specific aspect of opinion summarization: sentiment analysis of customer reviews. In summarization task, sentiment information can be viewed as one of the inputs along with text corpora itself. A difference between sentiment of text fragments and sentiment of whole summarization is a very interesting aspect to consider.
The expected contributions of our research are: (1) overview of a recent development in opinion summarization, (2) assembly of a reasonably big dataset for opinion summarization (from travel based portals), (3) a novel method for opinion summarization based on state-of-the-art neural network architectures.
The rest of this paper is structured as follows. Section 2 explains recent advances in text summarization. In Section 3, we focus on possibilities in the area of opinion mining and summarization of opinions. Future directions in summarization are drawn in Section 4. The research proposal including opinion summarization along with a possible dataset and a planned experiment are described in Section 5. Final observations and conclusions are mentioned in Section 6.

Text Summarization
In recent years, text summarization has been focusing on the abstractive summarization with a use of neural models. In (Rush et al., 2015), the authors showed a way to use a neural network based on encoder-decoder architecture for creating abstractive summarization on the sentence level. Using this type of model originates from a task of machine translation where these models were used before. The approach presented in (Chopra et al., 2016) can be considered as a follower of the previous work. Instead of a feed-forward neural network a recurrent neural network (RNN) was used. RNN emphasizes the order of input words. The authors presented conditional RNN with convolutional attention-based encoder.
Ferreira et al. presented a sentence clustering algorithm to deal with the redundancy and information diversity problems (Ferreira et al., 2014). The algorithm uses the text representation to convert input text into graph model along with four types of relations between sentences.
A specific yet not very widely used technique is Abstract Meaning Representation (AMR). For text summarization, a framework for abstractive summarization based on the recent development of a treebank for AMR  can be employed. This framework parses source text into a set of AMR graphs, then the graph is transformed into a summary graph from which the output summary is generated.
The use of neural architecture from machine translation became widely popular and many authors made research in this area. Nallapati et al. (Nallapati et al., 2016) presented a neural model for abstractive summarization along with the introduction of a whole new dataset for evaluation of summarization.
The use of attention mechanism in neural networks became widely spread and very popular. Many works showed usefulness of this mechanism in other tasks. In summarization task, a work proposed by See et al. (See et al., 2017) introduced a method based on the principle of encoderdecoder along with attention distribution of input text. They used hybrid pointer-generator architecture with a use of the coverage. The pointer mechanism tries to solve problem of choosing words either to use original word or generate a new one. The coverage part ensures minimizing repetition during the text generation in the later parts of the output. Interesting modification was introduced by Paulus et al. (Paulus et al., 2017). Their mechanism modifies standard attention mechanism and also objective function with a combination of maximum likelihood and cross-entropy loss. This mechanism is used in a phase of reinforcement learning. Tan et al. (Tan et al., 2017) proposed another modification of attention mechanism and their graph-based attention mechanism was used in a sequence-to-sequence framework. The goal of the encoder is mapping the input documents to the vector representation. Then decoder is used to generate the output sentences. Novelty of their method lies in using graph-based attention mechanism in a hierarchical encoder-decoder framework.
Neural models are widely used for both abstrac-tive and extractive summarization. Nallapati et al. (Nallapati et al., 2017) presented a neural sequential model for the extractive summarization of documents. Visualizing impact of a particular parts of the input text to output summarization we can consider as other contribution of this paper.
Similarly to other tasks of natural language processing, convolutional neural networks can be used for summarization. Yasunaga et al. (Yasunaga et al., 2017) incorporates sentence relations using Graph Convolutional Network on relation graphs along with the sentence embeddings obtained from RNN, which were taken as input node features. This system tries to exploit a representational power of neural networks and sentence relation information which can be encoded in the graph representation of document clusters.
Much research has been conducted in this field in recent years. Other interesting modification can employ latent structure modeling presented in a framework based on sequence-to-sequence oriented encoder-decoder model which incorporates a latent structure modeling component (Li et al., 2017). This model generates abstractive summary of the latent variables but also of the the discriminative deterministic states.
All aforementioned summarization works were primarily aimed at summarization of news articles. There can be also other summarization types like a summarization of emails (Carenini et al., 2008;Yousefi-Azar and Hamey, 2017), eventbased summarization (Glavaš and Šnajder, 2014;Kedzie et al., 2015), personalized summarization (Díaz and Gervás, 2007;Moro and Bielikova, 2012) and also sentiment-based or opinion summarization described in the next section.

Opinion Mining and Summarization
Summarization of opinions is a special type of summarization. Product and services along with comments on social networks could consist of hundreds of entries and could lead to information overload. Repetition of opinions is one of the major differences that contrasts with the summarization of news. User-generated text often remarkably differentiate from news text which is commonly widely revised.

Summarization of Customer Opinions
Summarization of opinions from product reviews is the most common example of opinion summa-rization. These reviews often come from stores of electronics like Amazon. Yuan et al. (Yuan et al., 2015) presented user study how opinion summarization can help in decision making before consumer purchase.
One of the first works in opinion summarization could be considered the work of Hu and Liu (Hu and Liu, 2004). They proposed a set of techniques for mining and summarizing product reviews. The main goal of their opinion summarization system is to provide a feature-based summary.
Tadano et al. proposed method based on evaluative sentence extraction where aspects are judged by their ratings, tf-idf value and number of mentions with similar topic (Tadano et al., 2010).
Summarization approach based on the topical structure was introduced by Zhan et al. (Zhan et al., 2009). They presented a topical structure as a list of significant topics related from a document set. To reduce redundancy of sentences they implemented a method of maximal marginal relevance.
The Opinosis project presented a graph-based summarization framework (Ganesan et al., 2010). This framework tries to generate abstractive summarization of highly redundant opinions. Authors showed that their summaries have better agreement with human summaries compared to the baseline extractive methods.
Dalal and Zaveri presented application of a multi-step approach for automatic opinion mining consisting of various phases (Dalal and Zaveri, 2013). Authors showed that this multi-step feature-based semi-supervised opinion mining approach can be successful in identification of opinionated sentences from user reviews.
Aspect-based sentiment analysis can help to produce a structured summary based on positive and negative opinion about features of the product (Kansal and Toshniwal, 2014). The system takes into consideration not only sentence information, but also pieces of information from other sentences or reviews called contextual information. The authors also showed that polarity of words can be different even within one domain.
Kurian and Asokan presented a method with the cross-domain sentiment classification along with the distributional similarity of opinion words (Kurian and Asokan, 2015). This method helps to classify and summarize product reviews and, in contrast with other methods, it does not require la-beled data from the target domain or other lexical resources.
Unlike other opinion summarization systems dealing with sentiment polarity, another study formulated opinion summarization as a community leader detection problem (Zhu et al., 2015). Authors proposed a graph-based method to identify informative sentences and evaluated method on product reviews. The study proposed algorithms for leaders detection in the sentence graph.
A system named Gist (Lovinger et al., 2017) intends to deal with a large amount of text and automatically summarizes it into informative and actionable key sentences. Gist tries to summarize original reviews into the short text consisting of a few key sentences that will capture the overall sentiment about the product.
A kind of opinion summarization could be a summarization of travel reviews to give feedback for quality of hotels, restaurants or other services. A clustering based method for summarization of hotel reviews was proposed by Hu et al. (Hu et al., 2017). They also showed additional information as author's reputation or creation date could have a huge impact on relevant summary creation. Raut et al. (Raut and Londhe, 2014) presented machine learning and Senti-WordNet method for mining opinions from hotel reviews and also a method for sentence relevant scoring.

Summarization of Community Answers
Along with the summarization of customer reviews, a very important summarization type considers comments on social networks or answers in question answering (QA) systems as input entries. Investigation in this forms of text entries can lead to easier decision making.
A sub-modular function-based framework was presented by Wang et al. (Wang et al., 2014). This framework can be used for query-focused opinion summarization. Authors evaluated this framework in QA and blog dataset. Statistically learned sentence relevance along with information coverage with respect to diverse topics are encoded as submodular functions.
Work of Lloret et al. (Lloret et al., 2015) deals both with the summarization of opinions in social networks and opinions in product reviews. Their method can be characterized with an integration of sentence simplification, but also regeneration of sentence and also internal concept representation in the summarization task. The method tries to be able to generate abstractive summaries.
There are many topics in this area which can lead to very interesting observations. Guo et al. (Guo et al., 2015) proposed a model for opinion summarization of highly contrastive opinions particularly for controversial issues. They integrated expert opinions with ordinary opinions to create an output of contrastive sentence pairs. The study also presented this method as a unified way for users to better summarize opinions concerning controversial issues.
Another study explores opinion summarization of the spontaneous conversation (Wang and Liu, 2015). Phone conversation corpus was annotated in this study and authors investigated two methods of extractive summarization, graph-based with incorporating topic and sentiment information and supervised method which cast this problem as a classification problem.
A study from Li et al. deals also with opinion summarization in blogging (Li et al., 2016). They proposed a convolutional neural network for opinion summarization based on recent deep-learning research. Maximal marginal relevance is used for extraction of representative opinion sentences.
A very important problem of the volume and volatility of opinionated data published in social media was presented by Tsirakis et al. (Tsirakis et al., 2016). They discussed that most of methods deal only with a small volume of data, where they are quite effective, but usually do not scale up.

Future Directions
As presented in the previous sections, many challenges are still present in summarization. Standard text summarization usually applied to news articles deals still with the problem of abstractive summarization. Text summarization techniques can be on a different level of abstraction and usually are not fully abstractive (See et al., 2017).
Another big challenge lies in summarization of text in languages other than English. Most of methods were evaluated only on English and interesting could be an evaluation in other ones along with their specifics. Another aspect could be summarization over multiple languages where input text does not need to be in only one language.
Summarization of user-generated content has to deal with a problem of the noisy and ungrammatical documents but also with very diverse and conflicting opinions included within these documents (Murray et al., 2017). This problem is even more protuberant in minor languages where opinion mining and sentiment analysis is not well developed (Krchnavy and Simko, 2017).
In opinion summarization with thousands of input entries a researcher should deal with a change of opinions during the time. When summarizing customer reviews for services like hotels or restaurants, change of quality should be considered. It can lead to a specific time-based summary which considers progress of opinions over the time. This problem is also relevant in summarization of text in social networks, especially with controversial topics reports to the specific mood in society or can be effected by community leaders. Lack of available large datasets for this task is other crucial subject of research in next years, since most of research was evaluated only on small ones.
A significant problem is present in the evaluation phase. Automatic evaluation can be quite controversial as there exist not only one correct summarization. Automatic evaluation measures like ROUGE and its modifications (Lin, 2004) can partially deal with these problems using n-grams, but still do not handle a use of synonyms. Same problems based on multiple formulation of ground truth can cause problems with human evaluation as well. Experiments evaluated with more human participants have to deal with an agreement between users which can be quite low.
In next few years, we expect the opinion summarization to deal with the domain specifics and also with the user satisfaction. The growth of usergenerated content in the future can lead to focus on reduction of information overload and also to text summarization itself.

Research Proposal
As we described earlier, we would like to focus on a specific type of summarization: creation of opinion summaries. Nowadays, travel sites include thousands of reviews from users which visited one of reviewed places. These reviews are very important in decision making of future possible customers but also for owners of services. With tens of new reviews every day it is impossible to read all the reviews and it is often very difficult to choose only the relevant ones. For owners, it is not possible to manually read all the reviews that could be very helpful in service improvement.
Recent advances in neural networks and also in the text summarization showed that employing encoder-decoder architecture can be very useful. The problem of summarization of customer reviews differs from standard single document summarization where the models were applied before.
In this task, we should consider multi-modular framework.
The main idea of this proposal lies in getting better user satisfaction with review summary and also in examining of time aspect on opinion summarization. Opinion summarization should process in several phases: 1. aspect detection, 2. clustering opinionated features, 3. sentence generation.
The first step of our proposal lies in the detection of aspects. Before creation of any summary, we need to identify aspects discussed in these reviews. Another mechanism would be needed to distinguish similarity of aspects. A taxonomy of aspects could be a very useful tool to avoid a separation of similar aspects but other approaches utilizing a distributional and vector space should be also examined. We plan to use a bidirectional LSTM neural network with convolutional attention mechanism to identify aspects within text and also to determine polarity of each aspect.
In the second step, we have to cluster opinionated sentences by aspects they talk about. In each cluster, a sentiment and polarity of opinions need to be determined. Whereas in the task of sentiment analysis only overall or average sentiment is typically provided, in the summarization all polarity opinions should be included in the output summary. To reach this goal, all the opinionated sentences and opinions should be identified.
In the final phase, we would like to employ neural network architecture to generate output sentences from collected aspect-oriented information. In this phase, we need to generate sentence to output summary based on clustered aspects and also use the polarity of aspects. We plan to use LSTM network for this stage and generate sentence from extracted aspects and their polarity.
There is also another important point of interest in task of summarization that has not been discussed yet. The time horizon is often a neglected feature which should be considered in opinion summarization of customer reviews, as opinions of customers can develop over the time in a positive, but also a negative way. We plan to include information about created time to process of clustering opinions along with other information about the reviewer, what can lead to better accuracy of summarization as well as resulting user satisfaction with an output summary.
Another significant point to discuss is employing end-to-end deep learning in the task of opinion summarization. The major problem is lack of large dataset which is necessary for such learning. Creation of this kind of dataset is expensive and could take hundreds of hours, if performed manually.

Dataset
To create an appropriate dataset, we plan to gather customer reviews from large travel portals (e.g., TripAdvisor, Booking.com). All reviews come along with other useful information such as score ranking, which should be included too. However any public information about reviewers could be very useful too as it shows reviewer relevance and also importance.

Experiments
To evaluate the quality of generated summaries a few experiments are required. We will have to create our ground truth or reference summaries to automatically evaluate quality of summary. As we mentioned before, it is not a sufficient way for evaluation and other experiments including human evaluation would be needed. As this type of evaluation is very time consuming and difficult for resources, a posteriori evaluation is more feasible way to assess the quality of generated summaries. Very interesting view for opinion summaries is comparison of the sentiment of generated summaries and the sentiment of original input reviews. In human evaluation, we would like to provide users a list of original reviews, generated summary and ask about their satisfaction. We also plan to use some other automatic measures like ROUGE (Lin, 2004) and compare generated summary with summary created by humans. Another important measure is aspect coverage and ratio of included aspect in generated summaries from original reviews.

Conclusion
In this paper, we described the background for summarization task. More importantly, we described recent contributions and development in this area with many problems the research deals with. We emphasized the main problems and future research directions in process of summarization and also particularly for opinion summarization. We also introduced our future research intentions along with a design of the first experiments and possible model and dataset. We demonstrated that summarization task and especially opinion summarization still have big open issues will be researched in the next few years.