Unsupervised Neural Single-Document Summarization of Reviews via Learning Latent Discourse Structure and its Ranking

This paper focuses on the end-to-end abstractive summarization of a single product review without supervision. We assume that a review can be described as a discourse tree, in which the summary is the root, and the child sentences explain their parent in detail. By recursively estimating a parent from its children, our model learns the latent discourse tree without an external parser and generates a concise summary. We also introduce an architecture that ranks the importance of each sentence on the tree to support summary generation focusing on the main review point. The experimental results demonstrate that our model is competitive with or outperforms other unsupervised approaches. In particular, for relatively long reviews, it achieves a competitive or better performance than supervised models. The induced tree shows that the child sentences provide additional information about their parent, and the generated summary abstracts the entire review.


Introduction
The need for automatic document summarization is widely increasing because of the vast amounts of online textual data that continue to grow. As for product reviews on E-commerce websites, succinct summaries allow both customers and manufacturers to obtain large numbers of opinions (Liu and Zhang, 2012). Under these circumstances, supervised neural network models have achieved wide success, using a large number of reference summaries (Wang and Ling, 2016;Ma et al., 2018). However, a model trained on these summaries cannot be adopted in other domains, as salient phrases are not common across domains. It requires a significant cost to prepare large volumes of references for each domain (Isonuma et al., 2017). An unsupervised approach is a possible solution to such a problem. Previously, unsupervised learning has been widely applied to extractive approaches (Radev et al., 2004;Mihalcea and Tarau, 2004). As mentioned in (Carenini et al., 2013;Gerani et al., 2014), extractive approaches often fail to provide an overview of the reviews, while abstractive ones successfully condense an entire review via paraphrasing and generalization. Our work focuses on the one-sentence abstractive summarization of a single-review without supervision.
The difficulties of unsupervised abstractive summarization are two-fold: obtaining the representation of the summaries, and learning a language model to decode them. As an unsupervised approach for multiple reviews, Chu and Liu (2018) regarded the mean of the document embeddings as the summary, while learning a language model via the reconstruction of each review. By contrast, such an approach cannot be extended to a single-review directly, because it also condenses including trivial or redundant sentences (its performance is demonstrated in Section 4.4).
To overcome these problems, we apply the discourse tree framework. Extractive summarization and document classification techniques sometimes use a discourse parser to gain a concise representation of documents (Hirao et al., 2013;Bhatia et al., 2015;Ji and Smith, 2017); however, Ji and Smith (2017) pointed out the limitations of using external discourse parsers. In this context, Liu and Lapata (2018) proposed a framework to induce a latent discourse tree without a parser. While their model constructed the tree via a supervised document classification task, our model induces it by identifying and reconstructing a parent sentence from its children. Consequently, we gain the representation of a summary as the root of the induced latent discourse tree, while learning a language model through reconstruction.

Good quality floor puzzle
(1) This floor puzzle is a nice size not huge but larger than normal kid puzzles (2) The pieces are thick and lock together well even on carpet (5) My son put it together on berber carpet without having any issues with pieces not staying together (3) The pieces are cardboard but are very dense almost like wood but not quite that solid Summary:

Body:
(4) I bought this puzzle for my son for his first birthday at the store … … … … … … Figure 1: Example of the discourse tree of a jigsaw puzzle review. StrSum induces the latent tree and generates the summary from the children of a root, while DiscourseRank supports it to focus on the main review point. Figure 1 shows an example of a jigsaw puzzle review and its dependency-based discourse tree. The summary describes its quality. The child sentences provide an explanation in terms of the size (1st) and thickness (2nd), or provide the background (4th). Thus, we assume that reviews can generally be described as a multi-root non-projective discourse tree, in which the summary is the root, and the sentences construct each node. The child sentences present additional information about the parent sentence.
To construct the tree and generate the summary, we propose a novel architecture; StrSum. It reconstructs a parent from its children recursively and induces a latent discourse tree without a parser. As a result, our model generates a summary from the surrounding sentences of the root while learning a language model through reconstruction in an endto-end manner. We also introduce DiscourseRank, which ranks the importance of each sentence in terms of the number of descendants. It supports StrSum to generate a summary that focuses on the main review point.
The contributions of this work are three-fold: • We propose a novel unsupervised end-to-end model to generate an abstractive summary of a single product review while inducing a latent discourse tree • The experimental results demonstrate that our model is competitive with or outperforms other unsupervised models. In particular, for long reviews, it achieves a competitive or better performance than the supervised models.
• The induced tree shows that the child sentences present additional information about their parent, and the generated summary abstracts for the entire review.

Proposed Model
In this section, we present our unsupervised endto-end summarization model with descriptions of StrSum and DiscourseRank.

StrSum: Structured Summarization
Model Training: The outline of StrSum is presented in Figure 2. y i and s i ∈ R d indicate the i-th sentence and its embedding in a document D = {y 1 , . . . , y n }, respectively. w t i is the t-th word in a sentence y i = {w 1 i , . . . , w l i }. s i is computed via a max-pooling operation across hidden states h t i ∈ R d of the Bi-directional Gated Recurrent Units (Bi-GRU): Here, we assume that a document D and its summary compose a discourse tree, in which the root is the summary, and all sentences are the nodes. We denote a ij as the marginal probability of dependency where the i-th sentence is the parent node of the j-th sentence. In particular, a 0j denotes the probability that a root node is the parent (see Figure 2). We define the probability distribution a ij (i ∈ {0, . . . , n}, j ∈ {1, . . . , n}) as the posterior marginal distributions of a nonprojective dependency tree. The calculation of the marginal probability is explained later.
Similar to (Liu and Lapata, 2018), to prevent overload of the sentence embeddings, we decompose them into two parts: where the semantic vector s e i ∈ R de encodes the semantic information, and the structure vector s f i ∈ R d f is used to calculate the marginal probability of dependencies.
The embedding of the parent sentenceŝ i and that of the summaryŝ 0 are defined with parameters W s ∈ R de * de and b s ∈ R de as: Usingŝ i , the GRU-decoder learns to reconstruct the i-th sentence, i.e., to obtain the parameters θ that maximize the following log likelihood: Summary Generation: An explanation of how the training contributes to the learning of a language model and the gaining of the summary embedding is provided here. As for the former, the decoder learns a language model to generate grammatical sentences by reconstructing the document sentences. Therefore, the model can appropriately decode the summary embedding toŷ 0 .
As for the latter, if the j-th sentence contributes to generating the i-th one, a ij get to be higher. This mechanism models our assumption that child sentences can generate their parent sentence, but not vice versa, because the children present additional information about their parent. Hence, the most concise k-th sentences (e.g., the 1st, 2nd, and 4th in Figure 1), provide less of a contribution to the reconstruction of any other sentences. Thus, a ik get to be lower for ∀i : i ̸ = 0. Because a ik satisfies the constraint ∑ n i=0 a ik = 1, a 0k is expected to be larger, and thus the k-th sentence contributes to the construction of the summary embeddingŝ 0 .
Marginal Probability of Dependency: The calculation of the marginal probability of dependency, a ij , is explained here. We first define the weighted adjacency matrix F = (f ij ) ∈ R (n+1) * (n+1) , where the indices of the first column and row are 0, denoting the root node. f ij denotes the un-normalized weight of an edge between a parent sentence i and its child j. We define it as a pair-wise attention score following (Liu and Lapata, 2018). By assuming a multi-root discourse tree, f ij is defined as: are the weight and bias respectively, for constructing the representation of the parent nodes. W c ∈ R d f * d f and b c ∈ R d f correspond to those of the child nodes. We normalize f ij into a ij based on (Koo et al., 2007). a ij corresponds to the proportion of the total weight of the spanning trees containing an edge (i, j): where T denotes the set of all spanning trees in a document D. v(t|F ) is the weight of a tree t ∈ T , and Z(F ) denotes the sum of the weights of all trees in T . From the Matrix-Tree Theorem (Tutte, 1984), Z(F ) can be rephrased as: where L(F ) ∈ R (n+1) * (n+1) and L 0 (F ) ∈ R n * n are the Laplacian matrix of F and its principal submatrix formed by deleting row 0 and column 0, respectively. By solving Eq. 12, a ij is given by: StrSum generates the summary under the large influence of the child sentences of the root. Therefore, sentences that are not related to the rating (e.g., the 4th in Figure 1) also affect the summary and can be considered noise. Here, we assume that meaningful sentences (e.g., the 1st and 2nd in Figure 1) typically have more descendants, because many sentences provide the explanation of them. Hence, we introduce the DiscourseRank to rank the importance of the sentences in terms of the number of descendants. Inspired by PageRank (Page et al., 1999), the DiscourseRank of the root and n sentences at the t-th iteration r t = [r 0 , . . . , r n ] ∈ R (n+1) is defined as: whereÂ = (â ij ) ∈ R (n+1) * (n+1) denotes the stochastic matrix for each dependency, λ is a damping factor, and v ∈ R (n+1) is a vector with all elements equal to 1/(n + 1). Eq.18 implies that r i reflects r j more if the i-th sentence is more likely to be the parent of the j-th sentence. The r solution and updated score of the edge (0, j) a 0j (j ∈ {1, . . . , n}) are calculated by: The updated scoreā 0j is used to calculate the summary embeddingŝ 0 instead of Eq.16. As a result, the generated summary reflects the sentences with a higher marginal probability of dependency on the root, while focusing on the main review point.

Related work 3.1 Supervised Review Summary Generation
Several previous studies have addressed abstractive summarization for product reviews (Carenini et al., 2013;Di Fabbrizio et al., 2014;Bing et al., 2015;Yu et al., 2016); however, their output summaries are not guaranteed to be grammatical (Wang and Ling, 2016). Neural sequenceto-sequence models have improved the quality of abstractive summarization. Beginning with the adaptation to sentence summarization (Rush et al., 2015;Chopra et al., 2016), several studies have tackled the generation of an abstractive summary of news articles (Nallapati et al., 2016;See et al., 2017;Tan et al., 2017;Paulus et al., 2018). With regard to product reviews, the neural sequenceto-sequence based model (Wang and Ling, 2016) and joint learning with sentiment classification (Ma et al., 2018;Wang and Ren, 2018) have improved the performance of one-sentence summarization. Our work is also based on the neural sequence-to-sequence model, while introducing the new concept of generating the summary by recursively reconstructing a parent sentence from its children.

Unsupervised Summary Generation
Although supervised abstractive summarization has been successfully improved, unsupervised techniques have still not similarly matured. Ganesan et al. (2010) proposed Opinosis, a graphbased method for generating review summaries. Their method is word-extractive, rather than abstractive, because the generated summary only contains words that appear in the source document. With the recently increasing number of neural summarization models, Miao and Blunsom (2016) applied a variational auto-encoder for semi-supervised sentence compression. Chu and Liu (2018) proposed MeanSum, an unsupervised neural multi-document summarization model for reviews. However, their model is not aimed at generating a summary from a single document and could not directly be extended. Although several previous studies (Fang et al., 2016;Dohare et al., 2018) have used external parsers for unsupervised abstractive summarization, our work, to the best of our knowledge, proposes the first unsupervised abstractive summarization method for a single product review that does not require an external parser.

Discourse Parsing and its Applications
Discourse parsing has been extensively researched and used for various applications.  (2017) also constructed a dependency-based discourse tree for document classification. Ji and Smith (2017) pointed out the limitations of using external parsers, demonstrating that the performance depends on the amount of the RST-DT and the domain of the documents. Against such a background, Liu and Lapata (2018) proposed a model that induces a latent discourse tree without an external corpus. Inspired by structure bias (Cheng and Lapata, 2016;Kim et al., 2017), they introduced Structured Attention, which normalizes attention scores as the posterior marginal probabilities of a nonprojective discourse tree. The probability distribution of Structured Attention implicitly represents a discourse tree, in which the child sentences present additional information about their parent. We extend it to the unsupervised summarization, i.e., obtaining a summary as the root sentence of a latent discourse tree. While Liu and Lapata (2018) introduce a virtual root sentence and induce a latent discourse tree via supervised document classification, we generate a root sentence via reconstructing a parent sentence from its children without supervision.

Experiments
In this section, we present our experiments for the evalation of the summary generation performance of online reviews. The following section provides the details of the experiments and results. 1

Dataset
Our experiments use the Amazon product review dataset (McAuley et al., 2015;He and McAuley, 2016), which contains Amazon online reviews and their one-sentence summaries. It includes 142.8 Because our model is trained by identifying and reconstructing a parent sentence from its children, it sometimes fails to construct an appropriate tree for relatively short reviews. It also has a negative influence on summary generation. Therefore, we use reviews with 10 or more sentences for training, and those with 5 or more sentences for validation and evaluation. Table 1 indicates the number of reviews in each domain.

Experimental Details
The source sentences and the summaries share the same vocabularies, which are extracted from the training sources of each domain. We limit a vocabulary to the 50, 000 most frequent words appearing in training sets.
The hyper-parameters are tuned based on the performance using the reference summaries in validation sets. We set 300-dimensional word embeddings and initialize them with pre-trained Fast-Text vectors (Joulin et al., 2017). The encoder is a single-layer Bi-GRU with 256-dimensional hidden states for each direction and the decoder is a uni-directional GRU with 256-dimensional hidden states. The damping factor of DiscourseRank is 0.9. We train the model using Ada-grad with a learning rate of 10 −1 , an initial accumulator value of 10 −1 , and a batch size of 16. At the evaluation time, a beam search with a beam size of 10 is used.
Similar to (See et al., 2017;Ma et al., 2018), our evaluation metric is the ROUGE-F1 score (Lin, 2004), computed by the pyrouge package. We use ROUGE-1, ROUGE-2, and ROUGE-L, which measure the word-overlap, bigram-overlap, and longest common sequence between the reference and generated summaries, respectively.

Baseline
For the comparisons, two unsupervised baseline models are employed. A graph-based unsupervised sentence extraction method, TextRank is employed (Mihalcea and Tarau, 2004), where sentence embeddings are used instead of bag-ofwords representations, based on (Rossiello et al., 2017). As an unsupervised word-level extractive approach, we employ Opinosis (Ganesan et al., 2010), which detects salient phrases in terms of their redundancy. Because we observe repetitive expressions in the dataset, Opinosis is added as a baseline. Both methods extract or generate a onesentence summary. Furthermore, a third, novel unsupervised baseline model MeanSum-single is introduced, which is an extended version of the unsupervised neural multi-document summarization model (Chu and Liu, 2018). While it decodes the mean of multiple document embeddings to generate the summary, MeanSum-single generates a single-document summary by decoding the mean of the sentence embeddings in a document. It learns a language model through reconstruction of each sentence. By comparing with MeanSumsingle, we verify that our model focuses on the main review points, and does not simply take the average of the entire document.
As supervised baselines, we employ vanilla neural sequence-to-sequence models for abstractive summarization (Hu et al., 2015), following previous studies (Ma et al., 2018;Wang and Ren, 2018). We denote the model as Seq-Seq and that with the attention mechanism as Seq-Seq-att. The encoder and decoder used are the same as those used in our model.  1. I love this game 2. It is so much fun 3. I'm all about new and different games 4. I love to play this with my brother because he is very bad at keeping score so I win most of the time and he loves to tell each characters story 5. And he loves to tell each characters story and to tell why each person got what fate 6. It's a must buy if you want a fun and fast card game 1. have not used it yet at the campground but tested it at home and works fine 2. use a toothpick to hold the valve open so you can deflate it easily 3. if you sit on it and your butt just touches the ground your at the right pressure 4. for the price i would recommend it for occasional use 5. if your a hard core camper you may want a name brand 6. it suits my needs perfectly difference between our models and the others are statistically significant (p < 0.05). Because the abstractive approach generates a concise summary by omitting trivial phrases, it can lead to a better performance than those of the extractive ones.

Evaluation of Summary Generation
On the other hand, for Movies & TV, our model is competitive with other unsupervised extractive approaches; TextRank and Opinosis. One possible explanation is that the summary typically includes named entities, such as the names of characters, actors and directors, which may lead to a better performance of the extractive approaches. For all datasets, our full model outperforms the one using only StrSum. Our models significantly outperform MeanSum-single, indicating that our model focuses on the main review points, and does not simply take the average of the entire document. Figure 3 shows the ROUGE-L F1 scores of our models on the evaluation sets with various numbers of sentences compared to the supervised baseline model (Seq-Seq-att). For the case of a dataset with less than 30 sentences, the performance of our models is inferior to that of the supervised baseline model. Because our full model generates summaries via learning the latent discourse tree, it sometimes fails to construct a tree, and thus experiences a decline in performance for relatively short reviews. On the other hand, for datasets with the number of sentences exceeding 30, our model achieves competitive or better performance than the supervised model. Figure 4 presents the generated summary and the latent discourse tree induced by our full model. We obtained the maximum spanning tree from the probability distribution of dependency, using Chu-Liu-Edmonds algorithm (Chu, 1965;Edmonds, 1967). Figure 4(a) shows the summary and the latent discourse tree for a board game review. Our model generates the summary, "i love this game", which is almost identical to the reference. The induced tree shows that the 2nd sentence elaborates on the generated summary, while the 3rd sentence provides its background. The 4th and 5th sentences explain the 1st sentence in detail, i.e., describe why the author loves the game. Figure 4(b) shows the summary and latent discourse tree of a camping mattress review. Although there is no word-overlap between the reference and generated summary, our model focuses on the positivity in terms of the price. On the induced tree, the 1st to 3rd sentences provide a background of the summary and mention the high quality of the product. The 6th sentence indicates that reviewer is satisfied, while the 4th sentence provides its explanation with regards to the price.

Analysis of the Induced Structure
In Figure 4(c), we present a failure example of a review of a concert DVD. The reviewer is disappointed by the poor quality of the sound; however  our model generates a positive summary, "this is a great movie". The induced tree shows that the sentences describing the high potential (1st), quality of the video (4th), and preference to the picture (7th), all affect the summary generation. Our model regards the sound quality as a secondary factor to that of the video. Therefore, it fails to prioritize the contrasting aspects; the sound and the video, and generates an inappropriate summary. DiscourseRank cannot work well on this example, because the numbers of sentences mentioning each aspect are not significantly different. To solve such a problem, the aspects of each product must be ranked explicitly, such as in (Gerani et al., 2014;Angelidis and Lapata, 2018). Table 3 summarizes the characteristics of the induced latent discourse trees. These are compared with those obtained by the Structured Attention model, StrAtt (Liu and Lapata, 2018). StrAtt induces single-root trees via the document classification task based on the review ratings. For each domain, our model induces more non-projective trees than StrAtt. Additionally, the height (the average maximum path length from a root to a leaf node) is larger than that of StrAtt. Our model estimates the parent of all the sentences and can induce deeper trees in which the edges connect trivial sentences. On the other hand, StrAtt identifies salient sentences required for the document classification, and thus induces shallow trees that connect the salient sentences and others. As our model prevents the summary from focusing on trivial or redundant sentences by inducing deep and complex trees, it specifically achieves higher performance when considering relatively long reviews.  Figure 5: Visualization of DiscourseRank. The darker the highlightning, the higher the rank score. The references and generated summaries are also shown.

DiscourseRank Analysis
In this section, we demonstrate how DiscourseRank affects the summary generation. Figure 5 visualizes the sentences in the main body and their DiscourseRank scores. We highlight the sentences that achieve a high DiscourseRank score with a darker color.
A review of a car coloring book is presented in Figure 5(a). As expected, the score of the 1st sentence is low, which is not related to the review evaluations, that is, DiscourseRank emphasizes the evaluative sentences, such as the 2nd and 6th sentences.
A review of swimming goggles is presented in Figure 5(b). The reviewer is satisfied with the quality of the product. The highlighting shows that DiscourseRank focuses on the sentences that mention leaking (e.g., the 2nd and 5th). While our model (with only StrSum) emphasizes the price sufficiency, DiscourseRank generates a summary describing that there is no issue with the quality.

Conclusion
In this work, we proposed a novel unsupervised end-to-end model to generate an abstractive summary of a single product review while inducing a latent discourse tree. The experimental results demonstrated that our model is competitive with or outperforms other unsupervised approaches. In particular, for relatively long reviews, our model achieved competitive or better performance compared to supervised models. The induced tree shows that the child sentences present additional information about their parent, and the generated summary abstracts the entire review.
Our model can also be applied to other applications, such as argument mining, because arguments typically have the same discourse structure as reviews. Our model can not only generates the summary but also identifies the argumentative structures. Unfortunately, we cannot directly compare our induced trees with the output of a discourse parser, which typically splits sentences into elementary discourse units. In future work, we will make comparisons with those of a humanannotated dataset.