Deep Recurrent Generative Decoder for Abstractive Text Summarization

We propose a new framework for abstractive text summarization based on a sequence-to-sequence oriented encoder-decoder model equipped with a deep recurrent generative decoder (DRGN). Latent structure information implied in the target summaries is learned based on a recurrent latent random model for improving the summarization quality. Neural variational inference is employed to address the intractable posterior inference for the recurrent latent variables. Abstractive summaries are generated based on both the generative latent variables and the discriminative deterministic states. Extensive experiments on some benchmark datasets in different languages show that DRGN achieves improvements over the state-of-the-art methods.


Introduction
Automatic summarization is the process of automatically generating a summary that retains the most important content of the original text document (Edmundson, 1969;Luhn, 1958;Nenkova and McKeown, 2012). Different from the common extraction-based and compression-based methods, abstraction-based methods aim at constructing new sentences as summaries, thus they require a deeper understanding of the text and the capability of generating new sentences, which provide an obvious advantage in improving the focus of a summary, reducing the redundancy, and keeping a good compression rate Rush et al., 2015;Nallapati et al., 2016). Some previous research works show that human-written summaries are more abstractive (Jing and McKeown, 2000). Moreover, our investigation reveals that people may naturally follow some inherent structures when they write the abstractive summaries. To illustrate this observation, we show some examples in Figure 1, which are some top story summaries or headlines from the channel "Technology" of CNN. After analyzing the summaries carefully, we can find some common structures from them, such as "What", "What-Happened" , "Who Action What", etc. For example, the summary "Apple sues Qualcomm for nearly $1 billion" can be structuralized as "Who (Apple) Action (sues) What (Qualcomm)". Similarly, the summaries " [Twitter] [fixes] [botched @POTUS account transfer]", " [Uber] [to pay] [$20 million] for misleading drivers", and "[Bipartisan bill] aims to [reform] [H-1B visa system]" also follow the structure of "Who Action What". The summary "The emergence of the 'cyber cold war"' matches with the structure of "What", and the summary "St. Louis' public library computers hacked" follows the structure of "What-Happened".
Intuitively, if we can incorporate the latent structure information of summaries into the abstractive summarization model, it will improve the quality of the generated summaries. However, very few existing works specifically consider the latent structure information of summaries in their summarization models. Although a very popular neural network based sequence-to-sequence (seq2seq) framework has been proposed to tackle the abstractive summarization problem (Lopyrev, 2015;Rush et al., 2015;Nallapati et al., 2016), the calculation of the internal decoding states is entirely deterministic. The deterministic transformations in these discriminative models lead to limitations on the representation ability of the latent structure information. Miao and Blunsom (2016) extended the seq2seq framework and proposed a generative model to capture the latent summary information, but they did not consider the recurrent dependencies in their generative model leading to limited representation ability.
To tackle the above mentioned problems, we design a new framework based on sequenceto-sequence oriented encoder-decoder model equipped with a latent structure modeling component. We employ Variational Auto-Encoders (VAEs) (Kingma and Welling, 2013;Rezende et al., 2014) as the base model for our generative framework which can handle the inference problem associated with complex generative modeling. However, the standard framework of VAEs is not designed for sequence modeling related tasks. Inspired by (Chung et al., 2015), we add historical dependencies on the latent variables of VAEs and propose a deep recurrent generative decoder (DRGD) for latent structure modeling.
Then the standard discriminative deterministic decoder and the recurrent generative decoder are integrated into a unified decoding framework. The target summaries will be decoded based on both the discriminative deterministic variables and the generative latent structural information. All the neural parameters are learned by back-propagation in an end-to-end training paradigm.
The main contributions of our framework are summarized as follows: (1) We propose a sequence-to-sequence oriented encoder-decoder model equipped with a deep recurrent generative decoder (DRGD) to model and learn the latent structure information implied in the target summaries of the training data. Neural variational inference is employed to address the intractable posterior inference for the recurrent latent variables.
(2) Both the generative latent structural information and the discriminative deterministic variables are jointly considered in the generation process of the abstractive summaries. (3) Experimental results on some benchmark datasets in different languages show that our framework achieves better performance than the state-of-the-art models.

Related Works
Automatic summarization is the process of automatically generating a summary that retains the most important content of the original text document (Nenkova and McKeown, 2012). Traditionally, the summarization methods can be classified into three categories: extraction-based methods (Erkan and Radev, 2004;Goldstein et al., 2000;Wan et al., 2007;Min et al., 2012;Nallapati et al., 2017;Cheng and Lapata, 2016;Cao et al., 2016;Song et al., 2017), compression-based methods (Li et al., 2013;Wang et al., 2013;Li et al., 2015, and abstraction-based methods. In fact, previous investigations show that human-written summaries are more abstractive (Barzilay and McKeown, 2005;. Abstraction-based approaches can generate new sentences based on the facts from different source sentences. Barzilay and McKeown (2005) employed sentence fusion to generate a new sentence.  proposed a more fine-grained fusion framework, where new sentences are generated by selecting and merging salient phrases. These methods can be regarded as a kind of indirect abstractive summarization, and complicated constraints are used to guarantee the linguistic quality.
Recently, some researchers employ neural network based framework to tackle the abstractive summarization problem. Rush et al. (2015) proposed a neural network based model with local attention modeling, which is trained on the Gigaword corpus, but combined with an additional loglinear extractive summarization model with handcrafted features. Gu et al. (2016) integrated a copying mechanism into a seq2seq framework to improve the quality of the generated summaries. Chen et al. (2016) proposed a new attention mechanism that not only considers the important source segments, but also distracts them in the decoding step in order to better grasp the overall meaning of input documents. Nallapati et al. (2016) utilized a trick to control the vocabulary size to improve the training efficiency. The calculations in these methods are all deterministic and the representation ability is limited. Miao and Blunsom (2016) extended the seq2seq framework and proposed a generative model to capture the latent summary information, but they do not consider the recurrent dependencies in their generative model leading to limited representation ability.
Some research works employ topic models to capture the latent information from source documents or sentences. Wang et al. (2009) proposed a new Bayesian sentence-based topic model by making use of both the term-document and termsentence associations to improve the performance of sentence selection. Celikyilmaz and Hakkani-Tur (2010) estimated scores for sentences based on their latent characteristics using a hierarchical topic model, and trained a regression model to extract sentences. However, they only use the latent topic information to conduct the sentence salience estimation for extractive summarization. In contrast, our purpose is to model and learn the latent structure information from the target summaries and use it to enhance the performance of abstractive summarization.

Overview
As shown in Figure 2, the basic framework of our approach is a neural network based encoderdecoder framework for sequence-to-sequence learning. The input is a variable-length sequence X = {x 1 , x 2 , . . . , x m } representing the source text. The word embedding x t is initialized randomly and learned during the optimization process. The output is also a sequence Y = {y 1 , y 2 , . . . , y n }, which represents the generated abstractive summaries. Gated Recurrent Unit (GRU) (Cho et al., 2014) is employed as the basic sequence modeling component for the encoder and the decoder. For latent structure modeling, we add historical dependencies on the latent variables of Variational Auto-Encoders (VAEs) and propose a deep recurrent generative decoder (DRGD) to distill the complex latent structures implied in the target summaries of the training data. Finally, the abstractive summaries will be decoded out based on both the discriminative deterministic variables H and the generative latent structural information Z.

Recurrent Generative Decoder
Assume that we have obtained the source text representation h e ∈ R k h . The purpose of the decoder is to translate this source code h e into a series of hidden states {h d 1 , h d 2 , . . . , h d n }, and then revert these hidden states to an actual word sequence and generate the summary.
For standard recurrent decoders, at each time step t, the hidden state h d t ∈ R k h is calculated using the dependent input symbol y t−1 ∈ R kw and the previous hidden state h d t−1 : where f (·) is a recurrent neural network such as vanilla RNN, Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997), and Gated Recurrent Unit (GRU) (Cho et al., 2014). No matter which one we use for f (·), the common transformation operation is as follows: where W d yh ∈ R k h ×kw and W d hh ∈ R k h ×k h are the linear transformation matrices. b d h is the bias. k h is the dimension of the hidden layers, and k w is the dimension of the word embeddings. g(·) is the non-linear activation function. From Equation 2, we can see that all the transformations are deterministic, which leads to a deterministic recurrent hidden state h d t . From our investigations, we find that the representational power of such deterministic variables are limited. Some more complex latent structures in the target summaries, such as the high-level syntactic features and latent topics, cannot be modeled effectively by the deterministic operations and variables.
Recently, a generative model called Variational Auto-Encoders (VAEs) (Kingma and Welling, 2013;Rezende et al., 2014) shows strong capability in modeling latent random variables and improves the performance of tasks in different fields such as sentence generation (Bowman et al., 2016) and image generation (Gregor et al., 2015). However, the standard VAEs is not designed for modeling sequence directly. Inspired by (Chung et al., 2015), we extend the standard VAEs by introducing the historical latent variable dependencies to make it be capable of modeling sequence data. Our proposed latent structure modeling framework can be viewed as a sequence generative model which can be divided into two parts: inference (variational-encoder) and generation (variational-decoder). As shown in the decoder component of Figure 2, the input of the original VAEs only contains the observed variable y t , and the variational-encoder can map it to a latent variable z ∈ R kz , which can be used to reconstruct the original input. For the task of summarization, in the sequence decoder component, the previous latent structure information needs to be considered for constructing more effective representations for the generation of the next state. For the inference stage, the variational-encoder can map the observed variable y <t and the previous latent structure information z <t to the posterior probability distribution of the latent structure variable p θ (z t |y <t , z <t ). It is obvious that this is a recurrent inference process in which z t contains the historical dynamic latent structure information. Compared with the variational inference process p θ (z t |y t ) of the typical VAEs model, the recurrent framework can extract more complex and effective latent structure features implied in the sequence data.
For the generation process, based on the latent structure variable z t , the target word y t at the time step t is drawn from a conditional probability distribution p θ (y t |z t ). The target is to maximize the probability of each generated summary y = {y 1 , y 2 , . . . , y T } based on the generation process according to: For the purpose of solving the intractable integral of the marginal likelihood as shown in Equation 3, a recognition model q φ (z t |y <t , z <t ) is introduced as an approximation to the intractable true posterior p θ (z t |y <t , z <t ). The recognition model parameters φ and the generative model parameters θ can be learned jointly. The aim is to reduce the Kulllback-Leibler divergence (KL) between q φ (z t |y <t , z <t ) and p θ (z t |y <t , z <t ): where · denotes the conditional variables y <t and z <t . Bayes rule is applied to p θ (z t |y <t , z <t ), and we can extract log p θ (z) from the expectation, transfer the expectation term E q φ (zt|y<t,z<t) back to KL-divergence, and rearrange all the terms. Consequently the following holds: Let L(θ, φ; y) represent the last two terms from the right part of Equation 4: Since the first KL-divergence term of Equation 4 is non-negative, we have log p θ (y <t ) ≥ L(θ, φ; y) meaning that L(θ, φ; y) is a lower bound (the objective to be maximized) on the marginal likelihood. In order to differentiate and optimize the lower bound L(θ, φ; y), following the core idea of VAEs, we use a neural network framework for the probabilistic encoder q φ (z t |y <t , z <t ) for better approximation.

Abstractive Summary Generation
We also design a neural network based framework to conduct the variational inference and generation for the recurrent generative decoder component similar to some design in previous works (Kingma and Welling, 2013;Rezende et al., 2014;Gregor et al., 2015). The encoder component and the decoder component are integrated into a unified abstractive summarization framework. Considering that GRU has comparable performance but with less parameters and more efficient computation, we employ GRU as the basic recurrent model which updates the variables according to the following operations: where r t is the reset gate, z t is the update gate. denotes the element-wise multiplication. tanh is the hyperbolic tangent activation function. As shown in the left block of Figure 2, the encoder is designed based on bidirectional recurrent neural networks. Let x t be the word embedding vector of the t-th word in the source sequence. GRU maps x t and the previous hidden state h t−1 to the current hidden state h t in feed-forward direction and back-forward direction respectively: Then the final hidden state h e t ∈ R 2k h is concatenated using the hidden states from the two directions: h e t = h t || h. As shown in the middle block of Figure 2, the decoder consists of two components: discriminative deterministic decoding and generative latent structure modeling.
The discriminative deterministic decoding is an improved attention modeling based recurrent sequence decoder. The first hidden state h d 1 is initialized using the average of all the source input where h e t is the source input hidden state. T e is the input sequence length. The deterministic decoder hidden state h d t is calculated using two layers of GRUs. On the first layer, the hidden state is calculated only using the current input word embedding y t−1 and the previous hidden state h d 1 t−1 : where the superscript d 1 denotes the first decoder GRU layer. Then the attention weights at the time step t are calculated based on the relationship of h d 1 t and all the source hidden states {h e t }. Let a i,j be the attention weight between h d 1 i and h e j , which can be calculated using the following formulation: The attention context is obtained by the weighted linear combination of all the source hidden states: The final deterministic hidden state h d 2 t is the output of the second decoder GRU layer, jointly considering the word y t−1 , the previous hidden state h d 2 t−1 , and the attention context c t : For the component of recurrent generative model, inspired by some ideas in previous works (Kingma and Welling, 2013;Rezende et al., 2014;Gregor et al., 2015), we assume that both the prior and posterior of the latent variables are Gaussian, i.e., p θ (z t ) = N (0, I) and q φ (z t |y <t , z <t ) = N (z t ; µ, σ 2 I), where µ and σ denote the variational mean and standard deviation respectively, which can be calculated via a multilayer perceptron. Precisely, given the word embedding y t−1 , the previous latent structure variable z t−1 , and the previous deterministic hidden state h d t−1 , we first project it to a new hidden space: where W ez yh ∈ R k h ×kw , W ez zh ∈ R k h ×kz , W ez hh ∈ R k h ×k h , and b ez h ∈ R k h . g is the sigmoid activation function: σ(x) = 1/(1 + e −x ). Then the Gaussian parameters µ t ∈ R kz and σ t ∈ R kz can be obtained via a linear transformation based on h ez t : The latent structure variable z t ∈ R kz can be calculated using the reparameterization trick: where ε ∈ R kz is an auxiliary noise variable. The process of inference for finding z t based on neural networks can be teated as a variational encoding process.
To generate summaries precisely, we first integrate the recurrent generative decoding component with the discriminative deterministic decoding component, and map the latent structure variable z t and the deterministic decoding hidden state h d 2 t to a new hidden variable: Given the combined decoding state h dy t at the time t, the probability of generating any target word y t is given as follows: where W d hy ∈ R ky×k h and b d hy ∈ R ky . ς(·) is the softmax function. Finally, we use a beam search algorithm (Koehn, 2004) for decoding and generating the best summary.

Learning
Although the proposed model contains a recurrent generative decoder, the whole framework is fully differentiable. As shown in Section 3.3, both the recurrent deterministic decoder and the recurrent generative decoder are designed based on neural networks. Therefore, all the parameters in our model can be optimized in an end-to-end paradigm using back-propagation. We use {X} N and {Y } N to denote the training source and target sequence. Generally, the objective of our framework consists of two terms. One term is the negative loglikelihood of the generated summaries, and the other one is the variational lower bound L(θ, φ; Y ) mentioned in Equation 5. Since the variational lower bound L(θ, φ; Y ) also contains a likelihood term, we can merge it with the likelihood term of summaries. The final objective function, which needs to be minimized, is formulated as follows: 4 Experimental Setup

Datesets
We train and evaluate our framework on three popular datasets. Gigawords is an English sentence summarization dataset prepared based on Annotated Gigawords 1 by extracting the first sentence from articles with the headline to form a sourcesummary pair. We directly download the prepared dataset used in (Rush et al., 2015). It roughly contains 3.8M training pairs, 190K validation pairs, and 2,000 test pairs. DUC-2004 2 is another English dataset only used for testing in our experiments. It contains 500 documents. Each document contains 4 model summaries written by experts. The length of the summary is limited to 75 bytes.
LCSTS is a large-scale Chinese short text summarization dataset, consisting of pairs of (short text, summary) collected from Sina Weibo 3 (Hu et al., 2015). We take Part-I as the training set, Part-II as the development set, and Part-III as the test set.
There is a score in range 1 ∼ 5 labeled by human to indicate how relevant an article and its summary is. We only reserve those pairs with scores no less than 3. The size of the three sets are 2.4M, 8.7k, and 725 respectively. In our experiments, we only take Chinese character sequence as input, without performing word segmentation.

Evaluation Metrics
We use ROUGE score (Lin, 2004) as our evaluation metric with standard options. The basic idea of ROUGE is to count the number of overlapping units between generated summaries and the reference summaries, such as overlapped n-grams, word sequences, and word pairs. F-measures of ROUGE-1 (R-1), ROUGE-2 (R-2), ROUGE-L (R-L) and ROUGE-SU4 (R-SU4) are reported.

Comparative Methods
We compare our model with some baselines and state-of-the-art methods. Because the datasets are quite standard, so we just extract the results from their papers. Therefore the baseline methods on different datasets may be slightly different.
• TOPIARY (Zajic et al., 2004) is the best on DUC2004 Task-1 for compressive text summarization. It combines a system using linguistic based transformations and an unsupervised topic detection algorithm for compressive text summarization.
• MOSES+ (Rush et al., 2015) uses a phrasebased statistical machine translation system trained on Gigaword to produce summaries. It also augments the phrase table with "deletion" rulesto improve the baseline performance, and MERT is also used to improve the quality of generated summaries.
• ABS and ABS+ (Rush et al., 2015) are both the neural network based models with local attention modeling for abstractive sentence summarization. ABS+ is trained on the Gigaword corpus, but combined with an additional log-linear extractive summarization model with handcrafted features.
• RNN and RNN-context (Hu et al., 2015) are two seq2seq architectures. RNN-context integrates attention mechanism to model the context.
• RNN-distract (Chen et al., 2016) uses a new attention mechanism by distracting the historical attention in the decoding steps.
• RAS-LSTM and RAS-Elman (Chopra et al., 2016) both consider words and word positions as input and use convolutional encoders to handle the source information. For the attention based sequence decoding process, RAS-Elman selects Elman RNN (Elman, 1990) as decoder, and RAS-LSTM selects Long Short-Term Memory architecture (Hochreiter and Schmidhuber, 1997).
• LenEmb (Kikuchi et al., 2016) uses a mechanism to control the summary length by considering the length embedding vector as the input.
• ASC+FSC 1 (Miao and Blunsom, 2016) uses a generative model with attention mechanism to conduct the sentence compression problem. The model first draws a latent summary sentence from a background language model, and then subsequently draws the observed sentence conditioned on this latent summary.
• lvt2k-1sent and lvt5k-1sent (Nallapati et al., 2016) utilize a trick to control the vocabulary size to improve the training efficiency.

Experimental Settings
For the experiments on the English dataset Gigawords, we set the dimension of word embeddings to 300, and the dimension of hidden states and latent variables to 500. The maximum length of documents and summaries is 100 and 50 respectively. The batch size of mini-batch training is 256. For DUC-2004, the maximum length of summaries is 75 bytes. For the dataset of LCSTS, the dimension of word embeddings is 350. We also set the dimension of hidden states and latent variables to 500. The maximum length of documents and summaries is 120 and 25 respectively, and the batch size is also 256. The beam size of the decoder was set to be 10. Adadelta (Schmidhuber, 2015) with hyperparameter ρ = 0.95 and = 1e − 6 is used for gradient based optimization. Our neural network based framework is implemented using Theano (Theano Development Team, 2016). We first depict the performance of our model DRGD by comparing to the standard decoders (StanD) of our own implementation. The comparison results on the validation datasets of Gigawords and LCSTS are shown in Table 1. From the results we can see that our proposed generative decoders DRGD can obtain obvious improvements on abstractive summarization than the standard decoders. Actually, the performance of the standard    Table 2 and  Table 3 respectively. Our model DRGD achieves the best summarization performance on all the ROUGE metrics. Although ASC+FSC 1 also uses a generative method to model the latent summary variables, the representation ability is limited and it cannot bring in noticeable improvements. It is worth noting that the methods lvt2k-1sent and lvt5k-1sent (Nallapati et al., 2016) utilize linguistic features such as parts-of-speech tags, namedentity tags, and TF and IDF statistics of the words as part of the document representation. In fact, extracting all such features is a time consuming work, especially on large-scale datasets such as Gigawords. lvt2k and lvt5k are not end-to-end style models and are more complicated than our model in practical applications.

ROUGE Evaluation
The results on the Chinese dataset LCSTS are shown in Table 4. Our model DRGD also achieves the best performance. Although CopyNet employs a copying mechanism to improve the summary quality and RNN-distract considers attention information diversity in their decoders, our model is still better than those two methods demonstrating that the latent structure information learned from target summaries indeed plays a role in abstractive summarization. We also believe that integrating the copying mechanism and coverage diversity in our framework will further improve the summarization performance.

Summary Case Analysis
In order to analyze the reasons of improving the performance, we compare the generated summaries by DRGD and the standard decoders StanD used in some other works such as (Chopra et al., 2016). The source texts, golden summaries, and the generated summaries are shown in Table 5. From the cases we can observe that DRGD can indeed capture some latent structures which are consistent with the golden summaries. For example, our result for S(1) "Wuhan wins men's soccer title at Chinese city games" matches the "Who Action What" structure. However, the standard decoder StanD ignores the latent structures and generates some loose sentences, such as the results for S(1) "Results of men's volleyball at Chinese city games" does not catch the main points. The reason is that the recurrent variational auto-encoders used in our framework have better representation ability and can capture more effective and complicated latent structures from the sequence data. Therefore, the summaries generated by DRGD have consistent latent structures with the ground truth, leading to a better ROUGE evaluation.

Conclusions
We propose a deep recurrent generative decoder (DRGD) to improve the abstractive summarization performance. The model is a sequenceto-sequence oriented encoder-decoder framework equipped with a latent structure modeling component. Abstractive summaries are generated based on both the latent variables and the deterministic states. Extensive experiments on benchmark S(1): hosts wuhan won the men 's soccer title by beating beijing shunyi #-# here at the #th chinese city games on friday. Golden: hosts wuhan wins men 's soccer title at chinese city games. StanD: results of men 's volleyball at chinese city games. DRGD: wuhan wins men 's soccer title at chinese city games. S(2): UNK and the china meteorological administration tuesday signed an agreement here on long -and short-term cooperation in projects involving meteorological satellites and satellite meteorology. Golden: UNK china to cooperate in meteorology. StanD: weather forecast for major chinese cities. DRGD: china to cooperate in meteorological satellites. S(3): the rand gained ground against the dollar at the opening here wednesday , to #.# to the greenback from #.# at the close tuesday. Golden: rand gains ground. StanD: rand slightly higher against dollar. DRGD: rand gains ground against dollar. S(4): new zealand women are having more children and the country 's birth rate reached its highest level in ## years , statistics new zealand said on wednesday. Golden: new zealand birth rate reaches ##year high. StanD: new zealand women are having more children birth rate hits highest level in ## years. DRGD: new zealand 's birth rate hits ##year high.
datasets show that DRGD achieves improvements over the state-of-the-art methods.