VAE-PGN based Abstractive Model in Multi-stage Architecture for Text Summarization

This paper describes our submission to the TL;DR challenge. Neural abstractive summarization models have been successful in generating fluent and consistent summaries with advancements like the copy (Pointer-generator) and coverage mechanisms. However, these models suffer from their extractive nature as they learn to copy words from the source text. In this paper, we propose a novel abstractive model based on Variational Autoencoder (VAE) to address this issue. We also propose a Unified Summarization Framework for the generation of summaries. Our model eliminates non-critical information at a sentence-level with an extractive summarization module and generates the summary word by word using an abstractive summarization module. To implement our framework, we combine submodules with state-of-the-art techniques including Pointer-Generator Network (PGN) and BERT while also using our new VAE-PGN abstractive model. We evaluate our model on the benchmark Reddit corpus as part of the TL;DR challenge and show that our model outperforms the baseline in ROUGE score while generating diverse summaries.


Introduction
Text summarization is the task of producing an accurate summary by preserving essential information from a long text document. This is a challenging and significant task, as it can be applied to many real-world applications such as summarizing news articles, social media, web pages, blogs or long text documents. Many approaches have been proposed to solve the text summarization problem (See et al., 2017;Liu, 2019b;Gehrmann et al., 2018). Extractive summarization generates a summary by selecting important phrases or sentences from the source text. This approach mainly uses ranking the importance of phrases or sen-tences to select only important information (Liu, 2019b). Whereas the abstractive approach generates entirely new phrases or sentences that capture the meaning of the source text. In this paper, we focus on a few challenges that need to be addressed in generating an abstractive summarization of a given text.
The first challenge comes from the difficulty in preserving the contents of a large source text in the generated summary. The state-of-the-art models for abstractive summarization use a sequenceto-sequence attention model with a copy mechanism (Gu et al., 2016;See et al., 2017), to select the relevant content of the source text. But these models suffer from their extractive nature of generating summaries due to the copy mechanism (Boutkan et al., 2019;Chawla et al., 2019). We introduce a VAE-based PGN model to overcome this extractive nature. Another challenge is to eliminate non-critical information from the source text. Gehrmann et al. (2018) employ a word-level content selection model to focus on only critical information, but handling critical information at wordlevel is difficult in long sentences because of the repetition of many common words. Our approach handles the critical information by using a multistage model with a sentence-level selection (extractive) and abstractive summarization modules.
First, we use a fine-tuned BERT-based extractive model named BERTSUM (Liu, 2019b) to eliminate less important sentences by scoring each sentence in the source text. Second, an abstractive summary is generated on the basis of the extracted sentences by the extractive model. This combination of VAE and PGN mechanisms brings diversity to abstractive summaries. This sequential multi-stage processing improves performance compared to using single-stage abstraction models. We found that the proposed model performs well, achieving the best result compared to our baselines on this task. The contributions of our work are as follows: • We develop VAE-based PGN to address the diversity issue and generate abstractive summaries. • We propose a new architecture that excludes less important information at the first stage.  (Devlin et al., 2018), which has achieved state-of-the-art performance on multiple NLP tasks, Liu (2019b) use a finetuned BERT model to score the sentences. This model is used by our work for extractive summarization module.

Abstractive Summarization
The work on attention-based encoder-decoder models (Rush et al., 2015;Chopra et al., 2016) created a surge of research interest in text generation approaches. Recent abstractive summarization models adopt the attention mechanism (Bahdanau et al., 2015), and a pointer-network to copy infrequent words and entities to the target sentence (See et al., 2017;Gülçehre et al., 2016;Gu et al., 2016). Gehrmann et al. (2018) use content selector that can be used for a bottom-up attention that restricts the ability of abstractive summarizers to copy words from the source text.

Diversity
In this work, the diversity of summarization refers to generating words that are different from the source text. VAEs have become more and more popular (Hu et al., 2017;Shen et al., 2017) in generating diverse sentences through learning a highlevel latent variable representation of the context. Bowman et al. (2015) points out that generating sentences from a continuous space like VAE is of higher grammatical quality compared to other techniques such as Beam Search.

Proposed Architecture
In this section, we explain the proposed architecture to resolve the problems discussed above.

VAE-PGN Model
As illustrated in Figure 1, the proposed model is based on two main components, the pointergenerator network, and the VAE mechanism. We incorporate the vanilla VAE to learn a representation of source text that captures the complex semantic structures underlying the text. The latent representation from the VAE component is combined with the pointer-generator component to generate the summary at the decoder.
An LSTM encoder encodes source text. From the final state of the encoder, we sample a latent variable over a Gaussian distribution. An LSTM decoder is used to generate the summary. Each decoding step uses the latent variable from the encoder and an attention mechanism to generate a probability distribution over the vocabulary. A pointer-generator network is used in tandem to aid in copying words from the source text. Now we define the notations used in the rest of the paper. We represent the training data as . It contains N content-summary pairs. s = (w s 1 , w s 2 , ..., w s p ) represents source text of length p, and t = (w t 1 , w t 2 , ..., w t q ) represents the target summary of length q. The source and target embeddings of a word w i , are represented as e s w i and e t w i respectively. The embeddings of words from the source text are encoded by a bidirectional LSTM Encoder.
where EN C s e is the bidirectional LSTM encoder and h s is the final encoder state of the source text. Since we use a bi-directional encoder, the hidden states of forward and backward encoder are concatenated, and passed through a feed-forward network to get h s . The final state of encoder is passed through two feed-forward layers to get µ and σ respectively. µ = Linear 1 (h s ) and σ = Linear 2 (h s ) (2) Although Linear 1 ,Linear 2 layers are both feedforward layers, they are labelled separately here to reflect that the weights are updated differently during backpropagation. The latent variable which encodes the representation of the source sentence is modelled as a one-dimensional tensor whose dimensions is a hyperparameter. We set this hyperparameter is initialized to dimensions of the encoder and decoder. A random variable from the N (0, I) normal distribution is sampled and is transformed to the required N (µ, σ) distribution.
The LSTM decoder is initialized with the final state of the encoder h s . The input to the decoder at every time step is z⊕e wt , which is a concatenation of the latent variable z and embeddings of the previous word (e wt ). During training, e wt is the embedding of the previous word of target paraphrase and while testing, it is embedding of the word generated by the decoder at step t-1.
The loss function is the sum of the crossentropy loss calculated between the target and the generated summary, the Coverage loss, and the Kullback-Leibler Divergence (KL-D) loss calculated between the N(0, I) distribution and the generated N (µ, σ) distribution. The loss function can be formalized as follows:

Multi-Stage Architecture
Gehrmann et al. (2018) studies the effect of eliminating duplicate or insignificant words by performing a word-level content selection. Motivated by the content selection, we use a sentence level selection before generating an abstractive summary as described by the Algorithm 1.
Algorithm 1 Multi-Stage Architecture 1: Train two language models, one using BERT-SUM for extractive summarization and the other using PGN or VAE-PGN for abstractive summarization. 2: Get scores for sentences from the source text using BERTSUM 3: Reorder the sentences using Algorithm 2 4: Generate abstractive summary using the Abstractive model (PGN or VAE-PGN) We obtain a score for each sentence through a high-performance extractive summarization model based on fine-tuned BERT called BERT-SUM (Liu, 2019b). The scored sentences are then reordered using the Algorithm 2. Also, the number of input sentences for the abstractive summary was selected dynamically according to the parameter min words in Algorithm 2. This hyperparameter is tuned based on the average target summary length of a given dataset. As a result, abstractive summarization can be performed with the filtered input in which the unimportant sentences are removed.

Datasets and Experimental Setup
TL;DR Reddit corpus (Völske et al., 2017): This is the dataset for the TL;DR challenge. They pro- Extractive Summarization and Abstractive Summarization modules are finetuned on each datasets for obtaining respetive results. Apart from the minimum words generated, we borrow all hyperparameters from baseline code implementations. The min w ords parameter is tuned by setting it to the average target summary length and doing to a grid-search based fine-tuning around the initialized value.

Metrics and Baselines
ROUGE (Lin, 2004) metric is used as an automatic evaluation metric. Rouge-N (R-n) scores refer to the n-gram overlap between the generated and the reference summary. Rouge-L (R-L) score refers to the Longest Common Subsequence based overlap score. All scores are calculated using the virtual machine provided by competition (2018) on the TIRA platform that uses the tagucci (2019) project to calculate the scores. We compared models using a validation set, choose the best scored models and then evaluated them on the official test set due to resource and time constraints.
PGN is the pointer-generator model with copy and coverage mechanism. VAE-PGN is the VAEbased pointer-generator model with copy and coverage mechanism as shown in Figure 1. Unified PGN and Unified VAE-PGN refer to the Unified Architecture with PGN and VAE-PGN models used for respective abstractive summarization modules. Although the unified model increases the R-1 and R-2 scores on both the abstractive models, there is a reduction in R-L score. We expect this decline to be caused by using different tokenization techniques used in the abstractive model (uses Stanford NLP tokenizer) and the extractive model (uses BERT tokenizer).  The results on the validation set and the test set are presented in Table 2 and Table 3 respectively. The results show that the unified architecture model outperforms the abstractive model on both VAE and VAE-PGN based models and also the Unified VAE-PGN model gets a better R-1 score than Unified PGN on the validation set. But Unified PGN and Unified VAE-PGN model perform almost similarly on the test set.

Conclusions
In this paper, we propose a Unified VAE-PGN model and an effective multi-stage architecture for abstractive summarization. Our model eliminates non-critical information at sentence-level and also generates diverse summaries using a continuous space representation of the information. We evaluate our models on the benchmark Reddit datasets as part of the TL;DR challenge and show that our proposed models outperform the baseline models. We plan to include content selection, eliminating word-level non-critical information in the multistage architecture in future work.