Transformer-based Model for Single Documents Neural Summarization

We propose a system that improves performance on single document summarization task using the CNN/DailyMail and Newsroom datasets. It follows the popular encoder-decoder paradigm, but with an extra focus on the encoder. The intuition is that the probability of correctly decoding an information significantly lies in the pattern and correctness of the encoder. Hence we introduce, encode –encode – decode. A framework that encodes the source text first with a transformer, then a sequence-to-sequence (seq2seq) model. We find that the transformer and seq2seq model complement themselves adequately, making for a richer encoded vector representation. We also find that paying more attention to the vocabulary of target words during abstraction improves performance. We experiment our hypothesis and framework on the task of extractive and abstractive single document summarization and evaluate using the standard CNN/DailyMail dataset and the recently released Newsroom dataset.


Introduction
Document summarization has been an active area of research, especially on the CNN/DailyMail dataset. Even with recent progress (Gehrmann et al., 2018;Chen and Bansal, 2018), there is still some work to be done in the field. Although extractive summarization seem to be less challenging because new words are not generated, identifying salient parts of the document without any guide in the form of a query, is a substantial problem to tackle.
More recent approaches are data-driven and implement a variety of neural networks (Jadhav and Rajan, 2018;Narayan et al., 2017) majorly with an encoder-decoder framework (Narayan et al., 2018;Cheng and Lapata, 2016).
Similar to the work of Nallapati et al. (2017), we consider the extractive summarization task as a sequence classification problem. A major challenge with this approach, is the fact that the training data is not sequentially labelled. Hence creating one from the abstractive ground-truth summary, is crucial. We improve on Nallapati et al. (2017)'s approach to generate this labelled data, and evaluation shows that our extractive labels are more accurate. Another hurdle in this task, is the imbalance in the created data, that is, most of the document's sentences are labelled 0 (excluded from the summary) than 1, because just a few sentences actually make up a summary. Hence the neural extractor tends to be biased and suffer from a lot of false-negative labels. We also present a simple approach to reduce this bias. Most importantly, our neural extractor uses the recent bidirectional transformer encoder (Vaswani et al., 2017) with details provided in Section 3.1.
More interesting than extractive summaries, abstractive summaries correlate better with summaries that a human would present. Abstractive summarization does not simply reproduce salient parts of the document verbatim, but rewrites them in a concise form, usually introducing novel words along the way by utilizing some key abstraction techniques such as paraphrasing (Gupta et al., 2018), compression (Filippova et al., 2015) or sentence fusion (Barzilay and McKeown, 2005). However, it is met with major challenges like grammatical correctness and repetition of words especially when generating long-worded sentences. Nonetheless remarkable progress have been achieved with the use of seq2seq models (Gehrmann et al., 2018;See et al., 2017;Chopra et al., 2016;Rush et al., 2015) and a reward instead of loss function via deep-reinforcement learning (Chen and Bansal, 2018;Paulus et al., 2017;Ranzato et al., 2015).
We see abstractive summarization in same light as several other authors (Chen and Bansal, 2018;Hsu et al., 2018;Liu et al., 2018) -extract salient sentences and then abstract; thus sharing similar advantages as the popular divide-and-conquer algorithm. More-so, it mitigates the problem of information redundancy, since the mini-source, ie extracted document, contains distinct salient sentences. Our abstractive model is a blend of the transformer and seq2seq model. We notice improvements using this framework in the abstractive setting. This is because, to generate coherent and grammatically correct sentences, we need to be able to learn long-term dependency relations. The transformer complements the seq2seq model in this regard with its multi-head self attention. Also the individual attention heads in the transformer model mimics behavior related to the syntactic and semantic structure of the sentence (Vaswani et al., 2017(Vaswani et al., , 2018. Hence, the transformer produces a richer meaningful vector representation of the input, from which we can encode a fixed state vector for decoding.
The main contributions of this work are: • We present a simple algorithm for building a sentence-labelled corpus for extractive summarization training that produces more accurate results.
• We propose a novel framework for the task of extractive single document summarization that improves the current state-of-the-art on two specific datasets.
• We introduce the encode -encode -decode paradigm using two complementary models, transformer and seq2seq for generating abstractive summaries that improves current top performance on two specific datasets.

Task Definition
Given a document D = (S 1 , ..., S n ) with n sentences comprising of a set of words D W = {d 1 , ..., d w }, the task is to produce an extractive (S E ) or abstractive (S A ) summary that contains salient information in D, where S E ⊆ D W and S A = {w 1 , ..., w s } | ∃w i ∈ D W .

Method
We describe our summarization model in two modules -Extraction and Abstraction. The abstraction module simply learns to paraphrase and compress the output of the extracted document sentences.

Extraction
As illustrated in Figure 1, our model classifies each sentence in a document as being summaryworthy or not. However, in order to enhance this sequence classification process, we encode the input document with a TRANSFORMER. A logistic classifier then learns to label each sentence in the transformed document.

TRANSFORMER Encoder
The input to the Transformer is the document representation, which is a concatenation of the vector representation of its sentences. Each sentence representation is obtained by averaging the vector representation of its constituent words.
The transformer encoder is composed of 6 stacked identical layers. Each layer contains 2 sub-layers with multi-head self attention and position-wise fully connected feed-forward network respectively. Full details with implementation are provided in (Vaswani et al., 2017(Vaswani et al., , 2018. The bidirectional Transformer often referred to as the Transformer encoder learns a rich representation of the document that captures long-range syntactic and semantic dependency between the sentences.

Sentence Extraction
The final layer of our extraction model is a softmax layer which performs the classification. We learn the probability of including a sentence in the summary, where W and b are trainable parameters and S ' i is the transformed representation of the i th sentence in document D j , by minimizing the cross-entropy loss between the predicted probabilities, y p and true sentence-labels, y t during training.

Extractive Training
Filtering Currently, no extractive summarization dataset exists. Hence it is customary to create one from the abstractive ground-truth summaries (Chen and Bansal, 2018;Nallapati et al., 2017). We observe however, that some summaries are more abstractive than others. Since the extractive labels are usually gotten by doing some n-gram overlap matching, the greater the abstractiveness of the ground-truth the more inaccurate the tuned extractive labels are. We filter out such samples 1 as illustrated in Table 1. In our work, we consider a reference summary R j as overly abstractive if it has zero bigram overlap with the corresponding document D j , excluding stop words.
See et al. (2017) and Paulus et al. (2017) truncate source documents to 400 tokens and target summaries to 100 tokens. We totally exclude documents with more than 30 sentences and truncate or pad as necessary to 20 sentences per document. From the over 280,000 and 1.3M training pairs in the CNN/DM and Newsroom training dataset respectively, our filtering yields approximately 150,000 and 250,000 abstractive summarization sub-dataset. We report evaluation scores using the training sets as-is versus our filtered training sets, to show that filtering the training samples does improve results.
Document: world-renowned chef, author and emmy winning television personality anthony bourdain visits quebec in the next episode of " anthony bourdain : parts unknown, " airing sunday, may 5, at 9 p.m. et. follow the show on twitter and facebook. Summary: 11 things to know about quebec. o canada! our home and delicious land.' Tuning We use a very simple approach to create extractive labels for our neural extractor. We hypothesize that each reference summary sentence originates from at least one document sentence. The goal is to identify the most-likely document sentence. Different from Nallapati et al. (2017)'s approach to greedily add sentences to the summary that maximizes the ROUGE score, our approach is more similar to Chen and Bansal (2018)'s model that calculates the individual reference sentence-level score as per its similarity with each sentence in the corresponding document. However, our sentence-level similarity score is based on its bigram overlap: for each t th sentence in the reference summary, R j , per i th sentence in document D j , in contrast to Chen and Bansal (2018)'s that uses ROUGE-L recall score. Additionally, for every time both words in the set of bigrams-overlap are stopwords, we decrement the similarity score by 1, for example, (on, the) is an invalid bigram-overlap while (the, President) is valid. We do this, to capture more important similarities instead of trivial ones. For statistical purposes, we evaluate our extractive trainer for tuning the document's sentences to 0's and 1's against (Nallapati et al., 2017) Table 2: ROUGE-F1 (%) scores of manually crafted extractive trainers for producing sentence-level extractive labels for CNN/DM. We apply our tuned dataset to the neural extractive summarizer explained in Sections 3.1.1 and 3.1.2 and report results in Tables 3 and 4.
Imbalanced Extractive Labels Because a summary is a snippet of the document, the majority of the labels are rightly 0 (excluded from the summary). Hence a high classification accuracy does not necessarily translate to a highly salient summary. Therefore, we consider the F1 score, which is a weighted average of the precision and recall, and apply an early stopping criteria when minimizing the loss, if the F1 score does not increase after a set number of training epochs. Additionally during training, we synthetically balance the labels, by forcing some random sentences to be labelled as 1 and subsequently masking their weights.
Number of sentences to extract The number of extracted sentences is not trivial, as this significantly affects the summary length and hence evaluation scores. Chen and Bansal (2018) introduced a stop criterion in their reinforcement learning process. We implemented a basic subjective approach based on the dataset. Since the gold summaries are typically 3 or 4 sentences long, we extract the top 3 sentences by default, but proceed to additionally extract a 4 th sentence if the confidence score from the softmax function is greater than 0.55.

Abstraction
The input to our abstraction module is a subset of the document's sentences which comprises of the output of the extraction phase from Section 3.1.2. For each document D j , initially comprising of n sentences, we abstract its extracted sentences, where m < n and S E j ⊆ D j , by learning to jointly paraphrase (Gupta et al., 2018) and compress (Filippova et al., 2015). We add one more encoding layer to the standard encoder-alignerdecoder Luong et al., 2015), ie, encode-encode-align-decode. The intuition is to seemingly improve the performance of the decoder by providing an interpretable and richly encoded sequence. For this, we interleave two efficient models -transformer (Vaswani et al., 2017) and sequence-to-sequence (Sutskever et al., 2014), specifically GRU-RNN (Chung et al., 2014;. Details are presented in subsequent subsections.

Encoder -TRANSFORMER
The transformer encoder has same implementation from Vaswani et al. (2017) as explained in Section 3.1.1, except the inputs are sentence-level vector representations not document. Also, the sentence representations in this module are not averaged constituent word representations as in the extraction module but concatenated. That is, for each i th sentence in equation 7, its vector representation, is the concatenation of its constituent word embeddings The output of equation 8 serves as the input vector representation to the transformer encoder. We use the transformer-encoder during abstraction as sort of a pre-training module of the input sentence.

Encoder -GRU-RNN
We use a single layer uni-directional GRU-RNN whose input is the output of the transformer. The GRU-RNN encoder (Chung et al., 2014; produces fixed-state vector representation of the transformed input sequence using the following equations: where r and z are the reset and update gates respectively, W and U are the network's parameters, x t is the hidden state vector at timestep t, s t is the input vector and represents the Hadamard product.

Decoder -GRU-RNN
The fixed-state vector representation produced by the GRU-RNN encoder is used as initial state for the decoder. At each time step, the decoder receives the previously generated word, y t−1 and hidden state s t−1 at time step t −1 . The output word, y t at each time step, is a softmax probability of the vector in equation 11 over the set of vocabulary words, V .

Experiments
We used pre-trained 300-dimensional gloV e 2 word-embeddings (Pennington et al., 2014). The transformer encoder was setup with the transf ormer base hyperparameter setting from the tensor2tensor library (Vaswani et al., 2018) 3 , but the hidden size and dropout were reset to 300 and 0.0 respectively. We also use 300 hidden units for the GRU-RNN encoder. The tensor2tensor library comes with pre-processed/tokenized versions of the dataset, we however perform these operations independently. For abstraction, our target vocabulary is a set of approximately 50,000 and 80,000 words for CNN/DM and Newsroom 2 https://nlp.stanford.edu/projects/ glove/ 3 https://github.com/tensorflow/ tensor2tensor corpus respectively. It contains words in our target training and test sets that occur at least twice. Experiments showed that using this subset of vocabulary words as opposed to over 320,000 vocabulary words contained in gloV e improves both training time and performance of the model. During the abstractive training, we match summary sentence with its corresponding extracted document sentence using equation 6 and learn to minimize the seq2seq loss implemented in tensorflow API 4 with AdamOptimizer (Kingma and Ba, 2014). We employ early stopping when the validation loss does not decrease after 5 epochs. We apply gradient clipping at 5.0 (Pascanu et al., 2013). We use greedy-decoding during training and validation and set the maximum number of iterations to 5 times the target sentence length. Beam-search decoding is used during inference.

Datasets
We evaluate our models on the non-anonymized version of the CNN-DM corpus (Hermann et al., 2015;Nallapati et al., 2016) and the recent Newsroom dataset (Grusky et al., 2018) released by Connected Experiences Lab 5 . The Newsroom Abstractive Model R-1 R-2 R-L RL+Intra-Att (Paulus et al., 2017) 41.16 15.75 39.08 KIGN+Pred (Li et al., 2018) 38.95 17.12 35.68 FAST (Chen and Bansal, 2018) 40.88 17.80 38.54 Bottom-Up (Gehrmann et al., 2018) Table 6: ROUGE-F1 (%) scores (with 95% confidence interval) of various abstractive models on the Newsroom released test set. * marks results taken from Grusky et al. (2018) corpus contains over 1.3M news articles together with various metadata information such as the title, summary, coverage and compression ratio. CNN/DM summaries are twice as long as Newsroom summaries with average word lengths of 66 and 26 respectively.

Evaluation
Following previous works (See et al., 2017;Nallapati et al., 2017;Chen and Bansal, 2018), we evaluate both datasets on standard ROUGE-1, ROUGE-2 and ROUGE-L (Lin, 2004). It calculates the appropriate n-gram word-overlap between the reference and system summaries.

Results Analysis
We used the official pyrouge script 6 with option 7 . Table 3 and 5 presents extractive and abstractive results on the CNN/DM dataset respectively, while Tables 4 and 6 for the Newsroom dataset. For clarity, we present results separately for each model and dataset. Our baseline non-filtered extractive (TRANSext) model is highly competitive with top models. Our TRANS-ext + filter produces an average of about +1 and +9 points across reported ROUGE variants on the CNN/DM and Newsroom datasets respectively, showing that our model does a better job at identifying the most salient parts of the document than existing state-of-the-art extractive 6 https://github.com/andersjo/pyrouge/ tree/master/tools/ROUGE-1.5.5 7 -n 2 -w 1.2 -m -a -c 95 models. We observe the large margin in the Newsroom dataset results, as existing baselines are just the LEAD-3 and TEXTRANK of (Barrios et al., 2016). The Newsroom dataset was recently released and is yet to be thoroughly explored, however it is a larger dataset and contains more diverse summaries as analyzed by Grusky et al. (2018).
We also experimented with the empirical outcome of using imbalanced extractive labels which usually leads to bias towards the majority class. Interestingly, our extractive model has +20% F Score increase when trained with balanced labels. Switching the transformer encoder with a seq2seq encoder, resulted in a drop of about 2 ROUGE points, showing that the transformer encoder does learn features that adds meaning to the vector representation of our input sequence.
Our baseline non-filtered abstractive (TRANSext + abs) model is also highly competitive with top models, with a drop of -0.81 ROUGE-2 points against Gehrmann et al. (2018)'s model which is the current state-of-the art. Our TRANS-ext + filter + abs produces an average of about +0.5 and +7 points across reported ROUGE variants on the CNN/DM and Newsroom datasets respectively, showing empirically that our model is an improvement of existing abstractive summarization models.
On the abstractiveness of our summaries, after aligning with the ground-truth as explained in Section 3.2 about 60% of our extracted document sentences were paraphrased and compressed.
O: the two clubs, who occupy the top two spots in spain's top flight, are set to face each other at the nou camp on sunday. G: real madrid face barcelona in the nou camp R: real madrid will travel to the nou camp to face barcelona on sunday. O: dangelo conner, from new york, filmed himself messing around with the powerful weapon in a friend's apartment, first waving it around, then sending volts coursing through a coke can . G: dangelo conner from new york was fooling around with his gun R: dangelo conner, from new york ,was fooling around with stun gun. O: jamie peacock broke his try drought with a double for leeds in their win over salford on sunday. G: jamie adam scored to win over salford for leeds R: jamie peacock scored two tries for leeds in their win over salford. O: britain's lewis hamilton made the perfect start to his world title defense by winning the opening race of the f1 season in australia sunday to lead a mercedes one-two in melbourne . G: lewis hamilton wins first race of season in australia R: lewis hamilton wins opening race of 2015 f1 season in australia . We highlight examples of some of the generated paraphrases in Table 7. Table 7 show that our paraphrases are well formed, abstractive (e.g powerful weapon -gun, messing around -fooling around), capable of performing syntactic manipulations (e.g for leeds in their win over sadford -win over salford for leeds) and compression as seen in all the examples.

Related Work
Summarization has remained an interesting and important NLP task for years due to its diverse applications -news headline generation, weather forecasting, emails filtering, medical cases, recommendation systems, machine reading compre-hension MRC and so forth (Khargharia et al., 2018).
Identifying the likely most salient part of the text as summary-worthy is very crucial. Some authors have employed integer linear programming (Martins and Smith, 2009;Gillick and Favre, 2009;Boudin et al., 2015), graph concepts (Erkan and Radev, 2004;, ranking with reinforcement learning (Narayan et al., 2018) and mostly related to our work -binary classification (Shen et al., 2007;Nallapati et al., 2017;Chen and Bansal, 2018) Our binary classification architecture differs significantly from existing models because it uses a transformer as the building block instead of a bidirectional GRU-RNN (Nallapati et al., 2017), or bidirectional LSTM-RNN (Chen and Bansal, 2018). To the best of our knowledge, our utilization of the transformer encoder model as a building block for binary classification is novel, although the transformer has been successfully used for language understanding (Devlin et al., 2018), machine translation (MT) (Vaswani et al., 2017) and paraphrase generation .
For generation of abstractive summaries, before the ubiquitous use of neural nets, manually crafted rules and graph techniques were utilized with considerable success. Barzilay and McKeown (2005); Cheung and Penn (2014) fused two sentences into one using their dependency parsed trees. Re-cently, sequence-to-sequence models (Sutskever et al., 2014) with attention Chopra et al., 2016), copy mechanism Gu et al., 2016), pointer-generator (See et al., 2017), graph-based attention (Tan et al., 2017) have been explored. Since the system generated summaries are usually evaluated on ROUGE, its been beneficial to directly optimize this metric during training via a suitable policy using reinforcement learning (Paulus et al., 2017;Celikyilmaz et al., 2018).
Similar to Rush et al. (2015); Chen and Bansal (2018) we abstract by simplifying our extracted sentences. We jointly learn to paraphrase and compress, but different from existing models purely based on RNN, we implement a blend of two proven efficient models -transformer encoder and GRU-RNN.  paraphrased with a transformer-decoder, we find that using the GRU-RNN decoder but with a two-level stack of hybrid encoders (transformer and GRU-RNN) gives better performance. To the best of our knowledge, this architectural blend is novel.

Conclusion
We proposed two frameworks for extractive and abstractive summarization and demonstrated that they each improve results over existing state-ofthe art. Our models are simple to train, and the intuition/hypothesis behind the formulation are straightforward and logical. The scientific correctness is provable, as parts of our model architecture have been used in other NLG-related tasks such as MT with state-of-the art results.