BiSET: Bi-directional Selective Encoding with Template for Abstractive Summarization

The success of neural summarization models stems from the meticulous encodings of source articles. To overcome the impediments of limited and sometimes noisy training data, one promising direction is to make better use of the available training data by applying filters during summarization. In this paper, we propose a novel Bi-directional Selective Encoding with Template (BiSET) model, which leverages template discovered from training data to softly select key information from each source article to guide its summarization process. Extensive experiments on a standard summarization dataset are conducted and the results show that the template-equipped BiSET model manages to improve the summarization performance significantly with a new state of the art.


Introduction
Abstractive summarization aims to shorten a source article or paragraph by rewriting while preserving the main idea. Due to the difficulties in rewriting long documents, a large body of research on this topic has focused on paragraph-level article summarization. Among them, sequence-tosequence models have become the mainstream and some have achieved state-of-the-art performance (Rush et al., 2015;Chopra et al., 2016;. In general, the only available information for these models during decoding is simply the source article representations from the encoder and the generated words from the previous time steps Gu et al., 2016;Lin et al., 2018), while the previous words are also generated based on the article representations. Since natural language text is complicated and verbose in nature, and training data is insufficient in size to help the models distinguish important article information from noise, sequence-to- * Corresponding author. sequence models tend to deteriorate with the accumulation of word generation, e.g., they generate irrelevant and repeated words frequently (Koehn and Knowles, 2017).
Template-based summarization (Zhou and Hovy, 2004) is an effective approach to traditional abstractive summarization, in which a number of hard templates are manually created by domain experts, and key snippets are then extracted and populated into the templates to form the final summaries. The advantage of such approach is it can guarantee concise and coherent summaries in no need of any training data. However, it is unrealistic to create all the templates manually since this work requires considerable domain knowledge and is also labor-intensive. Fortunately, the summaries of some specific training articles can provide similar guidance to the summarization as hard templates. Accordingly, these summaries are referred to as soft templates, or templates for simplicity, in this paper.
Despite their potential in relieving the verbosity and insufficiency problems of natural language data, templates have not been exploited to full advantage. For example, Cao et al. (2018a) simply concatenated template encoding after the source article in their summarization work. To this end, we propose a Bi-directional Selective Encoding with Template (BiSET) model for abstractive sentence summarization. Our model involves a novel bi-directional selective layer with two gates to mutually select key information from an article and its template to assist with summary generation. Due to the limitations in obtaining handcrafted templates, we further propose a multi-stage process for automatic retrieval of high-quality templates from training corpus. Extensive experiments were conducted on the Gigaword dataset (Rush et al., 2015), a public dataset widely used for abstractive sentence summarization, and the results appear to be quite promising. Merely using the templates selected by our approach as the final summaries, our model can already achieve superior performance to some baseline models, demonstrating the effect of our templates. This may also indicate the availability of many quality templates in the corpus. Secondly, the template-equipped summarization model, BiSET, outperforms all the state-ofthe-art models significantly. To evaluate the importance of the bi-directional selective layer and the two gates, we conducted an ablation study by discarding them respectively, and the results show that, while both of the gates are necessary, the template-to-article (T2A) gate tends to be more important than the article-to-template (A2T) gate. A human evaluation further validates the effectiveness of our model in generating informative, concise and readable summaries.
The contributions of this work include: • We propose a novel bi-directional selective mechanism with two gates to mutually select important information from both article and template to assist with summary generation.
• We develop a Fast Rerank method to automatically select high-quality templates from training corpus.
• Empirical evaluations on the benchmark dataset show our model has achieved a new state of the art.
• The source code of this work has been released for future research. 1

The Framework
Our framework includes three key modules: Retrieve, Fast Rerank, and BiSET. For each source article, Retrieve aims to return a few candidate templates from the training corpus. Then, the Fast Rerank module quickly identifies a best template from the candidates. Finally, BiSET mutually selects important information from the source article and the template to generate an enhanced article representation for summarization.

Retrieve
This module starts with a standard information retrieval library 2 to retrieve a small set of candidates for fine-grained filtering as Cao et al. (2018a). To do that, all non-alphabetic characters (e.g., dates) are removed to eliminate their influence on article matching. The retrieval process starts by querying the training corpus with a source article to find a few (5 to 30) related articles, the summaries of which will be treated as candidate templates.

Fast Rerank
The above retrieval process is essentially based on superficial word matching and cannot measure the deep semantic relationship between two articles. Therefore, the Fast Rerank module is developed to identify a best template from the candidates based on their deep semantic relevance with the source article. We regard the candidate with highest relevance as the template. As illustrated in Figure  1, this module consists of a Convolution Encoder Block, a Similarity Matrix and a Pooling Layer. Convolution Encoder Block. This block maps the input article and its candidate templates into high-level representations. The popular ways to this are either by using recurrent neural network (RNN) or a stack of convolutional neural network (CNN), while none of them are suitable for our problem. This is because a source article is usually much longer than a template, and both RNN and CNN may lead to semantic irrelevance after encodings. Instead, we implement a new convolution encoder block which includes a word embedding layer, a 1-D convolution followed by a non-linearity function, and residual connections (Gehring et al., 2017). Formally, given word embeddings {e i } E i=1 ∈ R d of an article, we use a 1-D convolution with kernel k ∈ R 2d×kd and bias b h ∈ R 2d to extract the n-gram features: where h i ∈ R 2d . We pad both sides of an article/template with zeros to keep fixed length. After that, we employ the gated linear unit (GLU) (Dauphin et al., 2017) as our activation function to control the proportion of information to pass through. GLU takes half the dimension of h i as input and reduces the input dimension to d. Let where r i ∈ R d , σ is the sigmoid function, and ⊗ means element-wise multiplication. To retain the original information, we add residual connections from the input of the convolution layer to the output of this block: Similarity Matrix. The above encoder block generates a high-level representation for each source article/candidate template. Then, a similarity matrix S ∈ R m×n is calculated for a given article representation, S ∈ R m×d , and a template representation, T ∈ R n×d : where f is the similarity function, and the common options for f include: bilinear function x − y , Euclidean distance (4) Most previous work uses dot product or bilinear function (Chen et al., 2016) for the similarity, yet we find the family of Euclidean distance perform much better for our task. Therefore, we define the similarity function as: Pooling Layer. This layer is intended to filter out unnecessary information in the matrix S. Before applying such pooling operations as max-pooling and k-max pooling (Kalchbrenner et al., 2014) over the similarity matrix, we note there are repeated words in the source article, which we only want to count once. For this reason, we first identify some salient weights from S: where max column is a column-wise maximum function. We then apply k-max pooling over q to select k most important weights, p ∈ R k . Finally, we apply a two-layer feed-forward network to output a similarity score for the source article and the candidate template:

Traditional Methodologies
In this section, we explore three traditional approaches to taking advantage of the templates for summarization. They share the same encoder and decoder layers, but own different interaction layers for combination of a source article and template. The encoder layer uses a standard bi-directional RNN (BiRNN) to separately encode the source article and the template into hidden states h s i and h t j . Concatenation. This approach directly concatenates the hidden state, h t i N i=1 , of a template after the article representation, . This approach is similar to R 3 Sum (Cao et al., 2018a) but uses our Fast Rerank and summary generation modules. Concatenation+Self-Attention. This approach adds a multi-head self-attention (Vaswani et al., 2017) layer with 4 heads on the basis of the above direct concatenation. DCN Attention. Initially introduced for machine reading comprehension (Seo et al., 2017), this interaction approach is employed here to create template-aware article representations. First, we compute a similarity matrix, S ∈ R m×n , for each pair of article and template words by , where ';' is the concatenation operation. We then normalize each row and col- umn of S by softmax, giving rise to two new matrices S and S. After that, the Dynamic Coattention Network (DCN) attention is applied to compute the bi-directional attention: A = S · h t and B = S · S T · h s , where A denotes article-totemplate attention and B is template-to-article attention. Finally, we obtain the template-aware ar-

BiSET
Inspired by the research in machine reading comprehension (Seo et al., 2017) and selective mechanism (Zhou et al., 2017), we propose a novel Bi-directional Selective Encoding with Template (BiSET) model for abstractive sentence summarization. The core idea behind BiSET is to involve templates to assist with article representation and summary generation. As shown in Figure 2, BiSET contains two selective gates: Template-to-Article (T2A) gate and Article-to-Template (A2T) gate. The role of T2A is to use a template to filter the source article representation: where h t is the concatenation of the last forward hidden state, − → h t n , and the first backward hidden state, ← − h t 1 , of the template. On the other hand, the purpose of A2T is to control the proportion of h g in the final article representation. We assume the source article is credible and use its representation h s together with h t to calculate a confidence degree, where h s is obtained in a similar way as h t . The confidence de-gree d is computed by: The final source article representation is calculated as the weighted sum of h s i and h g i : which allows a flexible manner for template incorporation and helps to resist errors when lowquality templates are given.
The decoder layer. This layer includes an ordinary RNN decoder (Luong et al., 2015). At each time step t, the decoder reads the word w t−1 and hidden state h c t−1 generated in the previous step, and gives a new hidden state for the current step: where the hidden state is initialized with the original source article representation, h s . We then compute the attention between h c t and the final article representation z s to obtain a context vector c t : After that, a simple concatenation layer is used to combine the hidden state h c t and the context vector c t into a new hidden state h a t : which will be mapped to a new representation of vocabulary size and fed through a softmax layer to output the target word distribution: p(w t |w 1 , ..., w t−1 ) = sof tmax(W p h a t ) (20)

Training
The Retrieve module involves an unsupervised process with traditional indexing and retrieval techniques. For Fast Rerank, since there is no ground truth available, we use ROUGE-1 3 (Lin and Hovy, 2003) to evaluate the saliency of a candidate template with respect to the gold summary of current source article. Therefore, the loss function is defined as: (21) where s is a score predicted by Equation 9, and N is the product of the training set size, D, and the number of retrieved templates for each article.
For the BiSET module, the loss function is chosen as the negative log-likelihood between the generated summary, w, and the true summary, w * : where L is the length of the true summary, θ contains all the trainable variables, and x and y denote the source article and the template, respectively.

Experiments
In this section, we introduce our evaluations on a standard dataset.

Dataset and Implementation
The dataset used for evaluation is Annotated English Gigaword (Napoles et al., 2012), a parallel corpus formed by pairing the first sentence of an article with its headline. For a fair comparison, we use the version preprocessed by Rush et al. (2015) 4 as previous work.
During training, both the Fast Rerank and BiSET modules have a batch size of 64 with the Adam optimizer (Kingma and Ba, 2015). We also apply grad clipping (Pascanu et al., 2013) with a range of [-5,5]. The differences of the two modules in settings are listed below. Fast Rerank. We set the size of word embeddings to 300, the convolution encoder block number to 1, and the kernel size of CNN to 3. The weights are shared between the article and template encoders. The k of k-max pooling is set to 10. L2 weight decay with λ = 3 × 10 −6 is performed over all trainable variables. The initial learning rate is 0.001 and multiplied by 0.1 every 10K steps. Dropout between layers is applied. BiSET. A two-layer BiLSTM is used as the encoder, and another two-layer LSTM as the decoder. The sizes of word embeddings and LSTM hidden states are both set to 500. We only apply dropout in the LSTM stack with a rate of 0.3. The learning rate is set to 0.001 for the first 50K steps and halved every 10K steps. Beam search with size 5 is applied to search for optimal answers.

Evaluation Metrics
Following previous work Zhou et al., 2017;Cao et al., 2018a), we use the standard F1 scores of ROUGE-1, ROUGE-2 and ROUGE-L (Lin and Hovy, 2003) to evaluate the selected templates and generated summaries, where the official ROUGE script 5 is applied. We employ the normalized discounted cumulative gain (NDCG) (Järvelin and Kekäläinen, 2002) from information retrieval to evaluate the Fast Rerank module.

Results and Analysis
In this section, we report our experimental results with thorough analysis and discussions.

Performance of Retrieve
The Retrieve module is intended to narrow down the search range for a best template. We evaluated this module by considering three types of templates: (a) Random means a randomly selected summary from the training corpus; (b) Retrievetop is the highest-ranked summary by Retrieve; (c) N-Optimal means among the N top search results, the template is specified as the summary with largest ROUGE score with gold summary.
As the results show in Table 1, randomly selected templates are totally irrelevant and unhelpful. When they are replaced by the Retrieve-top templates, the results improve apparently, demonstrating the relatedness of top-ranked summaries to gold summaries. Furthermore, when the N-Optimal templates are used, additional improvements can be observed as N grows. This trend is also confirmed by Figure 3, in which the ROUGE scores increase before 30 and stabilize afterwards. These results suggest that the ranges given by Retrieve indeed help to find quality templates.

Fast Rerank
As mentioned before, the role of Fast Rerank is to re-rank the initial search results and return a best template for summarization. To examine the effect of this module, we studied its ranking quality under different ranges as in Section 4.1. The original rankings by Retrieve are presented for comparison with the NDCG metric. We regard the ROUGE-2 score of each candidate template with the reference summary as the ground truth. As shown in Figure 4, Fast Rerank consistently provides enhanced rankings over the original.

Interaction Approaches
In Section 2.3, we also explored three alternative approaches to integrating an article with its template. The results are shown in Table 2, from which we can note that none of these approaches help yield satisfactory performance. Even though DCN Attention works impressively in machine reading comprehension, it performs even worse in this task than the simple concatenation. We conjecture the reason is that the DCN Attention attempts to fuse the template information into an article as in machine reading comprehension, rather than selects key information from the two to form an enhanced article representation.

BiSET
The overall performance of all the studied models is shown in Table 3. The results show that our model significantly outperforms all the baseline models and sets a new state of the art for abstractive sentence summarization. To evaluate the impact of templates on our model, we also implemented BiSET with two other types of templates: randomly-selected templates and best templates identified by Fast Rank under different ranges. As shown in Table 4, the performance of our model improves constantly with the improvement of template quality (larger ranges lead to better chances for good templates). Even with randomly-selected templates, our model still works with stable performance, demonstrating its robustness.

Speed Comparison
Our model is designed for both accuracy and efficiency. Due to the parallelizable nature of CNN, the Fast Rerank module only takes about 30 minutes for training and 3 seconds for inference on Model ROUGE-1 ROUGE-2 ROUGE-L ABS ‡ (Rush et al., 2015) 29.55 11.32 26.42 ABS+ ‡ (Rush et al., 2015) 29

Ablation Study
The purpose of this study is to examine the roles of the bi-directional selective layer and its two gates. Firstly, we removed the selective layer and replaced it with the direct concatenation of an article with its template representation. As the results show in Table 5, the model performs even worse than some ordinary sequence-to-sequence models in Table 3. The reason might be that templates would overwhelm the original article representations and become noise after concatenation. Then, we removed the Template-to-Article (T2A) gate, and as a result the model shows a great decline in performance, indicating the importance of templates in article representations. Finally, when we removed the Article-to-Template (A2T) gate, whose role is to control the weight of T2A in article representations, only a small performance decline is observed. This may suggest that the T2A gate alone can already capture most of the important article information, while A2T plays some supplemental role.
6 It takes about 2 days for training.

Human Evaluation
We then carried out a human evaluation to evaluate the generated summaries from another perspective. Our evaluators include 8 graduate students and 4 senior undergraduates, while the dataset is 100 randomly-selected articles from the test set. Each sample in this dataset also includes: 1 reference summary, 5 summaries generated by Open-NMT 7 (Klein et al., 2017), R 3 Sum 8 (Cao et al., 2018a) and BiSET under three settings, respectively, and 3 randomly-selected summaries for trapping. We asked the evaluators to independently rate each summary on a scale of 1 to 5, with respect to its quality in informativity, conciseness, and readability. While collecting the results, we rejected the samples in which more than half evaluators rate the informativity of the reference summary below 3. We also rejected the samples in which the informativity of a randomly-selected summary is scored higher than 3. Finally, we obtained 43 remaining samples and calculated an average score for each aspect. As the results show in Table 6, our model not only performs much better than the baselines, it also shows quite comparable performance with the reference summaries.  In Table 7 we present two real examples, which show the templates found by our model are indeed related to the source articles, and with their aid, our model succeeds to keep the main content of the source articles for summarization while discarding unrelated words like 'US' and 'Olympic Games'.

Related Work
Abstractive sentence summarization, a task analogous to headline generation or sentence compression, aims to generate a brief summary given a short source article. Early studies in this problem mainly focus on statistical or linguistic-rule-based methods, including those based on extractive and compression (Jing and McKeown, 2000;Knight and Marcu, 2002;Clarke and Lapata, 2010), templates (Zhou and Hovy, 2004) and statistical machine translation (Banko et al., 2000). The advent of large-scale summarization corpora accelerates the development of various neural network methods. Rush et al. (2015) first applied an attention-based sequence-to-sequence model for abstractive summarization, which includes a convolutional neural network (CNN) encoder and a feed-forward network decoder. Chopra et al. (2016) replaced the decoder with a recurrent neural network (RNN).  further changed the sequence-to-sequence model to a fully RNN-based model. Besides, Gu et al. (2016) found that this task benefits from copying words from the source articles and proposed the Copy-Net correspondingly. With a similar purpose,  proposed to use a switch gate to control when to copy from the source article and when to generate from the vocabulary. Zhou et al. (2017) employed a selective gate to filter out unimportant information when encoding.
Some other work attempts to incorporate external knowledge for abstractive summarization. For example,  proposed to en-rich their encoder with handcrafted features such as named entities and part-of-speech (POS) tags. Guu et al. (2018) also attempted to encode humanwritten sentences to improve neural text generation. Similar to our work, Cao et al. (2018a) proposed to retrieve a related summary from the training set as soft template to assist with the summarization. However, their approach tends to oversimplify the role of the template, by directly concatenating a template after the source article encoding. In contrast, our bi-directional selective mechanism exhibits a novel attempt to selecting key information from the article and the template in a mutual manner, offering greater flexibility in using the template.

Conclusion
In this paper, we presented a novel Bi-directional Selective Encoding with Template (BiSET) model for abstractive sentence summarization. To counteract the verbosity and insufficiency of training data, we proposed to retrieve high-quality existing summaries as templates to assist with source article representations through an ingenious bidirectional selective layer. The enhanced article representations are expected to contribute towards better summarization eventually. We also developed the corresponding retrieval and re-ranking modules for obtaining quality templates. Extensive evaluations were conducted on a standard benchmark dataset and experimental results show that our model can quickly pick out high-quality templates from the training corpus, laying key foundation for effective article representations and summary generations. The results also show that our model outperforms all the baseline models and sets a new state of the art. An ablation study validates the role of the bi-directional selective layer, and a human evaluation further proves that our model can generate informative, concise, and readable summaries.