RefSum: Refactoring Neural Summarization

Although some recent works show potential complementarity among different state-of-the-art systems, few works try to investigate this problem in text summarization. Researchers in other areas commonly refer to the techniques of reranking or stacking to approach this problem. In this work, we highlight several limitations of previous methods, which motivates us to present a new framework Refactor that provides a unified view of text summarization and summaries combination. Experimentally, we perform a comprehensive evaluation that involves twenty-two base systems, four datasets, and three different application scenarios. Besides new state-of-the-art results on CNN/DailyMail dataset (46.18 ROUGE-1), we also elaborate on how our proposed method addresses the limitations of the traditional methods and the effectiveness of the Refactor model sheds light on insight for performance improvement. Our system can be directly used by other researchers as an off-the-shelf tool to achieve further performance improvements. We open-source all the code and provide a convenient interface to use it: https://github.com/yixinL7/Refactoring-Summarization.


Introduction
In neural text summarization, system designers commonly have flexible choices in model architectures (Rush et al., 2015;Kedzie et al., 2018), decoding strategies (Paulus et al., 2018) (e.g. beam search) and etc. As a result, even on the same dataset, different selection biases of these choices will lead to diverse system outputs (Kedzie et al., 2018;Hossain et al., 2020).
To combine complementarity of system's output under different setups, researchers have made some preliminary efforts on two-stage learning (Collins and Koo, 2005;Huang, 2008; González-Rubio * Corresponding author. "Doc, Hypo, Ref" represent "input document, generated hypothesis, gold reference" respectively. "Hypo'" represents texts generated during test phase. Θ Base and Θ Meta represent learnable parameters in two stages. Mizumoto and Matsumoto, 2016), consisting of (i) a base-stage: first generates different outputs under different setups, and (ii) a meta-stage: then aggregates them in diverse ways, exemplified by stacking that uses a high-level model to combine multiple low-level models (Ting and Witten, 1997), or reranking (Collins and Koo, 2005), which aims to rerank different outputs of one system. Although these methods each play a role in different scenarios, they suffer from following potential limitations: (i) Ad-hoc Methods: most existing methods are designed for a specific scenario. For example, Li et al. (2015) and Narayan et al. (2018b) resort to reranking techniques to select summary-worthy sentences that are usually generated from one system. By contrast, Hong et al. (2015) focus on summaries generated from different systems and use a non-neural system combination method to make their complementary advantages. Few works explore if the complementarity existing in different scenarios could be utilized in a unified framework.
(ii) Base-Meta Learning Gap: parameterized models between two learning stages are relatively independent. For example, Zhou et al. (2017) and  adapt the seq2seq (Sutskever et al., 2014) framework as the meta model for combination, which takes the outputs of multiple base systems as a part of the inputs for machine translation. As a result, there is no parameter sharing between the meta model and base systems as shown in Fig. 1, which prevents the meta model from fully utilizing the knowledge encoded in the base systems.
(iii) Train-Test Distribution Gap: regarding the meta-learning stage, there is a distribution gap between the training and test distributions. Fig. 1 elucidates this phenomenon: the training distribution of Hypo differs from the test distribution of Hypo'. Although both two are outputs from the base stage, Hypo would be more accurate (closer to gold summaries) since it is the output during the training phase.
In this work, we aim to address these limitations by proposing a general framework, named Refactor, which can not only serve as a base system to construct a summary by selecting sentences from the source document but also act as a meta system to select the best system output from multiple candidates. The unification of base and meta systems allows them to share a set of parameters, thereby alleviating the "Base-Meta learning gap". Besides, we propose a pretrain-then-finetune paradigm for Refactor that mitigates the "Train-Test distribution gap". In practice, our proposed Refactor can be applied to different scenarios. For example, as a meta system, it can be used for multiple system combination or single system re-ranking.
Our contributions can be briefly summarized as: (1) We dissect two major factors that influence the performance of two-stage learning when leveraging the complementarity among different systems: (i) Base-Meta Learning Gap (ii) Train-Test Distribution Gap; (2) We show these two types of gaps can be alleviated by promoting communication between the two stages in §4 , and therefore present a new paradigm where the base and meta learners are parameterized with shared parameters; (3) We have made comprehensive experiments (twenty-two top-scoring systems, four datasets). In addition to achieving state-of-the-art results on CNN/DailyMail dataset ( §5) by a significant margin, the efficacy of the proposed Refactor opens up a thought-provoking direction for performance improvement: instead of pursuing a purely end-toend system, a promising exploration is to incorporate different types of inductive biases stage-wisely with the same parameterized function. Our exper-imental results demonstrate that there exists complementarity introduced by decoding algorithms (e.g. beam search) §5.5 or system combination §5.6 among the current state-of-the-art summarization systems, which can be effectively utilized by our model for boosting the system performance.

Preliminaries
Existing works commonly design systems in an end-to-end fashion (Sutskever et al., 2014;Sukhbaatar et al., 2015), which, though effective, also proves to be insufficient in some scenarios (Glasmachers, 2017;Webb et al., 2019). Instead of optimizing a system in an end-to-end fashion, one more flexible paradigm, stage-wise learning, is to break down the holistic process into different stages. The basic idea is to incorporate different types of inductive biases stage-wisely and two typical examples are: Stacking and Reranking.
Stacking Stacking (a.k.a, Stacked Generalization) is a general method of using a high-level model to combine lower-level models to achieve greater predictive accuracy (Ting and Witten, 1997).
In NLP research, this method has been widely explored in machine translation (MT) task. Traditionally, it is used to improve the performance of statistical MT systems (González-Rubio et al., 2011;Watanabe and Sumita, 2011;Duh et al., 2011;Mizumoto and Matsumoto, 2016). Some recent work (Zhou et al., 2017; also extends this method to neural MT where the meta model and base systems are all neural models. There is a handful of works about system combination for summarization (Hong et al., 2015), in which a feature-based meta model is used for combining unsupervised text summarization systems.
Reranking Reranking is a technique to improve performance by reranking the output of an existing system, which has been widely used across different NLP tasks, such as constituency parsing (Collins and Koo, 2005;Huang, 2008), dependency parsing Do and Rehbein, 2020), semantic parsing (Ge and Mooney, 2006;Yin and Neubig, 2019), machine translation (Shen et al., 2004;Mizumoto and Matsumoto, 2016).
Comparing reranking and stacking, both of them involve two-stage learning and the first stage would provide multiple candidate outputs as the input for the second stage. However, they differ in the way how multiple candidate outputs are generated at the first stage. Specifically, reranking usually decodes k-most qualified results during inference, using one base system. By contrast, stacking generates multiple outputs that are usually from different base systems.

Summarization as Two-stage Learning
In what follows, we detail how to formulate summarization as a two-stage learning task.

Base system
The system in the base stage aims to generate a summary based on the input text. Specifically, given a document D = {s 1 , · · · , s n } with n sentences, we refer to C as a candidate summary of D generated by a summarization system, which can be parameterized in diverse forms: where BASE(, Θ base ) represents a base system that can be instantiated either as an extractive model or abstractive model with a specific experimental setup: training method T , decoding strategy S.

Meta system
In practice, different choices of parameterized function BASE(·), training method T and decoding strategy S commonly lead to different candidate summaries, C = {C 1 , · · · , C k }, where C represents a set of different candidate summaries. The goal of the meta system is to utilize complementarities among C by popular techniques, such as reranking and system combination. Specifically, given a set of candidate summaries C, a meta system is used to re-construct a new candidate summary C * where Θ meta represents learnable parameters of the meta system.

Refactoring Text Summarization
Despite effectiveness of existing meta systems, they, as briefly mentioned in §1, suffer from two major problems: (i) Base-Meta Learning Gap and (ii) Train-Test Distribution Gap.

Refactoring
In this paper, we propose the model Refactor that unifies the goal of the base and meta systems by the view that a summary can be generated by selecting the best combination of document sentences. Therefore, both base and meta systems aim to select an optimal candidate summary, and they only differ in how the candidate summary set is constructed. For example, Refactor can be a base system when the candidate summary set C is formed by directly enumerating different combinations of document sentences and would be a meta system when C represents summaries from different systems. This formulation is advantageous in two points: (1) No matter where a system selects (from document sentences or multiple system outputs), the chosen criteria that define a good summary are shared. Therefore, the learning process of base and meta systems can be parameterized using a set of parameters, maximizing the information-sharing across two stages and mitigating the Base-Meta Learning Gap.
where REFACTOR(·, Θ refactor ) is the Refactor model, and the candidate summaries C can be constructed in different ways.
(2) Additionally, learning to select candidate summaries from document sentences enables the system to see more diverse candidates with different distributions. This is effective for solving the Train-Test Distribution Gap, where the distribution of the meta system outputs in training samples deviates from the test one.
Specifically, our proposed Refactor first learns to select candidate summaries from document sentences (pre-trained Refactor) and then learns to select candidate summaries from different system outputs (fine-tuned Refactor).

Pre-trained Refactor
Pre-trained Refactor takes as input a document D = {s 1 , · · · , s n } as well as a set of candidate summaries C = {C 1 , · · · , C m }, which can be constructed by enumerating possible combinations of source sentences with heuristic pruning. For example, an extractive system could be used to prune unlikely sentences to control the number of candidates. REFACTOR(·, Θ refactor ) is instantiated as a score function which quantifies the degree to which a candidate summary C i is matched with the source document D.
where D and C i denote document and summary representations respectively, which are calculated by a BERT (Devlin et al., 2019) model. SCORE(·) is a function that measures the similarity between a document and candidate summary.
Contextualized Similarity Function To instantiate SCORE(·), we follow the forms as mentioned in Zhang et al. (2019b); Zhao et al. (2019); Gao et al. (2020), which have shown superior performance on measuring semantic similarity between documents and summaries. Specifically, SCORE(·) is defined based on the greedy matching algorithm, which matches every word in one text sequence to the most similar word in another text sequence and vise versa. Given the document embedding matrix D = d 1 , · · · , d k and the candidate embedding matrix C = c 1 , · · · , c l encoded by BERT, SCORE(·) can be calculated as: where the weighted recall R, precision P are defined as follows: 1 w i is the weight of the i-th token in the document. We use weighted recall R based on the assumption that for text summarization, tokens in the source document have different importance and the summary should capture the most important information of the source document. Therefore, we introduce a weighting module built by a twolayer Transformer (Vaswani et al., 2017) assigning weights w i : represents the embedding of the "[CLS]" token which encodes the global information. d is the dimension of d i .
Learning Objective We use a ranking loss to learn the parameter Θ refactor , inspired by the assumption (Zhong et al., 2020) that a good candidate summary should be as close with the source 1 We found that adding 1 to the precision and recall helps to stabilize the training.
where C i and C j denote the i-th and j-th sample of the candidate list which is descendingly sorted by the ROUGE (Lin, 2004) scores between the reference summaryĈ and candidates. That is, λ c is the corresponding margin set to 0.01.

Fine-tuned Refactor
In order to fit the distributions of the specific types of input, we then fine-tune Refactor using the outputs generated by the base systems. Specifically, fine-tuning is also based on Eq. 9 where the candidate summaries C are generated by the base systems under different application scenarios.
Why does Pre-train and Fine-tune matter? We elaborate on the proposed two-step training using a real case. Fig. 2 depicts the distribution of ROUGE-1 scores regarding the candidate summaries in the pre-training stage training set, finetuning stage training set and test set on the XSum dataset, where we sample the same number of {document, candidate summaries} pairs. We can observe that: (i) there is a distribution gap between train and test samples in fine-tuning stage. (ii) in pre-training stage the pre-trained Refactor has seen a large number of candidate summaries with diverse performance (ROUGE value), which improves its generalization ability. In §5 we will show that the Pre-train and Fine-tune paradigm outperforms one-step training where the model is directly trained with data generated from the base systems.

Application Scenarios
Our Refactor can be used as different roles in different scenarios as follows.

Refactor as Base Learner
The pre-trained Refactor can not only be fine-tuned for a better selection of candidate summaries, but also be regarded as a base system, providing one system output. This feature of Refactor maximizes parameter sharing across the two training stages.

Refactor as Meta Learner
Both pre-trained Refactor and fine-tuned Refactor can be used as a meta system to select the best candidate when we have multiple system summaries. In this work, we explore the following settings: (1) Single System: It considers re-ranking candidate summaries generated from a single abstractive system using beam search.
(2) Multi-system Summary-level: It is tasked to select the best candidate summary from the results of different systems.
(3) Multi-system Sentence-level: We also take a step towards the fine-grained fusion of summaries from extractive and abstractive systems. Specifically, here candidate summaries are generated by combining the results of different systems at the sentence level.

Datasets
We mainly experiment on four datasets, whose statistics are shown in Tab. 1. CNNDM 2 (Hermann et al., 2015) is a widely used dataset containing news articles and the associated highlights which are used as the reference summaries. We follow the work of Nallapati et al. (2016)  WikiHow 5 (Koupaee and Wang, 2018) is a largescale dataset constructed from the articles using online WikiHow knowledge base.

Base Systems
Below, we mainly use BART, GSum and PEGA-SUS as the base systems since they have achieved state-of-the-art performance on at least one dataset.
BART (Lewis et al., 2020) is a large pre-trained sequence-to-sequence model that achieves strong performance on the abstractive summarization.
GSum (Dou et al., 2020) enhances the performance of BART using additional guidance information, which achieves the current state-of-the-art performance on the CNNDM dataset. PEGASUS  achieves competitive performance on various summarization datasets and is the current state-of-the-art on the XSum dataset.
To make a comprehensive evaluation of our proposed model, we additionally collect 19 top-scoring systems as base systems on CNNDM. 6 In details, for §5.7 we use the following systems: pointer-generator+coverage (See et al., 2017)

Baseline Systems
Neural system combinator: We use BERTScore (Zhang et al., 2019b) as an unsupervised baseline with neural models, which is an automatic evaluation metric computing the similarity of text pairs based on the corresponding BERT-encoded representations. We use it to directly compute the similarity score between the source documents and candidate summaries. Non-Neural system combinator: We use RankSVM 7 (Joachims, 2002) as a non-neural baseline. We perform cross-validation on the development set for hyper-parameter searching and train the model on the development set. The set of features is listed in Appendix A. Oracles: We compare our model with sample-wise Min, Max and Random oracles using ROUGE.

Training Details
For the following experiments in §5.5, §5.6 and §5.7 on CNNDM, we pre-train the Refactor model with a candidate set generated by enumerating combinations of sentences in the source documents. To reduce the number of candidates, we prune the sentences assigned with lower scores by an extractive model, BERTSum (Liu and Lapata, 2019), following Zhong et al. (2020). The maximum number of candidates for one data sample is 20. The pretrained Refactor is also used a base system in §5.6, whose outputs are used together with other base systems as candidate summaries. For different experiments, we fine-tune pre-trained Refactor on the base system's output, and name the model as fine-tuned Refactor. To analyze the effectiveness of the proposed two-stage training, we additionally train the model without the pre-training step, which is named as supervised Refactor.
The pre-trained BERT model we used is from Transformers library (Wolf et al., 2020). 8 We use Adam optimizer (Kingma and Ba, 2015) with learning rate scheduling.
where the warmup_steps is 10000. The model performance on the validation set is used to select the checkpoint. Pre-training takes around 40 hours  on 4 GTX-1080-Ti GPUs while fine-tuning takes around 20 hours.

Exp-I: Single System Reranking
We use BART and GSum for this experiment, and use beam search to generate the candidate summaries where the beam size is set to 4. The results are listed in Tab. 2, which shows that (1) Refactor can boost the base system's performance by a significant margin, (2) the fine-tuned Refactor outperforms supervised Refactor directly trained on the base system's outputs, showing the effectiveness of the two-step training. Notably, we observe the fine-tuned Refactor can boost BART's performance from 44.26 to 45.15 on ROUGE-1, indicating that the top-1 output selected by beam search is not always the best one, and Refactor can effectively utilize the complementarity introduced by considering all the beam search results.

Exp-II: Multiple Systems Stacking
Summary-level For summary-level combination, we explore two-system combination (BART & pre-trained Refactor) and three-system combination (BART, GSum & pre-trained Refactor). The results are shown in Tab. 3.
Sentence-level For sentence-level combination, we use BART and pre-trained Refactor as the base  Table 3: Summary level combination on CNNDM. Two denotes two-system combination (BART and pre-trained Refactor).
systems. The sentences of each system's output are merged together to form the candidate sentence set, and all combinations of three sentences in the candidate set are generated as candidate summaries.
To prune the candidates, we use tri-gram blocking to filter out candidates of which there exists an identical tri-gram in two sentences.  tion, supervised Refactor has similar performance as fine-tuned Refactor. We hypothesis that this is because here the number of candidates in the finetuning data is relatively large, therefore directly training on the fine-tuning data is sufficient enough.
(ii) The pre-trained Refactor cannot outperform GSum model in the three-system combination setting in Tab. 3. The reason might be that GSum has much stronger performance than the other two systems, which intuitively makes the expected gain from system combination lower than other settings.

Exp-III: Generalization on 19 Top-performing Systems
To evaluate the Refactor's generalization ability, we explore another setting where the pre-trained Refactor is directly used to select the outputs of multiple systems without fine-tuning.
To this end, we collect 19 top-performing summarization systems on CNNDM dataset. Here, we investigate if our Refactor can boost the performance of candidate systems with similar performance. In addition, we also aim to investigate how the range width of different systems' performance affects Refactor's performance. Therefore, we group the candidate systems into equal-width bins based on their average ROUGE-1 scores, and evaluate our Refactor on each bin separately.
In Tab. 5 we report the average ROUGE-1 scores of the oracles, Refactor, and the best candidate system in each bin whose width is 1. Refactor consistently outperforms the best candidate system, showing its generalization ability.
Next, in Fig. 3 we plot the change of Refactor's performance with different bin widths. We define the success rate of Refactor with a given bin width to be the number of bins where Refactor outperforms the single best base system normalized by the total number of bins. We observe that Refactor is more likely to improve the performance of base systems when the system-level performance of the  base systems is similar. Intuitively, if one base system is significantly better than the other systems, it is more difficult for Refactor to use other systems to complement the best base system.

Exp-IV: Effectiveness on More Popular Datasets
Next, we move on to other text summarization datasets to evaluate our proposed method's strength beyond CNNDM dataset. Some of the datasets used here are not as well-studied as CNNDM dataset, so there are less top-performing systems on these datasets. Therefore, here we focus on the experiments of the single system setting. The results in Tab. 6 show that Refactor is able to bring stable improvement over the base systems. The average summary length of these datasets varies from 23.3 (XSum) to 210.3 (Pubmed). Therefore, the results here demonstrate the Refactor can be applied to datasets with different characteristics. On XSum dataset, the pre-trained Refactor outperforms the fine-tuned Refactor. This may result from the additional pre-training data we introduced using BART, which is effective enough to train the Refactor for reranking PEGASUS output.

Fine-grained Analysis
We perform a fine-grained evaluation of Refactor to understand where improvement mainly comes.
Setup We choose the summary-level system combination setting on CNNDM test set in §5.6 as a case study, where the base systems are: BART and pre-trained Refactor, and then we use a fine-tuned Refactor 9 to combine them. Specifically, we first (i) define δ(C BART , C Pretrain ) as the performance (i.e., ROUGE) gap on the candidate summary C.
(ii) then partition test samples into different buckets S 1 , · · · , S n according to the performance gap δ. (iii) calculate selection accuracy for each bucket, which represents how accurately the Refactor can identify the best one from two candidate summaries.
The results are shown in Fig. 4. We observe that the selection accuracy is increasing as the gap δ becoming larger, indicating that Refactor performs better on the candidate summaries with diverse performance. Combining the results we get in §5.7, we conclude that Refactor has the largest potential gain when the base systems effectively complement each other -They have similar system-level performance but diverse summary-level performance. For example, each base system may perform significantly better than others on a subset of data with different characteristics but could not outperform others across the whole dataset.

Implications and Future Directions
We present a general framework for utilizing the complementarity of modern text summarization systems by formulating text summarization as a two-stage learning problem. Our proposed model, Refactor, can be used either as a base system or a meta system, effectively mitigating the learning gaps introduced in the two-stage learning. Experimental results show that Refactor is able to boost the performance of the base systems, and achieves the state-of-the-art performance on CNNDM and XSum datasets. We believe this work opens up a new direction for improving the performance of text summarization systems apart from an iterative process of searching for better model architectures -The gain of performance could be made by fully investigating and utilizing the complementarity of different systems with various architectures, problem formulations, decoding strategies, etc.