CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural Summarization Systems

Neural network-based models augmented with unsupervised pre-trained knowledge have achieved impressive performance on text summarization. However, most existing evaluation methods are limited to an in-domain setting, where summarizers are trained and evaluated on the same dataset. We argue that this approach can narrow our understanding of the generalization ability for different summarization systems. In this paper, we perform an in-depth analysis of characteristics of different datasets and investigate the performance of different summarization models under a cross-dataset setting, in which a summarizer trained on one corpus will be evaluated on a range of out-of-domain corpora. A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways (i.e. abstractive and extractive) on model generalization ability. Further, experimental results shed light on the limitations of existing summarizers. Brief introduction and supplementary code can be found in https://github.com/zide05/CDEvalSumm.


Introduction
Neural summarizers have achieved impressive performance when evaluated by ROUGE (Lin, 2004) on in-domain setting, and the recent success of pretrained models drives the state-of-the-art results on benchmarks to a new level (Liu and Lapata, 2019;Liu, 2019;Zhong et al., 2019a;Zhang et al., 2019;Lewis et al., 2019;Zhong et al., 2020). However, the superior performance is not a guarantee of a perfect system since exsiting models tend to show defects when evaluated from other aspects. For example, Zhang et al. (2018) observes that many abstractive systems tend to be near-extractive in practice. top-scoring summarization systems (Abstractive models are red while extractive ones are blue). Each system is evaluated based on three diverse evaluation methods: (a) averaging each system's in-dataset ROUGE-2 F1 scores (R2) over five datasets; (b-c) evaluating systems using our designed cross-dataset measures: stiff-R2, stable-R2 (Sec. 5). Notably, BERT match and BART are two state-of-the-art models for extractive and abstractive summarization respectively (highlighted by blue and red boxes). Cao et al. (2018); Wang et al. (2020); Kryscinski et al. (2019); Maynez et al. (2020); Durmus et al. (2020) reveal that most generated summaries are factually incorrect. These non-mainstream evaluation methods make it easier to identify the model's weaknesses.
Orthogonal to above two evaluation aspects, we aim to diagnose the limitation of existing systems under cross-dataset evaluation, in which a summarization system trained on one corpus would be evaluated on a range of out-of-dataset corpora. Instead of evaluating the quality of summarizers solely based on one dataset or multiple datasets individually, cross-dataset evaluation enables us to evaluate model performance from a different angle. For example, Fig. 1 shows the ranking of 11 summarization systems studied in this paper under different evaluation metrics, in which the ranking list "(a) in-dataset R2" is obtained by traditional ranking criteria while other two are based on our designed cross-dataset measures. Intuitively, we observe that 1) there are different definitions of a "good" system in various evaluation aspects; 2) abstractive and extractive systems exhibit diverse behaviors when evaluated under the cross-dataset setting.
The above example recaps the general motivation of this work, encouraging us to rethink the generalization ability of current top-scoring summarization systems from the perspective of crossdataset evaluation. Specifically, we ask two questions as follows: Q1: How do different neural architectures of summarizers influence the cross-dataset generalization performances? When designing summarization systems, a plethora of neural components can be adopted (Zhou et al., 2018;Chen and Bansal, 2018;Gehrmann et al., 2018;Cheng and Lapata, 2016;Nallapati et al., 2017). For example, will copy (Gu et al., 2016) and coverage (See et al., 2017) mechanisms improve the cross-dataset generalization ability of summarizers? Is there a risk that BERT-based summarizers will perform worse when adapted to new areas compared with the ones without BERT? So far, the generalization ability of current summarization systems when transferring to new datasets still remains unclear, which poses a significant challenge to design a reliable system in realistic scenarios. Thus, in this work, we take a closer look at the effect of model architectures on cross-dataset generalization setting.
Q2: Do different generation ways (extractive and abstractive) of summarizers influence the crossdataset generalization ability? Extractive and abstractive models, as two typical ways to summarize texts, usually follow diverse learning frameworks and favor different datasets. It would be absorbing to know their discrepancy from the perspective of cross-dataset generalization. (e.g., whether abstractive summarizers are better at generating informative or faithful summaries on a new test set?) To answer the questions above, we have conducted a comprehensive experimental analysis, which involves eleven summarization systems (including the state-of-the-art models), five benchmark datasets from different domains, and two evaluation aspects. Tab. 1 illustrates the overall analysis framework. We explore the effect of different architectures and generation ways on model generalization ability in order to answer Q1 and Q2. Semantic equivalency (e.g., ROUGE) and factual- ity are adopted to characterize the different aspects of cross-dataset generalization ability. Additionally, we strengthen our analysis by presenting two views of evaluation: holistic and fine-grained views (Sec. 5).
Our contributions can be summarized as: 1) Cross-dataset evaluation is orthogonal to other evaluation aspects (e.g., semantic equivalence, factuality), which can be used to re-evaluate current summarization systems, accelerating the creation of more robust summarization systems. 2) We have design two measures Stiffness and Stableness, which could help us to characterize generalization ability in different views, encouraging us to diagnose the weaknesses of state-of-the-art systems. 3) We conduct dataset bias-aided analysis (Sec. 4.3) and suggest that a better understanding of datasets will be helpful for us to interpret systems' behaviours.

Representative Systems
Although it's intractable to cover all neural summarization systems, we try to include more representative models to make a comprehensive evaluation. Our selection strategy follows: 1) the source codes of systems are publicly available; 2) systems with state-of-the-art performance or the top performace on benchmark datasets (e.g., CNNDM (Nallapati et al., 2016)) 3) systems equipped with typical neural components (e.g., Transformer, LSTM) or mechanism (e.g., copy).

Extractive Summarizers
Extractive summarizers directly choose and output the salient sentences (or phrases) in the original document. Generally, most of the existing extractive summarization systems follow a framework consisting of three major modules: sentence encoder, document encoder and decoder. In this paper, we investigate extractive summarizers with different choices of encoders and decoders. LSTM non (Kedzie et al., 2018) This summarizer adopts convolutional neural network as sentence encoder and LSTM to model the cross-sentence relation. Finally, each sentence will be selected in a non-autoregressive way. Trans non (Liu and Lapata, 2019) The Trans-formerExt model in Liu and Lapata (2019), similar to above setting except that the document encoder is replaced with the Transformer layer. Trans auto (Zhong et al., 2019a) The decoder is replaced with a pointer network to avoid the repetition (autoregressive). BERT non (Liu and Lapata, 2019) The Bert-SumExt model in Liu and Lapata (2019), this model is an extension of Trans non by introducing a BERT (Devlin et al., 2018) layer. BERT match (Zhong et al., 2020) This is the existing state-of-the-art extractive summarization system, which introduce a matching layer using siamese BERT.

Abstractive Summarizers
The abstractive approach involves paraphrasing the inputs using novel words. The current abstractive summarization systems mainly focus on the encoder-decoder paradigm. L2L cov ptr (See et al., 2017) The model is a LSTM based sequence to sequence summarizer with copy and coverage mechanism. L2L ptr We remove the coverage module and keep other parts unchanged. L2L This model is implemented by removing the pointer network of the above summarizer. T2T (Liu and Lapata, 2019) A sequence to sequence model with Transformer as the encoder and decoder. BE2T (Liu and Lapata, 2019) A sequence to sequence model with BERT as encoder and Transformer as decoder. BART (Lewis et al., 2019) A fully pre-trained sequence to sequence model. It is the existing stateof-the-art abstractive summarization system.

Datasets
We explore five typical summarization datasets: CNNDM, Xsum, PubMed, Bigpatent B and Reddit TIFU. CNNDM (Nallapati et al., 2016) and Xsum (Narayan et al., 2018) are news domain summarization datasets which are various in their publications and abstractiveness. PubMed (Cohan et al., 2018) is a scientific paper dataset, which can be used to investigate the generalization ability of models on scientific domain. Bigpatent B (Sharma et al., 2019) is the B category of Bigpatent (a dataset consisting of patent documents from Google Patents Public Datasets). Reddit TIFU (Kim et al., 2019) is a dataset with less formal posts collected from the online discussion forum Reddit. Detailed statistics and introduction of datasets are presented in the appendix section.

Evaluation for Summarization
Existing summarization systems are usually evaluated on different datasets individually based on an automatic metric: r = eval(D, S, m), where D, S represents a dataset (e.g., CNNDM) and system (e.g., L2L) respectively. m denotes an evaluation metric (e.g., ROUGE). To evaluate the quality of generated summaries, metrics can be designed from diverse perspectives, which can be abstractly characterized in Fig. 2. Specifically, semantic equivalence is used to quantify the relation between generated summaries (Gsum) and references (Ref) while factuality aims to characterize the relation between generated summaries (Gsum) and input documents (Doc).
Besides evaluation metrics, in this paper, we also introduce some measures that quantify the relation between input documents (Doc) and references (Ref). We claim that a better understanding of dataset biases can help us interpret models' discrepancies.

ROUGE (Lin, 2004
) is a classic metric to evaluate the quality of model generated summaries by counting the number of overlapped n-grams between the evaluated summaries and the ideal references.

Factuality
Apart from evaluating the semantic equivalence between generated summaries and the references, another evaluation aspect of recent interest is factuality. In order to analyze the generalization performance of models in different perspectives, in this  work, we also take the factuality evaluation into consideration.
Factcc Factcc (Kryscinski et al., 2019) is introduced to measure the fact consistency between the generated summaries and source documents. It is a model based metric which is weakly-supervised. We use the proportion of summary sentences that factcc predicts as factually consistent as the factuality score in this paper.

Dataset Bias
We detail several measures that could quantify the characteristics of datasets, which are helpful for us to understand the differences among models. Coverage (Grusky et al., 2018) illustrates the overlap rate between document and summary, it is defined as the proportion of the copied segments in summary. Copy Length measures the average length of segments in summary copied from source document. Novelty (See et al., 2017) is defined as the proportion of segments in the summaries that haven't appeared in source documents. The segments can be instantiated as n-grams. Repetition (See et al., 2017) measures the rate of repeated segments in summaries. Similar to the above measure, we choose n-gram (n ranges from one to four) as segment unit. Sentence fusion score is calculated using the result of the algorithm proposed by (Lebanoff et al., 2019), which is to find whether summary sentence is compressed from one sentence or fused from several sentences. Then, sentence fusion score is calculated as the proportion of fused sentences (sentences that are fused from two or three document sentences) to all summary sentences.
A high value of coverage and copy length suggests the dataset is more extractive, while novelty represents the rate of novel units in summary and sentence fusion score represents the proportion of sentences that is fused from more than two document sentences. Zhong et al. (2019b) also explores dataset bias to aid the analysis of model performance, but they only focus on metrics for extractive summarizers.

Dataset Bias Analysis
According to the coverage and copy length results in Fig. 3, CNNDM is the most extractive dataset. Bigpatent B also exhibits relatively higher copy rate in summary but the copy segments is shorter than CNNDM. On the other hand, Bigaptent b, Xsum obtain higher sentence fusion score, which suggests that the proportion of fused sentences in these two datasets are high. Xsum and Reddit obtain more 3-gram novel units in summary, reflecting these two datasets are more abstractive. In terms of repetition in Fig. 3, only PubMed and Bigpatent B contain more 2-gram repeated phrases in summary.   Table 3: Illustration of two views (Stiffness: r u and Stableness: r σ ) to characterize the cross-dataset (a and b) generalization based on model A and B. U A and U B represent two cross-dataset matrix of two models.

Models
suggests the model A is more robust.

Cross-dataset Evaluation
Despite recent impressive results on diverse summarization datasets, modern summarization systems mainly focus on extensive in-dataset architecture engineering while ignore the generalization ability which is indispensable when systems are required to process samples from new datasets or domains. Therefore, instead of evaluating the quality of summarization system solely based on one dataset, we introduce cross-dataset evaluation (a summarizer (e.g., L2L) trained on one dataset (e.g., CNNDM) will be evaluated on a range of other datasets (e.g., XSUM)). Methodologically, we perform cross-dataset evaluation from two views: finegrained and holistic and we will detail them below.

Methodology
Given a summarization system S, a set of datasets D = D 1 , · · · , D N , and evaluation metric m, we can design different evaluation function to quantify the system's quality: r = eval(D, S, m). Depending on different forms of function eval(·), r could be instantiated as either a scalar or a vector (or matrix).

Fine-grained Measures
Once r, the cross-dataset evaluation result, is instantiated as a matrix, we can characterize the given system in a fine-grained way. Specifically, we define r as: r = U ∈ R N ×N where each cell U i,j refers to the metric result (e.g., ROUGE) when a summarizer is trained in dataset D i and tested in dataset D j (N refers to the number of datasets). Additionally, we can normalize each cell by the diagonal value, r = U ij /U jj × 100% =Û, U ij /U jj measures how close the out-of-dataset performance (trained in D i and tested in D j ) of a system is to its in-dataset performance (trained in D j and tested in D j ).

Holistic Measures
Instead of using a matrix, holistically, we can quantify the cross-dataset generalization ability of each summarization system using a scalar. Specifically, we propose two views to characterize the crossdataset generalization.
Stiffness This measure reflects the absolute performance of a system under cross-dataset setting. Given a system, its stiffness can be calculated as: Intuitively, a higher value of stiffness suggests the system obtains better performance when transferred to new datasets.
Stableness It characterizes the relative performance gap between in-dataset and cross-dataset Generally, a higher value of stableness suggests that the variance between in-dataset and crossdataset results is smaller.
Tab. 3 gives an example to characterize generalization ability in two views. It shows that stiffness and stableness are not always unanimous, a model with higher stiffness may obtains lower stableness. T r a n s n o n T r a n s a u to

Experiment
In what follows, we analyze different summarization systems in terms of semantic equivalence and factuality. Moreover, the results are studied in holistic and fine-grained views based on the measures defined above. Holistic results are showed in Fig. 4 analysis aspect Architecture Red. avg normali. CNN.
Red. avg and Fig. 5. On the other hand, Tab. 4 and Tab. 5 display the fine-grained observations. Tab. 2 dispalys the in-dataset results of all models on five benchmark datasets.

Semantic Equivalence Analysis
We conduct pair-wise Wilcoxon Signed-Rank significant test with α = 0.05. The null hypothesis is that the expected performances (stiffness and stableness) of a pair of summarization models are identical. We report the observations that are statistically significant.

Architecture
Match based reranking improves stiffness significantly BERT match , which using semantic match scores to rerank candidate summaries enhances the stiffness of model significantly in Fig. 4a while obtaining comparable stableness with other extractive models in Fig. 4b. This indicates that BERT match not only increases the absolute performance but also retaining robustness.
BERT match is not stable when transferred from other datasets to Bigpatent B As Tab. 4g shows, when compared to BERT non , BERT match obtains larger in-dataset and cross-dataset performance gap when tested in Bigpatent B. This is because Bigpatent B possesses higher sentence fusion score and higher repetition compared with other datasets as Sec. 4.4 demonstrates. When served as test set, such dataset brings great challenge for BERT match to correctly rank the can-didate summaries while it provides more training signals when served as training set. Thus the in-dataset (Bigpatent b) trained model obtain much higher score compared with cross-dataset models which trained from other datasets and cause lower stableness.
Non-autoregressive decoder is more robust than autoregressive for extractive models. Regarding the decoder of extractive systems, as shown in Fig. 4a and Fig. 4b, the non-autoregressive extractive decoder (Trans non ) is more stable while it possesses lower stiffness than its autoregressive counterpart (Trans auto ).
Pointer network and coverage mechanism are instrumental in improving stiffness and stableness of abstractive systems. The pointer network and coverage mechanism do enhance the absolute performance of abstractive system as Fig. 4a demonstrates (r µ (L2L cov ptr ) > r µ (L2L ptr ) > r µ (L2L)). Also, the stableness results of L2L ptr and L2L in Fig. 4b reveals that once removing the pointer mechanism, the value of r σ for L2L ptr decreases, which suggests that the system will be more stable if it's augmented the ability to directly extract text spans from the the source document.
However, pointer network brings trivial improvement when tested in Xsum and Reddit The absolute model performance improvement of pointer network is trival when tested in xsum and Reddit as showed in Tab. 4c, which is in line with expectations because these two datasets are more abstractive as analyzed in Sec. 4.4.
On the other hand, coverage is not that helpful when tested in Reddit and Xsum and even harmful when trained in Xsum. The heatmap of L2L cov ptr vs. L2L ptr in Tab.4d) shows that when tested in Reddit and Xsum, the improvement of coverage mechanism is trivial. These two datasets possess less repetition, thus coverage can not provide much help when transferred to these datasets. Moreover, when trained in Xsum, L2L cov ptr gets lower stiffness compared with L2L ptr , which is in accordance with the normalized result in Tab. 4j. This is because the gold summaries of Xsum exhibit lower repetition score (as analyzed in Sec. 4.4), thus can't provide enough learning signals for coverage mechanism.
BERT sometimes brings unstableness. As shown in Fig. 4a, there is no doubt that once summarizers (extractive or abstractive) are equipped with pre-trained encoder, the stiffness will increase significantly (e.g., r µ (BE2T >> r µ (T2T), suggesting that the overall cross-dataset performance has been improved. However, we are surprised to find (from Fig. 4b) that BERT sometimes leads to unstableness (i.e. r σ (Trans non ) > r σ (BERT non )). This result enlightens us to search for other architectures or learning schemas to offset the unstableness brought by BERT.
As the heatmap of BERT non vs. Trans non in Tab. 4h shows, BERT brings unstableness especially when tested in Reddit and Xsum.
BERT sometimes can even harm the absolute cross-dataset performance. BERT non performs worse than Trans non in some cells (e.g., trained in Xsum and tested in CNNDM) in Tab. 4b BART shows superior performance in terms of stiffness and stableness. As Fig. 4a shows, BART obtains the highest stiffness among all abstractive models, and is even comparable with BERT match . In addition, BART is also outstanding in terms of stableness when compared with other abstractive models (Fig. 4b). The performance gap between BART and BE2T proves that for abstractive models, pre-training the whole sequence to sequence model works better than using the pretrained model in either side of encoder or decoder.

Generation ways
Extractive models are superior to abstractive models in terms of stiffness and robustness. Extractive models show superior advantage of absolute performance as shown in Fig. 4a. Moreover, comparing the stableness of abstractive and extractive models in Fig. 4b, we surprisingly find that abstractive approaches except for BART are extremely brittle since their r σ value is much lower than any extractive approaches with a maximum margin of 37%, and the gap can be reduced by introducing pointer network. This observation poses a great challenge to the development of the abstractive systems, encouraging research to pay more attention to improve the generalization ability. Also, we have provided hints for the solution, such as enabling the model to extract granular information from the source document or using the well pretrained sequence to sequence model (e.g., BART).
When tested in Xsum and Reddit, abstractive systems possess comparable or even better performance. The supremacy of extractive models is not retained in all datasets (Tab. 4f and Tab. 4e) Though extractive models obtain higher stiffness scores when tested in CNNDM and PubMed, abstractive approaches (BE2T, L2L) obtained higher or comparable stiffness scores when tested at XSUM and Reddit. This is because Xsum and Reddit are more abstractive as analyzed in Sec. 4.4.

Factuality Analysis
1) All extractive models can achieve higher factuality scores while all abstractive models obtain quite lower ones (Fig. 5a). One interesting observation is, for extractive models, not all factuality scores under the in-dataset setting are 100% in Tab. 5 (ondiagonal values), which reveals the limitation of EXT models  existing factuality checker.
2) BART can significantly improve the ability to generate factual summaries compared with other abstractive models as showed in Fig. 5a, even compared with L2L ptr which equipped with pointer network and tend to copy from source document.
3) Abstractive models obtain higher stableness of factuality scores in Fig. 5b which surpass 100%. This is because when tested in abstractive datasets (e.g., Xsum as Sec. 4.4 shows), abstractive summarizers trained in-dataset tend to be more abstractive and obtain lower factuality score while it gets higher factuality score when trained on other datasets which are more extractive (e.g., CNNDM). The superiority of cross-dataset results over indataset results thus leads to higher stableness.

Related Work
Our work is connected to the following threads of topics of NLP research.
Cross-Dataset Generalization in NLP Recently, more researchers shift their focus from individual dataset to cross-dataset evaluation, aiming to get a comprehensive understanding of system's generalization ability.  (2019), on the other hand, shows the generalization ability of reading comprehension models can be improved by pretraining on one or two other reading comprehension datasets. Fu et al. (2020) studies the model generalization in the field of NER. They point out the bottleneck of the existing NER systems through in-depth analyses and provide suggestions for further improvement. Different from the above works, we attempt to explore generalization ability for summarization systems.
Diagnosing Limitations of Existing Summarization Systems Beyond ROUGE, some recent works try to explore the weaknesses of existing systems from divese aspects. Zhang et al. (2018) tries to figure out to what extent the neural abstractive summarization systems are abstractive and discovers many of abstractive systems tend to perform near-extractive. On the other hand, Cao et al. (2018) and Kryscinski et al. (2019) study the factuality problem in modern neural summarization systems.
The former puts forward one model that combining source document and preliminary extracted fact description and prove the effectiveness of this model in terms of factuality correctness. While the latter contributes to design a model-based automatic factuality evaluation metric. Abstractiveness and factuality error the above works studied are orthogonal to this work and can be easily combined with cross-dataset evaluation framework in this paper as Sec. 6.2 shows. Moreover, Wang et al. (2019); Hua and Wang (2017) attempt to investigate the domain shift problem on text summarization while they focus on a single generation way (either abstractive or extractive). We also investigate the generalization of summarizers when transferring to different datasets, but include more datasets and models.

Conclusion
By performing a comprehensive evaluation on eleven summarization systems and five mainstream datasets, we summarize our observations below: 1) Abstractive summarizers are extremely brittle compared with extractive approaches, and the maximum gap between them reaches 37% in terms of the measure stableness (ROUGE) defined in this paper. 2) BART (SOTA system) is superior over other abstractive models and even comparable with extractive models in terms of stiffness (ROUGE). On the other hand, it is robust when transferring between datasets as it possesses high stableness (ROUGE). 3) BERT match (SOTA system) performs excellently in terms of stiffness, while still lacks stableness when transferred to Bigpatent B from other datasets. 4) The robustness of models can be improved through either equipped the model with ability to copy span from source document (i.e., Lebanoff et al. (2019)) or make use of well trained sequence to sequence pre-trained model (BART). 5) Simply adding BERT on encoder could improve the stiffness (ROUGE) of model but will cause larger cross-dataset and in-dataset perfor-mance gap, a better way should be found to merge BERT into abstractive model, or a better training strategy should be applied to offset the negative influence it brings. 6) Existing factuality checker (Factcc) is limited in predictive power of positive samples (Sec.6.2). 7) Out-of-domain systems can even surpass in-domain systems in terms of factuality. (Sec.6.2) Mike Lewis, Yinhan Liu, Naman Goyal, PUBMED PUBMED (Cohan et al., 2018) is drawn from scientific papers specifically medical journal articles from the PubMed Open Access Subset. We use the introduction as source document and the abstract as summary here.
BIGPATENT BIGPATENT (Sharma et al., 2019) consists of 1.3 million records of U.S. patent documents and the corresponding summaries are created by human. According to Cooperative Patent Classification (CPC), the dataset is divided to nine categories. One of the nine categories is chosen as a dataset in difference domain in our experiment (Category B: Performing Operations; Transporting).
REDDIT TIFU REDDIT TIFU (Kim et al., 2019) is a dataset with less formal posts compared with datasets mentioned above which mostly use formal documents as source. It is collected from the online discussion forum Reddit. They regard the body text as source, the title as short summary, and the TL;DR summary as long summary, thus making two sets of datasets: TIFU-short and TIFU-long. TIFU-long is used in this paper.

A.2 Dataset statistics
The detailed dataset statistics are presented in Tab Table 6: Detailed statistics of five datasets. Lead-k indicates ROUGE-1 F1 score of the first k sentences in the document and Oracle indicates the globally optimal combination of sentences in terms of ROUGE-1 F1 scores with ground truth, the latter represents the upper bound of extractive models.

A.3.1 Extractive Summarizers
We use the same training setup in (Zhong et al., 2019a). We use cross entropy as loss function to train LSTM non and Trans auto . The hidden state dimension of LSTM in LSTM non is set to 512 and the hidden state dimension of Transformer in Trans auto is 2048. We use Transformer with 8 heads.
BERT non and Trans non is constructed according to Liu and Lapata (2019). All documents and summaries are truncated to 512 tokens when training. BERT non and Trans non are trained for 50000 steps, the gradient is accumulated every two steps. We use Adam as optimizer and the learning rate is set to 2e-3.
BERT match is trained as in Zhong et al. (2020). It uses the base version of BERT as base model. We use Adam optimizer with warming up. The learning rate schedule follows Vaswani et al. (2017).

A.3.2 Abstractive Summarizers
L2L, L2L ptr and L2L cov ptr are trained using the pytorch reproduced version code of See et al. (2017). We use the same size of vocabulary(50k), hidden state dimension (256) and word embedding dimension (128) as in the paper. All of three models are trained with 650000 maximum training steps, We use Adagrad to train the models with learning rate of 0.15.
BE2T and T2T is constructed according to Liu and Lapata (2019). We use two separate optimizers for the decoder and encoder regarding BE2T to offset the mismatch of encoder and decoder, since the former is pre-trained while the latter is not. Learning rates for the optimizers of encoder and decoder are 0.002 and 0.2 respectively. On the other hand, BE2T and T2T are trained with gradient accumulation every five steps, training step for which is 200000.
BART uses the large pre-trained sequence to sequence model in Lewis et al. (2019). The total learning step when fine-tuning is set to 20000 with 500 steps warming up. We use Adam as optimizer and learning rate is 3e-05.    CNN.

Patent b
Red. avg Table 10: The difference of ROUGE-1 F1 scores between different models pairs. Every column of the table represents the compared result of one pair of models. The line of holistic analysis displays the overall stiffness and stableness of compared models. The rest of the table is the fine-grained results, the first and third lines of which are the origin compared result (U A − U B for models pairs A and B) and the second and fourth lines are the normalized compared result (Û A −Û B for models pairs A and B). For all heatmap, 'grey' represents positive, 'red' represents negative and 'white' represents approximately zero.