The Effect of Pretraining on Extractive Summarization for Scientific Documents

Large pretrained models have seen enormous success in extractive summarization tasks. In this work, we investigate the influence of pretraining on a BERT-based extractive summarization system for scientific documents. We derive significant performance improvements using an intermediate pretraining step that leverages existing summarization datasets and report state-of-the-art results on a recently released scientific summarization dataset, SciTLDR. We systematically analyze the intermediate pretraining step by varying the size and domain of the pretraining corpus, changing the length of the input sequence in the target task and varying target tasks. We also investigate how intermediate pretraining interacts with contextualized word embeddings trained on different domains.


Introduction
Text summarization is a quintessential NLP task that involves generating a coherent and succinct summary of an article containing the most salient information from the original article. Summarization systems are particularly useful for scientific articles that tend to be long and rich in technical content. Summarization can arguably reduce information overload on researchers and facilitate the quick retrieval of relevant papers from vast amounts of scientific literature. Broadly, summarization techniques can be categorized as extractive or abstractive. While abstractive systems treat the summarization problem as a natural language generation task and produce new phrases and sentences directly in the summary, extractive techniques select salient phrases or sentences verbatim from the original document to create a summary. Maynez * * Correspondence to pjyothi@cse.iitb.ac.in et al. (2020), Kryscinski et al. (2020), Huang et al. (2020) report factual hallucinations in abstractive summarization. Durmus et al. (2020) highlight the trade-off between faithfulness and abstractiveness. Since for the scientific summarization task, it is critical to be factually-accurate and be faithful to the source document, we focus on extractive summarization of scientific articles.
Large pretrained language models (e.g. BERT (Devlin et al., 2019)) have been successfully used for many NLP tasks including summarization (Liu and Lapata, 2019), using the following, now widely-adopted, two-step approach: Pretraining. Start with a pretrained model like BERT and suitably adapt its architecture to fit the target task. Finetuning. Finetune the model using a labeled dataset for the target task.
Recent work shows the benefits of interspersing the pretraining and finetuning steps with an intermediate pretraining step (Phang et al., 2018), (Vu et al., 2020). This intermediate step often involves supervised pretraining using labeled datasets from different domains for a task that is related to or is the same as the target task. While the efficacy of such pretraining approaches have been studied in prior work for natural language understanding tasks (like entailment, question answering, etc. (Vu et al., 2020)), the effect of pretraining on summarization has been far less explored.
In this work, we explore the benefits of intermediate pretraining using existing summarization datasets for a target task involving the summarization of scientific articles. We obtain improvements in performance over state-of-the-art extractive summarization baseline systems on a new sci-entific summarization benchmark, SCITLDR (Cachola et al., 2020). We also make the following key observations: • Intermediate pretraining using labeled summarization datasets (even when containing articles that are very different in domain from scientific articles) is very beneficial to lowresource target tasks like SCITLDR. We also derive additional benefits by filtering the intermediate pretraining data to only retain a subset of articles (based on a similarity metric) that best matches the target task. • While starting with a BERT-based model pretrained on scientific articles (e.g., SCIB-ERT(Beltagy et al., 2019)) offers a small advantage compared to the standard BERT-based model as an initialization, this advantage is eclipsed by the effect of intermediate pretraining which is much more significant. • The benefits from intermediate pretraining diminish with access to sufficiently large amounts of finetuning data in the target task. We also observe a trend of diminishing returns with the intermediate pretraining, as we increase the amount of pretraining data.

Models and Implementation Details
Our extractive summarization system uses the BERT-based architecture by (Liu and Lapata, 2019) described in Section 3. For intermediate pretraining, we use one of CNN/DM, Pubmed or MIXED. The finetuning step involves data from one of three target tasks, SCITLDR-A, SCITLDR-AIC and Pubmed. For all training steps, we set the dropout rate to 0.1 and learning rate to 2e-3, which are the reported parameters in (Liu and Lapata, 2019) for CNN/DM. We use a batch size of 3000 for all experiments involving CNN/DM during pretraining. The best model is selected on the basis of validation ROUGE scores for one-line summaries on the validation set. This is done to select the model with the best "extreme" summarization capability. When evaluating on Pubmed, the number of sentences extracted is set to 6, as reported in (Zhong et al., 2020). For fine-tuning on SCITLDR-A as well as SCITLDR-AIC, the batch size is set to 100 and the number of extracted sentences to form the final summary is 1.
Evaluation Metrics. The SCITLDR tasks have multiple reference summaries for each test article. We compute ROUGE scores between the summary generated by our system and each of the reference summaries. We consider the reference with the maximum ROUGE-1 score as the main gold summary used in further evaluations. We choose ROUGE-1 (R1), ROUGE-2 (R2) and ROUGE-L (RL) as our main evaluation metrics, as is typically done for summarization tasks. To determine the best possible performance from an extractive summarization system, we also compute oracle scores by choosing a sentence from each test article with the highest R1 score across all reference summaries and averaging these scores across the test articles.   Table 3: Results by varying the size of the pretraining dataset CNN/DM while finetuning on SCITLDR-AIC. Table 1 and Table 2 show our main results. In the first two rows, we present results from the state-ofthe-art MatchSum system (Zhong et al., 2020) and oracle scores. The remaining rows show pretraining results using BERT and SCIBERT embeddings in the BERTSUM model. Without any intermediate pretraining, SCIBERT offers a small advantage over BERT on Pubmed and is statistically comparable to BERT on both SCITLDR tasks. With pretraining and using BERT, we observe significant improvements in performance regardless of the pretraining corpora used. (  With pretraining and replacing BERT with SCIB-ERT, we observe a deterioration in performance indicated by the drop in ROUGE scores (especially with CNN/DM). The SCIBERT initialization appears to be counterproductive when using CNN/DM during intermediate pretraining. It is more beneficial to start with BERT, rather than SCIBERT, and pretrain on CNN/DM before the final finetuning step.

Results and Discussion
Additionally, we undertake two ablation experiments. 1) We investigate the effect of varying amounts of pretraining data. We vary the size of CNN/DM to 83K, 176K and 286K articles and analyse the finetuning results on SCITLDR-AIC with BERT embeddings. As shown in Table 3, R1, R2 and RL scores increase on moving from 83K to 176K articles but performance stagnates with a fur-ther increase in the size of the pretraining corpus. 2) During finetuning, we experiment with truncating the input sequence lengths of SCITLDR-AIC and Pubmed at 512, 1024 and 1500 tokens, as shown in Table 4. We initialize the model with BERT embeddings for the first 512 tokens and repeat the last set of weights for the remaining input tokens. We observe that the ROUGE scores improve with longer input lengths, with a sizeable boost for Pubmed.

Conclusions and Future Work
In this paper, we present a systematic investigation of the benefits of transfer learning via pretraining for extractive summarization of scientific articles. We show improvements in ROUGE scores for the SCITLDR benchmark using an intermediate pretraining that uses existing summarization datasets. We obtain additional benefits by filtering these existing datasets to construct a pretraining corpus that best matches the target task. This suggests the need for further explorations in future work on different criteria to be used for selective pretraining and how it could benefit both extractive and abstractive summarization.

A.1.1 SCITLDR
This dataset is built from a combination of TLDRS written by human experts and author-written TLDRS of computer science papers from OpenReview. OpenReview (https://openreview.net/) is one such example where authors are asked to submit TLDRs of their papers, which communicates to both reviewers and users of OpenReview the main content of the paper. SCITLDR has multiple reference summaries for each of the test and validation articles. The additional reference summaries (apart from the author written one) were obtained from human annotators. This is an "extreme" summarisation task as the compression ratio is very high compared to the other datasets i.e. around 47 for the AIC task. While the dataset is inherently abstractive in nature, the extractive oracle scores listed in Table ?? are quite high (in fact, they are much higher than existing abstractive and extractive SoTA scores), which implies there is a lot of scope for extractive summarisation.

A.1.2 CNN/DM
This dataset contains online news articles paired with multi-sentence summaries (which are highlights of the news articles). The dataset is fairly large and also has a high extractive oracle (with ROUGE-1 / ROUGE-2 / ROUGE-L scores of 52.59 / 31.24 / 48.87 ), although the summaries are not inherently extractive. The compression ratio is much lower compared to SCITLDR i.e. around 13.

A.1.3 Pubmed
This dataset is collected from scientific papers. It has a very low compression ratio i.e. around 2 (which is a direct consequence of using the introduction section as the document and the abstract as the corresponding summary). The summaries are relatively long, compared to SCITLDR and CNN/DM, with around 6 sentences per summary.

A.2 Qualitative Analysis
We present examples of SCITLDR articles and generated summaries to illustrate the effects of pretraining and other design choices (such as varying input lengths and BERT/SCIBERT initializations).

A.2.1 Effect of Input Sequence Length on SciTLDR-AIC
Article 1 Good representations facilitate transfer learning and few-shot learning. Motivated by theories of language and communication that explain why communities with large number of speakers have, on average, simpler languages with more regularity, we cast the representation learning problem in terms of learning to communicate.Our starting point sees traditional autoencoders as a single encoder with a fixed decoder partner that must learn to communicate. Generalizing from there, we introduce community-based autoencoders in which multiple encoders and decoders collectively learn representations by being randomly paired up on successive training iterations. Our experiments show that increasing community sizes reduce idiosyncrasies in the learned codes, resulting in more invariant representations with increased reusability and structure. The importance of representation learning lies in two dimensions. First and foremost, representation learning is a crucial building block of a neural model being trained to perform well on a particular task, i.e., representation learning that induces the "right" manifold structure can lead to models that generalize better, and even extrapolate. Another property of representation learning, and arguably the most important one, is that it can facilitate transfer of knowledge across different tasks , essential for transfer learning and few-shot learning among others BID0 . With this second point in mind, we can define good representations as the ones that are reusable, induce the abstractions that capture the "right" type of invariances and can allow for generalizing very quickly to a new task. Significant efforts have been made to learn representations with these properties; one frequently explored direction involves trying to learn disentangled representations BID12 BID6 BID5 BID17 ), while others focus on general regularization methods BID15 BID18 . In this work, we take a different approach to representation learning, inspired by successful abstraction mechanisms found in nature, to wit human language and communication.Human languages and their properties are greatly affected by the size of their linguistic community BID11 BID19 BID16 BID9 .....

Ground Truth Summaries
Motivated by theories of language and communication, we introduce community-based autoencoders, in which multiple encoders and decoders collectively learn structured and reusable representations. The authors tackle the problem of representation learning, aim to build reusable and structured represenation, argue co-adaptation between encoder and decoder in traditional AE yields poor representation, and introduce community based auto-encoders. The paper presents a community based autoencoder framework to address co-adaptation of encoders and decoders and aims at constructing better representations. Input Length 512 (ROUGE-1: 18.18, ROUGE-2: 0.00, ROUGE-L: 12.12) Good representations facilitate transfer learning and few-shot learning . Input Length 1024 (ROUGE-1: 28.57, ROUGE-2: 0.00, ROUGE-L: 14.29) Our starting point sees traditional autoencoders as a single encoder with a fixed decoder partner that must learn to communicate. Input Length 1500 (ROUGE-1: 60.0, ROUGE-2: 49.99, ROUGE-L: 55.99) Generalizing from there, we introduce community-based autoencoders in which multiple encoders and decoders collectively learn representations by being randomly paired up on successive training iterations.

Article 2
Generative models are important tools to capture and investigate the properties of complex empirical data. Recent developments such as Generative Adversarial Networks (GANs) and Variational Auto-Encoders (VAEs) use two very similar, but reverse, deep convolutional architectures, one to generate and one to extract information from data. Does learning the parameters of both architectures obey the same rules? .We exploit the causality principle of independence of mechanisms to quantify how the weights of successive layers adapt to each other. Using the recently introduced Spectral Independence Criterion, we quantify the dependencies between the kernels of successive convolutional layers and show that those are more independent for the generative process than for information extraction, in line with results from the field of causal inference. In addition, our experiments on generation of human faces suggest that more independence between successive layers of generators results in improved performance of these architectures. Deep generative models have proven powerful in learning to design realistic images in a variety of complex domains (handwritten digits, human faces, interior scenes). In particular, two approaches have recently emerged: Generative Adversarial Networks (GANs) BID8 , which train an image generator by having it fool a discriminator that should tell apart real from artificially generated images; and Variational Autoencoders (VAEs) BID15 BID21 ) that learn both a mapping from latent variables to the data (the decoder) and the converse mapping from the data to the latent variables (the encoder), such that correspondences between latent variables and data features can be easily investigated.....

Ground Truth Summaries
We use causal inference to characterise the architecture of generative models . This paper examines the nature of convolutional filters in the encoder and a decoder of a VAE, and a generator and a discriminator of a GAN. This work exploits the causality principle to quantify how the weights of successive layers adapt to each other. Input Length 512 (ROUGE-1: 25.92, ROUGE-2: 3.84, ROUGE-L: 14.81) Using the recently introduced Spectral Independence Criterion, we quantify the dependencies between the kernels of successive convolutional layers and show that those are more independent for the generative process than for information extraction, in line with results from the field of causal inference.

A.2.2 Effect of Pretraining on SciTLDR-AIC
Article 1 Recent advances in neural Sequence-to-Sequence (Seq2Seq) models reveal a purely data-driven approach to the response generation task. Despite its diverse variants and applications, the existing Seq2Seq models are prone to producing short and generic replies, which blocks such neural network architectures from being utilized in practical open-domain response generation tasks. In this research, we analyze this critical issue from the perspective of the optimization goal of models and the specific characteristics of human-to-human conversational corpora. Our analysis is conducted by decomposing the goal of Neural Response Generation (NRG) into the optimizations of word selection and ordering. It can be derived from the decomposing that Seq2Seq based NRG models naturally tend to select common words to compose responses, and ignore the semantic of queries in word ordering. On the basis of the analysis, we propose a max-marginal ranking regularization term to avoid Seq2Seq models from producing the generic and uninformative responses. The empirical experiments on benchmarks with several metrics have validated our analysis and proposed methodology. Past years have witnessed the dramatic progress on the application of generative sequential models (also noted as seq2seq learning (Sutskever et Despite these promising results, current Sequence-to-Sequence (Seq2Seq) architectures for response generation are still far from steadily generating relevant and coherent replies. The essential issue identified by many studies is the Universal Replies: the model tends to generate short and general replies which contain limited information, such as "That's great!", "I don't know", etc. Nevertheless, most previous analysis over the issue are empirical and lack of statistical evidence. Therefore, in this paper, we conduct an in-depth investigation on the performance of seq2seq models on the NRG task....

Ground Truth Summaries
Analyze the reason for neural response generative models preferring universal replies; Propose a method to avoid it. Investigates the problem of universal replies plaguing the Seq2Seq neural generation models. The paper looks into improving the neural response generation task by deemphasizing the common responses using modification of the loss function and presentation the common/universal responses during the training phase. Therefore, in this paper, we conduct an in-depth investigation on the performance of seq2seq models on the NRG task. Article 2 Graph convolutional networks (GCNs) have been widely used for classifying graph nodes in the semi-supervised setting. Previous works have shown that GCNs are vulnerable to the perturbation on adjacency and feature matrices of existing nodes. However, it is unrealistic to change the connections of existing nodes in many applications, such as existing users in social networks. In this paper, we investigate methods attacking GCNs by adding fake nodes. A greedy algorithm is proposed to generate adjacency and feature matrices of fake nodes, aiming to minimize the classification accuracy on the existing ones. In additional, we introduce a discriminator to classify fake nodes from real nodes, and propose a Greedy-GAN algorithm to simultaneously update the discriminator and the attacker, to make fake nodes indistinguishable to the real ones.... Ground Truth Summaries non-targeted and targeted attack on GCN by adding fake nodes The authors propose a new adversarial technique to add "fake" nodes to fool a GCN-based classifier Pubmed (ROUGE-1: 23.53, ROUGE-2: 0.0, ROUGE-L: 11.76) Graph convolutional networks (GCNs) have been widely used for classifying graph nodes in the semi-supervised setting.

Article 1
In this paper, we introduce a system called GamePad that can be used to explore the application of machine learning methods to theorem proving in the Coq proof assistant. Interactive theorem provers such as Coq enable users to construct machine-checkable proofs in a step-by-step manner. Hence, they provide an opportunity to explore theorem proving with human supervision. We use GamePad to synthesize proofs for a simple algebraic rewrite problem and train baseline models for a formalization of the Feit-Thompson theorem. We address position evaluation (i.e., predict the number of proof steps left) and tactic prediction (i.e., predict the next proof step) tasks, which arise naturally in tactic-based theorem proving. Theorem proving is a challenging AI task that involves symbolic reasoning (e.g., SMT solvers BID2 ) and intuition guided search. Recent work BID7 Loos et al., 2017; has shown the promise of applying deep learning techniques in this domain, primarily on tasks useful for automated theorem provers (e.g., premise selection) which operate with little to no human supervision. In this work, we aim to move closer to learning on proofs constructed with human supervision.We look at theorem proving in the realm of formal proofs. A formal proof is systematically derived in a formal system, which makes it possible to algorithmically (i.e., with a computer) check these proofs for correctness....

Ground Truth Summaries
We introduce a system called GamePad to explore the application of machine learning methods to theorem proving in the Coq proof assistant. This paper describes a system for applying machine learning to interactive theorem proving, focuses on tasks of tactic prediction and position evaluation, and shows that a neural model outperforms an SVM on both tasks. Proposes that machine learning techniques be used to help build proof in the theorem prover Coq. In this paper, we introduce a system called GamePad that can be used to explore the application of machine learning methods to theorem proving in the Coq proof assistant.

Article 2
We propose a novel method that makes use of deep neural networks and gradient decent to perform automated design on complex real world engineering tasks. Our approach works by training a neural network to mimic the fitness function of a design optimization task and then, using the differential nature of the neural network, perform gradient decent to maximize the fitness. We demonstrate this methods effectiveness by designing an optimized heat sink and both 2D and 3D airfoils that maximize the lift drag ratio under steady state flow conditions. We highlight that our method has two distinct benefits over other automated design approaches. First, evaluating the neural networks prediction of fitness can be orders of magnitude faster then simulating the system of interest. Second, using gradient decent allows the design space to be searched much more efficiently then other gradient free methods. These two strengths work together to overcome some of the current shortcomings of automated design. Automated Design is the process by which an object is designed by a computer to meet or maximize some measurable objective. This is typically performed by modeling the system and then exploring the space of designs to maximize some desired property whether that be an automotive car styling with low drag or power and cost efficient magnetic bearings BID1 BID4 . A notable historic example of this is the 2006 NASA ST5 spacecraft antenna designed by an evolutionary algorithm to create the best radiation pattern (Hornby et al.) . More recently, an extremely compact broadband on-chip wavelength demultiplexer was design to split electromagnetic waves with different frequencies BID17 . While there have been some significant successes in this field the dream of true automated is still far from realized. The main challenges present are heavy computational requirements for accurately modeling the physical system under investigation and often exponentially large search spaces. These two problems negatively complement each other making the computation requirements intractable for even simple problems.Our approach works to solve the current problems of automated design in two ways. First, we learn a computationally efficient representation of the physical system on a neural network. This trained network can be used to evaluate the quality or fitness of the design several orders of magnitude faster. Second, we use the differentiable nature of the trained network to get a gradient on the parameter space when performing optimization. This allows significantly more efficient optimization requiring far fewer iterations then other gradient free methods such as genetic algorithms or simulated annealing....

Ground Truth Summaries
A method for performing automated design on real world objects such as heat sinks and wing airfoils that makes use of neural networks and gradient descent. Neural network (parameterization and prediction) and gradient descent (back propogation) to automatically design for engineering tasks. This paper introduces using a deep network to approximate the behavior of a complex physical system, and then design optimal devices by optimizing this network with respect to its inputs. Bert Output (ROUGE-1: 16.67, ROUGE-2: 4.35, ROUGE-L: 12.49) This allows significantly more efficient optimization requiring far fewer iterations then other gradient free methods such as genetic algorithms or simulated annealing. SCIBERT Output (ROUGE-1: 62.75, ROUGE-2: 40.82, ROUGE-L: 39.22) We propose a novel method that makes use of deep neural networks and gradient descent to perform automated design on complex real world engineering tasks.