Using Pre-Trained Transformer for Better Lay Summarization

In this paper, we tack lay summarization tasks, which aim to automatically produce lay summaries for scientific papers, to participate in the first CL-LaySumm 2020 in SDP workshop at EMNLP 2020. We present our approach of using Pre-training with Extracted Gap-sentences for Abstractive Summarization (PEGASUS; Zhang et al., 2019b) to produce the lay summary and combining those with the extractive summarization model using Bidirectional Encoder Representations from Transformers (BERT; Devlin et al., 2018) and readability metrics that measure the readability of the sentence to further improve the quality of the summary. Our model achieves a remarkable performance on ROUGE metrics, demonstrating the produced summary is more readable while it summarizes the main points of the document.


Introduction
Recent summarization techniques have greatly benefitted from the advancement of language models and successfully produced plausible summaries for both general news articles in real-life and technical scholarly documents in the expert domain. An informative but concise summary can help people to reduce the search time and boost the decision making by expeditiously providing more relevant documents (Mani et al., 2002;Roussinov and Chen, 2001;Maña-López et al., 2004;McKeown et al., 2005). For processing scholarly documents, automatic summarization is promising since it can benefit researchers to cope with the pace of the exponentially growing number of publications (Bornmann and Mutz, 2015).
Despite the recent advancement in automatic summarization literature, summarization for scholarly documents has been less explored compared to the works regarding summarization for ordinary news articles due to the absence of large-scale datasets. Developing human-written lay summaries for scholarly documents is challenging since it involves expert knowledge to understand the technical jargon and the complex structure of scientific documents. Because of these inherent challenges, existing summarization techniques for scientific documents is limited in a sense, which the produced summary is either too concise to provide important information (Vasilyev et al., 2019;Cachola et al., 2020) or aiming to directly extract the content from abstract or citation sentences , which mostly resembles the abstract, making it hard for the public and researchers from outside of the particular domain to understand the main points of the scientific papers. Although the readability of the abstracts in scientific papers had continuously decreased due to the increase in the use of technical jargon (Plavén-Sigray et al., 2017), the summarization of scientific papers for the public and researchers from outside of the certain field has been remained elusive.
To provide a better summary for the public and researchers, we participated in the first Computational Linguistics Lay Summary Challenge (CL-LaySumm 2020) Shared task (Chandrasekaran et al., (Forthcoming) and developed a summarization system that automatically produces lay summaries for scholarly documents. The main task of CL-LaySumm is producing a corresponding lay summary given the full-text and abstract of the research paper. We employed the dataset from the CL-LaySumm 2020 committee and performed experiments using recent summarization models including Pre-training with Extracted Gap-sentences for Abstractive Summarization (PEGASUS; Zhang et al., 2019b), extractive summarization with Bidirectional Encoder Representations from Transformers (BERT; Devlin et al., 2019), and a new evaluation protocol that measures the readability of the sentence in the summary. We showcase how PEGASUS, BERT, and readability metric improve the summarization system and demonstrate that the produced summary is more readable while it summarizes the main ideas of the documents.

Related Work
The type of recent benchmark datasets that are widely utilized for evaluating the performance of the summarization system can be categorized into two themes: news articles and scientific documents.
Summarization of news articles have been more actively explored since it is relatively easy to develop human-written summaries. Woodsend and Lapata (2010) and Cheng and Lapata (2016) created a large-scale dataset that contains 200K news articles with manually written gold summaries.
Owing to the large-scale dataset and relatively simple structure of the articles, neural abstractive summarization using sequence models such as Long Term Short Memory (LSTM, Hochreiter and Schmidhuber, 1997) with attention mechanism (Bahdanau et al., 2014) has been actively used in abstractive summarization for news articles. The attention-based encoder-decoder network has been improved by others. See et al. (2017) used LSTM with two different networks: pointer-generator network that produces accurate expression by pointing each word in the source and coverage network that avoids repetition. Paulus et al. (2017) incorporated reinforcement learning (RL) into sequence models for summarization tasks and Celikyilmaz et al. (2018) developed multi-agent encoders that communicate with each other by sharing outputs for each layer in the encoder network.
After the advent of pre-trained language models such as Transformer, BERT, and Bidirectional and Auto-Regressive Transformers (BART), the summarization literature benefits from these pretrained language models that provide more contextual word representation (Vaswani et al., 2017;Devlin et al., 2019;Lewis et al., 2020). Liu and Lapata (2019) used BERT model as an encoder, Zhang et al. (2019a) applied BERT to both encoder and decoder networks, Scialom et al. (2020) constructed generative adversarial networks using BERT models, and Yoon et al. (2020) appended semantic similarity layers on top of the pre-trained BART. While neural sequence models have been successfully applied to the summarization for news articles, applying the same techniques to scientific documents would be challenging since the scholarly documents are far longer than ordinary news articles and have a complicated structure. Our work is different from the described works as we tackle summarization for scientific documents.
Although summarization for scientific texts is less explored, Cohan et al. (2018) proposed hierarchical encoder-decoder network to address the long scholarly documents for constructing abstract summary,  suggested summarization using abstract and citation sentences with graph convolutional networks (Kipf and Welling, 2016) and LSTM, and released the medium-scale dataset that contains 1000 scientific papers in the computational linguistic domain with humanwritten summaries and citation sentences for each paper. Cachola et al. (2020) implemented an extreme summarization system, which is TLDR (Too Long; Don't Read) summarization, for scientific documents using multi-task learning with headline generation models (Vasilyev et al., 2019). Zhang et al. (2019b) proposed PEGASUS by masking important sentences in the input document with a Transformer-based encoder-decoder network to force the model to summarize main points of the contents given the remainder of the text. PEGA-SUS tackled summarization for both news articles and scholarly documents but it only aimed to produce the abstract. In contrast, our work is distinct from the previous approaches as we aim to produce lay summaries for scientific documents rather than generating extremely short sentences or summaries that contain technical words which makes it difficult for lay audiences to understand.
To facilitate scholarly document processing, there have been annual workshops regarding data mining, natural language processing (NLP), information retrieval for scientific publications: BIRNDL (Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries), WOSP (Workshop on Mining Scientific Publications), TAC (Text Analytics Conference). In particular, the annual CL-SciSumm (Jaidka et al., 2016(Jaidka et al., , 2018Chandrasekaran et al., 2019) encouraged participants to research on scientific documents summarization. Our work is closely related to the CL-LaySumm 2020, which is the first lay summary challenge shared task. We employed the LaySumm dataset provided by the workshop organizing committee and performed experiments using a variety of recent summarization models to develop the lay summarization system. 330 3 Data Analysis

Overview
Laysumm dataset consists of around 600 scientific papers in epilepsy, archeology, and materials engineering domain, including full-text, abstract and corresponding lay summaries written by authors and journalists. The task for CL-LaySumm 2020 is creating a lay summary with less than 150 words given the full-text and the abstract of the paper. For evaluation, a test set which contains 37 scientific papers without ground truth lay summary is given. The below table shows the average number of words and sentences for each document. Here Spacy (Honnibal and Johnson, 2015) is used for word tokenization.

Sentence similarity
Before developing a specific summarization model, we measured the sentence similarity to determine which type of summarization is suitable for lay summarization. There are two types of summarization: extractive summarization and abstractive summarization. The extractive summarization scores the importance of sentences in the source and directly extracts the sentences based on the score. In contrast, the abstractive summarization generates the summary from scratch while it maintains the representative content of the source. We assumed that this resembles the way humans summarize the contents and the lay summarization can be categorized into the abstractive summarization. However, if the sentences in the lay summary exist in the abstract or full-text of the paper, extractive summarization is more promising. Table 2 shows the average number of overlapping sentences between the sentences in the lay summary and the abstract and full-text for the training set.
As shown in Table 2, the lay summaries were written from scratch rather than directly using the sentences from the abstract or full-text of the paper. We observed that overlapping occurs only 7% of the training set (40 of 572) and if any sentence in the lay summary exists in the abstract, there are 1.73 overlapping sentences in the abstract on average.
We also measured the similarity between the sentences in the lay summary and the sentences in the full-text. For this task, Term Frequency-Inverse Document Frequency (TF-IDF) and Jaccard similarity are used (Sammut and Webb, 2010;Hamers et al., 1989). Figure 1 shows the maximum value of similarity for each sentence in the lay summary in terms of TF-IDF and Jaccard similarity. As shown in the figures, the similarity is below 0.4 on average for TF-IDF and it becomes lower for Jaccard similarity. From the results of the analysis, we excluded full-text and aimed to produce the lay summary solely with the abstract.

Method
There are two main summarization models used in our system to generate the lay summary. We tried to use PEGASUS which is an abstractive summarization model and Presumm (Liu and Lapata, 2019) for extractive summarization to produce the summary. We trained the summarization model on the lay summary dataset in a supervised way by pairing the abstract and the corresponding lay summary of the paper. After producing the lay summary using PEGASUS, we improved the quality of the produced summary by appending important sentences to the summary of which the number of words is under a certain threshold. For example, if the number of the lay summary generated by the abstractive model was under 90, we added the sentences from the corresponding summry geneated by extractive model up to this threshold. When appending the sentences to the produced summary, we prioritized the sentences in the abstract based on the score predicted by Presumm model and readability metric and applied Tri-gram blocking to avoid repetition (Paulus et al., 2017). Detailed descriptions of each summarization model and the readability metric are presented in the following sections.

PEGASUS
We used PEGASUS that is trained on large text corpus of news text from the web pages to produce abstractive summaries (Zhang et al., 2019b). The architecture of PEGASUS model is Transformerbased encoder-decoder network and the model targets to output the important sentences by masking principal sentences or greedily selected sentences based on the ROUGE (Nallapati et al., 2016) in the input text during the training process. We used the official implementation and the checkpoint of the pre-trained PEGASUS model without any modification and trained this model directly on the lay summary dataset.

PreSumm
For extractive summarization, we used the Presumm model (Liu and Lapata, 2019) which uses BERT, a pre-trained language model, for news article summarization without any modification. Presumm model uses BERT as a pre-trained encoder. The authors added [CLS] token between the sentences as the input of BERT to obtain sentence representation. This token is used to calculate the score to determine whether each sentence is included in the lay summary.
For training this summarization model, we assumed that the model needs a large-scale dataset that contains thousands of instances to train over one hundred million of parameters. Since the lay summary dataset only consists of 600 documents, we used the CNN/DM dataset that consists of 300K news articles for the pre-training stage before train-ing the lay summary dataset. CNN/DM dataset is a common benchmark used in the summarization literature and the target summary for this dataset is somewhat extractive rather than abstractive, thus we considered this dataset seemed suitable for the extractive summarization. After training the model on the CNN/DM dataset for a few iterations, we switched the dataset with the lay summary dataset.

Readability of the Sentence
The evaluation metric that is widely used in the summarization literature is ROUGE, which reflects the ratio of overlapping vocabulary between the produced summary and the ground-truth summary. However, ROUGE only focuses on counting the overlapping words and it is unable to determine whether the sentence is difficult or not to understand. We believe the produced lay summary has to be more readable for the lay audience, thus we adopted the readability of the sentence as an additional metric and we combine this metric with extractive summarization. Specifically, we combine this metric with extractive summarization. When we produced the extractive summary based on the important score predicted by Presumm model, we pruned the sentence of which readability score is under a certain threshold.
The readability of the sentence is measured by considering the ratio of jargon. We used the corpus of words developed by Rakedzon et al. (2017). The authors collected 900 million words published on the BBC site and classifying the word as easy, medium, and rare (jargon) based on the frequency of words used on the BBC site. The dictionary contains around 500K words which were the most frequently used. To measure the readability of the sentence, we followed the authors as shown in equation (1) with different constant factors (c 1 , c 2 , c 3 ) in front of each ratio (r 1 : medium, r 2 : rare, r 3 : out of dictionary). We used 10, 20, 30 as constant factors in front of each ratio. Score = 100 − (c 1 r 1 + c 2 r 2 + c 3 r 3 ) (1) Using this metric, we measured the average sentence readability of the abstract and the lay summary in the Laysumm dataset. As shown in table 3, the lay summary achieves high readability than the abstract since it avoids using technical words. In the next section, we present the readability of the produced summary with ROUGE metric to show whether the summarization model can achieve a high score in both ROUGE and readability metrics.

Dataset and Evaluation
We evaluated the performance of the model on the lay summary dataset. The lay summary dataset is divided into the train, validation, and test set (8/1/1 split). Evaluation metrics are ROUGE recall and F1 score in terms of unigram, bigram, and the longest common subsequence overlap.

Implementation Details
We mainly used the official implementation of the PEGASUS, Presumm, and pre-trained checkpoints provided by the authors. We did not modify any network architecture and for Presumm model, the dataset was switched from CNN/DM to lay summary data after sufficient training steps. After switching the dataset, all the trainable parameters are gradually fine-tuned with a lower learning rate. Presumm extractive models were trained on dual GPUs (NVIDIA RTX 2080ti) with gradient accumulating every 4 steps. The model was trained for 50,000 steps for the pre-training stage and 10,000 steps after switching the data into the Laysumm dataset. We saved the checkpoints of the model every 200 steps after switching the dataset and performed validation by choosing the top three checkpoints, which have the lowest validation loss, to evaluate the model on the test set. To generate the extractive summary, we selected the sentence from the highest score only if the readability score is over 85 until the number of words in the produced summary is over 150. Trigram Blocking (Paulus et al., 2017) is applied when generating the summary to reduce the redundancy.
PEGASUS model was trained for 20,000 steps on a single GPU (NVIDIA RTX 2080ti) with hyperparameters provided by the authors except for batch size and learning rate. Due to the memory constraints, we decreased batch size to 1 with a decreased learning rate at 0.0001. We saved the checkpoints of the model every 1000 steps and performed the same validation done in the extractive summarization and chose beam search at size 10 to encourage the model not to generate short sentences.

Results
The best results were achieved by submitting different checkpoints from the validation and test stage for each model. The performance of extractive and abstractive models are summarized in

Analysis of Threshold
In this section, we investigate how the number of words in the produced lay summary affects the performance of the summarization model. We first produced lay summaries using PEGASUS(ABS) and measured the number of words for each summary. Then, we set a standard threshold and appended sentences from the extractive summary produced by PRESUMM(EXT) if the number of words in the abstractive summary is below that limit. Table 5 shows the ROUGE F1 score with respect to different threshold values and the average number of words of lay summaries after appending sentences. ABSEXT with a threshold at 90 performs best and it shows appending sentences from the extractive model to the abstractive summary consistently improves the performance. This makes sense as the abstractive model(ABS) tends to produce short summaries: the average number of words in abstractive summary is 82, whereas the average number of words in the ground-truth lay summaries in the Laysumm dataset is around 110.

Readability of Summary
We provide ROUGE-F1 and readability scores for each model. As shown in Table 6, for the extractive summary, EXT performs better than EXT W/O R, demonstrating excluding hard sentences improves the performance on both ROUGE and readability metrics. When the extractive summary is combined

Discussion and Future work
We applied transfer learning to mitigate the absence of large-scale datasets to tackle the lay summarization task. While we demonstrated that transfer learning can result in a good performance, it can create a bottleneck for the model due to the discrepancy between the distributions of datasets, resulting in sub-optimal solutions. Our summarization model also excludes the full-text of the paper and tries to produce the summary solely based on the abstract. Although the model achieves good performance, there might exist important points in the body of the paper. It is obvious for humans to utilize the full-text of the paper to write a better lay summary. Creating a large-scale lay summary dataset that handles scholarly documents and considering important sentences from the body text can be a promising direction to address these issues.
The readability score might be usefully utilized for constructing the large-scale dataset since it is necessary to pair the difficult sentences and a more readable lay summary. Secondly, in the optimization process during training the model, we only focused on predicting the only ground truth lay summary. This might limit the capability of the summarization model. Applying the readability score as an additional feature in the training stage would make the model more creative and help the system to summarize the contents while it selectively chooses easier words.