AUTH @ CLSciSumm 20, LaySumm 20, LongSumm 20

We present the systems we submitted for the shared tasks of the Workshop on Scholarly Document Processing at EMNLP 2020. Our approaches to the tasks are focused on exploiting large Transformer models pre-trained on huge corpora and adapting them to the different shared tasks. For tasks 1A and 1B of CL-SciSumm we are using different variants of the BERT model to tackle the tasks of “cited text span” and “facet” identification. For the summarization tasks 2 of CL-SciSumm, LaySumm and LongSumm we make use of different variants of the PEGASUS model, with and without fine-tuning, adapted to the nuances of each one of those particular tasks.


Introduction
For scholars in every scientific domain, the ever growing amount of articles published each year has made the long-lasting challenge of keeping up with the recent literature significantly harder. In addition, there is an increasing need for making research accessible and relevant to the general public and not just a small group of researchers and practitioners. For example, taxpayers want to know where federal money supporting research goes. As a result, there is a need for different types of summaries that can either facilitate scientific research by compressing the key ideas discussed in a scientific paper or make scientific research more relevant for a lay audience.
It is obvious that tasking the author of a paper with writing multiple summaries of her work for different audiences is time-consuming. Consequently, the interest for methods that automatically summarize scientific documents in different styles and variations has increased significantly. Towards this direction, the 1 st Scholarly Document Processing Shared Task (SDP 2020) (Chandrasekaran et al., 2020) introduces a number of different tasks that address these challenges. In addition to the original CL-SciSumm sub-tasks of previous years, the 2020 version includes tasks that are targeting the summarization of complete papers as well as the generation of lay summaries. The different tasks of SDP 2020 can be summarized as follows.
1. CL-SciSumm: The original CL-SciSumm challenge includes three sub-tasks. Given a set of reference papers (RP) and the corresponding papers that cite them (CP), task 1A requires for each citance (i.e. a sentence of the CP that references the RP) the identification of the "cited" text spans in the RP. In task 1B participants have to tag each cited span with the appropriate "facets" from a predefined set. Finally, for the optional task 2 a summary of the RP should be generated.
2. LaySumm: This task requires participants to generate a short summary for each given paper that accurately represents the content and at the same time is comprehensible and interesting to a lay audience.
3. LongSumm: In this task the requirement is to generate an extensive and detailed summary for each of the given scientific papers that sufficiently covers all the salient information.
Exploiting large language models that are pretrained on huge corpora of unlabelled data and then adapting them to solve NLP problems has proven to be a very successful strategy. This type of approach has yielded state-of-the-art results in a variety of NLP tasks such as question answering, machine translation and summarization and has proven to be especially beneficial when the training data are limited. In this work our main focus is to explore large pre-trained Transformers like BERT (Devlin et al., 2018) and PEGASUS  and how they can be effectively used in the context of the SDP 2020 shared task. For CL-SciSumm task 1A we fine-tune a BERT pairwise classifier that is able to identify "cited text spans" of the RP given a number of "citing spans" from multiple CPs. We further improve the efficiency of our system by adding a pre-filtering stage based on TF-IDF that selects good "candidates" for the BERT model. In task 1B we train a simple Logistic Regression classifier that uses embeddings from another pre-trained model, SciBERT (Beltagy et al., 2019), and is able to classify each cited text span into one of five distinct facets. We show that even an algorithm as simple as Logistic Regression can effectively learn a fairly complex task with very few training data when using features from a powerful pre-trained model such as SciBERT.
We approach task 2 of CL-SciSumm as well as LaySumm and LongSumm as abstractive summarization tasks. More specifically, we employ the PE-GASUS pre-trained model that has demonstrated very good results in various abstractive summarization benchmarks. We use the pre-trained model without any additional training to generate a comprehensive summary of an article given the abstract as well as the information from the parts of the full text that are cited by other articles for task 2 of CL-SciSumm. For the LaySumm task, we fine-tune the PEGASUS model to compress and re-write the abstract of the given article in order to generate a summary suited for a lay audience. Finally, for the LongSumm task our approach makes use of the Divide-ANd-ConquER (DANCER) (Gidiotis and Tsoumakas, 2020) summarization method in combination with the PEGASUS model aiming to generate an accurate and detailed summary of the article by separately summarizing important sections of the full text.
The rest of this work is structured as follows. In section 2 we very briefly present different summarization approaches focusing on academic articles with emphasis on pre-trained Transformer models. In sections 3 to 5 we describe our approaches to each one of the three tasks and in section 6 we discuss our experiments and results.

Related Work
The task of summarizing scientific articles has received increased attention lately. Existing methods usually approach the problem in one of two fundamental ways. Extractive methods Goharian, 2015, 2018;Collins et al., 2017) focus mainly on identifying key sentences of the text and creating a summary by combining the extracted sentences. On the other hand, abstractive methods Subramanian et al., 2019; are using language models conditioned on the input text in order to generate a summary.
Both extractive and abstractive methods when applied to scientific articles are typically taking as input the abstract and/or the full text of the article and try to generate an abstract-like summary. In contrast, DANCER (Gidiotis and Tsoumakas, 2020) learns to summarize different sections of the full text separately and combines the individual summaries into a single article summary.
Given the increased popularity and success of large pre-trained Transformer models in various NLP tasks, multiple approaches have decided to use similar models for summarization. Such approaches have either used pre-trained Transformers as encoders combined with a classification decoder that selects sentences in an extractive manner (Subramanian et al., 2019;Liu and Lapata, 2019) or have employed full encoder-decoder models that are pre-trained on various tasks and fine-tuned for abstractive summarization (Song et al., 2019;Dong et al., 2019;Yan et al., 2020). One notable such model is the Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence (PEGA-SUS)  model. PEGASUS is a Transformer encoder-decoder model pre-trained on massive corpora of documents (Web and news articles) that has demonstrated great potential on various summarization benchmarks. The pre-training of PEGASUS is based on optimizing the Gap Sentence Generation (GSG) objective where whole sentences of the input are masked and the model attempts to generate these gap-sentences from the rest of the input.
A number of summarization approaches we proposed for the CL-SciSumm task B in the previous years of the challenge, including extractive methods based on probabilistic models  and large pre-trained BERT models (Zerva et al., 2019).

Data Processing
The corpus of the CL-SciSumm shared task is split into two separate collections: 1. The manually annotated training set which consists of 40 articles and citing papers. For task 1A, we have multiple citances for each RP and each of these citances corresponds to a specific cited text span. For task 1B we are given the facet annotations for each one of the cited text spans and for task 2 human-written summaries are provided.
2. The ScisummNet corpus (Yasunaga et al., 2019), which consists of 1000 articles that are paired with multiple automatically annotated citing articles.
Out of the 40 articles in the manually annotated dataset, we randomly select 30 articles for the training set and 10 articles for the test set. For task 1A we created "positive" pairs of citing and reference spans as well as "negative" pairs of citing spans and randomly selected sentences from the RP. We also included the whole ScisummNet dataset into the training set of this task. For task 1B we are only able to use the manually annotated data since the ScisummNet dataset does not include facet annotations. One important notice about the task 1B data is the severe class imbalance which can potentially be problematic for the training of machine learning models.
The dataset for task 2 includes multiple humanwritten summaries for each article of the manually annotated dataset. Those summaries are annotated as "author summary", "community summary" and "human" in the JSON schema. We decided to use the summary labeled "human" as target summary because it was the one out of the three that was present in almost all articles of the dataset. In this task we will not be performing any additional training so we split the 40 articles into 20 for the validation set and 20 for the test set.

Task 1A
Our approach for task 1A makes use of the pretrained BERT model (Devlin et al., 2018) and fine tunes it for the task. More specifically, we formulate the task as a sequence classification problem, where we are using the binary classification capabilities of the BERT architecture. Our main finetuning objective trains the model to take as input pairs of text spans and tries to predict if the second span is the corresponding span of the RP cited by the first span. We are training using "positive" and "negative" pairs with a 1:1 ratio. We found that creating the same number of negative pairs as the positive pairs yields the best results. The training set for this objective includes both the training part of the manually annotated dataset and the whole ScisummNet dataset.
During the prediction phase, our system evaluates each one of the given citing spans in a pairwise fashion with different sentences from the RP and selects at most two sentences that have the highest probability of being the corresponding cited text. If the probability difference between the top-2 sentences is higher than a threshold T = 0.015 then we only keep the first sentence. This way we are able to identify text spans instead of single sentences although we found that most of the time the cited text span is indeed a single sentence. To further improve the predictive power of our model we are providing additional context for the model by extending the both the citing and cited text spans with the previous and next sentence. When making predictions during the test phase we are using the citing span as is and only extend the candidate cited spans with the surrounding sentences.
Based on the findings by (Zerva et al., 2019) we also employed an additional pre-processing step before fine-tuning our model for the task specific objective. In our approach, we further pre-train BERT Base using the MLM objective on the ACL Anthology Reference Corpus (Bird et al., 2008).
In order to increase the computational efficiency of the pairwise evaluation, we are first using TF-IDF similarity to select the top-20 most similar sentences to the citing text span. Then we proceed on evaluating those "candidate" sentences with the pairwise model that we described previously.

Task 1B
For task 1B we decided to build a classification model that uses as input features contextual embeddings from the pre-trained SciBERT model (Beltagy et al., 2019) in order to classify each cited text into one of the five facets. Previous research has explored the use of contextual embeddings extracted from different layers of Transformer language models such as BERT (Devlin et al., 2018), ELMo (Peters et al., 2018) and GPT (Radford et al., 2019) as features for classification models. Also, (Ethayarajh, 2019) demonstrates some interesting insights of how the outputs of the different layers compare with each other. Here we decided to use the last layer of SciBERT to get the embeddings, because this model is more relevant to the domain of the task articles and we did not experiment with other model types. The contextual embeddings of each cited text have been obtained from the CLS vector of the last layer of SciBERT.
Although there are some occasions where multiple facets apply to the same span the vast majority of samples had a single facet. For this reason we decided to treat the task as a simple multiclass sequence classification problem. We experimented with multiple classification algorithms like Logistic Regression and Random Forests. We opted for simpler classification models due to the limited amount of training data that were severely limiting our ability to train more sophisticated models.

Task 2
We approach task 2 as an abstractive summarization problem. It has been suggested by (Yasunaga et al., 2019) that a combination of the abstract and the cited text spans is sufficient content to cover the main aspects and findings of an academic article. We follow this idea and propose a summarization scheme that takes those inputs and tries to generate a comprehensive summary of the article.
More specifically, we are using the PEGASUS model pre-trained on the arXiv dataset in order to generate the summary given the abstract and cited text spans identified from the previous tasks. Given that the combined input sequences are much longer than the maximum input size we can support for PEGASUS (due to GPU memory limitations) we decided to run the PEGASUS model twice, one for the abstract and the second for the cited texts and combine the two individual summaries into the final summary. The combination is a simple concatenation of the abstract summary with the cited text summary. Due to the small size of the manually annotated dataset we cannot expect to sufficiently train a PEGASUS model so we opted to use the model without any additional fine-tuning.

Data Processing
For the LaySumm task, the data are provided in the form of plain text files that have already been parsed from the paper PDFs. For each article in the dataset we are given three text files, one with the full text of the article, one with the abstract and one with the target lay summary. The corpus covers three distinct domains, namely epilepsy, archaeology, and materials engineering and consists of 573 articles in total. We split the dataset in three parts using 338 samples for the training set, 113 samples for the validation set and leaving 114 samples for the test set. We focused our pre-processing on cleaning noise and removing unwanted tokens and artifacts such as equations, tabular elements and references.
The PEGASUS model uses tokenization with the SentencePiece Unigram algorithm (Kudo, 2018) and required us to have all the text lowercased. The particular pre-trained model we are using comes with the Unigram 96k vocabulary that was created during the pre-training of the model. We identified that this vocabulary misses several symbols that appear quite frequently in the LaySumm data (e.g. Greek letters) so we decided to encode those symbols with other "complex" tokens from the vocabulary before tokenization in order for the model to be able to parse them. For example, the Greek letter α is replaced with the complex token "greekalpha" before being tokenized. This allows the model to successfully encode and learn the symbol and gives us the ability to backwards replace it to the original symbol during the decoding phase.

Lay Summary Generation
Our approach for the LaySumm task focuses on re-writing selected parts of the article in order to make them more relevant and easier to understand for the lay audience. Our main system uses the PE-GASUS Large model and fine-tunes it on the task dataset. We are based on the idea that a lot of the key information that we want to include are present in the abstract of the article and we focus our methods to the task of re-writing and compressing the abstract. Our fine-tuning objective involves feeding the abstract as input to the PEGASUS model and using the provided lay summary as target for the summarization training.
We experimented with different variants of the pre-trained PEGASUS model. Those variants include: 1) the pre-trained PEGASUS model, 2) a model fine-tuned on the arXiv dataset and 3) a model fine-tuned on the PubMed dataset. All pretrained models were open sourced by the authors of the PEGASUS paper. We further fine-tuned the dif-ferent models on the specific LaySumm task using the provided dataset.

Data Processing
The abstractive dataset for the LongSumm corpus was given in the form of JSON files including article metadata, target summary and URLs to download the article PDFs. We used the provided scripts to download a total of 497 out of the 528 PDFs (we did not have access to the rest) and then extracted the abstract and section text from the downloaded files using Science-Parse 1 . We ended up with a total of 497 JSON files with the combined full text, abstract and target summaries. Out of these articles, 297 were used in the training set, 100 in the validation set and 100 in the test set.
The pre-processing steps followed in this dataset were similar to the ones we have described in the previous section and involve basic cleaning, normalisation and filtering operations. Again we use the same strategy for the tokens that are not supported by the PEGASUS vocabulary.

Long Article Summarization
Our approach for the LongSumm task is based on the Divide-ANd-ConquER (DANCER) summarization method which processes each section in a distributed way. The method uses text similarities between sentences of the summary and sections of full text in order to create better alignment during training and learns a summarization model that is able to summarize each section of the article separately.
More specifically, our system selects "types" of sections, namely the introduction, methods, results and conclusion, and uses the PEGASUS model to generate a summary for each section. The corresponding summaries are then concatenated to form the complete summary of the article. When training this system we use as input the full text of the section and as target the part of the summary that is most similar to that particular section.
We first use ROUGE-1 recall as a similarity metric in order to assign each sentence of the summary to one of the selected sections of the full text and then we group all the sentences assigned to each section to form the target summary corresponding  Table 1: The different section types and the common keywords that are used in order to identify them using heuristics. If the header of a section includes any of the keywords associated with a specific section type it is assigned to that section type. Sections that can't be matched with any section type are ignored.
to this section. The complete system architecture is shown in Figure 1.

Section Tagger
In order to select the aforementioned types of sections we employ a classification model that classifies each section of a given article into one of six distinct categories (introduction, literature, methods, results, conclusion, acknowledgments). Based on our experiments we found that the combination of introduction, methods, results and conclusion gives us the best summaries overall.
This classifier has a single LSTM (Hochreiter and Schmidhuber, 1997) layer with additive attention (Bahdanau et al., 2015) and takes as input subword level BPEmb embeddings (Heinzerling and Strube, 2019). This model is trained on full text sections from the arXiv dataset. To train this model, we select sections of the corpus where the heading includes specific keywords that are characteristic of the section type. These keywords are shown in Table 1. We skip sections where the heading does not match this pattern. The model is trained to take as input the text of the section (without the heading) and tries to predict the section category.

Experimental Setup
In our experiments for task 1A we are using the Tensorflow implementation of BERT Base provided by huggingface 2 . After pre-training for 20k steps on the ACL corpus we proceed on fine-tuning for another 4k steps on the pairwise classification objective for task 1A. When we are building the TF-IDF model we are only based on the manually annotated dataset. For task 1B we are using the SciBERT version open sourced by the authors of the original paper which is similar to the BERT Base model architecture. We are using the SciBERT model to get sentence level embeddings of size 768 which are then used as input to the classifiers.
All of our summarization methods are using the PEGASUS Large model which was pre-trained on the C4 and HugeNews dataset and was open sourced by the authors of the original paper. We also used two variations of this model that were fine-tuned for abstractive summarization. The first was fine-tuned for 74k steps on the arXiv dataset and the second for 100k steps on the PubMed dataset.
For task 2 of CL-SciSumm we are using the arXiv version of PEGASUS without any additional fine-tuning. We are running the model twice and generate summaries of up to 256 tokens for the abstract and the cited text spans identified from task 1A.
When fine-tuning our models on the LaySumm task we follow a very basic setup without extensive hyper-parameter tuning. More specifically, we used an input size of 1,024 tokens and an output size of 256 tokens since the evaluation scripts provided by the competition constrained the summary length to 150 words. We fine-tuned for 10k steps and monitor the ROUGE-1 F1 on the validation set in order to avoid overfitting. For the  Table 2: Results on our test set for task 1A. TF-IDF uses sentence similarities to select top-3 sentences. BERT is the proposed method.
LongSumm task we are using the arXiv PEGASUS model and we further fine-tune it for 10k steps using the DANCER method on the dataset of the task. The hyper-parameters used are identical to the ones used for the LaySumm model. Detailed hyper parameters can be found in Appendix A.1.

CL-SciSumm
We are evaluating task 1A on our test set (including only the manually annotated data) and measure the standard micro and macro precision, recall and F1 score. These metrics are shown in Table 2. For reference we are also comparing our method with a simple baseline that uses only TF-IDF to select the top-3 sentences for citing each span.
It can be seen clearly that the BERT based model is definitely superior to the baseline model and is able to correctly retrieve a fair amount of the cited text spans.
For task 1B we are evaluating our methods using 10-fold cross validation on the whole manually   Table 4: Shared evaluation results on our test set using the best performing models for both tasks. Spans that are incorrectly retrieved in the first task are not being scored by the second task. annotated dataset. In Table 3 we show the macro precision, recall and F1 of our Logistic Regression classifier versus a Random Forest classifier.
These results show that very simple algorithms like Logistic Regression and Random Forests with SciBERT features perform well on this task with very few training examples. On the other hand, training more sophisticated models like neural networks was simply not feasible due to the small size of the dataset and the severe class imbalance. For example when we attempted to train neural networks for the task we ended up with models that only predicted the "methods" facet.
One should keep in mind that the previous results only measure the performance of the task 1B model, assuming that all "cited text spans" have been correctly identified by the task 1A model. We are also evaluating the combination of our best performing methods on our test set using the evaluation scripts provided by the competition. We use the text spans retrieved from task 1A as input for task 1B. The scores from the shared evaluation are shown in Table 4.
Finally, the evaluation of the summarization task 2 is done comparing the generated summary of each article with the "human" summary using ROUGE metrics (Lin, 2004). In Table 5 we present the results of our proposed approach on our test set. Those scores demonstrate the "zero-shot" capabilities of the PEGASUS pre-trained model which is able to perform well on a new task without any additional training.

LaySumm
When evaluating the results we used the official evaluation script provided by the competi-

F1
Recall Model R-1 R-2 R-L R-1 R-2 R-L arXiv 47.93 25.36 31.66 46.13 23.85 30.17    tion which measures ROUGE-1, ROUGE-2 and ROUGE-L recall and F1-score. The results on our hold-out test set are shown in Table 6. As expected, both of the fine-tuned models outperform the model without any prior fine-tuning since they are better adapted to summarizing academic articles. However, the differences between the two models are very small with the arXiv model achieving a better ROUGE-1 score while the PubMed one achieving better ROUGE-2 and ROUGE-L F1 scores.
In Table 7 we show the results on the blind test set of the competition. Similarly to the numbers on our own test set, the arXiv model performs slightly better in terms of ROUGE-1 while the PubMed model is better in terms of ROUGE-2 and ROUGE-L. Once again, both models have a clear advantage compared to the model without prior summarization fine-tuning.

LongSumm
Based on the LaySumm results the model finetuned on the arXiv dataset had superior performance to both the model without fine-tuning and the model fine-tuned on PubMed so we decided to use this variant for the LongSumm task.
In order to evaluate the impact of the section  Table 8: Section level comparison between methods on the LongSumm test set. Notrain uses the model finetuned on arXiv without additional training. ArXiv is additionally fine-tuned on LongSumm. Notrain-notag and ArXiv-notag are the same models but using heuristics instead of the section tagger for section selection.

F1
Recall  tagger model we repeated the same experiments but this time instead of using the section tagger to help us select the appropriate sections we used the section headings and the heuristics described in 5.3.
First, we evaluated the performance at a section level using ROUGE-1, ROUGE-2 and ROUGE-L recall and f1-scores between the input section and the target section summary. Results for this evaluation are shown in Table 8. Second, we evaluate at an article level computing the same metrics between the full generated summary of each article and the full target summary. For this evaluation we use the official evaluation script provided by the competition and the results are shown in Table 9.
Looking at the section level results, we can see that fine-tuning the model with DANCER improves the results in every metric since it is better tuned for section level summarization compared to the model that is trained on whole articles. We should note that in this setup it is hard to have a direct comparison between the systems using the section tagger and the systems that use heuristics because using the section tagger results in a much larger test set.
The article level results can give us a better idea about the performance of the system on the Long-Summ task itself. Here we run our summarization system to generate the section summaries, combine the summaries by concatenation to create the article level summary and compare it with the target  article summary. The results on our test set show that once again DANCER training improves performance across the board. It is also clear that the section tagger has a very large effect as it improves both the trained and un-trained system by more than 10 ROUGE-1 points. This is clearly due to the fact that using the section tagger we include in the summary a lot more sections from the text that might not have a heading following the patterns from the heuristic approach.
Results on the test set of the competition are shown in Table 10. Similar to the results from our test set, we can see that the system trained with DANCER combined with the section tagger is clearly superior to all other systems.

Conclusion
We have presented the systems we developed for the SDP 2020 shared task. For task 1A we implemented an efficient pairwise classification approach based on the BERT model that tackles the "cited text identification" problem. For task 1B we show how a simple Logistic Regression classifier using pre-trained SciBERT embeddings as features can effectively learn to solve the problem of facet classification.
For the summarization tasks we employ different variants of the PEGASUS model and adapt them to the nuances of each particular task. For task 2 we use of the pre-trained PEGASUS model in a zero-shot setting to generate a summary given the abstract of an article along with the cited text spans. For LaySumm we propose a re-writing strategy based on the PEGASUS model that works on the abstract and generates a lay summary. Finally, we showcase how the PEGASUS model can be used to summarize an academic paper in a distributed way and we demonstrate an end-to-end system that generates a "long" summary by selecting key sections, summarizing each section independently and combining them to form the final summary.