Overview and Insights from the Shared Tasks at Scholarly Document Processing 2020: CL-SciSumm, LaySumm and LongSumm

We present the results of three Shared Tasks held at the Scholarly Document Processing Workshop at EMNLP2020: CL-SciSumm, LaySumm and LongSumm. We report on each of the tasks, which received 18 submissions in total, with some submissions addressing two or three of the tasks. In summary, the quality and quantity of the submissions show that there is ample interest in scholarly document summarization, and the state of the art in this domain is at a midway point between being an impossible task and one that is fully resolved.


Introduction
Scientific documents constitute a rich field for different tasks such as Reference String Parsing, Citation Intent Classification, Summarization and more. The constantly increasing number of scientific publications raises additional issues such as making these publications accessible to non-expert readers, or, on the other hand, to experts who are interested in a deeper understanding of the paper without reading a paper in full.
For this year's Scholarly Document Processing workshop (Chandrasekaran et al., 2020) at EMNLP 2020, we proposed three tasks: CL-SciSumm, Lay-Summ and LongSumm to improve the state of the art for different aspects of scientific document summarization.
The CL-SciSumm task was introduced in 2014 and aims to explore the summarization of scientific research in the domain of computational linguistics research. It encourages the incorporation of new kinds of information in automatic scientific paper summarization, such as the facets of research information being summarized in the research paper. CL-SciSumm also encourages the use of citing mini-summaries written in other papers, by other scholars, when they refer to the paper.
LaySumm (Lay Summarization) addresses the issue of making research results available to a larger audience by automatically generating 'Lay Summaries', or summaries that explain the science contained within the paper in laymen's terms.
Finally, the LongSumm (Long Scientific Document Summarization) task focuses on generating long summaries of scientific text. It is fundamentally different than generating short summaries that mostly aim at teasing the reader. The LongSumm task strives to learn how to cover the salient information conveyed in a given scientific document, taking into account the characteristics and the structure of the text. The motivation for LongSumm was first demonstrated by the IBM Science Summarizer system, (Erera et al., 2019) that retrieves and creates long summaries of scientific documents 1 . While Erera et al. (2019) studied some use-cases and proposed a summarization approach with some human evaluation, the authors stressed the need of a large dataset that will unleash the research in this domain. LongSumm aims at filling this gap by providing large dataset of long summaries which are based on blogs written by Machine Learning and NLP experts.
In this paper we present the tasks, datasets, description of the participating systems, and provide their results and insights from shared tasks.

Overview
The CL-SciSumm Shared Task was launched in 2014 as a pilot task aimed at bringing together the summarization community to address challenges in scientific communication summarization. Over time, the Shared Task has spurred the creation of new resources (e.g., ), tools and evaluation frameworks. As a consequence of this wide interest, CL-SciSumm 2020 is jointly organised with the inaugural editions of two other Scientific Summarization shared tasks, all of which were held as part of SDP 2020 workshop at EMNLP 2 ) (Chandrasekaran et al., 2020) A pilot CL-SciSumm task was conducted at TAC 2014, as part of the larger BioMedSumm Task 3 . In 2016, a second CL-Scisumm Shared Task (Jaidka et al., 2018)  In this section we provide the results and insights from CL-SciSumm 2020.

Corpus
We built the CL-SciSumm corpus by randomly sampling research papers (Reference papers, RPs) from the ACL Anthology corpus and then downloading the citing papers (CPs) for those which had at least ten citations. The prepared dataset then comprised annotated citing sentences for a research paper, mapped to the sentences in the RP which they referenced. Summaries of the RP were also included.
The CL-SciSumm 2020 corpus consisted of 40 annotated RPs and their CPs. These are the same as described in our overview paper in CL-SciSumm 2019  and 2018. The test set was blind. We reused the blind test we used from CL-SciSumm 2018 and 2019 since we want to have a comparable evaluation CL-SciSumm 2020 systems. After 3 iterations, we now release the gold labels for the 2018 test-set.
For details of the general procedure followed to construct the CL-SciSumm corpus, and changes made to the procedure in CL-SciSumm-2016, please see (Jaidka et al., 2018). In 2017, we made revisions to the corpus to remove citances from passing citations. These are described in (Jaidka et al., 2017).
Annotation. Given each RP and its associated CPs, the annotation group was instructed to find citations to the RP in each CP. Specifically, the citation text, citation marker, reference text, and discourse facet were identified for each citation of the RP found in the CP. The corpus has 40 annotated RPs, exclusive of 1000 auto-annotated RPs added in CL-SciSumm 2019. For CL-SciSumm-20 we encourage participants to use out-of-domain data (i.e., scientific document corpora from papers outside of the ACL anthology corpora; e.g., BIGPATENT (Sharma et al., 2019)) to bootstrap training using transfer learning. From 2019 onward, Task 2, training data (summaries) has been augmented with the SciSummNet corpus .

Task
CL-SciSumm defined two serially dependent tasks that participants could attempt, given a canonical training and testing set of papers.
Given: A topic consists of a Reference Paper (RP) and ten or more Citing Papers (CPs) that all contain citations to the RP. In each CP, the text spans (i.e., citances) have been identified that pertain to a particular citation to the RP. Additionally, the dataset provides three types of summaries for each RP: • the abstract, written by the authors of the research paper. • the community summary, collated from the reference spans of its citances. • a human-written summary, written by the annotators of the CL-SciSumm annotation effort. Task 1A: For each citance, identify the spans of text (cited text spans) in the RP that most accurately reflect the citance. These are of the granularity of a sentence fragment, a full sentence, or several consecutive sentences (no more than 5). Task 1B: For each cited text span, identify what facet of the paper it belongs to, from a predefined set of facets. Task 2: Finally, generate a structured summary of the RP from the cited text spans of the RP. The length of the summary should not exceed 250 words. This was an optional bonus task.

Evaluation
An automatic evaluation script was used to measure system performance for Task 1A, in terms of the sentence ID overlaps between the sentences identified in system output, versus the gold standard created by human annotators. The raw number of overlapping sentences were used to calculate the precision, recall and F 1 score for each system. We followed the approach in most SemEval tasks in reporting the overall system performance as its micro-averaged performance over all topics in the blind test set.
Additionally, we calculated lexical overlaps in terms of the ROUGE-2 scores (Lin, 2004) between the system output and the human annotated gold standard reference spans.
We have been reporting ROUGE scoring since CL-SciSumm 17, for Tasks 1a and Task 2.
Task 1B was evaluated as a proportion of the correctly classified discourse facets by the system, contingent on the expected response of Task 1A. As it is a multi-label classification, this task was also scored based on the precision, recall and F 1 scores.
Task 2 was optional, and also evaluated using the ROUGE-2 between the system output and three types of gold standard summaries of the research paper: the reference paper's abstract, a community summary, and a human summary.
We provisioned the evaluation scripts and goldtest-set CL-SciSumm Github repository 4 . For transparency we published all the system runs submitted by the participants. The participants then ran the evaluation and reported the results back to us. We collate and publish these as the CL-SciSumm'20 official result.

Systems Overview
Following teams submitted systems for evaluation for Task 1a and 1b. Their systems are described in their cited systems papers: NJUST (Zhang et al., 2020), CIST (Li et al., 2020), AUTH (Gidiotis et al., 2020) Official evaluation results on these systems is presented in the next section.

Results
Out of the 11 participants systems, 8 were able complete the final evaluation correctly. We have excluded the rest 3 them from listing in Tables 1 and 2 in the results on the blind test set. However, their systems and results on the development set are published in their respective system papers. We allows teams to submit an unlimited number of runs since this is an offline evaluation with a blind test set. However, we tabulate only the results from the top 5 runs when a large of runs are submitted.
Task 1a. (Table 1)NLP-PINGAN-TECH (Chai et al., 2020) achieve the best result on Task 1a when evaluated using sentence overlaps and ngram overlaps using ROUGE SU4. All top 5 of their runs outperforms other systems. Runs from UniHD's system are a close second.
Task 1b. (Table 2) We note that the runs that perform the best on Task1a are not the same that top performance in Task 1b though Task 1b is evaluated conditioned on Task 1a. CIST (Li et al., 2020)'s systems do consistently well on this task. We note that UniHD's systems, intersection 2 field and intersection 3 field do well on both Task 1a and 1b though they do not top the rankings on either task.
Task 2. Four of the eleven teams also particiapted in the bonus summarization task. On the summarization task AUTH (Gidiotis et al., 2020) does well when evaluated against both abstract and human written summaries. They score 0.41 on ROUGE-2 on Abstracts which is comparable to the state-of-the-art of general summarization. However, their system does not do well on community summaries, which is dependant on Task 1a. IITBH-IITP (Reddy et al., 2020)'s systems consistently perform better than the rest on community summaries. CIST (Li et al., 2020)'s systems are second and are comparable to the top performing system in this category. Notably CIST's runs do well on both human and community summaries and second only to AUTH on abstracts. This type of systems are the intended goal of the CL-SciSumm shared task.

Task Overview
To improve public understanding of science, researchers are increasingly asked by funders and publishers to outline the scope of their research, described in scientific research articles, by writing a summary for a lay audience. We call this a Lay Summary: a text of about 70-100 words intended for a non-technical audience that explains, succinctly and without using technical jargon, the overall scope, goal, and potential impact expressed in a scientific paper. The Lay Summarization task provides data for and evaluates automaticallyproduced Lay Summaries.

Corpus
The corpus comprised 572 author-generated lay summaries from a multidisciplinary collection of journals in Materials Science, Archaeology, Hepatology and Artificial intelligence, together with their corresponding abstracts and full text articles, provided by Elsevier. A small sample dataset can be found on the GitHub repository 5 ). A training corpus of 37 full-text papers and abstracts was made available to enable evaluation. 5 https://github.com/WING-NUS/ scisumm-corpus/blob/master/README_ Laysumm.md#sample-dataset

Task
The Lay Summary Task requires systems to generate a lay summary, given a full-text paper and its abstract. This summary should be representative of the content, comprehensible, and interesting to a lay audience. In addition to their results, system builders were asked to provide an automatically generated lay summary of their own systemdescription paper. The task was run on CodaLabs 6 .

Evaluation
We measured summary quality using the ROUGE measure (Lin, 2004). We used the Py-Rouge 0.1.3 package, which is built on the ROUGE 1.5.5 toolkit with its standard parameters setting 7 . We report both Recall and F-Measure for ROUGE-1, ROUGE-2, and ROUGE-L. The evaluation results were displayed on a public leaderboard on Codalab 8 . In addition, a number of automatically  generated lay summaries underwent human evaluation by science journalists and communicators for comprehensiveness, legibility, and interest.

Systems Overview
We received eight submissions. We briefly describe the approaches taken by the participating teams: AUTH (Gidiotis et al., 2020) -The authors use a summarization method utilizing PEGASUS  to compress and rewrite the abstract of a given article to generate a lay summary. The PEGASUS model is fine-tuned to generate lay summaries, using the article abstract as input and the lay summary as the reference for training the summarization model. Dimsum (Tiezheng Yu and Fung, 2020) -The system generates a summary by using a joint extractive and abstractive summarization approach, based on the intuition that lay summaries are grounded in sentences that occur within the scientific document. The abstractive summaries are converted to extractive labels, by selecting sentences that maximize the rouge score with the reference summary. The BART encoder (Lewis et al., 2020) is then used to make sentence representations and the model is trained with both extractive and abstractive summarization objectives. Seungwon (Kim, 2020) -The system built by the team from Georgia Tech primarily uses the PEGA-SUS model  to generate lay summaries, combining this with a BERT-based extractive summarization model. After generating a lay summary using PEGASUS, if the generated summary is shorter than a specified length, the extractive model is used to identify candidate sentences in the document that can be included in the summary. Sentences are only included in the summary by the extractive model if they are judged sufficiently readable, according to a sentence readability metric defined by the authors. This method uses a standard encoder-decoder framework for abstractive summarization. The system is based on BERT fine-tuned on the CNN/Dailymail dataset (Liu and Lapata, 2019a), with a decoder consisting of six transformer layers. DUCS: (no paper submitted) This system uses a two-stage pipeline. In the first phase, extractive summarization is performed, and relevant sentences are selected from the introduction, discussion and conclusion of the article. The abstract, and the extracted sentences from the introduction, discussion and conclusion are summarized using the BART model (Lewis et al., 2020), and the summaries are concatenated.

Results
Taking these metrics into account, the top 3 systems are: #1 Seungwon Kim, #2 HYTZ, and #3  1973 Summaformers. Next to the formal ROUGE scores, a subset of documents was evaluated by a team of domain experts. Gratifyingly, this human assessment confirmed this order of the results. Overall, the majority of submitted Lay Summaries was easy to read, though in some cases there were odd errors (e.g., inserted ellipses). The winning systems all produced legible and accessible summaries. Four of the papers complied with the request that the systems generate a Lay Summary of their own paper, using their own tools. This helps both to explain the concept of a Lay Summary and offers insights into the output of the software; hopefully it also helps explain this work to a non-specialised audience. For examples, please see the Lay Summary Submissions elsewhere in this Anthology.

Discussion
A comparison of Lay Summaries against typical paper abstracts (Technical Summaries) reveals several systematic differences. These include: • Lexical specialization: This category includes both domain-based terminological difference (e.g., "renal" vs "kidney" failure, "high-octane" vs "powerful" gasoline) and conceptual specificity / specialization (e.g., "bubblesort" vs "sorting", "kNN" vs "clustering"). Used at even the same level of specificity, the expert uses domainspecialist words. It is well known that experts' Basic Level categories (in the sense of Prototype Theory) (Rosch, 1973) is one level lower/more specific than normal speakers' categories. • Syntactic complexity: This includes morecomplex descriptive NPs vs simpler NPs across more sentences, and longer and deeper sentence parse trees vs shorter and more straightforward ones. Generally an expert author's abstract has no direct verb forms and no personal pronouns, while the lay summary has nothing but. Direct quotes typically make a lay summary read like journalism. • Epistemic complexity: Expert text includes more (and more-precise) hedging vs simper, more absolutist claims, and fewer evaluative interjections ("surprising", "lovely", "elegant"). • Content detail: Generally a lay content is more general, wider-ranging, and includes a historically longer but much shallower historical overview compared to the Related Work section of an expert text. Typically there are more examples in the lay text and the examples employ out-of-domain scenarios/entities. • Author presence: In lay summaries there is generally more explicit 'author foregrounding', leading to the personalization of the knowledge source. The opposite in expert summaries has been argued as suggesting there statement of known facts, a tactic that scientists often use. As described in the previous section, only a few systems implemented some of these strategies explicitly. Generally the hope was that the training data will allow a sufficiently powerful machine learning model to learn what to do by itself. The results do not really bear out this hope. We believe there is some very interesting and fruitful analysis to be done in order to create machine-learning models that are sufficiently rich to produce truly interesting and readable Lay Summaries.

Task Overview
Existing work on scientific document summarization focuses on generating short, abstract-like summaries. While this might be appropriate when summarizing news articles, such summaries cannot cover all the salient information conveyed in a scientific paper. Writing longer summaries requires deep understanding and domain expertise, as can be found in research blogs. To address this point, the LongSumm task opted to leverage blog posts created by researchers in the NLP and Machine learning communities that summarize scientific articles and use these posts as reference summaries (Boni et al., 2020). The task is, given a scientific document, generate a 600 words summary.

Corpus
The corpus for this task includes a training set that consists of 1705 extractive summaries, and 531 abstractive summaries of NLP and Machine Learning scientific papers. The extractive summaries are based on video talks from associated conferences , and contain up to 30 sentences. The abstractive summaries are blog posts created by NLP and ML researchers, with length varied between 100-1500 words, an average of 779 (±460) words, and an average of 31 (±18) sentences in a summary. In addition, we created a (blind) test set of 22 abstractive summaries for evaluating the submissions. The corpus can be found on LongSumm GitHub repository 9 .

Evaluation
We measured summarization quality using the ROUGE measure (Lin, 2004). The evaluation script utilizes the rouge-score 10 python package which is designed to replicate results from the original perl package with its standard parameters. We report both Recall and F-Measure of ROUGE-1, ROUGE-2, and ROUGE-L. The evaluation was executed on a public leaderboard 11 , forked from EvalAI (Yadav et al., 2019), an open-source AI challenge hosting platform. In addition, 6 randomly selected summaries are selected from the top performing systems, to undergo human evaluation. The evaluation focuses on informativeness and readability.

Systems Overview
Nine systems participated in the task, with a total of 100 submissions. We will briefly describe eight of them, that submitted a research report describing their approach. ARTU (El-Ebshihy et al., 2020) -The system generates an extractive summary which is based on the papers' abstract. Each sentence from the abstract becomes a query to an index that contains all papers' paragraphs. For each abstract sentence, a cluster that contains the top retrieved paragraphs is created. The final set of sentences is chosen based on the sentences LexRank value, their discourse (based on the section they belong to), and the size of the cluster. AUTH (Gidiotis et al., 2020) -The authors propose an extractive summarization method that utilizes DANCER, a divide and conquer approach for long document summarization. DANCER (Gidiotis and Tsoumakas, 2020) helps to select key sections in the document to be summarized separately, for that each sentence in the article is classified to a section type. Then using PEGASUS based Transformer  they are combined together to form an complete article summary. CIST BUPT (Li et al., 2020) -The system supports both an extractive and abstractive summaries using deep-learning architectures. For extractive summaries, they used RNN to compress and represent a sentence, and build a sentences relation graphs which are fed into the Graph Convolutional Network (GCN), and Graph Attention Network (GAN) to create a summary. For abstractive summaries, they used the gap-sentence method in (Zhang et al., 2015) to combine and transform all the data, and then T5 (Raffel et al., 2019), a transformer-liked pre-trained to fine-tune and generation. GUIR (Sotudeh et al., 2020) -A summarization method that utilizes BERT summarizer (Liu and Lapata, 2019b). The idea is based on multi-task learning heuristic, in which two tasks are optimized. The first is a binary classification task, for sentence selection. The second is section prediction, in which the model predicts section labels associated with input sentences. The extractive network is then trained to optimize both tasks. The authors also propose an abstractive summarizer based on BART (Lewis et al., 2020) transformer that runs after the extractive summarizer. IIITBH-IITP (Reddy et al., 2020) -The authors propose an extractive sentence classification method. They develop a deep learning architecture utilizing CNN to extract features, followed by Max-Pooling and flattening for sentence representation and classification. IITP-AI-NLP-ML (Mishra et al., 2020) -An unsupervised summarization technique that is used to extract salient sentences. First, article sentences are clustered together using various clustering methods (the authors considered various methods such as K-means (Lloyd, 1982) and DBScan (Ester et al., 1996)). Then, each cluster is ranked based on its centrality. Finally, salient sentences are selected from each cluster, taking into account cluster score, until the desired length of the summary. Monash-Summ (Ju et al., 2020)-The system, inspired by SummPip (Zhao et al., 2020), proposes an unsupervised approach that leveraging linguistic knowledge to construct sentence graph. The graph nodes, which represent sentences, are further clustered. This enables the control of the summary length. Finally, for each cluster they considered the key phrases and discourse and created an abstractive sentence. Summaformers (Roy et al., 2020) -To handle long documents, each section was allocated with a budget based on its contribution in the training data. Each section was summarized separately, using SummaRuNNer (Nallapati et al., 2017), a neural extractive summarizer. Table 4 reports the results of the 9 participating systems, 8 of them submitted a research report describing their system 12 . In order to compare between the systems we considered an average score of ROUGE-1, ROUGE-2, and ROUGE-L. Although some of the systems developed an abstractive variant, the highest ROUGE scores were obtained by leveraging extractive summarization techniques. The only system that reported abstrative summarization results, in the official leaderbaord, is Monash-Summ. Most of the systems except ARTU and IITP-AI-NLP-ML employ supervised learning approaches. The system that achieved the highest ROUGE average score is GUIR, with their multi-task learning heuristic. Second best is Summaformers, with about 3% lower ROUGE score.

Results
In addition, we randomly selected 5 summaries from the top-3 ranked systems, namely: GUIR, Summaformers and IIITBH-IITP, to be evaluated by experts. We asked them to rank the systems w.r.t coverage, and readability. For coverage, we asked to take into account how well the summary contains important, informative information conveyed in the text. For Readability, we asked to take into account fluency, coherence and grammat-12 Our analysis ignores Wing since they did not submit a system report as required ical correctness. From coverage perspective, all experts reported that GUIR summaries outperform the other systems, where the main issue with Summaformers and IIITBH-IITP is that they mainly cover the introduction and related works sections. From readability perspective, the experts pointed out on several issues such as out of context formulas and reference to tables and figures, sentences are not sorted by the paper discourse, and footnotes that are clearly not relevant such as URLs, author's information, etc.

Discussion
Scientific documents can be characterized as long, structured, utilizing technical language (i.e., formulas, tables, definitions, etc.). Analyzing the summaries and reports of the participated systems shows that most of them considered the structure of the document while generating summaries, by utilizing sections and document discourse. From a language perspective, some systems utilized language models that were pre-trained on scientific corpora. However, we believe that more efforts should be focused on handling mathematical definitions, formulas, tables, and the text surrounding them. For example, it is not clear whether these entities should be treated differently than narrative text and whether they should be considered as atomic units that should not be compressed further.
Moreover, readability should play an important role in algorithmic design. Due to the nature of scientific documents and LongSumm length requirement, we believe this is even more challenging compared to traditional summarization tasks. This should have gotten more attention by the participating systems.
Finally, it was surprising to see that most evaluated systems are extractive and not abstractive. In the future we plan to extend this corpus, with the hope that LongSumm will help foster further research in this domain.

Conclusion
The First Scholarly Document Processing workshop (Chandrasekaran et al., 2020) comprise three summarization tasks, that each aimed to improve the state-of-the-art of scientific document summarization. In total, we received 18 submissions that addressed one or more of these tasks. It was a useful exercise to compare and contrast each of these summarization tasks, since they allowed re- searchers to explore their systems in different contexts, on different corpora, and for different audiences. Overall, what this efforts has shown is that the state of the art of summarizing scientific documents is neither in its nascency, nor a fully solved problem. We are interested in expanding task-based efforts in scholarly document summarization in future workshops, and investigating how scholarly documents differ or are similar to other texts. We are interested in collaborating with others in the NLP and AI-communities to investigate to what degree new technologies can be utilized and developed, to allow for a future where some of the work of tracking the scientific literature can be supported by machines. While CL-SciSumm has run for 6 editions and with the 2020 edition now set up two standard benchmark evaluation datasets for citation based summarization intended for use by researchers to aid in scientific discovery (breadth), LongSumm and LaySumm are inaugural tasks towards building systems that to improve understanding and dissemination of papers (depth).