Divide and Conquer: From Complexity to Simplicity for Lay Summarization

We describe our approach for the 1st Computational Linguistics Lay Summary Shared Task CL-LaySumm20. The task is to produce non-technical summaries of scholarly documents. The summary should be within easy grasp of a layman who may not be well versed with the domain of the research article. We propose a two step divide-and-conquer approach. First, we judiciously select segments of the documents that are not overly pedantic and are likely to be of interest to the laity, and over-extract sentences from each segment using an unsupervised network based method. Next, we perform abstractive summarization on these extractions and systematically merge the abstractions. We run ablation studies to establish that each step in our pipeline is critical for improvement in the quality of lay summary. Our approach leverages state-of-the-art pre-trained deep neural network based models as zero-shot learners to achieve high scores on the task.


Introduction
Acceptance of science by society is accelerated by sharing scientific knowledge and engaging with the public at large. Scientifically backed information, when suitably summarized and conveyed to the common man, spurs empowerment to combat the spread of misinformation. Lay summary of a scientific scholarly text, targeted for the general public, captures the broad scientific idea and its potential impact with minimal technical jargon. Funding agencies, scientists within and outside the field, * These authors have equal contribution to this work. and science journalists also benefit from lay summaries (Kuehne and Olden, 2015). CL-LaySumm20 shared task aims to develop NLP methods to bridge the gap between advances made by the scientific community and nonspecialist audience, by summarizing scholarly scientific articles in language understandable by lay persons. Evaluation for the task is done on the basis of Recall and F1-scores of ROUGE-1, -2, and -L metrics (Lin, 2004). Additionally, selective summaries are evaluated by science journalists and communicators for ease of comprehension as well as for interestingness. Chandrasekaran et al. (Forthcoming) document the results and insights from the shared task.

Abstractive Vs. Extractive Summarization
Automatic summarization of generic documents is accomplished by either using Extractive or Abstractive approach. Extractive summarization algorithms rank salient sentences in the input text, and subsequently select top ranked sentences for inclusion in summary. These algorithms effectively identify sentences containing important facts, but often suffer from weak coherence. An extractive summary, which is more like bullet points, does not compare favourably with human written summary, which is a cohesive piece of text generally written after paraphrasing and fusing different sentences or phrases from the text. Overall coherence between sentences in an extractive summary depreciates because of severe loss of context and several dangling anaphora (Antunes et al., 2018). With rapid and remarkable developments in neu-ral language models, Abstractive summarization algorithms have gained traction. These models are trained on sequence to sequence text generation (Sutskever et al., 2014) and are able to generate high quality natural language texts. They are competent to abstract long sentences into short and meaningful sentences, and are germane enough to introduce novel expressions and paraphrases while maintaining almost human-like quality. The current state-of-the-art neural abstractive summarizers are based on transformers (Vaswani et al., 2017), which use self-attention mechanism to allow contextual encoding of input sequence. A major shortcoming of transformer based models is that their memory requirement and computational cost depends quadratically on the length of input sequence. A two stage extractive-abstractive pipeline is usually proposed to alleviate this shortcoming including in the prominent works of Chen and Bansal (2018), Gehrmann et al. (2018) and Zhao et al. (2020). Extractive step before abstraction has also been deemed important to improve the content selection in abstractive summaries (Liu and Liu, 2009;Mehdad et al., 2014).

Lay Summarization
Dubé and Lapane (2014) provide a checklist for manually writing lay summary for specified audience, which serves as a desiderata for designing algorithms for lay summarization. Manually translating complex research ideas into lay language incurs extensive patience, time, subject knowledge and effort. This has motivated research in the area of automatic lay summarization, which aims at condensing core ideas of scientific research and transforming them in accessible language for lay audiences, while remaining true to science.
Extractive summarization is insufficient for the task of lay summarization because of two reasons. First, when the sentences are selected for inclusion in summary they carry the burden of scientific jargon along with them, which degrades readability and comprehension for lay audience. Second, loss of contextual information and the consequent lack of coherence seriously detriments the purpose of lay summary.
As discussed earlier, state-of-the-art transformer based neural abstractive summarizers do not scale well for documents exceeding 1000 sequence tokens (Zhao et al., 2020). As scholarly articles are usually much longer, abstractive summarizers cannot be effectively used standalone for the CL-Laysum20 task.
Lay Summarization can benefit from tactfully exploiting the strengths of extractive and abstractive summarization, while renouncing their respective caveats. Distilling important sentences conveying core scientific ideas from the paper using extractive summarization, and feeding it to state-of-the-art abstractive summarizer has potential to yield desired non-technical summary of the scientific article in simple and understandable language.

Our Approach
We propose a two step approach that divides the scientific scholarly text into segments to conquer the complexity before generating simple lay summary. Following the heuristic advanced by Collins et al. (2017a) that certain sections of the document are more pertinent from the summarization viewpoint, we exploit the structure of scientific scholarly text to select information rich segments. We discerningly combine state-of-the-art extractive and abstractive summarization methods, to first extract important sentences from the selected segments, and then compress and paraphrase these sentences via abstractive summarizer. Subsequently, we combine the summaries in a rule based manner to obtain the final lay summary. We report systematic ablation studies to demonstrate the benefit of (i) using abstraction after extraction, and (ii) focusing on specific sections for lay summarization.
Various supervised and unsupervised techniques have been used so far for accomplishing distinctive tasks pertinent to scientific articles (Altmami and Menai, 2020). Recently, Miller (2019) propose to leverage the state-of-the-art BERT model (Devlin et al., 2018) for extractive summarization of lectures. In this approach, K-means clustering is performed on sentence embeddings obtained from BERT, and the sentences that are closest to cluster centroids are extracted to create the summary. Among non-neural models, a popular approach is to capture relations between sentences or word phrases via a weighted graph. Gupta et al. (2014Gupta et al. ( , 2019 model the sentences of the document as nodes of a weighted directed graph and compute idf based entailment scores between sentence pairs. They use weighted minimum vertex cover to extract most salient sentences. Most recent neural abstractive summarizers are trained on masked language modeling task where random sequences of inputs are masked and the model learns to reproduce the masked portions of text. One such model that has achieved state-of-theart results on abstractive summarization datasets is BART (Lewis et al., 2019). BART is an autoencoder which is pretrained to reproduce the original input after it has been corrupted with arbitrary noise. BART uses transformer (Vaswani et al., 2017) based architecture that employs selfattention mechanism to allow contextual encoding of input sequence.

Data
The organizers provide training and validation corpora for CL-LaySumm20 task, named Laysumm2 (215 documents) and Batch3 (357 documents). These documents comprise abstracts and full texts of scholarly articles from epilepsy, archaeology, and materials engineering domains. Each document in the two corpora is accompanied with a gold-standard lay summary. The test set contains 37 documents (abstracts and full texts).

Methodology
Our approach is based on the premise that not all sections of scientific scholarly text are equally comprehensible to non-experts. Gist of the scientific ideas and the important findings are concentrated in Abstract and Conclusion sections, while most of the technical details of the research are liberally spread in sections describing methodology and experimentation. Introduction and Discussion sections lie somewhere in between the spectrum. Based on the intuition that Abstract, Introduction and Conclusion sections in scientific scholarly text are information rich, Kavila and Radhika (2015) construct summaries sourced from these sections. Collins et al. (2017a) argue that the Abstract, being an author generated summary is most important section in a paper. Using corpus of 10K computer science research papers, they empirically compare the overlap between different sections and paper highlights. It is reported that among Abstract, Conclusion, Discussion and Introduction sections (ACDI), Introduction section shows least overlap. The authors attribute low importance of Introduction section to its longer length.
We empirically test this conjecture for lay summaries. We divide the scientific document in two parts -(i) combined ACDI text, and (ii) rest of the document and compute the ROUGE scores of the two parts 1 with respect to the gold standard summary. Table 2 shows the result of the experiment for both corpora, affirming the observations documented by Collins et al. (2017a). For both corpora the combined ACDI sections, despite being shorter, boast of higher ROUGE scores compared to the remaining text. The results confirm that ACDI sections are apposite for generating summaries from layman perspective.

Section-wise Analysis
We further study the relative importance of each of these four sections from lay summary perspective and present our results in Table 3. Note that all research papers in the corpora are not structured uniformly and there is a variation in the sections present in a paper. Column N Doc shows the number of documents that contain the particular section. All ROUGE scores are computed over the existing sections in the documents. For Laysumm2 corpus, Abstract consistently exhibits high ROUGE scores, except for slightly better ROUGE-recall of Introduction. Length of the Introduction section, which is almost four times that of abstract, possibly begets this advantage. Discussion and Introduction sections, which have similar average lengths score comparably. Conclusion, the shortest section, displays relatively higher F-score for its length. Its low recall score is clearly due to its short length. Documents in Batch3 corpus evince different trend in ROUGE scores due to difference in the lengths of the sections. Discussion section is strikingly longer compared to others, gaining higher recall scores. Interestingly, the gain due to length is annulled by F scores, which are the lowest among the four sections. Abstract consistently earns second highest score, despite short length.  Table 3: Average ROUGE scores for Abstract (A), Conclusion(C), Discussion(D), Introduction(I) sections wrt gold standard lay summaries for Laysumm2 (L) and Batch3 (B) datasets. The averages are taken by considering only the cases where these sections are present in the document. N Doc : is the number of documents in which a particular section appears.
The experiment leads to conclusion not different from (Collins et al., 2017a), and forms the basis of rules we use for generation of lay summary (described in the following subsection).

Lay Summarization Framework
The complete pipeline of our system is shown in Figure 1. We reconstruct the input for summarization by extricating Abstract, Conclusion, Discussion and Introduction sections from the preprocessed text. Recognizing the richness and simplicity of the information contained in Abstract, supported by high ROUGE scores, we choose not to perform extractive summarization over it. We over-determine important sentences from each of the remaining three sections using a common extractive summarization method. The four segments, viz. Abstract and Conclusion, Discussion and Introduction, are further condensed using an abstractive summarizer to obtain corresponding simplified texts. Finally, the four abstractive summaries are concatenated one by one in the ACDI order until the desired Lay summary length is achieved. We observe from Table 3 that some sections might be missing in some documents. We simply move on to the next most important section (as per ACDI order) in such cases.

Experimental Setting
In this section we describe the choices we made for implementing the lay summarization framework. All the source code is made publicly available on Github 2 .

Data Pre-processing
We pre-process the input text by removing redundant whitespaces, hyperlinks, and references. Based on the intuition that sentences containing relatively more mathematical symbols and non-English characters might not be comprehensible to lay readers, we completely remove the sentences which are comprised of more than one fifth special characters. We also remove any single character or numeral preceded by a period, since this character will constitute beginning of a valid sentence only if it was upper case and was preceded by a period and single whitespace-for instance, 'ab.c' is replaced with 'ab c'. We also replace common acronyms with their full forms. Finally, we remove all punctuation symbols except periods, question marks and exclamation marks which constitute end of sentence markers for effective sentence tokenization.

Extractive Summarization
We experiment with two extractive summarization methods belonging to different genres with the objective of comparing the cost and benefit. We choose a pretrained supervised neural model BioBERT with k-means clustering (BioBERT SUM), and a frugal, unsupervised network based summarization algorithm. The two methods are briefly described below. (i) Supervised: BioBERT SUM -Motivated by the approach proposed in Miller (2019), we apply k-means clustering on BioBERT embeddings. BioBERT (Lee et al., 2020) is initialized with weights from BERT (Devlin et al., 2018) model pretrained on general domain corpora followed by further training on scholarly text specific to Biomedical domain, which is one of the specified domains of our input corpora. Thus, we expect it to perform well with both general domain as well as biological domain inputs. We use the fine tuned version (BioBERT-NLI 3 ) of BioBERT with the bert-extractive-summarizer package 4 for extractive summarization. (ii) Unsupervised: Entailment based Weighted Minimum Vertex Cover (wMVC) is an unsupervised network based approach proposed by Gupta et al. (2014). The sentences are modelled as vertices of the graph, and Inverse Document Frequency (IDF) based entailment is employed to link sentences Gupta et al. (2019). The algorithm considers those sentences important, which entail many sentences. The extent to which a sentence A entails another sentence B is captured by the weight of directed edge (A, B) defined as: E A,B = w∈A∩B idf w w∈B idf w 3 https://huggingface.co/gsarti/ biobert-nli 4 https://pypi.org/project/ bert-extractive-summarizer/0.4.2/ where, the idf score of a word w is computed as: idf w = log N n i n i = number of sentences containing w, N = Total number of sentences in a document The connectivity score Conn u of the vertex determines the importance of the corresponding sentence: In a vertex pruning step, all vertices having connectivity score below a threshold are removed. Finally, minimum number of sentences that encapsulate the essence of document are identified using weighted minimum vertex cover. The aim is to prefer vertices with high connectivity score therefore the vertex weights are inverted for reduction to weighted minimum vertex cover. Highest scoring k sentences are extracted from the solution. These sentences are then re-ordered as per original document ordering. We implement wMVC using Python 3.8 and NetworkX (Hagberg et al., 2008) package.

Abstractive Summarization
We noticed through manual checks that the gold summaries provided for the task are abstractive in nature. Therefore we extract a longer than required length summary using extractive summarizer and compress them using BART abstractive summarizer. We use the transformers library provided by Wolf et al. (2019) and weights from pretrained model facebook/bart-large-cnn 5 for experiments. We run the BART and BioBERT SUM on Google Colaboratory with GPU setting while wMVC experiments are run on a CPU.

Experimental Design
We design experiments to answer three research questions. I. How does unsupervised wMVC method compare with BioBERT SUM for extractive summarization? II. Does staging of extractive and abstractive summarization bring in improvement in the quality of lay summaries? III. Does divide-and-conquer approach for generating lay summaries pay-off?
To answer questions (I) and (II), we extract summaries from the full text using BioBERT SUM and wMVC. Next we feed the extracted summaries to BART for comparison. The findings are described in Section 6.1. To answer question (III), we compare lay summaries generated by combining ACDI as single unit and those generated by the framework. The observations are discussed in Section 6.2.

Experiment I and II
We compare the performances of wMVC and BioBERT based summarizers by extracting 100 word summaries from the full text of the given documents and computing their respective ROUGE scores (Section (i) of Table 4). Macro-averaged scores for both datasets are higher for wMVC summaries, indicating that wMVC yields better quality summary for the two corpora.
In order to test effectiveness of staging extractive and abstractive summarization, we extract 200 word (twice the length of stipulated summaries) summaries from the full text using the two extractive summarizers, and feed these to BART abstractive summarizer to obtain two sets of lay summaries. Final average summary length is 103 words for Laysumm2 and 93 for Batch3 documents, meeting the stipulated length restriction. ROUGE scores of the summaries are recorded in Section (ii) of Table 4.
It is observed that abstraction distinctly improves ROUGE scores in all cases for all metrics. Interestingly, the quantum of improvement is apparently more for BioBERT based summaries, which makes it winner for lay summaries of Batch3 documents. The conclusion is not confirmatory, however. We plan to investigate deeper using statistical tests.
This experiment indicates that performance of abstractive summarizers for generating lay summaries can be leveraged by feeding them focused and quality content obtained by extractive summarizer. However, the quantum of boost is not predictable and depends on the input. It is noteworthy that staging of pretrained extractive and abstractive summarizers for inference is less data and resource intensive than training a model end to end.

Experiment III
Next we describe our experiment to test our conjecture of divide-and-conquer. We generate lay  summaries using the Abstract, Conclusion, Discussion and Introduction sections as single unit, and compare with those obtained by the proposed framework.
We extract 200 word summaries after combining Abstract, Introduction, Discussion and Conclusion sections in document order and abstract using BART which delivers lay summaries of average length 113. Next, based on the framework (Figure 1), we generate 150-170 length extractive summaries from the Conclusion, Discussion and Introduction sections. If any section has length less than 250 words, it is not subjected to extractive summarization. Section-wise lay summaries for each of the four sections are obtained, which are finally assembled by appending in order of ACDI till desired length of lay summary is achieved (90-110 words). Note that in case any section is missing from the paper, the framework quietly ignores it. We present the results in Table 5.
All ROUGE scores for both data sets show significant improvement over the scores obtained for lay summaries of full text (Part (ii) of Table 4 and part (i) of Table 5). It is abundantly clear that Abstract, Introduction, Discussion and Conclusion sections are most useful for generating lay summaries. Part (ii) of Table 5, further reveals that summary generation from individual sections and their subsequent merging in ACDI order results in higher scoring lay summaries than those generated from combined text of ACDI.
This validates the divide-and-conquer approach of focussing on limited segments of scholarly scientific documents, extracting the gist and abstracting it to make it comprehensible. The insight available from this result may help in designing better lay summarizers.
We report the evaluation results of summaries produced from our final experiment ACDI Incremental on the test set in Table 6. In the first variant,   It is noteworthy that the quality of generated lay summary is sensitive to the order of the sections. In case the abstract is simple and long enough, there is a possibility that the lay summary might be a condensed form of abstract only. Lay summary of this paper is shown in A.

Discussion
We present two sample system summaries along with the gold standard summaries in appendix B in Tables 8 and 9. We can observe that Table 8 has remarkably high overlap (highlighted) with the gold summary. In Table 9, we observe that some technical terms (highlighted) do find their way into lay summary. Use of appropriate ontologies and substituting these terms with their synonyms or entity classes can possibly make the meaning clearer to laity. At times, some sentences (example highlighted in Table 9) end up being extracted that are loosely coupled with the rest of the summary and are not even important from the viewpoint of a layman. Such sentences may increase the ROUGE score, but deteriorate the overall readability. 6 The evaluation scores for test set are retrieved from codalabhttps://competitions.codalab.org/ competitions/25516#results. Manual inspection by authors for few other system summaries indicates that we need to improve pre-processing and sentence tokenization. For example, one of the summaries contains confidence interval values which may not be comprehensible to laity. Certain inconsistencies in the input format, confuse our parsing algorithms leading to inaccurate segmentation of sections in a few cases. Moreover, we notice that in few scenarios, BART leaves out incomplete sentences towards the end, which degrades the quality of lay summary. An astute post-processing check is desirable to address this problem.
Dependence on abstractive summarizer for the quality of lay summary is the main caveat of the proposed framework. Anticipating constant improvement in the state-of-the-art in NLG, we expect the framework to yield high quality lay summaries.

Conclusion
We propose a framework for generating Lay Summaries of scientific scholarly documents. The framework is based on the core idea of extractiveabstractive pipeline to generate lay summaries. We divide the text into segments and focus on information rich segments to extract important sentences. These extracts are fed to the state-of-the-art abstractive summarizer for further compression which improves readability of the summary. This strategy improves the quality of lay summary, while cutting down on the training data requirement as well as computational resources. The proposed framework is frugal in terms of both types of resources. We show that reusing pre-trained publicly available models can be favoured over devising new training architectures. Thereby, reaping advantages of transfer learning for specialized tasks.

A Lay Summary of Present Paper
Title: Divide and Conquer: From Complexity to Simplicity for LaySummarization Summary The task is to produce non technical summaries of scholarly documents. The summary should be within easy grasp of a layman who may not be well versed with the domain of the research article. We propose a two step divide and conquer approach. We judiciously select segments of the documents that are not overly pedantic and are likely to be of interest to the laity. We over extract sentences from each segment using an unsupervised network based method. We perform abstractive summarization on these extractions and systematically merge the abstractions. We run ablation studies to establish that each step in our pipeline is critical for improvement in the quality of lay summary. Measures of site location in relation to agricultural potential are an important tool for identifying relative shifts in the importance of agriculture in prehistoric economies over time.
We examine GIS modeling of agricultural potential based on soil characteristics, topography, and proximity to drainage in the highlands of the Mogollon culture of southcentral New Mexico. We describe methods, limitations, and advantages of this approach. Preliminary results support other evidence of strong agricultural reliance in the pithouse period, substantially greater than in the Archaic; the pueblo period may be slightly more linked to optimal agricultural land, though the latter conclusion is uncertain.
BART wMVC Measures of site location in relation to agricultural potential are an important tool for identifying relative shifts in the importance of agriculture over time within a given region.
We examine the application of GIS modeling of agricultural potential based on soil characteristics, topography, and proximity to drainage in the highlands of the Jornada branch of the Mogollon culture of southcentral New Mexico.
Our results support other evidence of strong agricultural reliance in the pithouse period, substantially greater than in the Archaic the pueblo period occupation may be slightly more tightly linked to optimal agricultural land, though the latter conclusion is uncertain.
Our results have potential implications for both the interpretation of Formative period settlement in the Sierra Blanca Capitan Mountain highlands, and for further methodological approaches to settlement analysis.

Summaries Corresponding to Paper ID:S2352409X18303663
Title An evaluation of classical morphologic and morphometric parameters reported to distinguish wolves and dogs Gold Standard Visual traits and measurements that support distinguishing dog and wolf skeletal remains have been long-used, but insufficiently researched. We evaluated 14 of these, including dental abnormalities; mandible shape; orbital angle; hard palate; snout dimensions; and skull dimensions. We found only a few reliable measures, including skull height, very small or large orbital angle, snout width index, and specific measures of the 1st molar and 4th premolar teeth. Thus, much earlier research now must be re-considered toward use of combined visual, measured, and genetic traits for accurate archaeological identifications.
BART wMVC Morphological and morphometric differences between wolves and dogs are often overlooked. This article shows how these differences can be used to better understand the history of wolf-dog relations.
The study also shows that the differences between the two species are not as large as previously thought.
The results of the study were published in the Journal of Archaeology and Ethnology, a journal of the American Museum of Natural History and the American Academy of Arts and Sciences.
Traditional morphometric identification of potential early domesticated dogs largely has been based on low numbers of specimens, as well as unverified diagnostic methods and variables. We propose the use of much larger canid reference groups to explore whether variation identified as signs of domestication in these specimens actually reflects natural variation that will be seen more easily within larger sample groups.