IITP-AI-NLP-ML@ CL-SciSumm 2020, CL-LaySumm 2020, LongSumm 2020

The publication rate of scientific literature increases rapidly, which poses a challenge for researchers to keep themselves updated with new state-of-the-art. Scientific document summarization solves this problem by summarizing the essential fact and findings of the document. In the current paper, we present the participation of IITP-AI-NLP-ML team in three shared tasks, namely, CL-SciSumm 2020, LaySumm 2020, LongSumm 2020, which aims to generate medium, lay, and long summaries of the scientific articles, respectively. To solve CL-SciSumm 2020 and LongSumm 2020 tasks, three well-known clustering techniques are used, and then various sentence scoring functions, including textual entailment, are used to extract the sentences from each cluster for a summary generation. For LaySumm 2020, an encoder-decoder based deep learning model has been utilized. Performances of our developed systems are evaluated in terms of ROUGE measures on the associated datasets with the shared task.


Introduction
Massive amounts of scientific articles are published day by day (Cohan et al., 2015;Goharian, 2017, 2018), which impose a big challenge for researchers in various fields to keep themselves upto-date with the new developments. A bibliometric analyst's study shows that after nine years, the number of published articles will be doubled (Bornmann and Mutz, 2015). The scientific document summarization objective is to provide a summary of the reference paper. This summary should contain all the important facts. Therefore, it reduces the human effort to understand the document.
Challenges of each style of the summary are as follows: • Objective of CL-SciSumm is to generate a short summary of a paper which must contain all relevant facts and findings. To solve this problem, we have used the extractive summarization technique. We have used unsupervised techniques and explored different features for scientific summarization. These used features help in identifying important sentences of the article using different aspects.
• Objective of LongSumm is to generate a long summary of the scientific article that should be extractive and abstractive. The generated long summary must contain all important facts of the article. To solve the extractive long summarization problem, we have an unsupervised technique similar to CL-SciSumm. To solve the abstractive LongSumm problem, we have used the encoder-decoder based generative model.
• Objective of CL-LaySumm is to generate a lay summary that can be understood by a non-technical reader. The generated summary should not contain any technical words or jargon. We have solved this task using the abstractive summarization technique. Here, Fine-tuned BERT based encoder-decoder architecture is used to solve the problem.
The current paper addresses this issue by participating in the three shared tasks, namely, CL-SciSumm 2020, LaySumm 2020, LongSumm 2020 (Chandrasekaran et al., 2020). These tasks' goals are to generate medium, Lay (understandable for the non-technical audience), and long summaries of the scientific articles. We are using an extractive approach for CL-SciSumm, For LongSumm, an extractive followed by an abstractive approach is utilized, and for LaySumm, an abstractive approach is utilized. The detailed descriptions of these tasks are provided in the subsequent sections.

CL-SciSumm 2020
It is the sixth shared task on scientific document summarization. In the literature, two approaches have been used to solve this problem. The first one considers abstract as the summary of the paper, but the problem with the approach is it provides only the theme of the paper. The abstract may not convey all the important points of the summary (Yasunaga et al., 2019;Atanassova et al., 2016). Therefore, the second approach has been followed to solve the scientific summarization, which is citation-based summarization (Qazvinian et al., 2013). It utilizes a set of citations referencing to the original article (reference paper to be summarized). Citations are short descriptions that explain the reference paper all its contributions; this text can be termed as citation text or citance.

Dataset
The dataset contains a blind test set of 20 papers and corresponding citing papers. Each paper belongs to the computation linguistics and natural language processing domain. It can be found at https: //github.com/WING-NUS/scisumm-corpus/ tree/master/data/Test-Set-2018.
Training data is also provided. But, as our approach is purely unsupervised; therefore, we are making use of only test data.

Tasks Descriptions
• Task-1 (A) In this task, the objective is to identify the spans of text (cited text spans) in the reference paper (RP) for each citance given the RP and citing papers (CPs).
• Task-1 (B) is to classify each cited text span into facets that are predefined (Hypothesis, Aim, Method, Results, and Implication).
• Task-2 is to produce a summary of the reference paper by utilizing its citation. The generated summary length should be less than or equal to 250 words.

System Description
In this section, the steps followed in our proposed framework for solving different sub-tasks are elaborated.
Task 1 (A) To find out the cited text span in the reference paper for each citance, we have utilized the word mover's distance (Kusner et al., 2015).
For each citation sentence, WMD is used to identify the most similar sentences from the reference paper. WMD denotes the semantic similarity between sentences. Here we have selected the top five most similar sentences from the reference paper.
Task 1 (B) To classify each cited text span, we have calculated the similarity between the cited text span and all the five facets using word mover's distance (WMD). The cited text span is assigned that facet, which is closest in terms of WMD.
Task 2 To generate a structured summary, we have used the unsupervised technique, i.e., clustering, followed by the sentence extraction from each cluster based on various sentence-scoring functions. The sentence having a high score from each cluster is included in the summary until the desired length of the summary is reached. A series of steps followed are as follows: 1. Grouping of the sentences has been done using the traditional clustering techniques, namely, K-means (Lloyd, 1982), K-medoid (Kaufman et al., 1987), and DB-scan (Ester et al., 1996).
2. We have determined the document center/representative sentence (RS) of the reference paper. It is that sentence in the article which is most similar to the remaining sentences. We can also call it as an article's center. In other words, the sentence having the minimum average WMD with respect to other sentences is called the RS.
3. Clusters are ranked based on their distances from the representative sentence (RS). In other words, the cluster closest to the RS is assigned the highest rank.
4. After ranking the clusters, we have calculated the scores of the sentences within each cluster based on the following features and then selected the highest scored sentence from each cluster considering their rankings. Note that the selection of sentences from the ranked clusters (in a sequence) will continue until we get the desired length of the summary.
Position of the Sentence (F1): In the literature, it has been shown that important sentences are found in the title and lead sentences of a paragraph. It is expressed as follows m i = 1 n i where n i is the position of a sentence in the reference paper. The sentence is given the highest priority, which lies at the start of the document (Saini et al., 2019).
Similarity with the title (F2): In any document, the sentence, which is very much similar to the title of the document, can be an important sentence for the summary (Saini et al., 2019) as it represents the theme of the article. Here word mover's distance is used to find the similarity.
Length of the Sentence (F3): In the previous works, it is shown that longer sentences can be relevant for generating a summary for a document describing some news (Mendoza et al., 2014;Saini et al., 2019). The sentence is assigned the highest priority, which has the longest length.
Textual Entailment (F4): Textual entailment (Saini et al., 2020) has been used as an antiredundancy measure. In a good summary, sentences should not be related to each other to have more coverage. Here, initially, the cluster centers are included in the summary following the ranked order of the clusters. In the next step, those sentences are selected from the ranked clusters and included in the summary, which does not entail any sentence in summary.

Submitted Run
Details of the submitted systems are provided in Table 1. Here we have used five clusters for K-means and K-medoid, whereas DB-scan decides the number of clusters automatically. In Table 1 each run describes the features used for the selection of sentences within clusters to form the summary. Here, twelve different runs have been used for task-2.

Result
Results of task-1 (a), task-1 (b) and task-2 are shown in Table 2, Table 3 and Table 4 respectively.

CL-LaySumm 2020
The motivation of the CL-LaySumm Shared Task is to automatically produce Lay Summaries of technical (scientific research article) texts. A Lay Summary is defined as a textual summary easily understood by a non-technical audience. It is typically produced either by the authors or by a journalist or commentator. The corpus released in shared tasks covers three distinct domains: epilepsy, archeology, and materials engineering. In a lay summary, there should not be any technical jargon. It should reflect the overall scope, goal, and potential impact of a scientific paper. It is typically less than 150 words in length. The objective is to generate summaries that represent the content, understandable and interesting to a lay audience.

Dataset
The dataset has a training set of 572 articles having corresponding lay summaries. It contains a blind test set of 37 papers.

System Description
Neural network based approach formulates abstractive summarization problem as sequence to sequence problem, here encoder is used to read the token of source documents x = [x 1 , x 2 , ...... We have used standard encoder-decoder architecture for our lay summarization task. Here the encoder is pre-trained BERTSUM, and it is finetuned on CNN daily mail dataset, and the decoder is a six-layer transformer network (as shown in Fig  1 ). It should be noted that there is a difference between encoder and decoder as the encoder is pretrained while the decoder has to be trained. This can create an unstable process of fine-tuning, due to which encoder and decoder can have the problem of under-fitting and over-fitting. To resolve this problem, the different optimizers for encoder and decoder have been used.

Precision
Recall F1 score micro avg macro avg micro avg macro avg micro avg macro avg 0.0222 0.0221 0.1049 0.1058 0.0367 0.0365

Runs
Human  Table 4: Task-2 scores of different runs in terms of rouge scores. Here, R-2 and R-SU4 are denoting rouge-1, rouge-SU4 respectively. All the reported values are f1 scores. l rD =l rD .min(step 0.5 , step.warmup −1.5 D ) (2) where l r = 2e − 3 and warm-up = 20000 for the encoder whereas l rD = 0.1 and warm-up =10000 for decoder. Here the assumption is that the pretrained encoder must be trained with a lower learning rate and a lower learning rate smoothens the decay. This process helps the encoder in training with a better gradient when the decoder is in stable condition.
We have used a two-stage fine-tuning approach, first is fine-tuning for extractive summarization and then for abstractive summarization. It has been shown in the literature (Li et al., 2018) (Gehrmann et al., 2018) extractive object helps in obtaining a better abstractive summary.

Result
Our system has the following score (shown in Table  5).

LongSumm 2020
In all the previous works of scientific summarization Goharian, 2017, 2018), there is a summary length constraint of a maximum of 250 words. But in the current LongSumm shared task, the generated summary can be the length of between 100-1500 words.

Dataset
This dataset consists of a training set of 1705 papers associated with extractive summaries and 531 papers associated with abstractive summaries. It has a blind test set of 22 files. It can be found at https://github.com/guyfe/LongSumm.

System Description
We have used both extractive and abstractive approaches on the blind test set to generate a structured summary. We have used clustering followed by the sentence-scoring and extraction procedure within each cluster based on various features. The sentence having the highest score is included in the  Figure 1: Architecture for Lay Summarization task R-1 (f) R-2 (r) R-2 (f) R-2 (r) R-l (f) R-l (r) 0.3132 0.3705 0.0631 0.0746 0.1662 0.1973 Table 5: Obtained scores on blind set of CL-LaySumm, Here, R-1, R-2, and R-l are denoting rouge-1, rouge-2, and rouge-l, respectively, f and r representing f-1 score and recall, respectively. summary. Note that for LongSumm extractive summarization, we have used the same methodology as used for CL-SciSumm 2020. A deep learningbased encoder-decoder model is used for Long-Summ abstractive summarization.

Submitted Run:
Details of submitted systems are provided in Table  6. Here we have used five clusters for K-means and K-medoid, whereas DB-scan decides the number of clusters automatically. In Table 6 each run describes the features used for the selection of sentences within the cluster to form the summary. F1  run1  run2  run3  F2  run4  run5  run6  F3  run7  run8  run9  F4 run10 run11 run12 We have used deep learning-based technique as run13, which is as follows: run13: Proposed model has utilized encoderdecoder based deep learning model. Fine-tuned BERT model has been used for the generation of embedding. We have used same model as used for LaySumm.

Result
The results of all runs are shown in Table 7, here from run-1 are run-12 are the scores of long summarization using an extractive approach, whereas run-13 is the score of long summarization using the abstractive approach.
to extract the cited text span from the reference paper; for task 1 (B), the similarity-based measure has been used to identify the facet of each cited text span. Task 2 is based on clustering, followed by sentence extraction from each cluster based on their relevance/score. For LongSumm, we have utilized clustering and deep learning techniques and reported 13 different ways to generate a long summary. For LaySumm, we have proposed a deep learning-based encoder-decoder model that generates the lay summary utilizing the fine-tuned BERT language model's embedding.