IIITBH-IITP@CL-SciSumm20, CL-LaySumm20, LongSumm20

In this paper, we present the IIIT Bhagalpur and IIT Patna team’s effort to solve the three shared tasks namely, CL-SciSumm 2020, CL-LaySumm 2020, LongSumm 2020 at SDP 2020. The theme of these tasks is to generate medium-scale, lay and long summaries, respectively, for scientific articles. For the first two tasks, unsupervised systems are developed, while for the third one, we develop a supervised system.The performances of all the systems were evaluated on the associated datasets with the shared tasks in term of well-known ROUGE metric.


Introduction
Due to a lot of research going into the computational linguistic (CL) domain as well as in other domains, the rate of publishing scientific articles has been increased and will continue to expand (Nallapati et al., 2017(Nallapati et al., , 2016Jaidka et al., 2019). This makes the researchers challenging to update them with the up-to-date advancements. A survey (review) article may help the researcher to have a gist of the recent advancements. But, writing a survey paper is a very laborious and time-consuming task. This challenge demands summarization of scientific articles (Cohan and Goharian, 2018;Conroy and Davis, 2018) by providing their summary in a few words and then prepare the survey article .
But sometimes, for niche practitioners, the published and survey articles may be difficult to understand. To make them relevant for the nonpractitioners and to benefit all the researchers, it is indeed a need to outline the contribution of research articles in lay language.
The current paper demonstrates the participation of IIIT Bhagapur and IIT Patna team in three shared tasks namely, CL-SciSumm 2020, LongSumm 2020 and CL-LaySumm 2020, at first workshop on Schol-ary Document Processing 1 , 2020 (Chandrasekaran et al., 2020). The theme of these tasks is to generate medium-scale, long and Lay summaries, respectively. Here, Lay summary means a textual summary which is intended for non-technical audience. The scientific articles used for the first and third tasks are related to computational linguistic domain. While, for the second task, scientific articles cover distinct domains: archeology, epilepsy, and materials engineering. In current paper, all these tasks are posed as extractive summarization (Saini et al., 2019) problems where a subset of sentences are selected from scientific articles based on their relevance. For CL-LaySumm and CL-SciSumm, we have developed the system based on the maximal marginal relevance (MMR) (Carbinell and Goldstein, 2017) which considers novelty and informativeness of sentences with respect to what is already included in the summary. And, for LongSumm, our system utilizes neural network based approach. More descriptions about these tasks including datasets and methodology used, are provided in the subsequent sections. The performances of the systems are evaluated in terms of ROUGE (1-gram, 2-gram, and L) metrics on the provided dataset.

CL-SciSumm 2020
CL-SciSumm 2020 is the sixth Computational Linguistics Scientific Document Summarization Shared Task which aims to generate summaries of scientific articles not exceeding 250 words. The associated dataset for the task is provided with a Reference Paper (RP) (the paper to be summarized) and 10 or more citing Papers (CPs) containing citations to the RP, which are used to summarise RP. It includes two more sub-tasks: (a) Task 1(A)iden-tifying the text-spans in the reference article that mostly reflect the citation contexts (i.e., citances that cite the RP) of the citing articles; (b) Task 1(B)categorizing the identified text-spans into a predefined set of facets. Generation of structured summary for scientific document summmarization using the identified text-spans is covered in Task 2.

Dataset Description
The dataset associated with CL-SciSumm 2020 shared task, consists of 40 annotated scientific articles and their citations for training. In addition to this, a corpus of 1000 documents released as a part of ScicummNet  dataset for scientific document summarization is readily available for training. For testing, a blind test set of 20 articles used for CL-SciSumm 2018 (Jaidka et al., 2019) and 2019  shared tasks, is again used for the current shared task.

Methodology
In this section, we have discussed the system developed for Task 1 and Task 2. The corresponding flowchart is shown in Figure 1.

Task 1(A)
For a given reference paper (RP), in order to identify the reference text-spans using citation context, we have used an unsupervised approach where we have extracted the top 5 sentences by calculating cosine similarity between each citance and sentences of the RP. These 5 sentences are considered as cited/reference text spans. Note that before calculating the similarity, we have converted the text-space into a (numeric) vector-space for which we have utilized different types of sentence embeddings namely, Albert (Beltagy et al., 2019a), ELMO (Peters et al., 2018), fastText (Athiwaratkun et al., 2018), SciBERT (Beltagy et al., 2019a), Universal Sentence Encoder (Cer et al., 2018), XLNET (Yang et al., 2019), which are capable of capturing the semantics of the sentences. Thus, in total, six systems are developed for Task 1(A).

Task 1(B)
For identifying discourse facets (Hypothesis, Implication, Aim, Results and Method) of cited text spans, we have used a voting based method. A supervised multi-class classification model, based on the Gradient Boosting (La Quatra et al., 2019;Li et al., 2008), is trained in order to assign a facet to each cited text span. Training data statistics are described in Table 1. In our approach, we have extracted top 5 text spans for each citance in Task 1(A). We have used our trained model to identify facet for each cited text span. Later we have used a voting method to finalize facet for each citance. For generating structured summary of 250 words, we have used the unique sentences extracted in Task 1(A) (i.e., cited text spans) as the candidate set of sentences. This approach is known as citation-based summarization. For this purpose, a diversity-based unsupervised measure namely, maximal marginal relevance (MMR), inspired from (Carbinell and Goldstein, 2017) which is a linear combination of informativeness (with respect to documents consisting of chosen candidate sentences) and novelty of the sentence (with respect to sentences already included in the summary) is utilized. Mathematically, it is expressed as (1) where, Q is the current sentence, D is the list of extracted sentences in Task 1(A), d is the generated summary till that point of time, Sim 1 is the similarity of a sentence with respect to all other sentences in the document, Sim 2 is the similarity of current sentence with the sentences that are already included in the summary. Note that for representation of sentences into vector form, we have used CountVetorizer 2 which counts the term-frequency of each term in the article.
The authors of the paper (Jaidka et al., 2017) which was on summarizing scientific articles mentioned that system performance using ROUGE mea-  sure is not always lenient than sentence overlap F1 scores. They demonstrated how the ROUGE score is biased to prefer shorter sentences over longer ones. Motivated by this, we have proposed a variant of MMR by incorporating length of the sentence and is expressed as where, L is the length of the current sentence. Total twelve systems are developed using the citation-based approach in 6 different semantic spaces (refer to Section 2.2.1), each utilizing M M R 1 and M M R 2 for summary generation. To show the potentiality of citation-based summarization, we have also developed full-text based summarization where we have considered total sentences available in the scientific article as the candidate set of sentences for summary generation and utilized the M M R 2 for summary generation. Thus, in total, 13 systems are submitted in the CL-SciSumm 2020 shared task.

Discussion of Results
We have submitted a total of 13 system runs out of which 6 runs are for both Task 1(A) and Task 1(B) utilizing different semantic space. Rest of the 7 runs are only for the Task 2. Results obtained by our different runs for Task 1(A) and Task 1(B) are illustrated in Table 2 and Table 3, respectively. For Task 2, we have generated a single summary for each reference paper using M M R 1 and M M R 2 for different embeddings. Out of 13 system runs, 12 are citation-based, and remaining one is full text based. Results obtained are illustrated in Table 4. For task 1A, & 1B, the best results are obtained using universal sentence encoder for sentence embedding. For Task 2, our enhanced diversity based sentence selection approach, i.e., M M R 2 , has performed better than existing maximum marginal relevance model (M M R 1 ). It is important to note that M M R 2 is tested with different embedding space; but all gives the similar results. Therefore, in Table  4, we have mentioned only M M R 2 as representive of those runs. From Table 4, we can also infer that citation based summarization has better sentence overlaps compared to full text based summarization (last row of Table 4). Note: Using M M R 2 we have obtained exact same results irrespective of the embeddings used. So in Table 4, we have added a single entry M M R 2 representative of those 6 systems.
Poor performance of our system for abstract summary can be explained since our approach tries to focus more on coverage, and diversity. Abstract of any scientific article lies in the starting part and since we are not considering position in our pro-

CL-LaySumm 2020
The CL-LaySumm 2020, which is the first shared task 3 for Lay summary generation, is for automatic generation of Lay summary in 70-100 words which is readable and easily understandable by the general pubic. In other words, given a full-text paper and its abstract, the task is to generate a Lay Summary of the specified length of that paper.

Description of Dataset
The dataset is provided with 600 scientific articles with its, abstract, full-text and corresponding lay summary (gold summary) of around 70-100 words. The test data consists of 37 articles (out of 600 articles). Test data statistics in terms of number of words are shown in Figure 2.

Methodology
In this section, we have discussed the methodology used for Lay summary generation. Similar to CL-SciSumm, we have considered this problem as a sentence selection problem where relevant sentences are selected from the document to generate the summary.
Similar to Cl-SciSumm, here also, we have used both variants of maximum marginal relevance (MMR) mentioned in Eq 1 and Eq 2 for generating summary. As abstract (Let us call it as ABS) conveys the outline of the paper; therefore, we have compared the summary generated using different variants of MMR with the ABS. Other comparisons are done with the original Lay summary when using (a) the full-text of the article; (b) abstract (ABS) and conclusion (CON) of the paper.
Note that goal of generating lay summary is to create a human readable summary for nontechnical audience. To avoid scientific jargon in the generated summary, we have proposed a three step process (let us call it as CWR: Complex Word Removal) where firstly we identify complex words from given sentence, then generate similar words of identified complex words, replace with most suitable word from the generated list. In this paper, we have only identified complex words and removed them, pseudo code for identifying complex words is given in Algorithm 1.  Table 5. From this Table, it can be observed that by considering length in to the M M R 1 , i.e., Eq. (2) and generating summary using the abstract of the article helps in improving the performance of the system in comparison to M M R 1 . We have also illustrated how ROUGE-1 F score varied with λ 1 and, λ 2 in Table 6. Note that these parameters play important roles in generating the informative and novel Lay summary and are the parts of Eqs.
(1) and (2). The best values of the parameters used in M M R 1 and M M R 2 are highlighted (in bold) in Table 6     ues. Here, λ 1 represents the diversity factor as we increase λ 1 , diversity of generated summary decreases. Reader may have in mind that we are using high value of λ 1 , i.e., 0.75 and thus, the summary may have less coverage. Since average number of words in ABS are around 250 ( Figure 2) and our task is to find lay summary around 100 words; therefore, we have used higher λ 1 . Whereas λ 2 tries to maximize Rouge score. As in Table 5, abstract is shown to be good for the summary generation; therefore, we have executed the Algorithm 1 using the same (i.e., ABS). The results attained by CWR using different variants of MMR are shown in Table 7. After observing the results, it is clear that there is not much difference among-st the best results of Table 5 and 7, or in other words, the results of Table 7 are quite less than those reported in Table 5. Error analysis of CWR: Some of the common issues associated with identifying complex words are finding lemma. Lemmatization usually refers to performing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. Few common scientific terms which are not present in lexical databases like wordnet (Miller, 1995) can be important and trivial in the context of the paper. For example words like "hepatocellular", "carcinoma"      Table 7: Results attained by applying CWR on the generated summary using abstract (ABS). Here, R in second row stands for 'ROUGE'. etc., are trivial for paper (S016882782030009X) but are not present in wordnet vocabulary. Therefore, our result using CWR (Table 7) underperform that by M M R 2 (Table 5), thus demanding more sophisticated model.

LongSumm 2020
Most of the existing works on scientific document summarization focus on generating a summary of shorter length (maximum up to 250 words). Such type of length constraint can be sufficient when summarizing news articles, but for scientific articles, the summary requires expertise in the scientific domain to understand it. LongSumm 2020 shared task addresses this issue by generating longer summaries (up to 600 words) of scientific articles.

Dataset Description
The training corpus for this task includes 1705 extractive summaries, and 531 abstractive summaries of NLP/ML scientific papers. The extractive summaries are based on video talks from associated conferences (Lev et al., 2019), while the abstractive summaries are from blog posts created by NLP and ML researchers. The test set consists of 22 research papers for both extactive and abstractive summarization and task is to generate a summary of 600 words. In the current paper, we have focused only on the extractive summarization of LongSumm.

Methodology
To solve the LongSumm in an extactive way, we have utilized the neural network based approach, i.e., convolution neural network (Kim, 2014). The sentences which are part of the summary are assigned 1 and remaining sentences are assigned 0. In other words, we have posed this task as a binary classification problem where task is to identify whether the given sentence can be a part of the summary or not. Positional embedding is also used along with sentence embedding. The detailed methodology used in our CNN is described below:

Convolution:
Authors of (Kim, 2014) showed that CNN with one layer of convolution performs remarkably well for sentence classification tasks. Therefore, we have used one dimensional CNN for extracting features from sentences as described below mathematically: where b is the bias term, g is a non-linear activation function, W f , m and X are convolution filter, window size and concatenation vector, respectively.

2.
MaxPooling: Pooling is a down sampling operation. In max pooling, each pooling operation selects the maximum value of the current view and thus reduces the size yet preserves features as shown below: where h is hidden representation of sentence after convolution.

Positional Embedding:
In any document, regardless of the domain, more relevant sentences can be found in some sections of the document like the leading paragraph of the document (Saini et al., 2019). Particularly scientific articles are structured in a way that sentences at start (abstract) are more informative as represented below.
where p i is the i th (0 ≤ i < N ) sentence in the article. Higher the score for a sentence, more informative it is. Therefore, positional embedding is also utilized in our CNN framework.

Flattening:
After the max-pooling layer, we obtain the penultimate layer h (Eq. 6), which is the vector representation of the input sentence obtained from CNN. We have also fed sentence position encoding (h p ) as additional feature.
where h * is the semantic representation obtained from CNN and h p is position encoding represented as To avoid overfitting, we have used regularization as mentioned in Eq 10.
And finally, we have used sigmoid function as per Eq 11 for obtaining probability scores: Note that we have considered sigmoid probability for assigning ranks to sentences and used those for sentence selection to be included in the summary till the length constraint is satisfied.

Experimental setup
For our experimentation, we have used SciBert (Beltagy et al., 2019b) to get the sentence embeddings as it is trained on a large multi-domain corpus of scientific publications to improve performance on many scientific NLP tasks like summarization (Gabriel et al., 2019) and relation extraction (Sung et al., 2019). For convolution layer, we have used 600 filters, and 3 kernels with ReLU as our activation function. For Pooling, we have used pool size of 2. We train the model for 10 epochs with the Adadelta optimizer.

Discussion of Results
We have submitted 4 systems for LongSumm shared task. Out of 4, two systems are based on CNN architecture. The key difference between two neural models is essentially the limit of the number of words for summary generation, i.e., the first system (CN N 1 ) uses a strict 600 words and the second system (CN N 2 ) maintains on an average of 600 words for generating summaries. For other two systems, we have used M M R 1 and M M R 2 using same hyper parameters as LaySumm (Section 3.3). The results obtained for LongSumm 2020 task are reported in Table 8. From this Table it can be inferred that (a) CN N 2 performs better in term of Rouge-2 and Rouge-L F1-measure, but in terms of Rouge-1 F1-measure, M M R 2 performs the best. Training vs. testing accuracy for the results obtained using CN N 2 are shown in Figure  4.

Conclusion and Future work
We have investigated the effects of using maximal marginal relevance (MMR) in developing the systems for three shared tasks: CL-SciSumm, CL-LaySumm, and LongSumm 2020. Another variant of MMR is also proposed by incorporating a length-based feature. For LongSumm, we have also investigated the effect of using a convolution neural network. As the goal of LaySumm is to generate Lay summary, which is understandable for a non-technical audience, we have tried a common word removal approach using the lexical database like WordNet, which fails due to non-presence of scientific terms. In the future, we would like to develop a more sophisticated approach for LaySumm generation.