CIST@CL-SciSumm 2020, LongSumm 2020: Automatic Scientific Document Summarization

Our system participates in two shared tasks, CL-SciSumm 2020 and LongSumm 2020. In the CL-SciSumm shared task, based on our previous work, we apply more machine learning methods on position features and content features for facet classification in Task1B. And GCN is introduced in Task2 to perform extractive summarization. In the LongSumm shared task, we integrate both the extractive and abstractive summarization ways. Three methods were tested which are T5 Fine-tuning, DPPs Sampling, and GRU-GCN/GAT.


Introduction
The increasing scientific documents published on the Internet allow researchers to find more and more documents of interest. However, how to quickly and efficiently obtain the most important facts or ideas of a document is a big challenge. Summarization of scientific documents can mitigate this issue by presenting a brief summary of the whole document to researchers. This year, we participate in two shared tasks of SDP 2020 (Chandrasekaran et al., forthcoming). The CL-SciSumm shared task is the first medium-scale shared task on scientific document summarization in the field of Computational Linguistics and aims to generate a structured summary for the RP (Reference Paper) with the utilization of 10 or more CPs (Citing Papers). The LongSumm shared task opted to leverage blogs created by researchers in the NLP (Natural Language Processing) and Machine learning communities and use these summaries as reference summaries to generate the abstractive and extractive summaries for scientific papers.
In this paper, we will introduce our methods, experiments, and results of two shared tasks. For the CL-SciSumm shared task, based on our previous work (Li et al., 2019), we continue to leverage similarity calculation on multiple features to perform citation linkage in Task1A. In Task1B, we first extract position features and content features of RT (Reference Text) and CT (Citation Text), then apply different machine learning methods to classify the facet. In Task2, we apply DPPs (Determinantal Point Processes) and GCN (Graph Convolutional Network) to perform extractive summarization this time. As for the LongSumm shared task, we retain those extractive methods in the Task2 of the CL-SciSumm shared task as the basis for our summarization system. Furthermore, we also introduce the GAT (Graph Attention Network) and apply an abstractive summarization method based on finetuning.

Related work
The Task1A of CL-SciSumm is a citation linkage task. The most intuitive method is to calculate and compare the similarity between the CTS (Citation Text Spans) and every text span in RP (Reference Paper), and select the RT with the highest similarity as the result. There are many ways to calculate the similarity, not only traditional IDF and Jaccard similarity, but also Levenshtein distance (Yujian and Bo, 2007). The basic characteristics of words often play an important role in similarity calculating. As the size of the data set continues to grow, neural network language models such as Word2vec (Goldberg and Levy, 2014) and BERT (Devlin et al., 2018) that contain the semantic similarity information in word-level can make a huge improvement. But these word embedding methods will gradually smooth the difference between keywords in the process of calculating, so WMD (Kusner et al., 2015) was proposed to pay attention to the feature mapping between words. In addition to improvement in feature extraction, researchers have also proposed many new algorithms to process features, such as introducing CNN (Kim, 2014) (Dos Santos and Gatti, 2014) into the NLP field to make more complex judgments on feature vectors, or using MatchPyramid (Pang et al., 2016) to process the similarity comparison focusing on the similarity between words.
The Task2 of CL-SciSumm and the LongSumm shared task are both summarization task. Recently, the research on automatic summarization tasks has mainly focused on two ways: extractive summarization and abstractive summarization. In the field of extractive summarization, We studied the sampling process used in DPPs (Kulesza and Taskar, 2012) where we calculated the kernel matrix using WMD sentence similarity for further sampling (Li et al., 2018). Zhong et al. (2019) explored how to make the system generate higher quality summaries. They selected three metrics: network architecture, knowledge transfer, and learning mode, and analyzed the impact of the three metrics on the quality of summary generation through experiments.
GCN is a powerful neural network framework processing graph structural data. Defferrard et al. (2016) extended the traditional CNN to non-Euclidean space and introduce local spectral filtering to optimize the propagation process during the training of the standard graph neural network. Kipf and Welling (2017) further studied the application of GCN in semi-supervised classification. GAT (Veličković et al., 2017) allocates different weights on different node neighbors to aggregate information. A document can also be converted into a graph. Yasunaga et al. (2017) introduced GCN in multi-document summarization. The clusters of documents were fed into RNN to obtain intermediate representations. Then GCN continued to extracting features considering the connections of documents clusters. At last, each sentence was scored based on its cluster-aware representations, and sentences with high score were chosen as summaries.
As for abstractive summarization, Rush et al. (2015) introduced an attention mechanism to the Seq2Seq model, which enables the model to focus on words in specific positions in the original text via the weight matrix when generating abstracts, thus avoiding the problem of losing too much information due to long sentences. Since BERT (Devlin et al., 2018) has achieved great success in the field of NLP, the method of pre-training and fine-tuning has become a new paradigm. Researchers began to explore how to apply pre-trained models to natural language generation. At first, researchers tried to replace the encoder with a pre-trained BERT (Liu and Lapata, 2019), then more and more pre-training target functions for the Seq2Seq model were explored like masked generation (Song et al., 2019), denoising (Lewis et al., 2019), text-to-text (Raffel et al., 2019a). Some specially designed tasks for summarization have also been proposed, such as extracting gap-sentences (Zhang et al., 2019). We use the gap-sentence method in (Zhang et al., 2019) to combine and transform all the data, then utilize the T5 model (Lewis et al., 2019) to fine-tune and generate the summary. As shown in Figure 1, the citation linkage task, Task1A of CL-SciSumm, contains two steps: feature extraction and content linkage. In the feature extraction step, we perform similarity calculation based on different feature extraction ways for each RT and every CT (Citation Text) in CTS (Citation Text Spans), where some traditional features will be used, such as IDF similarity and Jaccard simi- larity. Additionally sentence context information is used on the basis of these simple features in order to more comprehensively reflect the similarity information of the sentence. Besides, we also use the Lin and Jcn features of WordNet, word-cos, Word vector, and LDA-Jaccard (Li et al., 2019). LDA-Jaccard performs better than LDA on sparse topics, and it pays more attention to the union set of the same topic that both two sentences have. In the content linkage step, we add all the scores that each CT belonging to the same CTS, then sort all RTs by the final scores, and take the first N results as the final answer of Task1A. We use four multi-feature fusion methods: Voting-1.2, Voting-2.1, Jaccard-Focused, and Jaccard-Cascade based on our last year work (Li et al., 2019) by increasing the training set and adjusting the hyper-parameters.

Task1B
Our system applies multiple machine learning methods on multiple features representing different aspects of CT and RT. Since a scientific paper is well-structured and each section represents a different facet of the document, our first motivation is to leverage the position feature of CT and RT to classify which facet the citation belongs to. As shown in Figure 2, the position features are the relative positions of CT and RT, the relative positions of the sections that CT and RT belong to, and the section title text. Suppose the section id is sid, the total amount of sections is tsid, the sentence id is ssid, and the total amount of sentences is tssid. Then, the section relative position(SecP os) of CT or RT is sid/tsid, and the sentence relative position(SenP os) of CT or RT is ssid/tssid. Since the section title text(ST T ) of CT or RT also implicates the role it plays in the whole paper, we leverage TF-IDF to select the top 189 words as the keywords where each word occurs at least 3 times in the training set, then convert the section title to a one-hot vector. Then we train LR, XGBoost, and Adaboost on the position features.
Next, we focus on the aspect of text content since the texts of CT and RT indicate the content in detail. First, the texts of CT and RT are preprocessed, such as extracting text from XML file, stop word removal, and word tokenization. Then they are represented by word embeddings and mapped to a dense vector space by FastText. The architecture of FastText is shown in Figure 3 where x 1 , x 2 , ..., x N −1 , x N represent the n-gram features and each feature is the average of word embeddings. The hidden layer is obtained from the average of x 1 , x 2 , ..., x N −1 , x N . Then the output layer is fully connected to the hidden layer and finally obtain the predicted label by the hierarchical softmax. The reason that we choose FastText as our classifier based on content features is that FastText is relatively lighter than other text classifiers and can avoid overfitting since the training set is small.

Task2
Task2 is a summarization task, and we apply two extractive methods in this paper.

Extractive summarization based on DPPs:
This method assumes that each document is a set of sentences, and the process of extracting the summary is to extract the highest quality subset from the set of sentences. To achieve this extraction process, we first represent the document as a matrix L representing the relationship between sentences and then apply the DPPs sampling algorithm to extract candidate sentences. The matrix L is constructed by the Quality-Diversity (QD) model and Sent2Vec (SV) model.
In the Quality-Diversity model, matrix L can be calculated by: where q i is the quality of each sentence which can be calculated by the features we selected, such as Sentence Length (SL), Sentence Position (SP) and Sentence Coverage (SC). Sim ij represents the similarity between sentences, which can be imple- mented as where ϕ i is the diversity vector of a single sentence. In the Sent2Vec model, we construct matrix L by By constructing matrix L, we can apply the DPPs sampling algorithm to select sentences, the extracted summaries have both high-quality and low-similarity. The details of DPPs can be referred to the work of Kulesza and Taskar (2012).
Extractive summarization based on GNN: We propose an extractive summarization method based on GCN and GAT ( Figure 5). As shown in Figure  4, we first build a sentence relation graph based on sentence similarity, calculated by cosine similarity. The similarity graph can objectively reflect the association between sentences, including keywords and sentence similarity information. The graphs and low-level sentence representations compressed by GRU are fed into GCN and GAT. Each node in the undirected graph is a sentence, which is connected to another sentence if their similarity is greater than 0.2, and the origin node feature is the last hidden layer of GRU. Graph convolution can leverage the feature information of the node itself and the structure information of the graph. In the L-layer convolution network, H (l) represents the hidden features of the l th layer, parameterized by a weight matrix W (l) . AndÃ is symmetrically normalized from the graph adjacency matrix A. After a non-linear function(ReLU), we obtain advanced representations as the final scoring features.
f (H (l) , A) = σ(ÃH (l) W (l) ) Figure 5: Left: Multi-head attention (with 3 heads) computations apply on node 1 and its neighborhood. h 1 is obtained by concatenating or averaging from the aggregated features of each head. Right: The attention mechanism a(W h i , W h j ), and activated by LeakyReLU. Figure 6: T5 is actually a transformer pretrained on the large corpus. We fine-tuned it for abstractive summarization task.
In the training period, we select the sentences most similar to the community summary from the RP as the summary sentences. The selected sentences are labeled as 1, while the rest sentences are labeled as 0. Then the model is trained as a binary classifier. Finally, we greedily select the highest-scoring sentences from the sentence set.

LongSumm
For the LongSumm shared task, we use three methods based on our forementioned summarization methods for Task2 of CL-SciSumm in this paper.

T5 Fine-tuning
Although we have divided the dataset into sectionwise samples and obtain more than 30000 sectionsummary pairs, it is still not sufficient to train an abstractive model from scratch. Therefore we use the pre-training and fine-tuning method to deal with this problem. As shown in Figure 6, T5 (Raffel et al., 2019b) is a transformer-liked pre-trained model that has great performance when transfer to a summarization task. It treats every NLP task as a text-to-text task and does both unsupervised pre-training and supervised multi-task pre-training on the large corpus.

DPPs Sampling
This method is based on DPPs sampling, which is similar to the method in Task2 of the CL-SciSumm shared task. We utilize two models to construct matrix L, that is, the Quality-Diversity (QD) model and the Sent2Vec (SV) model. Then DPPs sampling can automatically select the candidate sentences with high quality.

GRU-GCN/GAT
This method contains two parts: an RNN model and a GCN/GAT model. When processing the original text data, we use GRU to compress the sequences. And a similarity graph is constructed for each sentence group as described in 3.1.3, together with the sentence representation as sentence node feature are fed into GCN or GAT. Then we apply the method of supervised training as a reference to the binary classification and select the highestscored sentences according to the training results.

Task1A
In our previous work (Li et al., 2019), we have extracted many kinds of features through various methods. In terms of semantic information, the features are word vector, word-cos, and Lin and Jcn in WordNet. Some traditional features are also used such as IDF and Jaccard similarity, considering that with the increase of the number of topics in the LDA model, the topic vector will gradually become sparse. This time, we abandon LDA and LDA-cos features and introduce the LDA-Jaccard similarity, which can improve the discrimination performance of LDA when the topic vector is sparse and focus on the similarity in the same topic. Based on the original fusion method, there are four new fusion methods by increasing the training set and adjusting the hyper-parameters, which are, Voting-1.2, Voting-2.1, Jaccard-Focused, and Jaccard-Cascade.
LDA model topic size is set to 600, and the pretraining word vector size is set to 300. In the case of high-dimensional LDA, although the word distribution in the topic becomes very sparse, the performance has been improved.  parameter settings of the four multi-feature fusion methods.
As shown in Table 1, the performance of Jaccard-Focused is the best among the four methods. At the same time, there is a big gap between the precision and the recall rate. It is because we manually specify that top-N sentences are answers, so the program finds more sentences in general, so the recall rate is higher than the precision rate.

Task1B
For XGBoost (POS-XGB), we set the learning rate to 0.3, max depth to 1; for Adaboost (POS-ADB), we use the decision tree as the weak learner with max depth 2, learning rate 0.3; for LR (POS-LR), we set the learning rate to 0.3. We also implement a voting method (POS-Vote) based on these base classifiers. As for FastText (CON-CT-FastText and CON-RT-FastText) applied on content features, the CT and RT length are 40 and 50 respectively. The size of word embedding, hidden layer and output layer are 128, 256 and 2 respectively. We use Adam as the optimizer with learning rate 0.0001, and train for 50 iterations. Finally, we combine the classifiers on position features and content features via a voting method (CON-POS-Vote). Both the vote methods mentioned above obey the majority rule.
Since Task1B is a multi-label classification task and the training set is severely imbalanced, as shown in Table 2, we randomly sample an equal number of negative samples for each discourse facet, then train five independent classifiers, respectively. When predicting the test set, we select  at most top 2 facets with the highest probability. Table 3 shows the results of Task1B. We find that CON-POS-Vote has the best Precision, while POS-XGB performs best on Recall and F1 Score. The performance of FastText based on content features is better than most of machine learning methods based on position features. And CTs contain more information indicating the facet than RT.

Task2
In DPPs sampling, Sentence Length (SL), Sentence Position (SP), and Sentence Coverage (SC) are selected as features to calculate the quality of sentences, and the summary compression ratio is set to 20%. For the GCN method, we pick the top 50k words sorted by the frequency from the vocabulary of the original text. We select a sentence subset with the largest ROUGE score as the target for extractive summarization. Based on the greedy algorithm, the sentence with the largest ROUGE score is taken out one by one as a positive sample and added to the extractive summary set until the set cannot increase the score. After cleaning the RP, we rank the sentences by the output score, and then the summaries are generated. Table 5 shows the result on the test set.
From Table5 we can see that GCN based methods perform better than DPPs on various metrics of three different gold summaries. It indicates that end to end supervised learning method can extract better feature than human, even the supervised signals are constructed indirectly (we construct extractive summarization training data from human-write summarization dataset). Although DPPs performs well on improving the diversity of summaries, its ability to evaluate the quality of sentence comes from handcrafted feature, which generalize worse.

Data preprocessing
The training data set is composed of abstractive parts and extractive parts. The abstractive summarization data are from published papers and blogs which contain around 700 articles with an average of 31.7 sentences per summary and an average of 21.6 words per sentence. The extractive data are from Lev et al. (2019) which have 1705 paper-summary pairs. For each paper, it provides a summary with 30 sentences and 990 words on average. The LongSumm shared task is characterized by long input and output with a high compression ratio. So we choose a mix-and-divide method to deal with it: 1. To make full use of all data samples, we mix abstractive and extractive data.
2. Transform the full paper level summarization into short document summarization by dividing all article-summary pairs into sectionsummary pairs.
3. Relabel all samples for abstractive models and extractive models.
The first step is easy to understand. The second step is achieved as follows: with PDF parser, we can identify sections in the paper; the highest Jaccard similarity among all pairs between sections sentences and summary sentences is used as section-sentence Jaccard similarity; each summary sentence is allocated to the section which has the highest section-sentence Jaccard similarity with it. Other co-occurrence based metrics like ROUGE (Gidiotis and Tsoumakas, 2020) or BLEU can also be applied but we choose jaccard because of its simplicity(these metrics usually lead to the same allocation). We get 30230 section-summary pairs in total. At last, we build two datasets with different types: 1. For extractive models, sentences in a section that have the highest Jaccard similarity with summary sentences are labeled to be extracted.
2. For abstractive models, there is no need to process abstractive samples. Extractive samples are processed according to Zhang et al. (2019). For the long section, we use textrank to extract some sentences as a summary and exclude these sentences from the section. This preprocessing trick can prevent the abstractive model from learning to copy input. For a short section, we do not exclude summary sentences from the section.    We divide the dataset into train/dev/test for comparing different models in this report. ROUGE evaluation is given on the divided test set and we use all 30230 samples for training when inferring on the blind test set.

Result
The result of the LongSumm shared task is illustrated in Table 6. For model T5, we use the small version which has about 60 million parameters. All input sections are truncated to a maximum of 1024 words. The model is fine-tuned for 5 epochs on the sectionwise dataset with a learning rate of 1e-4. The batch size is 32 and we use gradient accumulation to achieve it on a single GPU. Then, we attempt different ways to process the original data, expecting to find the proper input for the model.
1. Construct summary, as mentioned above, all data is transferred to abstractive data.
2. Original summary, as the name suggests, original data are used as input. Because many sections do not have corresponding summaries, there are fewer samples can be utilized, but some corresponding summaries are relatively longer.
3. Original+Construct summary, this method merges the original section and the construct section.
In order to generate the summary as long as possible within the limitation of summary length, we design two plans to process the generated summaries. Plan A simply merges the first sentence of summaries that are generated from different sections. Plan B extracts at most three sentences from each summary, for those with fewer words, we can use all sentences. Also, the merged summary is truncated to 600 words if the word count exceeds the limit. As for DPPs, because the LongSumm task focuses on a long summary, we change the document compression ratio to control the summary length, we set the ratio to 20% and 30%. For the QD method, we select Sentence Length (SL), Sentence Position (SP), and Sentence Coverage (SC) as features and merge them, which can calculate sentence quality.
As for GRU-GCN/GAT, we divide each paper  into sections, since sections are the natural division of paper, and match each section to its gold summaries. For every section, its relation graph is constructed and system summaries are extracted by sentence scores. After we get the section summaries, paper summaries are concatenated by ranking sentences from sections. In our work, GAT has more parameters, thus are more difficult to converge, and the advantage to learn graph structure is weakened since section graphs are rather small, which explains why attention mechanism does not do better than GCN in some way.
The results on the test set show that extractive summarization model using the GCN method performs the best on long summary task and the performance of T5 and DPPs is slightly worse than GCN. Generally speaking, the ROUGE value of abstractive summaries is lower than that of extractive summaries. But as an abstractive summarization model, T5 can compress more semantic information to generate the summary closer to an artificial summary. As for DPPs, as an unsupervised model, it uses hand-constructed features to rank sentences. The sentence quality obtained by this is not accurate. GNN uses RNN to model sentences, and considers sentence diversity in the learning process of neural network. So the ability to measure sentence quality is weaker than GNN. However, DPPs is able to work well under the situation where the training data is lacked.

Conclusion and Future Work
In the CL-SciSumm shared task, Jaccard-Focused performs better than other methods in Task1A. In future work, we will try to use the knowledge graph and GNN for better expression of semantic and structure information. In Task1B, POS-XGB performs the best, which shows that the position fea-tures contributes more than the content features. In the future, more information can be extracted and fused to obtain richer features, or combined with some hand-craft rules to assist the classification.
In Task 2, GCN shows great potential to perform the summarization task. We expect the neural network language models to make contributions to obtain more meaningful semantic representation for sentences against statistical features. In the LongSumm shared task, model T5 and extractive summarization model based on GCN perform well on the official data set, and DPPs still has great potential, we expect to provide more features or modify the sampling processes so as to improve the performance of our models. What's more, in this paper we mainly focus on how to extract/generate section-wise summaries with high quality and diversity, but how to pick and combine these summaries is also an interesting work to be done.