1A-Team / Martin-Luther-Universität Halle-Wittenberg@CLSciSumm 20

This document demonstrates our groups approach to the CL-SciSumm shared task 2020. There are three tasks in CL-SciSumm 2020. In Task 1a, we apply a Siamese neural network to identify the spans of text in the reference paper best reflecting a citation. In Task 1b, we use a SVM to classify the facet of a citation.


Introduction
Task 1 of the CL-SciSumm shared task 2020 contains two sub tasks. The document dataset for the tasks consists of multiple reference papers (RPs) and for each RP a set of citing papers (CPs) that all contain a citation of the original RP. For each of these citations the cited text spans and the belonging facet have been manually annotated.
For task 1a the goal was to predict the cited text span for a given citation and its reference paper.
In task 1b the participants had to identify what facet a cited text span belongs to, from a predefined set of facets.
Our team's approach utilizes a neural network for task 1a to classify pairs of (citation, reference paper sentence) as either matching or not matching.
For task 1b the syntax of reference text in the form of part-of-speech n-grams is used to predict it's facet.

Related Work
Citations play a more significant role in the scientific development than one might expect. Fact is, that they help tracking the development of scientific problems and build a foundation for future research. Citations spread information and are a key attribute of determining the impact of a paper or rather its value to science (Hernández-Alvarez and Gomez, 2016).
There are different methods of extracting useful citations. Some utelize supervised Markov Random Fields classifiers (Qazvinian and Radev, 2010), others modeling the link information and the citation texts (Kataria et al., 2010), or sequence labeling with segment classification (Abu-Jbara and Radev, 2012). The main goal of these approaches is to find the sentences or spans of a CP that explain some facets of the RP. Because a way to see citations is as short textual parts describing some facets of the cited work.
However in this document we don't need to generate or extract citations from a cited work. The citances are already given and we need to find a method to determine the sentence or span in a RP corresponding to the given citance. For this purpose it may help analyzing the aim or rethorical status of a citance like in (Hernández-Alvarez and Gomez, 2016). One work presented a classification framework based on lexically and linguistically inspired features for classifying citation functions (Teufel et al., 2006).
A different mind may think about text summarization as a helping feature to find the corresponding textual span to a given citance. Fortunately the field of summerization grew to a well researched subject in the recent decades. There are several approaches to consider. Some of them are topic modeling (Gong and Liu, 2001), supervised models (Chali and Hasan, 2012), graph based models (Mihalcea, 2004) and neural networks (Chopra et al., 2016). For topic modeling a probabilistic framework is used to estimate the distribution of content in the final summary. Supervised models get a selection of sentences relevant for the final summary to learn on, to afterwards be able to seek the right sentences for a final summary. Graph based models focus on finding the most central sentence in a graph of a text, where sentences are nodes and similarities are edges, which represents a summarizing 278  sentence or to build a summary on.

Baseline Task 1a
As baseline we trained a SVM for each citation and chose the one with the largest tf-idf score as prediction. On the 2018 training set we got an F1-score of 0.09 (micro) and 0.10 (macro).

Task 1b
The dataset of 2018 consists of a total of 176 citations. 104 citations are labelled as method facet, 9 as implication facet, 34 as result facet, 22 as aim facet and only 7 citations belong to the hypothesis facet. That is why we decided to keep our baseline simple and tagged all citations with the majority label "method". The performance of this simple baseline can be seen in table 4.

Task 1a
Our first preprocessing step is computing the cross product for all citations and every sentence of a reference paper, given annotated citations. The pairs consisting of a citation and its matching reference sentence were labelled as class "1" and all other pairs as class "0". The resulting data matrix, as shown in table 2, contains the citation-sentence pairs and the class labels. By defining a threshold value of 0.9 we were able to use our NN as a binary classifier. Figure 1 shows the performance of our system when using different thresholds. With our training dataset, a value of 0.9 seemed to be suited best as threshold value. Our second preprocessing step was mapping each word, which is contained in the word2vec vocabulary (Mikolov et al., 2013b,a) to a unique number in the training data. Based on this, an |word2vec vector size| × |vocabulary size| embedding matrix E was constructed as a ground layer for the NN. We used a set pre-trained on the Google-NewsArchive as a word2vec embedding. Reference sentences and citations are represented as onehot over the vocabulary. Because of the construction of the training data the class "1" was very much underrepresented. For the NN to be able to handle this, we decided to undersample the huge "0" class. This improved our results by a factor of 30, as shown in table 3.
Our system for task 1a is based on a neuronal network (NN) that utilizes two identical long shortterm memory (LSTM) networks, mostly referred to as a "Siamese" 1 neural network. The output of the two networks is computed by the exp negate Manhattan distance function (1) as proposed by (Mueller and Thyagarajan, 2016): The complete NN architecture is shown in figure 2. Table 1 shows the evaluation results of our system on 2017 training data. For the experiment we trained the NN with 2016, 2018 and 2019 training data for 50 epochs and a threshold value of 0.9. We used the "adam" function of the keras tensorflow library (Chollet et al., 2015;Kingma and Ba, 2014) as an optimizer.

Task 1b
Our approach is based on a support vector machine (SVM) which uses part-of-speech (POS) n-grams as features. During the experiment, we tried using different POS n-gram features in SVMs with linear and polynomial kernels and compared their performances. We did not include the results of SVMs with polynomial kernel, because they showed bad performances.   In machine learning, kernel methods are a class of algorithms that use a kernel to perform their calculations implicitly in a higher-dimensional space. On one hand we used the function linear_kernel which determines the linear kernel. On the other hand we used the function polynomial_kernel which determines the degree-d polynomial kernel between two vectors. The polynomial kernel represents the similarity between two vectors.
Basically, the polynomial kernel considers both the similarity between vectors in the same dimension and the similarities across dimensions. When used in machine learning algorithms, this allows to observe the interaction between different features.
The polynomial kernel with input vectors x , y and kernel degree d is defined as: k(x, y) = (yx T y + c 0 ) d If c 0 = 0 the kernel is called homogeneous. The linear kernel is a special case of the polynomial kernel where d = 1 and c 0 = 0. If x, y are column vectors, their linear kernel is described as: We tried different degrees with the polynomial kernel, but did not include these in the results of SVMs, because they showed bad performances as well as the results with unbalanced training data. We used the python nltk (Bird et al., 2009) and spaCy (Honnibal and Montani, 2017) libraries for POS tagging and n-gram construction. As shown in table 4 the biggest improvement was gained when increasing n from POS 4-grams to POS 5-grams. Increasing n further seems to deteriorate the results   figure 3 show, the best performance was reached using POS 5-grams in combination with a linear kernel SVM.

Conclusion
We could improve upon the solutions of past-year's PolyU approach (Cao et al., 2016) for task 1a. In future works better results may be obtained with more training data as is often the case with neural networks. Moreover the parameters of the neuronal network for task 1a could be tuned.