On the Use of Context for Predicting Citation Worthiness of Sentences in Scholarly Articles

In this paper, we study the importance of context in predicting the citation worthiness of sentences in scholarly articles. We formulate this problem as a sequence labeling task solved using a hierarchical BiLSTM model. We contribute a new benchmark dataset containing over two million sentences and their corresponding labels. We preserve the sentence order in this dataset and perform document-level train/test splits, which importantly allows incorporating contextual information in the modeling process. We evaluate the proposed approach on three benchmark datasets. Our results quantify the benefits of using context and contextual embeddings for citation worthiness. Lastly, through error analysis, we provide insights into cases where context plays an essential role in predicting citation worthiness.


Introduction and Background
Citation worthiness is an emerging research topic in the natural language processing (NLP) domain, where the goal is to determine if a sentence in a scientific article requires a citation 1 . This research has potential applications in citation recommendation systems (Strohman et al., 2007;Küçüktunç et al., 2014;He et al., 2010), and is also useful for scientific publishers to regularize the citation process. Providing appropriate citations is critical to scientific writing because it helps readers understand how the current work relates to existing research.
Citation worthiness was first introduced by (Sugiyama et al., 2010), where the authors formulated as a sentence-level binary classification task solved using classical machine learning techniques like Support Vector Machines (SVMs). Subsequent works from (Färber et al., 2018;Bonab et al., 2018) use similar approach but employ deep learning models like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). More recently, Zeng et al. (Zeng and Acuna, 2020) proposed a Bidirectional Long shortterm memory (BiLSTM) based architecture and demonstrated that context, specifically the two adjacent sentences, can help improve the prediction of citation worthiness.
Citation worthiness is closely related to citation recommendation (suggest a reference for a sentence in a scientific article), which is often approached as a ranking problem solved using models that combine textual, contextual, and documentlevel features (Strohman et al., 2007;He et al., 2010). More recent works employ deep learning models (Huang et al., 2015;Ebesu and Fang, 2017) and personalization . Citation analysis (Athar and Teufel, 2012) and citation function (Teufel et al., 2006;Li et al., 2013;Hernandez-Alvarez et al., 2017), other closely related domains, aim to predict the sentiment and motivation of a citation respectively. Researchers have used many supervised approaches like sequence labeling (Abu-Jbara et al., 2013), structure-based prediction (Athar, 2011), and multi-task learning (Yousif et al., 2019) to address these problems.
In this paper, we want to investigate two research question about citation worthiness. First, we posit that citation worthiness is not purely a sentencelevel classification task because the surrounding context could influence if a sentence requires a citation. This context could include not only adjacent sentences but also information about the section titles, paragraphs, and other included citations. Previous work (Bonab et al., 2018) has explored using the adjacent two sentences; we predict that citation worthiness models would improve with access to more contextual information. To pursue this hypothesis, we propose two new formulations: (a) sentence pair classification and (b) sentence se-   (Kurimo et al., 2010) and (Chali et al., 2009). The actual citations have been removed but we include citation worthiness labels. quence modeling. For the latter formulation, we propose a new hierarchical architecture, where the first layer provides sentence-level representations, and the second layer predicts the citation worthiness of the sentence sequence. We also introduce a new dataset mostly because the prior datasets (Bonab et al., 2018;Zeng and Acuna, 2020) do not have sufficient contextual information to study this research question.
The second research objective is to understand if contextual embedding models would help citation worthiness. Recent developments in language modeling, specifically contextual embedding models (Liu et al., 2019;Devlin et al., 2019), have already demonstrated significant improvements in various NLP research tasks. We expect to observe similar gains in citation worthiness. Following is a summary of the main contributions of this work: • We propose two new formulations for citation worthiness: sentence-pair classification and sentence sequence modeling.
• We contribute a new dataset containing significantly more context, and we expect it to serve as another benchmark.
• Through rigorous experimental work, we demonstrate the benefits of sequential modeling and contextual embeddings for citation worthiness.
• We obtain new state-of-the-art results on three benchmark datasets.

Problem Statement
Let d = {s 1 , s 2 , ..., s n } be a scientific article, where s i is the i th sentence. The problem of citation worthiness is to assign each sentence s i one of two possible labels L = {l c , l n }, where l c denotes that sentence requires a citation and l n means otherwise. We present three different formulations here to investigate our main research objectives.

Sentence Classification
Our first formulation (Figure 1 (a)) approaches citation worthiness as a sentence level classification task similar to prior works of (Bonab et al., 2018;Färber et al., 2018). Given a sentence s i , we map it to a fixed-size dense vector x i using contextual embedding models (e.g. BERT (Devlin et al., 2019)). We then feed x i to a feed-forward layer to obtain the citation worthiness label. We fine-tune this entire architecture by optimizing the weights of the final layer.

Sentence-Pair Classification
Our second approach (Figure 1 (b)) is to formulate citation worthiness as a sentence-pair classification task, where the pair consists of the given sentence and a sentence-like representation of the context. Namely, for a given sentence s i , we define the context c i as the concatenation of the previous s i−1 , s i , and the next sentence s i+1 : We then concatenate s i with c i separated by the [SEP] token, pass it through the embedding layer to obtain a vector representation x i . This vector is then passed through a feed-forward layer to obtain the class label. This approach is similar to (Zeng and Acuna, 2020), where the authors used Glove embeddings (Pennington et al., 2014) to obtain sentence representations, and BiLSTMs for context representations. This formulation has also been used previously for question-answering (Devlin et al., 2019) and passage re-ranking (Nogueira and Cho, 2019). In our sentence-pair classification approach, we defined c i to include only two adjacent sentences, but it could easily include more. However, if we included too many sentences, the context might be too long for most transformerbased models.

Sentence Sequence Modeling
The third formulation addresses citation worthiness as a sequence labeling task solved using a hierarchical BiLSTM architecture (Figure 1 (c)).
We first map each sentence s i and context c i (eq. 1) to a fixed-size dense vector x i using the same approach as in section 2.3. Thus the given document d is represented as a sequence of vectors x = {x 1 , x 2 , ..., x n }. We then feed these vectors to a BiLSTM model to capture the sequential relations between the sentences. The hidden state of the BiLSTM h i provides a vector representation for sentence s i that incorporates information from the surrounding sentences. Thus the sequence modeling approach captures long term dependencies between the sentences without us needing to explicitly encode them as extra features. We then use a feed-forward layer to map the BiLSTM output to the citation worthiness labels. We have also experimented with using only the sentence s i to construct the vector x i . However, we observed that using the context c i helped improve the model performance.

Datasets
Prior works in citation worthiness presented two benchmark datasets: SEPID-cite (Bonab et al., 2018) and PMOA-cite (Zeng and Acuna, 2020). SEPID-cite contains 1,228,053 sentences extracted from 10,921 articles 2 but does not contain the source of the sentences (e.g. paper id) or the sentence order. PMOA-cite contains 1,008,042 sentences extracted from 6,754 papers from PubMed open access. PMOA-cite also contains the pre-ceding sentence, the next sentence, and the section header. However, the authors of PMOA-cite did data splits at sentence-level, which means sentences from the same research paper could be part of train and test datasets. Since we cannot use either one of these datasets directly for sequence modeling, we chose to process the ACL Anthology Reference Corpus (Bird et al., 2008) 3 (ACL-ARC) while preserving the sentence order, and then split the data at document-level. The latest version of ACL-ARC (Bird et al., 2008), released in 2015, contains 22,878 articles. Each article here contains the full text and metadata such as author names, section headers, and references. We first processed this corpus to exclude all articles without abstracts because they typically were conference cover sheets. Then, for each section in an article, we extracted paragraph information based on newlines. Then, we split the paragraphs into constituent sentences 4 and processed the sentences to obtain citation labels based on regular expressions (Appendix A). We then sanitized the sentences to remove all the citation patterns.

Experimental settings
We applied the sentence classification (SC) model on all three datasets, sentence-pair classification (SPC) model on PMOA-cite and ACL-cite, and sentence sequence modeling (SSM) approach on ACL-cite. This is because SEPID-cite does not have any context to apply SPC or SSM, and PMOAcite does not have sufficient context for SSM. To obtain sentence representations, we also explored the idea of pooling word-level embeddings obtained using CNNs. However, we observed no significant difference in the model performance when compared to using the [CLS] token. We also experimented with the choice of contextual embeddings: BERT (Devlin et al., 2019), SciBERT (Beltagy et al., 2019), Roberta (Liu et al., 2019), and XLnet (Yang et al., 2019) and observed that the Roberta model consistently gave the best results; therefore, we only report those numbers.
We used a batched training approach for the SSM models: split each article into sequences of m with an overlap of m/2 sentences. For example, consider a document with 32 sentences and m = 16, we create three training sequences; first sequence: sentences 1 to 16, second sequence: sentences 9 to 24, and so on. During inference, for a given sentence, we include the preceding m/2 sentences and the succeeding m/2 − 1 sentences 6 . We trained and evaluated models at different values of m = 4, 8, 16. We trained all the models using the Adam optimizer with a batch size of 16, a learning rate of 1e-5, a maximum of 4 epochs to optimize for cross-entropy loss. The hidden layers in the BiLSTM models were set to 128 units. The models were trained on a GPU machine with 6 cores and each training epoch took approximately 4 hours. More details on the experimental settings are available in the Appendix. Table 2 summarizes the results in terms of the precision, recall, F1 score for l c , and overall weighted F1 score. The baseline numbers reported here are either from prior works (Färber et al., 2018;Bonab et al., 2018;Zeng and Acuna, 2020) or based on architectures very similar to those used in these prior works. On the SEPID-cite dataset, our SC model obtained significantly better performance than the state-of-the-art results from (Zeng and   Acuna, 2020) with the F1 score increasing by more than 12%. On the PMOA-cite dataset, we obtain an F1 gain of 1.2% for sentence-level and 1.7% for contextual models. We indicate that the numbers from (Zeng and Acuna, 2020) use additional hand-crafted contextual features, including labels of surrounding sentences, but our models only use textual features.

Results
The results on the ACL-cite dataset clearly show the importance of context in this domain. The use of surrounding two sentences boosted the performance by nearly 6% points, and the performance continues to improve with added context increasing by another 5.6% points for 16 sentences. The model performance improves by another 0.7% with the inclusion of section headers in the context. Table 4 compares the performance of the SC and SSM models for different sections in the papers. The F1 score improves for all but most prominent for Abstract and Conclusion sections because of significant improvements in the precision.

Subjective Analysis
We observed some interesting trends during the error-analysis of the SC and SSM models. We categorized these trends into three groups and selected an example from each to illustrate the impact of context (Table 1). • Prior works: In the first excerpt, the last sentence could be interpreted as the author's contribution if no context was available. The preceding sentences in the paragraph seem to help the model understand that this sentence requires a citation because it refers to prior work.
• Sections: In the second excerpt, the second sentence could be interpreted as an introduction or conclusion. Once again, the context provides information to infer the section correctly and, therefore, the correct label.
• Topic sentences: Context is essential to understand if a sentence is the first statement about a topic, typically when researcher provide citations, or continuation of a discussion. In the second excerpt, the model does not predict l c for the last sentence because the authors already introduced the concept LexRank in previous sentences.

Conclusions
In this paper, we study the impact of context and contextual models on citation worthiness. We propose two new formulations for this problem: sentence-pair classification and sentence sequence modeling. We contribute a new benchmark dataset with document-level train/dev/test splits, which enables to incorporate contextual information better. We propose a hierarchical BiLSTM approach for sequence modeling, but we could also consider a transformer-based approach and further improve with a CRF layer. Likewise, we also want to consider some of the newer language models (Zaheer et al., 2020;Beltagy et al., 2020) that handle longer sentences.
We expect citation worthiness would be an important part of developing writing assistants for scientific documents. We studied the citation worthiness of sentences in scholarly articles in this paper, but we believe these findings are relevant to other domains like news, Wikipedia, and legal documents.