Automatic Generation of Citation Texts in Scholarly Papers: A Pilot Study

In this paper, we study the challenging problem of automatic generation of citation texts in scholarly papers. Given the context of a citing paper A and a cited paper B, the task aims to generate a short text to describe B in the given context of A. One big challenge for addressing this task is the lack of training data. Usually, explicit citation texts are easy to extract, but it is not easy to extract implicit citation texts from scholarly papers. We thus first train an implicit citation extraction model based on BERT and leverage the model to construct a large training dataset for the citation text generation task. Then we propose and train a multi-source pointer-generator network with cross attention mechanism for citation text generation. Empirical evaluation results on a manually labeled test dataset verify the efficacy of our model. This pilot study confirms the feasibility of automatically generating citation texts in scholarly papers and the technique has the great potential to help researchers prepare their scientific papers.


Introduction
A scientific paper usually needs to cite a lot of reference papers and introduce each reference paper with some text. In this study, the text describing a reference paper is called citation text. A researcher usually needs to find relevant papers he wants to cite and write some text to introduce them when writing a scientific paper. However, the process of writing citation texts is tedious and time-consuming. In order to reduce the burden of researchers, we propose and try to address the task of automatic citation text generation.
Automatic generation of citation texts in scholarly papers is a challenging and meaningful task, however, there are very few studies investigating this problem. Given a cited paper B and the context in a citing paper A (i.e., the sentences before and after a specific position in paper A), the task aims to generate a short text to describe B with respect to the given context in A. The task is like the task of scholarly paper summarization (Luhn, 1958;Edmundson, 1969;Qazvinian and Radev, 2008;Mei and Zhai, 2008). Both of the two tasks aim to produce a text to describe the cited paper B. The major difference between the two tasks is that the citation texts reflect not only the salient content of B, but also the context of A. Different citing papers usually have different descriptions of the same cited paper. Sometimes one paper may cite another paper several times in different positions but give different descriptions because the specific contexts are different. Another difference between the two tasks is the length of the text. A citation text is usually much shorter than a paper summary. Generally, citation text generation can be considered as a task of generating a very short summary of paper B given the context of paper A. The difficulty lies in that given different A or different contexts of A, the task aims to produce different citation texts for the same B.
Most commonly, the citation text is a single sentence, but sometimes it may consist of several sentences (Jebari et al., 2018;Qazvinian and Radev, 2010;Sondhi and Zhai, 2014). Like (Small, 2011), we define citation text as a block of text composed of one or more consecutive sentences surrounding the reference sign. Each citation sentence can be classified as explicit or implicit (Qazvinian and Radev, 2010;Athar and Teufel, 2012;Yasunaga et al., 2019). Explicit citation is a citation sentence that contains explicit reference to the cited paper. An implicit (or non-explicit) citation sentence appears around the explicit citation sentence and it does not attach any explicit reference to the cited paper but supplies additional information about the cited paper. The citation text generation task in this study aims to generate both explicit and implicit citation sentences.
We build a citation text generation dataset based on the ACL Anthology Network corpus (AAN) (Radev et al., 2013). We first perform human annotation and get 1,000 citation texts (including explicit and implicit citation sentences). We randomly select 400 citation texts as test set, and use the other 600 citation texts to first train a citation text extraction model and then use the extraction model to automatically extract many more citation texts to build a large-scale training dataset.
With the training dataset we construct, we can train our citation generation model. In this paper, we use pointer-generator network (See et al., 2017) as the baseline model. We believe that the key to dealing with citation text generation problem is modelling the relationship between the context of citing paper A and the content of cited paper B. So we encode the context of paper A and the abstract of paper B separately, and add cross attention mechanism by making context and abstract attend to each other. We call our model multi-source pointergenerator network with cross attention mechanism. The evaluation results show that our model outperforms the baseline models.
Our contributions are summarized as follows: • We propose a new task of automatic citation text generation in scholarly papers.
• We annotate 1,000 citation texts and train a citation extraction model to automatically construct a large training dataset for the citation text generation task. The data are available at https://github.com/XingXinyu96/ citation_generation.
• We propose the multi-source pointergenerator network with cross attention mechanism to address this challenging task. Evaluation results demonstrate the efficacy of our proposed model.

Related Work
Firstly, we introduce some studies on citation extraction. Kaplan et al. (2009) proposed a method based on coreference-chains for citation extraction. Sondhi and Zhai (2014) first independently trained a separate HMM for each citation in the article and then performed a constrained joint inference to label non-explicit citing sentences. Qazvinian and Radev (2010) proposed a framework based on probabilistic inference to extract implicit citations. Jebari et al. (2018) proposed an unsupervised approach which is based on topic modeling and word embedding for implicit citation extraction. Jebari et al. (2018) introduced method based on neural network but it did not give out convincing evaluation results. A few studies have investigated the task of summarizing single scholarly paper, i.e., single document summarization in the scientific domain, which is relevant to the citation text generation task. Early works include (Luhn, 1958;Baxendale, 1958;Edmundson, 1969), and they tried to use various features specific to scientific articles for summary extraction. Later on, citation information has shown its usefulness for scientific paper summarization (Qazvinian and Radev, 2008;Mei and Zhai, 2008;Qazvinian and Radev, 2010;Cohan and Goharian, 2018;Yasunaga et al., 2019). Several benchmark tests have been set up for scientific summarization, including TAC 2014 Biomedical Summarization track and the CL-SciSumm Shared Task (Jaidka et al., 2016). A few other studies have investigated the task of summarizing multiple scholarly papers, i.e., multi-document summarization in the scientific domain (Mohammad et al., 2009;Yeloglu et al., 2011;Chen and Zhuge, 2014). Related work generation is a special case of multi-document scientific summarization (Hoang and Kan, 2010;Hu and Wan, 2014;Chen and Zhuge, 2019). However, the above related work about scholarly paper summarization is different from the task of citation text generation, which aims to generate a usually very short text to describe the cited paper in the given context of the citing paper.

Problem and Corpus
Formally, given a citing paper A, a cited paper B and the context C in A, the task aims to generate the citation text T to describe B. The context C refers to the sentences surrounding the target citation text in A and it is provided to distinguish different mentions of B in different positions of A. The following example shows a paragraph of (Lu et al., 2008) and this article cites paper (Wong and Mooney, 2006). In this example, A refers to (Lu et al., 2008) and B refers to (Wong and Mooney, 2006). The sentence underlined (i.e., the second sentence) is an explicit citation, and the sentence in italics (i.e., the third sentence) is an implicit citationand both of them compose the citation text. The remaining two sentences (i.e., the first and last sentences) compose the context C of A. The phrase in bold which indicates the explicit citation to paper B is called reference sign. And the explicit citation text can be defined as the sentence with a reference sign to the cited paper. The implicit citation text can be defined as the sentences that provide information about the cited paper but do not have any reference sign.
... SILT (Kate et al., 2005) learns deterministic rules to transform either sentences or their syntactic parse trees to meaning structures. WASP (Wong and Mooney, 2006)  In this study, we build a citation generation dataset based on the ACL Anthology Network corpus (AAN) (Radev et al., 2013). The ACL anthology is a collection of papers from the Computational Linguistics journal, and proceedings from ACL conferences and workshops. In particular, we download and use the 2014 version of the AAN corpus which includes almost 23594 papers. After removing papers containing many garbled characters and papers without abstracts, there remains 16675 papers. The metadata of each paper and the paper citation network have been extracted and stored. We find all the mentions of each reference paper in a citing paper by using manually designed regular expressions to match the corresponding reference signs. Lastly, we extract 86052 explicit citations for further use.

Annotation Process
For each reference sign, we perform human annotation to get all citation sentences. We label a vector in which each dimension corresponds to a sentence. A sentence is marked with C if it is an explicit citation, and with 1 if it is an implicit citation. All other sentences are marked with 0. The label vector of the example we mentioned before is [0,C,1,0].
Our annotation process has two steps. First, we annotate the explicit citation sentences. Despite we have extracted explicit citations with rules, we cannot assure that the extraction is completely correct. In order to accurately evaluate the performance of our methods, the explicit citations in the test dataset should be human annotated. We randomly choose some automatically extracted explicit ci-tations and highlight the reference signs we find. The annotators only need to judge if they think the extraction of reference sign is correct. We stop this step when we get 1,000 explicit citations which are ensured correct by human. The second step is to annotate implicit citation texts. For each explicit citation sentence, we take three sentences before it and three sentences after it as candidate sentences 1 . Note that all the candidate sentences must be in the same section as the explicit citation sentence. We provide candidate sentences, explicit citation sentence, abstract of citing paper and cited paper for every annotator. Explicit citation sentence has already been labelled with C, and the annotators just need to label other sentences with 1 or 0. Note that we require the citation sentences to be continuous, which means there cannot be non-citation sentences between two citation sentences. To make the data more reliable, we make sure that every annotation instance must be annotated by three different people. When they disagree with each other, we take the label chosen by majority.
After the annotation process, we get 1,000 annotated citation texts (including both explicit and implicit citation sentences) for further use. We randomly choose 400 citation texts as the final test dataset and the remaining citation texts are used for training.

Implicit Citation Text Extraction Model
After the annotation process, we have 400 citation texts as test dataset and 600 citation texts for training. However, we need large-scale training data to train a feasible citation text generation model. So we decide to use the 600 human annotated citation texts to train an implicit citation text extraction model to expand our training dataset. We treat implicit citation text extraction as a sequence labeling problem and use BERT (Devlin et al., 2018) to deal with this problem. We add a classification layer on the final hidden representation of BERT and fine-tune the whole model on our dataset. We concatenate all the candidate sentences, the explicit citation sentence and the abstract of the cited paper as the input of BERT. We add a special tag '[s]' at the beginning of all sentences, a special tag '[explicit]' at the beginning of the explicit citation sentence and a special tag '[abs]' at the be-  ginning of the cited paper's abstract. The abstract of cited paper does not need to be labelled but it can provide a lot of information to help label the candidate sentences. BERT gives out the probability of every sentence to be implicit citation. We set a threshold α to control the identification of implicit citation sentence. When the probability given out by BERT is greater than α, we take the corresponding sentence as an implicit citation sentence. It is obvious that the smaller α is, the more sentences will be recognized as implicit citation sentence. To ensure the citation text being continuous, we start to identify implicit citation sentences from the explicit citation sentence to both sides and stop when meeting the first non-citation sentence. We do 10 fold cross-validation on our training dataset and use the 400 test data as external test data. The 600 training data are split into 10 subsets. When training, we use 9 subsets for training and use the remaining one subset as test set. The average results for cross-validation are shown in Table 1. The average results on external test data are shown in Table 2.
Our model is compared with these baseline models: All one: It labels all candidate sentences with 1. Random: It labels all candidate sentences randomly.
Cosine sim: It first uses bag of words model to represent all texts as vectors. Then it calculates the cosine similarity between candidate sentence and cited paper's abstract, and the cosine similarity between candidate sentence and the explicit citation sentence. When the two similarities are both greater than the threshold, the sentence is labelled with 1.
W2v sim: This model is also based on similarity. The similarity in this model is calculated based on word2vec model. With two sequence of words, it first gets the corresponding two sequences of   Results of all these baseline models are shown in Table 3.
As shown in these tables, our extraction model outperforms all the baseline models. The F-value of our extraction models with α=0.1 and α=0.9 are very close. This indicates that they have close performance. The precision of extraction model with α=0.9 is higher, while the recall of extraction model with α=0.1 is higher. So we can get two different extraction models with two different α. And with the two different extraction models, we can construct two different datasets for further training citation generation model.
To get the two different datasets, we use all 600 data to train two final extraction models. We call the extraction model with α=0.1 EXT α=0.1 and call the extraction model with α=0.9 EXT α=0.9 . The results on external test data when using full training data are shown in Table 4.

Final Evaluation Datasets
With the two implicit citation extraction models we trained in the previous section, we construct three datasets for experiments. In each dataset, a data example is a triple: [citing paper's context, cited paper's abstract, gold citation text]. The first dataset is an explicit citation text generation dataset (Explicit dataset). The gold citation text in the training data and test data is single explicit citation sentence. Note that the explicit citation sentences in the training data are automatically extracted with rules and the explicit citation sentences of test data are human annotated. The second dataset is a full citation text generation dataset. The gold full citation texts of test data are human annotated. The gold full citation text of training data is constructed as follows: the gold explicit citation text is extracted with rules and the gold implicit citation text is extracted with EXT α=0.1 . This extraction model gets higher recall, so we call this dataset high-recall full citation text generation dataset (HR dataset). The third dataset is also a full citation text generation dataset, and it is constructed in the same way with the second dataset except that the gold implicit citation text of training data is extracted with EXT α=0.9 and we call it high-precision full citation text generation dataset (HP dataset). The cited paper's abstract in all the three datasets refers to the abstract of the cited paper B. We use it to represent the content of paper B because the whole article is too long to encode. The citing paper's context in all the three datasets refer to the sentences around the gold citation text in citing paper A. we take three sentences before the gold citation text and three sentences after it as the context. Note that all the context sentences must be in the same section as the gold citation text.
Finally, we have three datasets for experiments: •

Citation Generation Model
Our citation text generation model is a multisource pointer-generator network with cross attention mechanism. Because the citation generation task has two input sequences, we use two encoders to encode them separately and allow the model to copy words from both input sequences. Such a multi-source pointer-generator network does not have the ability to model the relationship between two input sequences, so we add a cross attention mechanism on them. The cross attention mechanism calculates the attention distribution of every word to the other sequence of words. These attention distributions are used to help the decoder. We believe that the citing paper's context can tell the model what information in cited paper's abstract is important and vice versa. The structure of the whole model is shown is Figure 1.

Pointer-Generator Network
A typical seq2seq model with attention mechanism has three components: an encoder , a decoder and an attention network. The input text is seen as a sequence of words {w 1 , w 2 , ...w n }. The encoder which is a single-layer bidirectional LSTM network receives input words one by one and produces a sequence of encoder hidden states {h i }. At each decoding step t, the decoder which is a single-layer unidirectional LSTM receives the previous word and produces decoder state s t . The attention distribution a t is calculated as in (Bahdanau et al., 2014): where v, W h , W s and b attn are learnable parameters. At each decoding step t, the attention vector a t is used to calculate the context vector c t :  Figure 1: The structure of our generation model The context vector c t and the decoder state s t are used to produce the vocabulary distribution P v : where V 1 , V 2 , b and b are learnable parameters. P v is a probability distribution over all words in the vocabulary. During training, we use P v to calculate the cross entropy loss. At each decoding step, this network can generate word like normal seq2seq model or copy word from the source sequence. The generation probability p gen for timestep t is: where c t is the context vector, s t is the decoder state, x t is the decoder input, W c , W s , W x and b ptr are learnable parameters and σ is the sigmoid function. p gen is used as a soft switch to choose between generating a word from the vocabulary or copying a word from input sequence. For each text, we define an extended vocabulary which is the union of the vocabulary and all words appearing in the source text. We obtain the following probability distribution over the extended vocabulary: Note that if w is not in the vocabulary, P v (w) is zero. Then we use the probability distribution over the extended vocabulary to calculate the loss.

Multi-Source Pointer-Generator Network with Cross Attention
Then we introduce our generation model. Firstly we change the pointer-generator network to a multi-source pointer-generator network. The multisource pointer-generator network has two encoders and one decoder. The two encoders encode the citing paper's context and cited paper's abstract separately. The input context of citing paper is seen as a sequence of words {cw 1 , cw 2 , ..., cw n } and the input cited paper's abstract is seen as a sequence of words {aw 1 , aw 2 , ..., aw m }. We use the same notation to represent both a word and its embedding vector. The context is encoded by corresponding encoder to a sequence of encoder hidden states {ch i } and the cited paper's abstract is encoded to a sequence of encoder hidden states {ah j }. At each decoding step t, we calculate attention vectors {ac t i } , {as t i } and corresponding context vectors c 1 t , c 2 t separately as described in equations (1), (2) and (3). To make the model copy words from both two encoders, we change equation (5) to: where p gen is the probability of generating words, p copy1 is the probability of copying words from citing paper's context and p copy2 is the probability of copying words from cited paper's abstract. And equation (6) needs to be changed to: Then we add the cross attention mechanism to the multi-source pointer-generator network. By making citing paper's context and cited paper's abstract attend to each other, we capture the relationships between them. First, we calculate a match matrix M between the sequence of context's states {ch i } and the sequence of cited paper's abstract's states {ah j }. The element of the match matrix M i,j is: Then we apply softmax function on the row vectors of the matrix and get an attention matrix A row . The row vector A row i of the attention matrix is: The vector A row i represents the attention of word cw i to the sequence of words {aw 1 , aw 2 , ..., aw m }. We also apply softmax function on the column vectors of the matrix and get another attention matrix A column . The column vector of the attention matrix A column i represents the attention of word aw i to the sequence of words {cw 1 , cw 2 , ..., cw n }. With the two attention matrices, we calculate two special sequences of vectors. The first sequence of vectors {r 1 , r 2 , ..., r n } is calculated as: The second sequence {q 1 , q 2 , ..., q m } is calculated as: The vector r i represent what the word cw i thinks about the sequence of words {aw 1 , aw 2 , ..., aw m }, while the vector q j represents what the word aw j thinks about the sequence of words {cw 1 , cw 2 , ..., cw n }. We believe that the two sequences of vectors can model the relationship between the input citing paper's context and cited paper's abstract, so we call them relationship vectors. With these two sequences of relationship vectors, we calculate two new context vectors c 3 t and c 4 t separately at each decoding step t, by replacing the encoder hidden state h i with the relationship vector r i or q j in equations (1) (2) and (3). Finally, we calculate the vocabulary distribution with all four context vectors. We just need to change equation (4) to:

Experimental Setup
The baseline models include: RandomSen: It randomly selects a sentence from the abstract of paper B.    MaxSimSen: It selects a sentence from the abstract of paper B, which has the largest similarity with the context of A.
EXT-ORACLE: It can be viewed as an upper bound for extractive models. It creates an oracle citation text by selecting the best possible sentence from the abstract of paper B that gives the highest ROUGE with respect to the gold text.
COPY-CIT: It randomly copies one citation text from the papers in the training dataset which also cite the paper B.
PTGEN: It is a pointer-generator network which allows both copying words via pointing and generating words from a fixed vocabulary. When using this model, we concatenate the citing paper's context and the cited paper's abstract as the input sequence.
Our proposed model is called PTGEN-Cross. Both our model and the PTGEN has 256dimensional hidden states and 128-dimensional word embeddings. The vocabulary size is set to 50k. At test time the citation texts are produced using beam search with beam size 4.

Automatic Evaluation
We evaluate our models with ROUGE (Lin, 2004), reporting the F 1 scores for ROUGE-1, ROUGE-2    Tables 5, 6 and 7, respectively. On all three datasets, extractive models perform poorly. Our baseline generation model PTGEN outperforms EXT-ORACLE which can be seen as a 'perfect' extractive system. This is completely different from how these models preform on other summarization tasks like news document summarization. We believe it shows the particularity of this task. It not only requires the model to capture the important content of the cited paper, but also requires the model to capture the attitude of the citing paper to the cited paper. The model not only needs to generate fluent and informative text, but also needs to ensure the contextual coherence.
Our proposed model PTGEN-Cross obviously outperforms the baseline model PTGEN. This proves the effectiveness of the cross attention mechanism. We think the cross attention mechanism helps the model capture the relationship between the citing paper's context and the cited paper's abstract. The results on explicit citation text generation dataset are all higher than the results on the other two datasets, which means the task of explicit citation text generation is easier than the task of full citation text generation. We think it is because the context of explicit citation sometimes contains some implicit citation sentences and these sentences can be very helpful to the generation of explicit citation text. Another possible reason is that the quality of the training dataset for explic-it citation generation is higher than the other two training datasets. Because the test data of the two full citation text generation datasets is the same, we can compare the results of our model training on the two datasets. The model trained on the highrecall dataset performs slightly better. This tells us the coverage ability of the implicit citation extraction model is more important when constructing training dataset for citation generation.

Human Evaluation
We randomly sample 50 instances from the highrecall test set and perform human evaluation on them. Three graduate students are employed to rate the citation text produced by each method in four aspects: readability (whether the citation text is fluent), content (whether the citation text is relevant to the cited paper's abstract), coherence (whether the citation text is coherent with the citing paper's context) and overall quality. The rating score ranges from 1 to 5, and 1 means very bad and 5 means very good. Note that every text is scored by three judges and we take take the average of three scores. The results are shown in Table 9.
As is shown in the table, our model outperforms the baseline model, especially with respect to the coherence and overall aspects. This further demonstrates the efficacy of our proposed model. We show an example of generation in Table 8. Note that all reference signs to the cited paper are masked as ' [refer]' and all reference signs to other papers are masked as '[otherrefer]'. The '[cit]' in bold in context indicates the position the citation text should be. We can see that the citation text generated by our model is more contextual coherent because it can capture the relationship between context and the cited paper's abstract better.

Conclusion and Future Work
In this paper we investigate the challenging task of automatic generation of citation texts in scholarly papers. We annotate a dataset and train an implicit citation extraction model to automatically enlarge the training data. we then propose the multi-source pointer-generation network with cross attention mechanism to deal with this task. Empirical evaluation results on three datasets verify the efficacy of our proposed method. In future work, we will consider introducing more information like the citation texts to the cited paper in other papers to help the generation.