Dual Attention Model for Citation Recommendation

Based on an exponentially increasing number of academic articles, discovering and citing comprehensive and appropriate resources has become a non-trivial task. Conventional citation recommender methods suffer from severe information loss. For example, they do not consider the section of the paper that the user is writing and for which they need to find a citation, the relatedness between the words in the local context (the text span that describes a citation), or the importance on each word from the local context. These shortcomings make such methods insufficient for recommending adequate citations to academic manuscripts. In this study, we propose a novel embedding-based neural network called “dual attention model for citation recommendation (DACR)” to recommend citations during manuscript preparation. Our method adapts embedding of three semantic information: words in the local context, structural contexts, and the section on which a user is working. A neural network model is designed to maximize the similarity between the embedding of the three input (local context words, section and structural contexts) and the target citation appearing in the context. The core of the neural network model is composed of self-attention and additive attention, where the former aims to capture the relatedness between the contextual words and structural context, and the latter aims to learn the importance of them. The experiments on real-world datasets demonstrate the effectiveness of the proposed approach.


Introduction
When writing an academic paper, one of the most frequent questions considered is: "Which paper should I cite at this place?" Based on the massive number of papers being published, it is impossible for a researcher to read every article that might be relevant to their study. Thus, recommending a handful of useful citations based on the contents of a working draft can significantly alleviate the burden of writing a paper. An example of the application scenario is demonstrated in Figure 1.
Currently, many scholars rely on "keyword searches" on search engines, such as Google Scholar 2 and DBLP 3 . However, keyword-based systems often generate unsatisfying results, because query words may not convey adequate information to reflect the context that needs to be supported (Jia and Saule, 2017;Jia and Saule, 2018). Researchers in various fields have proposed various methods to solve this problem. For example, studies in (McNee et al., 2002;Gori and Pucci, 2006;Caragea et al., 2013;Küçüktunç et al., 2013;Jia and Saule, 2018) considered recommendations based on a collection of seed papers, and (Alzoghbi et al., 2015;Li et al., 2018) proposed methods using meta-data, such as authorship information, titles, abstracts, keyword lists, and publication years. However, when applying such methods to real-world paper-writing tasks, there is a lack of consideration for the local context of a citation within a draft, which can potentially lead to suboptimal results. Context-based recommendations adopt a more practical concept that generates potential citations for an input context (He et al., 2010;He Figure 1: Concept of dual attention model for citation recommendation (DACR) et al., 2011). Based on the context-based methodology, the HyperDoc2Vec (Han et al., 2018) proposed an embedding framework which further considers embedding with information of citation link between the local context in a citing paper and the content in a cited paper. Our previous study (Zhang and Ma, 2020) adapted the structural contexts in addition to citation link to further improve the recommendation performances. Context-based approaches could be potentially applicable in the real-world paper-writing process.
However, the studies mentioned above still fail to take into consideration a number of essential characteristics of academic papers, which limits their usefulness.
1. Scientific papers tend to follow the established "IMRaD" format (introduction, methods, results and discussion, and conclusions) (Mack, 2014), where each section of an article has a specific purpose. For example, the introduction section defines the topic of the paper in a broader context, the method section includes information on how the results were produced, and the results and discussion section presents the results. Therefore, citations used in each section should comply with the specific purpose of that section. For example, citations in the introduction section should support the main concepts of the paper, citations in the methods section should provide technical details, and citations in the results and discussion section should aim to compare results to those of other works. Therefore, recommendations of suitable citations for a given context should also consider the purpose of the corresponding section.
2. Certain words and cited articles in a paper are much more closely related than other words and articles in the same paper. Capturing these interactions is essential for understanding a paper. For example, in Figure 1, the word "recommendation" is closely related to the words "context-based," "citations," and "context," but has a weak relationship with the words "adopt," "more," and "input." Additionally, a given word may have strong relatedness with some citations that appear in the paper. For example, the word "recommendation" has strong relatedness to citations " (Li et al., 2018)" and " (Han et al., 2018)" because both of these citations focus on recommendation algorithms.
3. Not every word or cited article has the same importance within a given paper. Important words and cited articles are more informative with respect to the topic of a paper. For example, in Figure 1, the words "context-based," "recommendations," "citations," and "context" are more informative than the words "adopt," "more," or "generates." The citation, " (Han et al., 2018)," may be more essential than " (Jia and Saule, 2018)" because the former is related to context-based recommendations, while the latter is related to a different approach.
Adequate recommendations of citations for a manuscript should capture the relatedness and importance of words and cited articles in the context which needs citations, as well as the purpose of the section on which the writer is currently working. To this end, we propose a novel embedding-based neural network called dual attention model for citation recommendation (DACR) that is designed to capture the relatedness and importance of words in the context which needs citations and structural contexts in the manuscript, as well as the section for which the user is working. The core of the proposed neural network is composed of two attention mechanisms, namely self-attention and additive attention. The former captures the relatedness between contextual words and structural contexts, and the latter learns the importance of contextual words and structural contexts. Additionally, the proposed model embeds sections into an embedding space and utilizes the embedded sections as additional features for recommendation tasks.
2 Related Work 2.1 Document Embedding Document embedding refers to the representation of words and documents as continuous vectors. Word2Vec (Mikolov et al., 2013a) was proposed as a shallow neural network for learning word vectors from texts while preserving word similarities. Doc2Vec (Le and Mikolov, 2014) is an extension of Word2Vec for embedding documents with content words. However, these two methods generally treat documents as "plain texts," meaning that when they are applied to scholarly articles. This can lead to some essential information being lost (for example, citations and metadata in scientific papers), which in turn can lead to suboptimal recommendation results. Some more recent studies have attempted to remedy this issue. HyperDoc2Vec (Han et al., 2018) is a fine-tuning model for embedding additional citation relations. DocCit2Vec (Zhang and Ma, 2020) proposed by our previous work considers both structural contexts and citation relations. Regardless, some vital information is still not considered, such as the semantic of section headers and the relatedness and importance of word in the context requiring support of citations, which are included in this study.

Citation Recommendation
Citation recommendation refers to the task of finding relevant documents based on an input query. The query could be a collection of seed papers (McNee et al., 2002;Gori and Pucci, 2006;Caragea et al., 2013;Küçüktunç et al., 2013;Jia and Saule, 2017), and the recommendations are then generated via collaborative filtering (McNee et al., 2002;Caragea et al., 2013) or PageRank-based methods (Gori and Pucci, 2006;Küçüktunç et al., 2013;Jia and Saule, 2017). Some studies (Alzoghbi et al., 2015;Li et al., 2018) have proposed using meta-data, such as titles, abstracts, keyword lists, and publication years as query information. However, in real-world applications, when providing support for writing manuscripts, these techniques lack practicability. Context-based methods (He et al., 2010;He et al., 2011;Han et al., 2018;Zhang and Ma, 2020) use a passage requiring support as a query to determine the most relevant papers, which can potentially enhance the paper-writing process. However, such methods may suffer from information loss because they do not consider sections within papers or the relative importance and relatedness of local context words.

Attention Mechanisms
Attention mechanism is commonly applied in the field of computer vision (Tang et al., 2014) and detects important parts of an image to improve prediction accuracy. This mechanism has also been adopted in the recent researches in text mining. For example, (Ling et al., 2015) extended Word2Vec with a simple attention mechanism to improve word classification performance. Google's BERT algorithm (Devlin et al., 2019) uses multi-head attention and provides excellent performance for several natural language

Notations and Definitions
Academic papers can be treated as a type of hyper-document, in which citations are equivalent to hyperlinks. Based on paper modeling with citations (Han et al., 2018) and modeling of citations with structural contexts (Zhang and Ma, 2020), we introduce a novel modeling with citations, structural contexts, and sections.
Definition 1 (Academic Paper). Let w ∈ W represent a word from a vocabulary, W , where s ∈ S represents a section from a section header collection, S , and d ∈ D represents a document ID (paper DOI) from an ID collection, D. The textual information of a paper, H , is represented as a sequence of words, sections, and IDs of cited documents (i.e.,Ŵ ∪Ŝ ∪D, whereŴ ⊆ W ,Ŝ ⊆ S , andD ⊆ D).
Definition 2 (Citation Relationships). The citation relationships, C, (see Figure 2) in a paper, H , are expressed by a tuple, s, d t , D n , C , where d t ∈D represents a target citation,D represents the id of all the cited documents from H , C ⊆Ŵ is the local context surrounding d t , and s ∈Ŝ is the title of the section in which the contextual words appear. If other citations exist within the same manuscript, then they are defined as "structural contexts" and denoted by D n , where {d n |d n ∈D, d n = d t }.

Problem Definition
Embedding matrices are denoted as D ∈ R k×|D| for documents, W ∈ R k×|W | for words, and S ∈ R k×|S| for sections. The i-th column of D, denoted by d i , is a k -dimensional vector representing document d i . Additionally, the j-th column of W is a k-dimensional vector for word w j , and the s-th column of S is a k-dimensional vector for section s.
The proposed model initializes two embedding matrices (IN and OUT) for documents (i.e., D I and D O ), a word embedding matrix, W I , and a section embedding matrix, S I . A column vector from D I represents the role of a document as a structural context and a column vector from D O represents the role of a document as a citation (the implementation details of the experiment in Section 5.4 explain this in more detail). The word embedding matrix, W I , and section embedding matrix, S I , are initialized for all words of the word vocabulary and all sections of the section header collection.
The goal of this model is to optimize the following objective function: (1)

Dual Attention Model for Citation Recommendation
An overview of the proposed DACR approach is presented in Figure 2. DACR is composed of two main components: a context encoder (Section 4.1) for encoding contextual words, sections, and structural contexts into a fixed-length vector and a citation encoder (Section 4.2) for predicting the probability of a target citation.

Context Encoder
The context encoder takes three inputs, namely, context words, sections, and structural contexts, from citation relationships. The encoder contains three layers: an embedding layer for converting words and documents (structural contexts) into vectors, a self-attention layer with an Add&Norm sub-layer (Vaswani et al., 2017) for capturing the relatedness between words and structural contexts, and an additive attention layer (Wu et al., 2019) for recognizing the importance of each word and structural context.

IN Embedding, Add and Concatenation layer
The IN embedding layer initially generates three embedding matrices D I , W I , and S I for the document collection, word vocabulary and the section header collection. For a given citation relationship, the onehot vectors of structural contexts, context words, and sections are projected with the three embedding matrices, denoted as D I {Dn} , W I {C} , and S I s . The projected section vectors are then added to the word vectors (each word vector is added to a section vector), and the resultant matrix is denoted as W . W and D I {Dn} are then concatenated column-wise and form one matrix, i.e., [w 1 , ..., w m , d I 1 , ..., d I n ], and denoted as E, where m is the number of input context words and n is the number of input structural contexts.

Self-attention Mechanism with Add&Norm
Self-attention (Vaswani et al., 2017) is utilized to capture the relatedness between input context words and structural contexts. It applies scaled dot-product attention in parallel for a number of heads, to allow the model to jointly consider interactions from different representation sub-spaces at different positions.
The k-dimensional embedding matrix, E, from the last layer is first transposed and projected with three linear projections ( ..h}, and h denotes the number of heads. The E matrix is projected h times, and each projection is called a "head". At each projection (i.e., within a "head"), the dot products of the first two projected versions of E with A Q i and A K i are computed, and divided by √ d h . Subsequently, softmax is applied to obtain the resulting weight matrix with dimensions of (m + n) * (m + n), i.e., sof tmax( , where (m + n) is the total number of input context words and structural contexts. This weight matrix represents the relatedness between the input words and articles. The dot product of the weight matrix and the third projected version of E, i.e., E T A V i , is computed as the output matrix of the head, denoted as head i . The h numbers of the output head matrices are concatenated column-wise and projected again with A O to yield the final output matrix. The computation procedure is represented as follows: where A O ∈ R k×k , A Q i ∈ R k×d h , A K i ∈ R k×d h , and A V i ∈ R k×d h are projection parameters. d h is the embedding dimension of the heads, h is the number of heads, and k = d h ×h, where k is the dimension of the embedding vectors. The output matrix of the self-attention mechanism is then transposed and added to the original E matrix. Next, dropout is applied (Hinton et al., 2012) to avoid over-fitting, and applied with layer normalization (Ba et al., 2016) to facilitate the convergence of the model during training. The final output matrix is denoted as E .

Additive Attention Mechanism
The additive attention layer (Wu et al., 2019) is utilized to recognize informative contextual words and structural contexts. It takes matrix E from the last layer as input, whereby each column represents the vector of a word or document. The weight of each item is computed as follows: where V ∈ R k×k is the projection parameter matrix, V ∈ R k×(n+m) is the bias matrix, and q (kdimensional) is a parameter vector. The Weight vector is a row vector of dimension (m + n), where each column represents the weight of a corresponding word or document. The Weight vector is applied with the dropout technique to avoid over-fitting.
The output, EncoderVector, is the dot product of the softmaxed Weight vector and input matrix, E , where all rows of the embedding vectors are weighted and summed, as illustrated below: EncoderVector = E · sof tmax(Weight T ). (5)

Citation Encoder
The citation encoder is designed to predict potential citations by calculating the probability score between an OUT document matrix, D O , and the EncoderVector from the context encoder, which is defined as follows:ŷ The scores are then normalized using the softmax function as follows:

Model Training and Optimization
We adopted a negative sampling training strategy (Mikolov et al., 2013b) to speed up the training process for DACR. In each iteration, it generates a positive sample (correctly cited paper) and n negative samples. Therefore, the calculated probability vector, p, is composed of [p positive , p negative−1 , p negative−2 , ..., p negative−n ]. The loss function computes the negative loglikelihood of the probability of a positive sample, as follows: Stochastic gradient descent (SGD) (Sutskever et al., 2013) is used to optimize the model.

Experiments
We evaluated the recommendation performance of our model and five baseline models on two datasets, namely DBLP and ACL Anthology (Han et al., 2018). The recall, mean average precision (MAP), mean reciprocal rank (MRR), and normalized discounted cumulative gain (nDCG) are reported for a comparison of the models. These values are summarized in Table 2. Additionally, we proved the effectiveness of adding information about sections, relatedness, and importance, as shown in Figure 3.

Dataset Overview
The larger dataset, DBLP (Han et al., 2018), contains 649,114 full paper texts with 2,874,303 citations (approximately five citations per paper) in the field of computer science. The ACL Anthology dataset (Han et al., 2018) is smaller, containing 20,408 texts with 108,729 citations; however, it has a similar number of citations per paper (about five per paper) to the DBLP dataset. We split the datasets into a training dataset, for training the document, word, and section vectors, and test dataset with papers containing more than one citation published in the last few years for recommendation experiments. An experimental overview is provided in Table 1.

Document Preprocessing
The texts were pre-processed using ParsCit (Councill et al., 2008) to recognize citations and sections. Intext citations were replaced with the corresponding unique document IDs in the dataset. Section headers often have diverse names. For example, many authors name the "methodology" section using customized algorithm names. Therefore, we replaced all section headers with fixed generic section headers using ParsLabel (Luong et al., 2010). Generic headers from ParsLabel are abstract, background, introduction, method, evaluation, discussions, and conclusions. If ParsLabel is not able to recognize a section, we label it as "unknown." Detailed information for each section is listed in Table 1.

Implementation and Settings
DACR was developed using PyTorch 1.2.0 (Paszke et al., 2019). In our experiments, word and document embeddings were pre-trained using DocCit2Vec with an embedding size of 100, a window size of 50 (also known as the length of the local context, i.e. 50 words before and after a citation), a negative sampling value of 1000, and 100 iterations (default settings in (Zhang and Ma, 2020)). The word vectors for generic headers, such as "introduction" and "method," were selected as pre-trained vectors for the section headers. DACR was implemented with 5 heads, 100 dimensions for the query vector, and a negative sampling value of 1000. The SGD optimizer was implemented with a learning rate of 0.001, a batch size of 100, and 100 iterations for the DBLP dataset, or 200 iterations for the ACL Anthology dataset. To avoid over-fitting, we applied a 20% dropout in the two attention layers. Word2Vec and Doc2Vec were implemented using Gensim 2.3.0 (Řehůřek and Sojka, 2010), and Hy-perDoc2Vec and DocCit2Vec were developed based on Gensim. All baseline models were initialized with an embedding size of 100, a window size of 50, and default values for the remaining parameters.

Recommendation Evaluation
We designed three usage cases to simulate real-world scenarios: • Case 1: In this case, we assumed the manuscript was approaching its completion phase, meaning the writer had already inserted the majority of their citations into the manuscript. Based on the leaveone-out approach, the task was to predict a target citation, by providing the contextual words (50 words before and after the target citation), structural contexts (the other cited papers in the source paper), and section header as input information for DACR.
• Case 2: Here, we assumed that some existing citations were invalid because they were not available in the dataset, i.e., the author had made typographical errors or the manuscript was in an early stage of development. In this case, given a target citation, its local context and section header, we randomly selected structural contexts to predict a target citation. The random selection was implemented using the build-in Python3 random function. All case 2 experiments were conducted three times to determine the average results to rule out biases. • Case 3: It is assumed that the manuscript was in an early phase of development, where the writer has not inserted any citations or all existing citations are invalid. Only context words and section headers were utilized for the prediction of the target citation (no structural contexts were used).
To conduct recommendation via DACR, an encoder vector was initially inferred using the trained model with inputs of cases 1, 2, and 3, and then, the OUT document vectors were ranked based on dot products.
Five baseline models were adapted for comparison with DACR. As the baseline models do not explicitly consider section information, information on the section headers were neglected in the inputs.
1. Citations as words via Word2Vec (W2V) This method was presented in (Berger et al., 2017), where all citations were treated as special words. The recommendation of documents was defined as ranking OUT word vectors of documents relative to the averaged IN vectors of context words, and structural contexts via dot products. The word vectors were trained using the Word2Vec CBOW algorithm.
2. Citations as words via Doc2Vec (D2V-nc) (Berger et al., 2017). The citations were removed in this method, and the recommendations were made by ranking the IN document vectors via cosine similarity relative to the vector inferred from the learnt model by taking context words and structural contexts as input (this method results in better performance than the dot product). The word and document vectors were trained using Doc2Vec PV-DM.
3. Citations as content via Doc2Vec (D2V-cac) (Han et al., 2018). In this method, all context words around a citation were copied into the cited document as supplemental information. The recommendations were made based on cosine similarity between the IN document vectors and inferred vector from the learnt model. The vectors were trained using Doc2Vec PV-DM.
4. Citations as links via HyperDoc2Vec (HD2V) (Han et al., 2018). In this method, citations were treated as links pointing to target documents. The recommendations were made by ranking OUT document vectors relative to the averaged IN vectors of input contextual words based on dot products. The embedding vectors were pre-trained by Doc2Vec PV-DM using default settings.

5.
Citations as links with structural contexts via DocCit2Vec (DC2V) (Zhang and Ma, 2020). The recommendations were made by ranking OUT document vectors relative to the averaged IN vectors of input contextual words and structural contexts based on dot products. The embedding vectors were pre-trained by Doc2Vec PV-DM with default settings. There are three main conclusions that can be drawn from Table 2. First, DACR outperforms all baseline models at 1% significance level across all evaluation scores for all cases and datasets. This implies that the additionally included combined information: namely sections, relatedness, and importance, are essential for predicting useful citations. The effectiveness of each added information is presented in Section 5.5.
Second, performance increases when additional information is preserved in the embedding vectors. When comparing Word2Vec, HyperDoc2Vec, DocCit2Vec, and DACR, Word2Vec only preserves contextual information, HyperDoc2Vec considers citations as links, DocCit2Vec includes structural contexts, and DACR exploits the internal structure of a scientific paper to extract richer information. The evaluation scores increase with the amount of information preserved, indicating that overcoming information loss in embedding algorithms is helpful for recommendation tasks.
Third, DACR is effective for both the large (DBLP) and medium (ACL Anthology) sized datasets. However, we also realized that the smaller dataset requires higher iterations for the model to produce effective results. It is presumed that more iterations of training can compensate for a lack of diversity in the training data.
The performance of DACR could be further improved by more accurately recognizing section headers. Moreover, we determined that some labels were incorrectly recognized or unable to be recognized by ParsLabel. Therefore, we will work on improving the accuracy of section recognition in future work.

Effectiveness of Adding Sections, Relatedness, and Importance
In this section, we explore the effectiveness of adding the following information: section headers, relatedness, and importance. We run three modified DACR models without the corresponding layer, for example, removing the section embedding layer for verifying the effectiveness of section information, removing the self-attention layer for determining the relatedness between contextual words and articles, and removing additive attention for demonstrating the importance of context. We present the scores of recall, MAP, MRR, and nDCG at 10 for case 1 on the DBLP dataset for comparison, which are illustrated in Figure 3. Both models, DACR without self-attention and DACR without additive attention perform significantly worse than the full model of DACR, whereas the performance of DACR without section information drops by a minor margin. Three conclusions can be drawn from these facts.
Firstly, all modified models performed poorer than the full model, which supports our hypothesis: sections, relatedness, and importance between contextual words and articles are important for recommending useful citations. The relatedness information is more beneficial than section information, which is evident when comparing DACR without section and DACR without self-attention.
Secondly, the primary reason for the 0-close scores of the model without additive attention is that the losses of the model did not converge without the additive attention layer. Therefore, we consider that the additive attention has a two-fold purpose: ensuring convergence and learning the importance of context.
Lastly, only appropriate combinations of information and neural network layers lead to optimal solutions, as deficits in any of the three types of information (section, relatedness, and importance, or attention layers) result in low performance.

Conclusions and Future Work
In this study, we proposed a citation recommendation model with dual attention mechanisms. This model aims to simplify real-world paper-writing tasks by alleviating the issue of information loss in existing methods. Our model considers three types of essential information: section for which a user is working and need to insert citations, relatedness between the local context words and structural contexts, and their importance. The core of the proposed model is composed of two attention mechanisms: selfattention for capturing relatedness and additive attention for learning importance. Extensive experiments demonstrated the effectiveness of the proposed model in designed scenarios intended to mimic the real world scenarios as well as the efficiency of the proposed neural network.
In future work, we will first attempt to improve the accuracy of recognizing section headers to improve the usability and performance of the algorithm. Second, we will include additional paper-related information in the model, such as word positions. Third, we will explore more sophisticated neural network architectures to improve accuracy and reduce the training time of the model.