A Neural Citation Count Prediction Model based on Peer Review Text

Citation count prediction (CCP) has been an important research task for automatically estimating the future impact of a scholarly paper. Previous studies mainly focus on extracting or mining useful features from the paper itself or the associated authors. An important kind of data signals, peer review text, has not been utilized for the CCP task. In this paper, we take the initiative to utilize peer review data for the CCP task with a neural prediction model. Our focus is to learn a comprehensive semantic representation for peer review text for improving the prediction performance. To achieve this goal, we incorporate the abstract-review match mechanism and the cross-review match mechanism to learn deep features from peer review text. We also consider integrating hand-crafted features via a wide component. The deep and wide components jointly make the prediction. Extensive experiments have demonstrated the usefulness of the peer review data and the effectiveness of the proposed model. Our dataset has been released online.


Introduction
In recent years, the number of scientific publications has been growing in a dramatic rate. For example, the numbers of submissions and accepted papers of EMNLP 2019 have increased to 2,877 and 684 respectively 1 . Given the huge volume of scholarly papers, a long-standing research challenge is how to effectively evaluate the impact of scientific literature (Garfield, 1999;Saha et al., 2003;Bornmann, 2013). A typical way to measure the impact of a scholarly paper is through the number of citations received after publication (Garfield, 1979;Aksnes, 2006), reflecting the influence in the research community.
Since citation count is an important evaluation measure for scientific impact, many researchers aim to develop automatic ways to predict the future citation of a paper (Castillo et al., 2007;Ibáñez et al., 2009;Davletov et al., 2014;Xiao et al., 2016). A typical approach is to casting the problem into a classification or regression task, focusing on extracting useful feature information (Yan et al., 2011;Chen and Zhang, 2015;Singh et al., 2015;Park et al., 2017) (e.g., h-index and topic distribution). Although these studies have achieved important progress on this task, they mainly utilize information from the papers themselves or their associated authors. They have neglected an important kind of data signal for the prediction task, i.e., peer review data.
Peer review is a widely adopted paper evaluation mechanism, in which three or more reviewers would be assigned to decide whether to accept or reject a paper. During the review process, the reviewers should assess the paper quality in terms of several important factors, including originality, correctness, substance and readability 2 . Intuitively, peer review data should be useful to predict future impact of a paper, since the review text contains assessment comments from domain experts. Fortunately, the mechanism of open review (Soergel et al., 2013) has made it possible to obtain peer review data for the citation count prediction (CCP) task.
Although it is appealing to leverage peer reviews for the CCP task, it is difficult to effectively extract supporting evidence and learn comprehensive semantic representations from peer review data. Reviews are usually written in natural language text, covering the assessment comments of a paper in multiple aspects. Some comments may not focus on the main contribution of a paper. For example, a review typically contains the reminders for minor spelling errors or format problems. Another interesting observation is that different reviewers may focus on different aspects in their comments, and even raise divergent attitudes on the same aspect. Hence, it is important to consider both coverage and divergence of review comments for making a comprehensive decision on the paper impact.
In this paper, we take the initiative to study how to utilize the peer review data in the CCP task. We focus on how to learn a comprehensive semantic representation from peer review text for improving the prediction performance. To identify relevant evidence from long text, we utilize the abstract-review match method to learn abstract-aware review representations by using abstract text as an attentive query. In this way, we can reduce the influence of irrelevant content or noise. To further characterize the interaction among multiple reviews, we propose a novel cross-review match mechanism. With such a mechanism, a review representation will be decomposed into a parallel representation and an orthogonal representation by referring to the rest of the reviews. Our model can derive an effective semantic representation for capturing the comprehensive semantics of all the reviewers.
To evaluate our model, we have constructed two peer review datasets with citation counts. Extensive experiments have demonstrated the superiority of the proposed model over several competitive baselines. To our knowledge, it is the first time that peer review data has been utilized in the CCP task. Our work has shown that peer review data is important to improve the prediction performance. Our code and dataset have been released at https://github.com/RUCAIBox/ Citation-Count-Prediction.

Related Work
Citation count prediction has been a hot research topic in the literature (Castillo et al., 2007;Ibáñez et al., 2009;Chakraborty et al., 2014). Early studies casted this task as a classification or regression task (Fu and Aliferis, 2008). Their focus was to identify features in a certain aspect to explore the factors of the impact of papers. Following works formally defined this task and thoroughly examined various possible factors correlated with citation counts (Yan et al., 2011;Bhat et al., 2015;Chen and Zhang, 2015;Singh et al., 2015;Chen and Zhang, 2015;Park et al., 2017). These studies mainly model the long-term scientific impact (Wu et al., 2019;Abrishami and Aliakbary, 2018;Yuan et al., 2018). Furthermore, some researchers casted the problem as a time series task, and focused on analyzing temporal features or patterns in the process of citation growth (Davletov et al., 2014;Xiao et al., 2016;Yuan et al., 2018). However, to the best of our knowledge, no work has utilized peer review data of scholarly papers for citation count prediction.
As an important paper evaluation mechanism, peer review has been widely adopted in various journals and conferences (Ross et al., 2006;Fisher et al., 1994). Based on private review data, researchers have explored the usefulness of peer reviews in several aspects, such as issue localization (Xiong et al., 2010), review utility (Xiong and Litman, 2011) and quality/tone (Ramachandran and Gehringer, 2011). More recently, to lower the barrier to studying peer reviews for the scientific community, a public dataset of peer reviews has been released for research purpose (Kang et al., 2018). Based on this dataset, Wang and Wan (2018) have employed peer review text to predict the overall decision status for a paper submission. Compared with (Wang and Wan, 2018), we focus on a different task by studying how to leverage peer review for future impact estimation instead of paper acceptance, which has its own technical challenges. Besides, we have released our dataset with citation counts online.
Our work is also related to the studies that analyze scientific literature or citation data, including concept or keyphrase extraction (Shen et al., 2018;Gordon et al., 2016;Luan et al., 2017;Caragea et al., 2014), citation or influence analysis (Lauscher et al., 2018;Chakraborty and Narayanam, 2016;Bonab et al., 2018;Chen et al., 2018), context modeling (Cohan and Goharian, 2017;Jin and Szolovits, 2018) and automatic paper rating (Yang et al., 2018). We also assume that K peer reviews are available for paper d, characterized by {r k } K k=1 , where r k denotes the k-th review consisting of multiple review sentences. We assume that both abstract and review text share the same vocabulary V. Besides these features, we also assume other types of information (e.g., authors' h-index) are also available for our task. We use a vectorized representation x d to encode all non-review features.
Based on the above preliminaries, we now define the Citation Count Prediction (CCP) task. We aim to learn an effective predictive function that takes as input the abstract text, review text and other available information and estimates the future citation count after a given time period: whereĉ d is the estimated citation count for d. To make the citation number more predictable, we normalize the value range of c d within the interval (0, 1). Here, we consider long-term citation count prediction in terms of years.

The Proposed Model
In this section, we present a neural citation count prediction model based on peer review text. Our model consists of two major components, namely the deep component and wide component, which model review-based text features and other handcrafted features, respectively. Figure 1 presents an illustrative sketch for our model architecture. The notations and the descriptions are shown in Table  1.
Symbols Descriptions x d the non-review features of paper d z d the final review representation of paper d c d the citation count of paper d a d the abstract text of paper d r k the k-th review consisting of multiple review sentences s A j the learned representations of the j-th sentence in the abstract text s R k,j the learned representations of the j-th sentence in the k-th review text hS the dimension of sentence vectors n d the number of sentences in abstract of paper d n k the number of sentences in the k-th review of paper d u R t the updated representation of the t-th sentence in a review after abstract-review match hH the representation of other reviews excluding the k-th revieŵ v R k the refined representation of the k-th review after cross-review match Table 1: Notations used in the paper.

The Deep Component
The deep component is the core part of our model, which aims to extract important semantic charac-teristics from peer review text for the CCP task. We first encode abstract and review sentences into embedding vectors, and then distill the relevant evidence from review text by referring to the abstract. To characterize the interaction of multiple reviewers, we further design a cross-review match mechanism to capture both consistency and divergence among different reviews.

Encoding Abstract and Review Sentences
We first pretrain the word embeddings using the word2vec model (Mikolov et al., 2013) using all the scientific corpus. To effectively encode the abstract and review sentences, we adopt the convolution-based method in (Kim, 2014) to model the sentences, sequentially consisting of a lookup layer, a convolution layer of 100 filters, and a max pooling layer. We denote the learned sentence representations of the abstract text as vector for the j-th sentence in the abstract or the k-th review, and n d and n k is the number of sentences in abstract of paper d and its k-th review.

Improving the Review Representations with Abstract-Review Match
Review text reflects the subject assessment on a paper by the reviewers. A review is likely to cover the detailed comments from multiple aspects. It may contain irrelevant information for the prediction task, such as requesting source code release or pointing out minor spelling errors. It is key to identify relevant information focusing on the core contributions of a paper. Intuitively, we can utilize the abstract information to purify the original review sentence representations, since it provides a good summary for the main contributions of a paper. Inspired by the recent progress on machine reading comprehension, we adopt the gated attention-based recurrent networks (Wang et al., 2017) for refining the review representations regarding to the abstract text. In our task, the abstract is considered as a question, and a review is considered as a passage. Similar to machine reading comprehension, we aim to learn a "question"-relevant review sentence representation that focuses on the core content from the abstract. For simplicity, we only consider the interaction between the abstract and an indi-vidual review, and omit the review index from the notations. Formally, we update the representation of the t-th sentence in a review as u R t ∈ R h H : where p t ∈ R h S is an attentional vector of a review computed based on the interaction between review and abstract sentence representations: where " " is an element-wise product operation for vectors.
In this way, for a review, we can obtain the abstract-aware review sentence representations {u R j } n d j=1 , which encode more relevant information emphasized by the abstract. To learn the overall representation for the k-th review, we concatenate the sentence embeddings of the first and last sentences in it: where u R 1 and u R n k are learned sentence embeddings using bidirectional Gated Recurrent Unit (BiGRU) in Eq. 2.

Improving Review Representations
with Cross-Review Match Previously, we have considered the interaction between the abstract and an individual review. The evaluation process of a paper typically requires multiple reviewers to make the final decision. According to (Hirschauer, 2010), coverage and divergence should be considered for the acceptance decision of a paper. Therefore, we propose to utilize cross-review match to learn a comprehensive semantic representation from different reviewers.
Given a review, we take the rest of the reviews as a reference source. We aim to learn the common semantics that are also discussed by other reviews (maybe with different attitudes), and identify unmentioned semantics by other reviews. To implement this idea, we adopt the orthogonal decomposition strategy proposed in . We decompose the original review representation into a parallel representation and an orthogonal representation. Formally, the representation of the k-th where the parallel representation v R k, ∈ R h H encodes common semantics also discussed by other reviews, and the orthogonal representation v R k,⊥ ∈ R h H encodes unmentioned semantics in other reviews. Such a decomposition is useful to extract more comprehensive semantics from multiple reviews. We use average pooling to construct the reference vector of other reviews.
We perform the above decomposition for each review associated with a paper. The parallel representation reflects the common semantics shared by other reviews. Since different reviewers may have divergent comments (or attitudes) towards the same content, we further introduce a corresponding attentional representation for enriching the semantics of the original representation: Then we combine the three representations and adopt a fully connected layer to obtain the refined representation for the k-th reviewv It is able to capture the coverage and divergence in semantics for peer review text to some extent.
Finally we use an average pooling operation over all review representations of paper d as its final representation z d ∈ R h H :

The Wide Component
Besides review-based features, we consider directly integrating other important features for the prediction task, called wide features. Here, we use a vectorized representation x d to represent all the wide features. We construct the vector by using the features proposed in previous studies (Yan et al., 2011;Bhat et al., 2015): • Topic distribution: We utilize the Latent Dirichlet Allocation (LDA) (Blei et al., 2003) to learn the probability distribution over topics as the topic features.
• Diversity: We calculate the entropy of the paper's topic distribution to measure the topical breadth of a paper.
• Recency: We use the year of publication as the temporal feature to predict the citation count.
• Author influence: We use the number of authors and the average h-index as author features.

The Joint Deep and Wide Model
Finally, we integrate the two components into a unified model. We take as input the deep and wide features respectively discussed in Section 4.1 and Section 4.2, and combine them as the prediction function (Eq. 1) : where z d and x d are the derived feature representations from the deep and wide components respectively, and w deep and w wide are the corresponding parameter vectors. Furthermore, we define the citation count prediction error over the training set with the Mean Squared Error (MSE): where c d is the normalized real citation count of each scholarly paper. We learn our model parameters via optimizing the L(θ) loss. The parameters in GRU and CNN are initialized by a normal distribution with zero mean and 0.01 variance, and the biases are initialized as zeros. To optimize our model, we adopt the Stochastic Gradient Descent (SGD) optimizer to learn the

Experiments
In this section, we first set up the experiments, and then present the results and analysis.

Experimental Setup
Datasets. Peer review data is not available for the majority of mainstream journals and conferences. Fortunately, ICLR and NIPS have provided the review text on their website. NIPS does not provide rating scores from the reviewers, and we only consider utilizing text data in this paper. For most of the published papers, it is difficult to accumulate a considerable number of citations in a short period. Hence, we only use the data ICLR 2013-2017 and NIPS 2013-2016 for evaluation, which has a two-year span to now for long-term impact prediction. The data of NIPS 2013-2016 and ICLR 2017 have been shared by Kang et al. (2018). The data of ICLR 2013-2016 was crawled from the OpenView website 3 , including abstract text, review text and author data. Note that we only consider the accepted papers for citation prediction. We further crawl author data (e.g., h-index) and paper citation from Google Scholar 4 . When encountering any ambiguity on author names or paper titles, a senior graduate student will manually collect the corresponding data. All the Google Scholar data has been accessed on March 31, 2019 to guarantee the recency of citation data. We remove the papers with only one reviewer. We perform basic text preprocessing using NLTK 5 , including tokenization, lowercase, and stopword removal, and only retain the words that occur three times or more. In order to simulate the real situation, for both datasets, we take the data from the last year as test data, and the previous data as training set. The detailed statistics of the two datasets are summarized in Baseline Models. We compare our model against a number of baseline models: • Linear Regression (LR), K-NearestNeighbor (KNN), Support Vector Regression (SVR) and Gradient Boost Regression Tree (GBRT): The four methods are commonly used regression models for citation prediction (Yan et al., 2011;Bhat et al., 2015). We adopt the same wide features from (Yan et al., 2011)  We modify the deep component by implementing it as a feed-forward neural network on top of a bi-directional RNN component over review text.
• MILAM (Wang and Wan, 2018): It is a multiple instance learning network with a novel abstract-based memory mechanism to predict the overall decision (accept, reject, or borderline) based on review text. We modify the loss of this model as the MSE loss for citation regression. For a fair comparison, we also integrate the wide features in a similar way as our wide component.
Evaluation Metrics. To evaluate the performance of different methods on citation count prediction, following previous studies (Bhat et al., 2015;Yuan et al., 2018), we adopt five evaluation metrics, including MAE, RMSE, OR@30, OR@50, and Spearman's rank correlation coefficient. MAE and RMSE measure the difference between the real value and the predicted value for a regression task.
Spearman's Rank measures the overall correlation between the predicted list and the ground-truth list sorted by the citation number descendingly. OR@k measures the overlapping ratio between top k predicted results and the real ordered list.

Results and Analysis
In this subsection, we construct a series of experiments on the effectiveness of the proposed model for the citation count prediction task.
Main Results. Table 3 presents the performance of different methods on citation count prediction. We can make the following observations. First, the four traditional baselines (LR, KNN, SVR and GBRT) perform worse than the two deep learning baselines (W&D, MILAM). These four baselines only utilize the wide features with traditional machine learning models. Second, MILAM performs consistently better than W&D, since it has designed a more elaborate architecture to model the review text. Finally, our model outperforms all the baselines with a substantial margin, especially for the ICLR dataset. Our model is able to integrate the wide features and learn the comprehensive representation for peer review text, which is the key of the performance improvement over baselines. Overall, the two datasets show the similar findings. In what follows, we will report the results on ICLR dataset due to space limit.
Ablation Analysis. The major novelty of our model is that it utilizes abstract-review match and cross-review match to learn a comprehensive abstract-aware representation for peer review text.   Moreover, we integrate the wide features for the prediction task. To examine the contribution of the three parts, we examine the performance of the model variants by removing each module from the complete model. We present the MAE results of our model and its three variants in Table 4. As we can see, all components are useful to improve the final performance.
Usefulness of Peer Review Text. A major motivation of this paper is that peer review text is useful to the citation prediction task, which has been seldom studied in previous studies. Hence, we examine whether peer review text is also  Abstract · · · we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task · · · Additionally, the LSTM did not have difficulty on long sentences. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly.
Reviewer   useful to improve traditional prediction models (LR, KNN, SVR and KNN). Note that our focus is to verify the general usefulness of peer review text instead of the most suitable text features for the baselines. Here, we adopt the simple yet classic doc2vec model (Le and Mikolov, 2014) to encode peer text into a vectorized representation. Then, we integrate these text features into the four baseline methods. As shown in Table 5, the performance of all the four methods have been improved with the text features. The results have shown that the peer text is indeed generally useful for the citation prediction.
Cross-venue Evaluation. To examine the robustness of our model, we further perform a crossvenue evaluation. For ICLR (NIPS) test set, we apply the models trained on the full NIPS (ICLR) dataset. We only select the best baseline MILAM as a comparison. As shown in Table 6, we are able to see that the performance decreases compared with the results in Table 3, since we use a training set from a different venue. But the decline is not obvious. In the future we will consider how to improve the performance. Our model is still better than the baseline MILAM for both datasets.
Parameter Sensitivity. Next, we investigate the performance with respect to two major parameters in our model, e.g., the word embedding size and the GRU hidden size. As shown in Figure 2(a) and Figure 2(b), our model is consistently better than the best baseline MILAM with all the parameter values. An embedding size of 300 and a hidden size of 128 yield the best results for our model.

Qualitative Analysis
Previously, we have shown that both the reviewabstract match and cross-review match are useful in the prediction performance. In Table 7, we perform the qualitative analysis on a sample paper with three reviews for understanding how the two mechanisms work. We first compute the similarity weight between the abstract representation and a review sentence, and sort the review sentences according to such weights. As we can see, the comments on model design have been overall ranked in a higher position than those on experiments. With the abstract-review match, our model indeed identifies more relevant content regarding to the abstract text.
Then, we analyze the corresponding semantic explanation of the cross-review match, in which we decompose an overall review representation into a parallel representation v and an orthogonal representation v ⊥ . It is difficult to directly understand the two vectors. Instead, we compute the similarity between a comment sentence and the review parallel (or orthogonal) representation. Then we normalize the two similarity values into a distribution over two representations (parallel or orthogonal). It can be seen that the comments focusing on the common aspect have a larger weight on v . For example, for the comments from the first row, reviewer #2 and reviewer #3 have similar general comments about the model architecture, and both comments have very large weights on the parallel representation. While, reviewer #1 raises a different comment on other model detail, corresponding to a large weight on the orthogonal representation. Interestingly, the fourth row corresponds to the comments on the experiments. The three comment sentences are more related to the parallel representation (i.e., the common issue), while they have conveyed different attitudes. This phenomenon can be partially captured by the representation v R k, (Eq. 10) by attending to the parallel representations of other reviews.

Conclusion
In this paper, we studied how to utilize the peer review text to improve the citation prediction task. We developed a joint deep and wide model that was able to integrate both deep and wide features into a unified predictive function. The deep features were learned from peer review text by applying the abstract-review and cross-review match mechanisms. We constructed two evaluation datasets with peer review text. Extensive results have demonstrated the effectiveness of our proposed model. Currently, we only consider the semantic-level representations from peer review. As future work, we will consider how to extend our work by modeling sentiments of review text.