Hierarchical Bi-Directional Self-Attention Networks for Paper Review Rating Recommendation

Review rating prediction of text reviews is a rapidly growing technology with a wide range of applications in natural language processing. However, most existing methods either use hand-crafted features or learn features using deep learning with simple text corpus as input for review rating prediction, ignoring the hierarchies among data. In this paper, we propose a Hierarchical bi-directional self-attention Network framework (HabNet) for paper review rating prediction and recommendation, which can serve as an effective decision-making tool for the academic paper review process. Specifically, we leverage the hierarchical structure of the paper reviews with three levels of encoders: sentence encoder (level one), intra-review encoder (level two) and inter-review encoder (level three). Each encoder first derives contextual representation of each level, then generates a higher-level representation, and after the learning process, we are able to identify useful predictors to make the final acceptance decision, as well as to help discover the inconsistency between numerical review ratings and text sentiment conveyed by reviewers. Furthermore, we introduce two new metrics to evaluate models in data imbalance situations. Extensive experiments on a publicly available dataset (PeerRead) and our own collected dataset (OpenReview) demonstrate the superiority of the proposed approach compared with state-of-the-art methods.


Introduction
With an increasing submission of academic papers in recent years, the task of making final decisions manually incurs significant overheads to the program chairs, it is desirable to automate the process.In this study, we aim at utilizing document-level semantic analysis for paper review rating prediction and recommendation.Given the reviews of each paper from several reviewers as input, our goal is to infer the final acceptance decision for that paper and the reviewers' evaluation with respect to a numeric rating (e.g., 1-10 points).Paper review rating prediction and recommendation is a practical and important task in AI applications which will help improve the efficiency of the paper review process.It is also intended to enhance the consistency of the assessment procedures and outcomes, and to diversify the paper review process by comparing human recommended rating with machine recommended rating.In the literature, most of existing studies cast review rating prediction as a multi-class classification/regression task (Pang and Lee, 2005).They build a predictor by using supervised machine learning models with review texts and corresponding ratings.Due to the importance of features, most researches focus on extracting effective features such as context-level features (Qu et al., 2010) and user features (Gao et al., 2013) to boost prediction performance.However, feature engineering is time-consuming and labor-intensive.
Recently, with the development of neural networks and its wide applications, various deep learningbased models have been proposed for automatically learning features from text data (Bengio et al., 2013).Existing deep learning models usually learn continuous representations of different grains (e.g., word, phrase, sentence, document) from text corpus (Pennington et al., 2014;Lai et al., 2015;Kim, 2014;Conneau et al., 2017;Wang, 2018;Qiao et al., 2018).Although deep learning models can automatically learn extensive feature representation, they cannot efficiently capture the hierarchical relationship inherent to the review data.To address this problem, Yang et al. (2016) studied a hierarchical architecture and implemented it in deep learning framework to learn a better document-level representation.Also, with the success of attention mechanism in many tasks such as machine translation, question answering and so on (Vaswani et al., 2017), Shen et al. (2018b) designed a directional self-attention network to gain context-aware embeddings for words and sentences.Despite great progress made by these models, they do not focus on the task of paper review rating recommendation and are not effective enough to be directly used for this task because of the following reasons: First, the review data is hierarchical in nature.There exists a three-level hierarchical structure in the review data: word level, intra-review level and inter-review level, while previous models only capture two-levels (i.e., the word level and intra-review level) of this hierarchy.Second, paper reviews are usually much longer than other reviews (e.g., product reviews, movie reviews, restaurant reviews, etc.), while most of these models are working on those shorter reviews stated above and they do not leverage the up to date representation techniques such as BERT (Devlin et al., 2019) and SciBERT (Beltagy et al., 2019).
In this paper, we propose a novel neural network framework for paper review rating recommendation by taking word, intra-review and inter-review information into account.Specifically, inspired by HAN (Yang et al., 2016) and DiSAN (Shen et al., 2018b), we introduce a Hierarchical Bi-directional self-Attention Network (HabNet) framework to effectively incorporate different levels of hierarchical information.The proposed framework consists of three main modules in end-to-end relationship: sentence encoder, intra-review encoder and inter-review encoder, which can consider hierarchical structures of review data as comprehensive as possible.The outputs of inter-review encoder are leveraged as features to build the rating predictor without any feature engineering.We release the code and data collected by us to enable replication and application to new tasks, available at https://github.com/RingBDStack/HabNet.
The contributions of this work are as follows: • We present a novel framework to guide the investigation and assessment of the effects of hierarchies on review data.To our best knowledge, this is the first work that incorporates different levels of semantic information into a hierarchical neural network to perform paper review rating recommendation.
• We introduce two new metrics to better evaluate models when the distributions of classes are highly imbalanced (such as the paper review data we are working with).
• Empirical results on OpenReview (ours) and extended PeerRead datasets demonstrates the effectiveness of the proposed method in automatically making final acceptance decisions and helping reveal the rating inconsistency between the semantic review content and the numerical review ratings.
2 Related Work

Review Rating Prediction
Review rating prediction is a basic task in sentiment analysis.It was initially studied by Pang and Lee (2005) who cast this problem as a multi-class classification/regression task.In the literature, most of studies following this approach used supervised machine learning models to do review rating prediction.Since the features used by these models are critical for prediction performance, more refined textual features are exploited.Qu et al. (2010) introduced bag of opinions representation, where an opinion was composed of a root word, a set of modifier words and one or more negation words.Gao et al. (2013) used user-specific and product-specific features to increase the reliability of sentiment classification.With the popularity of deep learning model, instead of hand-crafted features, many works were proposed to automatically learn features from text corpora.Lai et al. (2015) applied a recurrent structure for convolutional neural network to capture contextual information for learning word representation.Kang et al. (2018) collected a dataset of peer reviews from several conferences and predicted paper acceptance decision by using paper draft.Gao et al. (2019) focused on predicting after-rebuttal scores by using their presented corpus.Hua et al. (2019) applied argument mining on their AMPERE dataset to assess the efficiency of reviewing process.Li et al. (2019) designed a neural model to predict citation count of accepted papers.Yang et al. (2018) designed a hierarchical attention-based CNN for automatic academic paper rating by using source paper, it adopts original attention mechanism which cannot capture the interactions between elements in the same level.Leng et al. (2019) proposed DeepReviewer for automatic paper review utilizing paper's grammar and innovation to help learn better representation and predict paper's final review score.Different from above works, we aim at predicting the final acceptance decisions for papers and ratings for reviews with self-attention based framework using raw review texts.And our collected dataset contains the rating score of each review and the final decision of each paper.

Attention Mechanism
Attention mechanism was proposed by researchers to improve the performance of different NLP tasks.There are two common attention mechanisms: additive attention (Bahdanau et al., 2015) and multiplicative attention (Rush et al., 2015;Vaswani et al., 2017;Peng et al., 2019), they use different compatibility functions to compute the attention weights.Lin et al. (2017) introduced self-attention to extract an interpretable sentence embedding.Yang et al. (2016) proposed a hierarchical attention network for document classification, which applied attention mechanism at word and sentence level.Vaswani et al. (2017) built a simple network architecture based only on attention mechanism without convolutions and recurrence.Yin and Schütze (2018) proposed an attentive convolution network which enables deriving higher-level features for a word from information extracted from nonlocal context.Shen et al. (2018b) designed a new attention mechanism which is directional and multi-dimensional, and a neural network solely based on this attention mechanism was proposed to learn sentence embedding.Shen et al. (2018a) proposed a memory-efficient bi-directional self-attention network which splits sequence into blocks to save memory.
Our framework is also based on self-attention mechanism, which makes use of the hierarchical characteristic of HAN (Yang et al., 2016) and the ability of capturing relationships between words from two directions in DiSAN (Shen et al., 2018b).

Methodology
In this section, we first describe the problem setting, and then present the details of our proposed framework for paper review rating prediction and recommendation.

Problem Setting
We consider the problem of paper review rating prediction and recommendation from a dataset containing K papers, where each paper has M reviews associated with the corresponding ratings and a decision class.Concretely, given the set R = {(r 1 , c 1 ), ..., (r M , c M ), y} for a scientific paper, where r i is the i-th reviewer's text review and c i is its associated numeric rating and y is the final decision (i.e., accept or reject).Assume that each text review denotes the t-th word in the j-th sentence of the i-th review document.Given a new paper with a set of reviews R = {r 1 , ..., r M }, our goal is to predict the decision class y which enables the program chairs to automatically make the final decision/recommendation, and also generate a rating c for each review r that is consistent with text sentiment as an aid to reviewers for discovering the rating inconsistency between ratings and review sentiments in the review process.Similar to (Zhang et al., 2010;Hassan and Shoaib, 2020), here we treat paper review rating prediction problem as a multi-class classification problem, where the class labels are the rating scores c.We treat the final decision prediction as a binary classification problem, where the class labels are the decisions y.

Our Approach
The proposed framework takes raw review texts as input and mainly consists of four components: sentence encoder, intra-review encoder, inter-review encoder and rating predictor, as shown in Figure 1.Before describing the details of each component, we introduce the multi-dimensional source2token selfattention module by following (Shen et al., 2018b) and taking this module in the sentence encoder as an example.The attention weight of each word we i,j,t , t ∈ [1, L] is obtained by applying softmax on the scores f (we i,j,t ), t ∈ [1, L] calculated by Eq. ( 1), W T , W (1) , b (1) , b are trainable parameters.The output of this module is the weighted sum of the inputs (e.g., we i,j,t , t ∈ [1, L] in sentence encoder).
• Sentence Encoder.Sentence encoder is designed to capture the relationships between words in a sentence and the importance of each word to the meaning of that sentence.It is shown in the first part of Figure 1.It first generates context-aware embedding for each word in a sentence by using bi-directional selfattention module (Bi-SAN) (Shen et al., 2018b).Based on these context-aware embeddings of words, the encoding for that sentence, which contains all words' information and relations between words, is then obtained from the multi-dimensional source2token self-attention module (Shen et al., 2018b) which aims at generating the sentence encoding by combining the context-aware word embeddings.Specifically, the input of sentence encoder are pre-trained word embeddings obtained from raw review texts by using GloVe pre-trained word embedding (Pennington et al., 2014), or using BERT (Devlin et al., 2019) or SciBERT (Beltagy et al., 2019).Each word (e.g., w i,j,1 , w i,j,2 ) is represented by a d e -dimensional vector.These vectors are fed into Bi-SAN, which includes a forward self-attention network and a backward self-attention network.Each of these two networks outputs a refined embedding for each word and then the two refined embedding of each word are concatenated by Bi-SAN as the final context-aware embedding for each word (e.g., we i,j,1 ∈ R 2de ).The context-aware embedding for each word has 2d e dimension because of the two networks (i.e., forward and backward) in Bi-SAN.After obtaining the context-aware embedding of each word, sentence encoder can generate encoding s i,j ∈ R 2de for each sentence through the multi-dimensional source2token self-attention module.
• Intra-Review Encoder.Sentences in one review may have temporal orders, causality and other logic relationships, and some sentences contain more information for the review.Therefore, intra-review encoder is designed to capture these relations existing in each individual review itself.The input is the sentence embedding s i,j generated by the first-level sentence encoder.The structure of intra-review encoder is similar to sentence encoder where it first feeds sentence embedding to the Bi-SAN module, which captures the relations between sentences and the importance of one sentence to another from two directions by generating forward embedding s f w i,j and backward embedding s bw i,j for sentence s i,j .The final embedding se i,j ∈ R 4de for each sentence s i,j in i-th review is generated by concatenating s f w i,j and s bw i,j .We have se i,j = [s f w i,j ||s bw i,j ], where || denotes concatenation operation.Next, the multi-dimensional source2token self-attention module takes se i,j as input and generates encoding r i ∈ R 4de for i-th review by combining all se i,j in this review according to their importance weights, i.e., attention weights.As shown in the second part of Figure 1, intra-review encoder can generate encoding r i for each review of the same paper.The dimension of r i is 4d e , which is double of sentence encoding because of Bi-SAN.
• Inter-Review Encoder.The integration of different reviews is essential for performing comprehensive analysis and supporting final decision-making on a paper.We use the inter-review encoder as the third level of our framework to integrate information from different reviews of each paper, as shown in the third part of Figure 1.It first feeds the second-level encoding r i of i-th review of a paper to a bi-directional GRU (Bahdanau et al., 2015) layer, and then uses a Bi-SAN to model the relations between reviews from two directions by generating refined encoding re i for this review.Thus re i contains the information from other reviews.Then, a multi-dimensional source2token self-attention module is applied on these encoding re i to get a final compact vector representation rs of the paper.This encoder can handle papers having different number of reviews by using padding.The whole process above is formulated as follows: Step1: Feeding encoding r i of each review of a paper generated by intra-review encoder to the bidirectional GRU layer, it outputs a new encoding for each review (we still use r i to denote the new encoding of i-th review).Then these new encodings are fed to the following Bi-SAN module.
Step2: Bi-SAN has a forward self-attention network and a backward self-attention network.Two attention matrices, denoted as P i(f w) ∈ R 4de×M and P i(bw) ∈ R 4de×M , for i-th review are calculated in these two networks respectively.Then the forward encoding re f w i and backward encoding re bw i for this review are generated as follows ( denotes element-wise multiplication): where M is the number of reviews for one paper.
denote the o-th column in attention matrix P i(f w) and P i(bw) respectively.The refined encoding re i for i-th review, which contains the information from other reviews of the same paper, is generated in the following equation.
Step3: The multi-dimensional source2token self-attention module takes the encodings of all reviews of one paper outputed from Bi-SAN as input, and computes the importance weight for each review encoding re i , and then combines all these review encodings to get the final vector representation rs of the paper based on the importance weights in the similar way as shown in Eq. (2).
• Rating Prediction and Recommendation.With the three levels of encoding above, a fully connected layer with softmax function is designed to make rating prediction and final recommendation.Specifically, we take the compact representation rs from all reviews as its input to predict the final decision, and the encoding r i for i-th review as its input to predict the corresponding rating, respectively.It is worth noting that the predicted review ratings are consistent with text sentiment conveyed by reviewers, thus it can serve as a guidance to reviewers for finding the inconsistencies between semantic review content and numerical review ratings in the review process.

Model Variants
To understand the contribution of different components in the proposed framework, we derive different variants for ablation study.Below are three variants implemented in our experiments.
HabNet-V1: After obtaining the encoding r i of each review for a paper which is outputed from intrareview encoder, we sum them up using equal weight and then use the result as the final encoding rs of that paper, i.e., rs = 1 M M i=1 r i .Thus the inter-review encoder is removed in this variant.HabNet-V2: We remove the sentence encoder in the proposed framework as the second variant to verify the contribution of sentence encoder to the framework.Specifically, for a sentence, we use the average of all words' pre-trained embeddings as its encoding, and feed such sentence encodings to intrareview encoder.Therefore, this variant cannot encode the relations between words in a sentence.
HabNet-V3: We remove the intra-review encoder as the third variant to understand how effectively this encoder captures the interactions between sentences in a review document and to demonstrate the importance of intra-review encoder to the proposed framework.To be specific, the encoding of a review document is the mean of sentence embeddings in that review.

Dataset
We conduct experiments on two datasets to validate our approach for scientific paper decision recommendation and review rating prediction.One is called OpenReview dataset which is collected by us.The other one is a dataset extended from PeerRead which is originally published by Kang et al. (2018).Table 1 shows the statistics of OpenReview and Extended PeerRead.For all the experiments on these two datasets, all samples are randomly shuffled before splitting the dataset.
• OpenReview.This collection contains all reviews for ICLR conference and workshop's papers from 2017 to 2019.Generally, each paper has 3-5 reviews with corresponding ratings, the rating is a numeric value from 1 to 10 (10 being the highest rating).There is also a decision (accept or reject) associated with each paper.The number of accepted and rejected papers are 1341 and 1962.For paper decision recommendation, we use 2293, 491 and 492 papers as training, validation and testing set respectively.For review rating prediction, 7600, 1000, 1000 reviews are used as training, validation and testing set respectively.As shown in Table 1, the number of reviews with different ratings are highly imbalanced.
• Extended PeerRead.The majority of papers with reviews in the original PeerRead dataset are accepted papers collected from NIPS 2013-2017, as shown in Table 1, the number of accepted and rejected papers are 2054 and 0, respectively.Thus the original PeerRead dataset cannot be used directly for predicting final decisions on accepted/unaccepted papers due to the severe imbalance problem stated above.Therefore, we further collect 2211 papers from ICLR 2020 conference and corresponding reviews from the openreview website to extend the PeerRead dataset.Finally, the extended PeerRead dataset has 4265 papers and 13721 reviews in total.However, since most review ratings are not available in the original PeerRead, we only use this extended dataset to predict the final decision.

Evaluation Metrics and Baselines
We use Accuracy, Macro-F1 and Micro-F1 to evaluate the effectiveness of our framework on the task of paper decision recommendation.For review rating prediction, due to the imbalanced distribution of ratings (shown in Table 1) and ineffectiveness of methods dealing with imbalanced problem (such as the oversampling technique and reducing rating range we tried), two new metrics with better discernibility are designed to better evaluate the performance of our framework and baselines apart from Accuracy.Distance Measure (DM).The distance between true label and predicted label is crucial for evaluating a model when there are multiple labels as in our task of review rating prediction.The smaller the distance, the better a model works.Thus we design a new metric which incorporates the distance between predicted rating and true rating.This metric can distinguish a better model from a more reasonable perspective.For example, models which predict a rating of 8 as 7 are much better than models that predict it as 3. Let p i and r i be the predicted rating and true rating for i-th sample respectively, and n be the total number of samples.We define DM as follows: It first calculates the distance d i for each sample and then takes an average over all samples according to Eq. ( 4).When the predictions for all samples are correct, the value of DM achieves its best which is 1.When all predictions are wrong and the distances between predicted ratings and true ratings are all maximum distances d max (in our case d max = 9), the value of DM is 0. When the distances become smaller, the value of DM becomes larger.Thus it can evaluate the performance of models appropriately.The range of DM 's value is [0, 1].The larger its value is, the better the algorithm works.
Optimized Precision (OP).It is important to correctly predict all classes when the data is imbalanced.Inspired by Hossin and Sulaiman (2015), we combine accuracy and recall of all classes into a unified measure, which allows to better deal with imbalanced data environments.Let ACC be the accuracy, N be the number of classes, and R i be the recall for i-th class, i = 1, • • • , N .We define OP as follows: As shown in Eq. ( 5), OP first computes the absolute differences between recalls of each pair of classes and sum them up, and then normalizes it by using the sum of all recalls.In this way, this metric measures the model's ability to predict the highest score of both accuracy and recall for all classes.The higher the value of OP , the better the model fits the data.We compare our proposed method with fourteen other state-of-the-art methods and three variations of our proposed model: (1). 10 flat baselines: three are traditional text classification models, including Support Vector Machines (SVM), Logistic Regression (LR) and Naïve Bayes (NB); five are deep learning models, including RNN (i.e., Bi-GRU) (Cho et al., 2014;Bahdanau et al., 2015), TextCNN (Kim, 2014), TextRCNN (Lai et al., 2015), VDCNN (Conneau et al., 2017) and DPCNN (Johnson and Zhang, 2017); two are attention-based models, including Transformer (Vaswani et al., 2017) and SA-Sent-EM (Lin et al., 2017), which exploit various relationships existing in review text.(2). 4 hierarchical baselines which leverage the hierarchical structure of the dataset: HAN-extended is an extension of HAN (Yang et al., 2016) re-implemented by us; we also implement three Bert-based baselines (Devlin et al., 2019;Beltagy et al., 2019) using large pre-trained contextual embeddings.Specifically, Bert-base and Bertlarge use 768 and 1024-dimensional Bert embedding respectively, while SciBert utilizes 768-dimensional SciBert embedding.For our proposed framework HabNet, apart from using GloVe embedding, we also conduct experiments by using the above three Bert-based contextual embeddings.(3).To demonstrate the contribution of each encoder ingredient, we also implement three variants of our proposed framework.

Experimental Settings
We use raw review texts as input for all models.For the decision recommendation task, 50-dimensional pre-trained GloVe word embedding is used for the models of our framework and HAN-extended, while 100-dimensional one is adopted for other deep learning models except bert-based baselines which utilize corresponding bert-embeddings.The number of training epochs is set to 100.For the rating prediction task, except bert-based baselines using bert embeddings, 100-dimensional pre-trained GloVe word embedding is used for all models.The number of training epochs is set to 50 since all models converge quickly.We use cross entropy as objective function to train all deep learning models.The common parameters, such as learning rate and batch size, are empirically set.For all the experiments on HabNet and its variants, we train each of them 10 times and use the average results to evaluate them.

Experimental Results and Discussion
The experimental results are shown in Table 2.For the paper decision recommendation task, HabNet achieves the best performance results no matter which kind of embedding is used.This demonstrates the effectiveness of our framework and its generality.To be specific, compared with flat baselines, our framework with GloVe embedding, i.e., HabNet (Glove), performs much better, which demonstrates that our framework can make good use of the hierarchical structure in the dataset.While compared with hierarchical baselines, HabNet (Bert-base), HabNet (Bert-large) and HabNet (SciBert) obtain good performance gain (5.4%, 8.2%, 5.4% and 8.2%, 8.0%, 4.1% in terms of accuracy on both datasets) over corresponding bert-based baseline respectively.The improvement indicates that our framework can capture the relationships between words, sentences, and reviews existing in the dataset.Although the three bert-based baselines can obtain contextual word embeddings, they cannot capture intra-review level and inter-review level relationships as our framework does.Even HabNet (Glove) still performs better than the three bert-based baselines using pre-trained contextual embedding and HAN-extended, which further demonstrates the ability of the encoders on capturing the three-level relationships.In addition, the best performance of our framework on both datasets validates its generality, and HabNet with different embedding outperforming all the baselines consolidates this.It is worth noting that the performance results of all models on the extended PeerRead dataset are higher than that on the OpenReview dataset.The reason may be that the review texts (especially those from NIPS 2013-2017) in the PeerRead dataset are much shorter and less complex than those in the OpenReview dataset.This demonstrates that our framework can work on long review text much better than other models.Note that in Table 2, there are only results on OpenReview dataset for review rating prediction, as PeerRead does not contain ratings.HabNet with various embedding (including GloVe and bert embeddings) achieving the best performance demonstrates the effectiveness and generalization ability of our framework again, because HabNet has a similar performance improvement as in the paper decision recommendation task when compared with flat and hierarchical baselines.Furthermore, the ratings predicted by HabNet, although not completely correct, can still be used as an aid to find inconsistencies between given ratings and text sentiments conveyed by reviewers.

Ablation Study
We conduct ablation study of our framework to evaluate the contribution of each component.The results are shown in Table 3.For the paper decision recommendation task, the better performance of HabNet over HabNet-V1 on both datasets indicates that the inter-review encoder can integrate information from different reviews of one paper well which helps the decision recommendation.While HabNet performs better than HabNet-V2 verifies the importance of sentence encoder which can encode the relationships between words in a sentence.And HabNet outperforming HabNet-V3 demonstrates the ability of intrareview encoder to capture sentence-level relations in a review text and that such relations between sentences contribute much information to the meaning of the review document.The results of HabNet and the variants on the review rating prediction task have a similar trend which further validates the contribution of different encoders to the framework.In conclusion, the three encoders help HabNet capture three-level relationships in the dataset which plays a vital role on improving prediction performance.

Case Study
To gain a view over the ability of our proposed framework on capturing the importance of words in the scientific paper decision recommendation task, we visualize the top-15 approbatory words for accepted and rejected papers, as shown in Figure 2. One can see that for the accepted papers, the attention on positive words such as "excellent", "competitive" and so on are much higher than other words.For the rejected papers, negatives words such as "unsatisfactory", "incoherent" and so on have higher attention weight than others.Intuitively, reviewers can express their tendency towards the result of article more clearly through the above keywords.Moreover, compared to the attention weights of words in accepted papers, the attention weights of words in rejected papers are generally greater.The possible reason is that reviewers' comments on the rejected articles are more consistent than that of accepted articles.
We also visualize the sentence-level attention on one accepted paper and rejected paper respectively, as shown in Figure 3, the deeper the color, the bigger the attention weight.For the accepted paper, the sentence with the deepest color expresses strong positive attitude towards the acceptance decision, while other sentences without strong sentiment have smaller attention weights (i.e., the color is much lighter).The same trend also appears in the rejected paper.This result shows that our framework can capture the most important sentence-level signal within a review for predicting the final decisions for papers.

Error Analysis
We investigate the error cases that HabNet did not predict correctly on OpenReview dataset and find that: (1) 67% of them are predicted by HabNet as rejected papers, but they are actually accepted paper; (2) 33% of them are predicted as accepted but are in fact rejected.
We randomly select 20 examples from (1) and read the review text carefully, we find that: a).18 of them have many negative keywords and phrases, like "unclear", "limited", "hard to interpret", "not provable" and so on.Although there are also positive expressions such as "looks good", "interesting", majority of the content contains negative ones.Thus the overall sentiment of the review text is classified as negative by HabNet.b). 2 of them have indicator of acceptance, like the keyword "accepted" or  "acceptance", but meanwhile there are also many negative words, such as "confusing", "no comparison".All of these positive and negative information together make the model unable to make correct prediction.
We also randomly select 20 examples from (2).There are three cases: a).7 of them contain very strong acceptance keywords and sentences, such as "pretty impressive", "promising", "I recommend acceptance".Because of these strong indicators, HabNet predict them as accepted papers.b). 2 of them have strong indicators of rejection, such as "The novelty of the paper is not enough to justify its acceptance", but they also have several strong positive keywords which deviate the overall sentiment of the review text and thus affect HabNet's prediction.c).11 of them have many positive and negative keywords and sentences at the same time, and there is no strong indicator of rejection.HabNet can not deal with them very well, because it takes all the positive and negative information into consideration.

Conclusion
In this paper, a scientific paper review dataset called OpenReview is collected from ICLR openreview website and released.We observe that there is a three-level hierarchical structure in this dataset (i.e., word level, intra-review level and inter-review level) -the information and relationships between reviews of one paper may affect the final decision, and so may relationships between words and sentences in each review.Based on these observations, a hierarchical bi-directional self-attention network (HabNet) framework is proposed for paper review rating prediction and recommendation that can model the interactions among words, sentences, intra-and inter-reviews in an end-to-end manner.Moreover, considering the imbalanced distribution of different classes (i.e., ratings from 1 to 10) in the review rating prediction task, we design two new metrics to better evaluate models.It is seen that both experimental results of predicting final decisions for submitted papers and identifying ratings for reviews on two datasets (Open-Review and extended PeerRead) demonstrate our proposed framework has sufficient ability to capture the hierarchical structures of words, sentences and reviews in the datasets and outperforms other models.In the future, we plan to investigate multi-task learning for paper review rating recommendation.

Table 3 :
Results of ablation study of our framework on OpenReview and Extended PeerRead datasets.Accuracy, Macro-F1 and Micro-F1 are the metrics used for the decision prediction/recommendation task; while Accuracy, DM and OP are used for the rating prediction task, and there are no results for this task on the extended PeerRead dataset because this dataset does not have rating for each review.HabNet achieves the best results which are in bold, arrow ↑ indicates statistical significance (p < 0.05).

Figure 2 :
Figure 2: Attention weights of the top 15 approbatory words for accepted and rejected papers, left subfigure is for accepted papers, right one is for rejected papers.
(a) A review for an accepted paper.(b) A review for a rejected paper.

Figure 3 :
Figure 3: Sentence-level attention visualization for accepted and rejected papers.
Peng et al. (2018)(2017)d very deep convolutional networks to learn hierarchical representations of whole sentences.Johnson and Zhang (2017)studied deepening word-level CNNs to capture global representations of text.Peng et al. (2018)designed a deep Graph-CNN to learn both non-consecutive and long-distance features of text.

Table 1 :
Statistics of OpenReview and Extended PeerRead, -means the rating distribution is unavailable.