Multi-task Peer-Review Score Prediction

Automatic prediction on the peer-review aspect scores of academic papers can be a useful assistant tool for both reviewers and authors. To handle the small size of published datasets on the target aspect of scores, we propose a multi-task approach to leverage additional information from other aspects of scores for improving the performance of the target. Because one of the problems of building multi-task models is how to select the proper resources of auxiliary tasks and how to select the proper shared structures. We propose a multi-task shared structure encoding approach which automatically selects good shared network structures as well as good auxiliary resources. The experiments based on peer-review datasets show that our approach is effective and has better performance on the target scores than the single-task method and naive multi-task methods.


Introduction
Automatic prediction of the peer-review aspect scores (e.g. "clarity" and "originality") of academic papers can be a useful assistant tool for both reviewers and authors. On the one hand, because the number of submissions to AI-related international conferences has significantly increased in recent years, it is challenging for the review process. Rejecting some papers with evidently low quality can reduce the workload. On the other hand, suggesting the weak aspects to the authors can also help them improve their papers.
There are several existing works related to the paper review which concentrate on the quality of the review (De Silva and Vance, 2017;Langford and Guzdial, 2015). Huang (2018) et al. predicted the acceptance of a paper only based on a paper's visual appearance (Huang, 2018). Automatic essay scoring (Dong and Zhang, 2016;Dong et al., 2017;Amorim et al., 2018) can be regarded as a related sub-topic that mainly focus on the grammatical and syntactic features in short essays. PeerRead is the first public dataset of scientific peer reviews for research purposes (Kang et al., 2018), which can be used for paper acceptance classification and review aspect score prediction. It provides detailed peerreviews including the final decisions, the aspect scores such as clarity and originality, and the review contents. It raises two NLP tasks, paper acceptance classification and review aspect score prediction. We focus on the later one in this paper. However, the dataset is relatively small; the set of papers for each review aspect can be different. To improve the performance of aspect score prediction, we propose a solution based on the multi-task learning that can leverage additional rich information from the resources obtained by other aspect scores. We treat the prediction of each aspect as a separate task. The multi-task model for each aspect score has a main-auxiliary manner.
Multi-task methods have been widely utilized in many NLP tasks, such as summarization (Isonuma et al., 2017;Guo et al., 2018), classification (Liu et al., 2017b;Shimura et al., 2019), parsing (Hershcovich et al., 2018), sequence labeling (Lin et al., 2018), and Entity and Relation (Luan et al., 2018). When building a multi-task model, there are two critical issues, i.e., which auxiliary resources (tasks) can be used for sharing useful information and how to share the information among the tasks. In these previous studies, researchers always select specific auxiliary resources, and design handcrafted shared structure in the model for a particular NLP topic.
However, for different datasets and tasks, there may exist other better auxiliary resources and shared structures. We thus propose an approach selecting the shared structures automatically as well as the auxiliary resources that are more beneficial for the main task. There are diverse parameter sharing manners in the multi-task methods for deep neural networks (Ruder, 2017). How to define the exploration space for automatic selection is a problem. Our approach encodes the multi-task shared structures in the manner of hard parameter sharing and defines the exploration space. We also propose a strategy to search the optimal structures and auxiliaries from the candidate models. It is also flexible to add more auxiliary tasks. Our approach can be integrated with hyperparameter optimization methods (Snoek et al., 2012) or network architecture search methods (Zoph and Le, 2016) for searching. Furthermore, our method is capable for not only review score prediction but also some other NLP tasks such as text classification. Our main contributions can be summarized as follows. (1). We address an application that predicting the peer-review aspect scores of papers which can be a useful assistant tool for both reviewers and authors. (2). We propose a multi-task shared structure encoding method which automatically selects good shared network structures as well as good auxiliary resources. (3). The experiments based on real paper peer-review datasets show that our approach can build a multi-task model with effective structures and auxiliaries which has better performance than the single-task model and naïve multi-task models.

Preliminary
Peer-review aspect score prediction is a regression problem with text data. We can utilize existing text classification methods (Kim, 2014;Liu et al., 2017a) based on deep neural network for this problem by changing the loss function from cross-entropy for classification to mean squared error for regression. Without loss of generality, we use the basic CNN-based text classification model (Kim, 2014) as the example to facilitate the description of our multi-task approach. Figure 1 shows the architecture of this model for predicting the aspect score. It includes the embedding layer, convolutional and pooling layer, and fully connected layers. The multi-task approach we propose is not limited to be adapted with this model. It can be integrated with similar neural network structures in this example, e.g., XML-CNN (Liu et al., 2017a) and DPCNN (Johnson and Zhang, 2017).
We have n single tasks (i.e., aspect scores) and assume that they have the same network structures with k layers. For each task, we regard it as the main task and search the proper shared structures and auxiliary tasks.

Multi-task Shared Structures
To automatically search the proper shared structures and auxiliary tasks, we need to define the exploration space. Because it is difficult to mix diverse parameter sharing manners proposed in various multi-task methods (Ruder, 2017), we utilize the typical manner of hard parameter sharing as the starting point to implement our idea. Other manners of parameter sharing will be addressed in future work. Figure 2 shows an example of the shared structure encoding (SSE) that we propose with three tasks (one main task and two auxiliary tasks). Given a main task t 0 , for each auxiliary task t i , if the jth layer of t i is shared with t 0 , then we encode this shared structure as l ij = 1; if the jth layer is not shared, then l ij = 0. We do not encode the shared structures among auxiliary tasks to decrease the complexity of the model. It is flexible to add more auxiliary tasks to a model. There are two special cases of this SSE. One is l ij = 1 for all aux-iliary tasks. The corresponding model is equivalent to one single model for all tasks. Another is l ij = 0 for all auxiliary tasks. It is equivalent to a singletask model for the main task. In other words, in the search stage, these models are also included. Lu et al. (2017) adaptively generate the feature sharing structure by splitting the network into branches without merging. Its exploration space is a subset of our approach.
Our multi-task approach utilizes a mainauxiliary manner, rather than a manner which equally treats all tasks. The later manner makes a sum of the weighted losses of all tasks and requires a trade-off among the tasks (Sener and Koltun, 2018), which may not be able to reach optimal results for a specific task. In our approach, we thus use every single task as the main task respectively and other tasks as the candidates for auxiliary tasks. It is flexible for us to define all candidate shared structures in the exploration space and decrease the size of the exploration space.

Shared Structure and Auxiliary Task Search
In our search strategy, we denote the number of auxiliary tasks in a model as m, m ≤ n − 1. There are n−1 m combinations of the auxiliary tasks. For each combination of auxiliary tasks, we search the shared structures and select the one with minimized loss. For the selection criterion, because the dataset is too small, we use the loss on both the training set and validation set rather than only using the loss of validation set.
After selecting the shared structures for all combinations of the auxiliary tasks, we select the combination of which the average loss of all candidate shared structures is minimum. For a main task, the number of candidate multi-task models is N m = n−1 m × 2 km . When m = n − 1, i.e., using all other tasks as the auxiliary tasks, this number is N n−1 = 2 k(n−1) . If m n − 1, then N m N n−1 . If N m is small, we can explore all candidates. Otherwise, we need to refer some other methods to search in the exploration space, for example, the hyperparameter optimization methods based on Bayesian optimization (Snoek et al., 2012); the network architecture search (NAS) methods based on reinforcement learning (Zoph and Le, 2016;Zoph et al., 2018;Liu et al., 2018). Random search is also possible to be used.

Experimental Settings
We use the ICLR and ACL datasets in the Peer-Read Dataset (Kang et al., 2018) because they provide the scores of the peer-review aspects. Table  1 shows the statistics of these datasets. We utilize the papers which have the scores in some of the six aspects (n = 6), i.e., Clarity (cla), Originality (ori), Correctness (cor), Comparison (com), Substance (sub) and Impact (imp). The scale of these scores is from 1 to 5. We utilize the dataset splitting provided by PeerRead. Because not all papers contain all six aspects in the ICLR dataset, the number of papers for each aspect are diverse. For the ground truth, we use the mean score of multiple reviews which is the general method of multiple score aggregation without considering the review bias. Analyzing the review bias among different reviewers is out of the scope of this paper. Note that although PeerRead contains both paper text and review text, we only used the paper text because the purpose of this work is to predict the aspect scores before review progress. Moreover, because in the PeerRead (Kang et al., 2018) article, the authors utilized the first 1,000 tokens because the paper text was extremely long; and we used full paper text with our own text pre-processing in the experiments, the results obtained by our experiments and that reported in PeerRead are thus not exactly comparable.
We remove the stop words and use stemming to the words in the papers. The initial word embeddings in the models are pre-trained by fastText  from each dataset. The hyperparameters of the CNN structures for the approaches refer to the common ones used in exiting work (Shimura et al., 2018). Table 2 shows the parameter settings of CNN and XML-CNN, which are used as basic models of the proposed multi-task approach in the paper.
The baselines are as follows.
Single task model: It is equivalent to the case that SSEs of all auxiliary tasks are "000". It uses one network for one aspect score like the models in (Dong and Zhang, 2016;Dong et al., 2017).
All-in-one (Ain1): It builds a single model that the main task and m auxiliary tasks use same network like the models in the PeerRead (Kang et al., 2018). It is equivalent to treating the prediction of all aspects as one task or as a multi-task that SSEs of all auxiliary tasks are "111".
Average performance of all explored Multi-Task models (AMT): It is equivalent to the expectation of the performance if randomly selecting a multi-task model from all candidates.
We select the aspect of Clarity, which has most test data as the main task for the evaluation in this paper. The evaluation metric is the Root Mean Square Error (RMSE). We first verify our approach by using CNN (Kim, 2014) as the basic model. We set m ∈ [1, 2, n − 1]. When m = n − 1, the N m = 8 5 is very huge. We use random search method by exploring 1000 candidate models and evaluate the mean performance of five times.

Experimental Results
We first verify whether our SSE method can select a good shared structure for a given combination of auxiliary tasks. Table 3.(a) shows the results in the case of m = 1. It shows that our method successfully builds a better model than the single task model and the model in which all tasks completely share with each other. The comparison result with AMT shows our method can select a better shared structure from all candidate structures.   Italic marks the better one between "Our" and "AMT". structure from all candidate structures. But it cannot always be better than the single task model this time. It is because that the corresponding combinations of auxiliaries are not proper. After using our search strategy to select the combinations of auxiliaries, in 2nd row of Table 4, our method can select the auxiliaries and structures with better performance. In addition, in Table 4, the performance for m = 2 is better than m = 1, it shows that increasing m is possible to improve the performance. However, a large m results in a large N m . In the case of m = 5, although it is possible to obtain a better model than m = 1 or 2 if exploring all N 5 = 8 5 candidate models, only exploring a subset (N 5 = 1000) cannot reach better performance even though N 5 has been larger than N 2 . Without a better search method, using a small m (e.g., m = 2) rather than a large m (e.g., m = 5, all  other aspects as auxiliaries) is recommended. Furthermore, we also respectively change the following four settings while keeping other settings unchanged to verify our approach in different conditions, (1). basic model: one of the SOTA text classification methods XML-CNN (Liu et al., 2017a); (2). main task: Originality, besides the clarity aspect, we also show the results when another aspect is the main task; (3). dataset: ACL. (4). embedding: the pre-trained embeddings by fastText are initialized by the embeddings trained from Wikipedia data. Table 5 shows that our approach can robustly generate better results in different settings. Table  4 and 5 also show that the selected auxiliary tasks and shared structures are diverse in different settings. It would be better to automatically select them rather than manually decide them. For the underlying characteristics of review aspects in this dataset, there is no apparent observation that one aspect is exactly related to the main aspect and must be the auxiliary. Finally, from the results of "originality" aspect in Table 5, it shows that "substance", "comparison" and "impact" support "originality", the selected aspects by SSEs is reasonable and fit human intuitions.

Conclusion
In this paper, we focus on the peer-review score prediction for papers. We propose a multi-task shared structure encoding approach which automatically selects good shared network structures as well as good auxiliary resources. There are some issues in the future work, e.g., trying search methods such as network architecture search and finding evidences of the score predictions.