Multi-Stage Pre-training for Automated Chinese Essay Scoring

This paper proposes a pre-training based automated Chinese essay scoring method. The method involves three components: weakly supervised pre-training, supervised cross-prompt ﬁne-tuning and supervised target-prompt ﬁne-tuning . An essay scorer is ﬁrst pre-trained on a large essay dataset covering diverse topics and with coarse ratings, i.e., good and poor , which are used as a kind of weak supervision. The pre-trained essay scorer would be further ﬁne-tuned on previously rated essays from existing prompts, which have the same score range with the target prompt and provide extra supervision. At last, the scorer is ﬁne-tuned on the target-prompt training data. The evaluation on four prompts shows that this method can improve a state-of-the-art neural essay scorer in terms of effectiveness and domain adaptation ability, while in-depth analysis also reveals its limitations.


Introduction
Automated essay scoring (AES) is an important educational application of natural language processing (NLP) (Page, 1966). AES aims to automatically judge the quality of student essays, which can reduce teachers' burden on essay scoring and provide fast feedback to students.
Most of the proposed methods, no matter the feature based or the representation learning based ones, work in an in-domain setting that is to train a scorer for a specific prompt based on a set of example essays for this prompt, and use this scorer to rate more essays from the same prompt. This manner usually requires many rated examples to get acceptable performance.
Although cross-domain transferable essay scoring has gained more attention (Phandi et al., 2015;Jin et al., 2018), the progress is still limited. The possible reason may be that the available corpora for essay scoring usually cover narrow topics on a small scale, and the topics, scoring criteria, and score ranges of different prompts often vary. Since an AES system should have the ability to appreciate or criticize essays, supervised pre-training is necessary. Intuitively, if a reader has read many rated essays from different prompts, she should be more experienced to judge the quality of an essay that responds to a new prompt. At least, she should require less guidance compared to a novice.
In this work, we empirically evaluate a pretraining based method for AES. Figure 1 illustrates the main framework. Our method has three components, each of which incorporates different level supervision. The first component is weakly supervised pre-training. An essay scorer is pre-trained based on a large scale essay corpus. The corpus covers diverse topics and is prompt-free. The essays are collected from the Web and have been rated by anonymous teachers. The essays' ratings are converted to binary coarse ratings: good and poor for the ease of weakly supervised pretraining. The second component is supervised cross-prompt pre-training / fine-tuning. This component aims to exploit the supervision from the training data of other prompts to pre-train or further fine-tune an essay scorer. The third component is supervised target-prompt fine-tuning. The pre-trained scorer would be fine-tuned on the training data for target prompts. Since human rat- ings are expensive to be collected, we expect the essay scorer depends on the target-prompt training data the less the better.
Although there are public available datasets in English such as the ASAP dataset. 1 These datasets usually cover only a few topics making it difficult to find datasets for pre-training and fine-tuning. As a result, we collect datasets and conduct experiments for automated Chinese essay scoring. We built a dataset with more than 85,000 essays written by junior and senior high school students for weakly supervised pre-training. We also collected nearly 4,000 essays in response to four prompts from senior high schools. These essays were carefully rated by teachers and are used for cross-prompt fine-tuning and evaluation.
Although the framework is straightforward, the evaluation demonstrates the effectiveness of the proposed method.
(1) Higher performance in general: The cooperation of the three components can improve the attentional recurrent convolutional neural network model (ARCNN) , which achieved the state-of-the-art result on the ASAP dataset. In average, the best pre-training enhanced ARCNN can achieve a 4.2% absolute improvement in QWK and 3.1% absolute improvement in Pearson coefficient compared with the ARCNN that is trained on the target-prompt training data only.
(2) Better domain adaptation ability: With both weakly pre-training and cross-prompt finetuning, our method can use 10% target-prompt training data (about 50 essays) to achieve 93.6% relative performance of the full model which is trained with 100% training data. Supervised crossprompt fine-tuning is essential for domain adaptation though it is also expensive due to the requirement of human rated essays. With weakly 1 https://www.kaggle.com/c/asap-aes/ pre-training only, our method can use half of the training data to achieve the same performance as the base scorer that is trained with 100% training data but without pre-training.
To the best of our knowledge, we are the first to investigate multi-stage pre-training based AES. We conduct careful analysis to gain more insights about how the method works and its limitations. Although our research focuses on Chinese, the results and observations should be useful for AES in other languages as well.

Related Work
AES is commonly viewed as a supervised learning problem with various feature templates (Larkey, 1998;Attali and Burstein, 2006;Chen and He, 2013;Phandi et al., 2015;Cummins et al., 2016;Song et al., 2017). These methods assume that essay quality correlates with surface-level features. The drawbacks of these methods include that the feature design and engineering are difficult and the semantic understanding of essays is limited.
However, most of these systems are promptspecific. New training data has to be annotated for training a new model for a new prompt. Phandi et al. (2015) proposed domain adaptation as a solution to adapt an AES system from one initial prompt to another prompt based on Bayesian linear ridge regression. Dong and Zhang (2016) Xia et al. (2016) also attempted to incorpo-rate external knowledge for readability assessment. However, these methods mostly focused on domain adaptation from one domain to another but did not explore external resources.

Domain Adaptation for AES
Pre-training for AES Recently, pre-training language models (LM) becomes a trend (Devlin et al., 2019;Yang et al., 2019), which leads to the pretraining then fine-tuning mechanism and achieves great success in many NLP tasks.
For AES, Mim et al. (2019) proposed an unsupervised pre-training approach for evaluating the organization and argument strength of argumentative essays, where coherence modeling is used for pre-training. Rodriguez et al. (2019) attempted to apply BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019) for AES, but the results on ASAP are similar to the performance of a LSTM based scorer. Howard and Ruder (2018) proposed the universal language model fine-tuning approach for text classification, including components such as general-domain LM pre-training, target domain LM fine-tuning and target task classifier fine-tuning. Gururangan et al. (2020) showed that task-adaptive pretraining can provide a large performance boost for ROBERTA across four domains and eight classification tasks. Motivated by previous works, this paper also adopts a multi-stage pre-training strategy by exploiting weak, distant and target oriented supervision for AES.

The ARCNN Model
Our base model is the attentional recurrent convolutional neural network model (ARCNN) , which is one of the state-of-the-art neural AES systems. Sentence Representation A sequence of words x = {w 1 , ..., w N } is modeled with a CNN encoder. The feature representation for the i-th word is where we use tanh as the activation function f , e(w i ) ∈ R d is the embedding of a word, h w is the window size in the convolutional layer, W z and b z are weight matrix and bias vector.
Above the convolutional layer, attention pooling is employed to get the sentence representation s, where, where s j is the representation of the j-th sentence, and h j−1 is the hidden state of the previous step. Two LSTM encoders are applied in both directions and the bidirectional hidden representations are concatenated together to represent each sentence. The whole sequence could be represented as a fixed is a function to summarize hidden states. The attention mechanism are used as φ(·) to get the text representation.
The Prediction Layer Finally, the rating of the essay is predicted according to where w y and b y are weight vector and bias vector.

Weakly Supervised Pre-training
We attempt to explore corpora with diverse topics and weak/distant quality judgements for pretraining a general essay scorer.

Data Collection
We collected essays from a website LeleKetang. 2 The essays were written by Chinese students in grade 7 to12. The corpus covers diverse topics and multiple genres, including narrative, argumentative and prose essays. The average number of sentences and Chinese characters are 30 and 779. Each essay was rated by a teacher to indicate its quality before it was uploaded to the website. The ratings range from 1 to 4, indicating poor, normal, good and excellent. However, the ratings are imbalanced. Rating 3 and rating 1 are many more than rating 2 and rating 4. The corresponding statistics are shown in Table 1.
For pre-training, we combine rating 4 and 3 to represent good essays, view rating 1 as poor essays, and remove rating 2 to ensure that the good and poor essays could be distinguished.  Table 1: The basic statistics of the dataset used for weakly pre-training. We combine rating 3 and rating 4 as good essays, and use rating 1 as poor essays. Rating 2 is not used in this work.

Pre-training the ARCNN Model
Formally, we have an essay dataset E = {(x, y)}, where y ∈ {0, 1} indicates a poor or good essay.
We train the ARCNN model on the dataset E to distinguish good and poor essays. The learning objective is the sum of the negative cross-entropy over all training examples.
Since the collected ratings might be noisy and are converted to coarse binary ratings, we call it weakly supervised pre-training (WSP).

Supervised Target-Prompt Fine-tuning
The WSP model is just pre-trained on the coarse ratings so that its predictions are within the range of [0,1], which is different from the score ranges in real examinations. Moreover, the essays should be closely related to the prompts. As result, the model should be fine-tuned on the training data of target prompts.
Following , the real scores are scaled to the range [0, 1] for fine-tuning: whereŷ is the real score, min and max indicate the minimum and maximum scores in the training data. In evaluation phase, the predicted scores are rescaled to integer scores in the original score range. The token representations are fixed during finetuning, which is the same as the pre-training. The other parameters would be fine-tuned. We call this strategy WSP-Finetune.

Supervised Transfer Fine-tuning
If rated essays that are from other prompts are available, such data could be used to further train our weakly pre-trained model WSP before fine-tuning the model on target prompts. We just continue to fine-tune WSP on the available prompt-specific  rated datasets. Since the rating knowledge learned from cross-prompt data would be transferred for scoring target-prompt essays, we call this strategy supervised transfer fine-tuning (Trans).
To be consistent with the score range of the target prompt, we only choose the essay datasets that have the same score range with the target prompt for supervised transfer fine-tuning. The main procedure is the same as described in Section 3.3.1. We put Trans before target-prompt fine-tuning so that the complete model is noted as WSP-Trans-Finetune. Of course, Trans could be also used for pre-training if the weakly supervised pre-training data is not available, noted as Trans-Finetune.

Model Parameter Settings
We use the tokenizer of BERT (Devlin et al., 2019) to get tokens and token embeddings. The vocabulary size is 21,128. The dimension of token embeddings is 768. The token embeddings are fixed during both pre-training and fine-tuning phases. We segment an essay into sentences by punctuation. The length limit of each sentence is set to 50. If the length of a sentence is longer than 50, it would be truncated and the remaining part is viewed as another sentence. The detail settings of hyper-parameters are listed in Table 2.  Evaluation metrics Since the coarse ratings are binary, we view pre-training as a classification problem. Macro precision(P), recall(R), F1-score (F1) and accuracy (Acc.) are used as evaluation metrics. Table 3 shows the experimental results on the development and test data of the pre-training dataset.

Results
The performance is moderate. The macro F1 score is about 0.74. This indicates that these essays are distinguishable on a certain degree. Notice that the dataset covers diverse topics and different genres so that this task is not easy because different types of essays should be judged with different evaluation criteria. We also tried to incorporate genre and grade information in a multi-task learning setting for pre-training, but the results on the pre-training dataset and target prompts are not obviously better than using coarse ratings only. The acceptable results indicate that essays in different topics and genres should still share features that can indicate the quality of essays.

Settings
Dataset We used four prompts which were previously used for writing test in college entrance examinations by two provinces in China, during 2012-2014. Each prompt is a short text describing an event, a quote, a fable or other background information (see Appendix A). We let students from several senior high schools write an essay according to their understandings of each prompt. The collected essays were scored by high school teachers. Each essay was scored by two teachers. The scores range from 0 to 60. If the difference between their scores is not bigger than 6 (10% of the score range), the average score would be the final score.  Otherwise, a third teacher would participate in evaluation, and the average of two closest scores among the three would be the final score. This procedure is the same as the evaluation procedure in college entrance examinations. The collected essays are grouped according to prompts. The statistics of the datasets are shown in Table 4.

Evaluation Metrics
We use the quadratic weighted Kappa (QWK) and Pearson coefficient score as evaluation metrics. QWK is widely adopted for evaluating AES, while Pearson coefficient could reflect ranking consistency. We conducted 5-fold cross-validation. In each run, we used 60%, 20% and 20% of a dataset for each prompt as training data, development data and test data, respectively. The average performance would be reported.
Comparisons We compare the following systems. The first set of systems are previously proposed neural AES systems, including The second set of systems are the variations of the proposed pre-training based AES series. All variations use ARCNN as the base model.
• ARCNN: The ARCNN model is trained only based on the target-prompt training data for each prompt.
• WSP: The ARCNN model is weakly pretrained on the LeleKetang dataset and then directly used to predict target-prompt test data without fine-tuning.
• WSP-Finetune: The weakly supervised pretrained model is further fine-tuned based on  (2016) Table 5: QWK and Pearson coefficient scores on target-prompt test sets. All models are trained or fine-tuned using the full target-prompt training data.
the target-prompt training data and development data.
• Trans: Other prompt-specific training data is used to pre-trained a model. In experiments, for each target-prompt test data, we use the training data and development data of the other three prompts for scorer training and model selection.
• Trans-Finetune: This setting further fine-tunes the Trans model based on the target-prompt training data and development data.
• WSP-Trans-Finetune: The weakly supervised pre-trained model is fine-tuned on the crossprompt data before being fine-tuned on the target-prompt data. Table 5 shows the performance of the previous neural AES models and the variants of our proposed pre-training based models. ARCNN obtains competitive performance compared with the other neural scorers. The results verify that ARCNN is an effective neural essay scorer and this is also the reason that we use it as the base model for pre-training.

Overall Results
Our final model WSP-Trans-Finetune achieves the best performance in average and outperforms ARCNN, which is trained and test on the datasets from the same prompts. The improved QWK score and Pearson coefficient score in average are 4.4% and 3.1%. The final model also outperforms Trans-Finetune and WSP-Finetune in most cases. This results verify that the multi-stage pre-training strategy is feasible and effective for AES in general.
One issue is that the performance gain across datasets is inconsistent. The improvement on Set3 is large, while the improvement on Set4 is relatively small.
In addition, we can see that fine-tuning on the target-prompt training data is still essential. The performance of Trans decreases a lot without finetuning, while the WSP model is infeasible to be directly applied for scoring due to the different score ranges between pre-training and targetprompt data.

Analysis and Discussions
We provide more detail analysis and discussions from several aspects.
The effect of weakly supervised pre-training As shown in the last three rows in Table 5, when WSP is directly applied to score essays from target prompts, the QWK scores are very low. This is reasonable since the distribution of the coarse ratings in the pre-training dataset is far from the distribution of scores in target-prompt dataset. As a result, the differences between predicted scores and real scores are large, which lead to low QWK scores. However, the Pearson coefficient scores are not so low as QWK scores. This indicates that the weakly supervised pre-training can help capture some common indicators of the quality of essays without considering prompt specific information. After WSP is fine-tuned on the target prompts, WSP-Finetune obtains improvements on all four datasets.
The effect of supervised transfer pre-training Trans-Finetune pre-trains a model on narrow topics (3 prompts) but performs surprisingly well. Trans-Finetune may provide a kind of regularization to improve the generalization of the essay scorer. This explanation to the effectiveness of pre-training is well accepted (Erhan et al., 2010). Moreover, more training data from the same score range also helps shape the real distribution of scores and avoids overfitting to the distribution in the target-prompt training data.  The combination of Trans and WSP WSP-Trans-Finetune achieves the best performance but its advantage compared with Trans-Finetune and WSP-Finetune is not very obvious, indicating that Trans and WSP benefit each other but also play similar roles.
On one hand, both Trans and WSP can play a role as regularization. Because the topics are still narrow for cross-prompt pre-training so that new bias might be brought in, while WSP can help alleviate such an effect. On the other hand, WSP is trained based on coarse binary ratings. Trans can help WSP adapt the prediction distribution towards the score range of the target prompts.
Can pre-training reduce the requirement of target-prompt training data? This is a key question for this research. To answer it, we use different ratio of target-prompt training data to train ARCNN and fine-tune the pre-trained models. We sampled these subsets according to the score distribution of the whole dataset for each prompt. Figure 2 shows the average QWK and Pearson  coefficient scores over four prompts with different ratio of training data. We can see that when the size of training data decreases, the performance of ARCNN drops sharply. In contrast, all three pre-trained models, WSP-Finetune, Trans-Finetune and WSP-Trans-Finetune, achieve very consistent performance even when the ratio of used training data is small. For example, in average, WSP-Finetune can use 50% target-prompt training data to obtain similar performance compared with AR-CNN trained with all training data, and use 10% target-prompt training data to obtain 93.6% performance of ARCNN. Trans-Finetune and WSP-Trans-Finetune perform even better than WSP-Finetune. The cross-prompt supervised transfer fine-tuning (Trans) is useful for domain adaptation. Figure 3 shows the QWK scores with different ratio of target-prompt training data across four prompts in detail. We can see that the trends on four datasets are generally consistent with the average performance. The pre-training based models outperform ARCNN with a large margin when the ratio of target-prompt training data is small. WSP-Trans-Finetune performs best on 3 datasets, while WSP-Finetune performs best on 1 dataset. Trans-Finetune obtains close performance compared with WSP-Trans-Finetune.
On one hand, these observations are encouraging. It means that if we have high quality rated crossprompt essays, the supervised transfer pre-training can help a lot for domain adaptation. But such datasets are still expensive and large scale such datasets might be not always available. Even so,  the weak supervision through coarse ratings can also make an impact on domain adaptation.
On the other hand, the effects of different pretraining strategies are different at different datasets. This indicates that the effects of pre-training may be also related to the properties of target prompts. Moreover, we observe that in some cases (e.g, Set1 and Set3) using fewer training data (e.g., 30%) performs better than using more training data (e.g., 50%). This may relate to the representativeness of selected subsets of essays for training.
How does pre-training affect essays from different score ranges compared with ARCNN? We divide all the essays from four datasets into four ranges according to their real scores. The distribution of scores is shown in Figure 4(a). We can see that the essay scores are concentrated in range [40][41][42][43][44][45][46][47][48][49][50].
We analyze the WSP-Trans-Finetune model. We define improvement here as reducing the differences between the predicted and real scores com-  Table 6: Some statistics of Jensen-Shannon divergence between topic vectors of essays on four datasets. Each essay is represented with a topic distribution vector inferred by a LDA model. pared with ARCNN. Figure 4(b) shows the results. We can see that the pre-training improves the scoring ability for essays from range [40][41][42][43][44][45][46][47][48][49][50]. So the general performance of WSP-Trans-Finetune is good. The essays from this range are at intermediate level, written in the common way. The pretraining models may help find subtle distinctions in style to distinguish them better.
However, pre-training hurts the performance in other ranges, although the number of essays in these ranges is small. The reasons might be as follows. High score prediction is a challenge for AES, because the training examples are less than other ranges and some high score essays were written in unique ways. Essays in the range [0-40] often involve off-topic essays. The pre-training models could not help much in these cases, because they can not help capture topic information very well.
Why the performance gain is inconsistent across prompts? We observe that the effects of pre-training vary across prompts, e.g., the performance gain in Set3 and Set4 is quite different.
Qualitatively, we speculate the inconsistence is related to the distinct properties of the prompts. For example, the prompt 3 has a semi-topic setting: writing an essay to discuss " to know", where the underline part should be filled in by students. So the students discussed this from a variety of angles. In this case, the importance of target-prompt examples might be weakened and pre-training plays an important role. The prompt 4 asked students to imagine a situation if we would have an intelligent chip which knows all kinds of knowledge. In this case, a good sense of imagination and creativity may become a scoring dimension to human raters. But this dimension is difficult to be captured by AES models.
We try to find quantitative evidence to support our speculations. We analyze the topical diversity of essays within each prompt. We train a LDA model with 200 topics on the pre-training dataset and infer the topic distribution of each essay in four prompt datasets. We then compute the Jensen-Shannon divergence between every pair of essays. Table 6 shows some statistics of these values. Unfortunately, we do not find obvious regularity except that the essays from prompt 3 cover more diverse topics compared with other prompts according to the quartile deviation and the coefficient of variation. We leave the investigation of the correlation between datasets' properties and scoring performance as future work.

Conclusion
In this paper, we presented a pre-training based approach to automated Chinese essay scoring. Our method investigates multi-stage pre-training and incorporates multi-level supervision, including the weak supervision from large scale coarse ratings, the supervision from rated essays from other prompts and the target-prompt training data.
The experimental results show that the pretraining based approach is effective for AES in terms of both effectiveness and domain adaptation ability. We carefully analyze the effects of each component and find that: multi-stage pre-training improves the base model in general; the domain adaptation ability can be consistently improved; target-prompt fine-tuning is still indispensable but the required amount of training data can be largely reduced; weakly supervised pre-training and supervised transfer fine-tuning are both helpful.
We also observe some phenomena but do not have good explanations. For example, the performance gain across prompts is inconsistent. When the pre-trained scorer can work best should be further studied. We suggest that the prompts' properties should be investigated more for applying AES.
The proposed method has a limitation that it pays more attention to the score range that most essays are from, and may hurt the performance in other ranges. Another limitation of the method is the dependence on pre-training dataset. The pre-training dataset used in this paper is still small compared with the data used for pre-training language models. Larger pre-training dataset with supervised labels or self-supervised learning strategies could be explored. Moreover, we are interested in understanding what features or traits of essays are captured by the deep models for scoring. We plan to investigate these in future.