Automatic Essay Scoring Incorporating Rating Schema via Reinforcement Learning

Automatic essay scoring (AES) is the task of assigning grades to essays without human interference. Existing systems for AES are typically trained to predict the score of each single essay at a time without considering the rating schema. In order to address this issue, we propose a reinforcement learning framework for essay scoring that incorporates quadratic weighted kappa as guidance to optimize the scoring system. Experiment results on benchmark datasets show the effectiveness of our framework.


Introduction
In recent years, neural networks have been widely used to grade student essays automatically and achieve state-of-the-art performance. In particular, a distributed representation is learned for an essay with variant neural networks and a linear layer is then used to produce the final score. Existing researches focus on learning better essay representation using different neural networks, including long short-term memory (LSTM) network (Taghipour and Ng, 2016), hierarchical convolutional neural networks (CNN) (Dong and Zhang, 2016), hierarchical CNN-LSTM structure with attention mechanism (Dong et al., 2017), and SKIPFLOW LSTM (Tay et al., 2017).
The major evaluation metric for AES is quadratic weighted kappa (QWK), which is also the official metric of Automated Student Assessment Prize 1 (ASAP). It evaluates the scoring results by taking rating schema into consideration. Because QWK is not differentiable, it is hard to train systems via optimizing this metric directly. Alternatively, existing AES systems are typically trained to predict the score for a single essay and optimized using mean square error (MSE). The gap between training and testing also limits the performance of state-of-the-art AES systems.
Recently, reinforcement learning (RL) has been introduced to optimize models in terms of nondifferentiable quality metrics and studies have shown its effectiveness for various tasks including language generation (Ranzato et al., 2015;Rennie et al., 2016;, machine translation (Bahdanau et al., 2016) and relation classification (Feng et al., 2018).
Inspired by these researches, we propose a novel reinforcement learning framework that incorporates QWK as the guidance to optimize the essay scoring system. In our framework, we score a pack of essays at a time and the scoring of each single essay is treated as an action. The QWK value computed for the pack of essays is then delivered as a reward to update the scoring system. Because the existing regression-based essay scorer is unable to generate a probability distribution in nature, it is non-trivial to be used within the reinforcement learning framework. We therefore propose to use a classification-based scoring system instead. The proposed framework is evaluated in the benchmark datasets from ASAP and experiment results confirm its effectiveness on two different settings of essay representation structures.

Model
Typically, an essay scorer contains two components, namely, essay representation and essay scoring. The component of essay representation transforms an input essay into a distributed vector and the component of essay scoring assigns a score to the essay based on the vector. Both components are usually trained jointly. In order to incorporate QWK to guide the process of essay scoring, we introduce a novel essay scoring strategy named packed evaluation. At each time, essay scorer grades a pack of essays together with the target essay, and QWK is calculated for the pack. To avoid contingency, for each target essay, we repeat the packed evaluation multiple times by randomly choosing other essays in a pack. And the average QWK it achieves is set to be the reward. The reward is then delivered to the essay scorer as a weak signal to supervise the scorer. Figure 1 illustrates the training process of our model. We will introduce the different parts in detail in the rest of this section.

Essay Representation
This component converts an input essay into a dense vector as its representation. Recurrent neural networks (Williams and Zipser, 1989) are widely used to learn a representation for a sequence of words for essay scoring. Following existing researches, we also use recurrent neural networks (RNN) and test two different structures.
Bidirectional LSTM We first use a doublelayer bi-directional LSTM network (Hochreiter and Schmidhuber, 1997) to process the essay. LSTM is a variant of recurrent neural network which uses gates to control the information flow. Our LSTM processes one word at a timestamp. Given the word embedding sequence {x 1 , x 2 , ..., x n } for the essay, the hidden states of the LSTM are calculated as follows: and b c are bias vectors. σ denotes sigmoid function and • denotes element-wise multiplication.
In particular, the average value over all hidden states of each LSTM layer are computed, and we concatenate the mean states of the two layers together as the embedding vector of the essay. Given h i,j as the j-th hidden state of the i-th layer, the layer outputs and the essay embedding vector E are defined as follows: Dilated LSTM Dilated recurrent neural networks (Chang et al., 2017) are proved to be more effective than traditional RNNs in long sequence processing, by capturing multi-timescale information along the sequence, with the mechanism of dilated skip connections. Denoting where k i is the skip length in the i-th layer. In order to keep the most information active, we simply concatenate the average hidden states of every layer to form the essay embedding.
where L is the number of layers.

Essay Scoring
Traditionally, a linear layer with sigmoid function is used to score an essay. Given an essay embedding vector E, the essay score is calculated as follows: where W l and b l are weight vector and bias for scoring. By running n examples together, mean square error is used to evaluate the predicted score.
where y andŷ are score vectors representing predicted scores and ground truth scores, respectively. As we can see, such objective function is unable to take rating schema into consideration.
The regression-based scorer only outputs a single value without probability distribution. It is thus non-trivial to use it for policy learning in RL framework directly. Therefore, we propose to use a classification-based scorer, in which different score categories and their probabilities constitute an action space.
Classification-based Scoring We first feed the essay vector into a fully connected layer, then softmax function is used to transform the output into a probability distribution. Given an essay embedding vector E, the probability distribution vector c is calculated as follows: where W c and b c are weight matrix and bias vector, respectively.
Given the ground truth category, cross entropy loss is applied to evaluate the agreement of the probabilities as follows: where N is the number of categories, which is equal to the number of possible ratings. Y is a one-hot vector with the element representing the ground truth category set as one.
Inter-class Penalty Cross entropy loss used in classification-based scorer does not imply the difference between categories, i.e. the rank information that is deemed to be important for essay scoring. Thus we enforce a penalty in addition to the cross entropy loss. Inspired by the definition of QWK, the penalty vector p is defined as follows: where score is the ground truth score of the essay. The penalty loss function is defined as:

Mixed Scoring
In practice, we jointly train both a regression-based scoring layer and a classification-based scoring layer over the same document representation to help the classificationbased scorer converge. By combining the two scorers together, the overall loss function can be written as: loss pre = α 0 loss M SE + β 0 loss CE + γ 0 loss P where α 0 , β 0 and γ 0 are hyper parameters. Mixed scoring is used as a pre-train model for our essay scorer in the phase of reinforcement learning.

Reinforcement Learning
We define our loss function as the negative expected reward: where τ is the set of actions, r denotes the reward, which is the average QWK an essay achieves in the packed evaluation.
By running n examples at a time, according to the REINFORCE algorithm (Williams, 1992), an approximated gradient can be calculated by: where θ denotes all parameters relevant to score calculation, and ∂log(p i,y |E i ; θ) can be computed by standard back propagation. Note that only the classification-based scorer is involved in the process of reinforcement learning for essay scoring. The overall loss function for this phase can be written as: loss overall = α 1 loss RL + β 1 loss CE + γ 1 loss P where α 1 , β 1 and γ 1 are hyper parameters.

Quadratic Weighted Kappa(QWK)
QWK calculation emphasizes on the overall rating schema. By setting QWK as the reward, our model is trained at a macro aspect taking the grading specialty of different sets of essays into consideration.
An N-by-N quadratic weight matrix W is first computed to encode the rating information.
where N is the number of possible ratings. An N-by-N matrix A is calculated such that A i,j corresponds to the number of essays that receive a score i by the human rater, and a score j by the scoring system. Another N-by-N matrix B is constructed as the outer product of the histogram vectors of the two ratings. A and B are then normalized such that they have the same sum. Finally, from the three matrices, the quadratic weighted kappa is calculated as follows: 3 Experiment

Experiment Setup
The ASAP dataset is used for evaluation. It consists of essays written by middle-school Englishspeaking students ranging among eight different topics. More details are listed in Table 1. As there are no released labels for the test data, we separate the validation set and test set from the original training data. Following Taghipour and Ng (2016) and Dong et al. (2017), we use 5-fold cross-validation. In each fold, the split is 60%, 20%, 20% for training, validation and testing respectively.
All essays are parsed with the NLTK 2 tokenizer. We pre-train the word embedding via word2vec (Mikolov et al., 2013) on the whole dataset. The number of hidden states in LSTMs is 200. We use a four-layer double-directional dilated LSTM with skip lengths 1,2,4,8 in each layer respectively. During the training and the scoring, scores are scaled to range [0,1] for regressionbased scorer. They are restored to integers when calculating QWK values. In the RL phase, the pack size is 64 essays, and packed evaluation is repeated 7 times per essay. The essay scorer for RL is pre-trained by mixed scoring.
We compare the performance of different approaches: • B0: This model uses a double-layer bidirectional LSTM to encode an essay and mean square error as objective function to train the essay scorer; • B1: This is a classification-based scorer and it is trained jointly with a regression-based scorer; • P0: Based on B1, this model incorporates penalty loss function;

Results
The overall results of our models in terms of QWK are shown in Table 2. We have the following findings: • By incorporating a penalty loss to the classification scorer, the performance of P0 is equal to or better than B1 on all the eight sets. This indicates the effectiveness of combining rank information with cross-entropy loss for essay scoring. • By replacing double-layer bi-directional LSTM with dilated LSTM, P1 improves the QWK values by a large margin compared with P0 on all the eight sets. This indicates the effectiveness of using dilated LSTM for document representation for the task of automatic essay scoring. The performance improvement brought by P1 compared to P0 is even greater when the length of essays are higher (set 1,2,7,8, see Table 1), indicating that dilated networks are specifically better at long sequence processing. • By incorporating QWK to guide the optimization of essay scorer, approaches (RL0 and RL1) with reinforcement learning strategy can improve the performance consistently on all the eight sets compared to their counterparts (P0 and P1). We also performed one-tailed t-test, showing that the improvements brought by reinforcement learning are significant with p < 0.05 compared to their base scorer models (RL1 vs. P1 and RL2 vs. P2). • The performance of classification-based scorer B1 can equate or improve the performance on four datasets (set 3,4,5,6) compared with regression-based scorer B0. The rating ranges for set 1,2,7,8 are much greater than set 3,4,5,6 (see Table 1). The performance difference between B1 and B0 decreases (from positive to negative) when the number of rating categories increases. This is because when the number of categories get larger, it requires much more parameters for the classification-based scorer to be well trained. Given N categories, the classification layer should output N probabilities for each category per essay, costing N times more parameters than regression-based scoring.

Related Work
There are two lines of research related to our work including text quality evaluation and reinforcement learning for natural language processing.

Text Quality Evaluation
Traditionally, AES models are usually divided into three categories: classification, regression and ranking. Naive Bayes models are mostly used in classification tasks. Larkey (1998) use bagof-word features. Following that, Rudner and Liang (2002) develop a system based on multinomial Bernoulli Naive Bayes, using content and style features. E-rater (Attali and Burstein, 2004) is one of the earliest systems to adopt regression methods. Phandi et al. (2015) use correlated Bayesian Linear Ridge Regression (cBLRR) focusing on domain-adaptation tasks. Ranking models use linguistic features. Yannakoudakis et al. (2011) formulate AES as a pair-wise ranking problem by ranking the order of pair essays. Chen and He (2013) formulate AES into a list-wise ranking problem by considering the order relation among the whole essays. Argument quality evaluation is a task closely related to AES, which involves evaluation of argumentative texts with various grains (argumentlevel, post-level, etc.). Tan et al. (2016); Wei et al. (2016a); Wang et al. (2017) make use of linguistic features to evaluate the persuasiveness of ar-guments in online forums. Wei et al. (2016b);Ji et al. (2018) consider features from the perspectives of argumentation interaction between participants. Persing and Ng (2017) construct their model based on error types for argumentation.

Reinforcement Learning for Natural
Language Processing Being able to optimize non-differentiable quality metrics, reinforcement learning has been widely used in natural language processing tasks such as machine translation (Bahdanau et al., 2016), image captioning (Rennie et al., 2016; and text summarization (Ranzato et al., 2015). To the best of our knowledge, this paper is the first attempt to optimize the scorer by QWK that considers rating schema. Skip connections in RNNs are capable of capturing long-term dependencies in sequences. Vezhnevets et al. (2017) introduces dilated LSTM to allow its manager to operate at a low temporal resolution.  propose a reinforcement learning method to let the network learn how long to skip.

Conclusion and Future Work
In this paper, we propose a reinforcement learning framework incorporating QWK metric as the reward to train the essay scoring system directly. A packed evaluation strategy is used for QWK computation and the scoring of each essay is treated as a single action. In particular, dilated LSTM is used to encode an essay, and a softmax layer is utilized for essay grading. Experiment results on benchmark datasets prove that training the grading system toward QWK is effective.
Further analysis on experiment results indicates the disadvantage of using a classification-based scorer for essays with complex grading schema. One of the future directions can be exploring other kinds of scoring actions than classification under the reinforcement learning framework.