Neural Automated Essay Scoring Incorporating Handcrafted Features

Automated essay scoring (AES) is the task of automatically assigning scores to essays as an alternative to grading by human raters. Conventional AES typically relies on handcrafted features, whereas recent studies have proposed AES models based on deep neural networks (DNNs) to obviate the need for feature engineering. Furthermore, hybrid methods that integrate handcrafted features in a DNN-AES model have been recently developed and have achieved state-of-the-art accuracy. One of the most popular hybrid methods is formulated as a DNN-AES model with an additional recurrent neural network (RNN) that processes a sequence of handcrafted sentence-level features. However, this method has the following problems: 1) It cannot incorporate effective essay-level features developed in previous AES research. 2) It greatly increases the numbers of model parameters and tuning parameters, increasing the difficulty of model training. 3) It has an additional RNN to process sentence-level features, enabling extension to various DNN-AES models complex. To resolve these problems, we propose a new hybrid method that integrates handcrafted essay-level features into a DNN-AES model. Specifically, our method concatenates handcrafted essay-level features to a distributed essay representation vector, which is obtained from an intermediate layer of a DNN-AES model. Our method is a simple DNN-AES extension, but significantly improves scoring accuracy.


Introduction
In various assessment fields, essay-writing tests have attracted much attention as a way to measure practical higher-order abilities such as logical thinking, critical reasoning, and creative-thinking skills (Hussein et al., 2019;Uto, 2019). In essay-writing tests, test-takers are required to write essays about a given topic, and human raters grade those essays based on a scoring rubric. However, because the scoring process takes much time and effort, it is hard to grade large numbers of essays (Hussein et al., 2019). Further, subjectivity in human scoring can reduce accuracy (Amorim et al., 2018;Uto and Ueno, 2018;Uto and Okano, 2020). Automated essay scoring (AES), which utilizes natural language processing and machine learning techniques to automatically grade essays, is one method for resolving these problems.
Many AES methods have been developed over the past decades, and can generally be categorized as feature-engineering and neural-network approaches (Hussein et al., 2019;Ke and Ng, 2019). The featureengineering approach predicts scores using handcrafted features such as essay length or spelling errors (e.g., (Amorim et al., 2018;Dascalu et al., 2017;Mark D. Shermis, 2016;Nguyen and Litman, 2018)). The advantages of this approach include interpretability and explainability. However, this approach generally requires extensive effort for engineering effective features to achieve high scoring accuracy for various essays.
These two approaches can be viewed as complementary rather than competing because they provide different advantages. Specifically, the neural-network approach can extract dataset-specific features from word sequence patterns, whereas the feature-engineering approach can use existing effective features that are difficult to extract using DNNs from only word sequence information. To obtain both benefits, Dasgupta et al. (2018) proposed a hybrid method that integrates both approaches. This method is formulated as a DNN-AES model with an additional recurrent neural network (RNN) that processes a sequence of handcrafted sentence-level features. This method provides state-of-the-art accuracy, but has the following drawbacks: 1. It cannot incorporate effective essay-level features developed in previous AES research.
2. It greatly increases the numbers of model parameters and tuning parameters, increasing the difficulty of model training.
3. It has an additional RNN that processes sequences of handcrafted sentence-level features, enabling extension to various DNN-AES models complex.
To resolve these problems, we propose a new hybrid method that integrates handcrafted essay-level features into a DNN-AES model. Specifically, our method concatenates handcrafted essay-level features to a distributed essay representation vector, which is obtained from an intermediate layer of a DNN-AES model. The advantages of our method are as follows: 1. It can incorporate various existing essay-level features for which effectiveness has been shown.
2. The number of required additional parameters is only the number of incorporated essay-level features, and there are no additional hand-tuned parameters.
3. It can be easily applied to various DNN-AES models, because conventional models commonly have a layer that produces a distributed essay-representation vector.
Our method is a simple DNN-AES extension, but experimental results on real-world benchmark data show that it significantly improves accuracy.

Automated essay scoring methods
This section briefly reviews conventional AES methods based on the feature-engineering and neuralnetwork approaches.

Feature-engineering approach
Following the first AES method, Project Essay Grade (PEG) (Page, 2003), many feature engineeringbased AES methods have been developed, including Intelligent Essay Assessor (IEA) (Foltz et al., 2013), e-rater (Attali and Burstein, 2006), the Bayesian Essay Test Score sYstem (BETSY) (Rudner and Liang, 2002), and IntelliMetric (Schultz, 2013). These methods have been applied to various actual tests. For example, e-rater, a popular commercial AES, now plays the role of a second rater in the Test of English as a Foreign Language (TOEFL) and the Graduate Record Examination (GRE).
These AES methods predict scores by supervised machine learning models using handcrafted features. For instance, PEG and e-rater use multiple regression models, and Phandi et al. (2015) used a correlated Bayesian linear-ridge-regression model. BETSY and Larkey (1998) perform AES using classification models. Other recent works solve AES by using preference-ranking models (Yannakoudakis et al., 2011;Chen and He, 2013).
The features used in previous research differ among the methods, ranging from simple features (e.g., word or essay length) to more complex ones (e.g., readability or grammatical errors). Table 1 shows examples of representative features (Phandi et al., 2015;Ke and Ng, 2019).

Neural-network approach
This section introduces two DNN-AES models as AES methods based on the neural-network approach: the most popular model, which uses a long short-term memory (LSTM), and an advanced model based on the transformer architecture.

LSTM-based model
The LSTM-based model (Alikaniotis et al., 2016), which was the first DNN-AES model, predicts essay scores through the multi-layered neural networks shown in Fig. 1 by inputting essay word sequences. Letting V = {1, · · · , V } be a vocabulary list, an essay j is defined as a list of vocabulary words {w ji ∈ V | i = {1, · · · , n j }}, where w ji is a V -dimensional one-hot representation of the i-th word in essay j and n j is the number of words in essay j. This model processes word sequences through the following layers: Lookup table layer: This layer transforms each word in a given essay into a D-dimensional wordembedding representation, in which words with the same meaning have similar representations. Specifically, letting A be a D × V -dimensional embeddings matrix, the word-embedding representation x ji corresponding to w ji is calculable as the dot product A · w ji .
Recurrent layer: This layer is an LSTM network that outputs a vector at each timestep to capture longdistance word dependencies. Specifically, this layer transforms sequence {x j1 , x j2 , · · · , x jn j } to an LSTM output sequence {h j1 , h j2 , · · · , h jn j }. A single-layer unidirectional LSTM is generally used, but bidirectional or multilayered LSTMs are also often used. A convolution neural network is optionally used before the recurrent layer to capture n-gram-level textual dependencies.
Pooling layer: This layer transforms recurrent layer outputs into a fixed-length vector. Mean-over-time (MoT) pooling, which calculates an average vector M j = 1 h ji , is generally used. Other frequently used pooling methods include the last pool, which uses the last output of the recurrent layer h jn j , and a pooling-with-attention mechanism.
Linear layer with sigmoid activation: This layer projects pooling-layer output M j to a scalar value in the range [0, 1] by the sigmoid function σ(W M j + b), where W is a weight matrix and b is a bias.
Model training is conducted by backpropagation with a mean square error (MSE) loss function using a training dataset in which scores are normalized to a [0, 1] scale. During the prediction phase, predicted scores are rescaled to the original score range. This model has been used as the basis model in various current DNN-AES models (e.g., (Dasgupta et al., 2018;Farag et al., 2018;Jin et al., 2018;Mesgar and Strube, 2018;Wang et al., 2018;Mim et al., 2019;Nadeem et al., 2019;Uto and Okano, 2020)).

Transformer-based model
Transformer-based architectures have recently attracted attention as an alternative approach to RNN for processing sequential data. Specifically, bidirectional encoder representations from transformers (BERT), a pre-trained multilayer bidirectional transformer network (Vaswani et al., 2017) released by the Google AI Language team, have achieved state-of-the-art results in various NLP tasks, such as question answering, named entity recognition, natural language inference, and text classification (Devlin et al., 2019). BERT was also applied to AES (Rodriguez et al., 2019) and automated short-answer grading Sung et al., 2019) in 2019, and demonstrated good performance. Transformers are a neural network architecture designed to handle ordered data sequences using an attention mechanism. Specifically, transformers consist of multiple layers (called transformer blocks), each containing a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. See Ref. (Vaswani et al., 2017) for details of this architecture.
BERT is trained in pre-training and fine-tuning steps. Pre-training is conducted on huge amounts of unlabeled text data over two tasks, masked language modeling and next-sentence prediction, the former predicting the identities of words that have been masked out of the input text and the latter predicting whether two given sentences are adjacent.
Using BERT for a target NLP task, including AES, requires fine-tuning (retraining), which is conducted from a task-specific supervised dataset after initializing model parameters to pre-trained values. When using BERT for regression or classification tasks such as AES, input texts require preprocessing, namely, adding a special token ("[CLS]") to the beginning of each text. BERT output corresponding to this token is used as a fixed-length distributed text representation (Devlin et al., 2019). We can thus conduct target regression or classification tasks based on the text representation. In this study, we assume the use of the linear layer with sigmoid activation, described in the previous subsection, to predict essay scores from the text representation (Fig. 2).

Hybrid method
The feature-engineering approach and the neural-network approach can be viewed as complementary rather than competing approaches, because as mentioned in Section 1 they provide different advantages. To receive both benefits, Dasgupta et al. (2018) proposed a hybrid method that integrates the two approaches. Figure 3 shows the model architecture of the hybrid method. As that figure shows, it mainly consists of two DNNs. One processes word sequences in a given essay in the same way as the conventional LSTM- based DNN-AES model. Specifically, it transforms a word sequence w j = {w j1 , w j2 , · · · , w jn j } to a hidden vector H j , which is a fixed-length distributed essay representation, through the lookup table layer, recurrent layer, and pooling layer. The other DNN processes a sequence of handcrafted sentence-level features. Letting the j-th essay have N j sentences, and letting sentence-level features for the n-th essay sentence be f jn , the feature sequence F j = {f j1 , f j2 , · · · , f jN j } is transformed to a fixed-length hidden vector H f j through a recurrent layer and a pooling layer. (Note that the original article used an LSTM for the recurrent layer and attention pooling for the pooling layer.) Finally, inputting a concatenated vector [H j ; H f j ], the linear layer with sigmoid activation produces a predicted score. This method has provided higher accuracy than feature engineering-based methods or DNN-based methods. However, it has the following drawbacks.
1. It cannot incorporate essay-level features developed in conventional AES research.
2. It has far more model and tuning parameters than does a base DNN-AES model. Specifically, letting the number of handcrafted sentence-level features be f , and the hidden variable size of the LSTM in the recurrent layer be d, this method requires at least (4df + d 2 + 5d) additional parameters, and further parameters are required if attention pooling is used. It also requires tuning parameters for the LSTM and the pooling layer, making model training more difficult.
3. It requires an additional RNN for processing sequences of handcrafted sentence-level features, making implementation with transformer-based models and other DNN-AES models complex.

Proposed method
To resolve the above problems, we propose a new hybrid method that incorporates handcrafted essaylevel features to a DNN-AES model. Our method concatenates handcrafted essay-level features to the distributed essay representation H j , which is the input vector for the last linear layer in conventional DNN-AES models. Letting essay-level features for the j-th essay be F o j , the proposed method projects the concatenated vector [H j ; F o j ] to a scalar value by using a sigmoid function, as in conventional DNN-AES models.
The proposed method can be easily applied to existing DNN-AES models, because they commonly have a layer that produces a distributed essay representation before the last linear layer. As examples, Figs. 4, 5, and 6 show model architectures for LSTM, BERT, and conventional hybrid models integrating essay-level features.
The proposed method can incorporate various existing essay-level features for which effectiveness has been shown. As essay-level features, this study uses the 25 features presented in Table 2, which have   been widely used in various AES studies. We assume that the feature values are standardized to fulfill the condition of mean 0 and standard deviation 1.0. Another advantage of our method is that it requires additional weight parameters in only the last linear layer, and the number of additional parameters is only the number of incorporated essay-level features F o j , as compared with the basis DNN-AES model. It requires no additional hand-tuned parameters.

Experiments
This section demonstrates the effectiveness of the proposed method using real-world benchmark data.

Experimental procedures
This study employed the automated student assessment prize (ASAP) dataset, which is widely used as benchmark data in AES research. The ASAP dataset provides eight sets of essays, each set associated with a prompt. Essays were written by students in grades 7-10. Table 3 summarizes numbers of essays, score ranges, and averaged essay length for each prompt.
Using this dataset, we evaluated score prediction accuracies through five-fold cross-validation for each prompt. The accuracy metric was the quadratic weighted Kappa (QWK), which examines agreement  (Coleman and Liau, 1975), Dale-Chall readability score, difficult word count, Flesch reading ease (Kincaid et al., 1975), Flesch-Kincaid grade (Kincaid et al., 1975), Gunning fog (Whisner, 2004), Linsear write formula, SMOG index (Fitzsimmons et al., 2010), syllable count. between predicted scores and ground truth. We conducted this experiment for the LSTM-based model (Fig. 1), the BERT-based model (Fig. 2), Dasgupta's hybrid model (Fig. 3), and the proposed method with these models (Figs. 4, 5, and 6). In the LSTM-based model, we used a single-layer LSTM, a twolayer LSTM, and a bidirectional LSTM for the recurrent layer. We used last pooling as the pooling layer for these LSTM-based models, and also examined MoT pooling for the single-layer LSTM-based model. As sentence features for Dasgupta's hybrid model, we used features similar to the essay-level features shown in Table 2 after two modifications: 1) For length-based features, we removed the number and average length of sentences. 2) We removed the SMOG index from the readability features, because it is not definable for a sentence. We also examined a logistic regression model using essay-level features as a method based on the feature-engineering approach. We implemented the models in the Python programming language with the Keras library. As the embedding matrix, we used Glove (Pennington et al., 2014) with 50 dimensions. We set LSTMs' hiddenvariable dimension to 300, the mini-batch size to 32, and the maximum epochs to 50. We used dropout regularization to avoid overfitting, with dropout probabilities for lookup table layer output and pooling layer output set to 0.5. The recurrent dropout probability was set to 0.2. We used the Adam optimization algorithm (Kingma and Ba, 2014) to minimize the mean squared error (MSE) loss function over the training data. For the BERT model, we used a base-sized pre-trained model. Table 4 shows the experimental results.

Experimental results
Comparing accuracy among prompts, accuracy tends to be higher for prompts in which the average essay length is short than those with long essays. For example, the accuracy for prompts 4, 5, 6, and 7 tends to be higher than that for prompts 2 and 8 in each model. This tendency is consistent with previous  Comparing the conventional DNN-AES models shows that the LSTM-based model with MoT pooling has higher performance than models with last pooling, which is also consistent with previous studies (Alikaniotis et al., 2016;Riordan et al., 2017). BERT tends to outperform the LSTM-based models, as in other BERT applications including automated short-answer grading (Devlin et al., 2019;Lun et al., 2020;Sung et al., 2019). As Dasgupta et al. (2018) reported, the conventional hybrid model shows the highest average accuracy among the conventional models. Table 4 shows that by incorporating handcrafted essay-level features, the proposed method drastically improves accuracy of all base DNN-AES models. We conducted paired t-tests to examine whether averaged performance of the proposed method is significantly higher than base model performance. The results, shown in the "p-value" column in Table 4, indicate that the proposed method improved performance at the 5% significance level for the LSTM-and BERT-based models, and at the 10% significance level for the conventional hybrid model.
Comparing the proposed methods with the logistic regression model (a feature-engineering approach), all of the proposed methods provided a higher average accuracy. The paired t-test between the logistic regression model and the proposed method shows that averaged QWKs of the proposed method using LSTM with MoT pooling and the conventional hybrid model were higher at the 5% significance level, and that of the BERT-based proposed method was higher at the 1% significance level.
Among the proposed methods, the one using the BERT model provided the highest average accuracy.
To confirm whether the handcrafted essay-level features were effective, Table 5 shows weight parameter values in the final linear layer of the BERT-based proposed model. In the table, the row Distributed representation shows the average values of the absolute weight parameters for the 300-dimensional essay distributed representation vector H j . A higher weight value means that the feature has more influence on score prediction. This table suggests that each handcrafted feature contributes to some extent, whereas features with large weights vary across prompts.
These experimental results show that the proposed method effectively improves AES accuracy.

Conclusions
We proposed a simple method that incorporates handcrafted essay-level features to DNN-AES models. Our method adds handcrafted features to a distributed essay representation vector obtained as an intermediate hidden representation of a DNN-AES model. Our method can be easily applied to various conventional DNN-AES models without increasing model complexity much, but significantly improving  In this study, we evaluated the effectiveness of the proposed method that uses relatively simple features, but in future studies, we will use more varied essay-level features, such as those shown in Table 1. Additionally, we will conduct an ablation experiment on essay-level features to clarify which features are effective for which DNN-AES models. Another future aim is to apply the proposed method to more varied DNN-AES models, such as those mentioned in Subsection 2.3. Moreover, although our method directly adds essay-level features to the DNN-based distributed essay representation vector, accuracy might be further improved by appending several layers after the feature input layer. Such model extensions are also another topic for future study.