Can Neural Networks Automatically Score Essay Traits?

Essay traits are attributes of an essay that can help explain how well written (or badly written) the essay is. Examples of traits include Content, Organization, Language, Sentence Fluency, Word Choice, etc. A lot of research in the last decade has dealt with automatic holistic essay scoring - where a machine rates an essay and gives a score for the essay. However, writers need feedback, especially if they want to improve their writing - which is why trait-scoring is important. In this paper, we show how a deep-learning based system can outperform feature-based machine learning systems, as well as a string kernel system in scoring essay traits.


Introduction
An essay is a piece of text that is written in response to a topic, called a prompt. Writing a good essay is a very useful skill. However, evaluating the essay consumes a lot of time and resources. Hence, in 1966, Ellis Page proposed a method of evaluation of essays by computers (Page, 1966). The aim of automatic essay grading (AEG) is to have machines, rather than humans, score the text.
An AEG system is a software that takes an essay as input and returns a score as output. That score could either be an overall score for the essay, or a trait-specific score, based on essay traits like content, organization, style, etc. To the best of our knowledge, most of the systems today use feature engineering and ordinal classification / regression to score essay traits.
From the late 1990s / early 2000s onwards, there were many commercial systems that used automatic essay grading. Shermis and Burstein (2013) cover a number of systems that are used commercially, such as E-Rater (Attali and Burstein, 2006), Intelligent Essay Assessor (Landauer, 2003), Lightside (Mayfield and Rosé, 2013), etc.
In 2012, Kaggle conducted a competition called the Automatic Student Assessment Prize (ASAP), which had 2 parts -the first was essay scoring, and the second was short-answer scoring. The release of the ASAP AEG dataset 1 led to a large number of papers on automatic essay grading using a number of different techniques, from machine learning to deep learning. Section 3 lists the different work in automatic essay grading.
In addition to the Kaggle dataset, another dataset -the International Corpus of Learner's English (ICLE) -is also used in some trait-specific essay grading papers (Granger et al., 2009). Our work, though, makes use of only the ASAP dataset, and the trait specific scores provided by Mathias and Bhattacharyya (2018a) for that dataset.
The rest of the paper is organized as follows. In Section 2, we give the motivation for our work. In Section 3, we describe related work done for traitspecific automatic essay grading. In Section 4, we describe the Dataset. In Section 5, we describe the experiments, such as the baseline machine learning systems, the string kernel and super word embeddings, the Neural Network system, etc. We report the results and analyze them in Section 6, and conclude our paper and describe future work in Section 7.

Motivation
Most of the work dealing with automatic essay grading either deals with providing an overall score to the essay, but often doesn't provide any more feedback to the essay's writer (Carlile et al., 2018).
One way to resolve this is by using trait-specific scoring, where we either do feature engineering or construct a neural network, for individual traits. However, coming up with different systems for measuring different traits is often going to be a challenge, especially if someone decides to come up with a new trait to score. Our work involves showing how we can take existing general-purpose systems, and use them to score traits in essays.
In our paper, we demonstrate that a neural network, built for scoring essays holistically, performs reasonably well for scoring essay traits too. We compare it with a task-independent machine learning system using task independent features (Mathias and Bhattacharyya, 2018a), as well as a stateof-the-art string kernel system (Cozma et al., 2018) and report statistically significant results when we use the attention based neural network (Dong et al., 2017).

Related Work
In this section, we describe related work in the area of automatic essay grading.

Holistic Essay Grading
Holistic essay grading is assigning an overall score for an essay. Ever since the release of Kaggle's Automatic Student Assessment Prize's (ASAP) Automatic Essay Grading (AEG) dataset in 2012, there has been a lot of work on holistic essay grading. Initial approaches, such as those of Phandi et al. (2015) and Zesch et al. (2015) made use of machine learning techniques in scoring the essays. A number of other works used various deep learning approaches, such as Long Short Term Memory (LSTM) Networks (Taghipour and Ng, 2016;Tay et al., 2018) and Convolutional Neural Networks (CNN) (Dong and Zhang, 2016;Dong et al., 2017). The current State-of-the-Art in holistic essay grading makes use of word embedding clusters, called super word embeddings, and string kernels (Cozma et al., 2018).
All the above mentioned works describe systems for scoring different traits individually. In our paper, we compare three approaches to score essay traits, which are trait agnostic. The first uses a set of task-independent features as described by Zesch et al. (2015) and Mathias and Bhattacharyya (2018a). The second uses a string kernel-base approach as well as super word embeddings as described by Cozma et al. (2018). The third is a deep learning attention based neural network described by Dong et al. (2017). Our work is also, to the best of our knowledge, the first work that uses the same neural network architecture to automatically score essay traits.

Dataset
The dataset we use is the ASAP AEG dataset. The original ASAP AEG dataset only has trait scores for prompts 7 & 8. Mathias and Bhattacharyya (2018a) provide the trait scores for the remaining prompts 2 . Tables 1 and 2 describe the different essay sets and the traits for each essay set respectively.

Experiments
We use the following systems for our experiments: 1. Feature-Engineering System. This is a machine-learning system described by Mathias and Bhattacharyya (2018a).
2. String Kernels and Superword Embeddings. This is a state-of-the-art system on holistic essay grading developed by Cozma et al. (2018) using string kernels and superword embeddings.
3. Attention-based Neural Network. This is a system for holistic automatic essay grading described by Dong et al. (2017), that we adapt for trait-specific essay grading.

Baseline Feature-Engineering System
The baseline system we use is the one described by Mathias and Bhattacharyya (2018a). Their system used a Random Forest classifier to score the essay traits. The features that they used are length based features (word count, sentence count, sentence length, word length), punctuation features (counts of commas, apostrophes, quotations, etc.), syntax features (parse tree depth, number of clauses (denoted by SBAR in the parse tree), etc.), sytlistic features (formality, type-token ratio, etc.), cohesion features (discourse connectives, entity grid (Barzilay and Lapata, 2008)), etc.

String Kernels and Superword Embeddings
Cozma et al. (2018) showed that using string kernels and a bag of super word embeddings drastically improved on the state-of-the-art for essay grading.

String Kernels
A string kernel is a similarity function that operates on a pair of strings a and b. The string kernel used is the histogram intersection string kernel (HISK(a, b)) that is given by the formula: where HISK(a, b) is the histogram intersection string kernel between two strings a and b, and # x (a) and # x (b) is the number of occurrences of the substring x in the strings a and b. The string kernel is then normalized as follows: wherek(i, j) is the normalized value of the string kernel k(i, j) between the strings i and j.

Super Word Embeddings
A super word embedding is a word embedding created by making a cluster of word embeddings (Cozma et al., 2018). The clusters are created using the k means algorithm, with k = 500. For each essay, we use the count of words in each cluster as features. For each sentence, we get the embeddings from the word embedding layer. The 4000 most frequent words are used as the vocabulary, with all the other words mapped to a special unknown token. This sequence of words is given as input to a 1-d CNN layer. The output from the CNN layer is pooled using an attention layer, which gets a wordlevel representation for every sentence in the essay. This is then sent through a sentence-level LSTM We send the sentence-level representation of the essay through a sentence-level attention pooling layer, to get the representation for the essay. The essay representation is then sent through a Dense layer to score the essay trait. As the scores were converted to the range of [0, 1], we use the sigmoid activation function in the activation layer, minimizing the mean squared error loss To evaluate the system, we convert the trait scores back to the original score range.

Attention-based Neural Network
We use the 50 dimension GloVe pre-trained word embeddings (Pennington et al., 2014). We run the experiments over a batch size of 100, for 100 epochs, and set the learning rate as 0.001, and a dropout rate of 0.5. The Word-level CNN layer has a kernel size of 5, with 100 filters. The Sentencelevel LSTM layer and modeling layer both have 100 hidden units. We use the RMSProp Optimizer (Dauphin et al., 2015) with a momentum of 0.9.

Experimental Setup
In this section, we describe different experiments.

Evaluation Metric
We choose to use Cohen's Kappa with quadratic weights (Cohen, 1968) -i.e. Quadratic Weighted Kappa (QWK) -as the evaluation metric. We use this as the evaluation metric because of the follow-ing reasons. Unlike accuracy and F-Score, Kappa takes into account if the classification happened by chance. Secondly, the accuracy and F-score metrics do not consider the fact that classes here are ordered. Thirdly, using weights allows Kappa to consider ordering among the classes. Lastly, by using quadratic weights, we reward matches and punish mismatches more than linear weights. Hence, we use QWK as the evaluation metric, rather than accuracy and F-score.

Evaluation Method
We evaluate the systems using five-fold crossvalidation, with 60% training data, 20% development data and 20% testing data for each fold. The folds that we use are the same as those used by Taghipour and Ng (2016). Table 3 gives the results of the experiments using the different classification systems. In each cell, we compare the results of each of the 3 systems for a given trait and prompt. The bold value in each cell corresponds to the system giving the best value out of all the 3 systems. Traits which are not applicable to different prompts are marked with a "-".

Results and Analysis
We see that the attention-based neural network system is able to outperform both, the baseline system of Mathias and Bhattacharyya (2018a) and the histogram intersection string kernel system of  Cozma et al. (2018) for all the traits, and across all 8 prompts. We also check if the improvements are statistically significant. We find that the improvements of the neural network system over the baseline system Mathias and Bhattacharyya (2018a) and histogram intersection string kernel system Cozma et al. (2018) to be statistically significant for p < 0.05 using the Paired T-Test.
Between the other 2 systems, the String Kernels system performed better than the baseline system in most of the cases. The only prompt in which it did not do so was in Prompt 8 -mainly because of the number of essays being very low and the size of the essay being very high compared to the other prompts.
Among the traits, the easiest to score are the traits of content and prompt adherence (where ever they are applicable) as they yielded the best agreement with the human raters. The hardest of the traits to score was Voice, which yielded the lowest QWK in the only prompt in which it was scored.

Conclusion and Future Work
In this paper, we describe a comparison between a feature-engineering system, a string kernel-based system, and an attention-based neural network to score different traits of an essay. We found that the neural network system provided the best results. To the best of our knowledge, this is the first work that describes how neural networks are used, in particular, to score essay traits.
As part of future work, we plan to investigate how to incorporate trait scoring as a means of helping to score essays holistically.