Neural Automated Essay Scoring and Coherence Modeling for Adversarially Crafted Input

We demonstrate that current state-of-the-art approaches to Automated Essay Scoring (AES) are not well-suited to capturing adversarially crafted input of grammatical but incoherent sequences of sentences. We develop a neural model of local coherence that can effectively learn connectedness features between sentences, and propose a framework for integrating and jointly training the local coherence model with a state-of-the-art AES model. We evaluate our approach against a number of baselines and experimentally demonstrate its effectiveness on both the AES task and the task of flagging adversarial input, further contributing to the development of an approach that strengthens the validity of neural essay scoring models.


Introduction
Automated Essay Scoring (AES) focuses on automatically analyzing the quality of writing and assigning a score to the text.Typically, AES models exploit a wide range of manually-tuned shallow and deep linguistic features (Shermis and Hammer, 2012; Burstein et al., 2003; Rudner et al.,  2006; Williamson et al., 2012; Andersen et al.,  2013).Recent advances in deep learning have shown that neural approaches to AES achieve state-of-the-art results (Alikaniotis et al., 2016;  Taghipour and Ng, 2016) with the additional advantage of utilizing features that are automatically learned from the data.In order to facilitate interpretability of neural models, a number of visualization techniques have been proposed to identify textual (superficial) features that contribute to model performance (Alikaniotis et al., 2016).
To the best of our knowledge, however, no prior work has investigated the robustness of neural AES systems to adversarially crafted input that is designed to trick the model into assigning desired missclassifications; for instance, a high score to a low quality text.Examining and addressing such validity issues is critical and imperative for AES deployment.Previous work has primarily focused on assessing the robustness of "standard" machine learning approaches that rely on manual feature engineering; for example, Powers et al. (2002); Yannakoudakis et al. (2011) have shown that such AES systems, unless explicitly designed to handle adversarial input, can be susceptible to subversion by writers who understand something of the systems' workings and can exploit this to maximize their score.
In this paper, we make the following contributions: i.We examine the robustness of state-of-the-art neural AES models to adversarially crafted input,1 and specifically focus on input related to local coherence; that is, grammatical but incoherent sequences of sentences. 2In addition to the superiority in performance of neural approaches against "standard" machine learning models (Alikaniotis et al., 2016;  Taghipour and Ng, 2016), such a setup allows us to investigate their potential superiority / capacity in handling adversarial input without being explicitly designed to do so.
ii.We demonstrate that state-of-the-art neural AES is not well-suited to capturing adversarial input of grammatical but incoherent sequences of sentences, and develop a neural model of local coherence that can effectively learn connectedness features between sentences.
iii.A local coherence model is typically evaluated based on its ability to rank coherently ordered sequences of sentences higher than their incoherent / permuted counterparts (e.g., Barzilay and Lapata (2008)).We focus on a stricter evaluation setting in which the model is tested on its ability to rank coherent sequences of sentences higher than any incoherent / permuted set of sentences, and not just its own permuted counterparts.This supports a more rigorous evaluation that facilitates development of more robust models.
iv.We propose a framework for integrating and jointly training the local coherence model with a state-of-the-art AES model.We evaluate our approach against a number of baselines and experimentally demonstrate its effectiveness on both the AES task and the task of flagging adversarial input, further contributing to the development of an approach that strengthens AES validity.
At the outset, our goal is to develop a framework that strengthens the validity of state-of-the-art neural AES approaches with respect to adversarial input related to local aspects of coherence.For our experiments, we use the Automated Student Assessment Prize (ASAP) dataset, 3 which contains essays written by students ranging from Grade 7 to Grade 10 in response to a number of different prompts (see Section 4).

Related Work
AES Evaluation against Adversarial Input One of the earliest attempts at evaluating AES models against adversarial input was by Powers et al.  (2002) who asked writing experts -that had been briefed on how the e-Rater scoring system works -to write essays to trick e-Rater (Burstein et al.,  1998).The participants managed to fool the system into assigning higher-than-deserved grades, most notably by simply repeating a few wellwritten paragraphs several times.Yannakoudakis  et al. (2011) and Yannakoudakis and Briscoe  (2012) created and used an adversarial dataset of well-written texts and their random sentence permutations, which they released in the public domain, together with the grades assigned by a human expert to each piece of text.Unfortunately, however, the dataset is quite small, consisting of  (Barzilay and Lapata, 2008).Lin et al.  (2015) developed a hierarchical Recurrent Neural Network (RNN) for document modeling.Among others, they looked at capturing coherence between sentences using a sentence-level language model, and evaluated their approach on the sentence ordering task.Tien Nguyen and Joty (2017) built a CNN over entity grid representations, and trained the network in a pairwise ranking fashion.
Their model outperformed other graph-based and distributed sentence models.We note that our goal is not to identify the "best" model of local coherence on randomly permuted grammatical sentences in the domain of AES, but rather to propose a framework that strengthens the validity of AES approaches with respect to adversarial input related to local aspects of coherence.

Local Coherence (LC) Model
Our local coherence model is inspired by the model of Li and Hovy (2014) which uses a window approach to evaluate coherence. 4Figure 1 presents a visual representation of the network architecture, which is described below in detail.
Sentence Representation This part of the model composes the sentence representations that can be utilized to learn connectedness features between sentences.Each word in the text is initialized with a k-dimensional vector w from a pre-trained word embedding space.Unlike Li and Hovy (2014), we use an LSTM layer5 to capture sentence compositionality by mapping words in a sentence s = {w 1 , w 2 , ..., w n } at each time step t (w t , where t ≤ n) onto a fixed-size vector h wrd t ∈ R d lstm (where d lstm is a hyperparameter).The sentence representation h snt is then the representation of the last word in the sentence: Clique Representation Each window of sentences in a text represents a clique q = {s 1 , ..., s m }, where m is a hyperparameter indicating the window size.A clique is assigned a score of 1 if it is coherent (i.e., the sentences are not shuffled) and 0 if it is incoherent (i.e., the sentences are shuffled).The clique embedding is created by concatenating the representations of the sentences it contains according to Equation 1.A convolutional operation -using a filter W clq ∈ R m×d lstm ×dcnn , where d cnn denotes the convolutional output size -is then applied to the clique embedding, followed by a non-linearity in order to extract the clique representation h clq ∈ R dcnn : Here, j ∈ {1, ..., N − m + 1}, N is the number of sentences in the text, and * is the linear convolutional operation.
Scoring The cliques' predicted scores are calculated via a linear operation followed by a sigmoid function to project the predictions to a [0, 1] probability space: where V ∈ R dcnn is a learned weight.The network optimizes its parameters to minimize the negative log-likelihood of the cliques' gold scores y clq , given the network's predicted scores: where The final prediction of the text's coherence score is calculated as the average of all of its clique scores: This is in contrast to Li and Hovy (2014), who multiply all the estimated clique scores to generate the overall document score.This means that if only one clique is misclassified as incoherent and assigned a score of 0, the whole document is regarded as incoherent.We aim to soften this assumption and use the average instead to allow for a more fine-grained modeling of degrees of coherence. 6e train the LC model on synthetic data automatically generated by creating random permutations of highly-scored ASAP essays (Section 4).

LSTM AES Model
We utilize the LSTM AES model of Taghipour and  Ng (2016) shown in Figure 2 (LSTM T&N ), which is trained, and yields state-of-the-art results on the ASAP dataset.The model is a one-layer LSTM that encodes the sequence of words in an essay, followed by a Mean over Time operation that averages the word representations generated from the LSTM layer. 7 6 Our experiments showed that using the multiplicative approach gives poor results, as presented in Section 6. 7 We note that the authors achieve slightly higher results when averaging ensemble results of their LSTM model together with CNN models.We use their main LSTM model

Combined Models
We propose a framework for integrating the LSTM T&N model with the Local Coherence (LC) one.Our goal is to have a robust AES system that is able to correctly flag adversarial input while maintaining a high performance on essay scoring.

Joint Learning
Instead of training the LSTM T&N and LC models separately and then concatenating their output representations, we propose a framework where both models are trained jointly, and where the final network has then the capacity to predict AES scores and flag adversarial input (Figure 3).Specifically, the LSTM T&N and LC networks predict an essay and coherence score respectively (as described earlier), but now they both share the word embedding layer.The training set is the aggregate of both the ASAP and permuted data to allow the final network to learn from both simultaneously.Concretely, during training, for the ASAP essays, we assume that both the gold essay and coherence scores are the same and equal to the gold ASAP scores.This is not too strict an assumption, as overall scores of writing competence tend to correlate highly with overall coherence.For the synthetic essays, we set the "gold" coherwhich, for the purposes of our experiments, does not affect our conclusions.ence scores to zero, and the "gold" essay scores to those of their original non-permuted counterparts in the ASAP dataset.The intuition is as follows: firstly, setting the "gold" essay scores of synthetic essays to zero would bias the model into over-predicting zeros; secondly, our approach reinforces the LSTM T&N 's inability to detect adversarial input, and forces the overall network to rely on the LC branch to identify such input. 9he two sub-networks are trained together and the error gradients are back-propagated to the word embeddings.To detect whether an essay is adversarial, we further augment the system with an adversarial text detection component that simply captures adversarial input based on the difference between the predicted essay and coherence scores.Specifically, we use our development set to learn a threshold for this difference, and flag an essay as adversarial if the difference is larger than the threshold.We experimentally demonstrate that this approach enables the model to perform well on both original ASAP and synthetic essays.During model evaluation, the texts flagged as adversarial by the model are assigned a score of zero, while the rest are assigned the predicted essay score (ŷ esy in Figure 3).

Data and Evaluation
We use the ASAP dataset, which contains 12, 976 essays written by students ranging from Grade 7 to 9 We note that, during training, the scores are mapped to a range between 0 and 1 (similarly to Taghipour and Ng (2016)), and then scaled back to their original range during evaluation.
Grade 10 in response to 8 different prompts.We follow the ASAP data split by Taghipour and Ng  (2016), and apply 5-fold cross validation in all experiments using the same train/dev/test splits.For each prompt, the fold predictions are aggregated and evaluated together.In order to calculate the overall system performance, the results are averaged across the 8 prompts.
To create adversarial input, we select high scoring essays per prompt (given a pre-defined score threshold, Table 1)10 that are assumed coherent, and create 10 permutations per essay by randomly shuffling its sentences.In the joint learning setup, we augment the original ASAP dataset with a subset of the synthetic essays.Specifically, we randomly select 4 permutations per essay to include in the training set,11 but include all 10 permutations in the test set.Table 1 presents the details of the datasets.
We test performance on the ASAP dataset using Quadratic Weighted Kappa (QWK), which was the official evaluation metric in the ASAP competition, while we test performance on the synthetic dataset using pairwise ranking accuracy (PRA) between an original non-permuted essay and its permuted counterparts.PRA is typically used as an evaluation metric on coherence assessment tasks on other domains (Barzilay and Lapata, 2008), and is based on the fraction of correct pairwise rankings in the test data (i.e., a coherent essay should be ranked higher than its permuted counterpart).Herein, we extend this metric and furthermore evaluate the models by comparing each original essay to all adversarial / permuted essays in the test data, and not just its own permuted counterparts -we refer to this metric as total pairwise ranking accuracy (TPRA).

Model Parameters and Baselines
Coherence models We train and test the LC model described in Section 3.1 on the synthetic dataset and evaluate it using PRA and TPRA.During pre-processing, words are lowercased and initialized with pre-trained word embeddings (Zou  et al., 2013) We use as a baseline the LC model that is based on the multiplication of the clique scores (similarly to Li and Hovy (2014)), and compare the results (LC mul ) to our averaged approach.As another baseline, we use the entity grid (EGrid) (Barzilay and Lapata, 2008) that models transitions between sentences based on sequences of entity mentions labeled with their grammatical role.EGrid has been shown to give competitive results on similar coherence tasks in other domains.Using the Brown Coherence Toolkit (Eisner and  Charniak, 2011), 13 we construct the entity transition probabilities with length = 3 and salience = 2.The transition probabilities are then used as features that are fed as input to an SVM classifier with an RBF kernel and penalty parameter C = 1.5 to predict a coherence score.
LSTM T&N model We replicate and evaluate the LSTM model of Taghipour and Ng (2016)  14 on ASAP and our synthetic data.

Combined models
After training the LC and LSTM T&N models, we concatenate their output vectors to build the Baseline: Vector Concatenation (VecConcat) model as described in Section 3.3.1,and train a Kernel Ridge Regression model. 15he Joint Learning network is trained on both the ASAP and synthetic dataset as described in Section 3.3.2.Adversarial input is detected based on an estimated threshold on the difference between the predicted essay and coherence scores (Figure 3).The threshold value is empirically calculated on the development sets, and set to be the average difference between the predicted essay and coherence scores in the synthetic data: where M is the number of synthetic essays in the development set.
We furthermore evaluate a baseline where the joint model is trained without sharing the word embedding layer between the two submodels, and report the effect on performance (Joint Learning no layer sharing ).Finally, we evaluate a baseline where for the joint model we set the "gold" essay scores of synthetic data to zero (Joint Learning zero score ), as opposed to our proposed approach of setting them to be the same as the score of their original non-permuted counterpart in the ASAP dataset.

Results
The state-of-the-art LSTM T&N model, as shown in Table 2, gives the highest performance on the ASAP data, but is not robust to adversarial input and therefore unable to capture aspects of local coherence, with performance on synthetic data that is less than 0.5.our LC model and the EGrid significantly outperform LSTM T&N on synthetic data.While EGrid is slightly better in terms of TPRA compared to LC (0.706 vs. 0.689), LC is substantially better on PRA (0.946 vs. 0.718).This could be attributed to the fact that LC is optimised using PRA on the development set.The LC mul variation has a performance similar to LC in terms of PRA, but is significantly worse in terms of TPRA, which further supports the use of our proposed LC model.
Our Joint Learning model manages to exploit the best of both the LSTM T&N and LC approaches: performance on synthetic data is significantly better compared to LSTM T&N (and in particular gives the highest TPRA value on synthetic data compared to all models), while manages to maintain the high performance of LSTM T&N on ASAP data (performance slighly drops from 0.739 to 0.724 though not significantly).When the Joint Learning model is compared against the VecConcat baseline, we can again confirm its superiority on both datasets, giving significant differences on synthetic data.

Further Analysis
We furthermore evaluate the performance of the the Joint Learning model when trained using different parameters (Table 3).When assigning "gold" essay scores of zero to adversarial essays (Joint Learning zero score ), AES performance on the ASAP data drops to 0.449 QWK, and the results are statistically significant. 16 plained by the fact that the model, given the training data gold scores, is biased towards predicting zeros.The result, however, further supports our hypothesis that forcing the Joint Learning model to rely on the coherence branch for adversarial input detection further improves performance.Importantly, we need something more than just training a state-of-the-art AES model (in our case, LSTM T&N ) on both original and synthetic data.
We also compare Joint Learning to Joint Learning no layer sharing in which the the two submodels are trained separately without sharing the first layer of word representations.While the difference in performance on the ASAP test data is small, the differences are much larger on synthetic data, and are significant in terms of TPRA.By examining the false positives of both systems (i.e., the coherent essays that are misclassified as adversarial), we find that when the embeddings are not shared, the system is biased towards flagging long essays as adversarial, while interestingly, this bias is not present when the embeddings are shared.For instance, the average number of words in the false positive cases of Joint Learning no layer sharing on the ASAP data is 426, and the average number of sentences is 26; on the other hand, with the Joint Learning model, these numbers are 340 and 19 respectively.17A possible explanation for this is that training the words with more contextual information (in our case, via embeddings sharing), is advantageous for longer documents with a large number of sentences.
Ideally, no essays in the ASAP data should be flagged as adversarial as they were not designed to trick the system.We calculate the number of ASAP texts incorrectly detected as adversarial, and find that the average error in the Joint Learning model is quite small (0.382%).This increases with Joint Learning no layer sharing (1%), although still remains relatively small.We further investigate the essay and coherence scores predicted by our best model, Joint Learning, for the permuted and original ASAP essays in the synthetic dataset (for which we assume that the selected, highly scored ASAP essays are coherent, Section 4), and present results for 3 randomly selected prompts in Figure 4.The graphs show a large difference between predicted essay and coherence scores on permuted / adversarial data ((a), (b) and (c)), where the system predicts high essay scores for permuted texts (as a result of our training strategy), but low coherence scores (as predicted by the LC model).For highly scored ASAP essays ((d), (e) and (f)), the system predictions are less varied and positively contributes to the performance of our proposed approach.

Conclusion
We evaluated the robustness of state-of-the-art neural AES approaches on adversarial input of grammatical but incoherent sequences of sentences, and demonstrated that they are not wellsuited to capturing such cases.We created a synthetic dataset of such adversarial examples and trained a neural local coherence model that is able to discriminate between such cases and their coherent counterparts.We furthermore proposed a framework for jointly training the coherence model with a state-of-the-art neural AES model, and introduced an effective strategy for assigning "gold" scores to adversarial input during training.When compared against a number of baselines, our joint model achieves better performance on randomly permuted sentences, while maintains a high performance on the AES task.Among others, our results demonstrate that it is not enough to simply (re-)train neural AES models with adversarially crafted input, nor is it sufficient to rely on "simple" approaches that concatenate output representations from different neural models.Finally, our framework strengthens the validity of neural AES approaches with respect to adversarial input designed to trick the system.

Figure 1 :
Figure 1: Local Coherence (LC) model architecture using a window of size 3.All h snt representations are computed the same way as h snt 1 .The figure depicts the process of predicting the first clique score, which is applied to all the cliques in the text.The output coherence score is the average of all the clique scores.T is the number of cliques.
3.3.1 Baseline: Vector Concatenation (VecConcat)The baseline model simply concatenates the output representations of the pre-prediction layers of the trained LSTM T&N and LC networks, and feeds the resulting vector to a machine learning algorithm (e.g., Support Vector Machines, SVMs) to predict the final overall score.In the LSTM T&N model, the output representation (hereafter referred to as the essay representation) is the vector produced from the Mean Over Time operation; in the LC model, we use the generated clique representations (Figure1) aggregated with a max operation; 8 (hereafter referred to as the clique representation).Although the LC model is trained on permuted ASAP essays (Section 4) and the LSTM T&N model on the original ASAP data, essay and clique representations are generated for both the ASAP and the synthetic essays containing reordered sentences.

Figure 3 :
Figure 3: A joint network for scoring essays as well as detecting adversarial input.The LSTM T&N model is the one depicted in Figure 2, and the LC in Figure 1.

Figure 4 :
Figure 4: Joint Learning model predictions on the synthetic test set for 3 randomly selected prompts.The upper graphs ((a), (b) and (c)) show the predicted essay and coherence scores on adversarial text, while the bottom ones ((d), (e) and (f)) show the predicted scores for highly scored / coherent ASAP essays.The blue circles represent the essay scores, and the red pluses the coherence scores.All predicted scores are mapped to their original scoring scale.

Table 1 :
. Words that occur only once in the training set are mapped to a special UNK embed-Statistics for each dataset per prompt.For the synthetic dataset, the high scoring ASAP essays are selected based on the indicated score threshold (inclusive)."totalsize"referstothenumber of the ASAP essays selected + their 10 different permutations.ding.All network weights are initialized to values drawn randomly from a uniform distribution with scale = 0.05, and biases are initialized to zeros.We apply a learning rate of 0.001 and RMSProp(Tieleman and Hinton, 2012)for optimization.A size of 100 is chosen for the hidden layers (d lstm and d cnn ), and the convolutional window size (m) is set to 3. Dropout(Srivastava et al., 2014)is applied for regularization to the output of the convolutional operation with probability 0.3.The network is trained for 60 epochs and performance is monitored on the development sets -we select the model that yields the highest PRA value.12

Table 2 :
On the other hand, both Model performance on ASAP and synthetic test data.Evaluation is based on the average QWK, PRA and TRPA across the 8 prompts.* indicates significantly different results compared to LSTM T&N (two-tailed test with p < 0.01).

Table 3 :
This is partly ex-Evaluation of different Joint Learning model parameters.* indicates significantly different results compared to our Joint Learning approach.