Investigating neural architectures for short answer scoring

Neural approaches to automated essay scoring have recently shown state-of-the-art performance. The automated essay scoring task typically involves a broad notion of writing quality that encompasses content, grammar, organization, and conventions. This differs from the short answer content scoring task, which focuses on content accuracy. The inputs to neural essay scoring models – ngrams and embeddings – are arguably well-suited to evaluate content in short answer scoring tasks. We investigate how several basic neural approaches similar to those used for automated essay scoring perform on short answer scoring. We show that neural architectures can outperform a strong non-neural baseline, but performance and optimal parameter settings vary across the more diverse types of prompts typical of short answer scoring.


Introduction
Deep neural network approaches have recently been successfully developed for several educational applications, including automated essay assessment. In several cases, neural network approaches exceeded the previous state of the art on essay scoring (Taghipour and Ng, 2016).
The task of automated essay scoring (AES) is generally different from the task of automated short answer scoring (SAS). Essay scoring generally focuses on writing quality, a multidimensional construct that includes ideas and elaboration, organization, style, and writing conventions such as grammar and spelling (Burstein et al., 2013). Short answer scoring, by contrast, typically focuses only on the accuracy of the content of responses (Burrows et al., 2015). Analyzing the rubrics of prompts from the Automated Student Assessment Prize shared tasks on AES and SAS, while there is some overlap across essay scoring and short answer scoring, there are three main dimensions of differences: 1. Response length. Responses in SAS tasks are typically shorter. For example, while the ASAP-AES data contains essays that average between about 100 and 600 tokens (Shermis, 2014), short answer scoring datasets may have average answer lengths of just several words (Basu et al., 2013) to almost 60 words (Shermis, 2015).
2. Rubrics focus on content only in SAS vs. broader writing quality in AES.
3. Purpose and genre. AES tasks cover persuasive, narrative, and source-dependent reading comprehension and English Language Arts (ELA), while SAS tasks tend to be from science, math, and ELA reading comprehension.
Given these differences, the feature sets for AES and SAS systems are often different, with AES incorporating a larger set of features to capture writing quality (Shermis and Hamner, 2013). Nevertheless, deep learning approaches to AES have thus far demonstrated strong performance with minimal inputs consisting of unigrams and word embeddings. For example, Taghipour and Ng (2016) explore simple LSTM and CNN-based architectures with regression and evaluate on the ASAP-AES data. Alikaniotis et al. (2016) train score-specific word embeddings with several LSTM architectures. Dong and Zhang (2016) demonstrate that a hierarchical CNN architecture produces strong results on the ASAP-AES data. Recently, Zhao et al. (2017) show state-of-the-art performance on the ASAP-AES dataset with a memory network architecture.
In this work, we investigate whether deep neural network approaches with similarly minimal feature sets can produce good performance on the SAS task, including whether they can exceed a strong non-neural baseline. Unigram embeddingbased neural network approaches to essay scoring capture content signals from their input features, but the extent to which they capture other aspects of writing quality rubrics has not been established. These approaches as implemented would seem to lend themselves even better to the purely content-focused rubrics in SAS, where content signals should dominate in achieving good humanmachine agreement. On the other hand, recurrent neural networks may derive some of their predictive power in AES from more redundant signals in longer input sequences (as sketched by Taghipour and Ng (2016)). As a result, the shorter responses in SAS may hinder the ability of recurrent networks to achieve state-of-the-art results.
To explore the effectiveness of neural network architectures on SAS, we use the basic architecture and parameters of Taghipour and Ng (2016) on three publicly available short answer datasets: ASAP-SAS (Shermis, 2015), Powergrading (Basu et al., 2013), and SRA (Dzikovska et al., 2016(Dzikovska et al., , 2013. While these datasets differ with respect to the length and complexity of student responses, all prompts in the datasets focus on content accuracy. We explore how well the optimal parameters for AES from Taghipour and Ng (2016) fare on these datasets, and whether different architectures and parameters perform better on the SAS task.

Datasets
The three datasets we use cover different kinds of prompts and vary considerably in the length of the answers as well as their well-formedness. Table 1 shows basic statistics for each dataset. Figures 1,  2 and 3 show examples for each of the datasets.

ASAP-SAS
The Automated Student Assessment Prize Short Answer Scoring (ASAP-SAS) dataset 1 contains 10 individual prompts, covering science, biology, 1 https://www.kaggle.com/c/asap-sas and ELA. The prompts were administered to U.S. high school students in several state-level assessments. Each prompt has an average of 2,200 individual responses, typically consisting of one or a few sentences. Responses are scored by two human annotators on a scale from 0 to 2 or 0 to 3 depending on the prompt (Shermis, 2015). Following the guidelines from the Kaggle competition, we always use the score assigned by the first annotator.

Powergrading
The Powergrading dataset (Basu et al., 2013) contains 10 individual prompts from U.S. immigration exams with about 700 responses each. Each prompt is accompanied by one or more reference responses. As responses are very short (typically a few words -see Figure 2) and because the percentage of correct responses is very high, responses in the Powergrading dataset are to some extent repetitive. The Powergrading dataset tests models' ability to perform well on extremely short responses.
The Powergrading dataset was originally used for the task of (unsupervised) clustering (Basu et al., 2013), so that there are no state-ofthe-art scoring results available for this dataset. For simplicity, we use the first out of three binary human-annotated correctness scores.

SRA
The SRA dataset (Dzikovska et al., 2012) became widely known as the dataset used in SemEval-2013 Shared Task 7 "The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge" (Dzikovska et al., 2013). It consists of two subsets: Beetle, with student responses from interacting with a tutorial dialogue system, and SciEntsBank (SEB) with science assessment questions. We use two label sets from the shared task: the 2-way labels classify responses as correct or incorrect, while the 5-way labels provide a more fine-grained classification of responses into the categories non domain, correct, partially correct incomplete, contradictory and irrelevant. In contrast with most SAS datasets, the SRA dataset contains a large number of prompts and with relatively few responses per prompt (see Table 1). Following the procedure from the shared task, we train models for each SRA dataset (Beetle, SEB) across all responses to all prompts. ASAP -Prompt 1 QUESTION: After reading the groups procedure, describe what additional information you would need in order to replicate the experiment. Make sure to include at least three pieces of information.

SCORING RUBRIC FOR A 3 POINT RESPONSE:
The response is an excellent answer to the question. It is correct, complete, and appropriate and contains elaboration, extension, and/or evidence of higher-order thinking and relevant prior knowledge. There is no evidence of misconceptions. Minor errors will not necessarily lower the score. STUDENT RESPONSES: • 3 points: Some additional information you will need are the material. You also need to know the size of the contaneir to measure how the acid rain effected it. You need to know how much vineager is used for each sample. Another thing that would help is to know how big the sample stones are by measureing the best possible way.
• 1 point: After reading the expirement, I realized that the additional information you need to replicate the expireiment is one, the amant of vinegar you poured in each container, two, label the containers before you start yar expirement and three, write a conclusion to make sure yar results are accurate.    Table 1: Overview of the datasets used in this work. Since we train prompt-specific models for ASAP-SAS and PG, we report the mean number of responses per set per prompt. For SRA, we train one model per label set across prompts and report the overall number of prompts per set as well as the mean number of responses per prompt per set (in parentheses).

Method
We carried out a series of experiments across datasets to discern the effect of specific parameters in the SAS setting. We took the best parameter set from Taghipour and Ng (2016) as our reference since it performed best on the AES data. We looked at the effect of varying several important parameters to discern the effectiveness of each for SAS: • the role of the mean-over-time layer, which was crucial for good performance in Taghipour and Ng (2016) • the utility of pretrained embeddings • the contribution of features derived from a convolutional layer • the needs for network representational capacity via recurrent hidden layer size • the role of bidirectional architectures for short response lengths • regression versus classification • the effect of attention To explore the effect of specific parameters, we trained models on the training set and evaluated on the development set only. Following these experiments, we trained a model on the training and development sets and evaluated on the test set. We report prompt-level results for this model in Section 3.6.
For evaluation, we use quadratic weighted kappa (QWK) for the ASAP-SAS and Powergrading datasets. Because the class labels in the SRA dataset are unordered, we report the weighted F1 score, which was the preferred metric in the Semeval shared task (Dzikovska et al., 2016).

Baseline
As a baseline system, we use a supervised learner based on a hand-crafted feature set. This baseline is based on DkPro TC (Daxenberger et al., 2014) and relies on support vector classification using Weka (Hall et al., 2009). We preprocess the data using the ClearNlp Segmenter 2 via DKPro Core (Eckart de Castilho and Gurevych, 2014).
The features used in the baseline system comprise a commonly used and effective feature set for the SAS task. We use both binary word and character uni-to trigram occurrence features, using the top 10,000 most frequent ngrams in the training data, as well as answer length, measured by the number of tokens in a response.

Neural networks
We work with the basic neural network architecture explored by Taghipour and Ng (2016) (Figure  4). 3 First, the word tokens of each response are converted to embeddings. Optionally, features are extracted from the embeddings by a convolutional network layer. This output forms the input to an LSTM layer. The hidden states of the LSTM are aggregated in either a "mean-over-time" (MoT) layer or attention layer. The MoT layer simply averages the hidden states of the LSTM across the input. We use the same attention mechanism employed in Taghipour and Ng (2016), which involves taking the dot product of each LSTM hidden state and a vector that is trained with the network. The aggregation layer output is a single vector, which is input to a fully connected layer. This layer computes a scalar (regression) or class label (classification).

Setup, training, and evaluation
The text is lightly preprocessed as input to the neural networks following Taghipour and Ng (2016). The text is tokenized with the standard NLTK tokenizer and lowercased. All numbers are mapped to a single <num> symbol. 4 Each response is padded with a dummy token to uniform length, but these dummy tokens are masked out during model training.
For the ASAP-SAS and Powergrading datasets, prior to training, we scale all scores of responses to [0, 1] and use these scaled scores as input to the networks. For evaluation, the scaled scores are converted back to their original range. The SRA class labels are used as is.
We fix a number of neural network parame-4 It may be the case that relevant content information is thus ignored. However, since many numbers occur with units of measurement, e.g. 1g, we do not have word embeddings for them either and so the embeddings would simply be random initializations. We leave a full exploration of this issue to future work. ters for our experiments. For pretrained embeddings, in preliminary experiments the GloVe 100 dimension vectors (Pennington et al., 2014) performed slightly better than a selection of other offthe-shelf embeddings, and hence we use these for all conditions that involve pretrained embeddings. Embeddings for word tokens that are not found in the embeddings are randomly initialized from a uniform distribution. The convolutional layer uses a window length of 3 or 5 and 50 filters. We use a mean squared error loss for regression models and a cross-entropy loss for classification models. To train the network, we use RMSProp with ρ set to 0.9 and learning rate of 0.001. We clip the norm of the gradient to 10. The fully connected layer's bias is initialized to the mean score for the training data, and the layer is regularized with dropout of 0.5. We use a batch size of 32, which provided a good compromise between performance and runtime in preliminary experiments.
To obtain more consistent results and improve predictive performance, we evaluate the models by keeping an exponential moving average of the model's weights during training. The moving average weights w EM A are updated after each batch by d is a decay rate that is updated dynamically at each batch by taking into account the number of batches so far: min(decay, (1 + #batches)/(10 + #batches)) where decay is a maximum decay rate, which we set to 0.999. This decay rate updating procedure allows the weights to be updated quickly at first while stabilizing across time.
All models are trained for 50 epochs for parameter exploration on the development set (Section 3.5) and 50 epochs for the final models on the test set (Section 3.6). Following Taghipour and Ng (2016), for our parameter exploration experiments on the development set, we report the best performance across epochs. When we train final models on the combined training and development set and evaluate on the test set, we report the results from the last epoch.
During development, we observed that even after employing best practices for ensuring repro-ducibility of results 5 , there was still some small variation between runs of the same parameter settings. The reasons for this variability were not clear.

Parameter exploration results
Our focus in this section is comparing different architecture and parameter choices for the neural networks with the best parameters from Taghipour and Ng (2016). Table 2 shows the results of our experiments on the development set for ASAP-SAS and Powergrading, and Table 3 shows the corresponding results for SRA.
Does the mean-over-time layer improve performance? Taghipour and Ng (2016) demonstrate a large performance gain with the mean-overtime layer that averages the LSTM hidden states across the response tokens. Comparing "T&N best" with "no MoT" across the datasets, we see mixed results. The mean-over-time layer performs relatively well across datasets, but achieves the best results only on the SRA-SEB dataset. We hypothesized that the mean-over-time layer is helpful when the input consists of longer responses (as was the case for the essay data in Taghipour and Ng (2016)). We computed the Pearson's correlation on the ASAP-SAS data between the difference on each prompt of the two conditions and the mean response length in the development set. However, the correlation was modest at 0.437.
Do pretrained embeddings with tuning outperform fixed or randomly initialized embeddings? On all datasets, the pretrained embeddings with tuning (among the "T&N best" parameters) performed better than fixed pretrained or learned embeddings. 6 Tuned embeddings were especially important for the ASAP-SAS and Powergrading datasets.
Does a convolutional layer produce useful features for the SAS task? The results for convolutional features are mixed: convolutional features contribute small performance improvements on Powergrading and one of the SRA label sets (SRA SEB 2-way).
Can smaller hidden layers be used for the SAS task? Although LSTMs with smaller hidden states often outperformed the 300-dimensional LSTM in the T&N best parameter set (compare 'T&N best' performance with performance for 'LSTM dims' conditions), the improvements were all quite small.
Do bidirectional LSTMs improve performance? Bidirectional LSTM architectures produced solid gains over the T&N best parameters on ASAP-SAS, Powergrading, and two of the four SRA label sets.
Can classification improve performance? The T&N model used regression. While the labels in SRA allow only for classification, ASAP-SAS and PG work with both regression and classification. However, we found consistently better results using regression.
Can attention improve performance? The attention mechanism we considered in this paper yielded strong performance improvements over the mean-over-time layer on all datasets except SRA-SEB 5-way. The largest improvements were on Powergrading and SRA-Beetle 5-way, where increases were almost 3 points weighted F1.
We also report the results of the combinations of individual parameters that performed well on the development data at the bottom of Table 2 and Table 3. While these combinations performed better than any individual parameter variation on ASAP-SAS and Powergrading, the combination performed worse on three of the four label sets in the SRA data. These results underscore that these parameters do not always produce additive effects in practice.
We examined the predictions from the baseline system and the T&N system for the ASAP-SAS development set and conducted a brief error analysis. In general, across the 10 prompts, it can be observed that when the baseline system is incorrect it tends to under-predict the scores, whereas the T&N system tends to slightly over-predict scores when it is incorrect. These effects are typically small, but consistent.

Test performance
We selected the top parameter settings on the development set and trained models on the full training set (i.e. training and development sets) for each dataset: • ASAP-SAS: 250-dimensional bidirectional LSTM, attention mechanism  Table 2: Parameter experiment results on ASAP-SAS and Powergrading on the development set. "Baseline" is the baseline non-neural system. "T&N best" is the best-performing parameter set in Taghipour and Ng (2016): tuned embeddings (here, GLOVE 100 dimensions), 300-dimensional LSTM, unidirectional, mean-over-time layer. Scores are bolded if they outperform the score for the "T&N best" parameter setting.
• Powergrading: CNN features with window length 5, 150-dimensional bidirectional LSTM, attention mechanism • SRA: Because of the decreased performance of the combined best individual parameters on the development data, we use a 300dimensional unidirectional LSTM with attention mechanism.
These models are "T&N tuned" in Table 4, which appear along with the non-neural baseline system. On ASAP-SAS, the "T&N tuned" parameter configuration outperformed the baseline system and the "T&N best" parameters. The tuned system does not reach the state-of-the-art Fisher-transformed mean score on the ASAP-SAS dataset (Ramachandran et al., 2015) 7 , which, like the winner of the ASAP-SAS competition (Tandalla, 2012), employed prompt-specific regular expressions. Other top performing systems used prompt-specific preprocessing and ensemble-based approaches over rich feature spaces (Higgins et al., 2014).
On the Powergrading dataset, the "T&N tuned" system did not match the performance of the baseline system, consistent with the results on the development set (Table 2). It appears that on the very short and redundant data in this dataset, the character-and n-gram based system can learn somewhat more efficiently than the neural systems.
On the SRA datasets, the "T&N tuned" model outperformed the baseline and the "T&N best" settings on average across prompts, by a larger margin than the other datasets. On the SRA data, as on the ASAP-SAS data, a gap remains between the tuned model's performance and the state of the art. On SRA, this may be partly due to the use of "question indicator" features by the top performing systems (Heilman and Madnani, 2013;Ott et al., 2013).
The performance improvement over the baseline system was larger on the development sets than on the test sets. Part of the reason for this is that the test set evaluation procedure likely did not choose the best-performing epoch for the neural models.  Table 3: Parameter experiment results on SRA datasets on the development set. "wF1" is the weighted F1 score. "Baseline" is the baseline non-neural system. "T&N best" is the best-performing parameter set in Taghipour & Ng (2016): tuned embeddings (here, GLOVE 100 dimensions), 300-dimensional LSTM, unidirectional, mean-over-time layer. Scores are bolded if they outperform the score for the "T&N best" parameter setting.

Discussion
Our results establish that the basic neural architecture of pretrained embeddings with tuning across model training and LSTMs is a reasonably effective architecture for the short answer content scoring task. The architecture performs well enough to exceed a non-neural content scoring baseline system in most cases. Given the diversity of prompts in SAS, there was a good deal of variation in the effectiveness of parameter choices in this neural architecture. Still, some basic trends emerged. First, pretrained embeddings tuned across model training were crucial for competitive performance on most datasets. Second, neural models for SAS generally benefit from similar size hidden dimensions as models for AES. Only the Powergrading dataset, with very short answers and a small vocabulary for each prompt, benefitted from a significantly smaller LSTM dimensionality. The relationship between task, rubrics, vocabulary size, and the representational capacity of neural models for SAS need further exploration.
Third, a mean-over-time aggregation mechanism on top of the LSTM generally performed well, but notably this mechanism was not nearly as important as in the AES task. Mean-over-time produced competitive results on many prompts, but contrary to Taghipour and Ng (2016), bidirectional LSTMS and attention produced some of the best results, which is consistent with results for neural models on other text classification tasks (e.g., Longpre et al. (2016)).
Research is needed to explain these emerging differences in effective neural architectures for AES vs. SAS, including model-specific factors such as the interaction of an LSTM's integration of features over time and the redundancy of predictive signals in essays vs. short answers, along with data-specific factors such as the consistency of human scoring, the demands of different rubrics, and the homogeneity or diversity of prompts in each setting. At the same time, different from the AES task, the family of neural architectures explored here needs further augmenting to achieve state-of-the-art results on the SAS task. Moreover, more experiments are needed to document how well neural systems perform relative to highly optimized non-neural systems. While further parameter optimizations and different architectures may yield better results, it may be the case that the  SAS task of content scoring with relatively short response sequences requires neural approaches to employ a larger set of features (Pado, 2016) or a greater level of prompt-specific tuning, or pairing with methods from active learning (Horbach and Palmer, 2016).