The Impact of Training Data on Automated Short Answer Scoring Performance

Automatic evaluation of written responses to content-focused assessment items (automated short answer scoring) is a challenging educational application of natural language processing. It is often addressed using supervised machine learning by estimating models to predict human scores from detailed linguistic features such as word n -grams. However, training data (i.e., human-scored responses) can be difﬁcult to acquire. In this paper, we conduct experiments using scored responses to 44 prompts from 5 diverse datasets in order to better understand how training set size and other factors relate to system performance. We believe this will help future researchers and practitioners working on short answer scoring to answer practically important questions such as, “How much training data do I need?”


Introduction
Automated short answer scoring is a challenging educational application of natural language processing that has received considerable attention in recent years, including a SemEval shared task (Dzikovska et al., 2013), a public competition on the Kaggle data science website (https://www.kaggle.com/ c/asap-sas), and various other research papers (Leacock and Chodorow, 2003;Nielsen et al., 2008;Mohler et al., 2011).
The goal of short answer scoring is to create a predictive model that can take as input a text response to a given prompt (e.g., a question about a reading passage) and produce a score representing the accuracy * Michael Heilman is now a data scientist at Civis Analytics. or correctness of that response. One well-known approach is to learn a prompt-specific model using detailed linguistic features such as word n-grams from a large training set of responses that have been previously scored by humans. 1 This approach works very well when large sets of training data are available, such as in the ASAP 2 competition, where there were thousands of labeled responses per prompt. However, little work has been done to investigate the extent to which short answer scoring performance depends on the availability of large amounts of training data. This is important because short answer scoring is different from tasks where one dataset can be used to train models for a wide variety of inputs, such as syntactic parsing. 2 Current short answer scoring approaches depend on having training data for each new prompt.
Here, we investigate the effects on performance of training sample size and a few other factors, in order to help answer extremely practical questions like, "How much data should I gather and label before deploying automated scoring for a new prompt?" Specifically, we explore the following research questions: • How strong is the association between training sample size and automated scoring performance?
• If the training set size is doubled, how much improvement in performance should we expect? • Are there other factors such as the number of score levels that are strongly associated with performance? • Can we create a model to predict scoring model performance from training sample size and other factors (and how confident would we be of its estimates)?

Short Answer Scoring System
In this section, we describe the basic short answer scoring system that we will use for our experiment. We believe that this system is broadly representative of the current state of the art in short answer scoring. Its performance is probably slightly lower than what one would find for a system highly tailored to a specific dataset. Although features derived from automatic syntactic or semantic parses might also result in small improvements, we did not include such features for simplicity. The system uses support vector regression (Smola and Schölkopf, 2004) to estimate a model that predicts human scores from vectors of binary indicators for linguistic features. We use the implementation from the scikit-learn package (Pedregosa et al., 2011), with default parameters except for the complexity parameter, which is tuned using crossvalidation on the data provided for training. For features, we include indicator features for the following: • lowercased word unigrams • lowercased word bigrams • length bins (specifically, whether the log of 1 plus the number of characters in the response, rounded down to the nearest integer, equals x, for all possible x from the training set) Note that word unigrams and bigrams include punctuation.

Datasets
We conducted experiments using responses to 44 prompts from five different datasets. The data for each of the 44 prompts was split into a training set and a testing set. Table 1 provides an overview of the datasets. The ASAP 2 dataset is from the 2012 public competition hosted on Kaggle (https:// www.kaggle.com/c/asap-sas) and is publicly available. 3 The Math and Reading 1 datasets were developed as part of the Educational Testing Service's "Cognitively Based Assessment of, for, and as Learning" research initiative (Bennett, 2010). 4 The Reading 2 dataset was developed as part of the "Reading for Understanding" framework (Sabatini and O'Reilly, 2013). The Science dataset was developed and scored as part of the Knowledge Integration framework (Linn, 2006). Note that only the ASAP 2 dataset is publicly available.
For all prompts, there are at least 359 training examples (at most 2,633).

Experiments
For each prompt, we trained a model on the full training set for that prompt and evaluated on the testing set. In addition, we trained models from randomly selected subsamples of the training set and evaluated on the full testing set. Specifically, we created 20 replications of samples (without replacement) of sizes 2 n * 100 (i.e., 100, 200, 400, . . . ) up to the full training sample size. We trained models on these subsamples and evaluated each on the full testing set.
For subsamples of the training data, we averaged the results across the 20 replications before further analyses. We used the Fisher Transformation z(κ) when averaging because of its variance-stabilization properties. The same transformation was also used  Table 2: Descriptive statistics about performance in terms of averaged quadratically weighted κ for different training sample sizes (N ), aggregated across all prompts. "med." = median, "s.d." = standard deviation by the ASAP 2 competition as part of its official evaluation.
This gives us a dataset of averaged κ values for different combinations of prompts and sample sizes. Table 2 shows descriptive statistics.
For each data point, in addition to the κ value and prompt, we compute the following: • log2SampleSize: log 2 of the training sample size, Variable r log2SampleSize .550 log2MinSampleSizePerScore .392 meanLog2NumChar -.365 numLevels .033 Table 3: Pearson's r correlations between training set characteristics and human-machine κ.
• log2MinSampleSizePerScore:log 2 of the minimum number of examples for a score level (e.g., log 2 (16) if the least frequent score level in the training sample had 16 examples), • meanLog2NumChar: The mean, across training sample responses, of log 2 of the number of characters (a measure of response length), • numLevels: The number of score levels.
For each of these variables, we first compute Pearson's r to measure the association between κ and each variable. The results are shown in Table 3.
Not surprisingly, the variable most strongly associated with performance (i.e., κ) is the log 2 of the number of responses used for training. However, having a large sample does not ensure high humanmachine agreement: the correlation between κ and log2SampleSize was only r = .550. Performance varies considerably across prompts, as illus- trated in Figure 1. Next, we tested whether we could predict humanmachine agreement for different size training sets for new prompts. We used the dataset of κ values for different prompts and training set sizes described above (N = 224). We iteratively held out each dataset and used it as a test set to evaluate performance of a model trained on the remaining datasets. For the model, we used a simple ordinary least squares linear regression model, with the variables from Table 3 as features. 5 For labels, we used z(κ) instead of κ, and then converted the models predictions back to κ values using the inverse of the z function (Eq. 1). We report two measures of correlation (Pearson's and Spearman's) and two measures of error (root mean squared error and mean absolute error). The results are shown in Table 4.

Discussion and Conclusion
In response to the research questions we posed earlier, we found that: • The correlation between training sample size and human-machine agreement is strong, though performance varies considerably by prompt (Table 2 and Figure 1). 5 We prefer to use a simpler linear model instead of a more complex hierarchical model for the sake of interpretability.  • If the training sample is doubled in size, then performance increases .02 to .05 in κ ( Table 2). This rate of increase was fairly consistent across prompts. However, as with other supervised learning tasks, there will likely be a point where increasing the sample size does not yield large improvements. • Variables such as the minimum number of examples per score level and the length of typical responses are also associated with performance (Table 3), though not as much as the overall sample size. • A model for predicting human-machine agreement from training sample size and other factors could provide useful information to developers of automated scoring, though predictions from our simple model show considerable error (Table 4). More detailed features of prompts, scoring rubrics, and student populations might lead to better predictions.
In this paper, we investigated the impact of training sample size on short answer scoring performance. Our results should help researchers and practitioners of automated scoring answer the highly practical question, "How much data do I need to get good performance?", for new short answer prompts. We conducted our experiments using a basic system with only n-gram and length features, though it is likely that the observed trends (e.g., the rate of increase in κ with more data) would be similar for many other systems. Future work could explore issues such as how much performance varies by task type or by the amount of linguistic variation in responses at particular score levels.