Preventing Critical Scoring Errors in Short Answer Scoring with Confidence Estimation

Many recent Short Answer Scoring (SAS) systems have employed Quadratic Weighted Kappa (QWK) as the evaluation measure of their systems. However, we hypothesize that QWK is unsatisfactory for the evaluation of the SAS systems when we consider measuring their effectiveness in actual usage. We introduce a new task formulation of SAS that matches the actual usage. In our formulation, the SAS systems should extract as many scoring predictions that are not critical scoring errors (CSEs). We conduct the experiments in our new task formulation and demonstrate that a typical SAS system can predict scores with zero CSE for approximately 50% of test data at maximum by filtering out low-reliablility predictions on the basis of a certain confidence estimation. This result directly indicates the possibility of reducing half the scoring cost of human raters, which is more preferable for the evaluation of SAS systems.


Introduction
The automated Short Answer Scoring (SAS) is a task of estimating a score of a short-text answer written as a response to a given prompt on the basis of whether the answer satisfies the rubrics prepared by a human in advance. SAS systems have mainly been developed to markedly reduce the scoring cost of human raters. Moreover, the SAS systems play a central role in providing stable and sustainable scoring in a repeated and large-scale examination and (online) self-study learning support system (Attali and Burstein, 2006;Shermis et al., 2010;Leacock and Chodorow, 2003;Burrows et al., 2015).
The development of the SAS systems has a long history (Page, 1994;Foltz et al., 1999). Many recent previous studies, e.g., Taghipour and Ng, 2016;Riordan et al., 2017;Wang et al., 2019), utilize Quadratic Weighted Kappa (QWK) (Cohen, 1968) as a measure for the achievement and for the comparison of the performances of the SAS systems. QWK is indeed useful for measuring and comparing the overall performance of each system and the daily developments of their scoring models. In our experiments, however, we reveal that the SAS systems with high QWK potentially incur serious scoring errors (see experiments in Section 5.3). Such serious scoring errors are rarely incurred by trained human raters, therefore, we need to avoid containing this type of errors to ensure the sufficient scoring quality, for use in the scoring of commercial examinations, of SAS systems. When we strictly focus on measuring the effectiveness of the SAS systems in actual usage, QWK seems unsatisfactory for the evaluation of the SAS systems. Here, we assume that the following procedure is a realistic configuration for utilizing the SAS systems in actual usage: (1) apply a SAS system to score each answer, (2) treat the predicted score as the final decision if the predicted score is highly reliable, proceed to the next step otherwise, and (3) discard the unreliable predicted score and reevaluate the answer by a human rater as the final decision. Therefore, we aim to establish an appropriate evaluation scheme for accurately estimating the effectiveness of the SAS systems in actual usage instead of the current de facto standard evaluation measure, QWK.
To do so, we first introduce a key concept critical scoring error (CSE), which reflects unacceptable prediction error. Specifically, CSE refers to the observation that the gap between a predicted score and the ground truth is larger than a predefined threshold, which, for example, can be determined by an average gap in human raters. Then, in our task formulation, the goal of the automated SAS is to obtain as many predictions without CSE as possible, which directly reflects the effectiveness of the SAS models in the actual usage. We also in-

Prompt
Explain what the author means by the phrase "this tension has caused several different philosophical viewpoints in Western culture" Figure 1: Example of a prompt and a student's shorttext response excerpted from the dataset proposed by . The allotment score of this prompt is 16, and this response is assigned four points by a human rater. Note that the prompt and the response are translated from the original ones given in Japanese.
troduce the critical scoring error rate (CSRate), which is the CSE rate in a subset of the test data selected on the basis of the confidence measure of predictions, for evaluating the performance of the SAS systems.
In our experiments, we select two methods, i.e., posterior probability and trust score (Jiang et al., 2018), as case studies of estimating whether or not each prediction is reliable. We use those two confidence estimation methods to obtain a set of highly reliable predictions. The experimental results show that the SAS systems can predict scores with zero CSE for approximately 50% of test data at maximum by filtering low-reliability predictions.

Task Description
As an example, in Figure 1, for a short answer question, a student writes a short text as a response to a given prompt. A human rater marks the response on the basis of the rubrics for the prompt. Similarly, given a student response x = {x 1 , x 2 , ..., x n } for a prompt allotted N points, our short answer scoring task can be defined as predicting a score of s ∈ C = {0, ..., N } for that response.
SAS models are often evaluated in terms of the agreement between the scores of a model prediction and human annotation with QWK. QWK is calculated as: where O ∈ R N ×N is the confusion matrix of two ratings and E ∈ R N ×N is the outer product of histogram vectors of the two ratings; O and E are normalized to have the same sum of their elements. W i,j is calculated as: where i and j are the score rated by a human and the score predicted by a SAS system, respectively. N is allotment score defined for a prompt.

Scoring Model
Following related works (Nguyen and O'Connor, 2015;Jiang et al., 2018;Hendrycks and Gimpel, 2017) on confidence calibration, we formalize our SAS model as a classification model. Note that our focus in this paper is more on evaluating the effectiveness of the confidential scores on SAS tasks than on creating an accurate SAS model. Therefore, we employ a standard Bidirectional Long Short Term Memory (Bi-LSTM) based neural network for our scoring model as a representative model for typical SAS tasks.
Given an input student response x, the model outputs a score s ∈ S for the response as follows. First, we convert tokens in x to word-embedding vectors. These embeddings are fed into a Bi-LSTM and D dimensional hidden vectors {h 1 , h 2 , ..., h n } are obtained as the sum of the hidden vectors from forward and backward LSTMs. The response vector h is then computed by averaging these hidden vectors.
A probability distribution of the score is calculated as: where W ∈ R N ×D and b ∈ R N are learnable parameters. Finally, we select the most likely output score s ∈ S for given input x as:

Task Formulation
The goal in our new task formulation for applying SAS to real-world educational measurements is to obtain as many scoring predictions without CSEs as possible. This is because we can trust such predictions and markedly reduce the cost of the human scoring effort. In this section, we describe our new task formulation of the automated SAS.  Table 1: Statistics of the dataset used in this paper. "Length limit (char.)" is the maximum character length of the response permitted for a prompt. "Allotment score" is the maximum score for a prompt. "Human agreement" represents QWK and Cohen's Kappa (shown in brackets) between the scores annotated by two human raters.
First, to evaluate the proportion of CSEs in the predictions, we define a function on the gold dataset D that returns whether or not the predicted score s of an input x is categorized as a CSE: where λ ∈ [0, 1] is a given threshold, N is the allotment of a score for a prompt, s is the ground truth score of input x, and s is obtained using Equation 5. Note that we can choose the value of λ depending on the situation. For example, for an important examination such as an entrance examination, λ should be smaller than that for daily tests in schools.
Here, let D be a test data set. Moreover, let D be a subset of D, that is, D ⊆ D. Then, our objective is to maximize the size of the subset D on the condition that this subset does not contain CSEs. For obtaining D , we estimate a confidence score C(x, s) for each prediction on the basis of a certain confidence measure, and then gather the predictions with high confidence scores that exceed a threshold, τ . Therefore, for the evaluation of the performance of our task formulation, we propose a critical scoring error rate (CSRate) defined as: where s is obtained using Equation 5. In real-world tasks, the model is expected to select as large a subset D as possible with very small or ideally zero CSRate.
4 Filtering Out Low-Reliability Estimation Using Confidence Score As described in Equation 8, the quality of the confidence measure is important for our task configuration. In this paper, we employ two methods for computing the confidence score: (1) posterior probability of the classification model and (2) trust score (Jiang et al., 2018) as case studies.

Posterior Probability
The most straightforward method for computing the confidence of the prediction in a classification problem is to employ a probabilistic model and use the output label probability: Although a label probability is often used as a confidence score for prediction, some authors are skeptical of its utility (Guo et al., 2017;Kumar et al., 2018). In our experiments, we evaluate the effectiveness of this posterior probability in terms of a confidence estimation method for SAS models.

Trust Score
Trust score (Jiang et al., 2018) is an indicator of the reliability of prediction based on the distance between a target data point and its nearest data points in training data. The intuition behind this score is that the reliability of a prediction is higher when the target data point is closer to the nearest training data point with the same label and farther away from the nearest training data point with a different label.
In this paper, trust score is calculated as follows.
Given a training data value {(x 1 , s 1 ), ..., (x m , s m )}, a target data point x for prediction, and its predicted label s, we first obtain a vector representation for each data point. In our model, the representation for each data point is the sentence vector of the student response described in Section 2.2. Let H = { h 1 , ..., h m } be a set of vector representations for the training data points and let h x be a vector for the target data point x.
Then we collect the representations in the training data that have the same label as the predicted label s: The trust score C trust for x is then calculated as the ratio of the euclidean distance d(·, ·) between the target representation h x and two data-point representations h p and h c in the training data: where, h p is the representation of the nearest training data point having the same label as the predicted label s, and h c is the nearest training data point with a different label:

Dataset
We use the Japanese short answer scoring dataset 1 introduced by Mizumoto et al. (2019). The dataset consists of six prompts. Each prompt has its rubric, student responses, and scores. The prompts, rubrics, and student responses in the dataset were collected from the examinations conducted by a Japanese education company, Takamiya Gakuen Yoyogi Seminar. Each response was manually scored using the multiple analytic criteria for the prompt, and the subscore for each criterion was rated individually on the basis of the corresponding rubric. In the experiments, we use the sum of these analytic scores as a ground truth score of each response. 2 Table 1 shows the statistics of the dataset. In the dataset, the randomly sampled 100 responses per prompt are annotated by two human raters. Therefore, we can calculate QWKs and their Kappa values (Cohen, 1960) between the two human raters to confirm the degree of human agreement. The Kappa values on this dataset are comparable to or higher than those on other datasets for the SAS task (Leacock and Chodorow, 2003;Mohler and Mihalcea, 2009;Mohler et al., 2011;Basu et al., 2013).
As additional statistics, we calculated the number of CSEs and CSRate in various settings of λ in  Equation 6 over the annotated scores of two human raters. Table 2 shows the result. The number of CSEs in Table 2 represents sum of the number of CSEs for all prompts.

Settings
We split the dataset into training data (1, 600), validation data (200), and test data (200). We used pretrained BERT (Devlin et al., 2019) as the embedding layer of the model. 3 We adopted the same optimization algorithm, learning rate, batch size, and output dimension of the recurrent layer as in Taghipour and Ng (2016). We trained the SAS models for 50 epochs and selected the parameters in the epoch in which the best QWK was achieved for the development set. We trained five models with different random seeds and reported the average of the results. Choosing a reasonable λ that defines CSE is crucial for our formulation. In our experiments, we employed 0.2 as λ for CSE. There is no theoretical and statistical evidence that 0.2 is the optimal value for our formulation. However, as shown in Table 2, 0.2 is assumed to be strict considering that even for human raters make CSEs in about 6% of responses. Therefore, this selection can offer meaningful evaluations for our formulation.

Result
Can confidence scores filter out CSEs? Figure  2 shows CSRates on test data when we choose a certain proportion of the predicted instances in descending order of the confidence scores. The figure illustrates that the CSRate in each prompt increases  ) and the number of CSEs (#CSEs) when using the trust score or the posterior probability to filter out unreliable predictions in test data with a certain threshold τ determined by the development set.
almost monotonically for both confidence metrics. We can also observe that the CSRate values on four out of six prompts are suppressed to 0% with a certain amount of high confidence predictions (20% to 60% of the test data). This is an important observation for our objective; the result demonstrates that the proposed procedure using confidence scoring possibly obtains a reasonable size of highly reliable predictions. When comparing the two confidence estimation methods, the trust score is more effective for suppressing CSEs than the posterior probability on Prompt 1, 3, 4, 5, and 6.
Filtering CSE using the threshold In a practical situation, it is necessary to determine a certain  threshold τ in the development set and use it for filtering low-reliability predictions of unknown samples. Assuming this situation, we evaluate how much CSEs in the test set can be reduced by using the threshold τ determined by the procedure described in Section 3. Table 3 shows the proportions of the remaining test data and the number of CSEs after filtering out low-reliability predictions using the thresholds in each prompt. The results for both confidence estimation methods indicate that we can successfully filter out the unreliable predictions and achieve a sufficiently low CSRate by the proposed approach.
QWK in highly reliable predictions Additionally, we also show QWK of the top 10, 30, and 50% confident predictions to illustrate the model performance with the de facto standard metric in Table 4. We show QWK of our model predictions on all test data as Base. The table shows that the proposed approach of selecting high-confidence predictions on the basis of confidence scores increases QWK markedly compared with using the whole test data. Moreover, we can achieve a QWK score of 1.0 in some prompts with the top 30% confident predictions, meaning that the model predictions perfectly agree with the ground truth scores.
Note that a higher QWK value does not always mean that the predictions do not contain CSEs. For example, in Table 4, the QWK values for prompts 1 and 2 are higher than 0.9. However, as shown in Figure 2, even with such high QWK values, these predictions include 1.5 to 2.0% of CSEs. This observation justifies the concept of CSE. QWK possibly conceals serious mispredictions, which are important to filter out in actual usage.

Conclusion and Future Work
In this paper, we introduced a new formulation of the SAS task to evaluate the effectiveness of the SAS systems in actual usage. We defined the concept of a critical scoring error (CSE), which represents unacceptable prediction errors. Then, we formulate the objective of the task to obtain as many predictions without CSE as possible. The experimental results show that by using our proposed procedure of selecting reliable predictions, SAS systems can predict scores with zero CSE for approximately 50% of test data at maximum. This result directly indicates the possibility of reducing half scoring cost of human raters, which, we believe, is highly preferable for the evaluation of SAS systems.
Our study revealed some potential for a better task formulation of SAS that links to actual usage. However, some issues remain, for example, how to determine the effective threshold τ that can strictly guarantee zero CSE is still unknown. This is one major challenge regarding our formulation. Moreover, we must develop a method for more accurately estimating the confidence scores, which is our primary focus in the next step.