Inject Rubrics into Short Answer Grading System

Short Answer Grading (SAG) is a task of scoring students’ answers in examinations. Most existing SAG systems predict scores based only on the answers, including the model used as base line in this paper, which gives the-state-of-the-art performance. But they ignore important evaluation criteria such as rubrics, which play a crucial role for evaluating answers in real-world situations. In this paper, we present a method to inject information from rubrics into SAG systems. We implement our approach on top of word-level attention mechanism to introduce the rubric information, in order to locate information in each answer that are highly related to the score. Our experimental results demonstrate that injecting rubric information effectively contributes to the performance improvement and that our proposed model outperforms the state-of-the-art SAG model on the widely used ASAP-SAS dataset under low-resource settings.


Introduction
Short Answer Grading (SAG) is the task of automatically evaluating the correctness of students' answers to a given prompt in an examination (Mohler et al., 2011). It would be beneficial particularly in an educational context where teachers' availability is limited (Mohler and Mihalcea, 2009). Motivated by this background, SAG has been studied mainly with machine learning-based approaches, where the task is considered as inducing a regression model from a given set of manually scored sample answers (i.e., training instances). As observed in a variety of other NLP tasks, recently proposed neural models have been yielding strong results (Riordan et al., 2017).
In general, a prompt is provided along with a scoring rubric. Figure 1 shows a typical example. Students are required to answer the steps involved Prompt Starting with mRNA leaving the nucleus, list and describe four major steps involved in protein synthesis.
2. mRNA travels through the cytoplasm to the ribosome or enters the rough endoplasmic reticulum. 3. mRNA bases are read in triplets called codons (by rRNA).

…
Answer (1 point) When the mRNA leaves the nucleus, it travels through the cell. It moves to a ribosome. The ribosome makes tRNA. Then, protein is synthesized. in protein synthesis. Each answer is scored based on a rubric, which contains several scoring criteria called key elements. Each of them stipulates different aspects of the conditions for an answer to gain a score. Based on the number of the key elements mentioned in an answer, its final score is determined. In Figure 1, the answer mentions two key elements, so it gains 1 point. Thus, rubrics and key elements play an essential role in SAG. Few previous studies, however, use information from rubrics for SAG.
In this paper, we present a method to incorporate rubric information into neural SAG models. Our idea is to enable neural models to capture alignments between an answer and each key element. Specifically, we use a word-level attention mechanism to compute alignments and generate an attentional feature vector for each pair of an answer and a key element.
The contributions of this study is summarized as follows: • This is the first study that explores how to incorporate rubric information into neural SAG models. • We propose a general framework to extend existing neural SAG models with a component for exploiting rubric information. • Our empirical evaluation shows that our proposed model achieves a significant performance improvement particularly in lowresource settings.

Related Work
A lot of existing SAG studies have a main interest in exploring better representations of answers and similarity measures between student answers and reference answers. A wide variety of methods have been explored so far, ranging from Latent Semantic Analysis (LSA) (Mohler et al., 2011), edit distance-based similarity, and knowledgebased similarity using WordNet (Pedersen et al., 2004) (Magooda et al., 2016 to word embeddingbased similarity (Sultan et al., 2016). Recently, Riordan et al. (2017) report that neural networkbased feature representation learning (Taghipour and Ng, 2016) is effective for SAG. In contrast to the popularity of learning answer representations, the use of rubric information for SAG has been gained little attention so far. In Sakaguchi et al. (2015), the authors compute similarities, such as BLEU (Papineni et al., 2002), between an answer and each key element in a rubric, and use them as features in a support vector regression (SVR) model. Ramachandran et al. (2015). Ramachandran et al. (2015) generates text patterns from top answers and rubrics, and reports the automatically generated pattern performances better than manually generated regex pattern. Nevertheless, it still remains an open issue (i) whether a rubric is effective or not even in the context of a neural representation learning paradigm (Riordan et al., 2017), and (ii) what kinds of neural architectures should be employed for the efficient use of rubrics.
Another issue in SAG is on low-resource settings.  investigate the importance of the training data size on nonneural SAG models with discrete features. Horbach and Palmer (2016) show that active learning is effective for increasing useful training instances. This is orthogonal to our approach: combining active learning with our rubric-aware SAG model is an interesting future direction.  We assume the base component encodes an answer into a feature vector f a . We also assume that a given rubric stipulates a set of key elements in natural language. We build a rubric component to encode rubric information, based on the relevance between the answer a and each key element k ∈ {k 1 , k 2 , · · · , k K } provided in the rubric.
The rubric component first encodes each key element that consists of m words, k = (w 1 , w 2 , · · · , w m ), into its feature vector k and the answer a into a. Then, it computes the relevance between the given answer a and each key element k ∈ {k 1 , k 2 , · · · , k K } using a word-level attention mechanism, and generates attentional feature vectors f r 1 , · · · , f r K , which represent the aggregated information of each key element. A rubric feature f r is generated based on the obtained K attentional feature vectors. Finally, f a and f r are merged into one vector f , which is used for scoring: where w is a parameter vector, β is a promptspecific scaling constant, and b is a bias term. Note that the model does not require explicit annotation of key elements on the training answer samples because the model implicitly estimates which key elements are included in each student answer in the course of training. It is also important to note that our framework is encoderagnostic; namely, any answer encoder that produces a fixed-length feature vector can be used as the base component.

Base component
As the base component, we employ the neural SAG model proposed by Riordan et al. (2017), which is the state-of-the-art SAG system among published methods. As shown in Figure 3, this model consists of three layers, namely (i) the embedding layer, (ii) the BiLSTM (bidirectional Long Short-Term Memory (Schuster and Paliwal, 1997)) layer and (iii) the pooling layer. Given an answer a = (w 1 , w 2 , ..., w n ), the embedding layer outputs a vector e a i ∈ R d for each word w i . Taking a sequence of these vectors (e a 1 , e a 2 , · · · , e a n ) as input, the BiLSTM layer then produces a contextualized vector are the hidden states of the forward and backward LSTM, respectively. Finally, the pooling layer averages these contextualized vectors to obtain a feature vector for the answer as follows:

Rubric component
Inspired by Chen et al. (2016), we compute wordlevel attention between each key element and an given answer as illustrated in Figure 4. The rubric component captures how relevant a key element is to the given answer in this way. Given word embedding sequences of an answer (e a 1 , e a 2 , · · · , e a n ) and a key element (e k 1 , e k 2 , · · · , e k m ), the rubric component first calculates the word-level attention between e k i and e a j : • Calculate the inner-products between word embeddings from the answer and key element: z i,j = e k i · e a j • Calculate softmax of z i,j over the rows and columns respectively: Note that α k i ∈ R n stands for the attention from the i-th word of a key element to each word in the answer a. Similarly, α a j ∈ R m stands for the attention from the j-th word of answer to each word in the key element k.
Next, attentional vectors of key-to-answer (v) and answer-to-key (u) are calculated by the sum of word embeddings weighted by α a and α k as follows: Intuitively, vectors u and v are the aggregation of answer tokens that are highly relevant to a key element, and tokens in the key elements that are highly relevant to the answer. We then concatenate u, v to obtain a feature vector for the key element.
Finally, we generate feature vectors f r 1 , · · · , f r K for all key elements in this manner, and then generate rubric feature f r based on them.

Merge features
We introduce two methodologies to merge f a and f r into one single feature f .
Concatenation We concatenate f a and f r : In this case, we expect the regression layer learns weights for the two feature space at the same time.
Weighted Sum Besides, we introduce a trainable parameter λ, which represents the influence of the rubric component. We then generate a rubric-aware answer feature as follows: where M ∈ R 2d×2h is a transformation matrix to learn, projecting f r to the space of f a . To reduce parameters to learn, we compute f r by average instead of concatenation. λ is initialized with 0.5 in our experiments.
Finally, the answer a is scored as follows: score(a) = βsigmoid(w · f + b), where w ∈ R 2h+2dK (or w ∈ R 2h for 'weighted sum' strategy) is a model parameter, β is a prompt-specific scaling constant, and b is a bias term.

Settings
We apply the proposed model on a widely-used, rubric-rich ASAP-SAS dataset 2 , which includes 10 prompts, with 2,226 answers for each prompt on average, including around 1,704 training data and 522 test data. In this paper, we choose the prompts 1, 2, 5, 6 and 10, where key elements are explicitly provided in their rubric, and we randomly take 20% of answers from the training data as the development data. On average, we have 1,308 answers as training data, 327 answers as development data and 545 answers as test data. For both the base and rubric components, we use 300-dimensional GloVe embeddings pretrained on Wikipedia and Gigaword5 (Pennington et al., 2014) to initialize the word embedding layer (d = 300), and update them during the training phase.
For the bi-LSTM layer of base component, we set h = 256, set the dropout probability for linear transformation as 0.5, and set the dropout probability for recurrent state as 0.1, following the setting of (Riordan et al., 2017).
Mean Squared Error (MSE) is used as the loss function, and optimized by RMSprop optimizer with a learning rate of 0.001. The batch size is set to 32.
The model is trained on each prompt. We first train the base component, then fix the base component and train the whole model, and run the training phase for 50 epochs to choose the best model on the development data. For each prompt, we repeat the experiments 5 times with different random seeds from 0 to 4 for initialization, and evaluate the model with Quadratic Weighted Kappa (QWK) independently, then we take average QWK over all the random seeds as the final performance of the model on the corresponding prompt.
To evaluate the robustness of our model in lowresource settings, we train our model on various sizes of the training data (12.5%, 25%, 50%, 75% and 100%).

Results
The experimental results under different sizes of training data are shown in Figure 5. The performance of the base component ('Base') with 100% training data was 0.770, which is comparable to the best performance of QWK 0.773 on the corresponding 5 prompts reported in (Riordan et al., 2017). This indicates that we successfully replicated their best performing model. Also, by adding the rubric component ('+Rubric'), the performance was improved especially when less training data is available. This suggests that the rubric component compensates the lack of training data. This is consistent with (Sakaguchi et al., 2015), a non-neural counter-part of our study.
Performance on each prompt is shown in Table 1. The results indicate that the benefit we obtain from rubric component varies with prompts. For instance, we achieve more improvements on prompt 2, 5 and 6 compared to the others. One of the reasons is that the rubrics vary on prompts. For instance in prompt 5 and 6, all key elements with which an answer can get points are listed, while in prompt 10 only four example answers are provided.

Analysis
Contribution of components Figure 5 demonstrates that when trained with full training data, our rubric-aware model ('+Rubric') achieved a comparable performance to the base component. To reveal reasons for this, we conduct two analy-ses.
First, for '+Rubric (concat)', we investigate the distribution of the learned weights of regression layer corresponding to the base and rubric components following the idea from Meftah et al. (2019). The distribution is shown in Figure 6. When the model was trained on 100% training data, the weights for the rubric component were closer to 0, while the weights for the base component were more dispersed (Figure 6b), compared to the distribution for 12.5% training data ( Figure 6a).
Second, for '+Rubric (weighted sum)', we plot the values of trained λ in Figure 7, representing the weights of base component. Generally, the values of λ grow with data size, which is consistent with Figure 6. This means that as training data increases, the rubric component makes less contribution to the performance, thus little improvement was obtained from the rubric component. Addressing this issue is an interesting direction of our future research.
Word-level attention To get further insights on the rubric component, we analyzed 1-point answers in the test set. We show two typical examples of 1-point answers in Table 2, where each answer is graded (a) correctly and (b) incorrectly by the system trained with 12.5% training data. Both the two answers are graded incorrectly as 0-points  Figure 6: Value distribution of learned weights of regression layer corresponding to base and rubric component for prompt 5.
by the baseline. The corresponding prompt and its rubric are shown in Figure 1. Both the answers only contain the first key element provided in rubric. The first answer is graded as 1-point correctly while the second is graded as 0-points.
The word-level attention shown in Figure 8 indicates how the proposed model identified the relevancy of the answer towards the key element. Figure 8a shows that the model successfully found words and phrases most related to the key element, helping the model improve the performance. On the other hand, Figure 8b shows that the model incorrectly aligned words in the answer and key element. Specifically, the model aligned exists in the answer with exists in the key element. However, these two verbs should not be aligned because their objects are different from each other (i.e. the cell in the answer, but nucleus in the key element). Because the attention is calculated on word-level, the model tends to simply find similar words that appear in the key element, ignoring the context around the words.

Conclusion
Rubrics play a crucial role for SAG but have attracted little attention in the SAG community. In this paper, we present an approach for incorporating rubrics into neural SAG models. We replicated a state-of-the-art neural SAG model as the base component, and injected rubrics (key elements) through the rubric component as an extension. In the low-resource setting where the base component had difficulty learning key elements directly from answers, our experimental results showed that the rubric component significantly improved the performance of SAG. When all training data was used, the rubric component did not have a negative effect on the overall performance.
Overall, the proposed model still has much room for improvement. For example, the approach to calculate the alignment between answers and key elements could be improved by taking context into account, instead of using word-level at- tention. Moreover, other types of rubrics could be explored in the SAG task, especially for prompts where key elements are not provided explicitly. We also expect to obtain a further improvement when full training data is available, by increasing the weights of rubric component feature, as discussed in Figure 6. Beyond SAG, we would like to explore approaches for generating feedback based on the computed attention to key elements.