A Short Answer Grading System in Chinese by Support Vector Approach

In this paper, we report a short answer grading system in Chinese. We build a system based on standard machine learning approaches and test it with translated corpus from two publicly available corpus in English. The experiment results show similar results on two different corpus as in English.


Introduction
To assess the learning outcomes of students with tests in various question types and grading methods, short answer question is one type of test that can test the level of students' understanding of specific concepts in a subject domain. Since grading short answer question requires natural language understanding, the test was manually graded by teachers.
Although technically similar to automatic essay grading, automatic short answer grading is not as mature as automatic essay grading. (Burrows et al., 2015) gives a survey on how the automatic short answer grading is dealt by various researchers. The traditional approach is string matching, which could be very efficient but not very effective.
Early work relied on regular expression patterns which were manually extracted from reference answers (Mitchell et al., 2002). The patterns included keywords in the reference answers. Patterns could also be learnt from the reference answers (Ramachandran et al., 2015). (Sultan et al., 2016) adopted the simpler notion of semantic alignment to avoid explicitly generating complicated patterns.
Semantic matching had also been proposed in early work (Leacock and Chodorow, 2003). This approach was also used by many researchers (Mohler et al., 2009;Mohler et al., 2011;Heilman and Madnani, 2013) in supervised learning machine learning. A large set of similarity measures is defined as features for a supervised learning model. Features range from word level n-gram overlap to deeper semantic similarity measures based on dictionary and distributional methods.
The short-text grading in SemEval Semantic Textual Similarity (STS) task (Agirre et al., 2012;Agirre et al., 2013;Agirre et al., 2014;Agirre et al., 2015) drew the attention of many researchers and provided an evaluation platform. Since then, several systems have been proposed for short answer grading based on the semantic similarity with given reference answers (Mohler and Mihalcea, 2009;Mohler et al., 2011;Heilman and Madnani, 2013;Ramachandran et al., 2015). (Sultan et al., 2016) presented a simple short answer grading system for short answer in English. Given a question and its reference answers, a system measures the correctness of a student answer by calculating the similarity with the correct answers.
Comparing to the field in English, there are very little research projects on short answer grading in Chinese, and there is no publicly available corpus for short answering grading in Chinese.
In this paper we report how we build a system and how to test it with a translated corpus from two publicly available English corpus.
The system first extracts the text similarity features, and the features are used in a support vector model. In the first corpus, answers are graded from 0 to 5; we use support vector regression (SVR) model to learn the grading. In the second corpus, answers are graded as correct/incorrect; we use a support vector machine (SVM) classifier approach to deal with it. In the following sections, we will show the system architecture and experimental results.

System Architecture
We adopt the previous works on the textual entailment (TE) as our prototype to tackle the short answer grading problem in Chinese. TE can be briefly defined as: "Given a pair of sentences (Student Answer, Reference answer), a program has to decide whether the information in Reference answer can be inferred by the Student answer". TE can be used in various applications, such as question answering system, information extraction, information retrieval, and machine translation. Once a system is able to decide whether T1 entails T2 or not, it can be regarded as an information filter to help users find useful information. Traditional approaches to TE are based on the semantic and syntactic similarities of the words in the sentences.

Support Vector Machines
Support vector machines (SVM) is a supervised machine learning classification algorithm, which can be used for classifying problem in n-dimension space. It is used widely in various natural language processing research projects and generally generates good results. Comparing to other classification algorithms, SVM algorithm usually has better result when the number of features is quite large and the data is sparse.
SVM uses ( ) = ∅( ) + as the linear separation hyperplane, where w is the weight vector, b is the bias, ∅(•) is a set of high dimensional non-linear transformation function, where w and b is determined by training data that optimizes the following formulas: where ξ I is the slack variables, and C is the penalty coefficient for all the training samples ( , y i ).

Support Vector Regression
Support Vector Regression (SVR) is using the SVM algorithm on regression problem. The goal of SVM is to find the separation hyperplane, and the goal of SVR is to find the regression hyperplane. For the given training set: 1 https://www.csie.ntu.edu.tw/~cjlin/libsvm/ {( 1 , 1 ), … , ( 1 , 1 )} where ∈ is a feature vector, and ∈ 1 is the target output. In order to find the hyperplane, two parameters C > 0, and ε > 0 must be given and the support vector regression can be defined: Subject to ( ) + − ≤ + , − ( ) − ≤ + * , , * ≥ 0, = 1, … , In our experiment, we use a free SVM toolkit, LIBSVM, to train the SVR model. 1 (Chang and Lin, 2011)

Feature extraction
In this section, we briefly introduce the features used in SVM, which are the same as those used in previous work. Table 1 shows the ten features used in the experiments. The first three features are the numbers of common terms both in T1 and T2. The next three features are the BLEU scores. The rest four features are the numbers and differences of sentence length of T1 and T2.

Data Sets in English
SciEntBank: This data set was used in SemEval-2013 and available via github 2 . The data set assigns one of five labels to a student response: correct, partially The data set is provided by (Mohler and Mihalcea, 2009), which is Data Structure questions and student responses graded by two judges. The data set assigns one of two labels to a student response: correct or incorrect. The questions are collected from ten assignments and two tests, and each one has a topic such as programming basics or sorting algorithms. A reference answer is also provided for each question. The interannotator agreement is 0.586 (Pearson's r) and .659 (RMSE on a 5-point scale). Average score of the two judges is used as the final gold score for each student answer.

Chinese Corpus Translation
Since there is no publicly available data set in Chinese, our experiments are conducted on the translated corpus. With the help of machine translation, we translate the two data set into Chinese and use them in our experiments. The sentences are then segmented into words by the Jieba 4 word segmentation toolkit. The quality of machine translation is not perfect, 12% of the sentences have to be corrected manually. The major error types are synonyms with improper usage in the context for both nouns and adjectives. There are also sentences with bad grammar.

Experiments
Since the SciEntBank data set has 5 way labelling, we use regression model to predict the scores of the student responses. And the Data Structure Data Set has 2 way labelling, we use the classification model to predict the scores of the student responses.

Metrics
For a regression result evaluation, we adopt the squared correlation coefficient and mean squared error. For a classification result evaluation, we adopt the accuracy. Squared correlation coefficient, R 2 3 http://web.eecs.umich.edu/mmihalcea/downloads/ShortAnswerGrading_v1.0.tar.gz R 2 is the square of the Pearson correlation coefficient between the observed x and modeled (predicted) y data values of the score. Pearson's correlation coefficient is commonly represented by the letter r. So if we have one dataset {x1,...,xn} containing n values and the prediction of the dataset {y1,...,yn} containing n values, then that formula for r is: where n is the sample size, xi is the sample indexed with i, yi is the correspondent system prediction, and ̅ , ̅ are the means of xi, and yi, respectively.

Root mean squared error (RMSE)
RMSE is defined as   Table 3: Performance on the Chinse version of the SemEval-2013 datasets.

Features
the system uses only the bleu features. In this experiment, the accuracy is almost the same. The result shows that more features do not improve the performance.

Discussions
Since the data sets are translated ones, it is not suitable to compare the results to the original ones. However, comparing to the result in English (Sultan et al., 2016), we find that the performance is similar.

Conclusion and Future Works
In this paper, we report a short answer grading system in Chinese based on a machine learning approach. We test it with translated corpus from two publicly available corpus in English. The experiment result shows that the results on the two different corpus is promising.
In the future, we will further develop the system with deep learning models. First at all, we will use distributed word embedding technique, such as word2vec, to improve the representation of the text. Then a recurrent neural network with long short term memory neuron is desired to replace the SVM model. Also curate corpus from native Chinese students is also important. Word segmentation is also important; instead of Jieba, we might use CKIP word segmentation service (Ma and Chen, 2003).
Most research projects require reference answers, and unsupervised automatic short answer grading is an interesting way to bypass the requirement (Adams et al., 2016)