Automated essay scoring with string kernels and word embeddings

In this work, we present an approach based on combining string kernels and word embeddings for automatic essay scoring. String kernels capture the similarity among strings based on counting common character n-grams, which are a low-level yet powerful type of feature, demonstrating state-of-the-art results in various text classification tasks such as Arabic dialect identification or native language identification. To our best knowledge, we are the first to apply string kernels to automatically score essays. We are also the first to combine them with a high-level semantic feature representation, namely the bag-of-super-word-embeddings. We report the best performance on the Automated Student Assessment Prize data set, in both in-domain and cross-domain settings, surpassing recent state-of-the-art deep learning approaches.


Introduction
Automatic essay scoring (AES) is the task of assigning grades to essays written in an educational setting, using a computer-based system with natural language processing capabilities. The aim of designing such systems is to reduce the involvement of human graders as far as possible. AES is a challenging task as it relies on grammar as well as semantics, pragmatics and discourse (Song et al., 2017).
In this paper, we propose to combine string kernels (low-level character n-gram features) and word embeddings (high-level semantic features) to obtain state-of-the-art AES results. Since recent methods based on string kernels have demonstrated remarkable performance in various text classification tasks ranging from authorship identification (Popescu and Grozea, 2012) and sentiment analysis (Giménez-Pérez et al., 2017; to native language identification (Popescu and Ionescu, 2013;Ionescu et al., 2014;Ionescu, 2015; and dialect identification , we believe that string kernels can reach equally good results in AES. To the best of our knowledge, string kernels have never been used for this task. As string kernels are a simple approach that relies solely on character n-grams as features, it is fairly obvious that such an approach will not to cover several aspects (e.g.: semantics, discourse) required for the AES task. To solve this problem, we propose to combine string kernels with a recent approach based on word embeddings, namely the bag-of-super-wordembeddings (BOSWE) . To our knowledge, this is the first successful attempt to combine string kernels and word embeddings. We evaluate our approach on the Automated Student Assessment Prize data set, in both in-domain and cross-domain settings. The empirical results indicate that our approach yields a better performance than state-of-the-art approaches (Phandi et al., 2015;Dong and Zhang, 2016;Tay et al., 2018).

Method
String kernels. Kernel functions (Shawe-Taylor and Cristianini, 2004) capture the intuitive notion of similarity between objects in a specific domain. For example, in text mining, string kernels can be used to measure the pairwise similarity between text samples, simply based on character n-grams. Various string kernel functions have been proposed to date (Lodhi et al., 2002;Shawe-Taylor and Cristianini, 2004;Ionescu et al., 2014). One of the most recent string kernels is the histogram intersection string kernel (HISK) (Ionescu et al., 2014). For two strings over an alphabet Σ, x, y ∈ Σ * , the intersection string kernel is formally defined as follows: where num v (x) is the number of occurrences of n-gram v as a substring in x, and n is the length of v. In our AES experiments, we use the intersection string kernel based on a range of character n-grams. We approach AES as a regression task, and employ ν-Support Vector Regression (ν-SVR) (Suykens and Vandewalle, 1999;Shawe-Taylor and Cristianini, 2004) for training.
Bag-of-super-word-embeddings. Word embeddings are long known in the NLP community (Bengio et al., 2003;Collobert and Weston, 2008), but they have recently become more popular due to the word2vec (Mikolov et al., 2013) framework that enables the building of efficient vector representations from words. On top of the word embeddings,  developed an approach termed bag-ofsuper-word-embeddings (BOSWE) by adapting an efficient computer vision technique, the bag-ofvisual-words model (Csurka et al., 2004), for natural language processing tasks. The adaptation consists of replacing the image descriptors (Lowe, 2004) useful for recognizing object patterns in images with word embeddings (Mikolov et al., 2013) useful for recognizing semantic patterns in text documents.
The BOSWE representation is computed as follows. First, each word in the collection of training documents is represented as word vector using a pre-trained word embeddings model. Based on the fact that word embeddings carry semantic information by projecting semantically related words in the same region of the embedding space, the next step is to cluster the word vectors in order to obtain relevant semantic clusters of words. As in the standard bag-of-visual-words model, the cluster-ing is done by k-means (Leung and Malik, 2001), and the formed centroids are stored in a randomized forest of k-d trees (Philbin et al., 2007) to reduce search cost. The centroid of each cluster is interpreted as a super word embedding or super word vector that embodies all the semantically related word vectors in a small region of the embedding space. Every embedded word in the collection of documents is then assigned to the nearest cluster centroid (the nearest super word vector). Put together, the super word vectors generate a vocabulary (codebook) that can further be used to describe each document as a bag-of-super-wordembeddings. To obtain the BOSWE represenation for a document, we just have to compute the occurrence count of each super word embedding in the respective document. After building the representation, we employ a kernel method to train the BOSWE model for our specific task. To be consistent with the string kernel approach, we choose the histogram intersection kernel and the same regression method, namely ν-SVR.
Model fusion. In the primal form, a linear classifier takes as input a feature matrix X of r samples (rows) with m features (columns) and optimizes a set of weights in order to reproduce the r training labels. In the dual form, the linear classifier takes as input a kernel matrix K of r × r components, where each component k ij is the similarity between examples x i and x j . Kernel methods work by embedding the data in a Hilbert space and by searching for linear relations in that space, using a learning algorithm. The embedding can be performed either (i) implicitly, by directly specifying the similarity function between each pair of samples, or (ii) explicitly, by first giving the embedding map φ and by computing the inner product between each pair of samples embedded in the Hilbert space. For the linear kernel, the associated embedding map is φ(x) = x and options (i) or (ii) are equivalent, i.e. the similarity function is the inner product. Hence, the linear kernel matrix K can be obtained as K = X · X ′ , where X ′ is the transpose of X. For other kernels, e.g. the histogram intersection kernel, it is not possible to explicitly define the embedding map (Shawe-Taylor and Cristianini, 2004), and the only solution is to adopt option (i) and compute the corresponding kernel matrix directly. Therefore, we combine HISK and BOSWE in the dual (kernel) form, by simply summing up the two corresponding kernel matrices. However, summing up kernel matrices is equivalent to feature vector concatenation in the primal Hilbert space.
To better explain this statement, let us suppose that we can define the embedding map of the histogram intersection kernel and, consequently, we can obtain the corresponding feature matrix of HISK with r×m 1 components denoted by X 1 and the corresponding feature matrix of BOSWE with r × m 2 components denoted by X 2 . We can now combine HISK and BOSWE in two ways. One way is to compute the corresponding kernel matrices K 1 = X 1 · X ′ 1 and K 2 = X 2 · X ′ 2 , and to sum the matrices into a single kernel matrix K + = K 1 + K 2 . The other way is to first concatenate the feature matrices into a single feature matrix X + = [X 1 X 2 ] of r × (m 1 + m 2 ) components, and to compute the final kernel matrix using the inner product, i.e. K + = X + ·X ′ + . Either way, the two approaches, HISK and BOSWE, are fused before the learning stage. As a consequence of kernel summation, the search space of linear patterns grows, which should help the kernel classifier, in our case ν-SVR, to find a better regression function.

Experiments
Data set. To evaluate our approach, we use the Automated Student Assessment Prize (ASAP) 1 data set from Kaggle. The ASAP data set contains 8 prompts of different genres. The number of essays per prompt along with the score ranges are presented in Table 1. Since the official test data of the ASAP competition is not released to the public, we, as well as others before us (Phandi et al., 2015;Dong and Zhang, 2016; 1 https://www.kaggle.com/c/asap-aes/data Tay et al., 2018), use only the training data in our experiments. Evaluation procedure.
As Dong and Zhang (2016), we scaled the essay scores into the range 0-1. We closely followed the same settings for data preparation as (Phandi et al., 2015;Dong and Zhang, 2016). For the in-domain experiments, we use 5-fold cross-validation. The 5-fold cross-validation procedure is repeated for 10 times and the results were averaged to reduce the accuracy variation introduced by randomly selecting the folds. We note that the standard deviation in all cases in below 0.2%.
For the cross-domain experiments, we use the same source→target domain pairs as (Phandi et al., 2015;Dong and Zhang, 2016), namely, 1→2, 3→4, 5→6 and 7→8. All essays in the source domain are used as training data. Target domain samples are randomly divided into 5 folds, where one fold is used as test data, and the other 4 folds are collected together to sub-sample target domain train data. The sub-sample sizes are n t = {10, 25, 50, 100}. The sub-sampling is repeated for 5 times as in (Phandi et al., 2015;Dong and Zhang, 2016) to reduce bias. As our approach performs very well in the cross-domain setting, we also present experiments without subsampling data from the target domain, i.e. when the sub-sample size is n t = 0. As evaluation metric, we use the quadratic weighted kappa (QWK). Baselines. We compare our approach with stateof-the-art methods based on handcrafted features (Phandi et al., 2015), as well as deep features (Dong and Zhang, 2016;Tay et al., 2018).
We note that results for the cross-domain setting are reported only in some of these recent works (Phandi et al., 2015;Dong and Zhang, 2016). Implementation choices. For the string kernels approach, we used the histogram intersection string kernel (HISK) based on the blended range of character n-grams from 1 to 15. To compute the intersection string kernel, we used the open-source code provided by Ionescu et al. (2014). For the BOSWE approach, we used the pre-trained word embeddings computed by the word2vec toolkit (Mikolov et al., 2013) on the Google News data set using the Skip-gram model, which produces 300-dimensional vectors for 3 million words and phrases. We used functions from the VLFeat li-  Table 2: In-domain automatic essay scoring results of our approach versus several state-of-the-art methods (Phandi et al., 2015;Dong and Zhang, 2016;Tay et al., 2018). Results are reported in terms of the quadratic weighted kappa (QWK) measure, using 5-fold cross-validation. The best QWK score (among the machine learning systems) for each prompt is highlighted in bold.
brary (Vedaldi and Fulkerson, 2008) for the other steps involved in the BOSWE approach, such as the k-means clustering and the randomized forest of k-d trees. We set the number of clusters (dimension of the vocabulary) to k = 500. After computing the BOSWE representation, we apply the L 1 -normalized intersection kernel. We combine HISK and BOSWE in the dual form by summing up the two corresponding matrices. For the learning phase, we employ the dual implementation of ν-SVR available in LibSVM (Chang and Lin, 2011). We set its regularization parameter to c = 10 3 and ν = 10 −1 in all our experiments.
In-domain results. The results for the in-domain automatic essay scoring task are presented in Table 2. In our empirical study, we also include feature ablation results. We report the QWK measure on each prompt as well as the overall average. We first note that the histogram intersection string kernel alone reaches better overall performance (0.780) than all previous works (Phandi et al., 2015;Dong and Zhang, 2016;Tay et al., 2018). Remarkably, the overall performance of the HISK is also higher than the inter-human agreement (0.754). Although the BOSWE model can be regarded as a shallow approach, its overall results are comparable to those of deep learning approaches (Dong and Zhang, 2016;Tay et al., 2018). When we combine the two models (HISK and BOSWE), we obtain even better results. Indeed, the combination of string kernels and word embeddings attains the best performance on 7 out of 8 prompts. The average QWK score of HISK and BOSWE (0.785) is more than 2% better the average scores of the best-performing state-of-the-art approaches Tay et al., 2018). Cross-domain results. The results for the crossdomain automatic essay scoring task are presented in Table 3. For each and every source→target pair, we report better results than both state-of-theart methods (Phandi et al., 2015;Dong and Zhang, 2016). We observe that the difference between our best QWK scores and the other approaches are sometimes much higher in the cross-domain setting than in the in-domain setting. We particularly notice that the difference from (Phandi et al., 2015) when n t = 0 is always higher than 10%. Our highest improvement (more than 54%, from 0.187 to 0.728) over (Phandi et al., 2015) is recorded for the pair 5→6, when n t = 0. Our score in this case (0.728) is even higher than both scores of Phandi et al. (2015) and Dong and Zhang (2016) when they use n t = 50. Different from the in-domain setting, we note that the combination of string kernels and word embeddings does not always provide better results than string kernels alone, particularly when the number of target samples (n t ) added into the training set is less or equal to 25. Discussion. It is worth noting that in a set of preliminary experiments (not included in the paper), we actually considered another approach based on word embeddings. We tried to obtain a document embedding by averaging the word vectors for each document. We computed the average as well as the standard deviation for each component of the word vectors, resulting in a total of 600 features, since the word vectors are 300-dimensional. We applied this method in the in-domain setting and we obtained a surprisingly low overall QWK score, around 0.251. We concluded that this simple ap-Source→Target Method n t = 0 n t = 10 n t = 25 n t = 50 n t = 100 1→2 (Phandi et al., 2015) Table 3: Corss-domain automatic essay scoring results of our approach versus two state-of-the-art methods (Phandi et al., 2015;Dong and Zhang, 2016). Results are reported in terms of the quadratic weighted kappa (QWK) measure, using the same evaluation procedure as (Phandi et al., 2015;Dong and Zhang, 2016). The best QWK scores for each source→target domain pair are highlighted in bold.
proach is not useful, and decided to use BOSWE  instead. It would have been interesting to present an error analysis based on the discriminant features weighted higher by the ν-SVR method. Unfortunately, this is not possible because our approach works in the dual space and we cannot transform the dual weights into primal weights, as long as the histogram intersection kernel does not have an explicit embedding map associated to it. In future work, however, we aim to replace the histogram intersection kernel with the presence bits kernel, which will enable us to perform an error analysis based on the overused or underused patterns, as described by .

Conclusion
In this paper, we described an approach based on combining string kernels and word embeddings for automatic essay scoring. We compared our approach on the Automated Student Assessment Prize data set, in both in-domain and crossdomain settings, with several state-of-the-art ap-proaches (Phandi et al., 2015;Dong and Zhang, 2016;Tay et al., 2018). Overall, the in-domain and the cross-domain comparative studies indicate that string kernels, both alone and in combination with word embeddings, attain the best performance on the automatic essay scoring task. Using a shallow approach, we report better results compared to recent deep learning approaches (Dong and Zhang, 2016;Tay et al., 2018).