The Effect of Adding Authorship Knowledge in Automated Text Scoring

Some language exams have multiple writing tasks. When a learner writes multiple texts in a language exam, it is not surprising that the quality of these texts tends to be similar, and the existing automated text scoring (ATS) systems do not explicitly model this similarity. In this paper, we suggest that it could be useful to include the other texts written by this learner in the same exam as extra references in an ATS system. We propose various approaches of fusing information from multiple tasks and pass this authorship knowledge into our ATS model on six different datasets. We show that this can positively affect the model performance at a global level.


Introduction
The existence of various English exam products provides a useful and fair way for language learners to measure their English skills accurately. It also offers a well-accepted standard to help schools and companies to quantitatively judge whether their non-native English applicants meet the compulsory language requirements they set up. Many learners have taken different English exams to get the qualifications required by different organisations. For example, more than two million International English Language Testing System (IELTS) exam sessions have been taken in 2012-2013 1 , and more than 30 million people have taken the Test of English as a Foreign Language (TOEFL) exam 2 .
English exams like IELTS and TOEFL have free-text writing tasks to evaluate a learner's writing ability. For a writing task, each learner needs to write a text to answer the prompt in the task.
Appropriately assessing the quality of free-text writings requires highly proficient human examiners, and the lack of professional and qualified examiners makes it hard for learners to get accurate feedback on the quality of their writings in a timely fashion. Consequently, it is hoped that an ATS system can possibly act as a kind of examiner to mitigate this problem, which offers an assistance to both learners and educators. The goal of ATS is to improve consistency and reduce human resource overheads. ATS usually utilises machine learning techniques to build a model to learn the underlying relationship between texts and scores. ATS is often used as the second marker in highstakes exams, the only marker in practice and tutoring software products.

Multiple Writing Tasks
To evaluate a learner's writing skill more thoroughly, many English exams like IELTS and TOEFL ask them to answer multiple writing tasks. These tasks are drawn from different topics and genres, and each learner is required to write a text for each task. In practice, human judges score each text with an individual score, and these scores are aggregated to obtain an overall score, which reflects their writing skills. We also define the ATS model predicting the individual score of a text and the overall score of all the texts as the individual-level and overall-level models, respectively.
When an individual-level ATS model scores texts, previous work has made an implicit assumption that all responses to all tasks are composed independently. This is not true for exams requiring responses to multiple tasks. The writing skill exemplified by a learner during the same exam session will not normally vary greatly, so the texts written by one learner may share some commonalities, such as preferential word usages and common mistakes, and should also approximately equally reflect their writing skills. We suggest that when an individual-level model predicts the score of a text written by a learner, it is helpful to use their other texts as a reference and pass it as an extra piece of information to the model. We refer to this information as authorship knowledge.
We suggest that the potential benefit of passing this authorship knowledge to an ATS model might come from a reduction of data sparsity and improvement in the robustness and reliability of feature extraction. Normally the text length for each task is limited, and so there may be insufficient features exemplified in a single response to differentiate language proficiency levels. It can be challenging for an ATS model to learn the mapping between texts and scores accurately, and adding authorship knowledge might provide additional salient features to learn the mapping.
In this paper, we test the hypothesis that authorship knowledge can improve individual-level model performance. We pass this authorship knowledge to an individual-level model in two independent ways: feature fusion and score fusion. When the model predicts text scores, both methods pass all the texts written by the same learner to the model as an extra reference. It is shown that adding this knowledge is helpful in an individuallevel ATS model in most cases. To the best of our knowledge, this is the first study that studies how authorship knowledge affects ATS system performance.

Related Work
In most previous work, text features are defined manually and automatically extracted from each text. A machine learning model is then applied to learn the mapping from features to scores. Many different machine learning models have been used, including regression (Page, 2003;Attali and Burstein, 2006;Phandi et al., 2015), classification (Larkey, 1998;Rudner and Liang, 2002) and ranking (Chen and He, 2013;Cummins et al., 2016b). The features used in previous work range from shallow textual features to discourse structure and semantic coherence (Higgins et al., 2004;Yannakoudakis and Briscoe, 2012;Somasundaran et al., 2014), and from prompt independent to dependent features (Cummins et al., 2016a). Some recent models have dispensed with feature engineering and utilised word embeddings and neural networks (Alikaniotis et al., 2016;Dong and Zhang, 2016;Taghipour and Ng, 2016).
However, no previous work has investigated the utility of authorship knowledge in ATS. One possible reason is that most datasets only have one text written by each learner. The First Certificate in English (FCE) dataset released by Yannakoudakis et al. (2011), on the other hand, contains two texts per learner. We primarily focus on the FCE dataset in this paper, but also utilise other datasets to corroborate our results. Yannakoudakis et al. defined all the texts written by a learner as a script. They extracted features from each text and then combined the features of the two texts within the same script together. A support vector machine (SVM) ranking model was trained to learn the relationship between features and overall scores.

Datasets
In this paper, we require a dataset that includes more than one text written by each learner, where each text is scored with an individual-level score. We finally get six datasets in total for our experiments. Each dataset is a set of texts collected from a real exam, and each exam is targeted at one or more Common European Framework of Reference for Languages (CEFR) 3 levels in English. There are six CEFR levels in total: A1, A2, B1, B2, C1 and C2 arranged from lowest to highest.
In each dataset, each script consists of the answers to two tasks. The answers to both tasks were scored on the same grading scale. Each script was written on the same day so we can safely assume no dramatic variation in the writing skill for each learner. The FCE dataset discussed in Section 2 was collected from the FCE exam. The other five datasets were provided by Cambridge Assessment collected from different years.
We need to choose the score for each text for an ATS model to learn. As the original score for each text in the FCE is not reported on a numerical scale, Cambridge Assessment helped us convert the grades to integers between 0 and 20. This mapping is available in Table 2. All the texts from the B2-U, B2-S, C1-U and C1-S datasets are evaluated in terms of four aspects: content, communicative achievement, language quality and organisation. Each aspect is scored as an integer in the range 0-5. We add the scores of these four aspects of a text together to obtain a total score in the range 0-20, and we use this total score as the score for For the other five datasets, the name of each dataset encodes its target CEFR level learners with whether it is unshuffled or shuffled. B2-U means that it aims at B2 level learners and is unshuffled. MEAN and STD describe the mean and standard deviation of the scores. All the datasets have two writing tasks, and for each writing task, each learner is required to write an answer to one prompt. # prompts describes how many prompts exist in each dataset.
this text for our study. In contrast, AL-U is marked on a scale of 0-9 at 0.5 mark intervals, where each text also receives a score for each of four aspects including task achievement, coherence, word usage and grammar. The total score is aggregated from the scores on all four aspects by Cambridge Assessment, and it is still normalised to 0-9 at 0.5 mark intervals. In this case, we directly use the existing total score as the individual score for a text in AL-U for our study.
Due to transcription errors, we only kept scripts which do not contain any invalid individual score.
After we cleaned the text scores, each dataset was then split into training, development and test sets. The total number of scripts in each dataset, and the number of scripts in the training, development and test sets are summarised in Table 1. The test set for FCE is the same in Yannakoudakis et al. (2011).

Notations
Let us introduce some notations to facilitate our discussion. Each dataset consists of M tasks for each learner to answer, and there are J learners in one dataset. We assume that each learner only takes any exam once. All the datasets we described in Section 3 require learners to write two texts. Hence, M = 2 in each dataset. t m,j denotes the m th text written by learner l j , which answers the m th task task m in a dataset. text t m,j can be represented as a sequence of words written by learner l j . The individual score for text t m,j marked by a human examiner is s m,j .
T L j = {t m,j |m = 1, ..., M } denotes the set of all the texts written by l j in a dataset. In other words, T L j is equivalent to the script answered by learner l j .
T N m,j = T L j \ t m,j denotes the neighbouring text set of t m,j , which is all the texts written by learner l j except for t m,j . In this section, since each dataset only contains 2 texts per learner, the number of texts in T N m,j is always 1, and the only text in this set is t (M +1−m),j , which denotes the neighbouring text of t m,j .

Assumptions
There are two assumptions behind authorship knowledge and ATS we want to validate.
The first assumption is that there is a variable skill j which can describe the writing skill of each learner l j , and skill j is approximately constant during an exam. If we believe the skill of a learner could be measured by the English exam they take, s m,j for any m will be a sample from a distribution constrained by skill j during the exam. We also assume that no learner will behave totally differently on the two tasks during the same exam. In this case, these individual text scores should be close and correlate well with their skill skill j , and this correlation might be helpful in training an individual-level model.
However, the first assumption is not always correct. In some circumstances, learners will perform really well on some tasks, but fail to finish other tasks to the same quality, and they can get low scores on these tasks. An obvious reason for this is that some learners may have managed their time badly and failed to finish the second task; some may also be better prepared for the topic and genre elicited by one of the prompts.
To verify and measure this assumption, we calculate root-mean-squared error (RMSE), quadratic weighted kappa (κ), Pearson (ρ prs ) and Spearman correlation (ρ spr ) between all the responses to the first task T T 1 , and the second task T T 2 answered by all learners. The results are given in Table 3  As we can see, κ, ρ prs and ρ spr are above 0.6 in the four unshuffled datasets, and about 0.4 in the two shuffled datasets. It is suggested by Landis and Koch (1977) that there is a substantial agreement between two sequences if Cohen's Kappa is above 0.6 and a moderate agreement when Co-hen's Kappa is between 0.4 to 0.6 4 . We use their interpretation to describe our results, and there is at least a moderate correlation and agreement between the scores of T T 1 and T T 2 . This verifies our first assumption to some degree. Whether this amount of agreement can affect the performance of an ATS model is further investigated in the following sections.
The second assumption concerns whether passing authorship knowledge to an ATS model truly improves the model performance by bringing more reliable features and better understanding about each learner's writing skill. An alternative explanation for the possible improvement, if it exists, is brought by the bias during the marking procedure. When comparing RMSE for the unshuffled and shuffled datasets as shown in Table 3, we can see that RMSE is higher for BS-2 than for B2-U, and higher for C2-S than for C2-U. This suggests that human judges might be biased when marking the second text after the first. Hence, we aim to determine whether authorship knowledge truly improves ATS performance by looking at the shuffled dataset, since any improvement on the unshuffled dataset might be the result of grading bias.

Methods
To study how authorship knowledge affects ATS, we first need a baseline model.

Baseline
In the baseline model, a feature vector f m,j , extracted from text t m,j written by learner l j , is used to train an individual-level model to learn the relationship between feature vector space F and text score space S. The model finally predicts the score of text t m,j asŝ m,j . The predicted scorê s m,j might be invalid on the given grading scale. For example, an ATS model might predict a score of 4.3, but the grading scale requires an integer. Hence, we roundŝ m,j to the nearest valid score on the given grading scale asrs m,j , which is 4 in this case.

Features
The features for the baseline model we use are similar to those of Yannakoudakis et al. mentioned in Section 2. More specifically, we use features including word and POS n-grams, script length, the n-gram missing rate estimated on a background corpus, phrase structure rules, and grammatical dependency distances between any two words within the same sentence, though we only use the top parse result for grammatical relation distance measures. The n-gram missing rate is estimated on UKWaC (Ferraresi et al., 2008). Besides that, we also include the number of words misspelt, the count of grammatical relation types, and the minimum, maximum and average sentence and word lengths. The POS tags, grammatical relations and phrase structure rules are derived from the RASP (robust accurate statistical parsing) toolkit (Briscoe et al., 2006). We remove any feature whose frequency in the training set is below 4, and keep the top 25,000 features that have the highest absolute Pearson correlation with text scores. Each feature vector is normalised to ||f m,j || = 1.

Benchmark
Yannakoudakis et al. (2011) only built an overalllevel model and evaluated it in terms of ρ prs and ρ spr . As we use more features and also a global feature selection step, we should ensure that our model is relatively optimal and thus a challenging baseline.
We firstly concatenate all the texts in script T L j together as concatenated text ct j so that We extract the script feature vector cf j from the concatenated text ct j based on the features defined in Section 6.1.1. We define the combined script score cs j as the sum of the individual text scores to represent the overall score of each script: cs j := M m=1 s m,j The FCE dataset has another overall script score ss j for script T L j used by Yannakoudakis et al. (2011)  level model by means of support vector regression (SVR) and SVM ranking between cf j and its script score ss j rather than cs j together with a linear kernel. In order to get the valid predicted score on given the grading scale for SVM ranking, we train another linear regression model on the training set between the ranking scores and the actual text scores. For both SVR and SVM ranking, we then round the scores predicted from their corresponding regressors to the nearest valid integers on the given grading scale.
We tune the regularization hyper-parameter on the development set and report the results achieving the lowest RMSE on the development set. The results are included in  (Nicholls, 2003) as the background corpus for n-gram missing rate estimation respectively. DISCOURSE is the CLC version with extra discourse features. In the DISCOURSE version, Yannakoudakis and Briscoe (2012) investigated different features to measure the coherence of a text and how these features affect the overall score of the texts in the FCE dataset. They showed that the coherence feature based on incremental semantic analysis (Baroni et al., 2007) measuring average adjacent sentence similarity can help their ATS system improve in terms of the Pearson and Spearman correlations. Table 4 does not include any recent neural model on the FCE dataset, because the neural model developed by Farag et al. (2017) shows that there is still a performance gap between the neural model and the models built on hand-crafted features.
Our models achieve relatively good performance, and we also found that by selecting appropriate features and hyper-parameters, the difference between using ranking and regression to train an ATS model is relatively small. This contrasts with Yannakoudakis et al. (2011)'s finding that ranking is much better than regression on this task. Therefore, we use SVR (BASE) in the following experiments.

Model Fusion
There are two ways in which we can pass authorship knowledge in our ATS model. We refer to them as score fusion and feature fusion.
For score fusion, we concatenate all the texts within the same script together as ct j written by learner l j . We extract the script feature vector cf j from ct j . An overall-level model is trained on cf j and its combined script score cs j , which is the sum of all the individual scores of one script. This overall-level model predicts the combined script score of ct j asĉs j , and the predicted normalised combined scoreĉ s j M is fused with the predicted individual scoreŝ m,j by linear interpolation to get the predicted fused scoref s m,j : The interpolation hyper-parameter α is tuned on the development set, andf s m,j is then rounded to the nearest valid score on the given grading scale as the final predicted individual-level score for t m,j .
For feature fusion, we still extract the script feature vector cf j from ct j . Then, we define the fused feature vector f f m,j of t m,j as the vector concatenated by f m,j and cf j together as follows: where β is the concatenation weighting hyperparameter to be tuned on the development set. We train an individual-level model on the fused feature vector f f m,j and text score s m,j , and the predicted scoreŝ m,j is rounded to the nearest valid scorers m,j on the given grading scale.
Another question raised by the discussion here is what to fuse. For text t m,j in score fusion, instead of fusing the individual scoreŝ m,j with the combined script scoreĉs j , we can also fuseŝ m,j with the individual predicted scoreŝ (M −m+1),j from the other text within the same script, which is the neighbouring text t (M −m+1),j .
For feature fusion, when we are augmenting text feature vector f m,j to f f m,j , we can concatenate it with the feature vector f (M −m+1),j from the neighbouring text t (M −m+1),j instead of the script feature vector cf j derived from the concatenated text ct j . Therefore, we have two different fusion approaches, and each approach also has two different sources to fuse.
It should be noticed that the two questions for each dataset are designed on a similar difficulty level. The fusion approach can easily be made to work even if these questions are not on the same difficulty level. If the difficulty difference between the targeted question and the neighbouring question is too large, we can penalise the neighbouring question so that the ATS model mainly look at the targeted question. This is straightforward to do in our method by adjusting the weight of the neighbouring question. We will investigate questions from different difficulty levels in future work once we have a suitable dataset.

Results and Discussion
In this section, we evaluate the baseline model and the fusion approaches to study the influence of authorship knowledge. For each setup, we train an individual-level model for each dataset. The model for each setup is optimised on each development set in terms of RMSE. We tune the SVR regularisation and interpolation hyper-parameters on each development set. We report RMSE, κ, ρ prs and ρ spr in Table 6 for each test set. The optimal interpolation hyper-parameters for each fusion approach are reported as α/β in Table 6.
Some readers might notice that there is a numerical difference between Table 4 and Table 6 for the same baseline model evaluated on the FCE test set. The reason for the difference here is that these two tables correspond to two different tasks. The task in Table 4 is predicting the overall-level score, and Table 6 is the individual-level score of a text. It seems that predicting the individual-level scores is a harder task as there is less text to assess (Section 1.1).
For feature fusion, feature fusion with neighbouring text (FF-NT) and concatenated text (FF-CT) is consistently better than the baseline (BASE) on all the datasets except for the B2-U on κ, ρ prs and ρ spr . For score fusion, score fusion with concatenated text (SF-CT) is better than BASE on all the six datasets except for κ in AL-U. In contrast, score fusion with neighbouring text (SF-NT) is better than BASE on all the datasets regarding RMSE except for FCE, but κ is only better than BASE on C1-S. Both SF-CT and SF-NT are better than BASE in terms of ρ prs and ρ spr . The improvement is also visible on the two shuffled datasets, and we suggest that adding authorship knowledge is not merely the result of modelling grading bias, which answers the second assumption in Section 5.
To give a better global understanding of how each approach performs, we conduct the Wilcoxon signed-rank test (Wilcoxon, 1945;Demšar, 2006) across the six datasets to see whether any setup is significantly better or worse than BASE at a global level. We use the SciPy implementation to run the test 5 , and the p-values of all the metrics across all the six datasets are listed in Table 5. Based on the result in Table 5, there is a significant difference between all the fusion approaches (p < 0.05) and BASE on all the metrics except for SF-NT on κ across multiple datasets.  Table 5: p-value for each approach estimated by the Wilcoxon signed-rank test across all the six datasets. The value bigger than 0.05 is in bold 7.1 Hyper-parameters α, β > 0.5 in each fusion approach tells the ATS model that it should favour the information from the other source over the current individual text t m,j being marked, and vice versa. We also visualise the relation between β and RMSE for the feature fusion approaches in Figure 1 and 2.
For the fusion with concatenated text ct j , α > 0.5 on FCE and C1-S in SF-CT. β > 0.5 for all the datasets except for B2-U in FF-CT. Furthermore, if we tune β on the test sets, we can find the optimal β for all the six datasets are bigger than 0.5. On the one hand, we are a little bit surprised that the fusion approaches with concatenated text favour ct j , and it might mean that ct j is more salient compared to the original text t m,j in ATS. On the other hand, it is still to be expected to observe these results, because ct j also contains t m,j , and the information from t m,j is still dominant in the model even if α, β > 0.5.
In contrast, we expect that the model fused with neighbouring text achieves the best performance on each dataset when α or β is smaller than 0.5, as the model should focus on the text t m,j being marked. For SF-NT in Table 6, the optimal α is always smaller than 0.5. However, for FF-NT, the optimal β = 0.5 for AL-U and C1-U in Table 6. Furthermore, if we choose the test sets to tune β instead of the development sets, we can see that β > 0.5 on the FCE, C1-U and AL-U dataset in Figure 2. Based on these results, we suggest that in some cases, the features from two tasks written by the same learner could be highly similar and shared to some extent in an ATS model.

Score Difference
Although positive effects are observed in most cases, no method is significantly better than BASE on every dataset and metric we used. One reason might be that it is not ideal to aggregate the two texts written by the same learner together if the performance difference between these texts is big. For example, some learners might perform well on the first task, but fail to complete the second task. This is what we have discussed in the first assumption in Section 5, and this assumption might be invalid in some cases. So, we conduct another study to see how the score difference between the two texts in each script affects the model performance.
We define the script score difference sd j as the score difference between two texts t 1,j and t 2,j within the same script T L j : sd j := |s 1,j − s 2,j |.
The text score difference of text t m,j is defined as the score difference of the script to which it belongs: sd m,j := sd j .
The text score error error m,j denotes the difference between the predicted score and the gold score of t m,j : error m,j := |rs m,j − s m,j |.
The text score errors error m,j produced by BASE and any fusion approach on text t m,j denote error BASE m,j and error FUSION m,j , respectively. The performance difference P D m,j between BASE and any fusion approach for text t m,j denotes the difference between the errors made by the two setups: P D m,j > 0 means that the fusion approach is better than BASE at predicting the score of t m,j , and vice versa. We calculate the Pearson correlation ρ prs between P D m,j and sd m,j for each test set in Table 7. Although we do not find any interesting relation between the correlation here and the performance variation in Table 6, Table 7 does reveal some patterns. On the one hand, most values are negative, and the five positive values in bold tend to be close to 0, and p is always bigger than 0.05 for all the positive values. We suggest that there is a negative correlation between performance dif-  GREEN means improvement and RED means degradation over BASE. The optimal interpolation hyperparameters for each fusion approach are reported as α/β. + means significantly better (p < 0.05) than BASE using the permutation randomisation test (Yeh, 2000) with 2,000 samples. No metric is found significantly worse than BASE.  ference P D m,j and text score difference sd m,j on some datasets.
On the other hand, only the p-values for six negative values in Table 7 are smaller than 0.05. We think the negative influence brought by the score difference is not huge, because the scores of the two texts written by the same learners are at least moderately correlated in Table 3. This correlation might reduce the negative influence of score difference here.
In some operational settings, it might be consid-ered unfair to use other responses to score a new response, and grading guidelines usually require texts to be marked independently. Nevertheless, we found a clear improvement when making use of such information, and no approach is significantly worse than BASE on any metric or dataset. In other words, the positive influence brought by our fusion approaches is stronger than any possible negative effects.   The Pearson correlation between performance difference P D m,j and script score difference sd m,j on the test sets. * denotes p-value < 0.05, and bold denotes a positive correlation.

Conclusion
In this paper, we studied how authorship knowledge, by means of score fusion and feature fusion, is a useful feature in ATS. We showed that including such information improves model performance at in most datasets, and that improvement is not only from modelling grading bias. One possible topic for future work is to study whether the target CEFR level of each dataset affects the influence of adding authorship knowledge.