Automated Essay Scoring in the Presence of Biased Ratings

Studies in Social Sciences have revealed that when people evaluate someone else, their evaluations often reflect their biases. As a result, rater bias may introduce highly subjective factors that make their evaluations inaccurate. This may affect automated essay scoring models in many ways, as these models are typically designed to model (potentially biased) essay raters. While there is sizeable literature on rater effects in general settings, it remains unknown how rater bias affects automated essay scoring. To this end, we present a new annotated corpus containing essays and their respective scores. Different from existing corpora, our corpus also contains comments provided by the raters in order to ground their scores. We present features to quantify rater bias based on their comments, and we found that rater bias plays an important role in automated essay scoring. We investigated the extent to which rater bias affects models based on hand-crafted features. Finally, we propose to rectify the training set by removing essays associated with potentially biased scores while learning the scoring model.


Introduction
Automated Essay Scoring (AES) aims at developing models that can grade essays automatically or with reduced involvement of human raters (Page, 1967). AES systems may rely not only on grammars, but also on more complex features such as semantics, discourse and pragmatics (Davis and Veloso, 2016;Song et al., 2014;Farra et al., 2015;Somasundaran et al., 2014). Thus, a prominent approach to AES is to learn scoring models from previously graded samples, by modeling the scoring process of human raters. When given the same set of essays to evaluate and enough graded samples, AES systems tend to achieve high agreement levels with trained human raters (Taghipour and Ng, 2016).
While research in AES has focused on designing scoring models that maximize the agreement with human raters (Chen and He, 2013;Alikaniotis et al., 2016), there is a lack of discussion on how biased are human ratings. Despite making judgments on a common dimension, raters may be influenced by their attitudes, their cultural background, and their political and economic views (Guerra et al., 2011). Since AES models are designed to learn by analyzing human-graded essays, AES models could inherit rating biases present in the scores from human raters, and this may result in systematic errors. Thus, our objective in this paper is to examine the extent to which rater bias affects the effectiveness of stateof-the-art AES models. A deeper understanding of such factors may help mitigating the effects of rater bias, enabling AES models to achieve greater objectivity.
In order to study the effects of rater bias in essay scoring, we created an annotated corpus containing essays written by high school students as part of a standardized Brazilian national exam. Our corpus contains a number of essays, written in Portuguese, along with their respective scores. Further, raters must also provide a comment for each essay in order to ground their scores. As in (Recasens et al., 2013) we built subjectivity and sentiment lexicons that serve as features to represent the comments, that is, rater comments are represented according to the subjectivity distribution as given by specific subjectivity cues in our lexicons. We present empirical evidence suggesting that the subjectivity distribution within rater comment is a proxy for the score that is given to the essay. More specifically, very low (or very high) scores are associated with essays for which rater comments showed a very particular subjectivity distribution. We also investigated the relationship be-tween subjectivity distribution and the misalignment between human raters and AES models. Interestingly, the subjectivity distribution becomes very characteristic as the misalignment increases.
Our main contributions are three-fold: • We built subjectivity lexicons for the Portuguese language. These lexicons include words and phrases associated with different subjectivity dimensions − sentiments, factive verbs, entailments, intensifiers and hedges. We identify biased language within rater comments by calculating the word mover's distance (Kusner et al., 2015) between comments and the lexicons. This approach benefits from large unsupervised corpora, that can be used to learn effective word embeddings (Mikolov et al., 2013). By identifying biased language, we observed that biases can work to inflate essay scores or to deflate them.
• We employ a set of linguistic features in order to learn different AES models, and we evaluate the effects of biased ratings in the efficacy of these models. In summary, biased ratings affect AES models in different ways, but in general the misalignment between human rater and the AES model is more acute when the rater shows biased language in their comments.
• We propose simple ways of preventing and reducing the negative effects of biased ratings while learning AES models. Results in a controlled experimental setting revealed that detecting and removing biased ratings from the training set lead to significant improvements in automated essay scoring.
In the remainder of this paper, Section 2 discusses related work on automated essay scoring. Section 3 describes the features used for learning AES models, as well as the features used for identifying biased language in rater comments. Further, our debiasing approach is also discussed in Section 3. Section 4 describes the data, the setup and the results of our empirical evaluation. Finally, Section 5 provides our conclusions.

Related Work
Research in cognitive science, psychology and other social studies offer a great amount of work on (conscious and unconscious) biases and their effects on a variety of human activities (Kaheman and Tversky, 1972;Tversky and Kaheman, 1974). Biases can create situations that lead us to make decisions that project our experiences and values onto others (Baron, 2007;Ariely, 2008). While there is sizeable literature on rater effects in general settings (Myford and Wolfe, 2003), it remains unknown how biased ratings affect automated essay scoring models. Rather, works on automated essay scoring are mainly focused on designing AES models by maximizing the agreement with human raters, despite the assertiveness of the ratings.
Typically, AES systems are built on the basis of predefined linguistic features that are then given to a machine learning algorithm (Amorim and Veloso, 2017). Works that fall into this approach include (Srihari et al., 2008(Srihari et al., , 2007Cummins et al., 2016;McNamaraa et al., 2015). Further, authors in (Dong and Zhang, 2016) presented an empirical analysis of features typically used for learning AES models. Authors in  studied a broader category of features that can also be used to build AES models. There are also more recent approaches for learning AES models that do not assume a set of predefined features. These approaches are based on deep architectures, and include (Alikaniotis et al., 2016;Taghipour and Ng, 2016;Riordan et al., 2017;Dong et al., 2017). Finally, there also models based on domain adaptation (Phandi et al., 2015) and unsupervised learning (Chen et al., 2010).
Few works have investigated the subjective nature of essay scoring. An interesting exception is , in which the authors investigated the misalignment between students' and teachers' ratings of essay. Results revealed that students who were less accurate in their self-assessments produced essays that were more causal, contained less meaningful words, and had less argument overlap between sentences.
The work in this paper builds upon prior work on building subjectivity lexicons (Klebanov et al., 2012) and subjectivity detection (Recasens et al., 2013), but in our case applied to score agreement. In this respect, our work is more comparable to , where authors discussed and investigated the problem of learning in the presence of biased annotators. Other works that are also close to ours include (Farra et al., 2015;Somasundaran et al., 2016;Song et al., 2014), in which the authors studied the problem of scoring persuasive and argumentative essays.

Method
Our aim in this work is to learn AES models that are less prone to the effects of biased ratings, that is, models that are able to perform highly objective and impartial judgements. Thus, we start this section by proposing features that are useful for building AES models. Then, we propose another set of features that are useful for identifying biased ratings based on subjectivity cues. Finally, we propose an approach to remove biased ratings from the training set, thus learning more objective AES models.

Features for Essay Scoring
As most existing AES systems, our models are built on the basis of predefined features (e.g. number of words, average word length, and number of spelling errors) that are given to a machine learning algorithm. The features used to build our AES models are discussed and evaluated in (Amorim and Veloso, 2017). They may fall into two broad categories: Domain features: These are simple linguistic features, including the number of first-person pronouns, demonstrative pronouns and verbs.
Features also include the number of pronouns and verbs normalized by the number of tokens in the corresponding sentence.
General features: Most of the general features are based on (Attali and Burstein, 2006). However, due to lack of tools for processing the Portuguese language, we implemented the following features, which are sub-divided as follows: Grammar and style: Features include the number of grammar errors and misspellings. These numbers are also normalized by the number of tokens in the corresponding sentence. In order to evaluate style, we designed features based on the style rules suggested in (Martins, 2000). Features include the number of style errors and the number of style of errors per sentence.
Organization and development: Features include the number of discourse markers from the Portuguese grammar, and the number of discourse markers per sentence. Discourse markers are linguistic units that establish connections between sentences to build coherent and knit discourse. Lexical complexity: Features include the Portuguese version for the Flesh score (Martins et al., 1996), the average word length (i.e., the number of syllables), the number of tokens in an essay, and the number of different words in an essay. Prompt-specific vocabulary usage: Features include different distances between prompt and essay (i.e., cosine distance). In this case, both the prompt and the essay are treated as frequency vectors of words.

Features for Identifying Biased Ratings
We assume a scenario in which essay raters must ground the provided scores with specific comments. We also assume that we can identify biased ratings by detecting comments with biased language. In order to detect biased language, we developed subjectivity lexicons for the Portuguese language. Specifically, a linguist built a list of Portuguese lexicons based on the analysis of expressions that seem to express some subjectivity of the human evaluator. Our subjectivity lexicons are categorized into the following groups: Argumentation: This lexicon includes markers of argumentative discourse. Argumentative markers include lexical expressions and connectives, such as: "even" (até), "by the way" (aliás), "as a consequence" (como consequência), "or else" (ou então), "as if" (como se), "rather than" (em vez de), "somehow" (de certa forma), "despite" (apesar de), among others.
Presupposition: This lexicon includes markers that suggest the rater assumes something is true. Some examples of such markers include: "nowadays" (hoje em dia), "to keep on doing" (continuar a), and factive verbs.
Modalization: This lexicon indicates that the writer exhibits a stance towards its own state-ment. Some examples of such markers are adverbs, auxiliary verbs, modality clauses, and some type of verbs.
Sentiment: This lexicon also includes markers that indicate a state of mind or a sentiment of the rater while evaluating the essay. Some examples of such markers include: "with regret" (infelizmente), "with pleasure" (felizmente), and "it is preferable" (preferencialmente).
Valuation: This lexicon assigns a value to facts. Usually, adjectives are employed as valuation, but as adjectives are context dependent we use only in this class the markers related to intensification, such as: "absolutely" (absolutamente), "highly" (altamente), and "approximately" (aproximadamente).

Debiasing the Training Set
Bias is generally defined as a deviation from a norm. If the norm is unknown to us, then bias is hard to identify. Thus, our approach for debiasing the training set starts by finding the norm (in terms of the subjectivity within rater comments) for each score value. Intuitively, the amount of subjectivity within a comment should be similar to the amount of subjectivity within another comment, given that the scores associated with the corresponding essays are close to each other. So, we should not expect to find essays having discrepant scores, but for which the corresponding comments show a similar amount of subjectivity. Our debiasing approach is divided into three steps: 1. Rater comments are represented according to the amount of subjectivity cues. In order to represent a comment, we calculate the distance between it and each of the five subjectivity lexicons. More specifically, we learn word embeddings (Mikolov et al., 2013) for the Portuguese language, and then we employed the Word Mover's Distance function (Kusner et al., 2015) between a comment and the five subjectivity lexicons. As a result, each comment is finally represented by a five-dimensional subjectivity vector, where each dimension corresponds to the amount of a specific type of subjectivity. This results in a subjectivity space, where comments are placed according to their amount of subjectivity.
2. We group subjectivity vectors according to the score misalignment associated with the corresponding essay. Then, we calculate centroids for each group in order to find the prototypical subjectivity vector for each group (or misalignment level).
3. The distance to the prototypical subjectivity vector is used as a measure of deviation from the norm. Specifically, we sort essays according to the distance between the subjectivity vector and the corresponding centroid. Then, we define a number of essays to be removed from the training set. The relative number of essays to be removed from the training set is controlled by hyper-parameter α.

Experiments
In this section, we present the data we used to learn and evaluate different AES models. Then, we discuss our evaluation procedure and report the results obtained with our debiasing approach. In particular, our experiments aim to answer the following research questions: RQ1: How scores are distributed across the essays? How aligned with human raters are different AES models?
RQ2: Does subjectivity in rater comments vary depending on the given score?
RQ3: Does subjectivity in rater comments vary depending on the misalignment between the AES model and the human rater?
RQ4: Can we mitigate the effects of biased ratings?

Corpus
Our corpus is composed of essays (n = 1, 840) that were written by high-school students as part of a standardized Brazilian national exam. Each essay is evaluated according to the following five objective aspects: Formal language: Mastering of the formal Portuguese language.
Relevance to the prompt: Understanding of essay prompt and application of concepts from different knowledge fields, to develop the theme in an argumentative dissertation format. Organization of information: Selecting, connecting, organizing, and interpreting information.
Argumentation: Demonstration of knowledge of linguistic mechanisms required to construct arguments.
Solution proposal: Formulation of a proposal to the problem presented.
The final score is given as the sum of the scores associated with each aspect. Raters are supposed to perform impartial and objective evaluations, and they must enter specific comments in order to ground their scores. Also, each essay was assessed by one rater.
Bias-free ratings: We also separate a number of essays (n = 50) which received similar scores by three expert raters who were directly instructed to perform impartial, objective, and unbiased evaluations. These raters are PhD-level in Linguistics with unlimited time to provide their ratings, and they do not participate on the creation of the training set. We assume the ratings given to these essays were not contaminated by biased judgements, and we will use these essays for evaluating the efficacy of AES models learned after the training set is debiased.

Setup
We implemented the different AES models using scikit-learn (Pedregosa et al., 2011). Specifically, we learn AES models using Support Vector Regression (SVR), Random Forests (RF), Logistic Regression (LR), Gradient Boosting (GB), and Multi-Layer Perceptron (MLP). All models are based on the same set of features, previously described in Section 3.1, and all models are trained in regression mode. The measure used to evaluate the effectiveness of the different models is the quadratic weighted kappa (κ) which measures the inter-agreement between human raters and AES models (Cohen, 1960). We conducted five-fold cross validation, where the dataset is arranged into five folds with approximately the same number of examples. At each run, four folds are used as training set, and the remaining fold is used as test set. We also kept a separate validation set. The training set is used to learn the models, the validation set is used to tune hyper-parameters and the test set is used to estimate κ numbers for the different the models. Unless otherwise stated, the results reported are the average of the five runs, and are used to assess the overall effectiveness of each model. To ensure the relevance of the results, we assess the statistical significance of our measurements by comparing each pair of models using a Welch's ttest with p−value ≤ 0.01.

Results and Discussion
Next we report results obtained from the execution of the experiments, and discuss these results in the light of our research questions.
Score distribution: The first experiment is concerned with RQ1. Figure 1 shows how scores are distributed over the essays in our corpus. Although the distribution differs for each AES model, scores are centered around 4, and few essays received extreme scores. The LR model seems to have a preference for lower scores. The scores provided by the GB and MLP models are better distributed. Figure 2 shows how aligned with human raters are the different AES models. For most of the essays, AES models are well aligned with human raters, showing misalignments that vary from −2 to +2. For some essays, the LR model tends to give scores that are much smaller than the score given by the human rater. The GB and MLP models perform very similary, but the MLP model shows a slightly better alignment.

Subjectivity vectors and biased ratings:
The second experiment is concerned with RQ2. Figure 3 shows the average subjectivity vector grouped according to the score given to the corresponding essay (i.e., the centroid or prototypical vector of a score). More specifically, we first grouped subjectivity vectors according to the score associated with the corresponding essay, and then we calculated the average subjectivity vector for each group. As shown in Figure 3, the argumentation dimension increases with the score, while modalization tends to decrease. Presupposition, valuation and sentiment dimensions show a very similar trend with varying score values. Figure 4 shows t-SNE representations (van ter Maaten and Hinton, 2008) for the average subjectivity vectors (centroids for each group of score). Three larger clusters emerged: subjectivity vectors associated with score 0, subjectivity vectors associated with scores between 1 and 6, and subjectivity vectors associated with scores between 6 and 10.

Subjectivity vectors and misalignment:
The third experiment is concerned with RQ3. Figure 5 shows the average subjectivity vector considering different levels of misalignment. More specifically, we grouped essays according to the misalignment between the score provided by the AES model and the human rater. Then, we calculated the average subjectivity vector for each group. As we can see, subjectivity affects AES models in different ways. In general, however, subjectivity vectors within groups of essays associated with extreme misalignments are very different from subjectivity vectors associated with mild misalignments. Figure 6 shows t-SNE representations for subjectivity vectors grouped by misalignment levels. Each cluster contains ≈ 80% of the vectors associated with one of the misalignment levels inside the cluster. That is, 20% of the essays will be removed from the training set (i.e., α = 0.2).
Debiasing the training set: The last experiment is concerned with RQ4. As described in Section 3.3, our debiasing approach works by removing from the training set a number of essays (controlled by α) that are more likely to be associated with biased ratings. Table 1 shows κ numbers for different α values. Clearly, the inter-agreement decreases as we remove essays with potentially biased ratings from the training set. This happens because the test set remains with essays that are potentially associated with biased ratings. In this case, removing biased ratings from the training set is always detrimental to the efficacy of AES models.
In order to properly evaluate our debiasing approach, we employ the 50 separate essays with bias-free ratings as our test set. In this case, biased ratings are removed from the training set, and the test set is composed by unbiased ratings. Table 2 shows κ numbers for different α values. As expected, the inter-agreement increases significantly with α, until a point in which keeping removing essays from the training set becomes detrimental. This happens either because we start to remove unbiased ratings, or the training set becomes too small. In all cases, the MLP model showed to be    statistically superior than the other models.

235
In this paper, we investigated the problem of automated essay scoring in the presence of biased ratings. Most of the existing work on automated essay scoring is devoted to maximize the agreement with the human rater. This is fairly dangerous, since human ratings may be biased. Overall, discussion about the quality of the ratings in automated essay scoring is lacking, and this was a central interest in this paper. Specifically, we create a subjectivity space from which potentially biased scores/ratings can be identified. We showed that removing biased scores from the training set results in improved AES models. Finally, the essay data as well as the subjectivity lexicons that we will release as part of this research could prove useful in other bias related tasks.