Topicality-Based Indices for Essay Scoring

In this paper, we address the problem of quantifying the overall extent to which a testtaker’s essay deals with the topic it is assigned (prompt). We experiment with a number of models for word topicality, and a number of approaches for aggregating word-level indices into text-level ones. All models are evaluated for their ability to predict the holistic quality of essays. We show that the best texttopicality model provides a significant improvement in a state-of-art essay scoring system. We also show that the findings of the relative merits of different models generalize well across three different datasets.


Introduction
The instruction to "stay on topic" oft given to developing writers seems intuitively unproblematic, yet the question of the best way to measure this property of a text is far from settled, and little is known about the interaction of topicality and other properties of text, such as length. We develop text topicality indices and evaluate them in the context of automated scoring of essays. Specifically, we investigate the relationship between the extent to which the essay engages the topic provided in the essay question (prompt) and the quality of the essay as quantified by a human-provided holistic score.
In the existing literature, topicality has been addressed as a control flag to identify off-topic essays or spoken responses (Yoon and Xie, 2014;Louis and Higgins, 2010;Higgins et al., 2006) or as an element in the overall coherence of the essay Higgins et al., 2004;Foltz et al., 1998). Persing and Ng (2014) annotated essays for promptadherence, and found that achieving inter-rater reliability was very challenging, reporting Pearson r = 0.243 between two raters. We address the relationship between a continuous topicality score and the holistic quality of an essay.
Generally, one can think of the topicality of a given word w on a given topic T as the extent to which w occurs more often in texts addressing T than in otherwise comparable texts addressing a different topic. We consider three models of word topicality from the literature: the significancetest approach as in topic signatures (Lin and Hovy, 2000), the score-product approach as described in the essay scoring literature (Higgins et al., 2006), and a simple cutoff-based approach relying on difference in probabilities.
Given a definition of word topicality, the question arises how to quantify the topicality of the whole text. Specifically, is topicality a property of the vocabulary of a text (of word types) or a property of both the vocabulary and the unfolding discourse (of word tokens)? Thus, do the sentences "I hate restaurants, abhor restaurants, loath restaurants, and love restaurants" and "I hate restaurants, abhor waiters, loath menus, and love food" address the topic of restaurants to the same extent (this would be the prediction of the token-based model), or does the latter sentence address the topic to a greater extent than the former (this would be the prediction of the typebased model)? 1 The second sentence seems to en-gage more with the topic because it attends to more aspects (or details) of the topic.
In this paper, we implement type-based and token-based approaches to text topicality, using a number of different models for word topicality. All models are evaluated for their ability to predict the holistic quality of an essay.
The contributions of this paper are as follows. First, assuming a number of common definitions of word topicality and an application of predicting holistic quality of essays, we show that text-level topicality is most effectively modeled (a) as a property of word types rather than tokens in the text; (b) taking essay length into account. Second, we show that when word topicality is defined using a simple cutoff-based measure and text-topicality is modeled as in (a),(b) above, we obtain a predictor of essay score that yields a statistically significant improvement in a state-of-art essay scoring system. Third, we show that the characteristics of the best topicality model and its effectiveness in improving essay scoring generalize across different kinds of essays.

Data
We experiment with three datasets. Two are datasets of essays responding to two different essay tasks written for a large-scale college-level examination in the United States. These essays are scored by professional raters on a 6-point scale. These sets contain tens of thousands of essays responding to dozens different prompt questions (82,500 essays, 76 prompts for each dataset). Their sheer sizes and the variety of topics (prompts) allow for a thorough evaluation of the proposed measures. However, the proprietary nature of these data does not allow for easy replication of the results, or benchmarking; we therefore use a third, publicly available dataset containing 12,100 essays written for the TOEFL test by non-native speakers of English seeking college entrance in the United States, as well as for other purposes. The dataset was originally built for the task of native language identification (Blanchard et al., 2013;Tetreault et al., 2012); however, the distribution provides coarse-grained holistic scores as well says that in both sentences, half the content words are topical, whereas the type-based model says that only 1 out of 5 different content words is topical in the first sentence, and 4 out of 8 in the second.   (3-point scale). We describe each of the datasets in detail below. Table 1 shows the sizes of the partitions of the datasets into Dev (used for building topicality models), Train (used for selecting the best topicality model and for training the essay scoring system); Test (used for a blind test of the essay scoring system). Table 2 shows score distributions and mean essay length on Train data.

Set 1
Dataset 1 is comprised of essays written in 2012 and 2013 as part of a large-scale college-level examination in the United States, by a mix of native and nonnative speakers of English. The essays respond to a "criticize an argument" task, where a test-taker is given a short prompt text of about 150 words that typically describes a setting where some recommendation is made or a claim is put forward. The task of the test-taker is then to critically evaluate the arguments presented in support of the claim. An example prompt is shown in Figure 1.

Set 2
The second dataset is used for evaluating the generalization of the text-topicality models to a different type of essays. Essays in this dataset are written in In surveys Mason City residents rank water sports (swimming, boating and fishing) among their favorite recreational activities. The Mason River flowing through the city is rarely used for these pursuits, however, and the city park department devotes little of its budget to maintaining riverside recreational facilities. For years there have been complaints from residents about the quality of the river's water and the river's smell. In response, the state has recently announced plans to clean up Mason River. Use of the river for water sports is therefore sure to increase. The city government should for that reason devote more money in this year's budget to riverside recreational facilities. Write a response in which you examine the stated and/or unstated assumptions of the argument. Be sure to explain how the argument depends on the assumptions and what the implications are if the assumptions prove unwarranted. a more open-ended "support your position on an argument" genre, where the prompt is typically a single sentence that puts forward a general claim, such as "As people rely more and more on technology to solve problems, the ability of humans to think for themselves will surely deteriorate." This task is administered on the same test as the one discussed above, and the general properties, such as scale and distribution of scores, are similar.

TOEFL Set
The third dataset will be used to assess generalization of the findings regarding text-topicality models to shorter essays written by non-native speakers of English -a generally less Englishproficient population than writers in Sets 1 and 2. This dataset is publicly available from the Linguistic Data Consortium (Blanchard et al., 2013). 2 This set contains 12,100 essays written for the Test of English as a Foreign Language (TOEFL), responding to the question "Do you agree or disagree with the following statement?", a genre similar to that of Set 2. The essays in this set were written in response to 2 LDC Catalogue No: LDC2014T06 8 different prompts, such as: "A teacher's ability to relate well with students is more important than excellent knowledge of the subject being taught." Only coarse-grained scores are provided, corresponding to low, medium, and high proficiency, which we represent as scores 1, 2, and 3, respectively. The data was partitioned so that 500 essays per prompt are used in Dev to estimate the topical lists; the remaining essays are split 75% (Train) and 25% (Test) within each prompt.

Models of Word Topicality
Let T 1 ... T m be sets of essays responding to m different prompts t 1 ... t m . For a word w and a prompt t k , we define the following contingency table of counts, where ¬w corresponds to any content word other than w and ¬T k corresponds to ∪ r =k T r : We define the following word topicality models. The first model, LH, due to Lin and Hovy (2000), quantifies the topicality of a word in a topic as a reduction in the entropy of topic distribution achieved by partitioning on the word w, scaled so that the resulting value is distributed according to χ 2 . Note that to avoid division by zero when 1 − p 1 = 0, we only consider words with A 12 > 0.
(1) where the proportions p, p 1 , and p 2 are given by: (2) From this definition, we derive three word topicality weights -the first using the continuous values of topicality mapped to the [0, 1] range, the second -binarized to separate out only words that reach the 0.001 significance, the third -a binarized model with a more permissive threshold for 0.05 significance, which would create larger but noisier sets of topical vocabulary. 3 The second model, HBA, due to Higgins et al. (2006), quantifies topicality of a word as the geometric mean of its probability of occurrence in the topic and the complement of its probability of occurrence overall. Thus, the more topical words tend to occur more frequently in the current topic and more rarely in general (this reasoning is similar to tfidf). According to this model, the weight of a word in a topic is defined as follows: 4 (7) Lastly, we define a simple (S) cutoff-based binary index, where the word is topical if it is likelier to occur in the current topic than overall:

Models of Text Topicality
For an essay e, let Y be a set of all content word 5 types in e and let O be a set of all content word tokens. 6 Further, let α w be the topicality value of the 3 For all indices, we set the value to 0 if p2 ≥ p1, even though a reduction in entropy due to a partition on w could occur when the topic is substantially less likely given that w occurred. 4 In Higgins et al. (2006), the probability of occurrence in general is estimated from a different dataset than that used to estimate prompt-specific probabilities. However, presumably, the general dataset would contain some number of essays responding the current prompt, so we believe our approximation is faithful to the spirit of the original. 5 We assume function words are irrelevant for topicality. 6 tYpes vs tOkens word w ∈ e. We then define text topicality as the proportion of topical words (for binary word topicality indices) or mean topicality per word (for continuous word topicality indices), for types and tokens, as follows: We observe that all the word topicality models defined above essentially produce a final list of topical words, based on some estimation set. This introduces a dependence between text length and its topicality, especially under the type-based definition: The longer the text, the less likely it is that the next new word would be topical -simply because there are only so many topical words, and their supply diminishes with every newly chosen word, whereas the (theoretical) supply of non-topical words is infinite. This reasoning suggests that the longer the text, the less topical it would be, on average. Note that we are not implying that longer texts digress more; it is just that the modeling of topicality that is based on a finite list of topical words is inherently biased against longer essays. Figure 2 shows that this is indeed the case, using a separate set of 3,000 essays responding to the same task as Set 1, using the α S word topicality index and type-based aggregation. The different series correspond to sets of essays within a certain length band, with color codes ranging from the lightest blue for the shortest essays (shorter than 2 standard deviations below mean length) to the lightest orange for the longest ones (more than 1.5 standard deviation longer than mean length). It is clearly the case that longer essays tend to be less topical, as moving from blue to red to orange generally aligns with moving down the topicality axis. Thus, given that essay length is typically strongly positively correlated with essay scores, we expect that topicality would be negatively correlated with score. However, separating essays by length bands reveals that the relationship between topicality and score is in fact positive -when length is held approximately constant, better essays tend to be more topical. 7 These obser- Figure 2: Illustration of the relationship between essay score, essay length, and topicality, using α S index in type aggregation, using an additional sample of 3,000 essays responding to the task in Set 1. The series correspond to length bands, with the lightest blue line showing mean topicality per score level for essays that are more than 2 standard deviations below mean essay length.
vations suggest that the estimated topicality of the essay needs to be scaled to compensate for "baseline" topicality differences that are due to length. We therefore define a length-scaled version of the two indices as follows: T ypeT opL(e) = log(|Y |) |Y | w∈Y α w T okT opL(e) = log(|O|) |O| w∈O α w (12)

Selecting Best Topicality Model(s)
We evaluate each of the 5 word-topicality models (α) with each of the 4 text-aggregation methods (types/tokens, scaled/unscaled) -20 models in total -for their ability to predict essay score above and beyond the prediction based on essay length. Essay length is a well-known confounder for essay scoring systems (Page, 1966): It is a strong predictor of essay score (r=0.65 for Set 1); yet, an automated essay scoring system needs to capture additional aspects of essay quality construct beyond the basic English production fluency captured by essay length. Our measure of success is therefore partial correlation r p between the feature and the human-provided es-say score, excluding the effect of essay length. 8 Table 3 shows the results. We make the following observations based on these results.
First, the relative merits of various topicality models generalize very well across the three sets. We calculated rank-order (Spearman) correlations between the 20 partial correlations for the various models on the three pairs of datasets. Thus, the rank order correlation between column r p for Set 1 and column r p for Set 2 in Table 3 is ρ = 0.92; Set 1 vs TOEFL ρ = 0.93; Set 2 vs TOEFL ρ = 0.98.
Second, it is clearly the case that the text topicality indices based on continuous word topicality indices (LH 1 , HBA) are less effective, their partial correlations with score excluding length being within 0.15 band around zero (lines 1-8 in Table 3). Although some overall correlations with score are reasonable (such as 0.262 in line 4, Set 1; 0.235 in line 2, Set 1), these are mostly accounted for by the even higher correlation with essay length. This suggests that accounting for the nuances of the extent of the topicality of each word is generally not effective -once the word is topical enough, it matters not just how topical it is. Or, at the very least, we have not yet found a way to devise an effective continuous topicality score for a word.
Let us now consider the more effective cutoffbased binary indices (LH 2 , LH 3 , SIMPLE), and evaluate the effects of the two manipulations applied across the word topicality models: the log scaling and the use of types vs tokens.
Log Scaling: This manipulation is effective in every single case (compare odd lines n to even lines n+1 for n > 8, for each of the datasets, for a total of 18 comparisons).
Type vs Token: Types are better than tokens in every single case (compare lines n to lines n+2 within each word topicality model, for n > 8, for each of the datasets, for a total of 18 comparisons).  Table 3: Performance of the different word-topicality models (α), with or without log length scaling, in type or token aggregation, on the three datasets, in terms of Pearson correlation with essay score (Rs), Pearson correlation with essay length (R l ), and partial correlation with score controlling for length (rp). Evaluations are performed on Train data in each dataset.

ID
Finally, we observe that among the cutoff models, the more permissive, the better -the model with a stricter significance threshold for topicality performs worse than the one with a looser threshold, which in turn performs worse than a simple cutoff model with no significance test at all (compare line n to line n+4, for n > 8, in Table 3). This suggests that richer but noisier topical lists are generally more effective, in the essay scoring context.
Following these observations, we select the typeaggregated log-scaled simple topicality index for evaluation within an essay scoring system for the three datasets.

Essay Scoring Experiments
In this section, we present an evaluation of the best topicality index for each of the three datasets as a feature in a comprehensive, state-of-art essay scoring system. The baseline engine (e-rater R , described in Burstein et al. (2013)) computes more than 100 micro-features, which are aggregated into macro-features aligned with specific aspects of the writing construct. The system incorporates macrofeatures measuring grammar, usage, mechanics, organization, development, etc; Table 4 shows the nine macro-features, with examples of micro-features. In addition, we put essay length (number of words) as the 10th macro-feature into the baseline model, to ascertain that any gains observed in the experimental condition are not due to the introduction of length as part of the scaling in the topicality feature.
In the baseline condition, a scoring model is built over the ten macro-features using linear regression on the Train set and evaluated on the Test set, for each of the datasets. In the experimental condition, the topicality index is added as the 11th macrofeature into the linear regression model; the experimental system is also trained on Train set and evaluated on Test set, for each of the datasets. We evaluate essay scoring performance using Pearson correlation with human holistic score.
To test statistical significance of the improvements, we use Wilcoxon signed-rank test for matched pairs. We calculate the baseline and experimental performance on each prompt separately, and use the 76 pairs of values (for each of Sets 1 and 2) and 8 pairs of values (for TOEFL) as inputs  for the test. We use VassarStats for performing the significance tests. 9 Table 5 show the results. We find that the addition of the topicality feature leads to a statistically significant improvement over the baseline for each of the three datasets. In an additional set of experiments, we removed essay length from both the baseline and the experimental conditions to check whether the topicality feature would improve upon a state-of-art essay scoring system as-is; we found an improvement in all the three datasets, at the same significance levels as those reported in Table 5.

Related Work
The two approaches that are most closely related to the current work are those of Higgins et al. (2006) and Lin and Hovy (2000), who present word topicality models based on a comparison between the distribution of words in on-topic and off-topic texts. Indeed, these models were the starting point of our work, along with a simpler comparison model based on raw frequencies. Higgins et al. (2006) aggregated the word-level scores using unscaled token-level ag-  gregation; our results suggest that this aggregation method can be improved upon by log-length scaling and type-based aggregation. We also showed that Lin and Hovy (2000) topicality models produce better predictions of essay quality, with appropriate scaling and aggregation. Louis and Higgins (2010) and Higgins et al. (2006) address the task of detecting off-topic essays without on-topic training materials. Persing and Ng (2014) reported a study where essays were scored on an analytic rubric of adherence to the prompt; while this is a promising way to evaluate text-topicality models intrinsically, the reliability of the annotations was low (r=0.234). Content scoring was also studied for essays written in response to an extensive reading or listening prompt -quality of content is then related to integrating information from the source materials (Beigman Klebanov et al., 2014;Kakkonen et al., 2005;Lemaire and Dessus, 2001).
A related direction of research implicitly treats topicality as a part of a more generalized notion of "good content," namely, words that are used by good writers. The approach to estimating the quality of content is to compare the content of the current text to sets of training texts that represent various score points (Attali and Burstein, 2006;Kakkonen et al., 2005;Xie et al., 2012). In this approach, there is no differentiation between content that is topical and other words that might be used for other reasons, such as discourse markers used for organizational purposes or spurious, shell-like elements (Madnani et al., 2012); an essay that is dissimilar from high-scoring essays on all or some of these accounts is likely to be viewed as having "bad content." An essay rife with misspellings would like-wise be seen as having "bad content", because the model high-scoring essays are generally not prone to misspellings. In contrast, our topicality lists are estimated based on a random sample of essays, including low scoring essays; this allows introduction of common misspellings of words frequently used to address the given topic into the topical lists. For example, one of the topical lists includes more than a dozen misspellings of the word contemporaries.
There is a large body of work using topic models to capture different topics typically addressed in a corpus of text (Mimno et al., 2011;Newman et al., 2011;Gruber et al., 2007;Blei et al., 2003). In this general framework, each text can address a few different topics and the number and identity of topics for the corpus is typically unknown. In our setting, we assume that each essay is on a single topic, and that topic is known in advance. 10 However, many of these topics are very open-ended, so they might exhibit non-trivial sub-topical structures. For example, a topic about cultural role models might be dealt with by discussing politicians, musicians, sportsmen -each of these could yield a specific sub-topic. In fact, Persing and Ng (2014) used LDA to create subtopics in this way, and derived features to predict prompt-adherence of an essay. The authors found that in order to make these features more effective, it was beneficial for humans to go over the topics and assign relevance estimates for each sub-topic.

Conclusion
In this paper, we addressed the problem of quantifying the overall extent to which a test-taker's essay deals with the topic it is assigned (prompt). We experimented with a number of approaches for quantifying the topicality of a word, and with a number of approaches for aggregating word-level topicality into text-level topicality. We found that type-based, log-length scaled aggregation generally works better than the token-based and unscaled one, for the task of predicting the holistic quality of essays. The findings of the effectiveness of log length scaling and of type-based accounting when estimating the topicality of an essay for the purposes of holistic scoring are novel contributions of this work.
We also showed that incorporation of texttopicality into essay scoring yields a significant improvement for two different writing tasks over a very strong baseline -a state-of-art essay scoring system augmented with an essay length feature. A significant improvement is also observed on the publicly available set of TOEFL essays, even though the set is smaller, there are only a handful of different prompts, the essays are shorter and less proficiently written, and the scores are given on a coarser-grained scale than for the other two datasets. The demonstration of the excellent generalization of the relative merits of the various topicality models across three datasets and the effectiveness of the topicality feature for improving essay scoring on the three sets is another novel contribution of this work; it suggests robustness of our findings regarding the relationship between topicality, length and quality of essays.
An interesting direction of future work is an intrinsic evaluation of topicality indices against human judgments of topicality. This is a difficult annotation task (Persing and Ng, 2014), and, to our knowledge, no reliable protocol exists for this task.