Automated Evaluation of Writing – 50 Years and Counting

In this theme paper, we focus on Automated Writing Evaluation (AWE), using Ellis Page’s seminal 1966 paper to frame the presentation. We discuss some of the current frontiers in the field and offer some thoughts on the emergent uses of this technology.


A Minimal Case for AWE
In a seminal paper on the imminence of automated grading of essays, Page (1966) showed that a high correlation between holistic machine and human scores is possible. He demonstrated automated scoring of 276 essays written by high school students by a system with 32 features, resulting in a multiple R = 0.65 between machine and average human score, after adjustment. He also provided a thoughtful discussion of his ambitions for automated scoring and of the possible objections.
Page made the case that automated evaluation of student writing is needed to take some of the evaluation load off the teachers and to provide students evaluations of their (potentially multiple) drafts with a fast turnaround. He then appealed to the then-burgeoning interest and fascination with machine learning to argue for the feasibility of such an enterprise, namely, that machines can learn how to give the right grades to essays, if trained on an expert-scored sample.
As part of the feasibility argument, Page emphasized the need to carefully define the goal so that success can be judged appropriately. The goal is not a "real" master analysis of the essay the way a human reader would do but merely an imitation that would produce a correlated result (using what Page called proxes -approximations). Page considered this goal to be both useful and achievable.
2 Report Card: Where are We Now?

Accomplishments
Page's minimal desiderata have certainly been achieved -AWE systems today can score in agreement with the average human rater, at least in some contexts. 1 For example, Pearson's Intelligent Essay Assessor™ (IEA) scores essays written for the Pearson Test of English (PTE) as well as for other contexts: "IEA was developed more than a decade ago and has been used to evaluate millions of essays, from scoring student writing at elementary, secondary and university level, to assessing military leadership skills." 2 Besides sole automated scoring as for PTE, there are additional contexts where the automated score is used in addition to a human score, such as for essays written for the Graduate Record Examination (GRE ® ) 3 or for the Test of English as a Foreign Language (TOEFL ® ). 4 Does this mean that the problem of AWE is solved? Well, not exactly.

Needs Improvement
Page did anticipate some difficulties for AWE systems. It is instructive to see where we are with those.

Originality
What about the gifted student who is offbeat and original? Won't he be overlooked by the computer? (Page, 1966) Page's argument is that the original student is not going to be much worse off with a com-puter than with an (average) human reader, because originality is a subjective construct. Thus, once research uncovers objective and measurable aspects of "original" writing, relevant features can be added into an AWE system; finding such aspects, as well as measuring them, is still work in progress. While no current operational scoring system we are aware of is specifically looking for originality, research into aspects of writing that are often considered original is taking place. For example, using data from different tests, Beigman Klebanov and Flor (2013a) and Beigman Klebanov et al. (2018) found that the extent of metaphor use (proportion of metaphorically used words in an essay) correlates with essay quality; Littlemore et al. (2014) likewise found that more skilled writers use metaphor more often. Song et al. (2016) observed a positive correlation between use of parallelism -syntactically similar and semantically related constructors, often used for emphasis or to enhance memorability -in student essays. Some pioneering work has been done on comparing writing that is recognized as outstanding (through receiving prestigious prizes) vs writing that is "merely" good in the domain of scientific journalism (Louis and Nenkova, 2013). Once various indicators of originality can be successfully measured, additional work may be necessary to incorporate these measurements into scoring ecosystems since such indicators may only occur infrequently. One way to achieve this would be to compute a "macro" feature that aggregates multiple such indicators, another would be to direct such essays to a human rater for review.

Gaming
Won't this grading system be easy to con? Can't the shrewd student just put in the proxies which will get a good grade? (Page, 1966) Certainly, students can and do employ gaming strategies to discover and exploit weaknesses of AWE systems. Such strategies can involve repeating the same paragraphs over and over, varying sentence structure, replacing words with more sophisticated variants, re-using words from the prompt, using general academic words, plagiarizing from other responses or from material found on the Internet, inserting unnecessary shell language -linguistic scaffolding for organizing claims and arguments, and automated generation of essays (Powers et al., 2001;Bejar et al., 2013Bejar et al., , 2014Higgins and Heilman, 2014;Sobel et al., 2014). Such strategies are generally handled by building in filters or flags for aberrant responses (Higgins et al., 2006;Yoon et al., 2018;. However, developers of AWE systems can never anticipate all possible strategies and may have to react quickly as new ones are discovered in use, by developing new AWE methods to identify them. This cat-andmouse game is particularly rampant in the context of standardized testing ( §3.2). This is one of the reasons standardized tests are often not scored solely by an AWE system but also by a human rater.

Content
We are talking awfully casually about grading subject matter like history. Isn't this a wholly different sort of problem? Aren't we supposed to see that what the students are saying makes sense, above and beyond their using commas in the right places? (Page, 1966) Indeed, work has been done over the last decade on automated evaluation of written responses for their content and not their general writing quality (Sukkarieh and Bolge, 2008;Mohler et al., 2011;Ziai et al., 2012;Basu et al., 2013;Ramachandran et al., 2015;Burrows et al., 2015;Sakaguchi et al., 2015;Madnani et al., 2016;Padó, 2016;Madnani et al., 2017a;Riordan et al., 2017;Kumar et al., 2017;Horbach et al., 2018;Riordan et al., 2019). Scoring for content focuses primarily on what students know, have learned, or can do in a specific subject area such as Computer Science, Biology, or Music, with the fluency of the response being secondary. For example, some spelling or grammar errors are acceptable as long as the desired specific information (e.g., scientific principles, trends in a graph, or details from a reading passage) is included in the response. Note that most current content scoring systems ascertain the "correctness" of a response based on its similarity to other responses that humans have deemed to be correct or, at least, high-scoring; they do not employ explicit fact-checking or reasoning for this purpose.
Concerns about specific content extends to other cases where the scoring system needs to pay attention to details of genre and task -not all essays are five-paragraph persuasive essays; the specific task might require assessing whether the student has appropriately used specific source materials (Beigman Klebanov et al., 2014;Rahimi et al., 2017;Zhang and Litman, 2018) or assessing narrative (Somasundaran et al., 2018) or reflective (Beigman Klebanov et al., 2016a;Luo and Litman, 2016), rather than persuasive, writing.

Feedback
Page emphasized the importance of feedback, and considered the following to be "the sort of feedback that can almost be programmed right now" (original italics):  Rao and Tetreault (2018)), thesis relevance (Writing Pal), sentence "spicing" through suggestions of synonyms and idioms (Ginger's Sentence Rephraser), and style & argumentationrelated feedback (Revision Assistant).
Can we then put a green check-mark against Page's agenda for automated feedback, which "may magnify and disseminate the best human capacities to criticize, evaluate, and correct"? Alas, not yet; research on effectiveness of automated feedback on writing is inconclusive (Englert et al., 2007;Shermis et al., 2008;Grimes and Warschauer, 2010;Choi, 2010;Roscoe and Mc-Namara, 2013;Wilson and Czik, 2016;Wilson, 2017;Bai and Hu, 2017;Ranalli et al., 2017). One potential reason for the different outcomes is difference in user populations -feedback that works for L1 writers might not work for L2 writers; differences in ages, skill levels, presence or absence of learning disabilities could all play a role. Adjustment of the evaluation methodology to the specific purpose of the writing assistance tool is another issue for consideration; we will return to this issue in §4.

Going off the Page
So far, Page's outline of the promises and challenges of AWE have provided a good framework for surveying the field. There are also a number of developments that were not mapped on Page's chart; we turn to reviewing those next.

Assessing writing in multiple languages
In order to advance the work on understanding and assessing writing quality, there is clearly a need for a multi-lingual perspective, since methods developed for one language or dialect may not work for another. This consideration does not appear in Page (1966), yet it is an active line of subsequent work. While most of the research we cited so far has been on English, various aspects of writing evaluation, e.g., annotation, detection of various types of errors, and building AWE systems, have been researched for a variety of languages: Song et al. (2016), Rao et al. (2017), Shiue et al. (2017) worked with data in Chinese, Lorenzen et al. (2019) in Danish, Berggren et al. (2019) in Norwegian, Amorim and Veloso (2017) in Portuguese, Stymne et al. (2017) in Swedish, Berkling (2018) and Weiss and Meurers (2019) in German, Mezher and Omar (2016) in Arabic, Kakkonen et al. (2005) in Finnish, Loraksa and Peachavanish (2007) in Thai, Lemaire andDessus (2001) in French, andIshioka andKameda (2006) in Japanese. The list is by no means exhaustive; see Flor and Cahill (2020) for a recent review.

Standardized Testing
The use of automated evaluation technology envisioned by Page was as a service to reduce a teacher's burden; to eventually "lift from the shoulders of the English teacher, that brave and harried soul, his perpetual pressure of unassigned papers, or his unassuaged guilt." While such use has certainly been made (Burstein et al., 2004;Grimes and Warschauer, 2010), the most visible use case for AWE technology has arguably evolved to be in the context of standardized testing, be it for a test of English such as TOEFL ® or PTE, a broader, more advanced psychometric examination such as the GRE ® or GMAT, or for professional licensure such as AICPA or PRAXIS ® .
This development of often high-stakes usage has led to somewhat different challenges from those that Page had anticipated. These challenges generally fall under the purview of the field of educational measurement (Bennett and Bejar, 1998;Clauser et al., 2002;Williamson et al., 2012): How to ensure that the automatic scores assigned to test takers are (1) valid, i.e., they actually measure the skill that the test developer designed the test to measure, (2) defensible, i.e., there is a reasonably clear explanation of why test takers received the particular scores they did, and (3) fair to all the test takers. We address each of these challenges separately below. Note that an additional challenge of high-stakes usage, not elaborated on here, is how to architect scoring systems for large-scale, low-latency use which requires them to be reliable, scalable, flexible, and attentive to the choice of software and application frameworks .

Construct Validity
Page declares that he is not after "generating measures of what the true characteristics of the essays are, as ordinarily discussed by human raters" but rather is content "to settle for the correlates of these true characteristics." Page seems to do away rather quickly with trying to measure the actual thing -the set of all and only "true characteristics of essays", or trins. Why is that? He explains: Notwithstanding the wonders of the computer, we have to develop a strategy in order to tell the computer what to do. The difficult part is the development of this strategy. It is difficult because we do not really understand what the psychological components are in the judgment of essays. It is easy enough to get persons to expound authoritatively on such judgment, but the fuzziness and inutility of their thinking becomes at once evident when the effort is made to translate it into a computer program. (Page, 1966) Page's argument is that we do not know precisely enough what the human raters are doing to try and implement that. Some work on rater cognition has already been done in the early 1950s and 1960s, e.g., in the context of the College Entrance Examination Board's development of the General Composition Test. Diederich et al. (1961) had 53 distinguished individuals from various academic disciplines and beyond (English, Social Science, Natural Science, Law, Writers and Editors, Business Executives) sort student essays "in order of merit", with no definition thereof, instructing readers as follows: Use your own judgment as to what constitutes "writing ability." Do not assume that we want you to do this or that. We want you to use whatever hunches, intuitions, or preferences you normally use in deciding that one piece of writing is better than another. You need not even act as a representative of your field, since individuals in any field have varying tastes and standards.
Readers were also asked to a write brief comments on anything that they liked or disliked about the essay, on as many essays as possible. For the study, a sample of U.S. college freshmen were asked to write essays in response to four topics as part of homework. A total of 300 essays addressing two topics were chosen for the analyses, sampled so as to make sure that the full range of abilities is represented (approximated via SAT Verbal scores). The researchers performed a factor analysis on the matrix of pairwise correlations among the readers, and identified groups of readers (factors) that represent five "schools of thought" about writing quality. Analyzing the comments made by readers who belong to the different "schools of thought", they identified five categories that were each prioritized by one of the groups of readers: It is based on such findings above that general scoring criteria have emerged (Deane, 2013) and morphed into scoring rubrics. These are explicit criteria set by and for human raters for evaluating essays. For example, to score highly on the GRE ® Issue essay-writing task, 16 one typically: • articulates a clear and insightful position on the issue in accordance with the assigned task • develops the position fully with compelling reasons and/or persuasive examples • sustains a well-focused, well-organized analysis, connecting ideas logically • conveys ideas fluently and precisely, using effective vocabulary and sentence variety • demonstrates superior facility with the conventions of standard written English (i.e., grammar, usage and mechanics), but may have minor errors In the current practice of automated scoring of standardized tests, developers of a scoring engine often need to provide a construct validity argument in order to show that what the system is measuring is actually aligned with the "writing construct" -the actual set of writing skills that the test is supposed to measure.
For other items in a rubric, one might need to drill down, articulate a reliable guideline for humans to assess that particular aspect of the essay, annotate a substantial enough number of essays using the guidelines to make machine learning possible, and then find automatically measurable properties of essays that would provide information relevant to that particular aspect of essay quality. This would be a mix between what Page called a prox and a trin, in that a particular, intrinsically interesting, aspect of an essay can be identified reliably by humans, and an automated system can learn how to approximate that particular construct. Such approaches have been developed for organization (well-organized) (Burstein et al., 2003), coherence (well-focused, conveys ideas fluently) (Burstein et al., 2010;, grammaticality (facility with conventions) (Heilman et al., 2014), thesis clarity (clarity) (Persing and Ng, 2013) as well as aspects of scoring rubrics that are more task-specific, e.g., argumentation (clear position, with compelling reasons) (Stab and Gurevych, 2014;Ghosh et al., 2016;Beigman Klebanov et al., 2017;Stab and Gurevych, 2017;Carlile et al., 2018), use of evidence in the context of source-based writing (Rahimi et al., 2017).
Finally, for some rubric items, it is not clear exactly how to reliably translate the relevant aspect of the writing construct into annotations guidelines, and so proxes might be employed. For example, consider Page's argument for capturing "diction" (appropriate word choice) through word frequency -a writer who can use many different words, including rarer and often semantically nuanced ones, is likelier to make precise word choices than a writer who uses a more limited vocabulary. Attempts to capture topicality (Beigman Klebanov et al., 2016b) or development (Beigman Klebanov and Flor, 2013b;Somasundaran et al., 2016) through properties of vocabulary distribution without human annotation of topicality and development exemplify such approaches.

Model Interpretability
Recent research has shown that more sophisticated machine learning models might perform better than simple regression-based models when it comes to predictive accuracy (Chen and He, 2013;Cummins et al., 2016;Taghipour and Ng, 2016;Alikaniotis et al., 2016;Dong et al., 2017;Dasgupta et al., 2018;Jin et al., 2018). However, unlike linear regression where stakeholders can understand how much each feature used in the model contributed to the predicted score, many of the more complex models are essentially "black boxes" and do not really lend themselves to post-hoc interpretability (Lipton, 2016). Although interpretability is an active area of research in the machine learning literature (Ribeiro et al., 2016;Koh and Liang, 2017;Doshi-Velez and Kim, 2017), it currently lags behind the research on machine learning methods. For this reason, some automated scoring systems used for high-stakes standardized testing -like ETS's e-Rater (Attali and Burstein, 2006) -still use some variant of least squares linear regression as the machine learning model to predict test taker scores.

Increased Attention to Fairness
It would probably not be an overstatement to say that fairness in AI is quickly becoming its own sub-field, with a new annual ACM conference on Fairness, Accountability, and Transparency having been inaugurated in 2018 17 and relevant research appearing at many impactful publication venues, such as Science (Caliskan et al., 2017), NIPS (Pleiss et al., 2017;Kim et al., 2018), ICML (Kearns et al., 2018), ACL (Hovy and Spruit, 2016;Sun et al., 2019;Sap et al., 2019), KDD (Speicher et al., 2018), AAAI (Zhang and Bareinboim, 2018), and others (Dwork et al., 2012;Hajian and Domingo-Ferrer, 2013). There is also recent work that examines fairness and ethical considerations when using AI in an education (Mayfield et al., 2019;Gardner et al., 2019).
In the context of assessment, fairness considerations dictate that the test reflects the same construct(s) for the entire test taking population, that 17 https://facctconference.org/ scores from the test have the same meaning for all the test taking population, and that a fair test does not offer undue advantages (or disadvantages) to some individuals because of their characteristicssuch as those associated with race, ethnicity, gender, age, socioeconomic status, or linguistic or cultural background -or the test characteristics itself, e.g., the different prompts shown to different testtakers at test time.
The educational measurement community has long been studying fairness in automated scoring (Williamson et al., 2012;Ramineni and Williamson, 2013;AERA, 2014) and recent progress made by the NLP community towards enhancing the usual accuracy-based evaluations with some of these psychometric analyses -from computing indicators of potential biases in automatic scores across various demographic sub-groups to computing new metrics that incorporate measurement theory to produce more reliable indicators of system performance -is quite promising (Madnani et al., 2017b;Loukina et al., 2019).

Pervasiveness of Technology
Page's gedankenexperiment on the potential of automated essay evaluation in a classroom context no doubt appeared audacious in 1966 but nothing back then could have prepared his readers to the pervasiveness of technology we are experiencing today. Today you can very literally carry your AWE system in your pocket; you can even carry several. You can use them (almost) at any time and at any place -not only in classrooms, but at home, at work, and even while texting with a friend. This is perhaps the biggest issue that Page's vision did not address: the possibility of universal availability and the concomitant co-optation of a tool beyond its original intended purpose. Much like the calculator -invented by Blaise Pascal to help his father with the tedious arithmetic of tax collection -ended up "freeing" people from the burden of figuring out their intended tip at a restaurant through mental arithmetic, a future writing aid meant to help a student improve his argument writing assignment for a class could end up being used by a lawyer for composing his closing argument. Since such usages are on the horizon, we should consider the implications now.

Discussion
Once an invention is out in the open, it is difficult to predict what specific uses people would put it to. How do we go about evaluating the tool if we don't know what the user's goal is? While it isn't possible to anticipate all specific uses, it is possible, we believe, to consider the types of uses that suggest different evaluation strategies. From the current vantage point, we see three types of uses.

Support Consequential Decision Making
The first use is where a consequential decision about the writer or a related entity (such as a class or a school) is being made based on the written product. This use is exemplified by the application of automated scoring in a standardized testing context to decide on admissions to an institution of higher education or the granting of a professional licenses; other cases such as course placement decisions, coursework grading, or even extension of a job offer (where the submission of a writing sample is a part of the job application process) would belong to this type of use. In all such cases, the automated system needs to provide valid and fair scores (or other types of feedback), since the livelihood or professional trajectory of people might depend on the outcome. We have dealt with the particulars of this case in detail in §3.2.

Create a Better Written Product
The second type of use is one where the focus is on the final product, namely, the actual piece of writing produced following the writer's use of AWE technology. In this context, it does not much matter exactly what part of the final product is due to the human and which part is due to the machineperhaps the machine only corrected misspellings, or suggested improvements for the human to vet, or maybe the human only contributed the very first ideation, and the machine has done the rest. Perhaps all the human writer contributed was the thesis ('I think school should start at 8 rather than 7') and then clicked 'submit' to get back an essay making a cogent and convincing case in support of the thesis. Mining large textual databases for arguments and evaluating them are feasible today as recently demonstrated by IBM's Debater technology 18 (Rinott et al., 2015;Levy et al., 2017;Gretz et al., 2019); introduce some figuration to 18 https://www.research.ibm.com/ artificial-intelligence/project-debater/ make it more appealing (Veale et al., 2017;Veale, 2018) and storify it (Riegl and Veale, 2018;Radford et al., 2019), et voilà! This type of use is essentially a machine's augmentation of human ability, and is hinted at, for example, in a customer testimonial for Grammarly: "Grammarly allows me to get those communications out and feel confident that I'm putting my best foot forward. Grammarly is like a little superpower, especially when I need to be at 110%." The human presumably remains at the same level of ability, but the product of the machine-human collaboration is superior to what the human alone could have produced.
In this context, the primary evaluation criterion for AWE is the fitness of the resulting communication to its purpose, or, at least, some evidence of improvement of the product over the human's first draft. Indeed, measurements of improvement across drafts and evidence of students' making corrections following feedback are often used for evaluation (Attali, 2004;Lipnevich and Smith, 2008;Foltz et al., 2014;Chapelle et al., 2015).
Within the product-centered evaluation paradigm, there could be various specific objectives other than the improvement of the holistic quality of the piece of writing; it could be an increase in the speed of production, or the maximization of click-through rate in an advertisement text, for example.

Help the User Learn to Write Better
The third type of use for AWE software is to help the writer improve his or her writing skill. Scores or other types of feedback are designed, in this context, to provide tutoring or guidance, not for fixing specific problems in the current piece of writing but to help the user learn more general skills that would make the first draft of their next essay better than the first draft of their current essay.
Evaluation of a tool though a demonstration of skill-improvement -the efficacy of the tool -is a complicated endeavor. To demonstrate that the observed improvement in skill is specifically due to the use of the writing tool, and not due to something else happening in students' life and education at the same time requires a research design that can take other potential sources of variation in outcomes into account, such as the one used in randomized controlled studies often used to as-sess interventions, including in education (Connolly et al., 2018); some such studies have been performed with respect to AWE tools (Rock, 2007;Wilson and Roscoe, 2020). A tool that allows for monitoring of improvement in skill (even if the improvement is due to other factors such as school instruction or participation in some activity or community) could also be useful in the broader context of skill-oriented use, as the learner and the teacher would be able to tell that improvement is happening, even if we do not know exactly why. Improvement in important aspects of learning such as motivation and self-efficacy could also provide value to the learner (Grimes and Warschauer, 2010; Wilson and Roscoe, 2020).

Relationships between Types of Use
One could argue that an ideal automated writing assistant would support all the different goals at once -help one produce better writing, help one learn, and do both in a psychometrically responsible fashion -benefits are not restricted to certain types of users more than others -so that decisionmaking based on the outcome of the usage of the tool can also be supported.
Indeed, the uses are not necessarily mutually exclusive. For example, the human augmentation and consequential decision use cases could apply at the same time. It is possible that, at some future point in time, spelling will be deemed to lie outside of the construct targeted by the consequential assessment of writing and spell-correction software will be made available to test-takers. However, this would require a careful examination of the impact of correction on the distributions and interpretations of the scores. In particular, Choi and Cho (2018) found that manually-vetted correction of spelling errors yielded a significant increase in scores assigned to the essays by trained raters, and that, even after controlling for the error quantity and quality predictors, the magnitude of the average gain in the score was smaller for responses with higher original scores. Add to the mix the finding that automated spelling correction system is more accurate on essays that are of better quality to begin with (Flor, 2012), and it's likely that the automated assessment of an automatically spell-corrected version of an essay might show an unexpected relationship with original scores that would need to be closely examined for bias or for an increase in construct-irrelevant variance.
It is also possible that the effect of using a tool optimized for one use case could be the opposite of what another use case requires. If 'use it or lose it' has any truth to it, a potential consequence of extensive, consistent, and pervasive human augmentation for producing superior written products is an adverse impact on the skill of the human in the human-machine team. If the near universal adoption of calculators is any guide, once a skill (long division) can be reliably outsourced to a machine, humans stop valuing it in daily practice and, therefore, might set out to lose it in the long run. 19 Spelling is a likely candidate writing skill where reliable access to high quality correction software could make humans stop worrying about it rather than invest effort in improving it.
Many of the tools mentioned in §2.2.4 seem to position themselves somewhere between the skillimprovement and the product-improvement use cases, perhaps assuming that quantity will eventually turn into quality, namely, extensive work on improving the written product might lead to internalization and generalization of the skill to new contexts. This might or might not be true. Feedback that helps the user fix an error quickly by pointing it out and by suggesting a correction might be good in a product-oriented context, but not in a skill-oriented context; letting the user pinpoint and fix the error himself or herself might be a better skill-development strategy (Hyland and Hyland, 2006). According to Graham and Perin (2007) meta-analysis of writing interventions for adolescents, explicit grammar instruction tended to be ineffective; this finding is cited by the developers for Writing Pal to support their decision to forgo giving explicit feedback on grammar (Mc-Namara et al., 2013), in contrast to most other AWE systems that do provide such feedback.

Summary & Conclusion
In his visionary paper from 1966, Ellis Page provided a proof-of-concept demonstration of the possibility of automated grading of essays, as well 19 1989 Curriculum and Evaluation Standards for School Mathematics from the National Council of Teachers of Mathematics recommend in the Summary of Changes to Content and Emphasis in K-4 Mathematics (p.21) decreasing the attention devoted to long division specifically and to "complex paper-and-pencil computations" in general; the recommendation for grades 5-8 is likewise to decrease emphasis on "tedious paper-and-pencil computations" (p.71). https: //archive.org/details/curriculumevalua00nati. The document has sparked substantial controversy, including with regards to long division (Klein and Milgram, 2000). as outlined some potential challenges to its adoption. Subsequent research and practice have delivered on Page's minimum desiderata for an AWE system; current research is working to address the outstanding challenges dealing with a variety of languages, content domains, and writing tasks.
The field of AWE has thus progressed according to the trajectory charted by Page to a large extent, though not completely. In particular, while Page imagined the main use case of AWE to be in the service of a harried English teacher and his feedback-thirsty students, in reality, the most visible use case has arguably evolved to be automated scoring of essays for standardized testing, which, in turn, has led to new challenges, such as ensuring the validity and fairness of scores.
The other development that Page could not anticipate is the sheer pervasiveness of technology in people's daily lives; AWE software can be made available not only in classrooms to be used under the watchful eye of the English teacher, but (almost) anywhere and at any time, including on mobile devices. While it is difficult to predict specific uses people would find for such software, we outlined a number of types of use, depending on the goal: (a) consequential decision making about the user; (b) delivery of the best possible written product in partnership with the user; and (c) assisting the user in improving her writing skills. We believe that we, as researchers, can help users find value in our technology by considering the goals, engaging partners from other relevant disciplines, and designing the tools as well as their evaluations to focus on specific types of use.