Annotation and Classification of Sentence-level Revision Improvement

Studies of writing revisions rarely focus on revision quality. To address this issue, we introduce a corpus of between-draft revisions of student argumentative essays, annotated as to whether each revision improves essay quality. We demonstrate a potential usage of our annotations by developing a machine learning model to predict revision improvement. With the goal of expanding training data, we also extract revisions from a dataset edited by expert proofreaders. Our results indicate that blending expert and non-expert revisions increases model performance, with expert data particularly important for predicting low-quality revisions.


Introduction
Supporting student revision behavior is an important area of writing-related natural language processing (NLP) research. While revision is particularly effective in response to detailed feedback by an instructor (Paulus, 1999), human writing evaluation is time-consuming. To help students improve their writing skills, various writing assistant tools have thus been developed (Eli Review, 2014;Turnitin, 2014;Writing Mentor, 2016;Grammarly, 2016). While these tools offer instant feedback on a particular writing draft, they typically fail to explicitly compare revisions between drafts.
Our long term goal is to build a system for supporting students in revising argumentative essays, where the system automatically compares multiple drafts and provides useful feedback (e.g., informing students whether their revisions are improving the essay). One step towards this goal is the development of a machine-learning model to automatically analyze revision improvement. Specifically, given only two sentences -original and revised, our current goal is to predict if a revised sentence is better than the original.
In this paper, we focus on predicting revision improvement using non-expert (i.e., student) writing data. We first introduce a corpus of paired original and revised sentences that has been newly annotated as to whether each revision made the original sentence better or not. The revisions are a subset of those in the freely available Ar-gRewrite corpus (Zhang et al., 2017), with improvement annotated using standard rubric criteria for evaluating student argumentative writing. By adapting NLP features used in previous revision classification tasks, we then develop a prediction model that outperforms baselines, even though the size of our non-expert revision corpus is small. Hence, we explore extracting paired revisions from an expert edited dataset to increase training data. The expert revisions are a subset of those in the freely available Automated Evaluation of Scientific Writing (AESW) corpus (Daudaravicius et al., 2016). Our experiments show that with proper sampling, combining expert and non-expert revisions can improve prediction performance, particularly for low-quality revisions.

Related Work
Prior NLP revision analysis work has developed methods for identifying pairs of original and revised textual units in both Wikipedia articles and student essays, as well as for classifying such pairs with respect to schemas of coarse (e.g., syntactic versus semantic) and fine-grained (e.g., lexical vs. grammatical syntactic changes) revision purposes (Bronner and Monz, 2012;Daxenberger and Gurevych, 2012;Zhang and Litman, 2015;Yang et al., 2017). For example, the ArgRewrite corpus (Zhang et al., 2017) was introduced with the goal to facilitate argumentative revision analysis and automatic revision purpose classification. However, purpose classification does not ad-dress revision quality. For example, a spelling change can both fix as well as introduce an error, while lexical changes can both enhance or reduce fluency. On the other hand, while some work has focused on correction detection in revision (Dahlmeier and Ng, 2012;Xue and Hwa, 2014;Felice et al., 2016), such work has typically been limited to grammatical error detection. The AESW shared task of identifying sentences in need of correction (Daudaravicius et al., 2016) goes beyond just grammatical errors, but the original task does not compare multiple versions of text, and also focuses on scientific writing.
In contrast, Tan and Lee (2014) created a dataset of paired revised sentences in academic writing annotated as to whether one sentence was stronger or weaker than the other. Their work directly sheds light on annotating sentence revision quality in terms of statement strength. However, their corpus focuses on the abstracts and introductions of ArXiv papers. Building on their annotation methodology, we consider paired sentences as our revision unit, but 1) annotate revision quality in terms of argumentative writing criteria, 2) use a corpus of revisions from non-expert student argumentative essays, and 3) move beyond annotation to automatic revision quality classification.

Annotating ArgRewrite
The revisions that we annotated for improvement in quality are a subset of the freely available Ar-gRewrite revision corpus (Zhang et al., 2017) 1 . This corpus was created by extracting revisions from three drafts of argumentative essays written by 60 non-expert writers in response to a prompt 2 . Essay drafts were first manually aligned at the sentence level based on semantic similarity. Nonidentical aligned sentences (e.g., modified, added and deleted sentences) were then extracted as the revisions. Our work uses only the 940 modification revisions, as our annotation does not yet consider a sentence's context in its paragraph.
We annotated ArgRewrite revisions for improvement using the labels Better or NotBetter. Better is used when the modification yields an improved sentence from the perspective of argumentative writing, while NotBetter is used when the modification either makes the sentence worse or does not have any significant effect. Binary labeling enables us to clearly determine a gold-standard using majority voting with an odd number of annotators. Binary labels should also suffice for our long term goal of triggering tutoring in a writing assistant (e.g., when the label is NotBetter).
Inspired by Tan and Lee (2014), our annotation instructions included explanatory guidelines along with example annotated sentence pairs. The guidelines were crafted to describe improvement in terms of typical argumentative writing criteria. We depend on annotators' judgment for cases not covered by the guidelines. According to the guidelines 3 , a revised sentence S2 is better than the original sentence S1 when: (1) S2 provides more information that strengthens the idea/major claim in S1; (2) S2 provides more evidence/justification for some aspects of S1; (3) S2 is more precise than S1; (4) S2 is easier to understand compared to S1 because it is fluent, well-structured, and has no unnecessary words; and (5) S2 is grammatically correct and has no spelling mistakes.
To provide context, annotators were told that the data was taken from student argumentative essays about electronic communications. We also let the annotators know the identity of the original and revised sentences (S1 and S2, respectively). Although this may introduce an annotation bias, it mimics feedback practice where instructors know which are the original versus revised sentences.
We collected 7 labels along with explanatory comments for each of the 940 revisions using Amazon Mechanical Turk (AMT). Table 1 shows examples (1, 2, and 3) of original and revised Ar-gRewrite sentences with their majority-annotated labels. The first revision clarifies a claim of the essay, the second removes some information and is less precise, while the third fixes a spelling mistake. As shown in Table 2, for all 940 revisions, our annotation has slight agreement (Landis and Koch, 1977) using Fleiss's kappa (Fleiss, 1971). If we only consider revisions where at least 5 out of the 7 annotators chose the same label (majority ≥ 5), the kappa values increase to fair agreement, 0.263. Tan and Lee (2014) achieve fair agreement (Fleiss's kappa of 0.242) with 9 annotators labeling 500 sentence pairs for statement strength.
Original Sentence (S1) Revised Sentence (S2) Label 1 The world has experienced various changes throughout its lifetime.
The world has been defined by its revolutions -the most recent one being technological.
Better 2 Technology is changing the world, and in particular the way we communicate.
Technology is changing the way we communicate. This is numerically expensive, but leads to proper results.
Better 5 Section 2 formulates and solves the balance equations.
The balance equations are formulated and solved in Section 2.

Sampling AESW
The Automated Evaluation of Scientific Writing (AESW) (Daudaravicius et al., 2016) shared task was to predict whether a sentence needed editing or not. Professional proof-readers edited sentences to correct issues ranging from grammatical errors to stylistic problems, intuitively yielding 'Better' sentences. Therefore, we can use the AESW edit information to create an automatically annotated corpus for revision improvement. In addition, by randomly flipping sentences we can include 'NotBetter' labels in the corpus. The AESW dataset was created from different scientific writing genres (e.g. Mathematics, Astrophysics) with placeholders for anonymization. We use two random samples of 5000 AESW revisions for the experiments in Section 5. "AESW all" samples revisions from all scientific genres, while "AESW plaintext" ignores sentences containing placeholders (e.g. MATH, MATHDISP) to make the data more similar to ArgRewrite. Table 1 shows two example (4 and 5) AESW revisions.
Following prior work, we count each unigram across, as well as unique to, S1 or S2 (Daxenberger and Gurevych, 2013; Zhang and Litman, 2015). However, we also count bigrams and trigrams to better capture introduced or deleted argumentative discourse units.
Another group of features are based on sentence differences similar to those proposed in (Zhang and Litman, 2015), e.g., difference in length, commas, symbols, named entities, etc., as well as edit distance. However, to capture improvement rather than just difference, we also introduce asymmetric distance metrics, e.g. Kullback-Leibler divergence 4 . We also capture differences using BLEU 5 score, motivated by its use in evaluating machinetranslated text quality.
Following Zhang and Litman (2015), we calculate the count and difference of spelling and language errors 6 , in our case to capture improvement as a result of error corrections.
As stated in the annotation guidelines, one way a revised sentence can be better is because it is more precise or specific. Therefore, we introduce the use of the Speciteller (Li and Nenkova, 2015) tool to quantify the specificity of S1 and S2, and take the specificity difference as a new feature. Remse et al. (2016) used parse tree based fea-  tures to capture the readability, coherence, and fluency of a sentence. Inspired by them, we calculate the difference in count of subordinate clauses (SBAR), verb phrases (VP), noun phrases (NP), and tree height in the parse trees 7 of S1 and S2.

Experiments and Results
Our goal is to examine whether we can predict improvement for non-expert ArgRewrite revisions, using AESW expert and/or ArgRewrite non-expert revisions for training. Our experiments are structured to answer the following research questions: Q1: Can we use only non-expert revisions to train a model that outperforms a baseline?
Q2: Can we use only expert revisions to train a model that outperforms a baseline?
Q3: Can we combine expert and non-expert training revisions to improve model performance?
Our machine learning experiments use Random Forest (RF) 8 from Python scikit-learn toolkit (Pedregosa et al., 2011) with 10-fold cross validation. Parameters were tuned using AESW development data. Because of the ArgRewrite class imbalance (Table 2, All row), we used SMOTE (Chawla et al., 2002) oversampling for each training fold. Feature selection was also performed on each training fold. Average un-weighted precision, recall and F1 are reported and compared to majorityclass baselines.
To answer Q1, we train a model using only Ar-gRewrite data. Table 3 shows that this model outperforms the majority baseline, significantly so for Precision and F1. Compared to all other models (Figure 1), this model can identify 'Better' revisions with the highest recall, and can identify 'NotBetter' revisions with the highest precision. However, for our long-term goal of building an effective revision assistant tool, intuitively we will 7 https://nlp.stanford.edu/software/lex-parser.shtml 8 Random Forest outperformed Support Vector Machines. also need to identify 'NotBetter' revisions with higher recall, which is very low for this model.
To answer Q2, we train only on AESW data but test on the same ArgRewrite folds as above. For both AESW revision samples (before and after removing the placeholders), only Precision is significantly better than the baseline. However, Figure 1 shows that AESW plaintext has significantly higher (p < 0.05) Recall than any other model in predicting 'NotBetter' revisions (which motivates Q3 as a way to address the limitation noted in Q1).
To answer Q3, during each run of crossvalidation training we inject the AESW data in addition to the 90% ArgRewrite data, then test on the remaining 10% as before. As can be seen from Table 3, AESW plaintext combined with ArgRewrite shows the best classification performance using all three metrics. It also has improved Recall for 'NotBetter' revisions compared to training only on ArgRewrite data. This result indicates that selective extraction of revisions from AESW data helps improve model performance, especially when classifying low-quality revisions.
Finally, to understand feature utility, we compute average feature importance in the 10-folds for each experiment. Top important features include unigrams, trigrams, length difference, language errors, edit distance, BLEU score, specificity difference, and parse-tree features. For example, length difference scores in the top 5 for all experiments. This is intuitive as the annotation guidelines state that adding evidence can make a better revision. Other features such as differences in language errors, specificity scores, and BLEU scores show more importance when training on combined ArgRewrite and AESW data than when training on only ArgRewrite. Surprisingly, spelling error corrections show low importance.

243
Original Sentence (S1) Revised Sentence (S2) Label Distribution Sample Comments A 1,000-word letter is considered long, and takes days, if not weeks, to reach the recipient.
A 1,000-word letter is considered long, and takes days, if not weeks, to reach the recipient, with risks of getting lost along the way.

vs 4
NotBetter: S1 is clearer than S2 and the 'risks along the way' could be included as a second sentence to increase readability.
Better: S2 provides more information that strengthens the idea/major claim in S1. People can't feel the atmosphere of the conversation.
Also, people can't feel the atmosphere of the conversation.

vs 4
NotBetter: Either sentence is fine, but sentence two is not any better. Better: Assuming this sentence originally came from the context of a larger part of text, I imagine the continuation included here improves the flow of the original context. With respect to personal life, social networking provides us opportunities to interact with people from different areas, such as Facebook and Twitter.
With respect to personal life, social networkings provide us opportunities to interact with people from different areas, such as Facebook and Twitter.

Discussion
Although AESW-plaintext helped classify Not-Better revisions, performance is still low. Table 4 shows some example NotBetter revisions misclassified as Better by most models. The first two examples were also difficult for humans to classify. In the first example, one annotator for Better (the minority label) points out that the revision provides more information. We speculate that our models might similarly rely too heavily on length and classify longer sentences as Better, since as noted above, length difference was a top 5 feature in all experiments. In fact, for the best model (Ar-gRewrite+AESW plaintext), the length difference for predicted Better revisions was 4.81, while for predicted NotBetter revisions it was −3.99.
In the second example, one of the annotators who labeled the revision as Better noted that the added word 'Also' indicates a larger context not available to the annotators. This suggests that including revision context could help improve both annotation and classification performance.
The third revision was annotated as NotBetter by 6 annotators. We looked into our features and found that the 'language-check' tool in fact was able to catch this grammatical mistake. Yet only the model using just ArgRewrite for training was able to correctly classify this revision, as all models using AESW data misclassified.

Conclusion and Future Work
We created a corpus of sentence-level student revisions annotated with labels regarding improve-ment with respect to argumentative writing. 9 We used this corpus to build a machine learning model for automatically identifying revision improvement. We also demonstrated smart use of an existing corpus of expert edits to improve model performance.
In the future, we would like to improve interrater reliability by collecting expert annotations rather than using crowdsourcing. We would also like to examine how the accuracy of our feature extraction algorithms impacted our feature utility results. Finally, we would like to improve our use of the AESW data, e.g., by automatically clustering revisions for more targeted sampling. Optimizing how many AESW revisions to use and how to balance labels in AESW sampling are also areas for future research.