Annotation and Classification of Argumentative Writing Revisions

This paper explores the annotation and classi-ﬁcation of students’ revision behaviors in ar-gumentative writing. A sentence-level revision schema is proposed to capture why and how students make revisions. Based on the proposed schema, a small corpus of student essays and revisions was annotated. Studies show that manual annotation is reliable with the schema and the annotated information helpful for revision analysis. Furthermore, features and methods are explored for the automatic classiﬁcation of revisions. Intrinsic evaluations demonstrate promising performance in high-level revision classiﬁcation (surface vs. text-based). Extrinsic evaluations demonstrate that our method for automatic re-vision classiﬁcation can be used to predict a writer’s improvement.


Introduction
Rewriting is considered as an important factor of successful writing. Research shows that expert writers revise in ways different from inexperienced writers (Faigley and Witte, 1981). Recognizing the importance of rewriting, more and more efforts are being made to understand and utilize revisions. There are rewriting suggestions made by instructors (Wells et al., 2013), studies modeling revisions for error correction (Xue and Hwa, 2010;Mizumoto et al., 2011) and tools aiming to help students with rewriting (Elireview, 2014;Lightside, 2014).
While there is increasing interest in the improvement of writers' rewriting skills, there is still a lack of study on the details of revisions. First, to find out what has been changed (defined as revision extraction in this paper), a typical approach is to extract and analyze revisions at the word/phrase level based on edits extracted with character-level text comparison (Bronner and Monz, 2012;Daxenberger and Gurevych, 2012). The semantic information of sentences is not considered in the character-level text comparison, which can lead to errors and loss of information in revision extraction. Second, the differentiation of different types of revisions (defined as revision categorization) is typically not fine-grained. A common categorization is a binary classification of revisions according to whether the information of the essay is changed or not (e.g. text-based vs. surface as defined by Faigley and Witte (1981)). This categorization ignores potentially important differences between revisions under the same high-level category. For example, changing the evidence of a claim and changing the reasoning of a claim are both considered as text-based changes. Usually changing the evidence makes a paper more grounded, while changing the reasoning helps with the paper's readability. This could indicate different levels of improvement to the original paper. Finally, for the automatic differentiation of revisions (defined as revision classification), while there are works on the classification of Wikipedia revisions (Adler et al., 2011;Bronner and Monz, 2012;, there is a lack of work on revision classification in other datasets such as student writings. It is not clear whether current features and methods can still be adapted or new features and methods are required. To address the issues above, this paper makes . Figure 1: In the example, words in sentence 1 of Draft 1 are rephrased and reordered to sentence 3 of Draft 2. Sentences 1 and 2 in Draft 2 are newly added. Our method first marks 1 and 3 as aligned and the other two sentences of Draft 2 as newly added based on semantic similarity of sentences. The purposes and operations are then marked on the aligned pairs. In contrast, previous work extracts differences between drafts at the character level to get edit segments. The revision is extracted as a set of sentences covering the contiguous edit segments. Sentence 1 in Draft 1 is wrongly marked as being modified to 1, 2, 3 in Draft 2 because character-level text comparison could not identify the semantic similarity between sentences. the following efforts. First, we propose that it is better to extract revisions at a level higher than the character level, and in particular, explore the sentence-level. This avoids the misalignment errors of character-level text comparisons. Finer-grained studies can still be done on the sentence-level revisions extracted, such as fluency prediction (Chae and Nenkova, 2009), error correction (Cahill et al., 2013;Xue and Hwa, 2014), statement strength identification (Tan and Lee, 2014), etc. Second, we propose a sentence-level revision schema for argumentative writing, a common form of writing in education. In the schema, categories are defined for describing an author's revision operations and revision purposes. The revision operations can be directly decided according to the results of sentence alignment, while revision purposes can be reliably manually annotated. We also do a corpus study to demonstrate the utility of sentence-level revisions for revision analysis. Finally, we adapt features from Wikipedia revision classification work and explore new features for our classification task, which differs from prior work with respect to both the revision classes to be predicted and the sentence-level revision extraction method. Our models are able to distinguish whether the revisions are changing the content or not. For fine-grained classification, our models also demonstrate good performance for some categories. Beyond the classification task, we also investigate the pipelining of revision extraction and classification. Results of an extrinsic evaluation show that the automatically extracted and classified revisions can be used for writing improvement prediction.

Related work
Revision extraction To extract the revisions for revision analysis, a widely chosen strategy uses character-based text comparison algorithms first and then builds revision units on the differences extracted (Bronner and Monz, 2012;. While theoretically revisions extracted with this method can be more precise than sentence-level extractions, it could suffer from the misalignments of revised content due to characterlevel text comparison algorithms. For example, when a sentence is rephrased, a character-level text comparison algorithm is likely to make alignment errors as it could not recognize semantic similarity. As educational research has suggested that revision analysis can be done at the sentence level (Faigley and Witte, 1981), we propose to extract revisions at the sentence level based on semantic sentence alignment instead. Figure 1 provides an example comparing revisions annotated in our work to revisions extracted in prior work (Bronner and Monz, 2012). Our work identifies the fact that the student added new information to the essay and modified the organization of old sentences. The previous work, however, extracts all the modifications as one unit and cannot distinguish the different kinds of revisions inside the unit. Our method is similar to Lee and Webster's method (Lee and Webster, 2012), where a sentence-level revision corpus is built from college students' ESL writings. However, their corpus only includes the comments of the teachers and does not have every revision annotated.

Revision categorization
In an early educational work from Faigley and Witte (1981), revisions are categorized to text-based change and surface change based on whether they changed the information of the essay or not. A similar categorization (factual vs. fluency) was chosen by Bronner and Monz (2012) for classifying Wikipedia edits. However, many differences could not be captured with such coarse grained categorizations. In other works on Wikipedia revisions, finer categorizations of revisions were thus proposed: vandalism, paraphrase, markup, spelling/grammar, reference, information, template, file etc. (Pfeil et al., 2006;Jones, 2008;Liu and Ram, 2009;Daxenberger and Gurevych, 2012). Corpus studies were conducted to analyze the relationship between revisions and the quality of Wikipedia papers based on the categorizations. Unfortunately, their categories are customized for Wikipedia revisions and could not easily be applied to educational revisions such as ours. In our work, we provide a fine-grained revision categorization designed for argumentative writing, a common form of writing in education, and conduct a corpus study to analyze the relationship between our revision categories and paper improvement.
Revision classification Features and methods are widely explored for Wikipedia revision classifications (Adler et al., 2011;Bronner and Monz, 2012;Ferschke et al., 2013). Classification tasks include binary classification for coarse categories (e.g. factual vs. fluency) and multi-class classification for fine-grained categories (e.g. 21 categories defined by ). Results show that the binary classifications on Wikipedia data achieve a promising result. Classification of finer-grained categories is more difficult and the difficulty varies across different categories. In this paper we explore whether the features used in Wikipedia revision classification can be adapted to the classification of different categories of revisions in our work. We also utilize features from research on argument mining and discourse parsing Sporleder and Lascarides, 2008;Falakmasir et al., 2014;Braud and Denis, 2014) and evaluate revision classification both intrinsically and extrinsically. Finally, we explore end-to-end revision processing by combining automatic revision extraction and categorization via automatic classification in a pipelined manner.

Sentence-level revision extraction and categorization
This section describes our work for sentence-level revision extraction and revision categorization. A corpus study demonstrates the use of the sentencelevel revision annotations for revision analysis.

Revision extraction
As stated in the previous section, our method takes semantic information into consideration when extracting revisions and uses the sentence as the basic semantic unit; besides the utility of sentence revisions for educational analysis (Faigley and Witte, 1981;Lee and Webster, 2012), automatic sentence segmentation is quite accurate. Essays are split into sentences first, then sentences across the essays are aligned based on semantic similarity. 1 An added sentence or a deleted sentence is treated as aligned to null as in Figure 1. The aligned pairs where the sentences in the pair are not identical are extracted as revisions. For the automatic alignment of sentences, we used the algorithm in our prior work (Zhang and Litman, 2014) which considers both sentence similarity (calculated using TF*IDF score) and the global context of sentences.

Revision schema definition
As shown in Figure 2, two dimensions are considered in the definition of the revision schema: the author's behavior (revision operation) and the reason for the author's behavior (revision purpose).
Revision operations include three categories: Add, Delete, Modify. The operations are decided automatically after sentences get aligned. For example, in Figure 1 where Sentence 3 in Draft 2 is aligned to sentence 1 in Draft 1, the revision operation is decided as Modify. The other two sentences are aligned to null, so the revision operations of these alignments are both decided as Add.
The definitions of revision purposes come from several works in argumentative writing and discourse analysis.
Claims/Ideas, Warrant/Reasoning/Backing, Rebuttal/Reservation, Evidence come from Claim, Rebuttal, Warrant, Backing, Grounds in Toulmin's model (Kneupper, 1978). General Content comes from Introductory material in the essay-based discourse categorization of . The rest come from the categories within the surface changes of Faigley and Witte (1981). Examples of all categories are shown in Table 1. These categories can further be mapped to surface and text-based changes defined by Faigley and Witte (1981), as shown in Figure 2.
Note that while our categorization comes from the categorization of argumentative writing elements, a key difference is that our categorization focuses on revisions. For example, while an evidence revision must be related to the evidence element of the essay, the reverse is not necessarily true. The modifications on an evidence sentence could be just a correction of spelling errors rather than an evidence revision.

Data annotation
Our data consists of the first draft (Draft 1) and second draft (Draft 2) of papers written by high school students taking English writing courses; papers were revised after receiving and generating peer feedback. Two assignments (from different teachers) have been annotated so far. Corpus C1 comes from an AP-level course, contains papers about Dante's Inferno and contains drafts from 47 students, with 1262 sentence revisions. A Draft 1 paper contains 38 sentences on average and a Draft 2 paper contains 53. Examples from this corpus are shown in Table 1. After data was collected, a score from 0 to 5 was assigned to each draft by experts (for research prior to our study). The score was based on the student's performance including whether the student stated the ideas clearly, had a clear paper organization, provided good evidence, chose the correct wording and followed writing conventions. The class's average score improved from 3.17 to 3.74 after revision. Corpus C2 (not AP) contains papers about the poverty issues of the modern reservation and contains drafts from 38 students with 495 revisions; expert ratings are not available. Papers in C2 are shorter than C1; a Draft 1 paper contains 19 sentences on average and a Draft 2 paper contains 26.
Two steps were involved in the revision scheme annotation of these corpora. In the first step, sentences between the two drafts were aligned based on semantic similarity. The kappa was 0.794 for the sentence alignment on C1. Two annotators discussed about the disagreements and one annotator's work was decided to be better and chosen as the gold standard after discussion. The sentence alignment on C2 is done by one annotator after his annotation and discussion of the sentence alignment on C1. In

Codes
Claims/Ideas: change of the position or claim being argued for Conventions/Grammar/Spelling: changes to fix spelling or grammar errors, misusage of punctuation or to follow the organizational conventions of academic writing Example Draft 1: (1, "Saddam Hussein and Osama Bin Laden come to mind when mentioning wrathful people") Draft 2: (1, "Fidel Castro comes to mind when mentioning wrathful people") Revisions (1->1, Modify, "claims/ideas"), (1->1, Modify, "conventions/grammar/spelling") Codes Evidence: change of facts, theorems or citations for supporting claims/ideas Rebuttal/Reservation: change of development of content that rebut current claim/ideas Example Draft 1: (1, "In this circle I would place Fidel.") Draft 2: (1, "In the circle I would place Fidel"), (2, "He was annoyed with the existence of the United States and used his army to force them out of his country"), (3, "Although Fidel claimed that this is for his peoples' interest, it could not change the fact that he is a wrathful person.") Revisions (null->2, "Add", "Evidence"), (null->3, "Add", "Rebuttal/Reservation") Codes Word-usage/Clarity: change of words or phrases for better representation of ideas Organization: changes to help the author get a better flow of the paper Warrant/Reasoning/Backing: change of principle or reasoning of the claim General Content: change of content that do not directly support or rebut claims/ideas Example As in Figure 1 Table 1: Examples of different revision purposes. Note that in the second example the alignment is not extracted as a revision when the sentences are identical. the second step, revision purposes were annotated on the aligned sentence pairs. Each aligned sentence pair could have multiple revision purposes (although rare in the annotation of our current corpus). The full papers were also provided to the annotators for context information. The kappa score for the revision purpose annotation is shown in Table 2, which demonstrates that our revision purposes could be annotated reliably by humans. Again one annotator's annotation is chosen as the gold standard after discussion. Distribution of different revision purposes is shown in Tables 3 and 4.

Corpus study
To demonstrate the utility of our sentence-level revision annotations for revision analysis, we conducted a corpus study analyzing relations between the number of each revision type in our schema and student writing improvement based on the expert paper scores available for C1. In particular, the number of revisions of different categories are counted for each student. Pearson correlation between the number of revisions and the students' Draft 2 scores is calculated. Given that the student's Draft 1 and Draft 2 scores are significantly correlated (p < 0.001, R = 0.632), we controlled for the effect of Draft 1 score by regressing it out of the correlation. 2 We expect surface changes to have smaller impact than textbased changes as Faigley and Witte (1981) found that advanced writers make more text-based changes comparing to inexperienced writers.
As shown by the first row in Table 5, the overall number of revisions is significantly correlated with students' writing improvement. However, when we compare revisions using Faigley and Witte's binary categorization, only the number of textbased revisions is significantly correlated. Within the text-based revisions, only Claims/Ideas, Warrant/Reasoning/Backing and Evidence are significantly correlated. These findings demonstrate that revisions at different levels of granularity have different relationships to students' writing success,  which suggests that our schema is capturing salient characteristics of writing improvement. While correlational, these results also suggest the potential utility of educational technologies based on fine-grained revision analysis. For teachers, summaries of the revision purposes in a particular paper (e.g. "The author added more reasoning sentences to his old claim, and changed the evidence used to support the claim.") or across the papers of multiple students (e.g. "90% of the class made only surface revisions") might provide useful information for prioritizing feedback. Fine-grained revision analysis might also be used to provide student feedback directly in an intelligent tutoring system.

Revision classification
In the previous section we described our revision schema and demonstrated the utility of it. This section investigates the feasibility of automatic revision analysis. We first explore classification assuming we have revisions extracted with perfect sentence alignment. After that we combine revision extraction and revision classification in a pipelined manner.

Features
As shown in Figure 3, besides using unigram features as a baseline, our features are organized into Location, Textual, and Language groups following prior work (Adler et al., 2011;Bronner and Monz, 2012;. Baseline: unigram features. Similarly to Daxenberger and Gurevych (2012) Table 5: Partial correlation between number of revisions and Draft 2 score on corpus C1 (partial correlation regresses out Draft 1 score); rebuttal is not evaluated as there is only 1 occurrence.

138
. Textual group. A sentence containing a specific person's name is more likely to be an example for a claim; sentences containing "because" are more likely to be a sentence of reasoning; a sentence generated by text-based revisions is possibly more different from the original sentence compared to a sentence generated by surface revisions. These intuitions are operationalized using several feature groups: Named entity features 4 (also used in Bronner and Monz (2012)'s Wikipedia revision classification task), Discourse marker features (used by  for discourse structure identification), Sentence difference features and Revision operation (similar features are used by ).
Language group. Different types of sentences can have different distributions in POS tags . The difference in the number of spelling/grammar mistakes 5 is a possible indicator as Conventions/Grammar/Spelling revisions probably decrease the number of mistakes.

Experiments
Experiment 1: Surface vs. text-based As the corpus study in Section 3 shows that only text-based revisions predict writing improvement, our first experiment is to check whether we can distinguish between the surface and text-based categories. The classification is done on all the non-identical aligned sentence pairs with Modify operations 6 . We choose 10-fold (student) cross-validation for our experi- Text-based): average unweighted precision, recall, F-score from 10-fold cross-validation; * indicates significantly better than majority and unigram.
ment. Random Forest of the Weka toolkit (Hall et al., 2009) is chosen as the classifier. Considering the data imbalance problem, the training data is sampled with a cost matrix decided according to the distribution of categories in training data in each round. All features are used except Revision operation (since only Modify revisions are in this experiment). Experiment 2: Binary classification for each revision purpose category In this experiment, we test whether the system could identify if revisions of each specific category exist in the aligned sentence pair or not. The same experimental setting for surface vs. text-based classification is applied.
Experiment 3: Pipelined revision extraction and classification In this experiment, revision extraction and Experiment 1 are combined together as a pipelined approach 7 . The output of sentence alignment is used as the input of the classification task. The accuracy of sentence alignment is 0.9177 on C1 and 0.9112 on C2. The predicted Add and Delete revisions are directly classified as text-based changes. Features are used as in Experiment 1.

Evaluation
In the intrinsic evaluation, we compare different feature groups' importance. Paired t-tests are utilized to compare whether there are significant differences in performance. Performance is measured using unweighted F-score. In the extrinsic evaluation, we repeat the corpus study from Section 3 using the predicted counts of revision. If the results in the intrinsic evaluation are solid, we expect that a similar conclusion could be drawn with the results from either predicted or manually annotated revisions.
Intrinsic evaluation Tables 6 and 7 present the results of the classification between surface and text- 7 We leave pipelined fine-grained classification to the future. based changes on corpora C1 and C2. Results show that for both corpora, our learned models significantly beat majority and unigram baselines for all of unweighted precision, recall and F-score; the Fscore for both corpora is approximately 55. Tables 8 and 9 show the classification results for the fine-grained categories. Our results are not significantly better than the unigram baseline on Evidence of C1, C2 and Claim of C2. While the poor performance on Evidence might be due to the skewed class distribution, our model also performs better on Conventions where there are not many instances. For the categories where our model performs significantly better than the baselines, we see that the location features are the best features to add to unigrams for the text-based changes (significantly better than baselines except Evidence), while the language and textual features are better for surface changes. We also see that using all features does not always lead to better results, probably due to over fitting. Replicating experiments in two corpora also demonstrates that our schema and features can be applied across essays with different topics (Dante vs. poverty) written in different types of courses (advanced placement or not) with similar results.
For the intrinsic evaluation of our pipelined approach (Experiment 3), as the revisions extracted are not exactly the same as the revisions annotated by humans, we only report the unweighted precision and unweighted recall here; C1 (p: 40.25, r: 45.05) and C2 (p: 48.08, r: 54.30). Paired t-test shows that the results significantly drop compared to Tables 6 and 7 because of the errors made in revision extraction, although still outperform the majority baseline.
Extrinsic evaluation According to Table 10 , the conclusions drawn from the predicted revisions and annotated revisions are similar (Table 5). Text-based changes are significantly correlated with writing improvement, while surface changes are not. We can also see that the coefficient of the predicted text-  based change correlation is close to the coefficient of the manually annotated results.

Conclusion and current directions
This paper contributes to the study of revisions for argumentative writing. A revision schema is defined for revision categorization. Two corpora are annotated based on the schema. The agreement study demonstrates that the categories defined can be reliably annotated by humans. Study of the annotated corpus demonstrates the utility of the annotation for revision analysis. For automatic revision classification, our system can beat the unigram baseline in the classification of higher level categories (surface vs. text-based). However, the difficulty increases for fine-grained category classification. Results show that different feature groups are required for different purpose classifications. Results of extrinsic evaluations show that the automatically analyzed revisions can be used for writer improvement prediction.
In the future, we plan to annotate revisions from different student levels (college-level, graduate level, etc.) as our current annotations lack full coverage of all revision purposes (e.g., "Rebuttal/Reservation" rarely occurs in our high school corpora). We also plan to annotate data from other educational genres (e.g. technical reports, science papers, etc.) to see if the schema generalizes, and to explore more category-specific features to improve the fine-grained classification results. In the longerterm, we plan to apply our revision predictions in a summarization or learning analytics systems for teachers or a tutoring system for students.