Using Context to Predict the Purpose of Argumentative Writing Revisions

While there is increasing interest in automatically recognizing the argumentative structure of a text, recognizing the argumentative purpose of revisions to such texts has been less explored. Furthermore, existing revision clas-siﬁcation approaches typically ignore contextual information. We propose two approaches for utilizing contextual information when predicting argumentative revision purposes: developing contextual features for use in the classiﬁcation paradigm of prior work, and transforming the classiﬁcation problem to a sequence labeling task. Experimental results using two corpora of student essays demonstrate the utility of contextual information for predicting argumentative revision purposes.


Introduction
Incorporating natural language processing into systems that provide writing assistance beyond grammar is an area of increasing research and commercial interest (e.g., (Writelab, 2015;Roscoe et al., 2015)). As one example, the automatic recognition of the purpose of each of an author's revisions allows writing assistance systems to provide better rewriting suggestions. In this paper, we propose contextbased methods to improve the automatic identification of revision purposes in student argumentative writing. Argumentation plays an important role in analyzing many types of writing such as persuasive essays , scientific papers (Teufel, 2000) and law documents (Palau and Moens, 2009). In student papers, identifying revision purposes with respect to argument structure has been used to predict the grade improvement in the paper after revision (Zhang and Litman, 2015).
Existing works on the analysis of writing revisions (Adler et al., 2011;Bronner and Monz, 2012;Daxenberger and Gurevych, 2013;Zhang and Litman, 2015) typically compare two versions of a text to extract revisions, then classify the purpose of each revision in isolation. That is, while limited contextual features such as revision location have been utilized in prior work, such features are computed from the revision being classified but typically not its neighbors. In addition, ordinary classifiers rather than structured prediction models are typically used. To increase the role of context during prediction, in this paper we 1) introduce new contextual features (e.g., the impact of a revision on local text cohesion), and 2) transform revision purpose classification to a sequential labeling task to capture dependencies among revisions (as in Table 1). An experimental evaluation demonstrates the utility of our approach.

Related Work
There are multiple works on the classification of revisions (Adler et al., 2011;Javanmardi et al., 2011;Bronner and Monz, 2012;Daxenberger and Gurevych, 2013;Zhang and Litman, 2015). While different classification tasks were explored, similar approaches were taken by extracting features (location, text, meta-data, language) from the revised text to train a classification model (SVM, Random Forest, etc.) on the annotated data. One problem with prior works is that the contextual features used were typically shallow (location), while we cap- [1] Writer Richard Louv emphasises this expanding chasm between people and nature and tries to convince people to go back to nature through his parallelism and pathos.
[First Revision: 1->1,Type: Claim, Modify], [Second Revision: 2->null,Type: Warrant, Delete] Table 1: Example dependency between Claim and Warrant revisions. Sentence 1 acts as the Claim (argument structure) of Draft 1 and sentence 2 acts as the Warrant for the Claim. Sentence 1 in Draft 1 is modified to sentence 1 (also acts as the Claim) of Draft 2.
Sentence 2 in Draft 1 is deleted in Draft 2. The first revision is a Claim revision as it modifies the Claim of the paper by removing "rhetorical questions." This leads to the second Warrant revision, which deletes the Warrant for "rhetorical questions." ture additional contextual information as text cohesion/coherence changes and revision dependencies.
As our task focuses on identifying the argumentative purpose of writing revisions, work in argument mining is also relevant. In fact, many features for predicting argument structure (e.g., location, discourse connectives, punctuation) Moens et al., 2007;Palau and Moens, 2009;Feng and Hirst, 2011) are also used in revision classification. In addition, Lawrence et al. (2014) use changes in topic to detect argumentation, which leads us to hypothesize that different types of argumentative revisions will have different impacts on text cohesion and coherence. Guo et al. (2011) and Park et al. (2015) both utilize Conditional Random Fields (CRFs) for identifying argumentative structures. While we focus on the different task of identifying revisions to argumentation, we similarly hypothesize that dependencies exist between revisions and thus utilize CRFs in our task. While our task is similar to argument mining, a key difference is that the revisions do not always appear near each other. For example, a 5-paragraph long essay might have only two or three revisions located at different paragraphs. Thus, the types of previous revisions cannot always be used as the contextual information. Moreover, the type of the revision is not necessarily the argument type of its revised sentence. For example, a revision on the evidence argument can be just a correction of spelling mistakes.

Data Description
Revision purposes. To label our data, we adapt the schema defined in (Zhang and Litman, 2015) as it can be reliably annotated and is argument-   (Faigley and Witte, 1981). As we focus on argumentative changes, we merge all the Surface subcategories into one Surface category. As Zhang and Litman (2015) reported that both Rebuttals and multiple labels for a single revision were rare, we merge Rebuttal and Warrant into one Warrant category 1 and allow only a single (primary) label per revision. Corpora. Our experiments use two corpora consisting of Drafts 1 and 2 of papers written by high school students taking AP-English courses; papers were revised after receiving and generating peer feedback. Corpus A was collected in our earlier pa-per (Zhang and Litman, 2015), although the original annotations were modified as described above. It contains 47 paper draft pairs about placing contemporaries in Dante's Inferno. Corpus B was collected in the same manor as A with agreement Kappa 0.69. It contains 63 paper draft pairs explaining the rhetorical strategies used by the speaker/author of a previously read lecture/essay. Both corpora were double coded and gold standard labels were created upon agreement of two annotators. Two example annotated revisions from Corpus B are shown in Table 1, while the distribution of annotated revision purposes for both corpora are shown in Table 2.

Adding contextual features
Our previous work (Zhang and Litman, 2015) used three types of features primarily from prior work (Adler et al., 2011;Bronner and Monz, 2012;Daxenberger and Gurevych, 2013) for argumentative revision classification. Location features encode the location of the sentence in the paragraph and the location of the sentence's paragraph in the essay. Textual features encode revision operation, sentence length, edit distance between aligned sentences and the difference in sentence length and punctuation numbers. Language features encode part of speech (POS) unigrams and difference in POS tag counts.
We implement this feature set as the baseline as our tasks are similar, then propose two new types of contextual features. The first type (Ext) extends prior work by extracting the baseline features from not only the aligned sentence pair representing the revision in question, but also for the sentence pairs before and after the revision. The second type (Coh) measures the cohesion and coherence changes in a 2-sentence block around the revision 2 .
Utilizing the cohesion and coherence difference. Inspired by (Lee et al., 2015;Vaughan and McDonald, 1986), we hypothesize that different revisions can have different impacts on the cohesion and coherence of the essay. We propose to extract features for both impact on cohesion (lexical) and impact on coherence (semantic). Inspired by (Hearst, 1997), sequences of blocks are created for sentences 2 In this paper we consider the most adjacent sentence only.  Two types of features are extracted. The first type describes the cohesion and coherence between the revised sentence and its adjacent sentences. The similarity (lexical/semantic) between the revised sentence block and the sentence block before (Sim(Block U p, Block U p Self )) and after (Sim(Block Down, Block Down Self )) are calculated as the cohesion/coherence scores Coh Up and Coh Down. The features are extracted separately for Draft 1 and Draft 2 sentences 3 . The second type describes the impact of sentence modification on cohesion and coherence 4 . Features Change Up and Change Down are extracted as the division of the cohesion/coherence scores of two drafts ( Coh U p(Draf t2) Coh U p(Draf t1) , Coh Down(Draf t2) Coh Down(Draf t1) ). A bag-of-word representation is generated for  each sentence block after stop-word filtering and stemming. Jaccard similarity is used for the calculation of lexical similarity between sentence blocks. Word embedding vectors (Mikolov et al., 2013) are used for the calculation of semantic similarity. A vector is calculated for each sentence block by summing up the embedding vectors of words that are not stop-words 5 . Afterwards the similarity is calculated as the cosine similarity between the block vectors. This approach has been taken by multiple groups in the SemEval-2015 semantic similarity task (SemEval-2015 Task 1) (Xu et al., 2015).

Transforming to sequence labeling
To capture dependencies among predicted revisions, we transform the revisions to a consecutive sequence and label it with Conditional Random Fields (CRFs) as demonstrated in Figure 2. For both drafts, sentences are sorted according to their order of occurrence in the essay. Aligned sentences are put into the same row and each aligned pair of sentences is treated as a unit of revision. The "cross-aligned" pairs of sentences 6 (which does not often occur) are broken into deleted and added sentences (i.e, the cross-aligned sentences in Draft 1 are treated as deleted and the sentences in Draft 2 are treated as added.). After generating the sequence, each revision unit in the sequence is assigned the revision purpose label according to the annotations, with unchanged sentence pairs labeled as Nochange.
We conducted labeling on both essay-level and paragraph-level sequences. The essay-level treats the whole essay as a sequence segment while the paragraph-level treats each paragraph as a segment. After labeling, the label of each changed sentence pair is marked as the purpose of the revision 7 .

Experiments and Results
Our prior work (Zhang and Litman, 2014) proposed an approach for the alignment of sentences. The approach achieves 92% accuracy on both corpora. In this paper we focus on the prediction task and assume we have gold-standard sentence alignments 8 . The first four columns of Table 3 show the performance of baseline features with and without our new contextual features using an SVM prediction model 9 . The last four columns show the performance of CRFs 10 . All experiments are conducted using 10-fold (student) cross-validation with 300 features selected using learning gain ratio 11 .
For the SVM approach, we observe that the Coh features yield a significant improvement over the baseline features in Corpus B, and a nonsignificant improvement in Corpus A. This indicates that changes in text cohesion and coherence can in-7 Revisions on cross-aligned pairs are marked as Surface. 8 Similar to settings in (Daxenberger and Gurevych, 2013) 9 We compared three models used in discourse analysis and revision classification (C4.5 Decision Tree, SVM and Random Forests) Bronner and Monz, 2012; and SVM yielded the best performance. 10 SVM model implemented with Weka (Hall et al., 2009) and CRF model implemented with CRFSuite (Okazaki, 2007) 11 We tested with parameters 100, 200, 300, 500 on a development dataset disjoint from Corpora A and B and chose 300 which yielded the best performance. deed improve the prediction of argumentative revision types. The Ext feature set -which computes features for not only the revision but also its immediately adjacent sentences -also yields a slight (although not significant) improvement. However, adding the two feature sets together does not further improve the performance using the SVM model. The CRF approach almost always yields the best results for both corpora, with all such CRF results better than all other results. This indicates that dependencies exist among argumentative revisions that cannot be identified with traditional classification approaches.

Error Analysis
To have a better understanding of how the sequence labeling approach improves the classification performance, we counted the errors of the cross-validation results on Corpus A (where the revisions are more evenly distributed). Figure 3 demonstrates the comparison of errors made by SVM and CRFs 12 .
We notice that the CRF approach makes less errors than the SVM approach in recognizing Claim changes (General-Claim, Evidence-Claim, Warrant-Claim, Surface-Claim). This matches our intuition that there exists dependency between revisions on supporting materials and revisions on Claim. We also observe that same problems exist in both approaches. The biggest difficulty is the differentiation between General and Warrant revisions, which counts 37.6% of the SVM errors and 40.1% of CRFs errors. It is also common that Claim and Evidence 12 Both use models with all the features. revisions are classified as Warrant revisions. Approaches need to be designed for such cases to further improve the classification performance.

Conclusion
In this paper we proposed different methods for utilizing contextual information when predicting the argumentative purpose of revisions in student writing. Adding features that captured changes in text cohesion and coherence, as well as using sequence modeling to capture revision dependencies, both significantly improved predictive performance in an experimental evaluation.
In the future, we plan to investigate whether performance can be further improved when more sentences in the context are included. Also, we plan to investigate whether revision dependencies exist in other types of corpora such as Wikipedia revisions. While the corpora used in this study cannot be published because of the lack of required IRB, we are starting a user study project (Zhang et al., 2016) on the application of our proposed techniques and will publish the data collected from this project.