Coarse-grained Argumentation Features for Scoring Persuasive Essays

Scoring the quality of persuasive essays is an important goal of discourse analysis, addressed most recently with high-level persuasion-related features such as thesis clarity, or opinions and their targets. We investigate whether argumentation features derived from a coarse-grained argu-mentative structure of essays can help predict essays scores. We introduce a set of argumentation features related to ar-gument components (e.g., the number of claims and premises), argument relations (e.g., the number of supported claims) and typology of argumentative structure (chains, trees). We show that these features are good predictors of human scores for TOEFL essays, both when the coarse-grained argumentative structure is manually annotated and automatically predicted.


Introduction
Persuasive essays are frequently used to assess students' understanding of subject matter and to evaluate their argumentation skills and language proficiency. For instance, the prompt for a TOEFL (Test of English as a Foreign Language) persuasive writing task is: Do you agree or disagree with the following statement? It is better to have broad knowledge of many academic subjects than to specialize in one specific subject. Use specific reasons and examples to support your answer.
Automatic essay scoring systems generally use features based on grammar usage, spelling, style, and content (e.g., topics, discourse) (Attali and Burstein, 2006;Burstein, 2003). However, recent work has begun to explore the impact of highlevel persuasion-related features, such as opinions and their targets, thesis clarity and argumentation schemes (Farra et al., 2015;Song et al., 2014;Ong et al., 2014;Persing and Ng, 2015). In this paper, we investigate whether argumentation features derived from a coarse-grained, general argumentative structure of essays are good predictors of holistic essay scores. We use the argumentative structure proposed by Stab and Gurevych (2014a): argument components (major claims, claims, premises) and argument relations (support, attack). Figure 1(i) shows an extract from an essay written in response to the above prompt, labeled with a claim and two premises. The advantage of having a simple annotation scheme is two-fold: it allows for more reliable human annotations and it enables better performance for argumentation mining systems designed to automatically identify the argumentative structure (Stab and Gurevych, 2014b).
The paper has two main contributions. First, we introduce a set of argumentation features related to three main dimensions of argumentative structure: 1) features related to argument components such as the number of claims in an essay, number of premises, fraction of sentences containing argument components; 2) features related to argument relations such as the number and percentage of supported and unsupported claims; and 3) features related to the typology of argumentative structure such as number of chains (see Figure 1(ii) for and example of chain) and trees (Section 3). On a dataset of 107 TOEFL essays manually annotated with the argumentative structure proposed by Stab and Gurevych (2014a) (Section 2), we show that using all the argumentation features predicts essay scores that are highly correlated with human scores (Section 3). We discuss what features are correlated with high scoring es- Figure 1: (i) Essay extract showing a claim and two premises and (ii) the corresponding argumentative structure (i.e., chain). says vs. low scoring essays. Second, we show that the argumentation features extracted based on argumentative structures automatically predicted by a state-of-the-art argumentation mining system (Stab and Gurevych, 2014b) are also good predictors of essays scores (Section 4). 1

Data and Annotation
We use a set of 107 essays from TOEFL11 corpus that was proposed for the first shared task of Native Language Identification (Blanchard et al., 2013). The essays are sampled from 2 prompts: P1 (shown in the Introduction) and P3: Each essay is associated with a score: high, medium, or low. From prompt P1, we selected 25 high, 21 medium, and 16 low essays, while for prompt P3 we selected 15 essays for each of the three scores.
For annotation, we used the coarse-grained argumentative structure proposed by Stab and Gurevych (2014a): argument components (major claim, claim, premises) and argument relations (support/attack). The unit of annotation is a clause. Our annotated dataset, T OEF L arg , includes 107 major claims, 468 claims, 603 premises, and 641 number of sentences that do not contain any argument component. To measure the inter-annotator agreement we calculated P/R/F1 measures, which are used to account for fuzzy boundaries (Wiebe et al., 2005). The F1 measure for overlap matches (between two annotators) for argument components is 73.98% and for argument relation is 67.56%.

Argumentation Features for Predicting Essays Scores
A major contribution of this paper is a thorough analysis of the key features derived from a coarsegrained argumentative structure that are correlated with essay scores. Based on our annotations, we propose three groups of features (Table 1). The first group consists of features related to argument components (AC) such as the number of claims, number of premises, fraction of sentences containing argument components. One hypothesis is that an essay with a higher percentage of argumentative sentences will have a higher score. The second group consists of features related to argument relations (AR), such as the number and percentage of supported claims (i.e., claims that are supported by at least one premise) and the number and percentage of dangling claims (i.e., claims with no supporting premises). In low scoring essays, test takers often fail to justify their claims with proper premises and this phenomenon is captured by the dangling claims feature. In contrary, in high scoring essays, it is common to find many claims that are justified by premises. We also consider the number of attack relations and attacks against the major claim. Finally, the third group consists of features related to the typology of argument structures (TS) such as the number of argument chains (Chain), number of argument trees of height = 1 (T ree h=1 ) and the number of argument trees of height > 1 (T ree h>1 ). We define an argument chain when a claim is supported by a chain of premises. We define T ree h=1 as a tree structure of height 1 with more than one leaves, where the root is a claim and the leaves are premises   or claims. Finally, T ree h>1 is a tree structure of height > 1, where the root is a claim and the internal nodes and leaves are either supporting claims or supporting premises. Figure 2 shows examples of a T ree h>1 structure, a Chain structure, and a T ree h=1 structure. The dark nodes represent claims (C), lighter nodes can be either claims or premises (C/P) and white nodes are premises (P). Figure 1 shows an extract from an essays and the corresponding Chain structure.
To measure the effectiveness of the above features in predicting the holistic essay scores (high/medium/low) we use Logistic Regression (LR) learners and evaluate the learners using quadratic-weighted kappa (QWK) against the human scores, a methodology generally used for essay scoring (Farra et al., 2015). QWK corrects for chance agreement between the system prediction and the human prediction, and it takes into account the extent of the disagreement between labels. Table 2 reports the performance for the three feature groups as well as their combination. Our baseline feature (bl) is the number of sentences in the essay, since essay length has been shown to be generally highly correlated with essay scores (Chodorow and Burstein, 2004). We found that all three feature groups individually are strongly correlated with the human scores, much better than  the baseline feature, and the AC features have the highest correlation. We also see that although the number of claims and premises can affect the score of an essay, the argumentative structures (i.e., how the claims and premises are connected in an essay) are also important. Combining all features gives the highest QWK score (0.803).
We also looked at what features are associated with high scoring essays vs. low scoring essays. Based on the regression coefficients, we observe that the high "number and % of dangling claims" are strong features for low scoring essays, whereas the "fraction of sentences containing argument components" (AC feature), "number of supported claims" (AR feature), and "number of T ree h=1 structures" and "number of T ree h>1 structures" (TS features) have the highest correlation with high scoring essays. For example, in a good persuasive essay, test takers are inclined to use multiple premises (e.g., reasons or examples) to support a claim, which is captured by the TS and AR features. In addition, we notice that attack relations are sparse, as was the case in Stab and Gurevych (2014b) dataset and thus the coefficients for attack relations features (#10, #11 in Table 1) are negligible.
In summary, our findings contribute to research on essay scoring, showing that argumentation features are good predictors of essay scores, besides spelling, grammar, and stylistic properties of text.

Automatic Extraction of Argumentation Features for Predicting Essay Scores
To automatically generate the argumentation features (Table 1), we first need to identify the argumentative structures: argument components (major claim, claim, and premise) and relations (support/attack). We use the approach proposed by Stab and Gurevych (2014b). 2 For argument component identification, we categorize clauses to one of the four classes (major claim (M C), claim (C), premise (P ), and N one). For argument relation identification, given a pair of argument clauses Arg 1 and Arg 2 the classifier decides whether the pair holds a support (S) or non-support (N S) relation (binary classification). For each essay, we extract all possible combinations of Arg 1 and Arg 2 from each paragraph as training data (654 S and 2503 N S instances; attack relations are few and included in N S). We do not consider relations that may span over multiple paragraphs to reduce number of non-support instances. For both tasks we use Lexical features (e.g., unigrams, bigrams, trigrams, modal verbs, adverbs, word-pairs for relation identification), Structural features (e.g., number of tokens/punctuations in argument, as well as in the sentence containing the argument, argument position in essay, paragraph position (paragraph that contains the argument)), Syntactic features (e.g., production rules from parse trees, number of clauses in the argument), and Indicators (discourse markers selected from the three top-level Penn Discourse Tree Bank (PDTB) relation senses: Comparison, Contingency, and Expansion (Prasad et al., 2008)). We use two settings for the classification experiments using libSVM (Chang and Lin, 2011) for both argument component and relation identification. In the first setting, we used the dataset of 90 high quality persuasive essays from (Stab and Gurevych, 2014b) (S&G) as training and use T OEF L arg for testing (out-of-domain setting). In the second setting (in-domain), we randomly split the T OEF L arg into 80% training and 20% for testing (sampled equally from each category (M C, C, P , and N one for argument components; S and N S for relations)). Table 3 and 4 present the classification results for identifying ar-2 In future work, we plan to use the authors' improved approach and larger dataset released after the acceptance of this paper (Stab and Gurevych, 2016).   gument components in the first and second setting, respectively. We ran experiments for all different features groups and observe that with the exception of the P class, the F1 scores for all the other classes is comparable to the results reported by Stab and Gurevych (2014b). One explanation of having lower performance on the P (premise) category is that the S&G dataset used for training has higher quality essays, while 2/3 of our T OEF L arg dataset consists of medium and low scoring essays (the writing style for providing reasons or example can differ between high and low scoring essays). When we select the top 100 features ("top100") using Information Gain (Hall et al., 2009) the F1 scores for the P class improves.
The results in Table 4 show that when training and testing on same type of essays the results are better for all categories except for M C when using the "top100" setup. Table 5 shows the results for relation identification in the first setting (out-of-domain). The F1 score of identifying support relations is 84.3% (or 89% using top100), much higher than reported by Stab and Gurevych (2014b). We obtain similar results when training and testing on T OEF L arg . We observe that two specific feature groups, Structural and Lexical, individually achieve high F1 scores and when combined with other features, they assist the classifier in reaching F1 scores in high 80s%. There can be two explanations for this: 1) essays in T OEF L arg have multiple short paragraphs where the position features such as position of the arguments in the essay and paragraph (Structural group) are strong indicators for argument relations; and 2) due to short paragraphs, the percentage of N S instances are less than in the S&G dataset, hence the Lexical features (i.e., word-pairs between Arg 1 and Arg 2 ) perform very well.   Based on the automatic identification of the argument components and relations, we generate the argumentation features to see whether they still predict essays scores that are highly correlated with human scores. Since our goal is to compare with the manual annotation setup, we use the first setting, where we train on the S&G dataset and test on our T OEF L arg dataset. We select the best system setup (top100 for both tasks; Table 3 and 5). We ran Logistic Regression learners and evaluated their performance using QWK scores. Table 6 shows that the argumentative features related to argument relations (AR) and the typology of argument structures (TS) extracted based on the automatically predicated argumentative structure perform worse compared to the scores based on manual annotations (Table 2). Our error analysis shows that this is due to the wrong prediction of argument components, specifically wrongly labeling claims as premises (Table 3). AR and TS features rely on correctly identifying the claims, and thus a wrong prediction affects the features in these two groups, even if the accuracy of supports relations is high. This also explains why the argument components (AC) features still have a high correlation with human scores (0.669). When we extracted the argumentation features using goldstandard argument components and predicted argument relations, the correlation of AR and TS features improved to 0.576 and 0.504, respectively and the correlation of all features reached 0.769.

Related Work
Researchers have begun to study the impact of features specific to persuasive construct on student essay scores (Farra et al., 2015;Song et al., 2014;Ong et al., 2014;Persing and Ng, 2013;Persing and Ng, 2015). Farra et al. (2015) investigate the impact of opinion and target features on TOEFL essays scores. Our work looks a step further by exploring argumentation features. Song et al. (2014) show that adding features related to argumentation schemes (from manual annotation) as part of an automatic scoring system increases the correlation with human scores. We show that argumentation features are good predictors of human scores for TOEFL essays, both when the coarsegrained argumentative structure is manually annotated and automatically predicted. Persing and Ng (2015) proposed a feature-rich approach for modeling argument strength in student essays, where the features are related to argument components. Our work explores features related to argument components, relations and typology of argument structures, showing that argument relation features show best correlation with human scores (based on manual annotation).

Conclusion
We show that argumentation features derived from a coarse-grained, argumentative structure of essays are helpful in predicting essays scores that have a high correlation with human scores. Our manual annotation study shows that features related to argument relations are particularly useful. Our experiments using current methods for the automatic identification of argumentative structure confirms that distinguishing between claim and premises is a particularly hard task. This led to lower performance in predicting the essays scores using automatically generate argumentation features, especially for features related to argument relations and typology of structure. As future work we plan to improve the automatic methods for identifying argument components similar to Stab and Gurevych (2016), and to use the dataset introduced by Persing and Ng (2015) to investigate how our argumentation features impact the argument strength score rather than the holistic essay score.