Argumentation: Content, Structure, and Relationship with Essay Quality

In this paper, we investigate the relationship between argumentation structures and (a) argument content


Introduction
With the advent of the Common Core Standards for Education, 1 argumentation, and, more specifically, argumentative writing, is receiving increased attention, along with a demand for argumentation-aware Automated Writing Evaluation (AWE) systems. However, current AWE systems typically do not consider argumentation (Lim and Kahng, 2012), and employ features that address grammar, mechanics, discourse structure, syntactic and lexical richness . Developments in Computational Argumentation (CA) could bridge this gap.
Recently, progress has been made towards a more detailed understanding of argumentation in essays (Song et al., 2014;Stab and Gurevych, 2014;Persing and Ng, 2015;Ong et al., 2014). An important distinction emerging from the relevant work is that between argumentative structure and argumentative content. Facility with the argumentation structure underlies the contrast between (1) and (2) below: In (1), claims are made without support; relationships between claims are not explicit; there is intervening irrelevant material. In (2), the argumentative structure is clear -there is a critical claim supported by a specific reason. Yet, 1 www.corestandards.org is it in fact a good argument? When choosing a provider for trash collection, how relevant is the color of the trucks? In contrast, in (3) the argumentative structure is not very explicit, yet the argument itself, if the reader is willing to engage, is actually more pertinent to the case, content-wise. Example (4) has both the structure and the content.
(1) "The mayor is stupid. People should not have voted for him. His policy will fail. The new provider uses ugly trucks." (2) "The mayor's policy of switching to a new trash collector service is flawed because he failed to consider the ugly color of the trucks used by the new provider." (3) "The mayor is stupid. The switch is a bad policy. The new collector uses old and polluting trucks." (4) "The mayor's policy of switching to a new trash collector service is flawed because he failed to consider the negative environmental effect of the old and air-polluting trucks used by the new provider." Song et al. (2014) took the content approach, annotating essays for arguments that are pertinent to the argumentation scheme (Walton et al., 2008;Walton, 1996) presented in the prompt. Thus, a critique raising undesirable side effects (examples 3 and 4) is appropriate for a prompt where a policy is proposed, while the critique in (1) and (2) is not. The authors show, using the annotations, that raising pertinent critiques correlates with holistic essay scores. They build a content-heavy automated model; the model, however, does not generalize well across prompts, since different prompts use different argumentation schemes and contexts.
We take the structure-based approach that is independent of particular content and thus has better generalization potential. We study its relationship with the content-based approach and with overall essay quality. Our contributions are the answers to the following research questions: 1. whether the use of good argumentation structure correlates with essay quality; 2. while structure and content are conceptually distinct, they might in reality go together. We therefore evaluate the ability of the structurebased system to deal with content-based annotations of argumentation.

Related Work
Existing work in CA focuses on argumentation mining in various genres. Moens et al. (2007) identify argumentative sentences in newspapers, parliamentary records, court reports and online discussions. Mochales-Palau and Moens (2009) identify argumentation structures including claims and premises in court cases. Other approaches focus on online comments and recognize argument components (Habernal and Gurevych, 2015), justifications (Biran and Rambow, 2011) or different types of claims (Kwon et al., 2007). Work in the context of the IBM Debater project deals with identifying claims and evidence in Wikipedia articles (Rinott et al., 2015;Aharoni et al., 2014). Peldszus and Stede (2015) identify argumentation structures in microtexts (similar to essays). They rely on several base classifiers and minimum spanning trees to recognize argumentative tree structures. Stab and Gurevych (2016) extract argument structures from essays by recognizing argument components and jointly modeling their types and relations between them. Both approaches focus on the structure and neglect the content of arguments. Persing and Ng (2015) annotate argument strength, which is related to content, yet what it is that makes an argument strong has not been made explicit in the rubric and the annotations are essay-level. Song et al. (2014) follow the content-based approach, annotating essay sentences for raising topic-specific critical questions (Walton et al., 2008). Ong et al. (2014) report on correlations between argument component types and holistic essay scores. They report that rule-based approaches for identifying argument components can be effective for ranking but not rating. However, they used a very small data set. In contrast, we study the relationship between content-based and structurebased approaches and investigate whether argumentation structures correlate with holistic quality of essays using a large public data set.
In the literature on the development of argumentation skills, an emphasis is made on both the structure, namely, the need to support one's position with reasons and evidence (Ferretti et al., 2000), and on the content, namely, on evaluating the effectiveness of arguments. For example, in a study by Goldstein et al. (2009), middle-schoolers compared more and less effective rebuttals to the same original argument.

Argumentation Structure Parser
For identifying argumentation structures in essays, we employ the system by Stab and Gurevych (2016) as an off-the-shelf argument structure parser. The parser performs the following steps: Segmentation: Separates argumentative from non-argumentative text units; identifies the boundaries of argument components at token-level. Classification: Classifies each argument component as Claim, Major Claim or Premise. Linking: Identifies links between argument components by classifying ordered pairs of components in the same paragraph as either linked or not. Tree generation: Finds tree structures (or forests) in each paragraph which optimize the results of the the previous analysis steps. Stance recognition: Classifies each argument component as either for or against in order to discriminate between supporting or opposing argument components and argumentative support and attack relations respectively.

Data
We use data from Song et al. (2014) -essays written for a college-level exam requiring test-takers to criticize an argument presented in the prompt. Each sentence in each essay is classified as generic (does not raise a critical question appropriate for the argument in the prompt) or non-generic (raises an apt critical question); about 40% of sentences are non-generic. Data sizes are shown in Table 1.

Selection of Structural Elements
We use the training data to gain a better understanding of the relationship between structural and content aspects of argumentation. Each selection is evaluated using kappa against Song et al. (2014) generic vs non-generic annotation.
Our first hypothesis is that any sentence where the parser detected an argument component (any claim or premise) could contain argument-relevant (non-generic) content. This approach yields kappa of 0.24 (prompt A) and 0.23 (prompt B).
We observed that the linking step in the parser's output identified many cases of singleton claimsnamely, claims not supported by an elaboration. For example, "The county is following wrong assumptions in the attempt to improve safety" is an isolated claim. This sentence is classified as "generic", since no specific scheme-related critique is being raised. Removing unsupported claims yields kappas of 0.28 (A) and 0.26 (B).
Next, we observed that even sentences that contain claims that are supported are often treated as "generic". Test-takers often precede a specific critique with one or more claims that set the stage for the main critique. For example, in the following 3-sentence sequence, only the last is marked as raising a critical question: "If this claim is valid we would need to know the numbers. The whole argument in contingent on the reported accidents. Less reported accidents does not mean less accidents." The parser classified these as Major Claim, Claim, and Premise, respectively. Our next hypothesis is that it is the premises, rather than the claims, that are likely to contain specific argumentative content. We predict that only sentences containing a premise would be "non-generic." This yields a substantial improvement in agreement, reaching kappas of 0.34 (A) and 0.33 (B).
Looking at the overall pattern of structure-based vs content-based predictions, we note that the structure-based prediction over-generates: The ra-tio of false-positives to false-negatives is 2.9 (A) and 3.1 (B). That is, argumentative structure without argumentative content is about 3 times more common than the reverse. False positives include sentences that are too general ("Numbers are needed to compare the history of the roads") as well as sentences that have an argumentative form, but fail to make a germane argument ("If accidents are happening near a known bar, drivers might be under the influence of alcohol").
Out of all the false-negatives, 30% were cases where the argument parser predicted no argumentative structures at all (no claims of any type and no premises). Such sentences might not have a clear argumentative form but are understood as making a critique in the context. For example, "What was it 3 or 4 years ago?" and "Has the population gone up or down?" look like factseeking questions in terms of structure, but are interpreted in the context as questioning the causal mechanism presented in the prompt. Overall, in 9% of all non-generic sentences the argument parser detected no claims or premises. Table 2 shows the evaluation of the structurebased predictions (classifying all sentences with a Premise as non-generic) on test data, in comparison with the published results of Song et al. (2014), who used content-heavy features (such as word ngrams in the current, preceding, and subsequent sentence). The results clearly show that while the structure-based prediction is inferior to content-based one when the test data are essays responding to the same prompt as the training data, the off-the-shelf structure-based prediction is onpar with content-based prediction on the crossprompt evaluation. Thus, when the content is expected to shift, falling back to structure-based prediction is potentially a reasonable strategy.

System
Train Test  Table 2: Evaluation of content-based (Song et al., 2014) and structure-based prediction on contentbased annotations.

Experiment 2: Argumentation Structure and Essay Quality
Using argumentation structure and putting forward a germane argument are distinct, not only theoretically, but also empirically, as suggested by the results of Experiment 1. In this section, we evaluate to what extent the use of argumentation structures correlates with overall essay quality.

Data
We use a publicly available set of essays written for the TOEFL test in an argue-for-an-opinion-onan-issue genre (Blanchard et al., 2013). Although this data was originally used for natural language identification experiments, coarse-grained holistic scores (3-grade scale) are provided as part of the LDC distribution. Essays were written by non-native speakers of English; we believe this increases the likelihood that fluency with argumentation structures would be predictive of the score. We sampled 6,074 essays for training and 2,023 for testing, both across 8 different prompts.
In terms of distribution of holistic scores in the training data, 54.5% received the middle score, 11% -the low score, and 34.5% -the high score.

Features for essay scoring
Our set of features has the following essay-level aggregates: the numbers of any argument components, major claims, claims, premises, supporting and attacking premises, arguments against, arguments for, and the average number of premises per claim. Using the training data, we found that 90% Winsorization followed by a log transformation improved the correlation with scores for all features. The correlations range from 0.08 (major claims) to 0.39 (argument components).

Evaluation
To evaluate whether the use of argumentation structures correlates with holistic scores, we estimated a linear regression model using the nine argument features on the training data and evaluated on the test data. We use Cohen's kappa, as well as Pearson's correlation and quadratically-weighted kappa, the latter two being standard measures in essay scoring literature (Shermis and . Row "Arg" in Table 3 shows the results; argument structures have a moderate positive relationship with holistic scores. More extensive use of argumentation structures is thus correlated with overall quality of an argumentative essay. However, argumentative fluency specifically is difficult to disentangle from fluency in language production in general manifested through the sheer length of the essay. In a timed test, a more fluent writer will be able to write more. To examine whether fluency in argumentation structures can explain additional variance in scores beyond that explained by general fluency (as approximated through the number of words in an essay), we estimated a length-only based linear regression model as well as a model that uses all the 9 argument structure features in addition to length. As shown in Table 3, the addition of argumentation structures yields a small improvement across all measures over a length-only model.  Table 3: Prediction of holistic scores using argument structure features (Arg), length (Len), and argument structure features and length (Arg+Len). "qwk" stands for quadratically weighted kappa.

Conclusion & Future Work
In this paper, we set out to investigate the relationship between argumentation structures, argument content, and the quality of the essay. Our experiments suggest that (a) more extensive use of argumentation structures is predictive of better quality of argumentative writing, beyond overall fluency in language production; and (b) structurebased detection of argumentation is a possible fallback strategy to approximate argumentative content if an automated argument detection system is to generalize to new prompts. The two findings together suggest that the structure-based approach is a promising avenue for research in argumentationaware automated writing evaluation.
In future work, we intend to improve the structure-based approach by identifying characteristics of argument components that are too general and so cannot be taken as evidence of germane, case-specific argumentation on the student's part (claims like "More information is needed"), as well as study properties of seemingly non-argumentative sentences that neverthe-less have a potential for argumentative use in context (such as asking fact-seeking questions). We believe this would allow pushing the envelope of structure-based analysis towards identification of arguments that have a higher likelihood of being effective.