Recognizing the Absence of Opposing Arguments in Persuasive Essays

In this paper, we introduce an approach for recognizing the absence of opposing arguments in persuasive essays. We model this task as a binary document classiﬁcation and show that adversative transitions in combination with unigrams and syntactic production rules signiﬁcantly outperform a challenging heuristic baseline. Our approach yields an accuracy of 75 . 6% and 84% of human performance in a persuasive essay corpus with various topics.


Introduction
Developing well-reasoned arguments is an important ability and constitutes an important part of education programs (Davies, 2009). A frequent mistake when writing argumentative texts is to consider only arguments supporting the own standpoint and to ignore opposing arguments (Wolfe and Britt, 2009). This tendency to ignore opposing arguments is known as myside bias or confirmation bias (Stanovich et al., 2013). It has been shown that guiding students to include opposing arguments in their writings significantly improves the argumentation quality, the precision of claims and the elaboration of reasons (Wolfe and Britt, 2009). Therefore, it is likely that a system which automatically recognizes the absence of opposing arguments effectively guides students to improve their argumentation. For the same reason, the writing standards of the common core standard 1 require that students are able to clarify the relation between their own standpoint and opposing arguments on a controversial topic.
Existing structural approaches on argument analysis like the argumentation structure parser 1 www.corestandards.org presented by Stab and Gurevych (2016) or the approach introduced by Peldszus and Stede (2015a) recognize the internal microstructure of arguments. Although these approaches can be exploited for identifying opposing arguments, they require several consecutive analysis steps like separating argumentative from non-argumentative text units (Moens et al., 2007), recognizing the boundaries of argument components (Goudas et al., 2014) and classifying individual arguments as support or oppose (Somasundaran and Wiebe, 2009). Certainly, an advantage of structural approaches is that they recognize the position of opposing arguments in text. However, knowing the position of opposing arguments is only relevant for positive feedback to the author and irrelevant for negative feedback, i.e. pointing out that opposing arguments are missing. Therefore, it is reasonable to model the recognition of missing opposing arguments as a document classification task.
The contributions of this paper are the following: first, we introduce a corpus for detecting the absence of opposing arguments that we derive from argument structure annotated essays. Second, we propose a novel model and a new feature set for detecting the absence of opposing arguments in persuasive essays. We show that our model significantly outperforms a strong heuristic baseline and an existing structural approach. Third, we show that our model achieves 84% of human performance.

Related Work
Existing approaches in computational argumentation focus primarily on the identification of arguments, their components (e.g. claims and premises) (Rinott et al., 2015;Levy et al., 2014) and structures (Mochales-Palau and Moens, 2011;Stab and Gurevych, 2014b). Among these, there are few approaches which distinguish between supporting and opposing arguments. Peldszus and Stede (2015b) use lexical, contextual and syntactic features to classify argument components as support or oppose. They experiment with pro/contra columns of a German newspaper and German microtexts. Similarly, their minimum spanning tree (MST) approach identifies the structure of arguments and recognizes if an argument component belongs to the proponent or opponent (Peldszus and Stede, 2015a). However, both approaches presuppose that the components of an argument are already known. Thus, they omit important analysis steps and cannot be applied directly for recognizing the absence of opposing arguments. Stab and Gurevych (2016) present an argumentation structure parser that includes all required steps for identifying argument structures and supporting and opposing arguments. First, they separate argumentative from non-argumentative text units using conditional random fields (CRF). Second, they jointly model the argument component types and argumentative relations using integer linear programming (ILP) and finally they distinguish between supporting and opposing arguments. We employ this parser as a structural approach and compare it to our document classification approach for recognizing the absence of opposing arguments in persuasive essays.
Another related area is stance recognition that aims at identifying the author's stance on a controversy by labeling a document as either "for" or "against" (Somasundaran and Wiebe, 2009;Hasan and Ng, 2014). Consequently, stance recognition systems are designed to identify the predominant stance of a text instead of recognizing the presence of less conspicuous opposing arguments.
Other approaches on argumentation in essays focus on thesis clarity (Persing and Ng, 2013), argumentation schemes (Song et al., 2014) or argumentation strength (Persing and Ng, 2015). We are not aware of any approach that focuses on recognizing the absence of opposing arguments.

Data
For our experiments, we employ an argument structure annotated essay corpus (Stab and Gurevych, 2014a;Stab and Gurevych, 2016). To the best of our knowledge, this corpus is the only available resource that exhibits an appropriate size and class distribution for detecting the absence of opposing arguments at the document-level. Each essay in this corpus is annotated with argumentation structures that allow to derive documentlevel annotations. The argumentation structures include arguments supporting or opposing the author's stance. Accordingly, we consider an essay as negative if it solely includes supporting arguments and as positive if it includes at least one opposing argument. Note that the manual identification of opposing arguments is a subtask of the argumentation structure identification. Both require that the annotators identify the author's stance, the individual arguments and if an argument supports or opposes the author's stance. Thus, deriving document-level annotations from argumentation structures is a valid approach since the decisions of the annotators in both tasks are equivalent.

Inter-Annotator Agreement
To verify that the derived document-level annotations are reliable, we compare the annotations derived from the argumentation structure annotations of three independent annotators. In particular, we determine the inter-annotator agreement on a subset of 80 essays. The comparison shows an observed agreement of 90%. We obtain substantial chance-corrected agreement scores of Fleiss' κ = .786 (Fleiss, 1971) and Krippendorff's α = .787 (Krippendorff, 2004). Thus, we conclude that the derived annotations are reliable since they are only slightly below the "good reliability threshold" proposed by Krippendorff (2004).

Approach
We consider the recognition of opposing arguments as a binary document classification. Due to the size of the corpus and to prevent errors in model assessment stemming from a particular data splitting (Krstajic et al., 2014), we employ a stratified and repeated 5-fold cross-validation setup. We report the average evaluation scores and the standard deviation over 100 folds resulting from 20 iterations. For model selection, we randomly sampled 10% of the training set of each run as a development set. We report accuracy, macro precision, macro recall and macro F1 scores as described by Sokolova and Lapalme (2009, p. 430). 3 We employ Wilcoxon signed-rank test on macro F1 scores for significance testing (significance level = .005). We preprocess the essays using several models from the DKPro framework (Eckart de Castilho and Gurevych, 2014). For tokenization, sentence and paragraph splitting, we employ the language tool segmenter 4 and check for line breaks. We lemmatize each token using the mate tools lemmatizer (Bohnet et al., 2013) and apply the Stanford parser (Klein and Manning, 2003) for constituency and dependency parsing. Finally, we use a PDTB parser (Lin et al., 2014) and sentiment analyzer (Socher et al., 2013) for identifying discourse relations and sentence-level sentiment scores. As a learner, we choose a support vector machine (SVM) (Cortes and Vapnik, 1995) with polynomial kernel implemented in Weka (Hall et al., 2009). For extracting features, we use the DKPro TC framework (Daxenberger et al., 2014).

Features
We experiment with the following features: Unigrams (uni): In order to capture the lexical characteristics of an essay, we extract binary and case sensitive unigrams. Dependency triples (dep): The binary dependency features include triples consisting of the lemmatized governor, the lemmatized dependent and the dependency type. Production rules (pr): We employ binary production rules extracted from the constituent parse trees (Lin et al., 2009) that occur at least five times. Adversative transitions (adv): We assume that opposing arguments are frequently signaled by lexical indicators. We use 47 adversative transitional phrases that are compiled as a learning resource 5 and grouped in the following categories: concession (18), conflict (12), dismissal (9), emphasis (5) and replacement (3). For each of the five categories, we add two binary features set to true if a phrase of the category is present in the surrounding paragraphs (introduction or conclusion) or in a body paragraph. 6 Note that we consider lowercase and uppercase versions of these features which results in a total of 20 binary features. Sentiment Features (sent): We average the five sentiment scores of all essay sentences for determining the global sentiment of an essay. In addition, we count the number of negative sentences and define a binary feature indicating the presence of a negative sentence. Discourse relations (dis): The binary discourse features include the type of the discourse relation and indicate if the relation is implicit or explicit. For instance, "Contrast imp" indicates an implicit contrast relation. Note that we only consider the discourse relations of body paragraphs since the introduction frequently includes a description of the controversy which is not relevant to the author's argumentation and whose discourse relations could be misleading for the learner.

Baselines
For model assessment, we use the following two baselines: First, we employ a majority baseline that classifies each essay as negative (not including opposing arguments). Second, we employ a rule-based heuristic baseline that classifies an essay as positive if it includes the case-sensitive term "Admittedly" or the phrase "argue that" which often indicate the presence of opposing arguments. 7

Results
In order to select a model and to analyze our features, we conduct feature ablation tests (lower part of Table 2) and evaluate our system with individual features. The adversative transitions and unigrams are the most informative features. Both show the best individual performance and a sig-  nificant decrease if removed from the entire feature set. Thus, we conclude that lexical indicators are the most predictive features in our feature set. The sentiment and discourse features do not perform well. Individually they do not achieve better results than the majority baseline and the accuracy increases slightly when removing them from the entire feature set. By experimenting with various feature combinations, we found that combining unigrams, production rules and adversative transitions yields the best results (SVM uni+pr+adv).
For model assessment, we evaluate the best performing model on our test data and compare it to the baselines (upper part of Table 2). The heuristic baseline considerably outperforms the majority baseline and achieves an accuracy of 71.1%. Our best system significantly outperforms this challenging baseline with respect to all evaluation measures. It achieves an accuracy of 75.6% and a macro F1 score of .734. We determine the human upper bound by comparing pairs of annotators and averaging the results of the 80 independently annotated essays (cf. Section 3). Compared to the upper bound, our system achieves 14.4% less accuracy and 84% of human performance. We compare our system to an argumentation structure parser that recognizes opposing components on a designated 80:20 train-test-split (Stab and Gurevych, 2016). We consider essays with predicted opposing arguments as positive, and negative if the parser does not recognize an opposing argument. This yields a macro F1 score of .648. Our document-level approach considerably outperforms the component-based approach with a macro F1 score of .710. Thus, we can confirm our assumption that modeling the task as document classification outperforms structural approaches.

Error Analysis
To analyze frequent errors of our system, we manually investigate essays that are misclassified in all 100 runs of the repeated cross-validation experiment on the development set. In total, 29 positive essays are consistently misclassified as negative. As reason for these errors, we found that the opposing arguments in these essays lack lexical indicators. In addition, we found 14 negative essays which are always misclassified as positive. Among these essays, we observe that the majority includes opposition indicators (e.g. "but") which are used in another sense (e.g. expansion). Therefore, the investigation of both false negatives and false positives shows that most errors are due to misleading lexical signals. Consequently, wordsense disambiguation for identifying senses or the integration of domain and world knowledge in the absence of lexical signals could further improve the results.

Conclusion
We introduced the novel task of recognizing the absence of opposing arguments in persuasive essays. In contrast to existing structural approaches, we model this task as a document classification which does not presuppose several complex analysis steps. The analysis of several features showed that adversative transitions and unigrams are most indicative for this task. We showed that our best model significantly outperforms a strong heuristic baseline, yields a promising accuracy of 75.6%, outperforms a structural approach and achieves 84% of human performance. For future work, we plan to integrate the system in writing environments and to investigate its effectiveness for fostering argumentation skills.