Intrinsic Quality Assessment of Arguments

Several quality dimensions of natural language arguments have been investigated. Some are likely to be reflected in linguistic features (e.g., an argument’s arrangement), whereas others depend on context (e.g., relevance) or topic knowledge (e.g., acceptability). In this paper, we study the intrinsic computational assessment of 15 dimensions, i.e., only learning from an argument’s text. In systematic experiments with eight feature types on an existing corpus, we observe moderate but significant learning success for most dimensions. Rhetorical quality seems hardest to assess, and subjectivity features turn out strong, although length bias in the corpus impedes full validity. We also find that human assessors differ more clearly to each other than to our approach.


Introduction
Good arguments help to persuade people, to compromise, or to at least understand each other better. What quality dimension is meant by good depends on the setting, though (van Eemeren and Grootendorst, 2004;Johnson and Blair, 2006). Several dimensions may be assessed computationally, as we exemplify for an argument in favor of advancing the common good, taken from Wachsmuth et al. (2017a): "While striving to make advancements for the common good you can change the world forever. Allot of people have succeded in doing so. Our founding fathers, Thomas Edison, George Washington, Martin Luther King jr, and many more. These people made huge advances for the common good and they are honored for it." The argument is well-organized (Persing et al., 2010), its premises are certainly largely acceptable (Yang et al., 2019) and relevant to the topic (Wachsmuth et al., 2017c). Whether they suffice to draw the conclusion (Stab and Gurevych, 2017) is another question, let alone how convincing the argument is (Habernal and Gurevych, 2016b). Some dimensions may be reflected in linguistic features of an argument's text. Others depend on context, require topic or background knowledge, or are inherently subjective.
In this paper, we benchmark what quality dimensions of an argument can be assessed intrinsically, i.e., when analyzing the text of an argument only. Given the corpus of Wachsmuth et al. (2017b) with 304 English debate portal arguments on 16 topics scored for 15 logical, rhetorical, and dialectical quality dimensions by three experts, we carry out systematic leave-one-topic-out cross-validation experiments. In particular, we learn standard supervised score regression on various text features; from content and distributional semantics, to style, structure, and length, to text quality, evidence, and subjectivity.
For 11 dimensions, we observe moderate but significant prediction gains of the features over a mean baseline. Following intuition, rhetorical quality related to credibility and emotions seem hardest to assess. Features capturing subjectivity (e.g., sentiment and pronoun usage) turn out particularly effective. Even better performs the length feature, though, revealing bias in the corpus and matching previous findings on other corpora (Potash et al., 2017). Follow-up experiments indicate that the experts strongly differ in their assessability, and that some beat our features only slightly in assessing quality. Altogether, an intrinsic computational assessment of argument quality seems useful but not enough alone.

Related Work
Soon after the rise of argument mining, argument quality assessment has come up as a task (Stede and Schneider, 2018), due to its importance for applications such as Project Debater (Gleize et al., 2019). It rests on extensive theoretical discussions about what good arguments (Johnson and Blair, 2006) and bad arguments are (Walton, 2006), and how to argue reasonably (van Eemeren and Grootendorst, 2004).
Several corpora and approaches were proposed for specific argument quality dimensions, first related to essay scoring (Persing and Ng, 2015), some of which modeling arguments explicitly (Wachsmuth et al., 2016). Later approaches targeted arguments from debate portals (Wei et al., 2016), student essays (Stab and Gurevych, 2017), mixed web texts (Wachsmuth et al., 2017c), and news editorials (Yang et al., 2019;El Baff et al., 2020). We use the corpus of Wachsmuth et al. (2017b), as it is the only one annotated for diverse dimensions and is claimed to reflect argument quality comprehensively. In follow-up work, Wachsmuth et al. (2017a) found correlations with the convincingness reasons of Habernal and Gurevych (2016a), and Potthast et al. (2019) as well as Gretz et al. (2020) have evaluated their annotations against the quality annotation scheme of the corpus. However, we do not know any previous assessment approach developed on the corpus, possibly due to its limited size (see Section 3).
We focus on features intrinsic to an argument's text. This complements the study of Potash et al. (2017) who employed external knowledge to assess convincingness. Like them, we find that longer arguments tend to be judged better. Toledo et al. (2019) limit arguments to at most 36 words, avoiding length bias but also preventing deeper reasoning. Quality in their corpus reflects which argument is in doubt preferred. Ultimately, such judgments remain subjective (Lukin et al., 2017). To alleviate this, El Baff et al. (2018) encode the reader's ideology and personality, but such information is often not given in practice.

Data
The corpus of Wachsmuth et al. (2017b) is a subset of 320 debate portal posts from the dataset of Habernal and Gurevych (2016a), 20 each for 16 controversial topics. Three human experts (all authors of the paper) scored all posts that they saw as arguments for the following 15 logical, rhetorical, and dialectical quality dimensions on a scale from 1 (low) to 3 (high). In line with the experiments of Wachsmuth et al. (2017a), we use only those 304 texts in Section 5 that were seen as arguments by all three experts.
Logic The main logical dimension is called cogency (Cog). It is defined based on three subdimensions: the local acceptability (LAc) of the truth of an argument's premises, the premises' local relevance (LRe) for the argument's conclusion, and their local sufficiency (LSu) to infer the conclusion.

Rhetoric
The main rhetorical dimension is the effectiveness (Eff) in persuading readers. Subdimensions are the argument's clarity (Cla), the author's credibility (Cre), the appropriateness (App) of the argument's used language, its success in emotional appeal (Emo), and its sequential arrangement (Arr) in the text.
Dialectic The main dialectical dimension is reasonableness (Rea), with three subdimensions: the global acceptability (GAc) of stating the argument when discussing a given issue, the argument's global relevance (GRe) for achieving agreement, and its global sufficiency (GSu) in discussing both sides of the issue.
Overall The overall quality (OvQ) reflects the subjective weighting of all 14 other quality dimensions.
Both the single expert scores and the mean score are provided for each dimension. The inter-annotator agreement for the different dimensions ranged from 0.26 (emotional appeal) to 0.51 (overall quality) in terms of Krippendorff's α. The majority score of most dimensions is 2, but both score 1 and 3 also occur frequently for many of them. Matching the hierarchical idea of the dimensions, overall quality correlates strongest with cogency, effectiveness, and reasonableness. For details on agreements, score distributions, and correlations, we refer the reader to the original paper (Wachsmuth et al., 2017b).

Approach
This paper does not aim to propose a novel quality assessment approach, but to evaluate what features of a text help to assess which quality dimension. External knowledge is included only via lexicons and While striving to make advancements for the common good you can change the world forever.
Allot of people have succeded in doing so.

Our founding fathers, Thomas Edison, George Washington, Martin Luther King jr, and many more.
These people made huge advances for the common good and they are honored for it.  Figure 1: Exemplary analysis of an argument from the used corpus for selected text features that might affect quality: content-related key phrases, text quality indicators such as spelling errors, subjective pronoun usage, length in sentences/tokens, and evidence distributions reflected by premises and conclusions. On the right, all mean quality scores of the argument in the corpus are shown (worst is 1, best is 3). embeddings. As an example, Figure 1 shows selected textual aspects of the argument from Section 1 that may be predictive of certain dimensions. We quantify these and other aspects in the following eight feature types that are employed in linear SVMs for score regression (Chang and Lin, 2011): 1 Content As often done in text classification (Aggarwal and Zhai, 2012), we aim to capture important content-related key phrases simply as part of the distribution of word 1-to 3-grams, taking all those that occur in ≥ 3% of all training texts (such thresholds were set after initial tests).
Embedding We capture an argument's distributional semantics by a sentence vector, using the pretrained fasttext model based on Wikipedia (Mikolov et al., 2018). Each vector position becomes one feature.
Structure In terms of structure, we look for enumeration indicators, such as "1." and "2." or "on one hand" and "on the other hand". In addition, we check the first token 1-, 2-, and 3-gram in the text.
Length Our length feature type includes normalized counts of characters, syllables, tokens, phrases, sentences, and paragraphs, along with ratios between each pair of these linguistic units.
Text Quality Motivated by classical essay scoring (Ke and Ng, 2019), we model text quality by spelling correctness and readability. The former is quantified as absolute and relative counts of three error types from www.languagetool.org (hints, unknown words, and others). For the latter, we calculate 10 common readability scores, including Flesch Kincaid Reading Ease, Gunning Fog Index, LIX, and similar.
Evidence On one hand, we capture the evidence given in an argument in terms of the frequency of links. On the other hand, we apply an out-of-the-box argument mining algorithm (Wachsmuth et al., 2016) that classifies each sentence into one of four argumentative unit types: thesis, conclusion, premise, or none.

Experiments
We now report on experiments with the features from Section 4 on the corpus from Section 3. In particular, we systematically study three research questions for the 15 given argument quality dimensions: Q1. To what extent can each quality dimension be assessed only from an argument's text?
Q2. How dependent is the assessability on the subjective view of the experts? Q3. How well do the considered features predict argument quality compared to humans?   Experimental Setup We approached all 15 dimensions using each feature type alone, feature ablation (all but one type), and all features respectively. We split the corpus into 16 test sets, one per topic. For each approach and topic, we trained one SVM on the other 15 topics, optimizing its C hyperparameter in 15-fold cross-validation on the training set (tested C range: 10 −4 · 2 j for 7 ≤ j ≤ 16). We then computed the mean absolute error (MAE), averaged over the 16 MAEs on each test set. This leave-one-topic-out way ensures that no topic information can be exploited in the assessment on the test sets.
To focus on the learning success, we compare the features only to the mean baseline, which always predicts the mean score of all arguments in the given training set. For the SVM with all features, we use the 16 single MAEs in a one-tailed independent t-test to test whether improvements over the baseline are significantly better at p < .05 (marked † below) and p < .01 ( ‡). The Java code for reproducing the experiments can be accessed here: http://arguana.com/software Quality Assessment (Q1) To provide answers to question Q1, we let all SVMs learn to assess the mean score of the three experts. Table 1 shows the MAE of each feature type (A i ), feature ablation (A \i ), and all features (A 1−8 ) in comparison to the baseline for each quality dimension. The SVM with all features (A 1−8 ) outperforms the baseline in all cases. Only for four dimensions, the gains are not significant, three of which being rhetorical: clarity (Cla), credibility (Cre), and emotional appeal (Emo). This may be due to their subjective nature, as reflected in limited inter-annotator agreement (Wachsmuth et al., 2017b). The highest MAE reduction is achieved for local sufficiency (0.39 to 0.30), which has also been successfully assessed in previous studies (Stab and Gurevych, 2017). Other clear gains are achieved for overall quality (0.45 to 0.37) and for cogency (0.44 to 0.37). No dimension is really "solved" by the given features, but we conclude that intrinsic argument quality assessment is effective to some extent.
Looking at the features, we see that content (A 1 ) and embedding (A 2 ) perform rather badly, unlike in many NLP tasks. This is somewhat expected, though, due to our leave-one-topic-out setting. While feature ablation leads to the best results for some dimensions (e.g., Cre and Arr), two feature types dominate the assessment: Subjectivity (A 8 ) alone minimizes the MSE for five dimensions, once being the single best approach (for local relevance, LRe). However, length (A 5 ) is even stronger, e.g., being best for reasonableness and overall quality. Albeit quality may require some words, this reveals length bias inherent to the corpus. Such bias was also found in other argument quality corpora (Potash et al., 2017). It questions the validity of the annotated scores, even if A 5 is not needed for many dimensions (see A \5 ).   Subjectiveness (Q2) For Q2, we learn to assess the score of each single expert and compare our features to the baseline in Table 2. Mean scores lead to the lowest MAE, due to their natural tendency towards middle scores. We find clear differences between the experts, reflecting how subjective the assessment is: Hardly any significant learning success is observed on the scores of expert #2, whereas particularly #3 seems well-assessable. While this may mean that some experts are either more reliable or more influenced by surface text features, it raises the question whether assessing the mean score is the best choice.
Human vs. Machine (Q3) For Q3, finally, we evaluate how much the experts diverge from the majority score as opposed to the all-features SVM. Since the experts could only give integer scores, for fairness we rounded the scores of the SVM before MAE computation. Still, Table 3 reveals that expert #2 significantly beats our features only on three dimensions (Eff, GAcc, and OvQ) and is even worse on five dimensions (App, Emo, Arr, GRe, and GSu). So, our features can compete with some humans. Expert #3, in contrast, clearly outperforms the SVM with a very low MAE for most dimensions. Together with the results on Q2, it seems that some expert scores are more consistent.

Conclusion
In this focused study, we have systematically benchmarked how well argument quality can be assessed computationally on an existing corpus annotated for several quality dimensions. Modeling subjectiveness in terms of sentiment, pronoun usage, and similar seems useful on the debate portal arguments included in the corpus, at least for logical and dialectical dimensions. However, the limited corpus size naturally makes it hard to find more complex features that robustly predict argument quality. In addition, the correlation of quality and length in the corpus limits the generalizability of our findings. This calls for more large-scale and balanced argument quality corpora. First attempts in this direction have been made (Toledo et al., 2019), but the comprehensive view on quality of the corpus used here has no equal so far.