Which argument is more convincing? Analyzing and predicting convincingness of Web arguments using bidirectional LSTM

We propose a new task in the ﬁeld of computational argumentation in which we investigate qualitative properties of Web arguments, namely their convincingness. We cast the problem as relation classiﬁcation, where a pair of arguments having the same stance to the same prompt is judged. We annotate a large datasets of 16k pairs of arguments over 32 topics and investigate whether the relation “A is more convincing than B” exhibits properties of total ordering; these ﬁndings are used as global constraints for cleaning the crowdsourced data. We propose two tasks: (1) predicting which argument from an argument pair is more convincing and (2) ranking all arguments to the topic based on their convinc-ingness. We experiment with feature-rich SVM and bidirectional LSTM and obtain 0.76-0.78 accuracy and 0.35-0.40 Spearman’s correlation in a cross-topic evaluation. We release the newly created corpus UKPConvArg1 and the experimental software under open licenses.


Introduction
What makes a good argument? Despite the recent achievements in computational argumentation, such as identifying argument components (Habernal and Gurevych, 2015;Habernal and Gurevych, 2016), finding evidence for claims (Rinott et al., 2015), or predicting argument structure (Peldszus and Stede, 2015;Stab and Gurevych, 2014), this question remains too hard to be answered.
Even Aristotle claimed that perceiving an argument as a "good" one depends on multiple factors (Aristotle and Kennedy (translator), 1991) -not only the logical structure of the argument (logos), but also on the speaker (ethos), emotions (pathos), or context (cairos) (Schiappa and Nordin, 2013). Experiments also show that different audiences perceive the very same arguments differently (Mercier and Sperber, 2011). A solid body of argumentation research has been devoted to the quality of arguments (Walton, 1989;Johnson and Blair, 2006), giving more profound criteria that "good" arguments should fulfill. However, the empirical evidence proving applicability of many theories falls short on everyday arguments (Boudry et al., 2015).
Since the main goal of argumentation is persuasion (Nettel and Roque, 2011;Mercier and Sperber, 2011;Blair, 2011;OKeefe, 2011) we take a pragmatic perspective on qualitative properties of argumentation and investigate a new high-level task. We asked whether we could quantify and predict how convincing an argument is.  If we take Argument 1 from Figure 1, assigning a single "convincingness score" is highly subjective, given the lack of context, reader's prejudice, beliefs, etc. However, when comparing both arguments from the same example, one can decide that A1 is probably more convincing than A2, because it uses at least some statistics, addresses the health factor, and A2 is just harsh and attacks. 1 We adapt pairwise comparison as our backbone approach. We propose a novel task of predicting convincingness of arguments in an argument pair, as well as ranking arguments related to a certain topic. Since no data for such a task are available, we create a new annotated corpus. We employ SVM model with rich linguistic features as well as bidirectional Long Short-Term Memory (BLSTM) neural networks because of their excellent performance across various end-to-end NLP tasks (Goodfellow et al., 2016;Piech et al., 2015;Wen et al., 2016;Dyer et al., 2015;Rocktäschel et al., 2016).
Main contributions of this article are (1) large annotated dataset consisting of 16k argument pairs with 56k reasons in natural language (700k tokens), (2) thorough investigation of the annotated data with respect to properties of convincingness as a measure, (3) a SVM model and end-to-end BLSTM model. The annotated data, licensed under CC-BY-SA license, and the experimental code are publicly available at https://github.com/UKPLab/ acl2016-convincing-arguments.
We hope it will foster future research in computational argumentation and beyond.

Related Work
Recent years can be seen as a dawn of computational argumentation -an emerging sub-field of NLP in which natural language arguments and argumentation are modeled, searched, analyzed, generated, and evaluated. The main focus has been paid to analyzing argument structures, under the umbrella entitled argumentation mining.
Web discourse as a data source has been exploited in several tasks in argumentation mining, such as classifying propositions in user comments into three classes (verifiable experiential, verifiable non-experiential, and unverifiable) (Park and Cardie, 2014), or mapping argument components to Toulmin's model of argument in user-generated Web discourse (Habernal and Gurevych, 2015), to name a few. While these approaches are crucial for understanding the structure of an argument, they do not directly address any qualitative criteria of argumentation.
Argumentation quality has been an active topic among argumentation scholars. Walton (1989) discusses validity of arguments in informal logic, while Johnson and Blair (2006) elaborate on criteria for practical argument evaluation (namely Relevance, Acceptability, and Sufficiency). Yet, empirical research on argumentation quality does not seem to reflect these criteria and leans toward simplistic evaluation using argument structures, such as how many premises support a claim (Stegmann et al., 2011), or by the complexity of the analyzed argument scheme (Garcia-Mila et al., 2013).
To the best of our knowledge, there have been only few attempts in computational argumentation that go deeper than analyzing argument structures (e.g., (Park and Cardie, 2014) mentioned above). Persing and Ng (2015) model argument strength in persuasive essays using a manually annotated corpus of 1,000 documents labeled with a 1-4 score value.
Our newly created corpus of annotated pairs of arguments might resemble recent large-scale corpora for textual inference. Bowman et al. (2015) introduced a 570k sentence pairs written by crowd-workers, the largest corpus to date. Whereas their task is to classify whether the sentence pair represents entailment, contradiction, or is neutral (thus heading towards a deep semantic understanding), our goal is to assess the pragmatical properties of the given multiple sentence-long arguments (to which extent they fulfill the goal of persuasion). Moreover, each of our annotated argument pairs is accompanied with five textual reasons that explain the rationale behind the labeler's decision. This is, to the best of our knowledge, a unique novel feature of our data.
Pairwise assessment for obtaining relative preference was examined by (Chen et al., 2013), among many others. 2 Their system was tested on ranking documents by their reading difficulty. Relative preference annotations have also been heavily employed in assessing machine translation (Aranberri et al., 2016). By contrast to our work, the underlying relations (reading difficulty 1-12 or better translation) have well known properties of total ordering, while convincingness of arguments is a yet unexplored task, thus no assumptions can be made apriori. There is also a substantial body of work on learning to rank, where also a pairwise approach is widely used (Cao et al., 2007). These methods have been traditionally used in IR, where the retrieved documents are ranked according to their relevance and pairs of documents are automatically sampled.
Employing LSTM for natural language inference tasks has recently gained popularity (Rocktäschel et al., 2016;Wang and Jiang, 2016;Cheng et al., 2016). These methods are usually tested on the SNLI data introduced above (Bowman et al., 2015).

Data annotation
Since assessing convincingness of a single argument directly is a very subjective task with high probability of introducing annotator's bias (because of personal preferences, beliefs, or background), we cast the problem as a relation annotation task. Given two arguments, one should be selected as more convincing, or they might be both equally convincing (see an example in Figure 1).

Sampling annotation candidates
Sampling large sets of arguments for annotation from the Web poses several challenges. First, we must be sure that the obtained texts are actual arguments. Second, the context of the argument should be known (the prompt and the stance). Finally, we need sources with permissive licenses, which allow us to release the resulting corpus further to the community. These criteria are met by arguments from two debate portals. 3 We will use the following terminology. We use topic to refer to a subset of an on-line debate with a given prompt and a certain stance (for example, "Should physical education be mandatory in schools? -yes" is considered as a single topic). Each debate has two topics, one for each stance. Argument is a single comment directly addressing the debate prompt. Argument pair is an ordered set of two arguments (A1 and A2) belonging to the same topic; see Figure 1.
We automatically selected debates that contained at least 25 top-level 4 arguments that were 10-110 words long (the mean for all top-level arguments was 66 ± 130 and the median 36, so we excluded the lengthy outliers in our sampling). We manually filtered out obvious silly debates (e.g., 'Superman vs. Batman') and ended up with 32 topics (the full topic list is presented together with experimental results later in Table 3). From each topic we automatically sampled 25-35 random arguments and created (n * (n − 1)/2) argument pairs by combining all selected arguments. Sampling argument pairs only from the same topics and not combining opposite stances was a design decision how to mitigate annotators' bias. 5 The order of arguments A1 and A2 in each argument pair was randomly shuffled. In total we sampled 16,927 argument pairs.

Crowdsourcing annotations
Let us extend our terminology. Worker is a single annotator in Amazon Mechanical Turk. Reason is an explanation why A1 is more convincing than A2 (or the other way round, or why they are equally convincing). Gold reason is a reason whose label matches the gold label in the argument pair (see Figure 2).
In the HIT, workers were presented with an argument pair, the prompt, and the stance as in Figure 1. They had to choose either "A1 is more convincing than A2" (A1>A2), "A1 is less convincing than A2" (A1<A2), or "A1 and A2 are convincing equally" (A1=A2). Moreover, they were obliged to write the reason 30-140 characters long. An example of fully annotated argument pair is shown in Figure 2. The workers were also provided with clear and crisp instructions (e.g., do not judge the truth of the proposition; be objective; do not express your position; etc.).
All 16,927 argument pairs were annotated by five workers each (85k assignments in total). We also allowed workers to express their own standpoint toward the topics. While 66% of workers had no standpoint, 14% had the opposite view and 20% the same view. This indicates that there should be no systematic bias in the data. Crowdsourcing took about six weeks in 2016 plus two weeks of pilot studies. In total, about 3,900 workers participated. Total costs including pilot studies and bonus payments were 5,520 USD.

Quality control and agreement
We performed several steps in controlling the quality of the crowdsourced data. First, we allowed only workers from the U.S. with ≥ 96% acceptance rate to work on the task. Second, we employed MACE (Hovy et al., 2013)   the true labels and ranking the annotators. We set the MACE's parameter threshold to 0.95 to keep only instances whose entropy is among the 95% best estimates. Third, we manually checked all the reasons for each worker. With paying more attention to workers with low MACE scores, we rejected all assignments of workers if they (1) copied&pasted the same or very similar reasons across argument pairs, (2) were only copying or rephrasing the texts from the arguments, (3) provided their opinion or were arguing, (4) had many typos or provided obvious nonsense. In total, we rejected 1161 assignments. We do not report any 'standard' inter-annotator agreement measures such as Fleiss' κ or Krippendorff's α, as their suitability for crowdsourcing has been recently disputed (Passonneau and Carpenter, 2014). However, in order to estimate the human performance, we analyzed the output of the pilot study. For each argument pair, we took the best-ranked worker for that particular pair (worker ranks are globally estimated by MACE) and computed her accuracy against the estimated gold labels. 6 The best-ranked worker for each argument pair is not necessarily the globally best-ranked worker; in the pilot study, the average global rank of this hypothetical worker was 11 ± 6.6. This rank can be interpreted as a decently performing worker; the obtained score reached 0.935 accuracy. 6 A similar approach was recently reported by Nakov et al. (2016).

What makes a convincing argument?
We manually examined a small sample of 200 gold reasons to find out what makes one argument more convincing than the other. A very common type of answer mentioned giving examples or actual reasons ("A1 cited several reasons to back up their argument.") and facts ("A1 cites an outside source which can be more credible than opinion"). This is not surprising, as argumentation is often perceived as reason giving (Freeley and Steinberg, 2008). Others point out strengths in explaining the reasoning or logical coherence ("A1 gives a succinct and logical answer to the argument. A2 strays away from the argument in the response."). The confirmation bias (Mercier and Sperber, 2011) also played a role ("A1 argues both viewpoints, A2 chooses a side."). Given the noisiness of Web data, some of the arguments might be non-sense, which was also pointed out as a reason ("A1 attempts to argue that since porn exists, we should watch it. A2 doesn't make sense or answer the question."). Apart from the logical structure of the argument, emotional aspects and rhetorical moves were also spotted ("A1 contributes a viewpoint based on morality, which is a stronger argument than A2, which does not argue for anything at all.", or "A1 calls for the killing of all politicians, which is an immature knee-jerk reaction to a topic. A2's argument is more intellectually presented.").

Transitivity evaluation using argument graphs
The previous section shows a variety of reasons that makes one argument more convincing than other arguments. Considering A1 is more convincing than A2 as a binary relation R, we thus asked the following research question: Is convincingness a measure with total strict order or strict weak order? Namely, is relation R that compares convingcingness of two arguments transitive, antisymmetric, and total? In particular, does is exhibit properties such that if A≥B and B≥C, then A≥C (total ordering)? We can treat arguments as nodes in a graph and argument pairs as graph edges. We will denote such graph as argument graph (and use nodes/arguments and edges/pairs interchangeably in this section). 7 As the sampled argument pairs 7 Argument pair A>B becomes a directed edge A → B contained all argument pair combinations for each topic, we ended up with an almost fully connected argument graph for each topic (remember that we discarded 5% of argument pair annotations with lowest reliability). We further investigate the properties of the argument graphs. Transitivity is only guaranteed, if the argument graph is a DAG (directed acyclic graph).
Building argument graph from crowdsourced argument pairs We build the argument graph iteratively by sampling annotated argument pairs and adding them as graph edges (see Algorithm 1). We consider two possible scenarios in the graph building algorithm. In the first scenario, we accept only argument pairs without equivalency (thus A>B is allowed but A=B is forbidden and discarded). The second scenario accepts all pairs, but since the resulting graph must be DAG, equivalent arguments are merged into one node. We use Johnson's algorithm for finding all elementary cycles in DAG (Johnson, 1975).
Argument pair weights By building argument graph from all pairs, introducing cycles into the graph seems to be inevitable, given a certain amount of noise in the annotations. We asked the following question: to which extent does occurrence of cycles in an argument graph depend on the quality of annotations?
We thus compute a weight for each argument pair. Let e i be a particular annotation pair (edge). Let G i be all labels in that pair that match the predicted gold label, and O i opposite labels (different from the gold label). Let v be a single worker's vote and c v a global worker's competence score. Then the weight w of edge e i is computed as follows: where σ is a sigmoid function σ = 1 1+e −x to squeeze the weight into the (0, 1) interval and λ is a penalty for opposite labels (we set empirically λ to 10.0 to ensure strict penalization). For example, if the predicted gold label from Figure 2 were A1>A2, then G i would contain four votes and O i one vote (the last one).
This weight allows us to sort argument pairs before sampling them for building the argument in the argument graph. graph. We test three following strategies. As a baseline, we use random shuffling (Rand), where no prior information about the weight of the pairs is given. The other two sorting algorithms use the argument pair weight computed by Equation 1. As the worst case scenario, we sort the pairs in ascending order (Asc), which means that the "worse" pairs come first to the graph building algorithm. We used this scenario to see how much the prior pair weight information actually matters, because building a graph preferably from bad pair label estimates should cause more harm. Finally, the Desc algorithm sorts the pairs given their weight in descending order (the "better" estimates come first).
Algorithm 1: Building DAG from sorted argument pairs. input : argumentPairs; sortingAlg output: DAG SortPairs(argumentPairs, sortingAlg); finalPairs ← []; foreach pair in argumentPairs do currentPairs ← [finalPairs, pair ]; /* cluster edges labeled as equal so they will be treated as a single node */ clusters ← clusterEqNodes(currentPairs); /* wire the pairs into directed graph */ g ← buildGraph(currentPairs, clusters); if hasCycles(g) then // report about breaking DAG else finalPairs += pair; return buildGraph(finalPairs); Measuring transitivity score We measure how "good" the graph is by a transitivity score. Here we assume that the graph is a DAG. Given two nodes A and Z, let P L be the longest path between these nodes and P S the shortest path, respectively. For example, let P L = A → B → C → Z and P S = A → D → Z. Then the transitivity score is the ratio of longest and shortest path |P L | |P S | . (which is 1.5 is our example). The average transitivity score is then an average of transitivity scores for each pair of nodes from the graph that are connected by two or more paths. Analogically, the maximum transitivity score is the maximal value. We restrict the shortest path to be a direct edge only.
The motivation for the transitivity score is the following. If the longest path between A and Z (A → . . . → Z) consists of 10 other nodes, than the total ordering property requires that there also exists a direct edge A → Z. This is indeed em- pirically confirmed by the presence of the shortest path between A and Z. Thus the longer the longest path and the shorter the shortest path are on average, the bigger empirical evidence is given about the transitivity property. Figure 3 shows an example of argument graph built using only non-equivalent pairs and desc prior sort or argument pairs. There are few "bad" arguments in the middle (many incoming edges, none outcoming) and few very convincing arguments (large circles). Notice the high maximum transitivity score even for medium-sized nodes.
Observations First, let us compare the different sorting algorithms for each sampling strategy. As Table 1 shows, on average, 158 pairs are ignored in total when all pairs are used for sampling (26 removed by MACE and 132 by the graph building algorithm), while 164 pairs are ignored when only non-equivalent pairs are sampled (129 had already been removed apriori-26 by MACE and 103 as equivalent pairs-and 35 by the graph algorithm).
The results show a tendency that, when sampling annotated argument pairs for building a DAG, sorting argument pairs by their weight based on workers' scores influences the number of pairs that break the DAG by introducing cycles. In par-ticular, starting with more confident argumentation pairs, the graph grows bigger while keeping its DAG consistency. The presence of the equal relation causes cycles to break the DAG sooner as compared to argument pairs in which one argument is more convincing than the other. We interpret this finding as that it is easier for humans to judge A>B than A=B consistently across all possible pairs of arguments from a given topic.

Gold-standard corpora
Our experiments show that convincingness between a pair of arguments exhibits properties of strict total order when the possibility of two equally convincing arguments is prohibited. We thus used the above-mentioned method for graph building as a tool for posterior gold data filtering. We discard the equal argument pairs in advance and filter out argument pairs that break the DAG properties. As a result, a set of 11,650 argument pairs labeled as either A>B or A<B remains, which is summarized in Table 2. We call this corpus UKPConvArgStrict.
However, since the total strict ordering property of convincingness is only an empirically confirmed working hypothesis, we also propose another realistic application. We construct a mixed graph by treating equal argument pairs (A=B) as undirected edges. Using PageRank, we rank the arguments (nodes) globally. The higher the PageRank for a particular node is, the "less convincing" the argument is (has a global higher probability of incoming edges). This allows us to rank all arguments for a particular topic. We call this dataset UKPConvArgRank (see Table 2).
We also release the full dataset UKPConvAr-gAll. In this data, no global filtering using graph construction methods is applied, only the local pre-filtering using MACE. We believe this dataset can be used as a supporting training data for some tasks that do not rely on the property of total ordering. Along the actual argument texts, all the gold-standard corpora contain the reasons as well as full workers' information and debate meta-data.

Experiments
We experiment with two machine learning algorithms on two tasks using the two new benchmark corpora (UKPConvArgStrict and UKPCon-vArgRank). In both tasks, we perform 32-fold cross-topic cross-validation (one topic is test data, remaining 31 topics are training ones). This rather  challenging setting ensures that no arguments are seen in both training and test data.

Predicting convincingness of pairs
Since this task is a binary classification and the classes are equally distributed (see Table 2), we report accuracy and average the final score over folds (Forman and Scholz, 2010).
Methods As a "traditional" method, we employ SVM with RBF kernel 8 based on a large set of rich linguistic features. They include uni-and bi-gram presence, ratio of adjective and adverb endings that may signalize neuroticism (Corney et al., 2002), contextuality measure (Heylighen and Dewaele, 2002), dependency tree depth, ratio of exclamation or quotation marks, ratio of modal verbs, counts of several named entity types, ratio of past vs. future tense verbs, POS n-grams, presence of dependency tree production rules, seven different readability measures (e.g., Ari (Senter and Smith, 1967), Coleman-Liau (Coleman and Liau, 1975), Flesch (Flesch, 1948), and others), five sentiment scores (from very negative to very positive) (Socher et al., 2013), spellchecking using standard Unix words, ratio of superlatives, and some surface features such as sentence lengths, longer words count, etc. The resulting feature vector dimension is about 64k. We also use bidirectional Long Short-Term 8 Using LISBVM (Chang and Lin, 2011).
Memory (BLSTM) neural network for end-to-end processing. 9 The input layer relies on pre-trained word embeddings, in particular GloVe (Pennington et al., 2014) trained on 840B tokens from Common Crawl; 10 the embedding weights are further updated during training. The core of the model consists of two bi-directional LSTM networks with 64 output neurons each. Their output is then concatenated into a single drop-out layer and passed to the final sigmoid layer for binary predictions. We train the network with ADAM optimizer (Kingma and Ba, 2015) using binary crossentropy loss function and regularize by early stopping (5 training epochs) and high drop-out rate (0.5) in the dropout layer. For both models, each training/test instance simply concatenates A1 and A2 from the argument pair. Table  3, SVM (0.78) outperforms BSLTM (0.76) with a subtle but significant difference. It is also apparent that some topics are more challenging regardless of the system (e.g., "Is it better to have a lousy father or to be fatherless? -Lousy father"). Both systems outperform a simple baseline (lemma ngram presence features with SVM, not reported in detail, achieved 0.65 accuracy) but still do not reach the human upper bounds (0.93 as reported in Section 3.3).  We examined about fifty random false predictions to gain some insight into the limitations of both systems. We looked into argument pairs, in which both methods failed, as well as into instances where only one model was correct. BLSTM won in few cases by properly catching jokes or off-topic arguments; SVM was properly catching all-upper-case arguments (considered as less convincing). By examining failures common to both systems, we found several cases where the prediction was wrong due to very negative sentiment (which might be a sign of the less convincing argument), but in other cases an argument with strong negative sentiment was actually the more convincing one. In general, we did not find any tendency on failures; they were also independent of the worker assignments distribution, thus not caused by likely ambiguous (hard) instances.

Ranking arguments
We address this problem as a regression task. We use the UKPConvArgRank data, in which a realvalue score is assigned to each argument so the arguments can be ranked by their convincingness (for each topic independently). The task is thus to predict a real-value score for each argument from the test topic (remember that we use 32-fold cross validation). We measure Spearman's and Pearson's correlation coefficients on all results combined (not on each fold separately).
Without any modifications, we use the same SVM and features as described in Section 4.1. Regarding the BLSTM, we only replace the output layer with a linear activation function and optimize mean absolute error loss. Table 4 shows that SVM outperforms BLSTM. All correlations are highly statistically significant.

Results discussion
Although the "traditional" SVM with rich linguistic features outperforms BLSTM in both tasks, there are other aspects to be considered. First, the employed features require heavy languagespecific preprocessing machinery (lemmatizer, POS tagger, parser, NER, sentiment analyzer). By contrast, BLSTM only requires pre-trained embedding vectors, while delivering comparable results. Second, we only experimented with vanilla LSTMs. Recent developments of deep neural networks (especially attention mechanisms or grid-LSTMs) open up many future possibilities to gain performance in this end-to-end task.

Conclusion and future work
We propose a novel task of predicting Web argument convincingness. We crowdsourced a large corpus of 16k argument pairs over 32 topics and used global constraints based on transitivity properties of convincingness relation for cleaning the data. We experimented with feature-rich SVM and bidirectional LSTM and obtain 0.76-0.78 accuracy and 0.35-0.40 Spearman's correlation in a crosstopic scenario. We release the newly created corpus UKPConvArg1 and the experimental software under free licenses. 11 To the best of our knowledge, we are the first who deal with argument convincingness in Web data on such a large scale.
In the current article, we have only slightly touched the annotated natural text reasons. We believe that the presence of 44k reasons (550k tokens) is another important asset of the newly created corpus, which deserves future investigation.