Retrieval of the Best Counterargument without Prior Topic Knowledge

Given any argument on any controversial topic, how to counter it? This question implies the challenging retrieval task of finding the best counterargument. Since prior knowledge of a topic cannot be expected in general, we hypothesize the best counterargument to invoke the same aspects as the argument while having the opposite stance. To operationalize our hypothesis, we simultaneously model the similarity and dissimilarity of pairs of arguments, based on the words and embeddings of the arguments’ premises and conclusions. A salient property of our model is its independence from the topic at hand, i.e., it applies to arbitrary arguments. We evaluate different model variations on millions of argument pairs derived from the web portal idebate.org. Systematic ranking experiments suggest that our hypothesis is true for many arguments: For 7.6 candidates with opposing stance on average, we rank the best counterargument highest with 60% accuracy. Even among all 2801 test set pairs as candidates, we still find the best one about every third time.


Introduction
Many controversial topics in real life divide us into opposing camps, such as whether to ban guns, who should become president, or what phone to buy.When being confronted with arguments against our stance, but also when forming own arguments, we need to think about how they could best be countered.Argumentation theory tells us that -aside from ad-hominem attacks -a counterargument denies either an argument's premises, its conclusion, or the reasoning between them (Walton, 2009).Take the following argument in favor of the right to bear arms from the web portal idebate.org: Argument "Gun ownership is an integral aspect of the right to self defence.(conclusion) Law-abiding citizens deserve the right to protect their families in their own homes, especially if the police are judged incapable of dealing with the threat of attack.[...]" (premise) While the conclusion seems well-reasoned, the web portal directly provides a counter to the argument: Counterargument "Burglary should not be punished by vigilante killings of the offender.No amount of property is worth a human life.Perversely, the danger of attack by homeowners may make it more likely that criminals will carry their own weapons.If a right to self-defence is granted in this way, many accidental deaths are bound to result.[...]" As in this example, we observe that a counterargument often takes on the aspects of the topic invoked by the argument, while adding a new perspective to its conclusion and/or premises, conveying the opposite stance.Research has tackled the stance of argument units (Bar-Haim et al., 2017) as well as the attack relations between arguments (Cabrio and Villata, 2012).However, existing approaches learn the interplay of aspects and topics on training data or infer it from external knowledge bases (details in Section 2).This does not work for topics unseen before.Moreover, to our knowledge, no work so far aims at actual counterarguments.
This paper studies the task of automatically finding the best counterargument to any argument.In the general case, we cannot expect prior knowledge of an argument's topic.Following the observation above, we thus just hypothesize the best counterargument to invoke the same aspects as the argument while having the opposite stance.Figure 1 sketches how we operationalize the hypothesis.In particular, we simultaneously model the topic similarity and stance dissimilarity of a candidate counterargument to the argument.Both are inferred -in different ways -from the similarities to the argument's conclusion and premises, since it is unclear in advance, whether either of these units or the reasoning between them is countered.Thereby, we find the most dissimilar among the most similar arguments.
To study counteraguments, we provide a new corpus with 6753 argument-counterargument pairs, taken from 1069 debates on idebate.org, as well as millions of false pairs derived from them.Given the corpus, we define eight retrieval tasks that differ in the types of candidate counterarguments.Based on the words and embeddings of the arguments, we develop similarity functions that realize the outlined model as a ranking approach.In systematic experiments, we evaluate the different building blocks of our model on all defined tasks.
The results suggest that our hypothesis is true for many arguments.The best model configuration improves common word and embedding similarity measures by eight to ten points accuracy in all tasks.Inter alia, we rank 60.3% of the best counterarguments highest when given all arguments with opposite stance (7.6 on average).Even with all 2801 test arguments as candidates, we still achieve 32.4% (and a mean rank of 15), fitting the intuition that offtopic arguments are easier to discard.Our analysis reveals notable gaps across topical themes though.
Contributions We believe that our findings will be important for applications such as automatic debating technologies (Rinott et al., 2015) and argument search (Wachsmuth et al., 2017b).To summarize, our main contributions are: • A large corpus for studying multiple counterargument retrieval tasks (Sections 3 and 4).
• A topic-independent approach to find the best counterargument to any argument (Section 5).
• Evidence that many counterarguments can be found without topic knowledge (Section 6).
The corpus as well as the Java source code for reproducing the experiments are available at http: //www.arguana.com.

Related Work
Counterarguments rebut arguments.In the theoretical model of Toulmin (1958), a rebuttal in fact does not attack the argument, but it merely shows exceptions to the argument's reasoning.Govier (2010) suggests to rather speak of counterconsiderations in such cases.Unlike Damer (2009), who investigates how to attack several kinds of fallacies, we are interested in how to identify attacks.We focus on those that target arguments, excluding personal (ad-hominem) attacks (Habernal et al., 2018).
Following Walton (2006), an argument can be attacked in two ways: one is to question its validity -not meaning that its conclusion must be wrong.The other is to rebut it with a counterargument that entails the opposite conclusion, often by revisiting aspects or introducing new ones.This is the type of attack we study.As Walton (2009) details, rebuttals may target an argument's premises or conclusion, or they may undercut the reasoning between them.
Recently, the computational analysis of natural language argumentation is receiving much attention.Most research focuses on argument mining, ranging from segmenting a text into argument units (Ajjour et al., 2017), over identifying unit types (Rinott et al., 2015) and roles (Niculae et al., 2017), to classifying argument schemes (Feng and Hirst, 2011) and relations (Lawrence and Reed, 2017).Some works detect counterconsiderations in a text (Peldszus and Stede, 2015) or their absence (Stab and Gurevych, 2016).Such considerations make arguments more balanced (see above).In contrast, we seek for arguments that defeat others.
Many approaches mine attack relations between arguments.Some use deep learning to find attacks in discussions (Cocarascu and Toni, 2017).Closer to this paper, others determine them in a given set of arguments, using textual entailment (Cabrio and Villata, 2012) or a combination of markov logic and stance classification (Hou and Jochim, 2017).In principle, any attacking argument denotes a counterargument.Unlike previous work, however, we aim for the best counterargument to an argument.
Classifying the stance of a text towards a topic (pro or con) generally defines an alternative way of addressing counterarguments.Sobhani et al. (2015) specifically classify health-related arguments using supervised learning, while we do not expect to have prior topic knowledge.Bar-Haim et al. (2017) approach the stance of claims towards open-domain topics.Their approach combines aspect-based sentiment with external relations between aspects and topics from Wikipedia.As such, it is in fact limited to the topics covered there.Our model applies to arbitrary arguments and counterarguments.
We need to identify only whether arguments oppose each other, not their actual stance.Similarly, Menini et al. (2017) classify only the disagreement of political texts.Part of their approach is to detect topical key aspects in an unsupervised manner, which seems useful for our purposes.Analogously, Beigman Klebanov et al. (2010) study differences in vocabulary choice for the related task of perspective classification, and Tan et al. (2016) find that the best way to persuade opinion holders in the Change my view forum on reddit.com is to use dissimilar words.As we report later, however, our experiments did not show such results for the argument-counterargument pairs we deal with.
The goal of persuasion reveals the association of counterarguments to argumentation quality.Many quality criteria have been assessed for arguments, surveyed in (Wachsmuth et al., 2017a).In the study of Habernal and Gurevych (2016), one reason annotators gave for why an argument was more convincing than another was that it tackled flaws in the opposing view.Zhang et al. (2016) even found that debate winners tend to counter opposing arguments rather than focusing on their own arguments.
Argument quality assessment is particularly important in retrieval scenarios.Existing approaches aim to retrieve documents that contain many claims (Roitman et al., 2016) or that provide most support for their claims (Braunstain et al., 2016).In Wachsmuth et al. (2017c), we adapt PageRank to argumentative relations, in order to assess argument relevance objectively.While our search engine args for arguments on the web still uses content-based relevance measures in its first version (Wachsmuth et al., 2017b), its long-term idea is to rank the best arguments highest. 1 The model present in this work finds the best counterarguments, but it is meant to be integrated into args at some point.
Like here, args uses idebate.orgarguments.Others take data from that portal for studying support (Boltužić and Šnajder, 2014) or for the distant supervision of argument mining (Al-Khatib et al., 1 Argument search engine args: http://args.me2016).Our corpus is not only larger, though, but it is the first to utilize a unique feature of idebate.org: the explicit specification of counterarguments.

The ArguAna Counterargs Corpus
This section introduces our ArguAna Counterargs corpus with argument-counterargument pairs, created automatically from the structure of idebate.org.The corpus is freely available at http://www.arguana.com/data.We also provide the code to replicate the construction process.

The Web Portal idebate.org
On the portal idebate.org,diverse controversial topics of usually rather general interest are discussed in debates, subsumed under 15 themes, such as "economy" and "health".Each debate has a title capturing a thesis on a topic, such as "This House would limit the right to bear arms", followed by an introductory text, a set of mostly elaborated and well-written points that have a pro or a con stance towards the thesis, and a bibliography.
A specific feature of idebate.org is that virtually every point comes along with a counter that immediately attacks the point and its stance.Both points and counters can be seen as arguments.While a point consists of a one-sentence claim (the argument's conclusion) and a few sentences justifying the claim (the premise(s)), the counter's (opposite) conclusion remains implicit.
All arguments on the portal are established by a community with the goal of showing both sides of a topic in a balanced manner.We therefore assume each counter to be the best counterargument available for the respective point, and we use all resulting true argument pairs as the basis of our corpus.Figure 2 illustrates the italicized concepts, showing the structure of idebate.org.An example argument pair has been discussed in Section 1.

Corpus Construction
We crawled all debates from idebate.org that follow the portal's theme-guided folder structure (last access: January 30, 2018).From each debate, we extracted the thesis, the introductory text, all points and counters, the bibliography, and some metadata.Each was stored separately in one plain text file, and we also created a file with the entire debate in its original order.Only points and counters are used in our experiments in Section 6.The underlying experiment settings are described in Section 4.  Table 1: Distribution of debates, points, and counters over the themes in the counterargs-18 corpus.The bottom rows show the size of the datasets.

Corpus Statistics
Table 1 lists the number of debates crawled for each theme, along with the numbers of points and counters in the debates.The 26 found points without a counter are included in the corpus, but we do not use them in our experiments.In total, the ArguAna Counterargs corpus consists of 1069 debates with 6753 points that have a counter.The mean length of points is 196.3 words, whereas counters span only 129.6 words, largely due to the missing explicit conclusion.To avoid exploiting this corpus bias, no approach in our experiments captures length differences.

Datasets
We split the corpus into a training set, consisting of the first 60% of all debates of each theme (ordered by alphabet), as well as a validation set and a test set, each covering 20%.The dataset sizes are found at the bottom of Table 1.By putting all arguments from a debate into a single dataset, no specific topic knowledge can be gained from the training or validation set.We include all themes in all datasets, because we expect the set of themes to be stable.
We checked for duplicates.Among the 13 532 point and counters, 3407 appear twice, 723 three times, 36 four times, and 1 five times.We ensure that no true pair is used as a false pair in our tasks.

Counterargument Retrieval Tasks
Based on the new corpus, we define the following eight counterargument retrieval tasks of different complexity.All tasks consider all true argumentcounterargument pairs, while differing in terms of what arguments (points and/or counters) from which context (same debate, same theme, or entire portal) are candidates for a given argument.
Same Debate: Opposing Counters All counters in the same debate with stance opposite to the given argument are candidates (Figure 2: a, b).The task is to find the best counterargument among all counters to the argument's stance.
Same Debate: Counters All counters in the same debate irrespective of their stance are candidates (Figure 2: a-c).The task is to find the best counterargument among all on-topic arguments phrased as counters.
Table 2: Number of true and false argument-counterargument pairs as well as their ratio for each evaluated context and type of candidate counterarguments in the three datasets.Each line defines one retrieval task.
Same Debate: Opposing Arguments All arguments in the same debate with opposite stance are candidates (Figure 2: a, b, d).The task is to find the best among all on-topic counterarguments.
Same Debate: Arguments All arguments in the same debate irrespective of their stance are candidates (Figure 2: a-e).The task is to find the best counterargument among all on-topic arguments.
Same Theme: Counters All counters from the same theme are candidates (Figure 2: a-c, f).The task is to find the best counterargument among all on-theme arguments phrased as counters.
Same Theme: Arguments All arguments from the same theme are candidates (Figure 2: a-g).The task is to find the best counterargument among all on-theme arguments.
Entire Portal: Counters All counters are candidates (Figure 2: a-c, f, h).The task is to find the best counterargument among all arguments phrased as counters.
Entire Portal: Arguments All arguments are candidates (Figure 2: a-i).The task is to find the best counterargument among all arguments.
Table 2 lists the numbers of true and false pairs for each task.Experiment files containing the file paths of all candidate pairs are provided in our corpus.

Retrieval of the Best Counterargument without Prior Topic Knowledge
The eight defined tasks indicate the subproblems of retrieving the best counterargument to a given argument: Finding all arguments that address the same topic, filtering those arguments with an opposite stance towards the topic, and identifying the best counter among these arguments.This section presents our approach to solving these problems computationally without prior knowledge of the argument's topic, based on the simultaneous similarity and dissimilarity of arguments.2

Topic as Word and Embedding Similarity
We do not reinvent the wheel to assess topical relevance, but rather follow common practice.Concretely, we hypothesize a candidate counterargument to be on-topic if it is similar to the argument in terms of its words and its embedding.We capture these two types of similarity as follows.
Word Argument Similarity To best represent the words in arguments, we did initial counterargument retrieval tests with token, stem, and lemma n-grams, n ∈ {1, 2, 3}.While the differences were not large, stems worked best and stem 1-grams sufficed.Both might be a consequence of the limited data size.In our experiments in Section 6, we determine the stem 1-grams to be considered on the training set of each task.
For word similarity computation, we tested four inverse vector-based distance measures: Cosine, Euclidean, Manhattan, and, Jaccard similarity (Cha, 2007).On the validation sets, the Manhattan similarity performed best, closely followed by the Jaccard similarity.Both clearly outperformed Euclidean and especially Cosine similarity.This suggests that the presence and absence of words are equally important and that outliers should not be punished more.For brevity, we report only results for the Manhattan similarity below.
To capture argument-level embedding similarity, we compared the four inverse vector-based distance measures above on average word embeddings against the inverse Word Mover's distance, which quantifies the optimum alignment of two word embedding sequences (Kusner et al., 2015).This Word Mover's similarity consistently beat the others, so we decided to restrict our view to it.

Stance as Topic Dissimilarity
Stance classification without prior topic knowledge is challenging: While we can compare the topics of any two arguments, it is impossible in general to infer the stance of the specific aspects invoked by one argument to those of the other.As sketched in Section 2, related work employs external knowledge to infer stance relations of aspects and topics (Bar-Haim et al., 2017) or trains classifying attack relations (Cabrio and Villata, 2012).Unfortunately, both does not apply to topics unseen before.
For argument pairs invoking similar aspects, a way to go is in principle to assess sentiment polarity; intuitively, two arguments with the same topic but opposite sentiment have opposing stance.However, we tested topic-agnostic sentiment lexicons (Baccianella et al., 2010) and state-of-the-art sentiment classifiers, trained on large-scale multipledomain review data (Prettenhofer and Stein, 2010;Joulin et al., 2017).The correlation between sentiment and stance differences of training arguments was close to zero.A possible explanation is the limited explicitness of sentiment on idebate.org,making the lexicons and classifiers fail there.
Other related work suggests that the vocabulary of opposing sides differs (Beigman Klebanov et al., 2010).We thus checked on the training set whether counterarguments are similar in their embeddings but dissimilar in their words.The measures above did not support this hypothesis, i.e., both embedding and word similarity increased the likelihood of a candidate counterargument being the best.Still, there must be a difference between an argument and its counterargument by concept.As a solution, we capture dissimilarity with the same similarity functions as above, but we change the granularity level on which we measure similarity.

Simultaneous Similarity and Dissimilarity
The arising question is how to assess similarity and dissimilarity at the same time.We hypothesize the best counterargument to be very similar in overall terms, but very dissimilar in certain respects.To capture this intuition, we rely on expert knowledge from argumentation theory (see Section 2).

Word and Embedding Unit Similarities
In particular, we follow the notion that a counterargument attacks either the conclusion of an argument, the argument's premises, or both.As a consequence, we compute two word and two embedding similarities as specified above for each candidate counterargument; once to the argument's conclusion (called w c and e c for words and embeddings respectively) and once to the argument's premises (w p and e p ). Now, to capture similarity and dissimilarity simultaneously, we need multiple ways to aggregate conclusion and premise similarities.As we do not generally know which argument unit is attacked, we resort to four standard aggregation functions that generalize over the unit similarities.For words, these are the following word unit similarities: Accordingly, we define four respective embedding unit similarities, e ↓ , e ↑ , e × , and e + .
As mentioned above, both word similarity and embedding similarity positively affect the likelihood that a candidate is the best counterargument.Therefore, we combine each pair of similarities as w ↓ + e ↓ , w ↑ + e ↑ , w × + e × , and w + + e + , but we also evaluate their impact in isolation below. 3ounterargument Scoring Model Based on the unit similarities, we finally define a scoring model for a given pair of argument and candidate counterargument.The model includes two unit similarity values, sim and dissim, but dissim is subtracted from sim, such that it actually favors dissimilarity.Thereby, we realize the topic and stance similarity sketched in Figure 1.We weight the two values with a damping factor α: where sim, dissim ∈ {w ↓ +e ↓ , w ↑ +e ↑ , w × +e × , w + + e + } and sim = dissim.
The general idea of the scoring model is that sim rewards one type of similarity, whereas subtracting dissim punishes another type.We seek to thereby find the most dissimilar candidate among the similar candidates.The model is meant to give a higher score to a pair the more likely the candidate is the best counterargument to the argument, so the scores can be used for ranking.
What combination of sim and dissim turns out best, is hard to foresee and may depend on the retrieval task at hand.We hence evaluate different combinations empirically below.The same holds for the damping factor α ∈ [0, 1].If our hypothesis on similarity and dissimilarity is true, then the best α should be close to but lower than 1.Conversely, if α = 1.0 achieves the best performance, then only similarity would be captured by our model.

Evaluation
We now report on systematic ranking experiments with our counterargument scoring model.The goal is to evaluate on all eight retrieval tasks from Section 4 to what extent our hypothesis holds that the best counterargument to an argument invokes the same aspects while having opposing stance.The Java source code of the experiments is available at: http://www.arguana.com/software

Experimental Set-up
We evaluated the following set-up of tasks, data, measures, baselines, and approaches.
Tasks We tackled each of the eight retrieval tasks as a ranking problem, i.e., we aimed to rank the best counterargument to each argument highest, given all candidates.Accordingly, only one candidate counterargument per argument is correct. 4 4 One alternative would be to see each argument pair as one instance of a classification problem.However, our preliminary tests confirmed the intuition that identifying the best counterargument is hard without knowing the other candidates, i.e., there is no general (dis)similarity threshold that makes an argument the best counterargument.Rather, how similar or dissimilar a counterargument needs to be depends on the topic and on the other candidates.Another alternative would be to treat all candidates for an argument as one instance, but this makes the experimental set-up very intricated.
Data Table 2 has shown the true and false argument pairs in all datasets.We undersampled each training set, resulting in 4065 true and 4065 false training pairs in all tasks.5Our model does not do any learning-to-rank on these pairs, but we derived lexicons for the word similarities from them (all stems included in at least 1% of all pairs).As detailed below, we then determined the best model configurations on the validation sets and evaluated these configurations on the test sets.
Measures As only one candidate is true per argument, we report the accuracy@1 of each approach, i.e., the percentage of arguments for which the true counterargument was ranked highest.Besides, we compute the rounded mean rank of the best counterargument in all rankings, reflecting the average performance of an approach.Exemplarily, we also mention the mean reciprocal rank (MRR), which is more sensitive to outliers.
Baselines A trivial way to address the given tasks is to pick any candidate by chance for each argument.This random baseline allows quantifying the impact of other approaches.As counterargument retrieval has not been tackled yet, we do not use any existing baseline. 6Instead, we evaluate the effects of the different building blocks of our scoring model.On one hand, we check the need for distinguishing conclusions and premises by comparing to the word argument similarity (w) and the embedding argument similarity (e).On the other hand, we consider all eight word and embedding unit similarities (w ↓ , w ↑ , . . ., e + ) as baselines, in order to see whether and how to best aggregate them.
Approaches After initial tests, we reduced the set of tested values of the damping factor α in our scoring model to {0.8, 0.9, 1.0}.On the validation sets of the first six tasks,7 we then analyzed all possible combinations of w ↓ +e ↓ , w ↑ +e ↑ , w × +e × , w + + e + , as well as w + e for sim and dissim.Three configurations of the model turned out best: we ↓ := 0.9 we ↑ := 0.9   we was best on the validation set of Same Debate: Opposing Arguments (accuracy@1: 62.1) and we ↓ on the one of Same Debate: Arguments (49.0).All other tasks were dominated by we ↑ .Especially, we ↑ was better than 1.0 • (w + + e + ) in all of them with clear leads of up to 2.2 points.This underlines the importance of modeling dissimilarity for counterargument retrieval.We took we, we ↓ , and we ↑ as our approaches for the test set. 8

Results
Table 3 shows the accuracy@1 and the mean rank of all baselines and approaches on each of the eight given retrieval tasks.
Overall, the counter-only tasks seem slightly harder within the same debate (comparing Counters to Opposing), i.e., stance is harder to assess than topical relevance.Conversely, the other Counters tasks seem easier, suggesting that topically close but false candidate counterarguments with the same stance as the argument (which are not included in any Counters task) are classified wrongly most often.Besides, these results support that potential differences in the phrasing of counters are not exploited, as desired.
The accuracy of the standard similarity measures, w and e, goes from 65.9 and 62.9 respectively in the smallest task down to 21.8 and 23.9 in the largest. 8All validation set results are found in the supplementary material, which we provide at http://www.arguana.com/publications w is stronger when only counters are candidates, e otherwise.This implies that words capture differences between the best and other counters, whereas embeddings rather help discard false candidates with the same stance as the argument.
From the eight unit similarity baselines, w + performs best on five tasks (e × twice, w × once).w + finds 71.5% true counterarguments among all opposing counters in a debate, and 28.6% among all test arguments from the entire portal.In that task, however, the mean ranks of w + (33) and particularly of w × (354) are much worse than for e × (21), meaning that words are insufficient to robustly find counterarguments.
we, we ↓ , and we ↑ outperform all baselines in all tasks, improving the accuracy by 8.1 (Same Theme: Arguments) to 10.3 points (Entire Portal: Counters) over w and e, and at least 3.0 over the best baseline in each task.Among all opposing arguments from the same debate (true-to-false ratio 1:6.6), we finds 60.3% of the best counterarguments, 44.9% when all arguments are given (1:13.3).
The winner in our evaluation is we ↑ , though, being best in five of the eight tasks.It found the true among all opposing counters in 74.5% of all cases, and about every third time (32.4) among all 2801 test set arguments; a setting where the random baseline has virtually no chance.Given all arguments from the same theme, we ↑ puts the best counterargument at a mean rank of 5 (MRR 0.58), and for the entire portal still at 15 (MRR 0.5).Although our scoring model thus does not solve the retrieval tasks, we conclude that it serves as a robust approach to rank the best counterargument high.
To test significance, we separately computed the accuracy@1 for the arguments from each theme.The differences between the 15 values of the best approach on each task and those of the best baseline (w + , w × , or e × ) were normally distributed.Since the baselines and approaches are dependent, we used a one-tailed dependent t-test with paired samples.As Table 3 specifies, our approaches are consistently better, partly with at least 99% confidence, partly even with 99.9% confidence.
In Table 4, we exemplarily detail the comparison of the best approach (we ↑ ) to the best baseline (w + ) on Entire Portal: Arguments.The mean ranks across themes underline the robustness of we ↑ , being in the top 10 for 7 and in the top 20 even for 13 themes.Still, the accuracy@1 of both w + and we ↑ varies notably, in case of we ↑ from 12.1 for free speech debate to 46.7 for sport.For free speech debates (e.g., "This House would criminalise blasphemy"), we observed that their arguments tend to be overproportionally long, which might lead to deviating similarities.In case of sports, the topical specificity (e.g., "This House would ban boxing") reduces the probability of mistakenly choosing candidates from other themes.
Free speech debate turned out the hardest theme in seven tasks, health in the remaining one.Besides sports, in some tasks the best results were obtained for religion and science, both of which share the characteristic of dealing with very specific topics.9

Conclusion
This paper has asked how to find the best counterargument to any argument without prior knowledge of the argument's topic.We did not aim to engineer the best approach to this retrieval task, but to study whether we can model the simultaneous similarity and dissimilarity of a counterargument to an argument computationally.For the restricted domain of debate portal arguments, our main result is quite intriguing: The best model (we ↑ ) rewards a high overall similarity to the argument's conclusion and premises while punishing a too high similarity to either of them.Despite its simplicity, we ↑ found the best counterargument among 2801 candidates in almost a third of all cases, and ranked it into the top 15 on average.This speaks for our hypothesis that the best counterargument often just addresses the same topical aspects with opposite stance.
Of course, our hypothesis is simplifying, i.e., there are counterarguments that will not be found based on aspect and stance similarity only.Apart from some hyperparameters, however, our model is unsupervised and it does not make any assumption about an argument's topic.Hence, it applies to any argument, given a pool of candidate counterarguments.While the model can be considered open-topic, a next step will be to study counterargument retrieval open-source.
We are confident that the modeled intuition generalizes beyond idebate.org.To obtain further insights into the nature of counterarguments, deeper linguistic analysis along with supervised learning may be needed, though.We provide a corpus to train respective approaches, but leave the according research to future work.
The intended practical application of our model is to retrieve counterarguments in automatic debating technologies (Rinott et al., 2015) and argument search (Wachsmuth et al., 2017b).While debate portal arguments are often suitable in this regard, in general not always a real counterargument exists to an argument.Still, returning one that addresses similar aspects with opposite stance makes sense then.An alternative would be to generate counterarguments, but we believe that humans are better than machines in writing them -currently.

Figure 1 :
Figure 1: Modeling the simultaneous similarity and dissimilarity of a counterargument to an argument.

point true counter other argument pairs in same debate
) Figure 2: Structure of idebate.org for one true argument pair in our corpus.Colors denote matching stance; we assume arguments from other debates to have no stance towards a point.Points have a conclusion and premises, counters only premises.(a)-(i) are used in Section 4 to specify the candidates in different tasks.

Table 3 :
Test set accuracy of ranking the best counterargument highest (@1) and mean rank (R) for 14 baselines and approaches (w, e, w ↓ , . . ., r) in all eight tasks (given by Context and Candidates).Each best accuracy value (bold) significantly outperforms the best baseline with 99% ( †) or 99.9% ( ‡) confidence.

Table 4 :
Accuracy@1 and mean rank of the best baseline (w + ) and approach (we ↑ ) on each theme when all 2801 test set arguments are candidates.