Classification and Clustering of Arguments with Contextualized Word Embeddings

We experiment with two recent contextualized word embedding methods (ELMo and BERT) in the context of open-domain argument search. For the first time, we show how to leverage the power of contextualized word embeddings to classify and cluster topic-dependent arguments, achieving impressive results on both tasks and across multiple datasets. For argument classification, we improve the state-of-the-art for the UKP Sentential Argument Mining Corpus by 20.8 percentage points and for the IBM Debater - Evidence Sentences dataset by 7.4 percentage points. For the understudied task of argument clustering, we propose a pre-training step which improves by 7.8 percentage points over strong baselines on a novel dataset, and by 12.3 percentage points for the Argument Facet Similarity (AFS) Corpus.


Introduction
Argument mining methods have been applied to different tasks such as identifying reasoning structures (Stab and Gurevych, 2014), assessing the quality of arguments (Wachsmuth et al., 2017), or linking arguments from different documents (Cabrio and Villata, 2012). Broadly speaking, existing methods either approach argument mining from the discourse-level perspective (aiming to analyze local argumentation structures), or from an information-seeking perspective (aiming to detect arguments relevant to a predefined topic). While discourse-level approaches mostly focus on the analysis of single documents or document collections (Eger et al., 2017), information-seeking approaches need to be capable of dealing with heterogeneous sources and topics (Shnarch et al., 2018) and also face the problem of redundancy, as arguments might be repeated across sources. As a result, this perspective naturally calls for a subsequent clustering step, which is able to identify and aggregate similar arguments for the same topic. In this work, we focus on the latter perspective, referring to it as open-domain argument search, and show how contextualized word embeddings can be leveraged to overcome some of the challenges involved in topic-dependent argument classification and clustering.
Identifying arguments for unseen topics is a challenging task for machine learning systems. The lexical appearance for two topics, e.g. "net neutrality" and "school uniforms", is vastly different. Hence, in order to perform well, systems must develop a deep semantic understanding of both the topic as well as the sources to search for arguments. Even more so, clustering similar arguments is a demanding task, as fine-grained semantic nuances may determine whether two arguments (talking about the same topic) are similar. Figure  1 gives an example of arguments on the topic "net neutrality". Both arguments center around the aspect of "equal access for every Internet user" but are differently phrased.

A1
The ultimate goal is fast, affordable, open Internet access for everyone, everywhere. A2 If this does not happen, we will create an Internet where only users able to pay for privileged access enjoy the network's full capabilities. GloVe (Pennington et al., 2014), these methods compute the embeddings for a sentence on the fly by taking the context of a target word into account. This yields word representations that better match the specific sense of the word in a sentence. In cross-topic scenarios, with which we are dealing in open-domain argument search, contextualized representations need to be able to adapt to new, unseen textual topics. We thus analyze ELMo and BERT in a cross-topic scenario for the tasks of argument classification and clustering on four different datasets. For argument classification, we use the UKP Sentential Argument Mining Corpus by Stab et al. (2018b) and the IBM Debater R : Evidence Sentences corpus by Shnarch et al. (2018). For argument clustering, we introduce a novel corpus on aspect-based argument clustering and evaluate the proposed methods on this corpus as well as on the Argument Facet Similarity Corpus (Misra et al., 2016).
The contributions in this publications are: (1) We frame the problem of open-domain argument search as a combination of topic-dependent argument classification and clustering and discuss how contextualized word embeddings can help to improve these tasks across four different datasets.
(2) We show that our suggested methods improve the state-of-the-art for argument classification when fine-tuning the models, thus significantly reducing the gap to human performance.
(3) We introduce a novel corpus on aspect-based argument similarity and demonstrate how contextualized word embeddings help to improve clustering similar arguments in a supervised fashion with little training data.
We present the four different datasets used in this work in Section 3, before we discuss our experiments and results on argument classification and clustering in Sections 4 and 5. We conclude our findings for open-domain argument search in Section 6.

Related Work
In the following, we concentrate on the fundamental tasks involved in open-domain argument search. First, we discuss work that experiments with sentence-level argument classification. Second, we review work that provides us with the necessary tools to cluster extracted arguments by their similarity. Third, we take a deeper look into contextualized word embeddings.
Argument Classification, as viewed in this work, aims to identify topic-related, sentencelevel arguments from (heterogeneous) documents. Levy et al. (2014) identify context-dependent claims (CDCs) by splitting the problem into smaller sub-problems. Rinott et al. (2015) extend this work with a pipeline of feature-based models that find and rank supporting evidence from Wikipedia for the CDCs. However, neither of these approaches leverage the potential of word embeddings in capturing semantic relations between words. Shnarch et al. (2018) aim to identify topicdependent evidence sentences by blending large automatically generated training sets with manually annotated data as initialization step. They use a BiLSTM with GloVe embeddings and integrate the topic via attention. For topic-dependent argument detection, Stab et al. (2018b) deploy a modified LSTM-cell that is able to directly integrate topic information. They show the importance of topic information by outperforming a BiLSTM baseline by around 4.5pp. Yet, their best model only shows mediocre recall for arguments, while showing an even lower precision when compared to their baseline. As argument classification is the first logical step in open-domain argument search, a low performance would eventually propagate further down to the clustering of similar arguments. Hence, in this work, we aim to tackle this problem by leveraging superior contextualized language models to improve on precision and recall of argumentative sentences.
Argument Clustering aims to identify similar arguments. Previous research in this area mainly used feature-based approaches in combination with traditional word embeddings like word2vec or GloVe. Boltužić andŠnajder (2015) applied hierarchical clustering on semantic similarities between users' posts from a two-side online debate forum using word2vec. Wachsmuth et al. (2018) experimented with different word embeddings techniques for (counter)argument similarity. Misra et al. (2016) presented a new corpus on argument similarity on three topics. They trained a Support Vector Regression model using different hand-engineered features including custom trained word2vec. Trabelsi and Zaïane (2015) used an augmented LDA to automatically extract coherent words and phrases describing arguing expressions and apply constrained cluster-ing to group similar viewpoints of topics.
In contrast to previous work, we apply argument clustering on a dataset containing both relevant and non-relevant arguments for a large number of different topics which is closer to a more realistic setup.
Contextualized word embeddings compute a representation for a target word based on the specific context the word is used within a sentence. In contrast, traditional word embedding methods, like word2vec or GloVe, words are always mapped to the same vector. Contextualized word embeddings tackle the issue that words can have different senses based on the context. Two approaches that became especially popular are ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018).
ELMo (Embeddings from Language Models) representations are derived from a bidirectional language model, that is trained on a large corpus. Peters et al. combine a character-based CNN with two bidirectional LSTM layers. The ELMo representation is then derived from all three layers.
BERT (Bidirectional Encoder Representations from Transformers) uses a deep transformer network (Vaswani et al., 2017) with 12 or 24 layers to derive word representations. Devlin et al. presented two new pre-training objectives: the "masked language model" and the "next sentence prediction" objectives. They demonstrate that the pre-trained BERT models can be fine-tuned for various tasks, including sentence classification and sentence-pair classification.
ELMo and BERT were primarily evaluated on datasets where the test and training sets have comparable distributions. In cross-topic setups, however, the distributions for training and testing are vastly different. It is unclear, whether ELMo and BERT will be able to adapt to this additional challenge for cross-topic argument mining.

Datasets
No dataset is available that allows evaluating open-domain argument search end-to-end. Hence, we analyze and evaluate the involved steps (argument classification and clustering) independently.

Argument Classification
To our knowledge, to date there are only two suitable corpora for the task of topic-dependent argument classification.
UKP Corpus. The UKP Sentential Argument  Shnarch et al. (2018) extracted sentences from Wikidata that were in turn annotated by crowd-workers (10 for each topic-sentence pair) with one of the two labels: evidence or no evidence in regard to the topic.

Argument Clustering
Topic-dependent argument clustering is an understudied problem with few resources available. Arguments on controversial topics usually address a limited set of aspects, for example, many arguments on "nuclear energy" address safety concerns. Argument pairs addressing the same aspect should be assigned a high similarity score, and arguments on different aspects a low score. To date, the only available resource of that kind we are aware of, is the Argument Facet Similarity (AFS) Corpus (Misra et al., 2016).
AFS Corpus. The AFS corpus annotates similarities of arguments pairwise. Misra et al. (2016) aimed to create automatic summaries for controversial topics. As an intermediate step, they extracted 6,000 sentential argument pairs from curated online debating platforms for three topics and annotated them on a scale from 0 ("different topic") to 5 ("completely equivalent"). A drawback of this corpus is that the arguments are curated, i.e., the dataset does not include noise or non-relevant arguments. Furthermore, the corpus covers only three different topics.
UKP ASPECT Corpus. To remedy these shortcomings, we create a new corpus with annotations on similar and dissimilar sentence-level arguments (Stab et al., 2018b), referred to as the Argument Aspect Similarity (UKP ASPECT) Corpus in the following. 2 The UKP ASPECT corpus consists of sentences which have been identified as arguments for given topics using the ArgumenText system (Stab et al., 2018a). The ArgumenText system expects as input an arbitrary topic (query) and searches a large web crawl for relevant docu-ments. Finally, it classifies all sentences contained in the most relevant documents for a given query into pro, con or non-arguments (with regard to the given topic).
We picked 28 topics related to currently discussed issues from technology and society. To balance the selection of argument pairs with regard to their similarity, we applied a weak supervision approach. For each of our 28 topics, we applied a sampling strategy that picks randomly two pro or con argument sentences at random, calculates their similarity using the system by Misra et al. (2016), and keeps pairs with a probability aiming to balance diversity across the entire similarity scale. This was repeated until we reached 3,595 arguments pairs, about 130 pairs for each topic.
The argument pairs were annotated on a range of three degrees of similarity (no, some, and high similarity) with the help of crowd workers on the Amazon Mechanical Turk platform. To account for unrelated pairs due to the sampling process, crowd workers could choose a fourth option. 3 We collected seven assignments per pair and used Multi-Annotator Competence Estimation (MACE) with a threshold of 1.0 (Hovy et al., 2013) to consolidate votes into a gold standard. About 48% of the gold standard pairs are labeled with no similarity, whereas about 23% resp. 13% are labeled with some resp. high similarity. Furthermore, 16% of the pairs were labeled as containing invalid argument(s) (e.g. irrelevant to the topic at hand).
We asked six experts (graduate research staff familiar with argument mining) to annotate a random subset of 50 pairs from 10 topics. The resulting agreement among experts was Krippendorff's α = 0.43 (binary distance) resp. 0.47 (weighted distance 4 ), reflecting the high difficulty of the task. Krippendorff's α agreement between experts and the gold standard from crowd workers was determined as 0.54 (binary) resp. 0.55 (weighted distance). the propagation of errors to the subsequent task of argument clustering, it is paramount to reach a high performance in this step.

Experimental Setup
For the UKP Corpus, we use the proposed evaluation scheme by Stab et al. (2018b): The models are trained on the train split (70% of the data) of seven topics, tuned on the dev split (10%) of these seven topics, and then evaluated on the test split (20%) of the eighth topic. A macro F 1 -score is computed for the 3-label classes and scores are averaged over all topics and over ten random seeds. For the IBM Corpus, we use the setup by Shnarch et al. (2018): Training on 83 topics (4,066 sentences) and testing on 35 topics (1,719 sentences). We train for five different random seeds and report the average accuracy over all runs.

Methods
We experiment with a number of different models and distinguish between models which use topic information and ones that do not.
bilstm. This model was presented as a baseline by Stab et al. (2018b). It trains a bi-directional LSTM network on the sentence, followed by a softmax classifier and has no information about the topic. As input, pre-trained word2vec embeddings (Google News dataset) were used.
biclstm. Stab et al. (2018b) presented the contextualized LSTM (clstm), which adds topic information to the i-and c-cells of the LSTM. The topic information is represented by using pre-trained word2vec embeddings.
IBM. Shnarch et al. (2018) blend large automatically generated training sets with manually annotated data in the initialization step. They use an LSTM with 300-d GloVe embeddings and integrate the topic via attention. We re-implemented their system, as no official code is available.
We experiment with these three models by replacing the word2vec / GloVe embeddings with ELMo and BERT embeddings. The ELMo embeddings are obtained by averaging the output of the three layers from the pre-trained 5.5B ELMo model. For each token in a sentence, we generate a BERT embedding with the pre-trained BERTlarge-uncased model.
Further, we evaluate fine-tuning the transformer network from BERT for our datasets: BERT. We add a softmax layer to the output of the first token from BERT and fine-tune the net-work for three epochs with a batch size of 16 and a learning rate of 2e-5. We only present the sentence to the BERT model.
BERT topic . We add topic information to the BERT network by changing the input to the network. We concatenate the topic and the sentence (separated by a special [SEP]-token) and finetune the network as mentioned before.

Results and Analysis
In the following, we present and analyze the results.
UKP Corpus. Replacing traditional embeddings in the bilstm by contextualized word embeddings improves the model's performance by around 6pp and 8pp in F 1 for ELMo and BERT (see Table 1). The fine-tuned BERT-large improves by even 12pp over the baseline bilstm and by this also outperforms bilstm BERT by around 4pp. Hence, using an intermediary BiLSTM layer for the BERT model even hurts the performance.
Using ELMo and BERT embeddings in the topic-integrating biclstm model significantly decreases the performance, as compared to their performance in the bilstm. The contextualized word embedding for a topic is different to the one of a topic appearing in a sentence and the biclstm fails to learn a connection between them.
Including the topic into the fine-tuned BERT models increases the F 1 score by approx. 14.5pp and 13pp for BERT-base and BERT-large. This is due to a vast increase in recall for both models; while changes in precision are mostly small, recall for positive and negative arguments increases by at least 21pp for both models. As such, BERTlarge topic also beats the biclstm by almost 21pp in F 1 score and represents a new state-of-the-art on this dataset.
While the gap to human performance remains at around 18pp in F 1 , our proposed approach decreases this gap significantly as compared to the previous state-of-the-art. Based on preliminary experimental results, we suspect that this gap can be further reduced by adding more topics to the training data.
The results show that (1) the BERT-[base/large] models largely improve F 1 and precision for arguments and (2) leveraging topic-information yields another strong improvement on the recall of argumentative sentences. The usefulness of topicinformation has already been shown by Stab et al. (2018b) through their biclstm and stems from a much higher recall of arguments while losing some of the precision when compared to their bilstm. Yet, their approach cannot hold to BERT's superior architecture; the topic-integrating BERT models BERT-base topic and BERT-large topic not only compensate for the biclstm's drop in precision, but also increase the recall for pro and con arguments by at least 18pp and 15pp. We account this performance increase to BERT's multihead attention between all word pairs, where every word in a sentence has an attention value with the topic (words).
IBM corpus. As a baseline for models that do not use any topic information, we train three simple BiLSTMs with ELMo, BERT, and 300-d GloVe embeddings and compare them to the finetuned base and large BERT models. As Table  1 shows, BERT and ELMo embeddings perform around 2.7 and 3.7pp better in accuracy than the GloVe embeddings. BERT-base yields even 7pp higher accuracy, while its difference to the large model is only +1pp.
Both BERT-base and BERT-large outperform the baseline IBM set by Shnarch et al. (2018) already by more than 6pp in accuracy 5 . The topic integrating models IBM ELM o and IBM BERT do not improve much over their BiLSTM counterparts, which do not use any topic information. Similar to the conclusion for the UKP corpus, we attribute this to the different embedding vectors we retrieve for a topic as compared to the vectors for a topic mention within a sentence. BERT-base topic and BERT-large topic show the largest improvement with 8pp over the baseline and represent a new state-of-the-art on this dataset. The fine-tuned BERT models show vast improvements over the baseline, which is on par with the findings for the UKP corpus.
Yet, in contrast to the results on the UKP corpus, adding topic information to the fine-tuned BERT models has only a small effect on the score. This can be explained with the different composition of both corpora: while sentences in the UKP corpus may only be implicitly connected to their related topic (only 20% of all sentences contain their related topic), sentences in IBM's corpus all contain their related topic and are thus explicitly   connected to it (although topics are masked with a placeholder). Hence, in the IBM corpus, there is much less need for the additional topic information in order to recognize the relatedness to a sentence.

Argument Clustering
Having identified a large amount of argumentative text for a topic, we next aim at grouping the arguments talking about the same aspects. For any clustering algorithm, a meaningful similarity between argument pairs is crucial and needs to account for the challenges regarding argument aspects, e.g., different aspect granularities, context-dependency or aspect multiplicity. Another requirement is the robustness for topicdependent differences.
Therefore, in this section, we study how sentence-level argument similarity and clustering can be improved by using contextualized word embeddings. We evaluate our methods on the UKP ASPECT and the AFS corpus (see Section 3.2).

Clustering Method
We use agglomerative hierarchical clustering (Day and Edelsbrunner, 1984) to cluster arguments.
We use the average linkage criterion to compute the similarity between two cluster A and B: 1 |A||B| a∈A b∈B d(a, b), for a given similarity metric d. As it is a priori unknown how many dif-ferent aspects are discussed for a topic (number of clusters), we apply a stopping threshold which is determined on the train set.
We also tested the k-means and the DBSCAN clustering algorithms, but we found that agglomerative clustering generally yielded better performances in preliminary experiments.
Agglomerative clustering uses a pairwise similarity metric d between arguments. We propose and evaluate various similarity metrics in two setups: (1) Without performing a clustering, i.e. the quality of the metric is directly evaluated (without clustering setup), and (2) in combination with the described agglomerative clustering method (with clustering setup).

Experimental Setup
We differentiate between unsupervised and supervised methods. Our unsupervised methods include no pre-training whereas the supervised methods use some data for fine-tuning the model. For the UKP ASPECT corpus, we binarize the four labels to only indicate similar and dissimilar argument pairs. Pairs labeled with some and high similarity were labeled as similar, pairs with no similarity and different topic as dissimilar.
We evaluate methods in a 4-fold crossvalidation setup: seven topics are used for testing and 21 topics are used for fine-tuning. Final evaluation results are the average over the four folds. In case of supervised clustering methods, we use 17 topics for training and four topics for tuning. In  their experiments on the AFS corpus, Misra et al. (2016) only performed a within-topic evaluation by using 10-fold cross-validation. As we are primarily interested in cross-topic performances, we evaluate our methods also cross-topic: we train on two topics, and evaluate on the third.

Evaluation
For the UKP ASPECT dataset we compute the marco-average F mean for the F 1 -scores for the similar-label (F sim ) and for the dissimilarlabel (F dissim ).
In the without clustering setup, we compute the similarity metric (d(a, b)) for an argument pair directly, and assign the label similar if it exceeds a threshold, otherwise dissimilar. The threshold is determined on the train set of a fold for unsupervised methods. For supervised methods, we use a held-out dev set.
In the with clustering setup, we use the similarity metric to perform agglomerative clustering. This assigns each argument exactly one cluster ID. Arguments pairs in the same cluster are assigned the label similar, and argument pairs in different clusters are assigned the label dissimilar. We use these labels to compute F sim and F dissim given our gold label annotations.
For the AFS dataset, Misra et al. (2016) computed the correlation between the predicted similarity and the annotated similarity score. They do not mention which correlation method they used. In our evaluation, we show Pearson correlation (r) and Spearman's rank correlation coefficient (ρ).

Similarity Metrics
We experiment with the following methods to compute the similarity between two arguments.
Tf-Idf. We computed the most common words (without stop-words) in our training corpus and compute the cosine similarity between the Tf-Idf vectors of a sentence. InferSent. We compute the cosine-similarity between the sentence embeddings returned by In-ferSent (Conneau et al., 2017).
Average Word Embeddings. We compute the cosine-similarity between the average word embeddings for GloVe, ELMo and BERT.
BERT. We fine-tune the BERT-uncased model to predict the similarity between two given arguments. We add a sigmoid layer to the special [CLS] token and trained it on some of the topics. We fine-tuned for three epochs, with a learning rate of 2e-5 and a batch-size of 32.
Human Performance. We approximated the human upper bound on the UKP ASPECT corpus in the following way: we randomly split the seven pair-wise annotations in two groups, computed their corresponding MACE (Hovy et al., 2013) scores and calculated F sim , F dissim and F mean . We repeated this process ten times and averaged over all runs (without clustering setup). For the with clustering setup, we applied agglomerative hierarchical clustering on the MACE scores of one of both groups and computed the evaluation metrics using the other group as the gold label. For the AFS dataset, Misra et al. (2016) computed the correlation between the three human annotators.

Results and Analysis
Unsupervised Methods. Table 2 shows the performance on the novel UKP ASPECT Corpus. When evaluating the argument similarity metrics directly (without clustering setup), we notice no large differences between averaging GloVe, ELMo or BERT embeddings. These three setups perform worse than applying InferSent with fast-  Table 3: Pearson correlation r and Spearman's rank correlation ρ on the AFS dataset (Misra et al., 2016) averaged over the three topics.
Text embeddings. Tf-Idf shows the worst performance. In Table 3, we show the performances for the AFS corpus (detailed results in the appendix, Table 5). In contrast to the ASPECT Corpus, the Tf-Idf method achieves the best performance and InferSent -fastText embeddings achieved the worst performance. As for the ASPECT Corpus, ELMo and BERT embeddings do not lead to an improvement compared to averaged GloVe embeddings. Unsupervised methods compute some type of similarity between sentence pairs. However, as our experiments shows, this similarity notion is not necessarily the notion needed for the task.
Supervised Methods. We fine-tune the BERT model for some of the topics and study the performance on unseen topics. For the ASPECT Corpus, we observe a performance increase of 7.8pp. Identifying dissimilar arguments (F dissim ) is on-par with the human performance, and identifying similar arguments achieves an F-score of .67, compared to .75 for human annotators. For the AFS dataset, we observe that fine-tuning the BERT model significantly improves the performance by 11pp compared to the previous state-ofthe-art from Misra et al. (2016).
In a cross-topic evaluation setup on the AFS dataset, we observe that the performance drops to .57 Spearman correlation. This is still significantly larger than the best unsupervised method.
We evaluated the effect of the training set size on the performance of the BERT model for the ASPECT Corpus. A certain number of topics were randomly sampled and the performance was evaluated on distinct topics. This process was repeated 10 times with different random seeds (Reimers and Gurevych, 2018). Table 4 shows the averaged results.
By allowing fine-tuning on five topics we are able to improve the F mean -score to .71 compared to .65 when using BERT without fine-tuning (without clustering setup). Adding more topics then slowly increases the performance.  With Clustering. We studied how the performance changes on the ASPECT corpus if we combine the similarity metric with agglomerative clustering (Table 2). We notice that the performances drop by up to 7.64pp. Agglomerative clustering is a strict partitioning algorithm, i.e., each object belongs to exactly one cluster. However, an argument can address more than one aspect of a topic, therefore, arguments could belong to more than one cluster. Hence, strict partitioning clustering methods introduce a new source of errors.
We can estimate this source of error by evaluating the transitivity in our dataset. For a strict partitioning setup, if argument A ∼ B, and B ∼ C are similar, then A ∼ C are similar. This transitivity property is violated in 376 out of 1,714 (21.9%) cases, indicating that strict partitioning is a suboptimal setup for the ASPECT dataset. This also explains why the human performance in the with clustering setup is significantly lower than in the without clustering setup. As Table 2 shows, a better similarity metric must not necessarily lead to a better clustering performance with agglomerative clustering. Humans are better than the proposed BERT-model at estimating the pairwise similarity of arguments. However, when combined with a clustering method, the performances are on-par.

Conclusion
Open-domain argument search, i.e. identifying and aggregating arguments for unseen topics, is a challenging research problem. The first challenge is to identify suitable arguments. Previous methods achieved low F 1 -scores in a crosstopic scenario, e.g., Stab et al. (2018b) achieved an F 1 -score of .27 for identifying pro-arguments. We could significantly improve this performance to .53 by using contextualized word embeddings. The main performance gain came from integrating topic information into the transformer network of BERT, which added 13pp compared to the setup without topic information.
The second challenge we addressed is to decide whether two arguments on the same topic are similar. Previous datasets on argument similarity used curated lists of arguments, which eliminates noise from the argument classification step. In this publication, we annotated similar argument pairs that came from an argument search engine. As the annotation showed, about 16% of the pairs were noisy and did not address the target topic.
Unsupervised methods on argument similarity showed rather low performance scores, confirming that fine-grained semantic nuances and not the lexical overlap determines the similarity between arguments. We were able to train a supervised similarity function based on the BERT transformer network that, even with little training data, significantly improved over unsupervised methods.
While these results are very encouraging and stress the feasibility of open-domain argument search, our work also points to some weaknesses of the current methods and datasets. A good argument similarity function is only the first step towards argument clustering. We evaluated the agglomerative clustering algorithm in combination with our similarity function and identified it as a new source of errors. Arguments can address multiple aspects and therefore belong to multiple clusters, something that is not possible to model using partitional algorithms. Future work should thus study the overlapping nature of argument clustering. Further, more realistic datasets, that allow end-to-end evaluation, are required.

A.1 UKP ASPECT Corpus: Amazon Mechanical Turk Guidelines and Inter-annotator Agreement
The annotations required for the UKP ASPECT Corpus were acquired via crowdsourcing on the Amazon Mechanical Turk platform. Workers participating in the study had to be located in the US, with more than 100 HITs approved and an overall acceptance rate of 90% or higher. We paid them at the US federal minimum wage of $7.25/hour. Workers also had to qualify for the study by passing a qualification test consisting of twelve test questions with argument pairs. Figure 2 shows the instructions given to workers.  Table 5: Pearson correlation r and Spearman's rank correlation ρ on the AFS dataset. Within-Topic Evaluation: 10-fold cross-validation. Cross-Topic Evaluation: System trained on two topics, evaluated on the third.

A.2 AFS Corpus: Detailed Results
Read each of the following sentence pairs and indicate whether they argue about the same aspect with respect to the given topic (given as "Topic Name" on top of the HIT). There are four options, of which one needs to be assigned to each pair of sentences (arguments). Please read the following for more details.
• Different Topic/Can't decide: Either one or both of the sentences belong to a topic different than the given one, or you can't understand one or both sentences. If you choose this option, you need to very briefly explain, why you chose it (e.g. "The second sentence is not grammatical", "The first sentence is from a different topic" etc.). For example, Argument A: "I do believe in the death penalty, tit for tat".
Argument B: "Marriage is already a civil right everyone has, so like anyone you have it too".
• No Similarity: The two arguments belong to the same topic, but they don't show any similarity, i.e. they speak about completely different aspects of the topic. For example, Argument A: "If murder is wrong then so is the death penalty".
Argument B: "The death penalty is an inappropriate way to work against criminal activity".
• Some Similarity: The two arguments belong to the same topic, showing semantic similarity on a few aspects, but the central message is rather different, or one argument is way less specific than the other. For example, Argument A: "The death penalty should be applied only in very extreme cases, such as when someone commands genocide".
Argument B: "An eye for an eye: He who kills someone else should face capital punishment by the law".
• High Similarity: The two arguments belong to the same topic, and they speak about the same aspect, e.g. using different words. For example, Argument A: "An ideal judiciary system would not sentence innocent people".
Argument B: "The notion that guiltless people may be sentenced is indeed a judicial system problem".
Your rating should not be affected by whether the sentences attack (e.g. "Animal testing is cruel and inhumane" for the topic "Animal testing") or support (e.g. "Animals do not have rights, therefore animal testing is fair" for the topic "Animal testing") the topic, but only by the aspect they are using to support or attack the topic.