Exploiting Debate Portals for Semi-Supervised Argumentation Mining in User-Generated Web Discourse

Analyzing arguments in user-generated Web discourse has recently gained attention in argumentation mining, an evolving ﬁeld of NLP. Current approaches, which employ fully-supervised machine learning, are usually domain dependent and suffer from the lack of large and diverse annotated corpora. However, annotating arguments in discourse is costly, error-prone, and highly context-dependent. We asked whether leveraging unlabeled data in a semi-supervised manner can boost the performance of argument component identiﬁcation and to which extent is the approach independent of domain and register. We propose novel features that exploit clustering of unlabeled data from debate portals based on a word embeddings representation. Using these features, we signiﬁcantly outperform several baselines in the cross-validation, cross-domain, and cross-register evaluation scenarios.


Introduction
Argumentation mining, an evolving sub-field of NLP, deals with analyzing argumentation 1 in various genres, such as legal cases (Mochales and Moens, 2011), student essays (Stab and Gurevych, 2014a), and medical and scientific articles (Green, 2014;Teufel and Moens, 2002). Recently, the focus of argumentation mining has also shifted to the Web registers (such as comments to articles, forum posts, or blogs) which is motivated by the need of retrieving and understanding ordinary people's arguments to various contentious topics on the large scale. Applications include passenger rights and protection (Park and Cardie, 2014), hotel reviews (Wachsmuth et al., 2014), and controversies in education (Habernal et al., 2014).
Despite the plethora of existing argumentation theories (van Eemeren et al., 2014), the prevalent view in argumentation mining treats arguments as discourse structures consisting of several argument components, such as claims and premises (Peldszus and Stede, 2013). Current approaches to automatic analysis of argumentation usually follow the fully supervised machinelearning paradigm (Biran and Rambow, 2011;Stab and Gurevych, 2014b;Park and Cardie, 2014) and rely on manually annotated datasets. Only few publicly available argumentation corpora exist, as annotations are costly, error-prone, and require skilled human annotators (Stab and Gurevych, 2014a;Habernal et al., 2014).
To overcome the limited scope and size of the existing annotated corpora, semi-supervised methods can be adopted, as they gain performance by exploiting large unlabeled datasets (Settles, 2012). However, unlike in other NLP tasks where data can be cheaply labeled using for example distant supervision, employing such methods in argumentation mining is questionable. First, argumentation is an act of persuasion (Nettel and Roque, 2011;Mercier and Sperber, 2011) but not all usergenerated texts can be treated as persuasive (Park and Cardie, 2014;Habernal et al., 2014), thus the selection of an appropriate unlabeled dataset represents a problem on its own. Second, argument components (e.g., claims or premises) are highly context-dependent and cannot be easily labeled in distant data using predefined patterns. So far, semi-supervised methods for argumentation mining remain unexplored.
In this article, we tackle argumentation min-ing of user-generated Web data by exploiting debate portals-semi-structured discussion websites where members pose contentious questions to the community and allow others to pick a side and provide their opinions and arguments in order to 'win' the debate. 2 Our first research question is whether debate portals (which contain noisy user-generated data) can be utilized in a semisupervised manner for fine-grained identification of argument components. As a second research question, we investigate to what extent our methods are domain independent and evaluate their adaptation across several domains and registers. Our contribution is three-fold. First, to the best of our knowledge, we present the first successful attempt to semi-supervised argumentation mining in Web data based on exploiting unlabeled external resources. We leverage these resources and derive features in an unsupervised manner by projecting data from debate portals into a latent argument space using unsupervised word embeddings and clustering. Second, our novel features significantly outperform state-of-the-art features in all scenarios, namely in cross-validation, crossdomain evaluation, and cross-register evaluation. Third, to ensure full reproducibility of our experiments, we provide all data and source codes under free licenses. 3

Related work
Analysis of argumentation has been an active topic in numerous research areas, such as philosophy (van Eemeren et al., 2014), communication studies (Mercier and Sperber, 2011), and informal logic (Blair, 2004), among others. In this section, we will focus on the most related works on argumentation mining techniques in NLP in the first part, with an emphasis on Web data in the second part. Mochales and Moens (2011) based their work on argumentation schemes (Walton et al., 2008) and experimented with Araucaria and ECHR datasets using supervised models to classify argumentative and non-argumentative sentences (≈ 0.7F 1 ) and their structure. Feng and Hirst (2011) classified argument schemes on the Araucaria dataset, reaching 0.6-0.9 accuracy. Experiments on this dataset were also conducted by Rooney et al. (2012), who classified sentences to four categories (conclusion, premise, conclusion-premise, and none) and achieved 0.65 accuracy. These approaches assume the text is already segmented into argument components. Stab and Gurevych (2014b) examined argumentation in persuasive essays and classified argument components into four categories (premise, claim, major claim, nonargumentative) using SVM and achieved 0.73 macro F 1 score. They further classified argument relations (support and attack) and reached 0.72 macro F 1 score. The best-performing features were structural features (such as the location or length ratios), as persuasive essays usually comply with a certain structure which can be seen as a potential drawback of this approach.
Regarding user-generated Web data, Biran and Rambow (2011) used naive Bayes for classifying justification of subjective claims from blogs and Wikipedia talk pages, relying on features from RST Treebank and manually-processed n-grams. In similar Web registers, Rosenthal and McKeown (2012) automatically determined whether a sentence is a claim using logistic regression and various lexical and sentiment-related features and achieved accuracy about 0.66-0.71. Park and Cardie (2014) classified propositions in user comments into three classes (verifiable experiential, verifiable non-experiential, and unverifiable) using SVM and reached 0.69 macro F 1 score. Goudas et al. (2014) identified premises in Greek social media texts using BIO encoding and achieved 0.42 F 1 score with Conditional Random Fields. The research gaps in the above-mentioned approaches are the following. First, the argumentation models are simplified to either claims or a few types of premises/propositions. Second, the segmentation of discourse into argument components is ignored (except the work of Goudas et al. (2014)). Recently, Boltužić andŠnajder (2015) employed hierarchical clustering to cluster arguments in online debates using embeddings projection, but in contrast to our work they performed only intrinsic evaluation of the clusters.
Debate portals have been used in a related body of research, such as classifying support and attack between posts by Cabrio and Villata (2012), or stance detection by Hasan and Ng (2013) or Gottipati et al. (2013). These approaches consider the complete documents (posts) but do not analyze the micro-level argumentation (e.g., claims or premises). Doc #2823 (article comment, public-private-schools): [claim: I agree -Kids can do great in the public school system and parents DO need to be involved.] The more people leave, the worse its going to become. Can't say I particularly liked it, I would of much preferred gone to a co-ed.] [premise: It is closer to the 'real world' that way. Kids should grow up in the company of both sexes... They will be more at ease around the opposite sex when they are older and it just makes sense.] If it is purely education you are concerned about (and not so much behaviour), our year (at a private school) went shockingly bad in OP scores. We were the worst in 12 years and were beaten by LOTS of co-ed and public schools... So you can never tell. In saying that my sister really enjoyed going to an all girls school. Her year went really well too.

Data
As data for training and evaluation of our methods, we use a corpus consisting of 340 English documents (approx. 90k tokens) annotated 4 with argumentation by Habernal et al. (2014). Compared to other corpora mentioned in the related work, this corpus is the largest one to date that covers different domains and spans several registers of usergenerated Web content. In particular, the corpus comprises four registers (comments to articles, forum posts, blogs, and argumentative newswire articles) and covers six domains related to educational controversies (homeschooling, private vs. public schools, mainstreaming, single-sex education, prayer in schools, and redshirting). The argumentation model used in this corpus is based on extended Toulmin's model (Toulmin, 1958). Each document contains usually one argument, where each argument consists of several argument components. There are five different components in this model, namely, the claim (the statement about to be established in the argument which conveys author's stance towards the topic), the premise(s) (propositions that are intended to give reasons of some kind for the claim), the backing (additional information used to back-up the argument), the rebuttal (attacks the claim), and the refutation (which attacks the rebuttal). Relations between the argument components are encoded implicitly in the function of the particular component type, for instance, premises are always attached to the claim. We made two observations in the data: the claim is often implicit (must be inferred by the reader), and some sentences have no argumentative function (thus are not labeled by any argument component). 5 Figure 1 depicts two example annotations from the corpus. Argument components were annotated on the token level as non-overlapping annotation spans. We therefore represent the argument annotations using BIO encoding. Each token is labeled with one of the 11 categories (5 argument component types × B or I tag + one O category for non-argumentative text).

Method
We cast the task of identifying argument components as a sequence tagging problem and employ SVM hmm (Joachims et al., 2009). 6 For linguistic annotations and feature engineering, we rely on two UIMA-based frameworks -DKProCore (Eckart de Castilho and Gurevych, 2014) and DKProTC (Daxenberger et al., 2014).
Although the argument component annotations in the corpus are aligned to the token boundaries (token-level annotations), the minimal classification unit in our sequence tagging approach is set to the sentence level. First, this allows us to capture rich features that are available for entire sentences as opposed to the token level. Second, by modeling sequences on the token level we would lose the advantage of SVM hmm to estimate dependencies between labels, as the label context is limited due to computational feasibility. On the token level, the label sequences are rather static (long sequences with the same label), as opposed to the sentence level. Before the classification step, we adjust all annotation boundaries (note that we use 11 BIO labels) so that they are aligned to the sentence boundaries and each sentence is then treated as a single classification unit with one label (for example, the first sentence from Figure 1 with token labels Claim-B, Claim-I, Claim-I, ... be-comes Claim-B). After classification, the labels are mapped back to tokens (so that, for example, Claim-B sentence label is transformed to Claim-B, Claim-I, ... token labels). However, all evaluations are performed on the token level and the performance is always measured against the original token labels. Using this approximation, we lose only about 10% of F 1 performance. 7

Baseline features
Lexical baseline (FS0) We encode the presence of unigrams, bigrams, and trigrams in the sentence as 'one-hot' (binary) features.

Structural and syntactic features (FS1)
Since the presence of discourse markers has been shown to be helpful in argument component analysis (e.g, "therefore" and "since" for premises or "think" and "believe" for claims), we encode the first and last three words as binary features. Furthermore, we capture the relative position of the sentence in the paragraph and the document, the number of part of speech 1-3 grams, maximum dependency tree depth, constituency tree production rules, and number of sub-clauses (Stab and Gurevych, 2014b). We used Stanford POS Tagger (Toutanova et al., 2003), Berkeley parser (Petrov et al., 2006), and Malt parser (Nivre, 2009).

Sentiment and topic features (FS2)
We assume that claims express sentiment, thus we compute five sentiment categories (from very negative to very positive) using Stanford sentiment analyzer (Socher et al., 2013) and use these values directly as features. Furthermore, in order to help detecting off-topic and non-argument sentences, we employ topic model features. In particular, we use features taken from a vector representation of the sentence obtained by using Gibbs sampling on LDA model (Blei et al., 2003;McCallum, 2002) with topics trained on unlabeled data provided as a part of the corpus. 8 Semantic and discourse features (FS3) Features based on semantic frames has been introduced in relevant works on stance recognition (Hasan and Ng, 2013). Our features, based on PropBank semantic role labels and obtained from

Unsupervised features
We enrich the above-mentioned features by utilizing external large unlabeled resources -debate portals. They fulfill several criteria, namely (a) they are 'argumentative' (meant as opposed to, for example, prose or encyclopedic genres), (b) they are comprised of user-generated content and (c) and there is at least some overlap with topics from our experimental corpus. On the other hand, they contain noisy texts of questionable quality and they do not provide any specific argumentative structure (in fact, these debates are simple discussions to a topic, where each post is only labeled with a pro or contra stance). Nevertheless, we assume that the posts from (unlabeled) debate portals contain valuable information that will help us with classifying argument components in labeled data. In order to do so, we employ clustering based on latent semantics, which we now formalize as argument space features.
We assume that phrases (sentences or documents) can be projected into a latent vector space, using, typically, a sum or a weighted average of all the word embeddings vectors in the phrase; see for example (Le and Mikolov, 2014). Neighboring vectors in the latent vector space exhibit some interesting properties, such as semantic similarity (thoroughly studied within the distributional semantics area). If the latent vector space is clustered, each n-dimensional vector gets reduced to a single cluster number; such clusters have been used directly as features in many tasks, such as NER (Turian et al., 2010), POS tagging (Owoputi et al., 2013), or sentiment analysis (Habernal and Brychcín, 2013).
We build upon the above-mentioned approach (described by Søgaard (2013) as 'clusters-asfeatures' semi-supervised paradigm) and extend it further. We take both sentences and posts from the unlabeled debate portals, project them into a latent space using word embeddings and cluster them. The motivation is that these clusters will contain similar phrases or (similar 'arguments'). Centroids of these clusters would then represent a 'prototypical argument' (note that the centroids exist only in the latent vector space and thus do not correspond to any existing sentence or post). Then we project each sentence (classification unit) in the labeled data to the latent vector space, compute its distance vector to all the cluster centroids, and encode this distance vector directly as real-valued features. By contrast to the abovementioned works using a single cluster label as a feature, the distance vector to cluster centroids resembles a soft labeling where each sentence belongs to several clusters with a certain 'weight'. We also use the latent vector space representation of the sentence directly as a feature vector.
As unlabeled data, we use data from two largest debate portals. 9 As a pre-processing step we removed all posts with less than one 'point' earned. 10 The data were then indexed using the Lucene framework and the top 100 debates for each of the 6 domains were retrieved which resulted into 5,759 posts (≈ 35k sentences) in the unlabeled data in total. Our approach is formalized in the following paragraph.
Argument space features (FS4) Let e(w) be the embedding vector of word w and tfidf(w) be the TD-IDF value of w. Sentence s = (w 1 , . . . , w n ) is then projected into the embedding space E as s e = n i=1 tfidf(w i ) e(w i )n −1 so dim( s e ) = dim(E). Analogically to s, we project the entire post a = (w 1 , . . . , w m ) to the same embedding space E such that a e = m i=1 tfidf(w i ) e(w i )m −1 . Let K be the number of sentence clusters in E and c k a centroid vector of cluster k ∈ K. Then s c denotes the distance of sentence s e to the sentence cluster centroids such that s c = 9 createdebate.com and convinceme.net, licensed under Creative Commons (CC-BY and CC0, resp.) 10 'Points' is the sum of up-votes/down-votes by other users to the particular post. Zero-point posts were usually noisy and spam-like.
(cos( s e , c 1 ), . . . , cos( s e , c k )) where dim( s c ) = K and cos(•, •) denotes cosine similarity. Analogically, let L be the number of post clusters in E and a l a centroid vector of cluster l ∈ L. Then s a denotes the distance of sentence s e to the post cluster centroids such that s a = (cos( s e , a 1 ), . . . , cos( s e , a l )). We construct the feature vector by concatenating s e , s c and s a .

Results
We investigate three evaluation scenarios. First, we report 10-fold cross validation over all 340 documents, where the data are randomly distributed across the folds regardless of the domain or register. In this scenario, the model can benefit from domain-dependent features for the testing data, such as lexical knowledge (FS0) or domainrelevant argument space features (FS4). Second, we evaluate the cross-domain performance; the model is always trained on five domains and tested on the sixth one. In this settings, we also remove all features that exploit distant data relevant to the test set. For instance, if the test domain is mainstreaming, we exclude all debates relevant to this domain before constructing the argument space features (FS4). This evaluates the model's cross-domain performance without any target domain data available. Finally, we test cross-register performance in two set-ups: we train the models using comments and forum posts and test on blogs and newswire articles, and then the other way round. We divided the data into these two parts based on similar properties of blogs/articles and comments/forums, such as the length, or the distribution of argumentative and non-argumentative text.
In the evaluation, we focus on F 1 scores achieved on claims, premises, backing, and nonargumentative text (the 'O' class). Although the    ( ) denotes that the row is significantly better than the previous row; dagger ( †) means the row is not significantly better than the previous row, but is significantly better than the previous row minus one; p < 0.001 using exact Liddell's test (Liddell, 1983).
classifier is trained and tested on all 11 classes including rebuttal and refutation, we do not report performance of these two argument componentsthe results are very poor regardless of the parameters for two reasons. First, these classes are underrepresented in the data (Rebuttal-B, Rebuttal-I, Refutation-B and Refutation-I are present in only about 4% of sentences). Second, the interannotator agreement reached on these classes were reported to be very low (Habernal et al., 2014). Table 1 shows results for the cross-validation scenario. The human baseline in the first row is an average score between three original annotators of the dataset. The baseline features (FS0) perform poorly, yet they beat the random assignment and majority vote (< 0.12 F 1 ). The argument space features (FS4) increase the performance in every combination. The best results for claims are achieved when only discourse, sentiment, and argument space features are involved (FS3 and FS4), whereas premises and backing benefit from the presence of lexical, syntactic, and semantic features (the richest feature set). The overall average best results are obtained from a feature combination with higher level of abstraction, in particular without low-level lexical features from FS0. After the cross validation experiments, we also fixed the hyperparameters (using grid search) to K = 1000, L = 100 for the cluster sizes and t = 1 and e = 0 for the hyperparameters of SVM hmm .

Cross-domain results
For each domain, the cross-domain results are shown in Table 2. On average, the best results are about 0.10 F 1 points worse than in the cross-validation settings (Table   1). In all domains, the best average performance was achieved using only the argument space features (FS4); in four cases this system significantly outperforms all other systems (p < 0.001). Moreover, more high-level feature set combinations that also contain argument space features (such as FS2+FS3+F4 or FS3+FS4) yield usually better results for particular argument components in contrast to features based on lexical or syntactic information (FS0 and FS1). For identifying nonargumentative texts, there is no clear winner with respect to feature set abstraction (in three domains the best results are achieved using FS4 but in other three domains the baseline FS0 performs best).

Cross-register results
The argument space features (FS4) performs best in average also in the cross-register evaluation (see Table 3). In recognizing premises, better results were achieved by a system trained on blogs and articles and tested on comments and forum posts. Recognizing claims exhibits similar behavior. On the other hand, recognizing non-argumentative text performs better in the opposite direction. On average, the cross-register results are much worse than crossvalidation and slightly worse than cross-domain results.

Error analysis
First, we quantitatively investigate errors in the cross-validation scenario. The confusion matrix in Table 4 shows that about 50-60% of errors for each argument component were caused by misclassifying it as non-argumentative (the 'O' class). The system tends to prefer the 'O' predictions because of the high presence of non-argumentative sentences in the corpus (about 57%). Backing is often confused with premises; in particular, Backing-B with Premise-B in 14%, Backing-I with Premise-I in 17%. These two argument components have a similar function-to support the claim-so the differences in the discourse (which are sometimes very subtle) confuse the system. Note that despite the confusion between these classes, the -I and -B tags mostly remain the same (the system correctly predicts whether the argument component begins or not). 13 We also analyzed the errors of the best-13 To provide the complete picture, we also show the previously unreported classes (rebuttal and refutation). Rebuttal is usually misclassified as non-argumentative or premise, refutation as either non-argumentative, backing, or premise.     (winning) row signals a significant difference between this row and all other rows while star ( ) denotes that the row is significantly better than the previous row; p < 0.001 using exact Liddell's test (Liddell, 1983).    (Liddell, 1983).

FS B-B B-I C-B C-I O P-B P-I Avg
performing cross-domain system in detail. 14 We randomly sampled 40 documents and manually compared the predicted arguments with the gold data. We found that 11 predicted documents were simply wrong or no argument components were predicted at all (e.g., document #1640, #1658, #1021, #5258). Most of these errors occur in blogs, which seem to convey rather complex argumentation structure (#1666, #1197, #4586, #5258). In 8 documents, we identified that only some premises were (correctly) spotted by the system. This happened mostly in long comments (#452) and blogs (#400, #697, #4583). In 7 inves-tigated documents, we identified errors caused by slightly different boundaries of recognized argument components (#4517, #2447, #2252, #4840) or when multiple segments were merged/split (#1604, #2180, #2310).
By analyzing the predicted output, we also found that in 12 documents the recognized argument components seemed to be valid to some extent, although this was our subjective judge. For instance, in #4285 (see Figure 2), the first premise was misclassified as a claim. The gold-data argument was annotated as an enthymeme (with implicit claim that advocates private schools), while in the prediction, the same proposition was identified as the an explicit claim supporting private  54  12  12  1  106  31  8  0  0  0  0  Backing-I  12 3,238  1  353  5,089  17 1,777  1  18  0  45  Claim-B  7  3  41  5  107  19  9  1  1  0  2  Claim-I  0  160  0  713  2,095  1  456  0  25  0  25  O  97 3,170  53 1,135 36,061  156 5,459  4  178  1  38  Premise-B  35  17  17  2  290  142  28  6  0  0  1  Premise-I  18 1,680  2  544 10,779  51 7,015  2  234  2   schools with one premise why the education was not satisfying, which might be also another valid interpretation. The second example #2180 in Figure 2 shows that the boundaries of the predicted premises are mixed up (two recognized instead of three), but the longer backing is also meaningful. These examples demonstrate that argument analysis is in some cases ambiguous and allows for different valid interpretations.

Conclusion
In this article, we proposed a semi-supervised model for argumentation mining of user-generated Web content. We developed new unsupervised features for argument component identification that exploit clustering of unlabeled argumentative data from debate portals based on word embeddings representation. With the help of these features we significantly improved performance of the argumentation mining system and outperformed several baselines. While the improvement was decent in cross-validation scenario, we gained almost 100% improvement in cross-domain and cross-register settings. We evaluated the methods on a publicly available corpus annotated with argumentation that origins from user-generated Web data. By a detailed analysis of the errors, we pointed out the strengths (such as domain adaptability) and weaknesses (such as unsatisfying results for rebuttal and refutation components), as well as the challenges for the argumentation mining task (such as boundary identification issues or ambiguous arguments). If we put our results into the context of existing works, the most relevant one by (Goudas et al., 2014) achieved 0.42 F 1 score on identifying only premises. We get comparable results in the cross-validation settings (F 1 0.31-0.40) yet with more complex argumentation model (five different components).
Although argumentation mining in usergenerated Web discourse has a long way to go (our methods currently achieve only about 50% of human performance), we see a huge potential for various future tasks, such as information seeking for better-informed personal decision making or support for argument quality assessment. To foster the research within the community, we provide all source codes and data required for the experiments under free licenses. There's no doubt boys behave a little different when girls are watching, and i also found boys were quite good at limiting the bitchyness girls are renowned for.] [premise: So both kept one another in line, and made for a more positive and dynamic environment. I also think there's a few extra life lessons and skills children can learn at co ed schools. Dating, relationships, interacting with the opposite sex, i think children at co ed schools tend to have a far better grasp of these skills then students who've only attended same sex schools.] (b) Doc #2180 (forum post, single-sex education) Figure 2: Examples of gold data annotations (on the left-hand side) and system predictions in the best-performing cross-domain evaluation scenario (on the right-hand side).
Elena Cabrio and Serena Villata. 2012. Combining textual entailment and argumentation theory for supporting online debates interactions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers -Volume 2, ACL '12, pages 208-212, Jeju Island, Korea. Association for Computational Linguistics.