Identifying Prominent Arguments in Online Debates Using Semantic Textual Similarity

Online debates sparkle argumentative discussions from which generally accepted arguments often emerge. We consider the task of unsupervised identiﬁcation of prominent argu-ment in online debates. As a ﬁrst step, in this paper we perform a cluster analysis using semantic textual similarity to detect similar arguments. We perform a preliminary cluster evaluation and error analysis based on cluster-class matching against a manually labeled dataset.


Introduction
Argumentation mining aims to detect the argumentative discourse structure in text. It is an emerging field in the intersection of natural language processing, logic-based reasoning, and argumentation theory; see (Moens, 2014) for a recent overview.
While most work on argumentation mining has focused on well-structured (e.g., legal) text, recently attention has also turned to user-generated content such as online debates and product reviews. The main motivation is to move beyond simple opinion mining and discover the reasons underlying opinions. As users' comments are generally less well-structured and noisy, argumentation mining proper (extraction of argumentative structures) is rather difficult. However, what seems to be a sensible first step is to identify the arguments (also referred to as reasons and claims) expressed by users to back up their opinions.
In this work we focus on online debates. Given a certain topic, a number of prominent arguments often emerge in the debate, and the majority of users will back up their stance by one or more of these arguments. The problem, however, is that linking users' statements to arguments is far from trivial. Besides language variability, due to which the same argument can be expressed in infinitely many ways, many other factors add to the variability, such as entailment, implicit premises, value judgments, etc. This is aggravated by the fact that most users express their arguments in rather confusing and poorly worded manner. Another principal problem is that, in general, the prominent arguments for a given topic are not known in advance. Thus, to identify the arguments expressed by the users, one first needs to come up with a set of prominent arguments. Manual analysis of the possible arguments does not generalize to unseen topic nor does it scale to large datasets.
In this paper, we are concerned with automatically identifying prominent arguments in online debates. This is a formidable task, but as a first step towards this goal, we present a cluster analysis of users' argumentative statements from online debates. The underlying assumption is that statements that express the same argument will be semantically more similar than statements that express different arguments, so that we can group together similar statements into clusters that correspond to arguments. We operationalize this by using hierarchical clustering based on semantic textual similarity (STS), defined as the degree of semantic equivalence between two texts (Agirre et al., 2012).
The purpose of our study is twofold. First, we wish to investigate the notion of prominent arguments, considering in particular the variability in expressing arguments, and how well it can be captured by semantic similarity. Secondly, from a more practical perspective, we investigate the possibility of automatically identifying prominent arguments, setting a baseline for the task of unsupervised argument identification.

Related Work
The pioneering work in argumentation mining is that of Moens et al. (2007), who addressed mining of argumentation from legal documents. Recently, the focus has also moved to mining from user-generated content, such as online debates (Cabrio and Villata, 2012), discussions on regulations (Park and Cardie, 2014), and product reviews (Ghosh et al., 2014). Boltužić andŠnajder (2014) introduced argument recognition as the task of identifying what arguments, from a predefined set of arguments, have been used in users comments, and how. They frame the problem as multiclass classification and describe a model with similarity-and entailment-based features.
Essentially the same task of argument recognition, but at the level of sentences, is addressed by Hasan and Ng (2014). They use a probabilistic framework for argument recognition (reason classification) jointly with the related task of stance classification. Similarly, Conrad et al. (2012) detect spans of text containing arguing subjectivity and label them with argument tags using a model that relies on sentiment, discourse, and similarity features.
The above approaches are supervised and rely on datasets manually annotated with arguments from a predefined set of arguments. In contrast, in this work we explore unsupervised argument identification. A similar task is described by Trabelsi and Zaïane (2014), who use topic modeling to extract words and phrases describing arguing expressions, and also discuss how the arguing expressions could be clustered according to the arguments they express.

Data and Model
Dataset. We conduct our study on the dataset of users' posts compiled by Hasan and Ng (2014). The dataset is acquired from two-side online debate forums on four topics: "Obama", "Marijuana", "Gay rights", and "Abortion". Each post is assigned a stance label (pro or con), provided by the author of the post. Furthermore, each post is split up into sentences and each sentence is manually labeled with one argument from a predefined set of arguments (different for each topic). Note that all sentences in the dataset are argumentative; non-argumentative sentences were removed from the dataset (the ratio of argumentative sentences varies from 20.4% to 43.7%, depending on the topic). Hasan and Ng (2014) report high levels of inter-annotator agreement (between 0.61 and 0.67, depending on the topic).
For our analysis, we removed sentences labeled with rarely occurring arguments (<2%), allowing us to focus on prominent arguments. The dataset we work with contains 3104 sentences ("Abortion" 814, "Gay rights" 824, "Marijuana" 836, and "Obama" 630) and 47 different arguments (25 pro and 22 con, on average 12 arguments per topic). The majority of sentences (2028 sentences) is labeled with pro arguments. The average sentence length is 14 words.
Argument similarity. We experiment with two approaches for measuring the similarity of arguments.
Vector-space similarity: We represent statements as vectors in a semantic space. We use two representations: (1) a bag-of-word (BoW) vector, weighted by inverse sentence frequency, and (2) a distributed representation based on the recently proposed neural network skip-gram model of Mikolov et al. (2013a).
As noted by Ramage et al. (2009), BoW has shown to be a powerful baseline for semantic similarity. The rationale for weighting by inverse sentence frequency (akin to inverse document frequency) is that more frequently used words are less argument-specific and hence should contribute less to the similarity.
On the other hand, distributed representations have been shown to work exceptionally well (outperforming BoW) for representing the meaning of individual words. Furthermore, they have been shown to model quite well the semantic composition of short phrases via simple vector addition (Mikolov et al., 2013b). To build a vector for a sentence, we simply sum the distributed vectors of the individual words. 1 For both representations, we remove the stopwords before building the vectors. To compute the similarity between two sentences, we compute the cosine similarity between their corresponding vectors.
Semantic textual similarity (STS): Following on the work of Boltužić andŠnajder (2014), we use an off-the-shelf STS system developed byŠarić et al. (2012). It is a supervised system trained on manually labeled STS dataset, utilizing a rich set of text comparison features (incl. vector-space comparisons). Given two sentences, the system outputs a real-valued similarity score, which we use directly as the similarity between two argument statements.
Clustering. For clustering, we use the hierarchical agglomerative clustering (HAC) algorithm (see (Xu et al., 2005) for an overview of clustering algorithms). This is motivated by three considerations. First, HAC allows us to work directly with similarities coming from the STS systems, instead of requiring explicit vector-space representations as some other algorithms. Secondly, it produces hierarchical structures, allowing us to investigate the granularity of arguments. Finally, HAC is a deterministic algorithm, therefore its results are more stable.
HAC works with a distance matrix computed for all pairs of instances. We compute this matrix for all pairs of sentences s 1 and s 2 from the corresponding similarities: 1 − cos(v 1 , v 2 ) for vector-space similarity and 1/(1 + sim(s 1 , s 2 )) for STS similarity. Linkage criterion has been shown to greatly affect clustering performance. We experiment with complete linkage (farthest neighbor clustering) and Ward's method (Ward Jr, 1963), which minimizes the within-cluster variance (the latter is applicable only to vector-space similarity). Note that we do not cluster separately the statements from the pro and con stances. This allows us to investigate to what extent stance can be captured by semantic similarity of the arguments, while it also corresponds to a more realistic setup.
4 Cluster Analysis 4.1 Analysis 1: Clustering Models Evaluation metrics. A number of clustering evaluation metrics have been proposed in the literature. We adopt the external evaluation approach, which compares the hypothesized clusters against target clusters. We use argument labels of Hasan and Ng (2014) as target clusters. As noted by Amigó et al. (2009), external cluster evaluation is a non-trivial task and there is no consensus on the best approach. We therefore chose to use two established, but rather different measures: the Adjusted Rand Index (ARI) (Hubert and Arabie, 1985) and the information-theoretic Vmeasure (Rosenberg and Hirschberg, 2007). ARI of 0 indicates clustering expected by chance and 1 indicates perfect clustering. The V-measure trade-offs measures of homogeneity (h) and completeness (c). It ranges from 0 to 1, with 1 being perfect clustering.
Results. We cluster the sentences from the four topics separately, using the gold number of clusters for each topic. Results are shown in Table 1. Overall, the best model is skip-gram with Ward's linkage, generally outperforming the other models considered in terms of both ARI and V-measure. This model also results in the most consistent clusters in terms of balanced homogeneity and completeness. Ward's linkage seems to work better than complete linkage for both BoW and skip-gram. STS-based clustering performs comparable to the baseline BoW model. We attribute this to the fact that the STS model was trained on different domains, and therefore probably does not extend well to the kind of argument-specific similarity we are trying to capture here.
We observe quite some variance in performance across topics. Arguments from the "Gay rights" topic seems to be most difficult to cluster, while "Marijuana" seems to be the easiest. In absolute terms, the clustering performance of the skip-gram model is satisfactory given the simplicity of the model. In subsequent analysis, we focus on the skip-gram model with Ward's linkage and the "Marijuana" topic.

Analysis 2: Clustering Quality
Cluster-class matching. To examine the cluster quality and clustering errors, we do a manual clusterclass matching for the "Marijuana" topic against the target clusters, using again the gold number of clusters (10). Cluster-matching is done on a class majority basis, resulting in six gold classes matched. Table  2 shows the results. We list the top three gold classes (and the percentage of sentences from these classes) in each of our clusters, and the top three clusters (and the percentage of sentences from these clusters) in each of the gold classes. Some gold classes (#4, #9) are frequently co-occurring, indicating their high similarity. We characterize each cluster by its medoid (the sentence closest to cluster centroid).
Error analysis. Grouping statements into coherent clusters proved a challenging task. Our preliminary analysis indicates that the main problems are related to (a) need for background knowledge, (b) use of idiomatic language, (c) grammatical errors, (d) opposing arguments, and (e) too fine/coarse gold argument granularity. We show some sample errors in Table 3, but leave a detailed error analysis for future work.
Ex. #knowledge demonstrates the need for background knowledge (exports are government regu-"Obama" "Marijuana" "Gay rights" "Abortion"  Table 1: External evaluation of clustering models on the four topics lated). A colloquial expression (pot) is used in Ex. #colloquial. In #oppose, the statement is assigned to a cluster of opposing argument. In Ex. #general our model predicts a more coarse argument. Another observation concerns the level of argument granularity. In the previous analysis, we used the gold number of clusters. We note, however, that the level of granularity is to a certain extent arbitrary. To exemplify this, we look at the dendrogram (Fig. 1) of the last 15 HAC steps on the "Marijuana" topic. Medoids of clusters divided at point CD are (1) the economy would get billions of dollars (...) no longer would this revenue go directly into the black market. and (2) If the tax on cigarettes can be $5.00/pack imagine what we could tax pot for!. These could well be treated as separate arguments about economy and taxes, respectively. On the other hand, clusters merged at CM consists mostly of gold arguments (1) Damages our bodies and (2) Responsible for brain damage, which could be represented by a single argument Damaging our entire bodies. The dendrogram also suggests that the 10-cluster cut is perhaps not optimal for the similarity measure used.

Conclusion
In this preliminary study, we addressed unsupervised identification of prominent arguments in online debates, using hierarchical clustering based on textual similarity. Our best performing model, a simple distributed representation of argument sentence, performs in a 0.15 to 0.30 V-measure range. Our analysis of clustering quality and errors on manually matched cluster-classes revealed that there are difficult cases that textual similarity cannot capture. A number of errors can be traced down to the fact that it is sometimes difficult to draw clear-cut boundaries between arguments. In this study we relied on simple text similarity models. One way to extend our work would be to experiment with models better tuned for argument similarity, based on a more detailed error analysis. Also of interest are the internal evaluation criteria for determining the optimal argument granularity.
A more fundamental issue, raised by one reviewer, are the potential long-term limitations of the clustering approach to argument recognition. While we believe that there is a lot of room for improvement, we think that identifying arguments fully automatically is hardly feasible. However, we are convinced that argument clustering will prove valuable in humanled argumentative analysis. Argument clustering may also prove useful for semi-supervised argument recognition, where it may be used as unsupervised pre-training followed by supervised fine-tuning. Used as a medicine for its positive effects 2 4 (92%) 9 (8%) The biggest effect would be an end to brutal mandatory sentencing of long jail times that has ruined so many young peoples lives.
(...) the link between marijuana use and mental illness may be an instance when correlation does not equal causation.