Determining Relative Argument Specificity and Stance for Complex Argumentative Structures

Systems for automatic argument generation and debate require the ability to (1) determine the stance of any claims employed in the argument and (2) assess the specificity of each claim relative to the argument context. Existing work on understanding claim specificity and stance, however, has been limited to the study of argumentative structures that are relatively shallow, most often consisting of a single claim that directly supports or opposes the argument thesis. In this paper, we tackle these tasks in the context of complex arguments on a diverse set of topics. In particular, our dataset consists of manually curated argument trees for 741 controversial topics covering 95,312 unique claims; lines of argument are generally of depth 2 to 6. We find that as the distance between a pair of claims increases along the argument path, determining the relative specificity of a pair of claims becomes easier and determining their relative stance becomes harder.


Introduction
The tasks of automatic argument generation and debate require the ability to present a diverse and comprehensive set of supporting and opposing arguments given a controversial topic. Two critical components of such systems are an ability to determine the stance and the specificity of any claims employed in the proposed argument. Consider, for example, the argument thesis (i.e., the topic) of Figure 1: (THESIS) Would we like to live in the world of Harry Potter? Construction of an argument in support or in opposition to this thesis necessarily requires knowing the stance of the claims that comprise it: the claim Magic opens a lot of interesting possibilities should be identified as a claim in support of the THESIS, and The capacity of harm is greater when magic is involved (HARM), as a claim in opposition. Indeed, pre-vious work has studied this task (e.g., Bar-Haim et al. (2017); Faulkner (2014)).
It is not sufficient, however, to determine claim stance only with respect to the argument thesis. Debate and argument generation systems, in general, should also be able to determine whether two claims that address the same line of reasoning represent the same, or the opposing stance: using Defense is also made easier through magic to refute the HARM claim in Figure 1, for example, requires recognizing that it represents the opposite stance.
The issue of claim specificity in argumentation has been much less addressed. Existing work, however, suggests that a high degree of specificity is correlated with argument quality and persuasiveness (Carlile et al., 2018;Swanson et al., 2015). In terms of argument quality though, it is entirely possible for the presented claims to be coherent and meaningful, yet be too specific within the given discourse, and therefore be logically irrelevant (Dessalles, 2016). As a concrete example, suppose we wanted to assert a claim in support of the argument THESIS of Figure 1. While The Unforgivable Curses are illegal...and their use is grounds for immediate life imprisonment supports the THESIS, it is too specific a claim to introduce at this point in the argument. Namely, it doesn't flow naturally without first introducing the concept of Unforgivable Curses.
To date, existing work on understanding claim specificity and stance has mostly employed annotated monologic persuasive documents or discussion forums and, as a result has been limited to the study of argumentative structures that are relatively shallow, most often only consisting of claims that directly support or oppose the argument thesis (Bar-Haim et al., 2017;Faulkner, 2014).
To support the generation of diverse and potentially complex arguments on a topic of choice, we present here a dataset of manually curated Figure 1: Partial tree for the controversial topic "Would we like to live in the world of Harry Potter?". Each claim's position towards its parent argument is indicated in the box on the edge between the claim and its parent. The full argument tree for this topic can be found at https://www.kialo.com/is-the-world-of-harry-potter-really-the-placeto-be-2415/2415.0=2415.1. argument trees for 741 controversial topics covering 95,312 unique claims. In contrast to existing datasets, ours consists of argument trees where each root node represents the argument thesis (main claim) and every other node represents a claim that either supports or opposes its parent. Taking advantage of this relatively complex argumentative structure, we formulate two prediction tasks to study relative specificity and stance. The main contributions of our study are the following: • We provide a publicly available dataset of argument trees consisting of a diverse set supporting and opposing claims for 741 controversial topics 1 .
• We propose two novel settings to study claim specificity and stance in the context of a diverse set of supporting and opposing points.
• We control for specific aspects of the argument tree (e.g., depth, stance) in our experiments to understand their effect on claim specificity and stance detection.

Dataset
We extracted argument trees for 741 controversial topics from www.kialo.com 2 . Kialo is a collaborative platform where users provide supporting and opposing claims for each claim related to a controversial issue. Besides providing the claims themselves, users also help to improve the quality of existing claims by suggesting edits, and rating the quality of claims. This process of collaborative editing helps to create a high quality, diverse set of supporting and opposing points for each controversial topic 3 . The dataset includes diverse set of controversial topics. Each controversial topic is represented by a thesis and tagged to be related to pre-defined generic categories such as Politics, Ethics, Society and Technology 4 . Figure 2(a) shows the number of controversial topics with the given pre-defined categories. The controversial topics' theses include: "A free Press is necessary to democracy.", "All drugs should be legalised.", "A society with no gender would be better.", "Hate speech should be banned", etc.

Structure of the arguments
The arguments for each controversial topic are represented as trees. The root node of each such tree represents the thesis of the controversial topic. Every other node in the tree represents a claim that either supports or opposes its parent claim. Figure 1 shows a partial argument tree for the thesis "Would we like to live in the world of Harry Potter?". We see that besides the supporting and opposing claims for the thesis, there are supporting and opposing claims for the claims at different depths. With this structure, we can identify indirect support/oppose relationships even between nodes without parent-child relationships if they (a) Number of controversial topics with the given predefined categories. Note that a controversial topic could be related to multiple pre-defined categories.
(b) Number of claims at given depths. The majority of the claims lie at the depth 3 or higher.
(c) Number of trees with given range of total number of claims. For the majority of trees, the argument tree has more than 30 claims in the tree. Average number of claims per argument tree is 127.
(d) Number of trees with given range of depth. For the majority of trees, the depth of the argument tree is 4 or higher, and average depth per argument tree is 5. are on the same argument path. For example, the claim "Defense is also made easier through magic" indirectly supports the thesis, since it is in opposition with its parent "The capacity of harm is greater when the magic is involved", which is an opposing claim to the thesis. Another observation is that as we go deeper along an argument path, the claims get more specific, since each claim aims to either support or oppose its parent. For example, while the claim "The capacity of harm is greater when the magic is involved" refers to the general harms that can be caused by magic, one of its child claims "There is a great capacity to harm others using the Unforgivable Curses" is more specific as it refers to harm via a particular set of curses in magic.

Data Statistics
The dataset consists of argument trees for 741 controversial topics comprised of 95, 312 unique claims. The distribution of argument trees with the given range of total claims, and depth is shown in Figures 2(c) and 2(d) respectively. We see that for the majority of trees, the depth is 4 or higher, and the number of claims is greater than 30. Figure 2(b) shows the total number of claims at a given depth. We see that only 7, 618 out of 95, 312 claims are directly supporting or opposing the theses of the controversial topics. The majority of the claims lie at the depth 3 or higher. This shows that the dataset has a rich set of supporting and opposing claims for not only for the theses, but for claims at different depths of the tree. In total, there are 44,572 claims that are supporting and 50,740 claims that are opposing their parent claims. 90% of claims consist of 1 (61%) to 2 (29%) sentences and average number of tokens per claim is 30.

Claim Specificity
Determining the relative specificity of arguments is an important step towards being able to generate logically relevant arguments in a given discourse (Dessalles, 2016). For a system that disregards the relative specificity of claims, it is entirely possible to generate coherent and meaningful, yet logically irrelevant claims, when the generated claims are either too generic or specific for the given argument discourse.
In this work, we determine the relative specificity between a pair of claims that are along the Figure 2: Hierarchical model for stance classification. A pre-trained BERT model is used to encode pairs of claims, which are then fed into a bi-directional GRU, to encode the path. In the figure, E i represents the input embedding for token TOK i , R i represents the contextual representation for token TOK i from the final layer in the BERT model, and R pair i is the representation of Pair i. same argument path from the thesis to a given leaf claim. We note that specificity always increases along a given path, as each child claim is addressing some aspect of its parent claim, by either supporting or opposing, and therefore by definition has to be more specific. While an increase in depth is correlated with an increase in specificity for claims within a given argument path, this correlation does not necessarily hold for claims across different argument paths within a tree 5 . One important note is that we use the path information only as a way to reliably generate specificity labels, without requiring human annotations. The task of relative specificity detection itself does not require any path information to be present, nor do we make any assumptions in our models about the availability of path information.
For this task, given a pair of claims, we want the model to determine whether the second claim is more specific than the first claim. We note that unlike in stance prediction, we never provide the path information between a pair of claims, as this would be equivalent to giving the gold label as input to the model, since given the path, the relative specificity is deterministic.

Results and Analysis
Baseline. We experiment with feature-based Logistic Regression (LR) model that incorporates all the features that are shown to be effective in determining sentence specificity (Louis and Nenkova, 2012). For example, this feature list includes polarity of the claims (Wilson et al., 2005), number of personal pronouns in the claims, and length of the claims since (Louis and Nenkova, 2012) shows that generic sentences have stronger polarity, less number of personal pronouns and are shorter in length. While Ko et al. (2019) has also looked at the task of specificity prediction, we cannot directly apply their models to our data, since their annotation scheme requires each sentence to be labelled as general or specific, whereas we argue that specificity is relative.
Fine-tuned BERT. We compare our baselines with a fine-tuned BERT model (Devlin et al., 2018). BERT is a pre-trained deep bidirectional  transformer model that can encode sentences into dense vector representations. It is trained on large un-annotated corpora such as Wikipedia and the BooksCorpus (Zhu et al., 2015) using two different learning objectives, namely masked language model and next sentence prediction. These learning objectives together allow the model to learn representations that can be easily fine-tuned to achieve state-of-the-art performance for a wide range of natural language processing tasks. For relative specificity detection, we feed the pair of claims as a single sequence with the special [SEP] token between the claims, and a [CLS] token at the beginning of the sequence, as shown in Figure 2, into a pre-trained BERT model 6 . In addition, we indicate each token in the first claim (as well as the [CLS] and [SEP] tokens) as belonging to sentence A, and each token in the second claim as belonging to sentence B, which is used by the BERT model to add the appropriate learned sentence embedding to each token. Note that this approach of packing a pair of claims into a single sequence is consistent with the input representation from (Devlin et al., 2018), for tasks where the input is a pair of sequences. We then take the output of the [CLS] token from the final layer of the BERT model, and feed it into a classification layer. We fine-tune 7 this architecture for relative specificity detection.
We split our data into train, development and test sets, by topic, which ensures that all nodes from the same tree are confined to a single split. We split the data in this way in order to encourage our models to learn more domain independent features, that are applicable across the diverse set of controversial topics. Number of examples in each split for each task is shown in Table 1. Table 2 compares the performance of the dif-6 Specifically, we use the BERT-Base (Uncased) model, which contains 12 layers of bidirectional transformers, with a hidden size of 768 units and 12 attention heads (for a total of 110M parameters). 7 For all fine-tuning experiments with BERT, we used a learning rate of 2e -5 . We ran the fine-tuning jobs for a maximum of 5 epochs, and used the validation performance for early stopping. ferent models for relative specificity, across three different settings. In the first setting, we evaluate the models across all claim pairs that occur in the same argument path in a given tree. We then control for the distance between the pair, in the second setting, by evaluating only across pairs of nodes that are distance 1 from each other, i.e. have a parent-child relationship. Finally, we control for the stance, in the third setting, and evaluate across pairs of claims that have the same stance relative to their parent.
Analysis. Consistent with previous work (Li and Nenkova, 2015), we find that length is highly predictive of specificity and more specific claims are longer than more generic claims. Across all settings, the fine-tuned BERT model achieves the best performance. As expected, the performance degrades, for all models, as we control for distance and stance, since the claims get more similar in language, for both cases. Table 4 shows the top weighted words by BOW model for each class. We find that connectives (such as also, but, because, when) are associated more with arguments with higher specificity as they are mostly used to add more specific information to the claims as also found by Lugini and Litman (2017), whereas concept words (such as society, world, gender) have higher association with more generic arguments since these words represents the concepts of the controversial topics that people argue about.
We further evaluate our models for the claim pairs with distance values 2 to 5 as shown in Table 3. We find that BERT model is consistently the best performing model for all distance pairs. As we increase the distance, the models achieve higher prediction performance despite having less training examples for higher distance values.

Claim Stance Detection
It is not sufficient for debate and argument generation systems to determine the claim stance only with respect to the argument thesis; it is also necessary to determine the stance between any pair of claims that address the same line of reasoning. An argument generation system, for example, may need to generate arguments that oppose some of the opponent's previous claims while supporting some of its own previous claims during the debate which would require to determine the stance between any candidate claims and the claims in the    previous argument discourse. In this work, given a claim A at depth d and claim B at depth > d along the same argument path, we determine whether B (in)directly SUP-PORTS or OPPOSES A (stance). If A and B do not have parent-child relationship, we determine whether B indirectly SUPPORTS or OPPOSES A by considering support/oppose relationship of each parent-child claims between A and B. Following the example shown in Figure 1, the claim "The capacity of harm is greater when the magic is involved" is directly supported by the claim "There is a great capacity to harm others using the Unforgivable Curses", with a direct parent-child relationship. However, the argument "The Unforgivable Curses are illegal in the wizarding world and their use is grounds for immediate life imprisonment in Azkaban Prison" is indirectly opposing the same claim, by rebutting it's parent, which presents a supporting point for the claim.

Results and Analysis
We experiment with a feature-based Logistic Regression model and a fine-tuned BERT model (Devlin et al., 2018) using the same strategy to split the data into train, development and test sets as in Section 3.1.
Baseline. Our feature-based model employs features shown to be effective in stance detection tasks (Mohammad et al., 2016) such as bag of words, word match, sentiment match, document embedding similarity, and MPQA subjectivity features (Wilson et al., 2005) 8 . We cannot evaluate the model from Sun et al. (2018) as a baseline, as that requires additional annotations for argument phrases for the given topics. Similarly, we cannot evaluate the model from Bar-Haim et al. (2017) as a baseline, since it would require additional annotations for target phrases in each claim, polarity towards the target phrases, and consistent/contrastive labels between the target phrases of two claims.
Fine-tuned BERT. We feed a pair of claims into a pre-trained BERT model, in the same manner as detailed above for relative specificity detection, and take the output of the [CLS] token from final layer and feed it into a classifier. We fine-tune this model for relative stance detection.
Fine-tuned BERT with path (simple). In this model, we incorporate path information in a very naïve manner. For a given pair of claims A and  We indicate each token from claim B as belonging to sentence A, and the tokens from all other claims in the path, including claim A, are indicated as belonging to sentence B. We note that this way of processing the input is similar to how (Devlin et al., 2018) processed their input for the QA task.
Similar to the previous model, we feed the output of the [CLS] token from the final layer into a classifier. We then fine-tune this model for relative stance classification.
Fine-tuned BERT with path (hierarchical). We hypothesize that the task of determining relative stance becomes easier, if we can follow along the argument path and determine the relative stance between parent-child claims. We incorporate this inductive bias into the model by constructing a hierarchical architecture for relative stance classification, as shown in Figure 2. First, we feed each parent-child pair along an argument path as a single sequence into the BERT encoder, separated by the [SEP] token, and take the representation of the [CLS] token from final layer of the BERT model, as the pair representations. We then feed the sequence of pair representations into a bidirectional Gated Recurrent Unit (GRU) (Cho et al., 2014), to get the path representation. In our experiments, we used a single bidirectional GRU layer with 128 units. The output of the last token from the forward GRU, and the output of the first token from the backward GRU are concatenated together to get the final path representation. We then feed this into a classifier to predict relative stance. We fine-tune this architecture for relative stance classification. Table 5 compares the performance of the different models for argument stance detection, across two different settings. In the first setting, we evaluate the models only across pairs of claims that are distance 1 from each other, i.e. in a parent-child relationship. In the second setting, we evaluate the model across all pairs that occur in the same argument path in a given tree with and without incorporating the claims along the argument path between these pair of claims.
Analysis. We find that the fine-tuned BERT models perform much better than the feature based models and baselines, across both the settings. Also, as we hypothesized, having the argument path information is useful for determining relative stance between claims that do not have a parentchild relationship, as the BERT models with path information consistently perform better in the second setting, with the hierarchical BERT model being the best. In our dataset, an argument path from the tree is the best approximation that we have for an argumentative discourse, and as such our results suggest that considering discourse level context is useful in determining relative stance between two claims. However, as shown by our results, our models can still be employed when there is limited or no discourse information.
The performance degrades significantly 9 in the second setting, where we include claim pairs with all the distances, implying that it is easier to determine the stance relative to the parent, than claims that are further on the same path.
We do a more fine grained analysis of the performance of the fine-tuned BERT models, at different distances, which we present in Table 6. As expected, performance degrades for all models as the distance between the pair of claims increases. We find that at distance d=4 Fine-tuned BERT model that incorporates path information in a simple manner performs similarly to the model without path information. The hierarchical model, however, performs significantly better, which further justifies our choice to treat the argument path  context as a hierarchical rather than a flat representation.

Related Work
Argumentation Generation. Previous work in argument generation has focused on generating summaries of opinionated text (Wang and Ling, 2016), rebuttals for a given argument (Jitnah et al., 2000), paraphrases from predicate/argument structure (Kozlowski et al., 2003), generation via sentence retrieval (Sato et al., 2015) and developing argumentative dialogue agents (Le et al., 2018;Rakshit et al., 2017). The work on developing argumentative dialogue agents, in particular, has employed mostly social media data such as IAC (Walker et al., 2012c) to design retrievalbased or generative models to make argumentative responses to the users. These models, however, employ very limited context in generating the claims, and there is no notion of generating a claim with a particular stance or the appropriate level of specificity within the context. Furthermore, these models are trained on social media conversations, which can be noisy, and as noted by Rakshit et al. (2017), many sentences either do not express an argument or cannot be understood out of context. In contrast, our dataset explicitly provides the sequence of claims in an argument path that leads to any particular claim, which can enable an argument generation system to generate relevant claims, with a particular stance and at the right level of specificity. Recent work by Hua and Wang (2018) studies the task of generating claims of a different stance for a given statement, however their context is limited to the given statement and they do not take specificity into account. Stance Detection. Previous work on claim stance detection has studied the important linguistic features to determine the stance of a claim relative to a thesis/main claim Wiebe, 2009, 2010;Walker et al., 2012a,b;Hasan and Ng, 2013;Sridhar et al., 2014;Thomas et al., 2006;Yessenalina et al., 2010;Burfoot et al., 2011;Kwon et al., 2007;Faulkner, 2014;Bar-Haim et al., 2017). Some of these studies have shown that simple linear classifiers with uni-gram and n-gram features are effective for this task (Somasundaran and Wiebe, 2010;Hasan and Ng, 2013;Mohammad et al., 2016). However, in our setting, since we try to predict the stance between all pairs of claims on an argument path, rather than simply claims that are directed towards the thesis or the parent claim, we find that the models with a hierarchical representation of the argument path, i.e. the context, significantly outperform these baselines.
Argument Structure and Quality. There has been tremendous amount of work in computational argumentation mining focusing on determining argumentative components (Mochales and Moens, 2011;Stab and Gurevych, 2014;Nguyen and Litman, 2015) and argument structure in text (Palau and Moens, 2009;Biran and Rambow, 2011;Feng and Hirst, 2011;Lippi and Torroni, 2015;Park and Cardie, 2014;Peldszus and Stede, 2015;Niculae et al., 2017;Rosenthal and McKeown, 2015), and understanding the argument quality dimensions (Wachsmuth et al., 2017;Carlile et al., 2018) and the characteristics of persuasive arguments (Kelman, 1961;Burgoon et al., 1975;Chaiken, 1987;Tykocinskl et al., 1994;Chambliss and Garner, 1996;Durmus and Cardie, 2018;Dillard and Pfau, 2002;Cialdini, 2007;Durik et al., 2008;Tan et al., 2014;Marquart and Naderer, 2016;Durmus and Cardie, 2019). Existing work on claim specificity and stance detection has mostly employed datasets extracted from monologic documents that include more shallow support/oppose structures (Bar-Haim et al., 2017;Faulkner, 2014). Although there has been some work on constructing argument structure datasets using news sources (Reed et al., 2008), microtexts (Peldszus, 2014) and user comments (Park and Cardie, 2018), these structures tend to be shallower and include fewer opposing claims since they employ existing monologic texts that are rel-atively short. In contrast, the dataset we provide is constructed with the goal of providing supporting and opposing claims for each of the claim presented in an argument tree. Therefore, these argument tree structures are deeper and have more balanced number of supporting and opposing claims.

Conclusion
We present a new dataset of manually curated argument trees, which can open interesting avenues of research in argumentation. We use this dataset to study methods for determining claim stance and relative claim specificity for complex argumentative structures. We find that it is easier to predict stance for claims that have a parent-child relationship, where as relative specificity is more difficult to predict in the same case. For future work, it may be interesting to understand which other models would be effective in claim specificity and stance detection tasks. Besides, developing techniques to incorporate the claim stance and specificity detection models in argument generation to generate more coherent and consistent arguments is another interesting research direction to be explored.