Exploring the Role of Argument Structure in Online Debate Persuasion

Online debate forums provide users a platform to express their opinions on controversial topics while being exposed to opinions from diverse set of viewpoints. Existing work in Natural Language Processing (NLP) has shown that linguistic features extracted from the debate text and features encoding the characteristics of the audience are both critical in persuasion studies. In this paper, we aim to further investigate the role of discourse structure of the arguments from online debates in their persuasiveness. In particular, we use the factor graph model to obtain features for the argument structure of debates from an online debating platform and incorporate these features to an LSTM-based model to predict the debater that makes the most convincing arguments. We find that incorporating argument structure features play an essential role in achieving the better predictive performance in assessing the persuasiveness of the arguments in online debates.


Introduction
The increase in availability of online argumentation platforms has provided opportunity for researchers to develop computational methods at a larger scale studying the important factors of persuasiveness such as the language use (Hidey et al., 2017;Zhang et al., 2016), characteristics of audience (i.e. prior beliefs, demographics) (Durmus andCardie, 2019a, 2018) and social interactions (Durmus and Cardie, 2019b).
Prior work has showed incorporating argument structure features is important in assessing the quality of monological persuasive essays (Klebanov et al., 2016;Wachsmuth et al., 2016). Hidey et al. (2017) and Egawa et al. (2019) further collected annotations for semantic types of argument components and studied the relationship between the semantic types and persuasiveness of the arguments from online argumentative platform Change-MyView (CMV) . CMV consists of discussion trees where the users interact with the original poster to change their opinion on a given topic. Although the discussion trees are of a high quality since they are monitored by moderators , they are not as structured since any user in the subreddit can participate in the discussions once the original post is posted. Furthermore, the persuasiveness of the posts in CMV is evaluated only by the original poster (i.e. whether they change their stance or not). In this paper, we aim to investigate the effect of argument structure in persuasion on online debates. We focus on debates from DDO corpus (Durmus and Cardie, 2019a) where debaters from two diverging sides of an issue express their opinions on a controversial topic in turns since these debates are more structured and the persuasiveness of the arguments in debates are evaluated by a larger set of audience. Moreover, this setup allows us to account for the audience characteristics when studying the effect of the argument structure on persuasion.
We first generate argument structure on DDO dataset (Durmus and Cardie, 2019a) using the model proposed by Niculae et al. (2017). We then incorporate the features extracted from argument structure to an LSTM-based model that encodes the sequence of turns from two sides (i.e. PRO vs. CON). We compare our results with the baselines proposed in (Durmus and Cardie, 2018) which extracts linguistics features from the debate text as well as features that encode characteristics of the audience. We find that incorporating argument structure features achieves significantly better results than the baselines. Our analysis further shows that argument structure features encode important strategies of persuasion, for example, we find that more convincing arguments are more likely to in- clude personal experiences of the debater and appeal to audience emotion.

Related Work
Analysis of discourse structure There has been a lot effort to understand the role of discourse structure in argumentation. Jiang et al. (2019) applied RST to essays written by students in K-12 schools and demonstrated its potential to provide automated feedback for essay quality. Argument structures, which can be considered as a special kind of discourse structure, have been widely analyzed in the task of automatic essay scoring and feedback (Klebanov et al., 2016;Ghosh et al., 2016;Wachsmuth et al., 2016). Furthermore, Duthie and Budzynska (2018) has studied the relationship between ethos, a specific kind of argument unit, and the dynamics of governments from the UK parliamentary debates. The role of argument structure in persuasion on online debates is much less explored, which is the main focus of this paper.
Analysis of Persuasion Prior studies on persuasion has mainly focused on understanding the role of linguistic factors (Petty et al., 1983;Chaiken, 1987;Dillard and Pfau, 2002;Gold et al., 2015). Besides, the interaction between debaters has shown to be an important cue in persuasion studies (Zhang et al., 2016;Wang et al., 2017). Luu et al. (2019) further found that the debater's skill estimated from debate text history is also predictive of convincing the audience. User factors are explored in previous papers (Durmus andCardie, 2019a, 2018;Longpre et al., 2019), demonstrating the importance of characteristics and beliefs of the audience. Furthermore, Potash and Rumshisky (2017) proposed a recurrent neural network architecture with attention and annotated audience favorability to predict the winner of the debate. Villata et al. (2018) and Benlamine et al. (2017) studied the correlation of the engagement index in brain hemispheres with the persuasion strategies. Argument structures have been used to understand argumentative strategies in dialogues and news editorials (Al Khatib et al., 2017;Wang et al., 2019). A few studies have explored the impact of argument structures in predicting persuasion on CMV dataset based on statistical analysis of proposition types (Hidey et al., 2017;Egawa et al., 2019;Morio et al., 2019). In this paper, we particularly study persuasion in online debates. We propose novel argument structure features based on n-grams of the supporting relations in argument structure graph of the debate text and experiment with these using both linear and neural models.

Dataset
We experiment with DDO dataset (Durmus and Cardie, 2019a) which includes 77,655 debates covering 23 different topic categories. Each debate consists of multiple rounds with each round containing one utterance from the PRO side and one utterance from the CON side. Besides the text information for debates, the dataset also contains user information and votes provided by the audience on six different criteria of evaluating both the debaters. We use the criterion "Made more convincing arguments" as an overall signal to study the role of argument structure in predicting more convincing arguments.

Prediction Task
Task. We aim to predict which side (i.e. PRO vs. CON) makes more convincing arguments during a debate, and thus is more persuasive. Data preprocessing. We count which side of the debate gets more votes for the criterion "Made more convincing arguments". We eliminate debates if they are tied or the difference in votes is only 1. 1 The final dataset contains 2,606 debates.

Argument structure features
We apply the pre-trained model (Niculae et al., 2017) on DDO dataset to get the stucture of the arguments. We select this method since we can predict argumentative relations and classify proposition type at the same time, while the method proposed by Chakrabarty et al. (2020) mainly focuses on predicting argumentative relations. Besides, this model can model argumentative relations that do not necessarily form a tree structure which is more suitable to argumentation in the wild comparing to the models proposed in Stab and Gurevych (2017) and Peldszus and Stede (2015). We generate argument structures for the selected 2,606 debates. 2 The argument structure model outputs the proposition type for each sentence (i.e. REFERENCE, TESTIMONY, FACT, VALUE, POLICY) as well as the supporting relationship between the propositions. An example of argument structure generated on the text from one side in one round of the debate 'Preschool Is A Waste Of Time' is shown in Figure 1. We use Amazon Mechanical Turk (AMT) to further 1 Since the average number of total votes in one debate is 8, we consider difference of two or more votes as significant. 2 Since the model takes relatively long inference time and performs worse for long debates, we eliminate all the debates with more than 40 sentences in one round from one side. We also eliminate debates where one of the debaters forfeit during the debate. evaluate the quality of the argument structure on debates by asking Turkers to classify each argument from randomly picked 30 debates into five categories: POLICY, VALUE, FACT, TESTIMONY, REFERENCE. In total, we get annotations for 1,098 sentences, and each sentence is annotated by two annotators. We find that around 64% of the output generated from the pre-trained model is consistent with either of the annotations from the Turkers. We then extract three sets of argument structure features to capture the proposition types and link between propositions: Proposition n-gram frequency Similar to Wachsmuth et al. (2016), we obtain the frequencies of proposition unigrams, bigrams, and trigrams from the sequence of propositions. For example, (POLICY,VALUE) and (VALUE,VALUE) bigram features in Figure 1 has values 0.25,0.75 respectively.
Link n-gram frequency We extract the n-gram information from the supporting relations in argument structure graph. For example, we represent two propositions connected with a link as a bigram (i.e a → b in the graph is represented with bigram (a,b)).
Graphical representation Rahwan (2008) has found that there are five common argument structures in online environment: basic argument, convergent argument, serial argument, divergent argument, and linked argument. A typical basic argument is a → b 3 , while serial argument is Similarly, a linked argument is in the form of a, c → b. We extract features to represent which of these types of arguments are used in the text of the debaters. We further classify the convergent arguments into two categories -where two propositions support one proposition (regular convergent argument) and more than two propositions support one proposition (multi convergent argument). Similarly, we classify divergent argument into regular divergent argument and multi divergent argument.

Experiments and Analysis
We compare our model with the baseline proposed by Durmus and Cardie (2019a) employing linguistic features and features encoding audience characteristics. The prediction accuracy is evaluated using 5-fold cross-validation, and the model parameters for each split are picked with 3-fold crossvalidation on the training set. As shown in Table 2, incorporating argument structure features to Logistic Regression achieves significantly better performance than the baseline with linguistic and user-based features. LSTM with argument structure features achieves the best predicive performance since LSTM can better represent context and the interplay between debaters. We perform t-test over 10 runs between the model with and without argument structure features, the p-value is 0.0038, indicating a statistically significant result. Furthermore, we do ablation over different sets of argument structure features. The results show that using the sequential flow of arguments is more effective than using argumentative relations in our setting.
We further analyze what type of argument structure is more correlated with making more convincing arguments. Comparing the unigram, bigram and trigram frequencies of the propositions by more convincing vs. less convincing debaters, we find that unigram TESTIMONY (p < 0.0001) 4 , bigram (VALUE,TESTIMONY) (p < 0.001), and trigram (VALUE,TESTIMONY,VALUE) (p < 0.0001) appear more frequently in the more convincing side. This result suggests that justifying the objective claims with personal experiences is an effective strategy as also shown in previous work (Villata et al., 2018). Table 1 shows an example that is predicted classified by the model correctly after adding argument structure features. We observe that the side referring to their personal experiences (PRO) is voted as the side making more convincing arguments. Besides, we find that unigram POLICY (p < 0.0001), bigram (POLICY,VALUE) (p < 0.005) appear more frequently in the less convincing side suggesting that using propositions with type POLICY -which is used to specify a specific course of action to be taken -may not be a very effective strategy in online debating. Analyzing the link n-gram frequency features, we have further found that propositions with type VALUE from more convincing side are supported by a FACT (p < 0.05) more often. This suggests that the more convincing debaters may be using logos to support their views as also shown in previous work (Hidey et al., 2017). Finally, we observe that more convincing side tends to have more divergent arguments (p = 0.052). Divergent arguments involves three or more consecutive sentences most of the time. In the case of three consecutive sentences, the middle sentence supports both the other two sentences by giving explanations or evidence, and serves as a transition between two similar ideas.
We also look into some examples that are classified wrong by the model. A typical error is caused by wrong proposition type classification. For example, in the debate "Driving on public roads is a right not a privilege", sentences from PRO side "In addition, in purchasing our vehicles, we have the right to drive said vehicle." and "I appreciate the insight given by my opponent but he/she has failed to address the issue at hand." are classified as "Testimony" wrongly, which makes the model prefer PRO as the more convincing side. We believe that incorporating more accurate argument structure generation models can further improve the performance on persuasion prediction.

Conclusion
In this work, we explore the role of argument structure in online debate persuasion and find that incorporating argument structure features along with the linguistic features achieves the best predictive performance models. Moreover, we observe that argument structure features provide important cues about effective persuasion strategies in online debates.  ance is the same, though one utterance will contain multiple sentences. Due to the maximum sequence length of 128 tokens of BERT 5 , which is much shorter than the average length of utterance in one round for one debater, we truncate the debate text input and only preserve the last three sentences 6 in each round for each debater. The truncate method of choosing the first three sentences of the utterance has also been tested, but the performance of the model was around 3% lower.

A.4 Implementation Details
We use grid search to pick the hyperparameter. For the model that encodes linguistic information, we use a one-layer bidirectional LSTM with 768 dimension BERT representation input and 32 dimension hidden states. ( . To test the stability of our results, we train and evaluate our model 10 times and take the average 5 We use BERT-base with uncased input as the pretrained model. 6 We also experimented using more sentences (e.g. last five sentences) in cases where the sequence length has not been maxed out has also been tested, but it doesn't show significant improvement.

accuracy.
A.5 Details on AMT result Figure 3 shows the screenshot for a typical HIT for the Turkers. For each HIT, the turkers are given the debate topic and the sentence to be classified. They need to choose between 5 categories: Policy, Value, Fact, Testimony, Reference. The definition of these proposition types and the corresponding example are included in the full instruction. In total, we get annotations for 1,098 sentences from seventeen annotators. The detailed results are listed in Table 3. Consistency means the generated annotations is consistent with either of the annotations from the Turkers. We also compute Inter-Annotator Agreement (IAA) using Kripendorff's alpha (Krippendorff, 1970). The Kripendorff's alpha is 0.2, indicating that annotating argument structure is still a hard task for Turkers.