Agreement Prediction of Arguments in Cyber Argumentation for Detecting Stance Polarity and Intensity

In online debates, users express different levels of agreement/disagreement with one another’s arguments and ideas. Often levels of agreement/disagreement are implicit in the text, and must be predicted to analyze collective opinions. Existing stance detection methods predict the polarity of a post’s stance toward a topic or post, but don’t consider the stance’s degree of intensity. We introduce a new research problem, stance polarity and intensity prediction in response relationships between posts. This problem is challenging because differences in stance intensity are often subtle and require nuanced language understanding. Cyber argumentation research has shown that incorporating both stance polarity and intensity data in online debates leads to better discussion analysis. We explore five different learning models: Ridge-M regression, Ridge-S regression, SVR-RF-R, pkudblab-PIP, and T-PAN-PIP for predicting stance polarity and intensity in argumentation. These models are evaluated using a new dataset for stance polarity and intensity prediction collected using a cyber argumentation platform. The SVR-RF-R model performs best for prediction of stance polarity with an accuracy of 70.43% and intensity with RMSE of 0.596. This work is the first to train models for predicting a post’s stance polarity and intensity in one combined value in cyber argumentation with reasonably good accuracy.


Introduction
Many major online and social media and networking sites, such as Facebook, Twitter, and Wikipedia, have taken over as the new public forum for people to discuss and debate issues of national and international importance. With more participants in these debates than ever before, the volume of unstructured discourse data continues to increase, and the need for automatic processing of this data is prevalent. A critical task in processing online debates is to automatically determine the different argumentative relationships between online posts in a discussion. These relationships typically consist of a stance polarity (i.e., whether a post is supporting, opposing, or is neutral toward another post) and the degree of intensity of the stance.
Automatically determining these types of relationships from a given text is a goal in both stance detection and argumentation mining research. Stance detection models seek to automatically determine a text's stance polarity (Favoring, Opposing, or Neutral) toward another text or topic based on its textual information . Likewise, argumentation mining seeks to determine the stance relationship (Supporting, Attacking, or Neutral) between argumentation components in a text (Stede and Schneider, 2018). However, in both cases, attention is only paid to the stance's polarity, while the intensity of the relationship is often ignored. Some studies have tried to incorporate intensity into their predictions by expanding the number of classes to predict (Strongly For, For, Other, Against, and Strongly Against); however, this expansion lowered their classification performance considerably compared classification without intensity (Sobhani et al., 2015). Thus, effective incorporation of stance intensity into stance classification remains an issue.
Research in Cyber Argumentation has shown that incorporating both stance polarity and intensity information into online discussions improves the analysis of discussions and the various phenomena that arise during a debate, including opinion polarization (Sirrianni et al., 2018), and identifying outlier opinions (Arvapally et al., 2017), compared to using stance polarity alone. Thus, automatically identifying both the post's stance polarity and intensity, allows these powerful analytical models to be applied to unstructured debate data from platforms such as Twitter, Facebook, Wikipedia, comment threads, and online forums.
To that end, in this paper, we introduce a new research problem, stance polarity and intensity prediction in a responsive relationship between posts, which aims to predict a text's stance polarity and intensity which we combine into a single continuous agreement value. Given an online post A, which is replying to another online post B, we predict the stance polarity and intensity value of A towards B using A's (and sometimes B's) textual information. The stance polarity and intensity value is a continuous value, bounded from -1.0 to +1.0, where the value's sign (positive, negative, or zero) corresponds to the text's stance polarity (favoring, opposing, or neutral) and the value's magnitude (0 to 1.0) corresponds to the text's stance intensity.
Stance polarity and intensity prediction encapsulates stance detection within its problem definition and is thus a more difficult problem to address. While stance polarity can be identified through specific keywords (e.g., "agree", "disagree"), the intensity is a much more fuzzy concept. The difference between strong opposition and weak opposition is often expressed through subtle word choices and conversational behaviors. Thus, to accurately predict agreement intensity, a learned model must understand the nuances between word choices in the context of the discussion.
We explore five machine learning models for agreement prediction, adapted from the topperforming models for stance detection: Ridge-M regression, Ridge-S regression, SVR-RF-R, pkudblab-PIP, and T-PAN-PIP. These models were adapted from , , Mourad et al. (2018), Wei et al. (2016), and Dey et al. (2018) respectively. We evaluated these models on a new dataset for stance polarity and intensity prediction, collected over three empirical studies using our cyber argumentation platform, the Intelligent Cyber Argumentation System (ICAS) . This dataset contains over 22,000 online arguments from over 900 users discussing four important issues. In the dataset, each argument is manually annotated by their authoring user with an agreement value.
Results from our empirical analysis show that the SVR-RF-R ensemble model performed the best for agreement prediction, achieving an RMSE score of 0.596 for stance polarity and intensity predic-tion, and an accuracy of 70% for stance detection. Further analysis revealed that the models trained for stance polarity and intensity prediction often had better accuracy for stance classification (polarity only) compared to their counterpart stance detection models. This result demonstrates that the added difficulty of detecting stance intensity does not come at the expense of detecting stance polarity. To our knowledge, this is the first time that learning models can be trained to predict an online post's stance polarity and intensity simultaneously.
The contributions of our work are the following: • We introduce a new research problem called stance polarity and intensity prediction, which seeks to predict a post's agreement value that contains both the stance polarity (value sign) and intensity (value magnitude), toward its parent post.
• We apply five machine learning models on our dataset for agreement prediction. Our empirical results reveal that an ensemble model with many hand-crafted features performed the best, with an RMSE of 0.595, and that models trained for stance polarity and intensity prediction do not lose significant performance for stance detection.
2 Related Work

Stance Detection
Stance detection research has a wide interest in a variety of different application areas including opinion mining (Hasan and Ng, 2013), sentiment analysis , rumor veracity (Derczynski et al., 2017), and fake news detection (Lillie and Middelboe, 2019). Prior works have applied stance detection to many types of debate and discussion settings, including congressional floor debates (Burfoot et al., 2011), online forums (Hasan and Ng, 2013;Dong et al., 2017), persuasive essays (Persing and Ng, 2016), news articles (Hanselowski et al., 2018), and on social media data like Twitter . Approaches to stance detection depends on the type of text and relationship the stance is describing. For example, stance detection on Twitter often determines the author's stance (for/against/neutral) toward a proposition or target . In this work, we adapt the features sets and models used on the SemEval 2016 stance detection task Twitter dataset .
This dataset has many similarities to our data in terms of post length and topics addressed. Approaches to Twitter stance detection include SVMs Elfardy and Diab, 2016), ensemble classifiers (Tutek et al., 2016;Mourad et al., 2018), convolutional neural networks (Igarashi et al., 2016;Vijayaraghavan et al., 2016;Wei et al., 2016), recurrent neural networks (Zarrella and Marsh, 2016;Dey et al., 2018), and deep learning approaches (Sun et al., 2018;Sobhani et al., 2019). Due to the size of the dataset, the difference in domain, and time constraints, we did not test Sun et al. (2018)'s model in this work, because we could not gather sufficient argument representation features.

Argumentation Mining
Argumentation mining is applied to argumentative text to identify the major argumentative components and their relationships to one another (Stede and Schneider, 2018). While stance detection identifies the relationship between an author's stance toward a concept or target, argumentation mining identifies relationships between arguments, similar to our task in agreement prediction. However, unlike our task, argumentation mining typically defines arguments based on argument components, instead of treating an entire post as a single argument. In argumentation mining, a single text may contain many arguments. The major tasks of argumentation mining include: 1) identify argumentative text from the nonargumentative text, 2) classify argumentation components (e.g., Major Claim, Claims, Premise, etc.) in the text, 3) determine the relationships between the different components, and 4) classify the relationships as supporting, attacking, or neutral (Lippi and Torroni, 2016). End-to-end argument mining seeks to solve all the argumentation mining tasks at once (Persing and Ng, 2016;Eger et al., 2017), but most research focuses on one or two tasks at once. The most pertinent task to this work is the fourth task (though often times this task is combined with task 3). Approaches to this task include using textual entailment suites with syntactic features (Boltužić andŠnajder, 2014), or machine learning classifiers with different combinations of features including, structural and lexical features (Persing and Ng, 2016), sentiment features (Stab and Gurevych, 2017), and Topic modeling features (Nguyen and Litman, 2016). We use many of these types of features in our Ridge-S and SVR-RF-R models.

Cyber Argumentation Systems
Cyber argumentation systems help facilitate and improve understanding of large-scale online discussions, compared to other platforms used for debate, such as social networking and media platforms, online forums, and chat rooms (Klein, 2011). These systems typically employ argumentation frameworks, like IBIS (Kunz and Rittel, 1970) and Toulmin's structure of argumentation (Toulmin, 2003), to provide structure to discussions, making them easier to analyze. More specialized systems include features that improve the quality and understanding of discussions. Argumentation learning systems teach the users effective debating skills using argumentation scaffolding (Bell and Linn, 2000). More complex systems, like ICAS and the Deliberatorium (Klein, 2011), provide several integrated analytical models that identify and measure various phenomena occurring in the discussions.
ICAS implements an IBIS structure (Kunz and Rittel, 1970), where each discussion is organized as a tree. In ICAS, discussions are organized by issue. Issues are important problems that need to be addressed by the community. Under each issue are several positions, which act as solutions or approaches toward solving the issue. Under each position, there are several arguments that argue for or against the parent position. Under these arguments, there can be any number of follow-on arguments that argue for or against the parent argument, and so on until the discussion has ended. Figure 1 provides a visualization of the discussion tree structure ICAS employs.
In ICAS, arguments have two components: a textual component and an agreement value. The textual component is the written argument the user makes. ICAS does not limit the length of argument text; however, in practice, the average argument length is about 160 characters, similar to the length of a tweet. The agreement value is a numerical value that indicates the extent to which an argument agrees or disagrees with its parent. Unlike other argumentation systems, this system allows users to express partial agreement or disagreement with other posts. Users are allowed to select agreement values from a range of -1 to +1 at 0.2 increments that indicate different partial agreement values. Positive values indicate partial or complete agreement, negative values indicate partial or complete disagreement, and a value of 0 indicates indifference or neutrality. These agreement values represent each post's stance polarity (the sign) and intensity (the magnitude). These agreement values are distinctly different from other argumentation weighting schemes where argument weights represent the strength or veracity of an argument (see (Amgoud and Ben-Naim, 2018;Levow et al., 2014)). Each agreement value is selected by the author of the argument and is a mandatory step when posting.

Models for Stance Polarity and Intensity Prediction
This section describes the models we applied to the stance polarity and intensity prediction problem. We applied five different models, adapted from top-performing stance classification models based on their performance and approach on the SemEval 2016 stance classification Twitter dataset .

Ridge Regressions (Ridge-M and Ridge-S)
Our first two models use a linear ridge regression as the underlying model. We created two ridge regression models using two feature sets. The first ridge model (Ridge-M) used the feature set described in  as their benchmark. They used word 1-3 grams and character 2-5 grams as features. We filtered out English stop words, tokens that existed in more than 95% of posts, and tokens that appear in less than 0.01% of posts for word N-grams and fewer than 10% for character N-grams. There were a total of 838 N-gram features for the Ridge-M model. The second ridge model (Ridge-S) used the feature set described in Sobhani, Mohammad, and Kiritchenko's follow-up paper (2016). In that paper, they found the sum of trained word embeddings with 100 dimensions, in addition to the N-gram features outlined by , to be the best-performing feature set. We trained a word-embedding (skip-gram word2vec) model on the dataset. For each post, and summed the embeddings for each token in the post were summed up and normalized by the total number of tokens of a post to generate the word embedding features. Ridge-S had 938 total features.

Ensemble of Regressions (SVR-RF-R)
This model (SRV-RF-R) consisted of an averagevoting ensemble containing three different regression models: an Epsilon-Support Vector Regression model, a Random Forest regressor, and a ridge regression model. This model is an adaption of the ensemble model presented by Mourad et al. (2018) for stance detection. Their model used a large assortment of features, including linguistic features, topic features, tweet-specific features, labeled-based features, word-Embedding features, similarity features, context features, and sentiment lexicon features. They then used the feature selection technique reliefF (Kononenko et al., 1997) to select the top 50 features for usage. Due to the changes in context (Twitter vs. Cyber Argumentation), we constructed a subset of their feature set, which included the following features 1 : • Linguistic Features: Word 1-3 grams as binary vectors, count vectors, and tf-idf weighted vectors. Character 1-6 grams as count vectors. POS tag 1-3 grams concatenated with their words (ex: word1 pos1 . . . ) and concatenated to the end of the post (ex: word1, word2, . . . , POS1, POS2, . . . ).
• Topic Features: Topic membership of each post after LDA topic modeling (Blei et al., 2003) had run on the entire post corpus.
• Word Embedding Features: The 100dimensional word embedding sums for each word in a post and the cosine similarity between the summed embedding vectors for the target post and its parent post.
We tested using the top 50 features selected using reliefF and reducing the feature size to 50 using Principal Component Analysis (PCA), as well as using the full feature set. We found that the full feature set (2855 total) performed significantly better than the reliefF and PCA feature sets. We used the full feature set in our final model.

pkudblab-PIP
The highest performing CNN model, pkudblab, applied to the SemEval 2016 benchmark dataset, was submitted by Wei et al. (2016). Their model applied a convolutional neural network on the word embedding features of a tweet. We modified this model for agreement prediction. The resulting model's (pkudblab-PIP) architecture is shown in Figure 2. We used pre-trained embeddings (300-dimension) published by the word2vec team (Mikolov et al., 2013). Given an input of word embeddings of size d by |s|, where d is the size of the word embedding and |s| is the normalized post length, the input was fed into a convolution layer. The convolution layer contained filters with window size (m) 3, 4, and 5 words long with 100 filters (n) each. Then the layers were passed to a max-pooling layer and finally passed through a fully-connected sigmoid layer to produce the final output value. We trained the model using a mean squared error loss function and used a 50% dropout layer after the max-pooling layer.

T-PAN-PIP
The RNN model (T-PAN-PIP) is adapted from the T-PAN framework by Dey et al. (2018), which was one of the highest performing neural network models on the SemEval 2016 benchmark dataset. The T-PAN framework uses a two-phase LSTM model with attention, based on the architecture proposed by Du et al. (2017). We adapted this model for regression by making some modifications. Our  (2017), where the output is the predicted agreement value, instead of a categorical prediction. Figure 3 illustrates the architecture of T-PAN-PIP. It uses word embedding features (with embedding size 300) as input to two network branches. The first branch feeds the word embeddings into a bi-directional LSTM (Bi-LSTM) with 256 hidden units, which outputs the hidden states for each direction (128 hidden units each) at every time step. The other branch appends the average topic embedding from the topic text (i.e., the text of the post that the input is responding) to the input embeddings and feeds that input into a fully-connected softmax layer, to calculate what Dey et al. (2018) called the "subjectivity attention signal." The subjectivity attention signals are a linear mapping of each input word's target augmented embedding to a scalar value that represents the importance of each word in the input relative to the target's text. These values serve as the attention weights that are used to scale the hidden state output of the Bi-LSTM.
The weighted attention application layer combines the attention weighs to their corresponding hidden state output, as shown in (1).
Where a s is the attention signal for word s, h s is the hidden layer output of the Bi-LSTM for word s, |s| is the total number of words, and Q is the resulting attention weighted vector of size 256, the size of the output of the hidden units of the Bi-LISTM. The output Q feeds into a fully-connected sigmoid layer and outputs the predicted agreement value. We train the model using a mean absolute error loss function.

Empirical Dataset Description
The dataset was constructed from three separate empirical studies collected in Fall 2017, Spring 2018, and Spring 2019. In each study, a class of undergraduate students in an entry-level sociology class was offered extra credit to participate in discussions in ICAS. Each student was asked to discuss four different issues relating to the content they were covering in class. The issues were: 1) Healthcare: Should individuals be required by the government to have health insurance? 2) Same Sex Adoption: Should same-sex married couples be allowed to adopt children? 3) Guns on Campus: Should students with a concealed carry permit be allowed to carry guns on campus? 4) Religion and Medicine: Should parents who believe in healing through prayer be allowed to deny medical treatment for their child?
Under each issue, there were four positions (with the exception of the Healthcare issue for Fall 2017, which had only 3 positions) to discuss. The positions were constructed such that there was one strongly conservative position, one moderately conservative position, one moderately liberal position, and one strongly liberal position. The students were asked to post ten arguments under each issue.
The combined dataset contains 22,606 total arguments from 904 different users. Of those arguments, 11,802 are replying to a position, and 10,804 are replying to another argument. The average depth of a reply thread tends to be shallow, with 52% of arguments on the first level (reply to position), 44% on the second level, 3% on the third level, and 1% on the remaining levels (deepest level was 5).
When a student posted an argument, they were required to annotate their argument with an agree- The annotated labels in this dataset are selflabeled, meaning that when a user replies to a post, they provide their own stance polarity and intensity label. The label is a reflection of the author's intended stance toward a post, where the post's text is a semantic description of that intention. While these label values are somewhat subjective, they are an accurate reflection of their author's agreement, which we need to capture to analyze opinions in the discussion. Self-annotated datasets like this one have been used in stance detection for argumentation mining in the past (see (Boltužić andŠnajder, 2014;Hasan and Ng, 2014)).

Agreement Prediction Problem
In this study, we want to evaluate the models' performance on the stance polarity and intensity prediction problem. We separated the dataset into training and testing sets using a 75-25 split. For the neural network models (pkudblab-PIP and T-PAN-PIP), we separated out 10% of the training set as a validation set to detect over-fitting. The split was performed randomly without consideration of the discussion issue. Each issue was represented proportionally in the training and testing data sets with a maximum discrepancy of less than 1%.
For evaluation, we want to see how well the regression models are able to predict the continuous agreement value for a post. We report the root-mean-squared error (RMSE) for the predicted results.

Agreement Prediction Models for Stance Detection
We wanted to investigate whether training models for agreement prediction would degrade their performance for stance detection. Ideally, these models should learn to identify both stance intensity without impacting their ability to identify stance polarity.
To test this, we compared each model to their original stance classification models described in their source papers. Thus, ridge-H is compared with an SVM trained on the same feature set (SVM-H), ridge-S is compared to a Linear-SVM trained on the same feature set (SVM-S), SVR-RF-R is compared to a majority-voting ensemble of a linear-SVM, Random Forest, and Naïve Bayes classifier using the same feature set (SVM-RF-NB), pkudblab-PIP is compared to the original pkudblab model trained using a softmax cross-entropy loss function, and T-PAN-PIP is compared to the original T-PAN model trained using a softmax crossentropy loss function. We trained the classification models for stance detection by converting the continuous agreement values into categorical polarity values. When converted into categorical values, all of the positive agreement values are classified as Favoring, all negative values are classified as Opposing, and zero values are classified as Neutral. In the dataset, 12,258 arguments are Favoring (54%), 8962 arguments are Opposing (40%), and 1386 arguments are Neutral (6%). To assess the stance detection performance of the models trained for agreement prediction, we converted the predicted continuous agreement values output by the models into the categorical values using the same method.
For evaluation, we report both the accuracy value of the predictions and the macro-average F1-scores for the Favoring and Opposing classes on the testing set. This scoring scheme allows us to treat the Neutral category as a class that is not of interest (Mourad et al., 2018).

Agreement Prediction Results
The results for agreement prediction are shown in Table 1. A mean prediction baseline model is shown in the table to demonstrate the difficulty associated with the problem. The neural network models perform worse than both the ridge regression and ensemble models. Ridge-S performed slightly better than Ridge-M due to the sum word We performed feature analysis on the SVR-RF-R model using ablation testing (i.e., removing one feature set from the model). Results showed that removing a single features set for each type of feature (Word N-grams, Character N-grams, POS N-grams, Topic features, Lexicon features, word embedding features, and cosine similarity feature) impacted the RMSE of the model by less than 0.005. Using only the N-gram features resulted in an RMSE of 0.599, which is only a 0.0047 decrease from the total. This result matches the difference between Ridge-M (only uses N-gram features) and Ridge-S (includes N-gram and word embedding features). Since the N-gram features contain most of the textual information, it had the most impact on the model, while the additional features had smaller effects on the model accuracy.

Agreement Prediction models for Stance Detection Results
We compare the models trained on the agreement prediction task to their classification model counterparts in terms of performance on the stance detection task. Tables 2 and 3 show the comparison between the models in terms of accuracy and (macro) F1-score. SVR-RF-R has the best accuracy and F1-score for stance detection, which outperformed its classifier counterpart (SVM-RF-NB) by 2.12% in accuracy and +0.016 in F1-score. Three of the models trained for stance polarity and intensity prediction, SVR-RF-R, Ridge-S, and T-PAN-PIP, outperformed their classifier counterparts in accuracy by 1-2% and F1-score by +0.009 on average. Two of the models trained for stance polarity and intensity prediction, Ridge-H and pkudblab-PIP, slightly underperformed their classifier counterparts in accuracy by -0.36% and F1-score by -0.011 on average.

Discussion
The models behaved very similarly on the agreement prediction problem, where the difference between the best performing model and the worst performing model is only 0.061. Overall, the best model received an RMSE of 0.596, which is reasonably good but can be improved. T-PAN-PIP had the worst performance, which is surprising, as it was the only model to include the parent post's information into its prediction, which should have helped improve its performance. It is possible that its architecture is unsuitable for agreement prediction; other architectures have been deployed that include a post's parent and ancestors into a stance prediction, which might be more suitable for agreement prediction. Future model designs should better incorporate a post's parent information into their predictions.
The difference in performance between the agreement prediction models and the classification models on the stance detection task was small and sometimes better. This demonstrates that the models learning to identify stance intensity do so without significant loss of performance in identifying stance polarity.
Larger gains in performance will likely require information about the post's author. Some post authors will state strong levels of agreement in their statements, but annotate their argument with weaker agreement levels. For example, one author wrote, "Agree completely. Government should stay out of healthcare." and annotated that argument with an agreement value of +0.6. The authors were instructed on how to annotate their posts, but the annotations themselves were left to the post's author's discretion. Thus including author information into our models would likely improve the stance polarity and intensity prediction results.

Conclusion
We introduce a new research problem called stance polarity and intensity prediction in a responsive relationship between posts, which predicts both an online post's stance polarity and intensity value toward another post. This problem encapsulates stance detection and adds the additional difficulty of detecting subtle differences in intensity found in the text. We introduced a new large empirical dataset for agreement prediction, collected using a cyber argumentation platform. We implemented five models, adapted from top-performing stance detection models, for evaluation on the new dataset for agreement prediction. Our empirical results demonstrate that the ensemble model SVR-RF-R performed the best for agreement prediction and models trained for agreement prediction learn to differentiate between intensity values without degrading their performance for determining stance polarity. Research into this new problem of agreement prediction will allow for a more nuanced annotation and analysis of online debate.
• Maximum Sentence Length (|s|): 150. Posts longer than 150 words were truncated from the beginning and posts less than 150 words were padded at the end.
The model was trained using a batch size of 64 and used an Adam optimizer.