Incorporating Topic Aspects for Online Comment Convincingness Evaluation

In this paper, we propose to incorporate topic aspects information for online comments convincingness evaluation. Our model makes use of graph convolutional network to utilize implicit topic information within a discussion thread to assist the evaluation of convincingness of each single comment. In order to test the effectiveness of our proposed model, we annotate topic information on top of a public dataset for argument convincingness evaluation. Experimental results show that topic information is able to improve the performance for convincingness evaluation. We also make a move to detect topic aspects automatically.


Introduction
With the popularity of online forums such as idebate 1 and convinceme 2 , researchers have been paying increasing attention to analyzing persuasion content (Wei et al., 2016a,b). Argument convincingness assessment plays an important role in persuasion content analysis. Previous researchers attribute the convincingness of arguments to argument structure (Potash et al., 2017;Peldszus and Stede, 2015), strong evidence (Hasan and Ng, 2014;Park and Cardie, 2014), specific argument components (Habernal and Gurevych, 2016a;Persing and Ng, 2015), interactions (Ji et al., 2018), domain knowledge (Wei et al., 2017) and so on. Most efforts of convincingness evaluation focus on using explicit linguistic features, such as words (Chalaguine and Schulz, 2017) and part-of-speech (POS) (Wachsmuth et al., 2017a) etc. Considering people construct their arguments based on different topic aspects, we thus argue that topic information can be crucial for convincingness evaluation. * *Corresponding author 1 https://idebate.org/ 2 http://convinceme.net/

Argument1
Argument2 Content The American Water companies are Aquafina (Pepsi), Dasani (Coke), Perrier (Nestle) which provide jobs for the american citizens.
If bottled water did not exist, more people would be drinking sweetened liquids because it would be the only portable drinks! People would become fat! Topic Aspect Economy Convenience and health To illustrate this idea, Table 1 gives a brief example of an argument pair. Both arguments express opinions against the banning of plastic water bottles. Argument 1 is expressed from the topic aspect of economy while Argument 2 makes the point from the aspect of convenience and health. As we can see, for a specific discussion subject, different aspects might reveal various degree of convincingness. Wang et al. (2017) has already made attempt to make use of latent persuasive strengths of topic aspects for quality evaluation on a formal debate dataset. However, there is still no further research on online debating texts, which is un-structured with multiple participants.
In this paper, we propose to incorporate latent topic aspects information to evaluate the convincingness of comments in online forum. We make use of graph convolutional networks (GCN) to utilize the latent topic information of comments for a specific subject. We assume that arguments sharing the same topic aspect are more likely to have similar degree of convincingness, and GCN is able to make use of the topic similarity among arguments. Bi-directional long short-term memory (Bi-LSTM) is used to encode each argument. We annotate topic aspects information on top of a public dataset collected from online forum Habernal and Gurevych (2016b) to evaluate our proposed model.
Main contribution of this article are three folds: (1) we annotate topic aspects for each argument in an existing dataset over 16 discussion threads (2 stances for each subject); (2) we propose a BiLSTM-GCN model and prove the effectiveness of topic aspects in convincingness evaluation; (3) we implement several baseline models to detect the topic aspect automatically.

Data Description
Our experiments are conducted on UKPCon-vArg1 corpus released by Habernal and Gurevych (2016b). The source argument is from 16 debates on Web debate platforms createdebate.com and convinceme.net. Each debate is about a specific topic and has two stances. We call each (debate,stance) tuple a "split", so there are 32 splits in total. The dataset includes sets of argument pairs according to 32 splits, and each argument pair is annotated with a binary relation (0 and 1) which represents the pairwise convincingness relationship (more/less convincing), 11,650 in total. Since we take the UKPConvArg1Strict version as our dataset, the equal instances are removed. The topic aspect annotations are not from the initial dataset, but from our own annotations. In order to extract topic aspects from each topic, we manually annotate each argument by two annotators. The annotation process is as follows. First, two annotators have a discussion and then determine possible topic aspect candidates for each split. Second, two annotators independently check every argument and summarize one main topic aspect. As for arguments carrying multiple aspects, we pick the primary topic aspect. Third, after all the annotations are made, they are asked to rank the topic aspects under a split according to topic aspect strength by discussion. Due to the quality of the corpus, some arguments have to be assigned to the aspect None if it (1) has nothing to do with the topic, or (2) has no point of view, or (3) is contradictory/ambiguous. Results of topic aspect annotations are shown in Appendix.
To clarify the annotations of topic aspects, we will take the annotation process for comments under the topic of "banning plastic water bottles" as an example. Comments from the con-side of the this debating might hold different topic aspects. Some of them concern about the economic decay after banning plastic water bottles. And some of them suggest that we can recycle plastic water bottles instead of banning them. The others care more about the inconvenience and safety after banning plastic water bottles. After reading all these arguments, annotators conclude that there should be three main topic aspects, "economics", "bottle recycling" and "convenience and health" respectively. There are some arguments which have no point of view or seem ambiguous, and the topic aspects of these arguments will be set as "None". Since the average length of arguments is relatively short, so most of the arguments hold a single topic aspect. Therefore, each argument only has one label to simplify the problem. The dataset is available here 3 .

Proposed Model
In this paper, we propose a BiLSTM-GCN model to solve the convincingness evaluation task. The BiLSTM acts as the the foundation to generate the representation of each argument, and GCN is built upon BiLSTM to make use of the interrelationships of similar arguments. The architecture of our model is shown in Figure 1. In general, Our aim is predicting a binary relation (0 and 1) representing more/less convincing given an argument pair. All the arguments in the same split are considered as a batch.

Content Layer
The input of the content layer is the word embedding matrix of each argument and the output of content layer is the representation vector of each argument. BiLSTM plays the role of encoder in this layer, and it has been proved effective to encode sentences (Goodfellow et al., 2016;Dyer et al., 2015;Wang and Jiang, 2016).
We simply employ the word embeddings released by Glove (Pennington et al., 2014), and we choose the 840B tokens, 300 dimensional vector version. As to the word which is absent in the Glove vocabulary, we randomly generate a 300 dimensional vector to represent those out of vocabulary vectors. These vectors are then put into BiL-STM to get the basic representation of each argument. The dimension of argument representation is set to 64 after tuning.

Topic Layer
The input of topic layer is the representation vector of each argument. The output of topic layer is the updated representation vector of each argument. The core of our topic layer is GCN (Kipf and Welling). The main function of this layer is utilizing the topic aspect information.
We consider a single-layer GCN for better argument representation. The GCN layer propagates the information of a node onto its nearest neighbours.
Our GCN model simply takes the following form. A represents the adjacency matrix, X represents the feature matrix, which is a stacked version of raw argument representations generated from BiLSTM. W is a parameter matrix which can be trained in the training process. We add self connected edges to A to keep the information of the original argument itself.Â is the normalized form of A. Normalization is a compulsory part for combining information, since we have to keep the ratio of self information and neighbours' information the same for each argument. Z is the final feature matrix, and each row represents the new representation of each argument.
The normalization process is described below: Here, I N is the identity matrix, which represents the self-connections. D ii = j A ij is a diagonal matrix and each element on the diagonal represents the degree of argument i. The selfconnections are not normalized since we think that original argument is more useful than other arguments.

Adjacency Matrix
The adjacency matrix represents whether there is an edge between two nodes, but the graph structure among the arguments is implicit. We can capture the implicit structure by making use of argument similarity or topic aspect.
Argument similarity: We can calculate the similarity of two arguments by distance under embedding setting and use threshold to decide whether there is an edge.
Topic aspects: When two arguments share the same topic aspect, they are supposed to have an edge between corresponding nodes.

Feature Matrix
Feature matrix is built upon BiLSTM, and is a stacked version of argument representation. The argument representation is the mean pooling of the BiLSTM result. In fact, the result of graph convolutional network is still a feature matrix which absorbs information from neighbours. The feature dimension is set to 64, and the dimension of each matrix is fixed over 32 splits by setting the maximum argument quantities as the row dimension. Default arguments are filled by zero vectors. The result of graph convolutional network is still a feature matrix, which involves related nodes' information.

Convincingness Layer
The input of convincingness layer is the updated representation vector of each argument. The output of convincingness layer is a binary label (0 and 1) representing more/less convincing given an argument pair. As a result the core of this convincingness layer is a classifier.
We can simply apply the softmax classifier. However, inspired by DistMult (Yang et al., 2014), we design a classifier which will perform better. Our classifier takes the following form. e s represents the representation of argument 1, and e o represents the representation of argument 2, and W is a parameter matrix. We set the parameter matrix as a real antisymmetric matrix. Therefore, the result of comparing argument 1 and argument 2 will be opposite to the result of comparing argument 2 and argument 1.
Our parameter matrix is better than the normal linear layer mainly because it can capture the interactive relationship between different feature dimensions.

Experiment Setup
We test our model on the dataset depicted in Section 2 to evaluate convincingness of arguments. To compare with the algorithms applied in the initial task (Habernal and Gurevych, 2016b), we still use 32-fold cross-split cross-validation, which means 31 splits are training data and the other one is test data. The preprocessing part is the same as the original task as well. And we train our BiLSTM-GCN model as described in Section 3 and evaluate prediction accuracy on the test split.
In this paper, we implement our BiLSTM-GCN model by Pytorch. Each split is considered as a batch to train and test. The loss function we use is simply the quadratic loss function. We have tried the cross entropy loss and the quadratic loss, and latter performs better when using our classifier. The batch loss is calculated by summing the loss of each argument pair. The weights of the parameter matrix classifier are initialized randomly from the normal distribution, and the initial hidden state of our BiLSTM is set to zero. And we take topic aspects to build adjacency matrix in convincingness evaluation task.

Baselines
The baselines for convincingness evaluation include: (1) SVM (Habernal and Gurevych, 2016b): The SVM with RBF kernel is based on a number of linguistic features. (2) BiLSTM (Habernal and Gurevych, 2016b): The input word embedding is as depicted in Section 3.1, and hidden dimension is 64. (3) BiLSTM+argument length: Since BiLSTM will transform all the arguments into same dimension, the information of argument length will be lost to some extent. Here we mention argument length since it can handle this task quite well. The accuracy of judging convincingness by its length can be 0.73, which is not mentioned in the original work. As a result, we take it as a useful feature to help our model work better.
Here, we don't list the baseline with our topic aspects annotations because the topic aspects among different debates are not the same, so it can't play the role of an explicit feature. We have to encode the topic aspect information, that is also a reason of applying Graph Convolutional Network in our work.
We also try different methods to build the adjacency matrix as depicted in Section 3.2.1, including unit matrix, Jaccard similarity, cosine similarity, word mover's distance and topic aspects. (1) Unit matrix: Unix matrix means that the adjacency matrix is just a unit matrix, and the GCN here act as a single linear layer. (2) Jaccard similarity: Jaccard similarity is calculated by the following form. A and B means the argument 1 and argument 2 in an argument pair. In our experiments we find that use threshold to build adjacency is better than use weighted adjacency matrix. As a result , we all use the threshold to build a zero-one adjacency matrix in the following building methods.The threshold of Jaccard similarity is set to 0.5 after normalizing.
(3) Cosine similarity: Cosine similarity is represented using a dot product and magnitude as the following form. The threshold of cosine similarity is set to 0.95 after normalizing.
(4) Word mover's distance (Kusner et al., 2015): The word mover's distance calculating is reproduced by reading the original paper. And the threshold is set to 0.35 after adjusting.

Experimental Results
This part, we compare our BiLSTM-GCN model with baselines mentioned above. Table 2 lists the results of convincingness evaluation task. The adjacency matrix of our BiLSTM-GCN model in Table 2 is based on our topic aspects annotations. In Table 3, We test other adjacency matrix building methods as described above and analyze the results. The result shows that our BiLSTM-GCN model performs better than best baseline model, and obviously better than models utilized in the initial task. What's more, we have proved that the interrelationships of arguments can help us evaluate the convincingness better by using GCN.
The results in Table 3 show that our annotations perform the best among all the metrics, which   means topic aspect is an excellent way to evaluate the relationship between arguments. And we can know that some state-of-the-art text similarity metric like word mover's distance performs better than classical text similarity metrics like Jaccard similarity and cosine similarity.

Further analysis of topic detection
We know that the topic aspects are not labeled in most data. Since we already have the annotations of topic aspects, so we can set a classification task, which can be further used for automatic annotation. The training data and test data are the same as convincingness evaluation. However, the labels will change. Here, if two arguments in an argument pair share the same topic aspect, the label will be set to 1, or it will be set to 0. We also suggest that None type is different from all other types including None type itself. We test some baseline models and our BiLSTM-GCN model on this task and evaluate F-score on the test split. We don't use accuracy since the labels are imbalanced here, only about thirty percent of argument pairs have positive labels. The baselines for topic aspect detection include: (1) Random classification: Select zero or one randomly. (2) LDA clustering (Blei et al., 2003): Automatically cluster the arguments by LDA over each split into the number of topic aspects we annotate.
(3) SVM: The SVM with RBF kernel is based on a number of linguistic features.
We choose word mover's distance to build adjacency matrix instead of our annotations, since it is the kind of the latest metric of text similarity calculation.The word mover's distance calculation is reproduced by reading the original paper (Kusner et al., 2015). The threshold is set to 0.35 after adjusting. Table 4 lists the results of topic aspect detection task.

Model
F-score Random classification 0.425 LDA clustering 0.524 SVM 0.589 BiLSTM-GCN 0.612 Table 4: Results of topic aspect detection task The result shows that our BiLSTM-GCN model performs the best among all the methods, and supervised training methods like SVM performs better than unsupervised methods like LDA clustering. All these models perform significantly better than the lower bound given by random classification.

Conclusions and Future works
In this paper, we apply BiLSTM-GCN model on a convincingness evaluation task and the model performs 3-5% better than the original method on the online debate dataset. Our model utilizes not only the explicit argument features like length but also topic aspects which are latent features. Our experiments proves that topic information is able to improve the performance for convincingness evaluation. In future, we will consider to utilize external knowledge to further improve the performance of convincingness evaluation. The possible external knowledge can be similar arguments from other websites, or argument search engine (Wachsmuth et al., 2017b).