Integrating Semantic and Structural Information with Graph Convolutional Network for Controversy Detection

Identifying controversial posts on social media is a fundamental task for mining public sentiment, assessing the influence of events, and alleviating the polarized views. However, existing methods fail to 1) effectively incorporate the semantic information from content-related posts; 2) preserve the structural information for reply relationship modeling; 3) properly handle posts from topics dissimilar to those in the training set. To overcome the first two limitations, we propose Topic-Post-Comment Graph Convolutional Network (TPC-GCN), which integrates the information from the graph structure and content of topics, posts, and comments for post-level controversy detection. As to the third limitation, we extend our model to Disentangled TPC-GCN (DTPC-GCN), to disentangle topic-related and topic-unrelated features and then fuse dynamically. Extensive experiments on two real-world datasets demonstrate that our models outperform existing methods. Analysis of the results and cases proves that our models can integrate both semantic and structural information with significant generalizability.


Introduction
Social media such as Reddit1 and Chinese Weibo2 has been the major channel through which people can easily propagate their views.In the open and free circumstance, the views expressed by the posts often spark fierce discussion and raise controversy among the engaging users.These controversial posts provide a lens of public sentiment, which bring about several tasks such as news topic selection, influence assessment (Hessel and Lee, 2019), and alleviation of polarized views (Garimella et al., 2017).As a basis of all mentioned tasks, automatically identifying the controversial posts has  : The point is that their lights, skins, functions, and even names are similar.No reason to say that Xiaomi don't copy.↳ (Refute) C 3-1 : No, the point is that the manuscript is original.

Comments Attached to P
(Refute) RP 1 : I'm against Xiaomi this time.The component library is too similar.Whether the faces are hand-made or not is not important.Can't this fact be the evidence?(Refute) RP 2 : I think Mimoji is similar to Memoji.Even if the process of faces is different, their ideas are too close.

Related Posts
Figure 1: A controversial post P about whether Xiaomi's Mimoji copies Apple's Memoji.These Supports and Refutations are to either their respective parent comments or P .
attracted wide attention (Addawood et al., 2017;Coletto et al., 2017;Rethmeier et al., 2018;Hessel and Lee, 2019).This work focuses on post-level controversy detection on social media, i.e., to classify if a post is controversial or non-controversial.According to (Coletto et al., 2017), a controversial post has debatable content and expresses an idea or an opinion which generates an argument in the responses, representing opposing opinions in favor or in disagreement with the post.In practice, the responses of a target post (the post to be judged) generally come from two sources, i.e., the comments attached to the post and other content-related posts.Figure 1 shows an example where the target post P expresses that Xiaomi's Mimoji do not copy Apple's Memoji.We can see that: 1) The comments show more supports and fewer refutes to P , which raises a small controversy.However, the related posts show extra refutations and enhance the controversy of P .2) C 3−1 expresses refutation literally, but it actually supports P because in the comment tree, it refutes C 3 , a refuting comment to P .3) There exist two kinds of semantic clues for detection, topicrelated and topic-unrelated clues.For example, support and against is unrelated to this topic, while copy and similar are topic-related.Topic-related clues can help identify posts in a similar topic, but how effective they are for those in dissimilar topics depends on the specific situation.Therefore, to comprehensively evaluate the controversy of a post, the information from both the comments and related posts should be integrated properly on semantic and structure level.
Existing methods detecting controversy on social media have exploited the semantic feature of the target post and its comments as well as structural feature.However, three drawbacks limit their performance: 1) These methods ignore the role of the related posts in the same topic in providing extra supports or refutations on the target post.Only exploiting the information from comments is insufficient.2) These methods use statistical structure-based features which cannot model the reply-structure relationships (like P -C 1 and C 3 -C 3−1 in Figure 1).The stances of some comments may be misunderstood by the model (like C 3−1 ).
3) These methods tend to capture topic-related features that are not shared among different topics with directly using information of content (Wang et al., 2018).The topic-related features can be helpful when the testing post is from a topic similar to those in the training set but would hurt the detection otherwise.
Recently, graph convolutional networks have achieved great success in many areas (Marcheggiani et al., 2018;Ying et al., 2018;Yao et al., 2019;Li and Goldwasser, 2019) due to its ability to encode both local graph structure and features of node (Kipf and Welling, 2017).To overcome the first two drawbacks of existing works, we propose a Topic-Post-Comment Graph Convolutional Network (TPC-GCN) (see Figure 2a) that integrates the information from the graph structure and content of topics, posts, and comments for post-level controversy detection.First, we create a TPC graph to describe the relationship among topics, posts, and comments.To preserve the replystructure information, we connect each comment node with the post/comment node it replies to.To include the information from related posts, we connect each post node with its topic node.Then, a GCN model is applied to learn node representa-tion with content and reply-structure information fused.Finally, the updated vectors of a post and its comments are fused to predict the controversy.
TPC-GCN is mainly for detection in intra-topic mode, i.e., topics of testing posts appear in the training set, for it cannot overcome the third drawback.We thus extend a two-branch version of TPC-GCN named Disentangled TPC-GCN (DTPC-GCN) (see Figure 2b) for inter-topic mode (no testing posts are from the topics in the training set).We use a TPC-GCN in each branch, but add an auxiliary task, topic classification.The goals of the two branches for the auxiliary task are opposite to disentangle the topic-related and topic-unrelated features.The disentangled features can be dynamically fused according to the content of test samples with attention mechanism for final decision.Extensive experiments demonstrate that our models outperform existing methods and can exploit features dynamically and effectively.The main contributions of this paper are as follows: 1. We propose two novel GCN-based models, TPC-GCN and DTPC-GCN, for post-level controversy detection.The models can integrate the information from the structure and content of topics, posts, and comments, especially the information from the related posts and reply tree.Specially, DTPC-GCN can further disentangle the topic-related features and topic-unrelated features for inter-topic detection.
2. We build a Chinese dataset for controversy detection, consisting of 5,676 posts collected from Chinese Weibo, each of which are manually labeled as controversial or noncontroversial.To the best of our knowledge, this is the first released Chinese dataset for controversy detection.
3. Experiments on two real-world datasets demonstrate that the proposed models can effectively identify the controversial posts and outperform existing methods in terms of performance and generalization.

Related Work
Controversy detection on the Internet have been studied on both web pages and social media.Existing works detecting controversy on web pages mostly aims at identifying controversial articles in is the representation matrix, containing all node vectors in the l-th layer of Branch B. X is the initial representation.L c and L t refer to controversy classification loss and topic classification loss respectively.FC means fully connected layer.
Unlike the web pages, social media contains more diverse topics and more fierce discussion among users, which makes controversy detection on social media more challenging.Early studies assume that a topic has its intrinsic controversy, and focus on topic-level controversy detection.Popescu and Pennacchiotti (2010) detect controversial snapshots (consisting of many tweets referring to a topic) based on Twitter-based and externalknowledge features.Garimella et al. (2018) build graphs based on a Twitter topic, such as retweeting graph and following graph, and then apply graph partitioning to measure the extent of controversy.However, topic-level detection is rough, because there exists non-controversial posts in a controversial topic and vice versa.Recent works focus on post-level controversy detection by leveraging language features, such as emotional and topicrelated phrases (Rethmeier et al., 2018), emphatic features, Twitter-specific features (Addawood et al., 2017).Other graph-based methods exploit the features from the following graph and comment tree (Coletto et al., 2017;Hessel and Lee, 2019).The limitations of current post-level works are that they do not effectively integrate the information from content and reply-structure, and ignore the role of posts in the same topic.Moreover, the difference between intra-topic and inter-topic mode is not realized.Only Hessel and Lee (2019) deal with topic transfer, but they train on each topic and test on others to explore the transferability, which is not suitable in practice.

Methodology
In this section, we introduce the Topic-Post-Comment Graph Convolutional Network (TPC-GCN) and its extension Disentangled TPC-GCN (DTPC-GCN), as shown in Figure 2. We first introduce the TPC graph construction and then detail the two models.

TPC Graph Construction
To model the paths of message passing among topics, posts, and comments, we first construct a topicpost-comment graph G = (V, E) for target posts, where V and E denote the set of nodes and edges respectively.First, to preserve the post-comment and inter-comment relationship, we incorporate the comment tree, each comment node of which is connected with the post/comment node it replies to.Then, to facilitate the posts capturing information from related posts in the same topic that proved helpful in Section 1, we connect each post with its topic.The topic node can be regarded as a hub node to integrate and interchange the information.Another way is to connect post nodes in a topic pairwise, but the complexity will be high.Note that the concept topic here is not necessarily provided by the platform, such as the subreddit on Reddit and the hashtag (#) on Weibo.When topics are not provided, algorithms for text-based clustering can be used to construct a topic with related posts (Nematzadeh et al., 2019).
In G, each node may represent a topic, a post, or a comment and each edge may represent topic-post, post-comment, or comment-comment connection.We initially represent each node v with an embedding vector x of their text by using the pre-trained language model.

TPC-GCN
In this subsection, we detail the TPC-GCN, by first introducing the generic GCN and then our TPC-GCN model.
The GCN has been proved an efficient neural network that operates on a graph to encode both local graph structure and features of node (Kipf and Welling, 2017).The characteristic of GCN is consistent to our goal that integrates the semantic and structural information.In a GCN, each node is updated according to the aggregated information of its neighbor nodes and itself, so the learned representation can include information from both content and structure.For a node v i ∈ V , the update rule in the message passing process is as follows: i is the hidden state of node v i in the lth layer of a GCN and N i is the neighbor set of node v i with itself included.Incoming messages from N i are transformed by the function g and then pass through the activation function σ (such as ReLU) to output new representation for each node.b (l) is the bias term.Following Kipf and Welling (2017), we use a linear transform function g(h where W (l) is a learnable weight matrix.Based on node-wise Equation 1, layer-wise propagation rule can be written as the following form: where H (l) contains all node vectors in the l-th layer and Â is the normalized adjacency matrix with inserted self-loops.W (l) is the weight matrix and B (l) is the broadcast bias term.
In TPC-GCN (see Figure 2a), we input the matrix consisting of N d-dimensional embedding vectors H (0) = X ∈ R N ×d to a two-layer GCN to obtain the representation after message passing H (2) .Next, the vector of each post node i and its attached comment nodes are averaged to be the fusion vector f i of the post.Finally, we apply a softmax function to the fusion vectors for the controversy probability of each post.The cross entropy is the loss function: (3) where y c i is a label with 1 representing controversial and 0 representing the non-controversial, p c i is the predicted probability that the i-th post is controversial, and N is the size of training set.The limit of TPC-GCN is that the representation tends to be topic-related as Section 1 said.The limited generalizability of TPC-GCN makes it more suitable for intra-topic detection, instead of inter-topic detection.

Disentangled TPC-GCN
Intuitively, topic-unrelated features are more effective when testing on the posts from unknown topics (inter-topic detection).However, topic-related features can help when unknown topics are similar to the topics in the training set.Therefore, both of topic-related and topic-unrelated features are useful, but their weights vary from sample to sample.This indicates that the two kinds of features should be disentangled and then dynamically fused.Based on the above analysis, we propose the extension of TPC-GCN, Disentangled TPC-GCN (see Figure 2b), for inter-topic detection.DTPC-GCN consists of two parts: the two-branch multi-task architecture for disentanglement, and attention mechanism for dynamic fusion.Two-branch Multi-task Architecture To obtain the topic-related and topic-unrelated features at the same time, we use two branches of TPC-GCN with multi-task architecture, denoted as R for topicrelated branch and U for topic-unrelated one.In both R and U , an auxiliary task, topic classification, is introduced to guide the learning of representation oriented by the topic.
For each branch, we first train the first layer of GCN with the topic classification task.The input of the topic classifier is fusion vectors from H (1) which are obtained with the same process of f i in TPC-GCN.The cross entropy is used as the loss function: where y t ik is a label with 1 representing the groundtruth topic and 0 representing the incorrect topic class, p t ik is the predicted probability of the i-th post belonging to the k-th topic, and N is the size of training set.The difference between R and U is that we minimize L t in Branch R to obtain topicdistinctive features, but maximize L t in Branch U to obtain topic-confusing features.
Then we include the second layer of GCN and train on two tasks, i.e., topic and controversy classification, for each branch individually.Branch U and R are expected to evaluate controversy effectively with different features in terms of the relationship with the topics.Attention Mechanism After the individual training, Branch U and R are expected to capture the topic-related and topic-unrelated features respectively.We further fuse the features from the two branches dynamically.Specifically, we freeze the parameters of U and R, and further train the dynamic fusion component.For the weighted combination of fusion vectors f U and f R from the two branches, we use the attention mechanism as follows: where W F is the weight matrix and b F is the bias term.v T is a transposed weight vector and F(•) outputs the score of the input vector.The scores of features from Branch U and R are normalized via a softmax function as the branch weight.The weighted sum of the two fusion vectors u is finally used for controversy classification.The loss function is the same as Equation 3.

Experiment
In this section, we conduct experiments to compare our proposed models and other baseline models.
Specifically, we mainly answer the following evaluation questions: EQ1: Are TPC-GCN and DTPC-GCN able to improve the performance of controversy detection?EQ2: How effective are different information in TPC-GCN, including the content of topics, posts, and comments as well as the topic-post-comment structure?EQ3: Can DTPC-GCN learn disentangled features and dynamically fuse them for controversy detection?

Dataset
We perform our experiments on two real-world datasets in different languages.Table 1 shows the statistics of the two datasets.The details are as follows: Reddit Dataset The Reddit dataset released by Hessel and Lee (2019) and Jason Baumgartner of pushshift.io is the only accessible English dataset for controversy detection of social media posts.This dataset contains six subreddits (which can be regarded as over-arching topics): AskMen, AskWomen, Fitness, LifeProTips, personalfinance, and relationships.Each post belongs to a subreddit and the number of attached comments is ensured to be over 30.The tree structure of the comments is also maintained.We use the comment data in the first hour after a post is published.
Weibo Dataset We built a Chinese dataset for controversy detection on Weibo3 in this work.We first manually selected 49 widely discussed, multidomain topics from July 2017 to August 2019 (see Appendix A).Then, we crawled the posts on those topics and preserved those with at least two comments.Here we rebuilt the comment tree according to the comment time and usernames due to the lack of officially-provided structure.Finally, annotators were asked to read and then annotate the post based on both of the post content and the user stances in the comments/replies.Each post was labeled by two annotators(Cohen's Kappa coefficient = 0.71).
When the disagreement occurred between the annotators, the authors discussed and determined the labels.In total, this dataset contains 1,992 controversial posts and 3,684 non-controversial posts, which is in line with the distribution imbalance in the real-world scenario.As far as we know, this is the first released dataset for controversy detection on Chinese social media.We use at most 15 comments of each post due to the computation limit.
In the intra-topic experiment: For the Weibo dataset, we randomly divided with a ratio of 4:1:1 in each topic and merged them respectively across all topics.For the Reddit dataset, we apply the data partition provided by the authors.The ratio is 3:1:1.
In the inter-topic experiments: For the Weibo and Reddit dataset, we still divided with a ratio of 4:1:1, but on the topic level.

Implementation Details
In the (D)TPC-GCN model, each node is initialized with its textual content using the pre-trained BERT 4 (BERT-Base Chinese for Weibo and BERT-Base Uncased for Reddit) and the padding size for each is 45.We only fine-tune the last layer, namely layer 11 of BERT for simplicity and then apply a dense layer with a ReLU activation function to reduce the dimensionality of representation from 768 to 300.In TPC-GCN, the sizes of hidden states of the two GCN layers are 100 and 2, respectively, with ReLU for the first GCN layer.To avoid overfitting, a dropout layer is added between the two layers with a rate of 0.35.We apply a softmax function to the fusion vector for obtaining the controversy probability.In DTPC-GCN, the size of hidden states of the first and second GCN layers in each branch are 32 and 16.The dropout rate between two GCN layers in each branch is set to 0.4.The batch size in our (D)TPC-GCN model is 1 (1 TPC graph), and 128 (posts and attached replies) in our PC-GCN model and baselines.The optimizer is BertAdam5 in all BERT-based models and Adam (Kingma and Ba, 2014) in the other semantic models.The learning rate is 1e-4 and the total epoch is 100.We report the best model according to the performance on the validation set.In those semantic models that are not based on BERT, we use two publicly-available big-scale word embedding files to obtain the model input, sgns.weibo.bigramchar 6for Weibo and glove.42B.300d7 for Reddit.

Baselines
To validate the effectiveness of our methods, we implemented several representative methods including content-based, structure-based and fusion methods as baselines.

Content-based Methods
We implement mainstream text classification models including TextCNN (Kim, 2014), BiLSTM-Att (bi-directional LSTM with attention) BiLSTM (Graves and Schmidhuber, 2005;Bahdanau et al., 2015), BiGRU-Att (bi-directional GRU with attention) (Cho et al., 2014),BERT (Devlin et al., 2019) (only fine-tune the last layer for simplicity).For a fair comparison, we concatenate the post and its attached comments together as the input, instead of feeding the post only.

Structure-based Methods
Considering that structure-based features of the post and its comment tree are rare and nonsystematic in previous works, we integrate the plausible features in (Coletto et al., 2017) and (Hessel and Lee, 2019).As the latter paper does, we feed them into a series of classifiers and choose a best model for classification.We name the method SFC.For a post-comment graph, the feature set contains the average depth (average length of root-to-leaf paths), the maximum relative degree (the largest node degree divided by the degree of the root), C-RATE features (the logged reply time between the post and comments, or over pairs of comments),  and C-TREE features (statistics in a comment tree, such as maximum depth/total comment ratio).

Fusion Method
The compared fusion method from (Hessel and Lee, 2019) aims to identify the controversial posts with semantic and structure information.They extract text features of topics, posts, and comments by BERT and structural feature including the C-RATE and C-TREE features mentioned above.In addition, publish time features are also exploited.

Performance Comparison
To answer EQ1, we compare the performance of proposed (D)TPC-GCN with mentioned baselines on the two datasets.The evaluation metrics include the macro average precision (Avg.P), macro average recall (Avg.R), macro average F1 score (Avg.F1), and accuracy (Acc.).Table 2 and 3 show the performance of all compared methods for intra-topic detection and inter-topic detection respectively.
In the intra-topic experiments, we can see that 1) TPC-GCN outperforms all compared methods on the two datasets.This indicates that our model can effectively detect controversy with a significant generalizability on different datasets.2) The structure-based model, SFC, reports the low scores on the two datasets, indicating that the statistical structural information is insufficient to timely identify the controversy.3) The fusion models outperform or are comparable to the other baselines, which proves that information fusion of content and structure is necessary to improve the performance.
In the inter-topic experiments, we can see that 1) DTPC-GCN outperforms all baselines by 6.4% of F1 score at least, which validates that DTPC-GCN can detect controversy on unseen or dissimilar topics.2) DTPC-GCN outperforms TPC-GCN by 3.74% on Weibo and 4.00% on Reddit.This indicates that feature disentanglement and dynamic fusion can significantly improve the performance of inter-topic controversy detection.

Ablation Study
To answer EQ2 and part of EQ3, we also evaluate several internal models, i.e., the simplified variations of (D)TPC-GCN by removing some components or masking some representations.By the ablation study, we aim to investigate the impact of content and structural information in TPC-GCN and topic-related and topic-unrelated information in DTPC-GCN.

Ablation Study of TPC-GCN
We delete certain type of nodes (and the edges connect to them) to investigate their overall impact and mask the content by randomizing the initial representation to investigate the impact of content.Specifically, we investigate on the following simplified models of TPC-GCN: PC-GCN / TP-GCN: discard the topic / comment nodes.
(RT)PC-GCN / T(RP)C-GCN / TP(RC)-GCN: randomly initialize the representation of topic / post / comment nodes.From Table 4, we have the following observations: 1) TPC-GCN outperforms all simplified models, indicating that the necessity of structure and content from all types of nodes.2) PC-GCN uses no extra information (the information of other posts in the same topic), the performance is still better than the baselines (Table 2 and 4), showing the effectiveness of our methods.3) The models deleting comment information, i.e., TP-GCN and TP(RC)-GCN, experience a dramatic drop in performance, which shows the comment information is of the most importance.4) The effect of structural information varies in the different situations.Without the contents, the comment structure can individually work (TP(RC)-GCN > TP-GCN), while for topics, the structure has to collaborate with the contents ((RT)PC-GCN < PC-GCN on the Weibo dataset).

Ablation Study of DTPC-GCN
We focus on the roles of the U (topic-unrelated) branch and R (topic-related) branch: U branch only: Only U branch is trained to capture topic-unrelated features.R branch only: Only R branch is trained to capture topic-related features.
Table 5 shows that both of the two branches can identify controversial posts well, but their performances are worse than the fusion model.Specifically, the U branch performs slightly better than R, indicating the topic-unrelated features are more suitable for inter-topic detection.We infer that the two branches can learn good but different representation under the guide of the auxiliary task.(Refute) Don't think the cost can be reduced.The costs of new electronic devices and larger data system are not small.

Comments Attached to 1
Human traffickers are hateful.People's Congress Baoyan Zhang thinks that woman-and childtrafficking cases should be sentenced to death and the present sentence of five to 10 years in prison is not heavy enough.

Case Study
We conduct a case study to further answer EQ3 from the perspective of samples.We compare the attention weight of the U and R branch in DTPC-GCN and exhibit some examples where the final decisions lean on one of the two branches.
Figure 3 shows two examples in the testing set of the Weibo dataset.The DTPC-GCN rely more on the topic-unrelated features from Branch U when classifying Post 1 (0.874 > 0.126), while more on the topic-related features from Branch R when classifying Post 2 (0.217 < 0.783).The topic of Post 1, Cancel the Driving License, is weakly relevant to topics in training set, and the comments mostly use topic-unspecific words such as simple support and good proposal.Thus, the topic-unrelated features are more beneficial for judging.In contrast, Post 2 discusses the death penalty for women and children traffickers, relevant to one of the topics in the training set, Improve Sentencing Standards for Sexually Assault on Children.Further, both of the two topics are full of comments on death penalty.Exploiting more of the topic-related features is reasonable for the final decision.

Error Analysis
By conducting the error analysis on 186 misclassified samples in the Weibo dataset, we find three main types of samples that lead to the misclassification: 1) 22.6% of the wrong samples are with too much noise in the comments, including unrelated and neutral comments.2) 16.1% are with a very deep tree structure.This kind of structure is helpful for controversy detection (Hessel and Lee, 2019), but the ability of GCN to obtain information from this kind of structure is limited.3) 10.2% are with obscure and complex statements.These wrong cases indicate that better handling the noisy data, learning more deep structural features, and mining the semantic more deeply have the potential to improve the performance.

Conclusion
In this paper, we propose a novel method TPC-GCN to integrate the information from the graph structure and content of topics, posts, and comments for post-level controversy detection on social media.Unlike the existing works, we exploit the information from related posts in the same topic and the reply structure for more effective detection.To improve the performance of our model for inter-topic detection, we propose an extension of TPC-GCN named DTPC-GCN, to disentangle the topic-related and topic-unrelated features and then dynamically fuse them.Extensive experiments conducted on two datasets demonstrate that our proposed models outperform the compared methods and prove that our models can integrate both semantic and structural information with significant genaralizablity.

Figure 2 :
Figure 2: Architecture of (a) Topic-Post-Comment Graph Convolutional Network (TPC-GCN).(b) Disentangled TPC-GCN (DTPC-GCN).The upper post in the TPC graph is taken as an example to illustrate the methods.H (l) B Cancelling the physical driving license can bring much benefits: No punishment because of forgetting to carry the license; reduce the administrative costs; put an end to the use of fake licenses…Target Post 1Topic: Cancel the Driving License (Support) Yes! Just use the citizen's ID card for replacement.(Support) Good proposal!Support! (Refute) I don't support it.

Figure 3 :
Figure 3: Examples of controversial posts that rely more on one of the two branches.The attention weights of the two posts are on the horizontal bars (left: the U branch, right: the R branch).Post 1 rely more on U (0.874 > 0.126) while Post 2 more on R (0.217 < 0.783).

Target Post P Topic: A microblogger implies that Xiaomi's Mimoji copies Apple's Memoji.
(Support) C 1 :A rational fan appeared finally.Support you.(Support) C 2 : What you said is persuasive.(Refute) C 3

Table 1 :
Statistics of two datasets.

Target Post 2 Topic: Suggest Death Penalty for Woman-& Child-traffickers
Drug smugglers are sentenced to death, but so many people still do.If we use death penalty to traffickers, they may task crazier actions.Should think more carefully.
(Support) Directly sentence to death.Execute immediately!(Support) Those harboring traffickers also need death penalty!(Support) Support!All the child traffickers should be sentenced to death penalty!(Refute)