Exploiting Microblog Conversation Structures to Detect Rumors

As one of the most popular social media platforms, Twitter has become a primary source of information for many people. Unfortunately, both valid information and rumors are propagated on Twitter due to the lack of an automatic information verification system. Twitter users communicate by replying to other users’ messages, forming a conversation structure. Using this structure, users can decide whether the information in the source tweet is a rumor by reading the tweet’s replies, which voice other users’ stances on the tweet. The majority of rumor detection researchers process such tweets based on time, ignoring the conversation structure. To reap the benefits of the Twitter conversation structure, we developed a model to detect rumors by modeling conversation structure as a graph. Thus, our model’s improved representation of the conversation structure enhances its rumor detection accuracy. The experimental results on two rumor datasets show that our model outperforms several baseline models, including a state-of-the-art model


Introduction
Social media platforms have become a primary source of information due to the ease of sharing information they provide. The latest survey from the Pew Research Center states that 68% of American adults occasionally read news on social media platforms (Holcomb et al., 2013). However, the credibility of the massive amount of news propagated through social media is questionable due to the absence of editors who can validate it (Zubiaga et al., 2018). As a result, social media platforms have become perfect avenues for spreading unverified information and rumors.
Users on social media platforms communicate by replying to other users' posts, repeatedly responding to one another, forming a conversation structure. The conversation structure established by social media consists of a tree representation of the information distributed by users posting at a specific time in response to a source post. (Belkaroui et al., 2014;Cogan et al., 2012;Magnani et al., 2011). Furthermore, Pace et al. (2016) distinguish two types of conversation: dialogic (horizontal conversation among users) and dialectic (vertical conversation with a source post) conversations.
Conversations on social media can influence users' perception of the information within them (Pace et al., 2016). For example, Figure 1 shows a rumor conversation on Twitter in which users give opinions, make conjectures, or supply evidence in reply to the source post or other users' replies. Users can obtain clues about the truth of the source post by reading the replies in the conversation structure. Ma et al. (2018) refer to this self-correcting mechanism in their research. However, the majority of users disregard such comments, preferring to immediately share tweets; thus, this self-correcting mechanism cannot prevent the spread of rumors in social media. Therefore, an automatic rumor detection mechanism is necessary.
Several researchers have attempted to automatically detect rumors by analyzing the text features (Ajao et al., 2018;T. Chen et al., 2018;Guo et al., 2018;Ma et al., 2016;Yu et al., 2019;Zubiaga et al., 2017), user and message features (Liu et al., 2015;Yang et al., 2012;Zhao et al., 2015), or both (Lukasik et al., 2016;Nguyen et al., 2017). However, most of the existing approaches rely on analyzing a single message and ignore the topological information of the social media conversation structure.
Ignoring the conversation structure can lead to information misinterpretation and affect rumor detection accuracy. For example, in Figure 1, the numbers indicate the sequence of tweet publishing times, and the words depicted in red font represent the stance of the posts, namely, support, deny, or comment, symbolized by (+), (-), and (O), respectively. Because each specific reply is directly related to the source tweet or other responses, a conversation structure is formed. Deep learning methods such as RNNs (including LSTM and GRU) and CNNs ignore this relationship and oversimplify it into a time-based chain structure for encoding tweets (see Figure 2). For instance, based on the tweet publication time, the 5th tweet will be fed into the model after the 4th tweet and before the 6th tweet. According to this timebased chain structure and the tweets' stances, the 5th tweet supports the 4th tweet, and the 6th tweet denies the 5th tweet. However, the 5th tweet is actually a reply to the source tweet, not the 4th tweet. In other words, there is no actual relationship between the 4th, 5th, and 6th tweets, but the time-based chain structure assigns them a false relationship.
A social media conversation can be illustrated by a graph where each message is a node, and the relationships between posts are edges. Figure 3 shows a graph representation of the conversation structure described in Figure 1 that maintains the original conversation structure.
To utilize the conversation structure between a source tweet and its replies, we propose a graph-based deep learning model for rumor detection. The graph structure overcomes the shortcomings of RNNs and CNNs by processing tweets based on conversation structure rather than based on time. Our model consists of the following three hierarchically structured modules: the Tweet Representation Module (TRM), the Conversation Propagation Module (CPM) and the Classifier Module (CM). The TRM captures the high-level information of a tweet and creates its representation, the CPM propagates the tweet's representation through a graph structure, and the CM is a deep neural network for rumor classification.
The main contributions of our study are as follows: • To the best of our knowledge, our study is the first to empirically integrate both the conversation structure and graph neural networks to detect rumors.
• Experiments based on two Twitter datasets show that our model achieves the highest accuracy and outperforms state-of-the-art baselines.
• Our model can successfully detect rumor in early stages. On the other hand, deep learning does not depend on hand-crafted features and automatically extracts features from data. RNNs are the most widely used deep learning models for text classification tasks, including rumor detection. RNNs can extract features and learn contextual information from text over time (Ma et al., 2016;Rath et al., 2017). To shorten the training time and enhance the prediction accuracy, Ajao et al. (2018) proposed a hybrid approach that integrates CNNs and RNNs.
Regardless of their traditional machine learning or deep learning approaches, all of the aforementioned researchers process data based on time, ignoring structure information. To take advantage of the event structure in social media, Guo et al. (2018) proposed a hierarchical structure combined with social attention to process the data based on an event. These authors divided the process into three levels: word, post, and subevent levels. However, this event structure does not reflect the actual conversation structure. Using another approach, Ma et al. (2018) constructed a recursive neural network to handle conversational structure. This model generates a tree structure by bottom-up or top-down propagation. However, the nodes' influences are unbalanced, as the last nodes have a greater impact on the representation results. Moreover, because this model uses an RNN as a processing unit, it also encounters the long-term dependencies problem. In addition, an acyclic graph is required as input; thus, graph generalization is unreliable.
Graph neural networks (GNNs) have recently become a popular model in deep learning research. GNNs process information by modeling the dependencies between nodes through message passing. Moreover, GNNs achieve a state that contains information from their neighborhood with varied depth (Xu et al., 2019;J. Zhou et al., 2018). GNN variants have demonstrated solid performance on a variety of NLP tasks, e.g., text classification, sentiment analysis, neural machine translation, and multihop reading comprehension. The GNN variant GraphSAGE uses a general inductive model to learn embedded nodes where each node is represented by the aggregation of its neighbors (Hamilton et al., 2017). GraphSAGE can learn from dynamic graphs such as those found in social media conversations with variable numbers of graph nodes.

Model Architecture
We formulate our task as a supervised classification problem, designating the detection unit as a conversation involving a single source post and its replies. Let C denote a conversation = { , 1 , 2 , … , | | } where is the source post, and | | is the last relevant post. Note that the numbers assigned to each post do not indicate that the conversation has a sequential structure; rather, the links between each post are based on reply or repost relationships. The objective of our model is to classify as 'rumor' or 'nonrumor'. The classifier performs learning through labeled information, i.e., : → . The core concept of our approach is to strengthen the representation of the information by propagating it through conversation structure. To achieve this goal, we designed three modules: the Tweet Representation Module (TRM), Conversation Propagation Module (CPM), and Classifier Module (CM) ( Figure  4).

TRM
The TRM module contains two components: a word embedding component that maps input words into fixed-sized vectors and a deep BiLSTM that processes sequential word vectors and extracts high-level information from each tweet.
In the word embedding layer, we map the words in post into vectors, yielding fixed-length vectors for each word: where is the ℎ word in a post, and is a special word embedding matrix. A deep BiLSTM is used to capture the relationships between words and generate the tweet representation:

CPM
To propagate through the conversation structure and generate improved representations of the conversation, we use GraphSAGE to create low-dimensional vector representations from both training and unseen nodes (Hamilton et al., 2017). The output of the TRM module is a set of tweet-embedding vectors lacking conversation structure information. Thus, before the vectors are fed into the CPM module, a mapping process needs to be performed. This mapping process aims to map the conversation structure to a graph object where one post becomes one node, and the edges reflect the reply relationship it has with each of the other posts: (3) U = contains the output of the TRM module represented as hLSTM pi .
is the relationship of the vertices, and is the global property of the graph. Since the model is trained in a supervised manner, the label of the conversation is saved as the global property of the graph object.
At the beginning of the forward propagation step, the feature of each node is assigned to the nodes in the hidden state as follows: where h 0 graph v is the initial hidden state of the nodes in GraphSAGE GraphSAGE works by aggregating information from local neighbor nodes at each iteration until all the nodes are accessed. This process makes the nodes gain incrementally richer information (Hamilton et al., 2017): where h k graph N(v) is the aggregated neighborhood vector, is the depth of the information transmission updates (the number of times the graph information is updated), is the neighborhood function, ( ) is the set of the node's immediate neighborhood, and AGGREGATE is the aggregation function.
In this paper, we use the max pooling aggregator to aggregate a node with neighborhood information, where max is the element-wise max operator, and is a nonlinear activation function: After k iterations, we obtain the output representation , the conversation embedding results: ← ℎ ℎ , ∀ ∈ (7)

CM
The CM module is an MLP module. Based on the output of the CPM, we use a softmax function in the output layer to predict the label of the conversation: where and are parameters in the output layer. For each training process, the goal is to minimize the standard deviation between the predicted and output values using the following loss function: = ∑( −̂) 2 + ∑‖ ‖ 2 (9) where is the target value, and is the model parameters to be estimated. The L2-regularization penalty is used for trading off the error and scale of the problem.

Dataset
Two Twitter datasets are used to extensively evaluate our proposed model, PHEME 2017 and PHEME 2018, which were created by  and , respectively. Table 1 describes the statistics of these dataset.

Experimental Setup
In our preprocessing phase, we empirically clean the text by deleting hyperlinks, emojis, and stop-words. We use Twitter 27B pretrained GloVe data with 200 dimensions for word embedding and set the maximum vocabulary to 80,000. We use Adam with a 0.001 learning rate to optimize the model during training.
The hyperparameters on TRM are: batch size = 64; dropout rate = 0.5; hidden size = 70; and number of layers = 2. For CPM, they are: batch size = 128; aggregation function is the maximum; and activation function is ReLU. For CM, they are: number of layers = 2; dropout rate = 0.5; activation function between layers is ReLU and Sigmoid in the last layer to predict rumors.

Experiments
We compared our model with the following baseline models, among which RDM is considered the stateof-the-art model: • SVM-BOW: an SVM classifier using bag-of-words and N-gram (e.g., 1-gram, bigram, and trigram) features (Ma et al., 2018).
• CNN: a convolutional neural model for obtaining the representation of each tweet and classifying tweets with a softmax layer (Y.-C. Chen et al., 2017).
• BERT: a fine-tuned BERT to detect rumors (Devlin et al., 2019) • RDM: a method integrating GRU and reinforcement learning to detect rumors at an early stage (K. Zhou et al., 2019). Table 2 shows that our model outperforms the other models, including the state-of-the-art model, RDM. The SVM uses bag-of-words features for text encoding and statistical methods that miss essential text features, which leads to low rumor detection accuracy. CNN, BiLSTM, and RDM process tweets based on time, thus they lose valuable information from the conversation structure. BERT has a  Table 2. Experiment result multilayer architecture that performs well in various NLP tasks; however, BERT also processes tweets based on time; therefore, it has the same disadvantage as the other models. On the other hand, our model benefits from the utilization of conversation structure, where every reply supports, denies or comments on the source message. The results show that exploiting conversation structure enables improved rumor detection performance.

Conversation representation impact study
To further investigate the impact of the tweet representation used in our model, we evaluated several conversation information extraction approaches by replacing the TRM module with hand-crafted features and sentence embedding, namely: (1) Table 3, the performance of the first three models based on hand-crafted features is low, as characterized by accuracies of less than 68%. This low performance indicates that hand-crafted features with small dimensions are unlikely to represent all the helpful information in a conversation.
On the other hand, the sentence embedding methods GNN-S and GNN-SB achieve higher performance than the first three models. GNN-S outperforms GNN-MU by 7.62% on the PHEME 2017 dataset and 5.61% on the PHEME 2018 dataset, while GNN-SB achieves 15.35% and 13.95% higher accuracies on the PHEME 2017 and PHEME 2018 datasets, respectively. These results show that information propagation through the conversation structure based on the text can generate a more comprehensive representation.
Overall, we found that TRM as a Twitter conversation representation achieves the best performance because it was built to perform well on the rumor detection task. Unlike the other representations, TRM can learn through its loss function that can change the way posts are represented.

Early detection performance
Detecting rumors at an early stage of propagation is vital so that interventions can be carried out as soon as possible.
To evaluate our model's performance on the early detection of rumors, we created eight test sets reflecting real scenarios of rumors spreading on Twitter. Unlike other researchers who define early rumors based on time, we define them based on the number of replies. We claim that, on the one hand, a small number of replies to a source tweet means that the rumor has just begun to spread because only a few tweets refer to it. On the other hand, a large number of replies to a source tweet suggests that the rumor has been widely spread.
The detection capability of the SVM is deficient in all the cases, achieving accuracies of less than 70%. Improved results of approximately 80% accuracy are obtained by CNN and BiLSTM. Moreover, CNN and BiLSTM exhibit higher accuracy when they use only a few replies (less than ten replies). In  Table 3. Conversation representation study Figure 5, we see that the accuracies of CNN and BiLSTM slightly decrease as the number of replies increases. This result occurs because CNN and BiLSTM ignore conversation structure information and use a time-based chain structure. As a result, CNN and BiLSTM are unable to encode correct representations when the number of replies is greater than ten.
Unlike the other models, our model benefits from encoding information based on conversation structure, thus obtaining the correct representation. The results show that, in every case, our model outperforms the other methods, suggesting that our models can classify rumors at very early or late stages.

Conclusions
We introduce a novel model that shows how the conversation structure of social media can help detect rumors. By viewing conversation structure as a graph, we propagate a message through its structure and benefit from users' related stances. In experiments using the PHEME 2017 and PHEME 2018 datasets, which contain only small amounts of data, our model outperforms the baseline and state-of-the-art models. We expect that our model's performance will increase when it is trained on larger datasets.
Moreover, since the CPM module can aggregate information from local neighbors and create a representation of unseen nodes, our model can be used for unsupervised tasks because it enables nearby nodes to have a similar representation while enforcing the highly distinct representation of the disparate node. In the future, we will investigate unsupervised models on massive amounts of unlabeled rumor data from social media.