A State-independent and Time-evolving Network with Applications to Early Rumor Detection

In this paper, we study automatic rumor detection for in social media at the event level where an event consists of a sequence of posts organized according to the posting time. It is common that the state of an event is dynamically evolving. However, most of the existing methods to this task ignored this problem, and established a global representation based on all the posts in the event’s life cycle. Such coarse-grained methods failed to capture the event’s unique features in different states. To address this limitation, we pro-pose a state-independent and time-evolving Network (STN) for rumor detection based on ﬁne-grained event state detection and segmentation. Given an event composed of a sequence of posts, STN ﬁrst predicts the corresponding sequence of states and segments the event into several state-independent sub-events. For each sub-event, STN independently trains an encoder to learn the feature representation for that sub-event and incrementally fuses the representation of the current sub-event with previous ones for rumor prediction. This framework can more accurately learn the representation of an event in the initial stage and enable early rumor detection. Experiments on two benchmark datasets show that STN can significantly improve the rumor detection accuracy in comparison with some strong baseline systems. We also design a new evaluation metric to measure the performance of early rumor detection, under which STN shows a higher advantage in comparison.


Introduction
Rumor is defined as an unverified statement, which may be unintentionally created or deliberately fabricated (DiFonzo and Bordia, 2007). False rumors are damaging as they may cause public panic and * Equal contribution. † Corresponding author. social unrest. Social media platforms have been ideal places for spreading rumors. It is important to automatically detect the rumors and debunk them before they are widely spread. In recent years, the rumor detection task has attracted continuous attention from many researchers in the NLP community. We denote a statement in social media as an event consisting of a source post and its following posts such as comments or reposts (collectively called posts). Given an event, the rumor detection task is typically defined as a text classification problem (Zubiaga et al., 2018). The former aims to detect whether an event is a rumor or not.
In the literature, the typical method was to first obtain a global representation of the event based on all posts in the event's life cycle, and then employ a machine learning algorithm, such as Random Forest (RF, Kwon et al. 2013), Support Vector Machine (SVM, Ma et al. 2015), Convolution Neural Network (CNN, Yu et al. 2017) and Recurrent Neural Network (RNN, Ma et al. 2016) to learn the connection between the representation and the class labels.
On the one hand, events in social media evolve dynamically. According to communication stud-ies, the dissemination of an event can be roughly divided into an evolution period, a high-tide period and an extinction period (Li et al., 2014;Han et al., 2014). As shown in Figure 1, similar curves can be observed in two real-world social media rumor datasets (i.e., Twitter and Weibo). Each state of the event has different posting density and data distribution. However, most of the aforementioned coarsegrained methods ignored the dynamics in the text data stream, and failed to capture the unique features in different states. Although part of these methods have considered temporal features or modeled the sequential dynamics with RNN, they still failed to establish fine-grained representations for different states.
On the other hand, the early detection of rumors is of great importance. According to our observation on the two rumor datasets, most events reach the high-tide period in less than five minutes. Although some of the previous work segmented the timeline by equal time span or equal number of posts for early rumor detection (Ma et al., 2016;Guo et al., 2018;Chen et al., 2018), they potentially ignored the vital features of early states and failed to train targeted models for early detection.
To address the limitations mentioned above, we propose a new State-independent and Timeevolving Network (STN) for rumor detection based on propagation state detection and segmentation, and apply it to early rumor detection in this paper. Specifically, since an event in social media is actually a sequence of posts sorted according to the posting time, it can be viewed as a time-series text data stream. To learn the propagation states in the text data stream, we first employ the Kleinberg algorithm (Kleinberg, 2003) to segment an event into several sub-events based on the state transition, each of which represents a continuous and identical state. Subsequently, we train an encoder to fit each sub-event separately. We furthermore propose a time-evolving fusion (He et al., 2018) mechanism to merge the current sub-event representation with previous ones, and combine them together for incremental prediction. STN no longer outputs one predictive label for one event, but outputs a sequence of labels for each state-independent sub-event, which enables early detection of rumors. Moreover, we further present a new evaluation metric, called Time-series Smoothing Accuracy (TS-Acc), for measuring the performance of early rumor detection.
Experimental results on two real-world rumor detection datasets released by Ma et al. (2016) demonstrate the effectiveness of our STN model. It not only achieves significant improvements for rumor detection in comparison with several strong baseline systems, but also greatly improves the early rumor detection performance.

Related Work
In recent years, rumor classification system has developed rapidly. Based on the definition in (Zubiaga et al., 2018), a complete rumor classification system consists of four components: i. rumor detection; ii. rumor tracking; iii. stance classification; iv. rumor verification. Among the four sub-tasks, rumor verification resembles rumor detection closely. For rumor detection, the goal is to detect whether a statement is a rumor or not (i.e, the class labels are rumor and non-rumor); for rumor verification, the goal is to determine whether a rumor is true, false or unconfirmed. Some following work have also combined the class labels together and consider it as a four-class classification problem (non rumor, true rumor, false rumor, unverified rumor) (Ma et al., 2017) During the prophase study of rumor detection, researchers focused on extracting various obvious features of microblog events on social media platforms, and combined the features with traditional machine learning classifiers to detect rumors or identify information credibility (Castillo et al., 2011;Yang et al., 2012;Kwon et al., 2013;Liu et al., 2015;Ma et al., 2015;Wu et al., 2015;Zhao et al., 2015;Wang and Terano, 2015;Vosoughi, 2015). These manually-designed features can be roughly categorized into three groups, including text content, user portraits and propagation states. However, it is hard for these traditional approaches to capture the dynamic characteristics during the spread of an event and the relationship between the posts.
To address this issue, Kwon et al. (2013) constructed a massage propagation model to find the diversity of the amount of related posts between rumor and nor-rumor. Ma et al. (2015) first proposed to divide the event timeline into equal-span periods and utilized the dynamic changes of fea-tures in adjacent periods. Based on this, Ma et al. (2016) further introduced RNN models to encode the time periods, which verified the effectiveness of RNN models on encoding sequential posts. Zubiaga et al. (2017) utilized a sequential approach based on Linear-chain Conditional Random Fields (CRF) to learn the dynamic relations between posts, which relies on the content of a source microblog and its related posts. Kwon et al. (2017) employed different sets of features to keep the properties of the propagation structure and temporal relations among posts. Moreover, Guo et al. (2018) incorporated the attention mechanism into stacked RNNs to model the temporal propagation of an event.
In addition, there is another line of researches focusing on modeling the post sequences with tree structures, which aims to useful relations among the responsive posts (Nadamoto et al., 2013;Wu et al., 2015;Ma et al., 2017Ma et al., , 2018Kumar and Carley, 2019). Among them, the representative studies are Ma et al. (2018) and Kumar and Carley (2019), which respectively proposed a recursive neural network and a Tree-LSTM architecture to explicitly model the tree structure. Different from all the studies mentioned above, a recent study by Ma et al. (2019) proposed to leverage Generative Adversarial Networks (GAN) to improve the robustness of rumor detection, where a generative model is trained to confuse the rumor detection discriminator by generating pseudo real examples.
Although much work has been done for rumor detection, only a few previous studies focused on the early detection of rumors (EDR). Zhao et al. (2015) argued that rumors are more likely to arouse users' suspicion, and proposed to aggregate related posts with specific phrases, followed by performing EDR with cluster-based classifiers. However, this work inevitably involved much human effort. To alleviate the reliance of feature engineering, Nguyen et al. (2017) utilized deep neural networks to automatically capture features at the post level. Although it achieves better early detection performance, it is difficult to be applied to large events.  believed that early posts are easy to be manipulated by the source microblog, while user characteristics are relatively stable. They integrated RNN and CNN models to capture user characteristics in the propagation process of an event. However, only using user features makes their model unable to achieve continuous performance improvement as time goes by. More re-cently, Song et al. (2019) introduced the concept of credible detection points, and proposed to gather every ten posts along the timeline as one time-step of RNN and made prediction at each step. But tens of thousands of posts make the number of time steps large, which may reduce the reliability of long-distance dependence.

Task Definition
is a rumor detection dataset, where E denotes one event, and y denotes its class label. Each event E consists of a large amount of posts: where |E| is the number of posts in it. The first post c 0 in E is regarded as the source post published at time t 0 . Each of the following posts c i has an arrival time t i and c i denotes the feature representation of each post. After sorting all the posts in event E according to the arrival time, E can be considered as a time-series text data steam We train a rumor detection model based on D, and use it to predict the class labels y on an unseen event E.

Event State Detection and Segmentation
The Kleinberg algorithm (Kleinberg, 2003) was originally used to detect burst incidents on news or e-mails. In this paper, we employ it to detect the state for each post in an event. Based on the hidden Markov model, the Kleinberg algorithm can identify the hidden state sequence corresponding to a post sequence.
For an event consisting of multiple posts E = [(c 0 , t 0 ), (c 1 , t 1 ), . . . , (c |E| , t |E| )], we first build a sequence of arrival time intervals where Our goal is to obtain the corresponding state sequence Q for the interval sequence X: Each post is assigned with a hidden state q i ∈ {1, 2, 3}.
where q i ∈ {1, 2, . . . , N } denotes the state of x i , and N number of state levels in Q. Figure 2 illustrates a case of the state changes of part of the posts in an event when N = 3. The Kleinberg algorithm assumes the arrival time interval has a memoryless exponential distribution: where α j can be regarded as the arrival rate of posts. It can be derived that the expected value of For the basic state q = 1, we set its arrival rate as the reciprocal of the average intervals of all posts in E: The values of α j corresponding to higher states q i = j are then set as: where s > 1 is a preset scaling parameter. For adjacent arrival time intervals x i and x i+1 with the corresponding states q i = a and q i+1 = b, the loss of transition from state a to b is defined as: where γ is a preset parameter to control the magnitude of transition loss. The objective of the algorithm is to solve a state sequence Q, which minimizes the cost function L(Q|X): (9) where the first item is the loss of state transition, based on which we expect the frequency of transition to be as small as possible. The second term is the log-likelihood, based on which we want to maximize the density functions p(X|Q) given the sequence of x i and q i pairs.
After obtaining the optimal state sequence Q, we then merge the continuous posts with the same state into a single sub-event, and finally represents an event E by a sequence of K state-independent sub-events: Each sub-event E k includes a series of continuous posts: where c k,l denotes the l-th post in the k-th subevent.

State-independent Sub-event Encoder
For each sub-event E k , we train a stateindependent sub-event encoder e k to get the subevent representation. Firstly, the mean pooling of the embedding of all words in a post is used as the post representation: where w (c i ) l is the word embedding vector retrieved from a pre-trained word embedding matrix, and c i denotes the representation of the ith post. Based on c i , we can then get the input representation of the sub-event E k , denoted by Secondly, we employ a basic encoder (e.g., CNN, LSTM) to get the sub-event representation h k based on the input post representation X k . The encoding of the sub-event E k can finally be expressed as follows: Note that here the state-independent sub-event encoder is a general framework compatible with the widely used encoders of texts, e.g., CNN, LSTM, GRU, etc. In the experiments, in addition to CNN, we also report the results based on LSTM and GRU.

Time-evolving Representation and Classification
The social media event evolve dynamically, and the representations of pre-ordered sub-events may be helpful for current sub-event prediction. Therefore, Figure 3: A schematic diagram of our model when learning for sub-event E 2 . The model outputs the prediction resultĥ 2 and updates all the visible and unfrozen modules accordingly.
we add a time-evolving fusion module after each sub-event encoder to fuse the representation of the current sub-event with previous ones: where δ is a Sigmoid activation function and W k is a weight matrix. Similarly,ĥ k will be used to guide the encodingĥ k+1 of next sub-event, forming a recursive encoding mode.
We independently predict the authenticity of subevent E k under each state. The encodingĥ k of E k will be fed into a separate softmax classifier to get the prediction resultŷ k : where V k and b k are parameters representing weights and bias. Given the sequence of sub-events E = [E 0 , E 1 , . . . , E K−1 ], our model incrementally outputs a corresponding sequence of predictive probabilities: The training objective of each sub-event is to minimize the cross-entropy loss between the predictive probabilityŷ k and the true class label y: where L E k denotes the loss for sub-event E k .
It should be noted that our model is learned in an incremental training mechanism. In training for the current sub-event E k , the parameters of all previous encoders are frozen. That is, we only update the parameters in current encoder. But the fusion parameters (i.e., W k−1 , W k−2 , . . .) should be fine-tuned synchronously. The incremental training process is illustrated in Figure 3.

Early Detection of Rumors
In this subsection, we further propose a new evaluation metric, named Time-series Smoothing Accuracy (TS-Acc), to measure the performance of early rumor detection.
Since the earlier prediction results are more important for rumor detection, we first employ a smoothed exponential function to assign a weight to the accuracy for the predictions in each subevent: v where t is the arrival time of sub-events and and λ is a smoothing parameter defined as 60 in our experiments. TS-Acc is then defined as a weighed sum of accuracies of already-appeared sub-events: where Norm(v(t (k) )) denotes the normalized weight among k already-appeared sub-events.
It is also worth noting that in case of discrete time-points, the area under accuracy-time curve is equivalent to the sum of accuracies at all timepoints. Since TS-Acc is a weighted sum of accuracies, it can also be regarded as a weighted version of area under accuracy-time curve.

Datasets and Experimental Settings
Twitter and Weibo datasets were published by Ma et al. (2016), both of which provided a large number of relevant posts for each microblog event. Due to the protection policy of Twitter, we re-crawl all the posts in the Twitter dataset according to their ID numbers. However, since some of the tweets are no longer available, we discard those unresponsive source microblogs and finally obtain 90% of the original dataset for our experiments. For Weibo dataset, the events of misinformation are marked as rumor. According to the definition in (Zubiaga et al., 2018), it is more related to true rumor. But to be be consistent with (Ma et al., 2016), we still regard the event category as rumor and non-rumor. The detailed statistics of both datasets are shown in Table 1. Following the same settings in the previous papers, we hold out 10% of the events in both datasets for model tuning, and the rest of the events are split with a ratio of 3:1 for training and test. To guarantee obtaining global states for different events, we have not performed the Kleinberg algorithm for each post. Instead, we combine all events in the dataset together and align the posts in them according to the posting time. The Kleinberg algorithm is then performed on the combined entire dataset. We use the Chinese word embeddings from Tencent AI Lab  and the English word embeddings from Google News. When training the model, we use the Adam optimizer (Kingma and Ba, 2014).

Rumor Detection Performance
In this subsection, we compare our proposed STN model with the following rumor detection methods on the standard rumor detection task, i.e., evaluating the detection accuracy after the end of the event propagation: • DTR: A ranking model based on a decision tree to identify trending rumors through searching for disputed claims (Zhao et al., 2015); obtains credible detection points for each repost sequence, followed by making reliable prediction based on the information before the credible detection point (Song et al., 2019).
Based on the results reported in Table 2, we can make a couple of observations. First, compared with the traditional model DTR and SVM, GRU achieves obvious improvements on both datasets, and AIM shows even better performance by using attention mechanism. Second, based on the credible detection point, CED further boosts the detection accuracy on Weibo to 94.6%. Finally, our model STN consistently achieves the best performance on both Twitter and Weibo datasets, which outperforms the state-of-the-art models by around two percentage points on both detection accuracy and F 1 score.

Early Detection Performance
In this subsection, we compare the performance of all the models in early detection of rumors (EDR), i.e., predicting the credibility of microblog events based on the posts released before a detection time point.
(1) The curve of detection accuracy In Figure 4, we show the detection accuracy of all the models as the time goes by. In particular, we illustrate more detection results within the first 6 hours.
First, we can see from Figure 4 that the accuracy of DTR and that of SVM grow slowly on both  Twitter and Weibo datasets. In contrast, the GRU model has a faster and more stable rising curve on both datasets. Second, compared with the previous methods, we can find that AIM consistently improves the detection accuracy at each detection time point. Moreover, PPC can quickly improve the detection accuracy to over 92% in the first 5 minutes, but it cannot continue to perform better, whereas CED can continuously improve its detection accuracy to over 94% within the first 6 hours on the Weibo dataset. Finally, in comparison with all the methods mentioned above, STN shows a significant improvement within the first 6 hours. Specifically, it is easy to see that STN has achieved over 75% and 94% accuracy respectively at the 10th minute. In addition, as the time goes by, we can clearly see that STN can gradually improve its detection accuracy, and outperform all the state-ofthe-art models at each detection time point. (

2) Time-series Smoothing Accuracy
As introduced in Section 3.5, we propose the Time-series Smoothing Accuracy (TS-Acc) to evaluate the efficiency of EDR. In Table 3, we report the results of using this evaluation metric to com-pare all the models. To be consistent with Figure  4, we respectively select 3 and 9 time points within the first 6 and 96 hours to re-evaluate the TS-Acc performance of all the models.
First, we can see that for each approach, the overall trend of the TS-Acc performance is similar to that of the accuracy performance in Figure 4. Second, it is worth noting that for the Weibo dataset, the TS-Acc of PPC in 96 hours is slightly lower than AIM and CED, whereas its TS-Acc in 6 hours is significantly higher than AIM and CED. This indicates that our evaluation metric TS-Acc primarily reflects the speed of improvement in the early stage. Finally, we can clearly observe that the TS-Acc of STN is significantly higher than that of all the state-of-the-art models on both datasets, which is consistent with the performance trend shown in Figure 4.

Discussion on State Detection and Event Segmentation
(1) Discussion on the Dynamics of Data Distribution To show the state detection and event segmentation advantages of Kleinberg algorithm, we com-   Chen et al. 2018), which divide the event with equal time span and equal number of posts respectively. Specifically, we calculate the intra-class distance of each divided sub-event, and obtain the mean value of all the distances of sub-events for each model. Note that the smaller the intra-distance is, the closer the post features in the sub-event are. In Table 4, it is easy to observe that Kleinberg algorithm can obtain the lowest intra-class distance, which demonstrates its better state segmentation ability, and it may reduce the dynamics of data distribution of sub-events and further enhance the feature extraction ability of our encoders.
(2) Effects of Parameters of Kleinberg Kleinberg algorithm has two important preset parameters s and γ which are used to set the expected arrival rate of posts and the transfer loss between different state levels. As shown in Table  5, we explore the impact of several pairs of s and γ on the sub-event partition.
In order to ensure effective training of STN, we adjust the state division of the Kleinberg algorithm. If the number of events that have posts under a single state is less than 30% of the total number of events, we merge the current state with the sequential state, and if the duration of a single state exceeds two hours, we truncate it at the 2nd hour. Finally, we find that the reasonable changes of s  and γ have little effect on the number and the segmented position of sub-events. Thus, we draw the conclusion that the performance of STN is not subject to the parameter adjustment of Kleinberg.

Discussion on the Compatibility of State-independent Encoders
As mentioned above, the state encoder of STN is a general framework compatible with traditional feature extraction and classification algorithms or deep neural networks. We do experiments with the following encoders on Weibo dataset: LR (Logistic Regression), LSTM, GRU, GRU-ATT (GRU with self-attention) and CNN. Table 6 shows the detection accuracy and early detection efficiency TS-Acc of STN in the first 6 and 96 hours with different encoders. First, we can see that the traditional machine learning model LR can already achieve good performance. Second, among all the deep learning encoder, CNN obtains the best performance in both settings. Moreover, by comparing the results in Table 3 and Table 6, we can see that all the deep learning encoders can obtain better performance than all the state-of-theart models, which indicates the effectiveness and the generalization ability of our STN model.

Discussion on the Incremental Training
Finally, to verify the effect of the time-evolving fusion module, we replace the module of STN with a standard GRU, and make STN degenerate into an integrated training model (GRU with Kleinberg, GK). As shown in Table 7, we can easily find that our incremental training method (i.e., STN) can consistently perform better than GRU and GK, which demonstrates the usefulness of our time-evolving fusion module. Moreover, in our experiments, we also find that since the prediction of the current state is dependent on the previous state in our timeevolving fusion module, events that are predicted correctly in the earlier states rarely change in the follow-up state, but most of the events that are wrongly predicted in the earlier states can be largely corrected in the follow-up state. This further proves the effectiveness of our STN model.

Conclusion and Future Work
In this paper, we first introduce the Kleinberg algorithm to identify the propagation states for an event composed of a sequence of posts and segment the sequence into several state-independent sub-events. On this basis, we propose a state-independent and time-evolving network (STN) for rumor detection as well as early rumor detection. We also present a new metric called time-series smoothing accuracy (TS-Acc) for measuring the efficiency of early rumor detection. The experimental results on two real-world microblog rumor datasets demonstrates the advantages of our STN approach in terms of both rumor detection accuracy and our proposed TS-Acc metric, in comparison with some strong rumor detection systems.
One disadvantage of this work is that the Kleinberg algorithm is performed on the combination of all events in the dataset to maintain global states. This way may fail to capture the individual state transition in single events. Secondly, it is a retrospective algorithm which depends on the condition all posts along the timeline should be provided in advance. Therefore, one direction for future work is to explore an online state detection algorithm and perform it for each event, but at the same time ensure that the state of each event is globally defined. It would be even better if the state detection and segmentation step can be integrated with sub-sequent state-independent feature extraction and rumor detection in an end-to-end framework.