Modeling Evolution of Message Interaction for Rumor Resolution

Previous work for rumor resolution concentrates on exploiting time-series characteristics or modeling topology structure separately. However, how local interactive pattern affects global information assemblage has not been explored. In this paper, we attempt to address the problem by learning evolution of message interaction. We model confrontation and reciprocity between message pairs via discrete variational autoencoders which effectively reflects the diversified opinion interactivity. Moreover, we capture the variation of message interaction using a hierarchical framework to better integrate information flow of a rumor cascade. Experiments on PHEME dataset demonstrate our proposed model achieves higher accuracy than existing methods.


Introduction
With increasing openness of social media platforms, unverified messages can be easily disseminated from person to person and result in tremendous rumor cascades which expose huge threat to individuals and society. To resolve rumors, firstly we need to detect statements that are ambiguous at the time of posting, then explore how users share and discuss rumors and finally assess their veracity as true, false or unverified. This can be represented as a pipeline of sub-tasks, including rumor detection, stance classification and rumor verification (Zubiaga et al., 2018a).
Identifying and debunking rumors automatically has been extensively studied in the past few years. State-of-the-art approaches construct sequential representations following a chronological order and then utilize temporal features to capture dynamic signals (Zubiaga et al., 2016;Ma et al., 2016;. Although the source content stays invariable, time-series modeling successfully locates modifiers who might import evidence to correct misinformation or stir up enmity to discredit truth (Zhang et al., 2013). These models generate promising results, however, they ignore local interactions happened during the message diffusion which is deemed to be important for the identification of rumors. Figure 1 (a) shows a rumor cascade which is identified as false for devilishly suspect Ray Radley's role in the appalling Sydney siege. As can be seen, denial to false rumor tends to evoke affirmative replies which further confuses the factuality of the message. Besides, disagreement and query towards descriptive statements are able to trigger drastic discussion and result in validity modification. Although some researchers explore propagation structure of rumor proliferation (Ma et al., 2017;Kumar and Carley, 2019), they typically rely on rough aggregation of locally successional messages.
Moreover, the evolution of message interaction depicts the global characteristic of rumor cascades which improves the performance of verification. Figure 1 (b) illustrates the intuition using statistics drawn from PHEME dataset (Kochkina et al., 2018). It can be seen that denial tweets with supportive parent posts appear frequently in false rumors especially in an early stage, while unverified rumors constantly stimulate queries behind positive messages along with time. As rumor cascade evolves, with more dialogue context and auxiliary evidence, assessing the message credibility comprehensively becomes possible. In order to capture local interactive patterns and explore how interactivity dominates global factuality judgment , we propose to learn conversational message interaction and cooperate with propagation structure to improve the performance of rumor resolution. To model message interaction, we learn the latent interactive pattern for a repost toward its original post via discrete variational autoencoders (DVAEs) which has shown great potential in learning categorical latent patterns (interaction patterns in our case). For rumor resolution, latent variables not only represent participant's attitude, but can also control how much literal information is reserved for claim confirmation. We then employ an attention-based hierarchical architecture to capture temporal variation of message interaction.
Our contributions are of three-folds: • To the best of our knowledge, this is the first study modeling the interactive patterns of messages rather than coarse aggregation for rumor verification. By exploiting interaction between post pairs, we also make it possible to combine propagation structure with time series modeling. • What's more, we utilize DVAEs to capture the interactive pattern between online conversational discussion and also interpret the latent representation of message interaction associating with stance information.
• Extensive experiments on real-world datasets collected from TWITTER demonstrate our proposed model outperforms state-of-the-art rumor verification methods with large margin.

Related Work
Our research is related to two areas including rumor resolution and application of discrete variational autoendocers.

Rumor Resolution
There have been numerous studies on dismantled tasks of rumor resolution. Traditional approaches (Castillo et al., 2011;Yang et al., 2012;Kwon et al., 2013;Liu et al., 2015) exploit features manually crafted from post text, user profile and media source and use straightforward machine learning algorithms to classify the set of messages. Moreover, rather than only considering properties of individual messages, dynamic time series structure (Ma et al., 2015) and tree model using propagation pattern (Ma et al., 2017) is effective of depicting global difference between rumor and non-rumor claims.
To avoid the effort and bias of feature engineering, methods based on deep neural networks are massively applied and have demonstrated great efficacy of discovering data representation automatically. Ma et al. (2016) employ recurrent neural networks (RNNs) to capture dynamic temporal signals. Yu et al. (2017) use convolutional neural networks (CNNs) to flexibly extract evidential posts. Recently, Zhou et al. (2019) integrate reinforcement learning to select the minimum number of posts required for early rumor detection. Ma et al. (2019) generate less indicative semantic representation via generative adversarial networks to gain better generalization for rumor detection. Besides, since rumor resolution is a coherent process, researchers also combine detection and stance classification with verification under the framework of multi-task learning (Ma et al., 2018;Kochkina et al., 2018;Kumar and Carley, 2019;. In summary, deep learning approaches for rumor resolution involves three critical parts: (1) capture local attributes of every single message, (2) integrate information flow to acquire globally coherent representation and (3) explore the synergy effects of local and global information to promote holistic performance. However, it is inadequate to learn interaction between messages via simply sharing model parameters and aggregating information. Our work is closely related to methods based on modeling time-series characteristics (Ma et al., 2016). Different from their work, our proposed model manage to learn the local interactive pattern to assist final verdict and employ attention mechanism to locate messages significantly influence the classification result. Table 1 lists various fundamental modules that latest researches adopt for each part.

Research
Message Modeling Cascade Modeling Union Approach

Application of Discrete Variational Autoencoders
Variational Autoencoders (VAEs) are devised to learn low-dimensional latent variables strongly linked with fundamental attributes (Kingma and Welling, 2013) and has shown great promise in smoothly generating diversified sentences from a continuous space (Bowman et al., 2015).
In the setting of VAE, the latent variables are considered independent and continuous in Gaussian latent space. As for datasets composed of discrete classes, discrete latent variables are more suitable to capture the different distribution over the disconnected manifolds. To overcome the problem of training discrete latent variables, Rolfe (2016) proposes the discrete variational autoencoders (DVAEs) which assume that the corresponding prior distribution over the latent space is characterized by independent categorical distributions.
Especially for text mining, discrete variables are adaptive to holistic properties of text and much more friendly for interpreting categories of natural language such as style, topic and high-level syntactic features. For instance, in neural dialog generation, DVAE is able to learn underlying dialogue intentions that can be interpreted as actions guiding the generation of machine responses (Wen et al., 2017;Zhao et al., 2018). In this paper, we learn discrete latent variables between inherited post pairs and incorporate them with textual information to model message interaction.

Proposed Model
Resolution of rumor cascades can be formulated as a supervised classification problem. Given a treestuctured TWITTER cascade C which corresponds to a root tweet r 0 and its relevant responsive tweets {r 1 , r 2 , ..., r T }, the goal is to recognize the stance of each tweet Y s i as support, comment, deny or query, as well as determine the class of the cascade Y v as true, false or unverified. From our dataset, for each tweet r i , its post time t i and parent post r p i from which it retweets is also available. Our model is based on a hierarchical architecture which consists of two components: (1) interaction modeling which cooperates child post with its parents via DVAEs to generate message interaction and (2) evolution capturing that employs attention-based recurrent neural networks to capture temporal variation and make prediction, as shown in Figure 2.

Interaction Modeling
We use mean of glove word vectors to encode the textual information for each post and then employ DVAEs to explore the relationship between post pairs so as to generate representation for message interaction.
Post Representation. For each tweet r, we represent the textual information as a sequence of words {w 1 , w 2 , ..., w n }. Besides, we extract its post time t and look up corresponding parent post r p for further use.
Given a sequence of words {w 1 , w 2 , ..., w n } , an embedding layer map each w i into a dense vector x i , where E is the embedding matrix, x i is the embedding form of the word w i . Then we take the average of these word embeddings to obtain the sentence-level representation c. Similarly, we can obtain representation of the corresponding parent post c p according to r p . Besides, we have also tried other complex methods of sentence representation, including CNNs, RNNs and pretrained BERT embeddings. They are not as effective as in other tasks since text in TWITTER contains numerous informal expressions and they are likely to intensify the semantic gap under the setting of cross-event validation.
Latent Interaction Modeling. To model message interation, we propose to explore the relationship between three random variables: the repost tweet c, the parent post c p and the latent interactive pattern z. Before introducing our adaption of DVAEs, we identify two key properties of tweet claim formulation in the first place.
On one hand, the latent meaning of z should be independent of c p since there is high probability for contradictory opinions to appear after the same original post. On the other hand, different from text generation, the latent action z is the product of interaction between c and c p and should reciprocate with textual information to guide rumor discrimination. Thus, our DVAEs include two critical modules, (1) a recognition network R: q R (z|c) that recognizes attitude of a retweet post; (2) a policy network π: p π (a|z, c, c p ) that constrains the distribution of z and incorporates textual information to form interaction a, as shown in Figure 2(b).
In the setting of DVAEs, the latent action z is a series of K-way categorical variables {z 1 , z 2 , ..., z M }, where z i is independent with each other and M is the number of latent variables. Conditioning on the retweet post c, the recognition network calculates the temporary logits of latent space by a single full-connected layer, where W i and b i are weight matrix and bias vector.
As simulating the distribution of z from by softmax operation presents great challenge for back propagation, we apply Gumbel-Softmax trick to create a derivable estimator for categorical variables (Maddison et al., 2016;Jang et al., 2016). A random variable g has a standard Gumbel distribution if g = − log(− log(u)), with u ∼ U (0, 1). Let {g 1 , g 2 , ..., g k } be an i.i.d sequence of Gumbel random variables, by adding the Gumbel noise g k to log ik , the categorical distribution could be appropriately reparameterized. Then a relaxation by introducing a temperature parameter τ makes it possible to implement a continuous approximation and provides guarantee for optimization.
With Gumbel-Softmax trick, we obtain separated elements of the posterior distribution q R (z i |c) as, with higher τ , the vector d i is much smoother and even seems continuous. Then, the discrete code of each z i can be acquired.
In the policy network, we concatenate c and c p to form semantic signal and combine the signal with the learned latent interactive pattern z to generate a control vector a which represents message interaction, where W 0 a and W 1 a are weight matrix, b 1 a is bias vector, and ⊕ denotes the concatenate operation. In Equation 5, sigmoid gate allows z to control the degree of semantic information flowing from the post representation.
In order to demonstrate discrete latent variables are more effective than continuous, we also compare the performance while following the framework proposed by Bowman et al. (2015) to obtain the continuous latent variables z.

Evolution Capturing
After exploring the interactivity between messages, we employ and modify the dynamic time series model (Ma et al., 2016) to capture temporal variation of these interactive information, as shown in Figure 2(b). Different from their preprocessing procedure, we remove the tedious process of time series partitioning since the average cascade size of the dataset we use is relatively small and simplifying data storage structure is more friendly for batch training. Then bidirectional LSTM layers are employed on these sequential message interactions to obtain the intermediate hidden states h j i .
where j means the jth LSTM layer and h j−1 i equals to a i at the first layer. Then we utilize the inner hidden states to output stance labelsŷ s 1 ,ŷ s 2 , ...,ŷ s T in the framework of multitask learning. Although the bidirectional LSTM networks could have several layers, we use the first layer of hidden states as the source of stance output because they are closer to the original local representation.
After obtaining coherent global representation of each message, an attention pooling layer is used as a last step of integration in order to capture contribution imbalance. For the last layer of hidden states h 1 , h 2 , ..., h T , we calculate the cascade representation s as follow, where W m and w u are weight matrix and vector, b s is the bias vector and u i represents the attention weights.
Finally, one linear layer is applied on the cascade representation s to get the prediction resultŷ.

Joint Learning
For one thing, our proposed model aims at modeling the interactivity between messages, and for another, the ultimate goal is to make precise discrimination for rumor claims. As a result, the objective of the overall framework has to consider effects from two aspects. We define the loss function as, where λ is a tradeoff hyperparameter to balance the task-oriented loss and DVAE loss. The first two loss term is defined on the rumor resolution task. We adopt the well-known cross entropy loss, where N is the number of instances, L is the number of considered classes. The last term is defined on the generation validity of DVAEs. To carry out inference for interaction modeling, we introduce a parameterized network q Φ (z|c, c p ) to approximate the posterior distribution p π (z|c, c p ). Since it is a trainable parameter space, we simplify the expression as q Φ (z). Then we can write the objective of DVAEs as follow.
Inspired by the decomposition work from Zhao et al. (2018), we use cross entropy to approximate the reconstruction loss and derive the KL-divergence through customary calculation.

Data Set
We evaluate our interaction-aware model on real-world dataset collected from TWITTER which is developed by Kochkina et al. (2018). It contains rumor and non-rumor claims related to 5 breaking news and each of the rumor claims is annotated with its credibility, either true, false or unverified. In addition, the dataset constructor supplement sparse stance information (Zubiaga et al., 2018b) so that multi-task learning is able to show its validity and we can implement further analysis to confirm the effectiveness of message interaction. Among these two tasks, verification is labeled on cascade-level while stance belongs to tweet-level annotations. PHEME is undoubtedly suitable for our exploration of message interaction as it is constructed by a large amount of conversational threads in which participants tend to launch discussion other than judge on the source tweet.

Preprocessing and Training Details
We preprocess each tweet by the NLTK toolkit (Bird et al., 2009) and follow a procedure of removing url and @, tokenizing, lemmatizing, and removing all the stop words. Glove (Pennington et al., 2014) word embeddings with dimension of 300 are adopted without being fine-tuned. As for training process, we perform leave-one-event-out (LOEO) cross validation (Kochkina et al., 2018). Although it suffers a lot to handle problems such as evil-balanced instances for each event and semantic inconsistency between events, LOEO is much more representative of real world and has been adopted by latest researches (Kumar and Carley, 2019;. Hyperparameters performing best in development set are fixed and recorded. The network is trained with back propagation using the Adagrad update rule (Duchi et al., 2011). Following is the final hyperparameters of best performed network. For the module of DVAEs, the number of disrete variables M is set as 4, the possible number of each varible K is 4 and the temperature τ equals to 10. For the integration part, the number of hidden unit is 200, with a dropout rate of 0.3. While training, the batch size is set as 32, the maximum number of training epochs is 50, and the tradeoff parameter of loss terms is 0.4. We assign verification and stance classification tasks with different start learning rate, namely 1e-5 and 1e-4 respectively, because these two tasks share the same input while most of the stance labels are missing which requires larger learning rate to catch up. We have made our code and preprocessed data publicly available 1 .

Models for Comparison
We compare our model with the following models: RNN: A RNN-based model (Ma et al., 2016) with GRU to capture dynamic textual variation. CNN: A CNN-based rumor detection model (Yu et al., 2017) to locate key information.
BranchLSTM: A branchLSTM-based network (Kochkina et al., 2018) that cooperates detection and stance classification task to boost verification.
GCN-RNN: A combination model  which uses GCN to update message and employs RNN to acquire cascade representation.
VAE-RNN: Our proposed model alternating discrete latent variables as continuous. DVAE-RNN: Our proposed model that considers time series effect and propagative interactivity at the same time.

Overall Performance
We implement the task of rumor verification and stance classification to evaluate the performance of our proposed model.
Rumor verification. The overall results for rumor verification are shown in Table 2. We can see that our interaction-aware model significantly outperforms all the models nearly across all the metrics, especially recognizing misleading messages (false rumor) which is extremely important for practical use. On the whole, methods using multi-task learning are more robust than others that don't. Compared with plain RNN model, introducing local information modeling brings about performance improvement which illustrates that to measure the whole cascade's attribute, the local attribute of interaction needs to be considered. Comparing with VAE-RNN, the discrete variables is more representative of the latent interactive patterns as the input of neural networks is already a form of continuous dense vectors.
Stance classification. Under the framework of multi-task learning, we also test the performance of stance classification, as shown in Table 3. Our proposed model achieves the highest accuracy and macro F1-score, even though some other methods reach a sudden performance boost testing on certain event or stance. The main reason is that stance classification is much more dependent on the semantics of the tweet and its surrounding claims, and the huge semantic gap between the event-related corpus brings about the drastic fluctuation. Compared with VAE-RNN that converts the discrete variable into continuous, the exceedance indicates that using discrete latent variables are more suitable to represent categorical information.   Table 3: Results for stance classification. MaF: the value of macro F1-score, Bold: the best performance in each column. 5 columns in the middle represent the macro F1-score using different event data as the test data. 4 columns on the right show the averaged F1-score of classifying supporting, commenting, denying and querying messages. '*' denotes values taken from the original publication.

Further Analysis on Interaction Modeling
In order to analyze the effectiveness of DVAEs for interaction modeling, we propose to use the stance information as assistance. Our model attempts to learn the latent vector aroused by a specific post but constrained by the parent-relevant distribution which means the interaction we modeled is heavily depend on the pair relationship between parent post and its repost. Besides, with the design of attention-based integration strategy, we are able to locate what kind of message interaction dominantly determine the classification of rumor cascades. Using the model with best performance, we calculate the average attention weights of different stance pairs to estimate if interactive patterns assist in verifying rumors. The distribution of attention weights of different interaction patterns can be seen in Figure 3. It is obvious that supportive or denial posts with parent holding the same stance play a critical part in verifying rumors, and discussion aroused by judgemental (supporting/denying) tweets immensely promote the process of identification.

Hyperparameter Sensitivity
In this section, we explore the influence of three hyper-parameters, namely the trade-off factor λ, the number of discrete latent variables M and categories for each latent variable K.
Impact of λ. In order to investigate the influence level of interactive effect, we set the tradeoff factor λ as 0, 0.2, 0.4, 0.6, 0.8, 1 respectively to control the dominance of message interaction modeling. As shown in Figure 4, we observe that with λ set to 0.4, our model achieves the highest accuracy for rumor detection and verification. Even when the value of λ descends to 0, the model is still robust as a result of plain integration of message pairs. Nevertheless, we figure that with the increase of λ, our proposed model gradually presents the effectiveness for rumor verification. As the assessment criteria is taskoriented, thereupon, with larger λ, the generation of DVAEs is likely to become discretionary so that the test accuracy decreases rapidly.
Impact of M and K. Furthermore, We explore the optimal scope for the latent space z by tuning M and K. With a mass of experimental practice, we confirm when setting M and K both at 4, our framework works best. Figure 4 illustrates the result of varying M and K compared with plain hierarchical structure. Varying M affects little for the classification result. The reason probably lies in the independence of each z i . However, the augment of K brings about disastrous decline of prediction exactitude. This is principally because a large K makes it more difficult to approximate the complex posterior distribution.

Conclusion and Future work
In this paper, we propose to model the evolution of message interaction for rumor resolution. The interaction pattern between post repost pairs is modeled via via discrete variational autoencoers. And an attention-based hierarchical architecture is employed to capture the evalution of message interactions. Experimental results on PHEME dataset show that our framework significantly outperforms the baselines for rumor verification. Further analysis shows that DVAEs is able to model interaction features for better interaction pattern identification. Besides, a closer look at attention weights present that some specific types of interactions contribute more on rumor resolution.
In the future, we would like to explore the task of interaction type classification to further analyze the influence of various interaction types on rumor resolution. In addition, it would be interesting to identify those change points along the timeline when misinformation emerges.