Early Detection of Fake News by Utilizing the Credibility of News, Publishers, and Users based on Weakly Supervised Learning

The dissemination of fake news significantly affects personal reputation and public trust. Recently, fake news detection has attracted tremendous attention, and previous studies mainly focused on finding clues from news content or diffusion path. However, the required features of previous models are often unavailable or insufficient in early detection scenarios, resulting in poor performance. Thus, early fake news detection remains a tough challenge. Intuitively, the news from trusted and authoritative sources or shared by many users with a good reputation is more reliable than other news. Using the credibility of publishers and users as prior weakly supervised information, we can quickly locate fake news in massive news and detect them in the early stages of dissemination. In this paper, we propose a novel structure-aware multi-head attention network (SMAN), which combines the news content, publishing, and reposting relations of publishers and users, to jointly optimize the fake news detection and credibility prediction tasks. In this way, we can explicitly exploit the credibility of publishers and users for early fake news detection. We conducted experiments on three real-world datasets, and the results show that SMAN can detect fake news in 4 hours with an accuracy of over 91%, which is much faster than the state-of-the-art models.


Introduction
The widespread dissemination of fake news has lead to a significant influence on personal fame, public trust, and security. For example, spreading misinformation, such as "Asians are more vulnerable to novel coronavirus" 1 about COVID-19 has very serious repercussions, making people ignore the harmfulness of the virus and directly affecting public health. Research has shown that misinformation spreads faster, farther, deeper, and more widely than true information (Vosoughi et al., 2018). Therefore, fake news detection on social media has attracted tremendous attention recently in both research and industrial fields.
Early research on fake news detection mainly focused on the design of effective features from various sources, including textual content, user profiling data, and news diffusion patterns. Linguistic features, such as writing styles and sensational headlines (Kwon et al., 2013), lexical and syntactic analysis (Potthast et al., 2017), have been explored to separate fake news from true news. Apart from linguistic features, some studies also proposed a series of user-based features (Castillo et al., 2011;Shu et al., 2018), and temporal features (Kwon et al., 2013) about the news diffusion. However, these feature-based methods are very time-consuming, biased, and require a lot of labor to design. Besides, these features are easily manipulated by users.
To solve the above problems, many recent studies (Ma et al., 2016;Yu et al., 2017;Guo et al., 2018;Yuan et al., 2019) apply various neural networks to automatically learn high-level representations for fake news detection. For example, recurrent neural network (RNN) (Ma et al., 2016), convolutional neural network (CNN) (Yu et al., 2017), matrix factorization  and graph neural network (Yuan et al., 2019) are applied to learn the representation of content and diffusion graph of news. These methods only apply more types of information for fake news detection, but paying little attention to early detection. Moreover, these models can only detect fake news in consideration of all or a fixed proportion of repost information, while in practice they cannot detect fake news in the early stage of news propagation (Song et al., 2018). Some studies (Liu and Wu, 2018;Song et al., 2018; explore to detect fake news early by relying on a minimum number of posts. The main limitation of these methods is that they ignore the importance of publishers' and users' credibility for the early detection of fake news. When we humans see a piece of breaking news, we firstly may use common sense to judge whether there are factual errors in it. At the same time, we will also consider the reputation of the publishers and reposted users. People tend to believe the news from a trusted and authoritative source or the news shared by lots of users with a good reputation. If the publisher is reliable, we tend to believe this news. On the other hand, if the news is reposted by many low-reputation users in a short period, it may be that some spammers tried to heat up on the news (Chen and Chen, 2015;Vosoughi et al., 2018), resulting in lower credibility of the news. Inspired by the above observation, we explicitly take the credibility of publishers and users as supervised information, and model fake news detection as a multi-task classification task. We can annotate a small part of publishers and users by their historical publishing and reposting behaviors. Although the credibility of publishers and users does not always provide correct information, they are necessary complementary supervised information for fake news detection. To make the credibility information generalized to other unannotated users, we construct a heterogeneous graph to build the connections of publishers, news, and users. Through a graph-based encoding algorithm, every node in the graph will be influenced by the credibility of publishers and users.
In this paper, we address the following challenges: (1) How to fully encode the heterogeneous graph structure and news content; and (2) How to explicitly utilize the credibility of publishers and users for facilitating early detection of fake news. To tackle the above challenges, we propose a novel structureaware multi-head attention network for early detection of fake news. Firstly, we design a structure-aware multi-head attention module to learn the structure of the publishing graph and produce the publisher representations for the credibility prediction of publishers. Then, we apply the structure-aware multi-head attention module to encode the diffusion graph of the news among users and generate user representations for the credibility prediction of users. Finally, we apply a convolutional neural network to map the news text from word embedding to semantic space and utilize the fusion attention module to combine the news, publisher, and user representations for early fake news detection.
The contributions of this paper can be summarized as follows: • We propose a novel strategy that explicitly takes the credibility of publishers and users as weakly supervised information for facilitating early detection of fake news.
• We provide a principled way to jointly utilize the credibility of publishers and users, and the heterogeneous graph for credibility prediction and fake news detection.
• We conduct extensive experiments on three real-world datasets. Experimental results show that our model achieves significant improvement over state-of-the-art models on both fake news detection and early detection tasks.
2 Related Work

Feature-based Methods
Early studies in fake news detection concentrate on designing some good features for separating fake news from true news. These features are mainly extracted from text content or users' profile information.
Linguistic patterns, such as special characters and keywords (Castillo et al., 2011), writing styles and sensational headlines (Kwon et al., 2013), lexical and syntactic features (Feng et al., 2012;Potthast et al., 2017), temporal-linguistic features (Ma et al., 2015;Zhao et al., 2015a), have been explored to detect fake news. Apart from linguistic features, some studies also proposed a series of user-based features (Castillo et al., 2011;Yang et al., 2012), e.g. the number of fans, registration age, and genders (Castillo et al., 2011) to find clues for fake news detection. However, the language used in social media is highly informal and ungrammatical, which makes traditional natural language processing techniques hard to effectively learn semantic information from news content. Second, designing effective functions is often time-consuming and relies heavily on expert knowledge in specific fields. There are some features are often unavailable or inadequate in the early stage of news propagation.

Deep Learning Methods
Recurrent neural network (RNN) (Ma et al., 2016), convolutional neural network (CNN) (Yu et al., 2017) and graph neural network (Yuan et al., 2019) have been imported to learn the representations from news content or diffusion graph. Some studies also combine news content and users' response, such as conflicting viewpoints (Jin et al., 2016), topics (Guo et al., 2018), or stance (Bhatt et al., 2018;, to find clues by neural networks for fake news detection. These methods only apply more types of information for fake news detection, but paying little attention to early detection.
Recently, some studies (Liu and Wu, 2018;Song et al., 2018; have proposed some methods to detect fake news at the early stage of propagation. However, these methods ignored the importance of publishers' and users' credibility for the early detection of fake news. Different from these studies, our method explicitly takes the credibility of publishers and users as weakly supervised information for facilitating fake news detection. We propose a novel deep learning model to simultaneously optimize the fake news detection task and users' credibility prediction task.

Problem Formulation
Let N = m 1 , m 2 , . . . m |N | be the set of news. Each news m j has one publisher at least and K users {R 1 , R 2 , . . . , R K } to repost it at most. The publisher-news relations form a publishing graph G(V p , E). The publisher-user relations form a diffusion graph G(V u , E). In the diffusion graph of news, we regard users who repost the news as the neighbor nodes of the publisher. We use |P |, |N |, and |U | to denote the amount of publishers, news, and users respectively. For fake news detection task, our target is to learn a function p(c|m j , P, N , U; θ 3 ) to predict whether a piece of news is fake or not. c is class label of the news and θ 3 represents all parameters of the model.
In this paper, we design a credibility prediction subtask to explicitly utilize the users' or publishers' credibility information for fake news detection. For credibility prediction task, our goal is to learn a function p(c|G(V p , E), P; θ 1 ) or p(c|G(V u , E), U; θ 2 ) to predict the credit scores of publishers or users by publishing graph or diffusion graph.

The Proposed Framework
The proposed framework consists of three major components: (1) publisher credibility prediction; (2) user credibility prediction; and (3) fake news classification. Figure 1 illustrates the architecture of the proposed model.

Publisher Credibility Prediction
In recent years, the multi-head attention mechanism (Vaswani et al., 2017) shows the superior ability to learn the semantic representations of documents in the natural language process, which inspires us to extend it to learn node representations for graph representation learning. In this paper, we extend the Multi-head Attention (Vaswani et al., 2017) as a structure-aware multi-head attention module to encode the structure of the graph and learn the node representation from the publishing graph.
The structure-aware multi-head attention module has three input items: the query item, the key item and the value item, namely Q ∈ R nq×d , K ∈ R n k ×d , and V ∈ R nv×d respectively, where n q , n k , and n v denote the number of nodes in each item, and d is the dimensionality of the node embedding. The  User 8 Figure 1: The architecture of the proposed fake news detection model.
attention module first takes each node in the query to attend to all nodes in the key item via a dot-product attention unit. But in fact, it is impossible for each node to establish connections with all nodes in the social graph. Thus, we encode the adjacent relations of the graph structure into the attention module. The adjacency matrix A pn ∈ R |P |×|N | , whose element A pn ij denotes that publisher i deliver a piece of news j. Finally, we apply those attention weights upon the value item: where W h ∈ R d×d is a transformation matrix. D p ii = j A pn ij and D n jj = i A pn ij are diagonal matrices, which are applied to normalize the adjacency matrix A pn . denotes element-wise product.
The entries of V are then linearly combined with the weights to form a new representation of Q. In this way, the structure-aware attention module can capture relations across query nodes and key nodes, and further use the relations to aggregate embeddings in the query to produce new node representations. We usually let K = V. Therefore, every node in Q is represented by its most similar nodes in V.
For each head of attention captures relations among Q, K, and V from one aspect, we expand one head attention to multi-head schema: Q, K, and V are dispensed to h heads. Specifically, ∀h ∈ [1, H] the output of head h is given by following formulation: where P ∈ R |P |×d is the publishers' embeddings and N ∈ R |N |×d is the news embeddings. H is the amount of heads in attention module. Every publisher and news is transformed into a d-dimensional embedding by their id and the vector is initialized by normal distribution (Glorot and Bengio, 2010). Then, the output features of multi-head attention are concatenated together and a fully-connected layer is applied to transform it as final output, which is formalized as: where W o ∈ R Hd×d is a linear transformation matrix and ELU(x) is an activation function. We obtain publishers' representationsP ∈ R |P |×d after above procedure. Finally, we use these features to predict the publishers' credibility, which can be formulated as follows: where b p is a bias term, and W p ∈ R d×|c| and |c| is the total levels of credibility. The credit scores have three levels (|c| = 3): "unreliable", "uncertain", and "reliable". The annotation of credibility will be introduced in Section 5.1. Finally, the publisher credibility prediction task can be transformed into a classification task. We apply the cross-entropy loss as the optimization objective: where y (p) i is the true credibility of publisher i and θ 1 denotes all parameters need to be trained in this subtask. We apply 2 regularization on all parameters of the model to overcome overfitting problem. λ is a regularization factor.

User Credibility Prediction
Same as publisher credibility prediction task, we apply user credibility as weakly supervised information to facilitate fake news detection. Firstly, we construct the diffusion graph of news G(V u , E), which records how news propagated from publishers to other users. The nodes V u of the graph belongs to the user set and the edges denote the diffusion traces.
Suppose that every news will be reposted by K different users at most. We use matrix R ∈ R |U |×K to denote the user ids who had reposted the news before. '0' is padded at the start of the matrix R when the amount of reposted users is less than K. We still apply structure-aware multi-head attention to learn the user node representation from the diffusion graph. The attention unit is defined as follows: where W h ∈ R d×d is a transformation matrix. D u ii = j A uu ij is a diagonal matrix, which is used to normalize the adjacency matrix A uu . The complete computation process is shown in Algorithm 1. Calculate Z h = Attention(Rj, U, U) h by Equation (6)  To learn abundant representations from different reposting relations, we extend structure-aware attention to employ a multi-head paradigm. Specifically, H independent attention units execute the transformation of Equation 6, and then their features are concatenated, resulting in the user representations.
Finally, we use these users' representationsR ∈ R |U |×K×d to predict the users' credibility scores, which can be formulated as follows: where i ∈ [1, . . . , |U |] and j ∈ [1, . . . , K]. W r ∈ R d×|c| is a trainable matrix and |c| is the levels of credibility. b r ∈ R |c| is a bias term.
The credit scores of users are annotated in the same way as the credit scores of publishers. We apply the cross-entropy loss as the optimization function: where y (u) ij is the credibility of user u ij and θ 2 denotes all parameters needed to be trained in this subtask.

Fake News Classification
For the fake news classification, we combine news with publishing and diffusion graph to more comprehensively capture the differences in the content and diffusion mode of true and false news.

News Content Representation
There have been many natural language processing models that can be used to learn the text representation from word sequence embeddings, such as CNN (Kim, 2014;Kalchbrenner et al., 2014) and RNN (Tai et al., 2015;Yang et al., 2016). For a fair comparison, we also apply CNN (Kim, 2014) as the basic component to learn the representation of news, which is the same as paper (Yuan et al., 2019).

Fusion Attention Unit
After content encoding, we have obtained news content representation m j ∈ R 3d for news m j from word embeddings by CNN. Then, we will introduce how to fuse the publisher, user, and content representations for classification.
Firstly, we find publisher id p i from the publishing and diffusion graph by news id m j . Then, we look up publisher representationP i ∈ R d from all publisher representations tableP by publisher id p i . And by the same way, we look up user representationsR i ∈ R K×d from all user representations tableR by publisher id p i .R i denotes K different users who had reposted the news m j .
We aggregate the reposted user embeddingsR i ∈ R K×d by an attention module: where N j ∈ R 1×d is the embedding of news m j looked up from the news embeddings table N. Then, we fuse the publisher representation and user combined representation by a heuristic method: where W F ∈ R 4d×d is transformation matrix and b F ∈ R d is a bias term. News content representation captures the semantic difference between fake and true news.m j captures the differences between fake and true news from the diffusion graph. Both representations are important for fake news detection, thus they are concatenated as final features. A fully-connected layer is applied to project the final representation into the target space of classes probability: where W m ∈ R 4d×|c| is a transformation matrix and b ∈ R is a bias term. Finally, the cross-entropy loss is used as the optimization objective function for fake news detection: where y (n) j is the gold class probability of news m j . For simultaneously optimize the credibility prediction task and fake news detection task, we combine all these optimization objective as follows: where θ = {θ 1 , θ 2 , θ 3 } represents all parameters of the model SMAN.

Experiments
In this section, we introduce the experiments to evaluate the effectiveness of SMAN. Specifically, we aim to answer the following evaluation questions: • EQ1: Can SMAN improve fake news classification performance by jointly optimizing the fake news detection task and publishers' and users' credibility prediction task?
• EQ2: How effective are publishers' and users' credibility prediction tasks, respectively, in improving the detection performance of SMAN?
• EQ3: Can SMAN improve the performance of fake news early detection task?

Datasets
We evaluate SMAN on three real-world datasets: Twitter15 (Ma et al., 2017), Twitter16 (Ma et al., 2017), and Weibo (Ma et al., 2016). Table 1 shows the statistics of the three datasets. For a fair comparison, we use the train, validation, and test set that is split by (Yuan et al., 2019), where 10% samples as the validation dataset, and split the rest for training and test set with a ratio of 3:1.
The credit scores of publishers and users in these three datasets is annotated according to the training set. In this paper, we have defined three levels of credibility for publishers and users: (1) "0" means "reliable" (the publisher has never delivered fake or unverified news); (2) "1" means "uncertain" (the publisher not only delivers true news, but also publishes false news); (3) "2" means "unreliable" (publishers always publish false news and unverified news, but never publish true news).

Baseline Models
We compare our model with a series of fake news detection methods as follows: (1) Feature-based methods: DTC (Castillo et al., 2011): A decision tree-based model that utilizes a combination of news characteristics. SVM-RBF (Yang et al., 2012): An SVM model with RBF kernel that utilize the news features. SVM-TS (Ma et al., 2015): An SVM model that utilizes time-series to model the variation of news characteristics. DTR (Zhao et al., 2015b): A decision-tree-based method for detecting fake news through enquiry phrases. RFC (Kwon et al., 2017): A random forest classifier that utilizes user, linguistic and structure features. cPTK (Ma et al., 2017): An SVM classifier with a propagation tree kernel that detects fake news by learning temporal-structure patterns.
(2) Deep Learning methods: GRU (Ma et al., 2016): A RNN-based model that learns temporallinguistic patterns from user comments. RvNN (Ma et al., 2018): A bottom-up and a top-down treestructured model based on recursive neural networks for fake news detection on Twitter. PPC (Liu and Wu, 2018): A model that detects fake news through propagation path classification with a combination of recurrent and convolutional networks. GLAN (Yuan et al., 2019): A model that jointly encodes the local semantic and global structural of the diffusion graph.

Evaluation Metrics and Parameter Settings
Same as previous studies (Liu and Wu, 2018;Ma et al., 2018;Yuan et al., 2019), we also adopt accuracy, precision, recall and F1 score as the evaluation metrics. The parameters of SMAN are updated by Adam algorithm (Reddi et al., 2018) with default parameters. All word embeddings of the model are initialized with the 300-dimensional word vectors, which is released by (Yuan et al., 2019). The convolutional kernel size is set to (3, 4, 5) with 100 kernels for each kind of size. The number of heads in structure-aware multi-head attention H is chosen from {1, 2, 3, . . . , 11, 12} and is set to 7. The λ in Equation (5), (8), (12) is chosen from {1e −8 , 1e −7 , . . . , 1e −2 } and is set to 1e −6 . The source code will be released in the future.

Results and Analysis
To answer EQ1, we compare SMAN with baselines introduced in Section 5.2 for fake news classification. The experimental results of all baseline methods are shown in Table 2, 3, and 4. For fair comparison, the performance of baselines is directly cited from previous studies (Ma et al., 2018;Liu and Wu, 2018;Yuan et al., 2019). The GLAN model is the state-of-the-art method when submitting this paper.   We bold the best performance of each column in all tables. From the tables, we can observe that: (1) Methods based on manually designed features (DTR, DTC, RFC, SVM-RBF, cPTK, and SVM-TS) have poorer performance. It indicates: 1) hand-crafted features cannot effectively encode semantic information of news content; 2) these methods cannot perform deep feature interaction; thus unable to fully learn the difference between fake and true news.
(2) Deep learning methods (GRU, RvNN, PPC, and GLAN) significantly outperform conventional classifiers that using manually designed features. This observation indicates deep learning models can learn better semantic representations and perform better feature interactions. We can also observe that GLAN is more effective than RvNN and PPC because it can deeply integrates local semantic and global diffusion structure for fake news detection.
(3) SMAN achieves significant improvement compared with GLAN. Different from GLAN, SMAN not only optimizes the fake news detection task but also tries to predict the credibility of publishers and users. The results show that the credibility of publishers and users is critical for learning the differences between fake and true news.

Ablation Study
To answer EQ2, we further perform some ablation studies over the different modules of SMAN. The experimental results are presented in Table 5. We first evaluate the impact brought by the publishers' credibility prediction subtask. We can observe that the performance drops a lot without PC. The publishers' credibility prediction subtask can exploit publishing relations between publishers and corresponding news to transfer the influence of publishers' credibility to news credibility, thus facilitating the fake news detection. The ablation results also prove it is very important to explicitly encode the credibility of publishers.
Then, we analyze the influence of the user credibility prediction subtask. We can observe that the absence of UC also causes significant performances to decline on all datasets. Intuitively speaking, if a piece of news is reposted by many low-reputation users, its credibility will indeed be greatly reduced. Same as PC task, the users' credibility also can be transferred to news credibility by diffusion graph, and thereby it can improve the detection performance.
Finally, we also find that the performance is much lower than the complete model SMAN after removing both publisher and user credibility prediction subtasks, which further proves that both tasks provide complementary information to each other. Thus, it is essential to jointly optimize the fake news detection and credibility prediction tasks.

Early Detection
For fake news detection task, one of the most essential targets is to detect fake news as soon as possible to intervene in time (Zhao et al., 2015b). To answer EQ3, we compared different methods of different time delays, and the performance is evaluated by the accuracy obtained when we incrementally add data up to the checkpoint given the targeted time delay. By changing the time delays, the accuracy of several competitive models is shown in Figure 2. In 0 to 4 hours, SMAN significantly outperforms the tree-based methods and feature-based methods and achieves better performance over the state of the art method, indicating the superior early detection performance of SMAN. Particularly, SMAN achieves about 91% accuracy on Twitter15 and Twitter16 datasets, and 95% accuracy on Weibo within 4 hours, which is much faster than most of the baselines.
After 8 hours, our model significantly surpasses the state of the art method. We can see that using more reposting relations will make the construction of the diffusion graph more complete and make the influence of credibility more easily transfer from publishers and users to news representations. Overall, the experimental results show that SMAN can not only improve the detection performance but also significantly reduce the time required for detection.

Conclusion
This paper proposes a novel structure-aware multi-attention network, which combines news content, the heterogeneous graphs among publishers and users, and jointly optimizes the task of false news detection and user credibility prediction for early fake news detection. Different from most existing research extracting hand-crafted features or deep learning methods, we explicitly treat the credibility of publishers and users as a kind of weakly supervised information for facilitating fake news detection. Extensive experiments conducted on three real-world datasets show that the proposed model can significantly surpass other state-of-the-art models on both fake news classification and early detection task.