Different Absorption from the Same Sharing: Sifted Multi-task Learning for Fake News Detection

Recently, neural networks based on multi-task learning have achieved promising performance on fake news detection, which focuses on learning shared features among tasks as complementarity features to serve different tasks. However, in most of the existing approaches, the shared features are completely assigned to different tasks without selection, which may lead to some useless and even adverse features integrated into specific tasks. In this paper, we design a sifted multi-task learning method with a selected sharing layer for fake news detection. The selected sharing layer adopts gate mechanism and attention mechanism to filter and select shared feature flows between tasks. Experiments on two public and widely used competition datasets, i.e. RumourEval and PHEME, demonstrate that our proposed method achieves the state-of-the-art performance and boosts the F1-score by more than 0.87%, 1.31%, respectively.


Introduction
In recent years, the proliferation of fake news with various content, high-speed spreading, and extensive influence has become an increasingly alarming issue. A concrete instance 1 was cited by Time Magazine in 2013 when a false announcement of Barack Obama's injury in a White House explosion "wiped off 130 Billion US Dollars in stock value in a matter of seconds". Other examples, an analysis of the US Presidential Election in 2016 (Allcott and Gentzkow, 2017) revealed that fake news was widely shared during the three months prior to the election with 30 million total Facebook shares of 115 known pro-Trump fake stories and 7.6 million of 41 known pro-Clinton fake stories. Therefore, automatically detecting fake news has attracted sig- Most existing methods devise deep neural networks to capture credibility features for fake news detection. Some methods provide in-depth analysis of text features, e.g., linguistic (Conroy et al., 2015), semantic (Yang et al., 2012), emotional (Wang et al., 2015), stylistic (Potthast et al., 2017), etc. On this basis, some work additionally extracts social context features (a.k.a. meta-data features) as credibility features, including sourcebased (Castillo et al., 2011), user-centered (Long et al., 2017), post-based  and networkbased (Ruchansky et al., 2017), etc. These methods have attained a certain level of success. Additionally, recent researches (Thorne et al., 2017;Dungs et al., 2018) find that doubtful and opposing voices against fake news are always triggered along with its propagation. Fake news tends to provoke controversies compared to real news (Mendoza et al., 2010;Zubiaga et al., 2016b). Therefore, stance analysis of these controversies can serve as valuable credibility features for fake news detection.
There is an effective and novel way to improve the performance of fake news detection combined with stance analysis, which is to build multi-task learning models to jointly train both tasks (Ma et al., 2018a;Kochkina et al., 2018;Li et al., 2018). These approaches model information sharing and representation reinforcement between the two tasks, which expands valuable features for their respective tasks. However, prominent drawback to these methods and even typical multi-task learning methods, like the shared-private model, is that the shared features in the shared layer are equally sent to their respective tasks without filtering, which causes that some useless and even adverse features are mixed in different tasks, as shown in Figure 1(a). By that the network would be confused by these features, interfering effective sharing, and even mislead the predictions.
To address the above problems, we design a sifted multi-task learning model with filtering mechanism ( Figure 1(b)) to detect fake news by joining stance detection task. Specifically, we introduce a selected sharing layer into each task after the shared layer of the model for filtering shared features. The selected sharing layer composes of two cells: gated sharing cell for discarding useless features and attention sharing cell for focusing on features that are conducive to their respective tasks. Besides, to better capture long-range dependencies and improve the parallelism of the model, we apply transformer encoder module (Vaswani et al., 2017) to our model for encoding input representations of both tasks. Experimental results reveal that the proposed model outperforms the compared methods and gains new benchmarks.
In summary, the contributions of this paper are as follows: • We explore a selected sharing layer relying on gate mechanism and attention mechanism, which can selectively capture valuable shared features between tasks of fake news detection and stance detection for respective tasks.
• The transformer encoder is introduced into our model for encoding inputs of both tasks, which enhances the performance of our method by taking advantages of its long-range dependencies and parallelism.
• Experiments on two public, widely used fake news datasets demonstrate that our method significantly outperforms previous state-ofthe-art methods.

Related Work
Fake News Detection Exist studies for fake news detection can be roughly summarized into two categories. The first category is to extract or construct comprehensive and complex features with manual ways (Castillo et al., 2011;Ruchansky et al., 2017;Flintham et al., 2018). The second category is to automatically capture deep features based on neural networks. There are two ways in this category. One is to capture linguistic features from text content, such as semantic Wu et al., 2018), writing styles (Potthast et al., 2017), and textual entailments (Oshikawa et al., 2018). The other is to focus on gaining effective features from the organic integration of text and user interactions Wu et al., 2019). User interactions include users' behaviours, profiles, and networks between users. In this work, following the second way, we automatically learn representations of text and stance information from response and forwarding (users' behaviour) based on multi-task learning for fake news detection.

Stance Detection
The researches (Lukasik et al., 2016;Zubiaga et al., 2016a) demonstrate that the stance detected from fake news can serve as an effective credibility indicator to improve the performance of fake news detection. The common way of stance detection in rumors is to catch deep semantics from text content based on neural networks (Mohtarami et al., 2018). For instance, Kochkina et al.(Kochkina et al., 2017) project branch-nested LSTM model to encode text of each tweet considering the features and labels of the predicted tweets for stance detection, which reflects the best performance in RumourEval dataset. In this work, we utilize transformer encoder to acquire semantics from responses and forwarding of fake news for stance detection.
Multi-task Learning A collection of improved models Liu et al., 2019) are developed based on multitask learning. Especially, shared-private model, as a popular multi-task learning model, divides the features of different tasks into private and shared spaces, where shared features, i.e., task-irrelevant features in shared space, as supplementary features are used for different tasks. Nevertheless, the shared space usually mixes some task-relevant features, which makes the learning of different tasks introduce noise. To address this issue, Liu et al.  explore an adversarial shared-  Figure 2: The architecture of the sifted multi-task learning method based on shared-private model. In particular, the two blue boxes represent selected sharing layers of stance detection and fake news detection and the red box denotes shared layer between tasks.
private model to alleviate the shared and private latent feature spaces from interfering with each other. However, these models transmit all shared features in the shared layer to related tasks without distillation, which disturb specific tasks due to some useless and even harmful shared features. How to solve this drawback is the main challenge of this work.

Method
We propose a novel sifted multi-task learning method on the ground of shared-private model to jointly train the tasks of stance detection and fake news detection, filter original outputs of shared layer by a selected sharing layer. Our model consists of a 4-level hierarchical structure, as shown in Figure 2. Next, we will describe each level of our proposed model in detail.

Input Embeddings
In our notation, a sentence of length l tokens is indicated as X = {x 1 , x 2 , ..., x l }. Each token is concatenated by word embeddings and position embeddings. Word embeddings w i of token x i are a d w -dimensional vector obtained by pretrained Word2Vec model (Mikolov et al., 2013), i.e., w i ∈ R dw . Position embeddings refer to vectorization representations of position information of words in a sentence. We employ onehot encoding to represent position embeddings p i of token x i , where p i ∈ R dp , d p is the positional embedding dimension. Therefore, the embeddings of a sentence are represented as In particular, we adopt one-hot encoding to embed positions of tokens, rather than sinusoidal position encoding recommended in BERT model (Devlin et al., 2018). The reason is that our experiments show that compared with one-hot encoding, sinusoidal position encoding not only increases the complexity of models but also performs poorly on relatively small datasets.

Shared-private Feature Extractor
Shared-private feature extractor is mainly used for extracting shared features and private features among different tasks. In this paper, we apply the encoder module of transformer (Vaswani et al., 2017) (henceforth, transformer encoder) to the sharedprivate extractor of our model. Specially, we employ two transformer encoders to encode the input embeddings of the two tasks as their respective private features. A transformer encoder is used to encode simultaneously the input embeddings of the two tasks as shared features of both tasks. This process is illustrated by the shared-private layer of Figure 2. The red box in the middle denotes the extraction of shared features and the left and right boxes represent the extraction of private features of two tasks. Next, we take the extraction of the private feature of fake news detection as an example to elaborate on the process of transformer encoder. The kernel of transformer encoder is the scaled dot-product attention, which is a special case of attention mechanism. It can be precisely described as follows: where Q ∈ R l×(dp+dw) , K ∈ R l×(dp+dw) , and V ∈ R l×(dp+dw) are query matrix, key matrix, and value matrix, respectively. In our setting, the query Q stems from the inputs itself, i.e., Q = K = V = E.
To explore the high parallelizability of attention, transformer encoder designs a multi-head attention mechanism based on the scaled dot-product attention. More concretely, multi-head attention first linearly projects the queries, keys and values h times by using different linear projections. Then h projections perform the scaled dot-product attention in parallel. Finally, these results of attention are concatenated and once again projected to get the new representation. Formally, the multi-head attention can be formulated as follows:

Selected Sharing Layer
In order to select valuable and appropriate shared features for different tasks, we design a selected sharing layer following the shared layer. The selected sharing layer consists of two cells: gated sharing cell for filtering useless features and attention sharing cell for focusing on valuable shared features for specific tasks. The description of this layer is depicted in Figure 2 and Figure 3. In the following, we introduce two cells in details.
Gated Sharing Cell Inspired by forgotten gate mechanism of LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Chung et al., 2014), we design a single gated cell to filter useless shared features from shared layer. There are two reasons why we adopt single-gate mechanism. One is that transformer encoder in shared layer can efficiently capture the features of long-range dependencies. The features do not need to capture repeatedly by multiple complex gate mechanisms of LSTM and GRU. The other is that single-gate mechanism is more convenient for training (Srivastava et al., 2015). Formally, the gated sharing cell can be expressed as follows:  where H shared ∈ R 1×l(dp+dw) denotes the outputs of shared layer upstream, W fake ∈ R l(dp+dw)×l(dp+dw) and b fake ∈ R 1×l(dp+dw) are trainable parameters. σ is a non-linear activationsigmoid, which makes final choices for retaining and discarding features in shared layer. Then the shared features after filtering via gated sharing cell g fake for the task of fake news detection are represented as: where denotes element-wise multiplication.
Similarly, for the auxiliary task -the task of stance detection, filtering process in the gated sharing cell is the same as the task of fake news detection, so we do not reiterate them here.
Attention Sharing Cell To focus on helpful shared features that are beneficial to specific tasks from upstream shared layer, we devise an attention sharing cell based on attention mechanism. Specifically, this cell utilizes input embeddings of the specific task to weight shared features for paying more attention to helpful features. The inputs of this cell include two matrixes: the input embeddings of the specific task and the shared features of both tasks. The basic attention architecture of this cell, the same as shared-private feature extractor, also adopts transformer encoder (the details in subsection 3.2). However, in this architecture, query matrix and key matrix are not projections of the same matrix, i.e., query matrix E fake is the input embeddings of fake news detection task, and key matrix K shared and value matrix V shared are the projections of shared features H shared . Formally, the attention sharing cell can be formalized as follows: where the dimensions of E fake , K shared , and V shared are all R l×(dp+dw) . The dimensions of remaining parameters in Eqs.(6, 7) are the same as in Eqs.(2, 3). Moreover, in order to guarantee the diversity of focused shared features, the number of heads h should not be set too large. Experiments show that our method performs the best performance when h is equal to 2.
Integration of the Two Cells We first convert the output of the two cells to vectors G and A, respectively, and then integrate the vectors in full by the absolute difference and element-wise product (Mou et al., 2016).
where denotes element-wise multiplication and ; denotes concatenation.

The Output Layer
As the last layer, softmax functions are applied to achieve the classification of different tasks, which emits the prediction of probability distribution for the specific task i.
whereŷ i is the predictive result, F i is the concatenation of private features H i of task i and the outputs SSL i of selected sharing layer for task i. W i and b i are trainable parameters. Given the prediction of all tasks, a global loss function forces the model to minimize the crossentropy of prediction and true distribution for all the tasks: where λ i is the weight for the task i, and N is the number of tasks. In this paper, N = 2, and we give more weight λ to the task of fake news detection.

Datasets and Evaluation Metrics
We use two public datasets for fake news detection and stance detection, i.e., RumourEval (Derczynski et al., 2017) and PHEME (Zubiaga et al., 2016b). We introduce both the datasets in details from three aspects: content, labels, and distribution.
Content. Both datasets contain Twitter conversation threads associated with different newsworthy events including the Ferguson unrest, the shooting at Charlie Hebdo, etc. A conversation thread consists of a tweet making a true and false claim, and a series of replies. Labels. Both datasets have the same labels on fake news detection and stance detection. Fake news is labeled as true, false, and unverified. Because we focus on classifying true and false tweets, we filter the unverified tweets. Stance of tweets is annotated as support, deny, query, and comment. Distribution. RumourEval contains 325 Twitter threads discussing rumours and PHEME includes 6,425 Twitter threads. Threads, tweets, and class distribution of the two datasets are shown in Table 1.
In consideration of the imbalance label distributions, in addition to accuracy (A) metric, we add Precision (P), Recall (R) and F1-score (F1) as complementary evaluation metrics for tasks. We hold out 10% of the instances in each dataset for model tuning, and the rest of the instances are performed 5-fold cross-validation throughout all experiments.

Settings
Pre-processing -Processing useless and inappropriate information in text: (1) removing nonalphabetic characters; (2) removing website links of text content; (3) converting all words to lower case and tokenize texts.
Parameters -hyper-parameters configurations of our model: for each task, we strictly turn all the hyper-parameters on the validation dataset, and we achieve the best performance via a small grid search. The sizes of word embeddings and position embeddings are set to 200 and 100. In transformer encoder, attention heads and blocks are set to 6 and 2 respectively, and the dropout of multi-head attention is set to 0.7. Moreover, the minibatch size is 64; the initial learning rate is set to 0.001, the dropout rate to 0.3, and λ to 0.6 for fake news detection.   to capture features similar to n-grams. TE Tensor Embeddings (Guacho et al., 2018) leverages tensor decomposition to derive concise claim embeddings, which are used to create a claimby-claim graph for label propagation.

Performance
DeClarE Evidence-Aware Deep Learning (Popat et al., 2018) encodes claims and articles by Bi-LSTM and focuses on each other based on attention mechanism, and then concatenates claim source and article source information.
MTL-LSTM A multi-task learning model based on LSTM networks (Kochkina et al., 2018) trains jointly the tasks of veracity classification, rumor detection, and stance detection.
TRNN Tree-structured RNN (Ma et al., 2018b) is a bottom-up and a top-down tree-structured model based on recursive neural networks.
Bayesian-DL Bayesian Deep Learning model (Zhang et al., 2019) first adopts Bayesian to represent both the prediction and uncertainty of claim and then encodes replies based on LSTM to update and generate a posterior representations.

Compared with State-of-the-art Methods
We perform experiments on RumourEval and PHEME datasets to evaluate the performance of our method and the baselines. The experimental results are shown in Table 2. We gain the following observations: • On the whole, most well-designed deep learning methods, such as ours, Bayesian-DL, and TRNN, outperform feature engineering-based methods, like SVM. This illustrates that deep learning methods can represent better intrinsic semantics of claims and replies.
• In terms of recall (R), our method and MTL-LSTM, both based on multi-task learning, achieve more competitive performances than other baselines, which presents that sufficient features are shared for each other among multiple tasks. Furthermore, our method reflects a more noticeable performance boost than MTL-LSTM on both datasets, which extrapolates that our method earns more valuable shared features.
• Although our method shows relatively low performance in terms of precision (P) and recall (R) compared with some specific models, our method achieves the state-of-the-art performance in terms of accuracy (A) and F1-score (F1) on both datasets. Taking into account the tradeoff among different performance measures, this reveals the effectiveness of our method in the task of fake news detection.

Model Ablation
To evaluate the effectiveness of different components in our method, we ablate our method into several simplified models and compare their performance against related methods. The details of these methods are described as follows: Single-task Single-task is a model with transformer encoder as the encoder layer of the model for fake news detection.
MT-lstm The tasks of fake news detection and stance detection are integrated into a shared-private model and the encoder of the model is achieved by LSTM.

MT-trans
The only difference between MTtrans and MT-lstm is that encoder of MT-trans is composed of transformer encoder.
MT-trans-G On the basis of MT-trans, MTtrans-G adds gated sharing cell behind the shared layer of MT-trans to filter shared features.
MT-trans-A Unlike MT-trans-G, MT-trans-A replaces gated sharing cell with attention sharing cell for selecting shared features.
MT-trans-G-A Gated sharing cell and attention sharing cell are organically combined as selected sharing layer behind the shared layer of MT-trans, called MT-trans-G-A. Table 3 provides the experimental results of these methods on RumourEval and PHEME datasets. We have the following observations: • Effectiveness of multi-task learning. MT-trans boosts about 9% and 15% performance improvements in accuracy on both datasets compared with Single-task, which indicates that the multi-task learning method is effective to detect fake news.
• Effectiveness of transformer encoder. Compared with MT-lstm, MT-trans obtains more excellent performance, which explains that transformer encoder has better encoding ability than LSTM for news text on social media.
• Effectiveness of the selected sharing layer. Analysis of the results of the comparison with MT-trans, MT-trans-G, MT-Trans-A, and MTtrans-G-A shows that MT-trans-G-A ensures optimal performance with the help of the selected sharing layer of the model, which confirms the reasonability of selectively sharing different features for different tasks.

Error Analysis
Although the sifted multi-task learning method outperforms previous state-of-the-art methods on two datasets (From Table 2), we observe that the proposed method achieves more remarkable performance boosts on PHEME than on RumourEval. There are two reasons for our analysis according to Table 1 and Table 2. One is that the number of training examples in RumourEval (including 5,568 tweets) is relatively limited as compared with PHEME (including 105,354 tweets), which is not enough to train deep neural networks. Another is that PHEME includes more threads (6,425 threads) than RumourEval (325 threads) so that PHEME can offer more rich credibility features to our proposed method.

Case Study
In order to obtain deeper insights and detailed interpretability about the effectiveness of the selected shared layer of the sifted multi-task learning method, we devise experiments to explore some ideas in depth: 1) Aiming at different tasks, what effective features can the selected sharing layer in our method obtain? 2) In the selected sharing layer, what features are learned from different cells?
(a) Obtained weights of tokens by two models (b) Neuron behaviours of A fake and Astance Figure 5: (a) In fake news detection task, the GSC line denotes the weight values g fake of gated sharing cell, while the SL line represents feature weights of H shared in the shared layer. Two horizontal lines give two different borders to determine the importance of tokens. (b) The red and green heatmaps describe the neuron behaviours of attention sharing cell A fake in fake news detection task and A stance in stance detection task, respectively.

The Visualization of Shared Features
Learned from Two Tasks We visualize shared features learned from the tasks of fake news detection and stance detection. Specifically, we first look up these elements with the largest values from the outputs of the shared layer and the selected shared layer respectively. Then, these elements are mapped into the corresponding values in input embeddings so that we can find out specific tokens. The experimental results are shown in Figure 4. We draw the following observations: • Comparing PL-FND and PL-SD, private features in private layer from different tasks are different. From PL-FND, PL-SD, and SLT, the combination of the private features and shared features from shared layer increase the diversity of features and help to promote the performance of both fake news detection and stance detection.
• By compared SL, SSL-FND, and SSL-SD, selected sharing layers from different tasks can not only filter tokens from shared layer (for instance, 'what', 'scary', and 'fact' present in SL but not in SSL-SD), but also capture helpful tokens for its own task (like 'false' and 'real' in SSL-FND, and 'confirm' and 'misleading' in SSL-SD).

The Visualization of Different Features
Learned from Different Cells To answer the second question, we examine the neuron behaviours of gated sharing cell and attention sharing cell in the selected sharing layer, respectively. More concretely, taking the task of fake news detection as an example, we visualize feature weights of H shared in the shared layer and show the weight values g fake in gated sharing cell. By that we can find what kinds of features are discarded as interference, as shown in Figure 5(a). In addition, for attention sharing cell, we visualize which tokens are concerned in attention sharing cell, as shown in Figure 5(b). From Figure 5(a) and 5(b), we obtain the following observations: • In Figure 5(a), only the tokens "gunmen, hostages, Sydney, ISIS" give more attention compared with vanilla shared-private model (SP-M). In more details, 'gunmen' and 'ISIS' obtain the highest weights. These illustrate that gated sharing cell can effectively capture key tokens.
• In Figure 5(b), "live coverage", as a prominent credibility indicator, wins more concerns in the task of fake news detection than other tokens. By contrast, when the sentence of Figure 5(b) is applied to the task of stance detection, the tokens "shut down" obtain the maximum weight, instead of "live coverage". These may reveal that attention sharing cell focuses on different helpful features from the shared layer for different tasks.

Conclusion
In this paper, we explored a sifted multi-task learning method with a novel selected sharing structure for fake news detection. The selected sharing structure fused single gate mechanism for filtering useless shared features and attention mechanism for paying close attention to features that were helpful to target tasks. We demonstrated the effectiveness of the proposed method on two public, challenging datasets and further illustrated by visualization experiments. There are several important directions remain for future research: (1) the fusion mechanism of private and shared features; (2) How to represent meta-data of fake news better to integrate into inputs.