DTCA: Decision Tree-based Co-Attention Networks for Explainable Claim Verification

Recently, many methods discover effective evidence from reliable sources by appropriate neural networks for explainable claim verification, which has been widely recognized. However, in these methods, the discovery process of evidence is nontransparent and unexplained. Simultaneously, the discovered evidence is aimed at the interpretability of the whole sequence of claims but insufficient to focus on the false parts of claims. In this paper, we propose a Decision Tree-based Co-Attention model (DTCA) to discover evidence for explainable claim verification. Specifically, we first construct Decision Tree-based Evidence model (DTE) to select comments with high credibility as evidence in a transparent and interpretable way. Then we design Co-attention Self-attention networks (CaSa) to make the selected evidence interact with claims, which is for 1) training DTE to determine the optimal decision thresholds and obtain more powerful evidence; and 2) utilizing the evidence to find the false parts in the claim. Experiments on two public datasets, RumourEval and PHEME, demonstrate that DTCA not only provides explanations for the results of claim verification but also achieves the state-of-the-art performance, boosting the F1-score by more than 3.11%, 2.41%, respectively.


Introduction
The increasing popularity of social media has brought unprecedented challenges to the ecology of information dissemination, causing rampancy of a large volume of false or unverified claims, like extreme news, hoaxes, rumors, fake news, etc. Research indicates that during the US presidential election (2016), fake news accounts for nearly 6% of all news consumption, where 1% of users are exposed to 80% of fake news, and 0.1% of users are responsible for sharing 80% of fake news (Grinberg et al., 2019), and democratic elections are vulnerable to manipulation of the false or unverified claims on social media (Aral and Eckles, 2019), which renders the automatic verification of claims a crucial problem.
Currently, the methods for automatic claim verification could be divided into two categories: the first is that the methods relying on deep neural networks learn credibility indicators from claim content and auxiliary relevant articles or comments (i.e., responses) (Volkova et al., 2017;Rashkin et al., 2017;Dungs et al., 2018).Despite their effectiveness, these methods are difficult to explain why claims are true or false in practice.To overcome the weakness, a trend in recent studies (the second category) is to endeavor to explore evidencebased verification solutions, which focuses on capturing the fragments of evidence obtained from reliable sources by appropriate neural networks (Popat et al., 2018;Hanselowski et al., 2018;Ma et al., 2019;Nie et al., 2019).For instance, Thorne et al. (2018) build multi-task learning to extract evidence from Wikipedia and synthesize information from multiple documents to verify claims.Popat et al. (2018) capture signals from external evidence articles and model joint interactions between various factors, like the context of a claim and trustworthiness of sources of related articles, for assessment of claims.Ma et al. (2019) propose hierarchical attention networks to learn sentence-level evidence from claims and their related articles based on coherence modeling and natural language inference for claim verification.
Although these methods provide evidence to solve the explainability of claim verification in a manner, there are still several limitations.First, they are generally hard to interpret the discovery process of evidence for claims, namely, lack the interpretability of methods themselves because these methods are all based on neural networks, belong-ing to nontransparent black box models.Secondly, the provided evidence only offers a coarse-grained explanation to claims.They are all aimed at the interpretability of the whole sequence of claims but insufficient to focus on the false parts of claims.
To address the above problems, we design Decision Tree-based Co-Attention networks (DTCA) to discover evidence for explainable claim verification, which contains two stages: 1) Decision Tree-based Evidence model (DTE) for discovering evidence in a transparent and interpretable way; and 2) Co-attention Self-attention networks (CaSa) using the evidence to explore the false parts of claims.Specifically, DTE is constructed on the basis of structured and hierarchical comments (aiming at the claim), which considers many factors as decision conditions from the perspective of content and meta data of comments and selects high credibility comments as evidence.CaSa exploits the selected evidence to interact with claims at the deep semantic level, which is for two roles: one is to train DTE to pursue the optimal decision threshold and finally obtain more powerful evidence; and another is to utilize the evidence to find the false parts in claims.Experimental results reveal that DTCA not only achieves the state-of-the-art performance but also provides the interpretability of results of claim verification and the interpretability of selection process of evidence.Our contributions are summarized as follows: • We propose a transparent and interpretable scheme that incorporates decision tree model into co-attention networks, which not only discovers evidence for explainable claim verification (Section 4.4.3)but also provides interpretation for the discovery process of evidence through the decision conditions (Section 4.4.2).
• Designed co-attention networks promote the deep semantic interaction between evidence and claims, which can train DTE to obtain more powerful evidence and effectively focus on the false parts of claims (Section 4.4.3).
• Experiments on two public, widely used fake news datasets demonstrate that our DTCA achieves more excellent performance than previous state-of-the-art methods (Section 4.3.2).

Related Work
Claim Verification Many studies on claim verification generally extract an appreciable quantity of credibility-indicative features around semantics (Ma et al., 2018b;Khattar et al., 2019;Wu et al., 2020), emotions (Ajao et al., 2019), stances (Ma et al., 2018a;Kochkina et al., 2018;Wu et al., 2019), write styles (Potthast et al., 2018;Gröndahl and Asokan, 2019), and source credibility (Popat et al., 2018;Baly et al., 2018a) from claims and relevant articles (or comments).For a concrete instance, Wu et al. (2019) devise sifted multi-task learning networks to jointly train stance detection and fake news detection tasks for effectively utilizing common features of the two tasks to improve the task performance.Despite reliable performance, these methods for claim verification are unexplainable.To address this issue, recent research concentrates on the discovery of evidence for explainable claim verification, which mainly designs different deep models to exploit semantic matching (Nie et al., 2019;Zhou et al., 2019), semantic conflicts (Baly et al., 2018b;Dvořák and Woltran, 2019;Wu and Rao, 2020), and semantic entailments (Hanselowski et al., 2018;Ma et al., 2019) between claims and relevant articles.For instance, Nie et al. (2019) develop neural semantic matching networks that encode, align, and match the semantics of two text sequences to capture evidence for verifying claims.Combined with the pros of recent studies, we exert to perceive explainable evidence through semantic interaction for claim verification.
Explainable Machine Learning Our work is also related to explainable machine learning, which can be generally divided into two categories: intrinsic explainability and post-hoc explainability (Du et al., 2018).Intrinsic explainability (Shu et al., 2019;He et al., 2015;Zhang and Chen, 2018) is achieved by constructing self-explanatory models that incorporate explainability directly into their structures, which requires to build fully interpretable models for clearly expressing the explainable process.However, the current deep learning models belong to black box models, which are difficult to achieve intrinsic explainability (Gunning, 2017).Post-hoc explainability (Samek et al., 2017;Wang et al., 2018;Chen et al., 2018) needs to design a second model to provide explanations for an existing model.For example, Wang et al. (2018) combine the strengths of the embeddings-based model and the tree-based model to develop explainable recommendation, where the tree-based model obtains evidence and the embeddings-based model improves the performance of recommendation.In this paper, following the post-hoc explainability, we harness decision-tree model to explain the discovery process of evidence and design co-attention networks to boost the task performance.

Decision Tree-based Co-Attention
Networks (DTCA) In this section, we introduce the decision treebased co-attention networks (DTCA) for explainable claim verification, with architecture shown in Figure 1, which involves two stages: decision treebased evidence model (DTE) and co-attention selfattention networks (CaSa) that consist of a 3-level hierarchical structure, i.e., sequence representation layer, co-attention layer, and output layer.Next, we describe each part of DTCA in detail.

Decision Tree-based Evidence Model (DTE)
DTE is based on tree comments (including replies) aiming at one claim.We first build a tree network based on hierarchical comments, as shown in the left of Figure 2. The root node is one claim and the second level nodes and below are users' comments on the claim (R 11 , ... ,R kn ), where k and n denote the depth of tree comments and the width of the last level respectively.We try to select comments with high credibility as evidence of the claim, so we need to evaluate the credibility of each node (comment) in the network and decide whether to select the comment or not.Three factors from the  perspective of content and meta data of comments are considered and the details are described: The semantic similarity between comments and claims.It measures relevancy between comments and claims and aims to filter irrelevant and noisy comments.Specifically, we adopt soft consine measure (Sidorov et al., 2014) between average word embeddings of both claims and comments as semantic similarity.
The credibility of reviewers 1 .It follows that "reviewers with high credibility also usually have high reliability in their comments" (Shan, 2016).Specifically, we utilize multiple meta-data features of reviewers to evaluate reviewer credibility, i.e., whether the following elements exist or not: verified, geo, screen name, and profile image; and the number of the items: followers, friends, and favorites.The examples are shown in Appendix A.
The credibility of comments.It is based on meta data of comments to roughly measure the credibility of comments (Shu et al., 2017), i.e., 1) whether the following elements exist or not: geo, source, favorite the comment; and 2) the number of favorites and content-length.The examples are shown in Appendix A.
In order to integrate these factors in a transparent and interpretable way, we build a decision tree model which takes the factors as decision conditions to measure node credibility of tree comments, as shown in the grey part in Figure 2.
We represent the structure of a decision tree model as Q = {V, E}, where V and E denote nodes and edges, respectively.Nodes in V have two types: decision (a.k.a.internal) nodes and leaf nodes.Each decision node splits a decision condition x i (one of the three factors) with two decision edges (decision results) based on the specific decision threshold a i .The leaf node gives the decision result (the red circle), i.e., whether the comment is selected or not.In our experiments, if any decision nodes are yes, the evaluated comment in the tree comment network will be selected as a piece of evidence.In this way, each comment is selected as evidence, which is transparent and interpretable, i.e., interpreted by decision conditions.
When comment nodes in the tree network are evaluated by the decision tree model, we leverage post-pruning algorithm to select comment subtrees as evidence set for CaSa (in section 3.2) training.

Co-attention Self-attention Networks (CaSa)
In DTE, the decision threshold a i is uncertain, to say, according to different decision thresholds, there are different numbers of comments as evidence for CaSa training.In order to train decision thresholds in DTE so as to obtain more powerful evidence, and then exploit this evidence to explore the false parts of fake news, we devise CaSa to promote the interaction between evidence and claims.
The details of DTCA are as follows:

Sequence Representation Layer
The inputs of CaSa include a sequence of evidence (the evidence set obtained by DTE model is concatenated into a sequence of evidence) and a sequence of claim.Given a sequence of length l tokens X = {x 1 , x 2 , ..., x l }, X ∈ R l×d , which could be either a claim or the evidence, each token x i ∈ R d is a d-dimensional vector obtained by pre-trained BERT model (Devlin et al., 2019).We encode each token into a fixed-sized hidden vector h i and then obtain the sequence representation for a claim X c and evidence X e via two BiLSTM (Graves et al., 2005) neural networks respectively. where h is the number of hidden units of LSTM.; denotes concatenation operation.Finally, R e ∈ R l×2h and R c ∈ R l×2h are representations of sequences of both evidence and a claim.Additionally, experiments confirm BiLSTM in CaSa can be replaced by BiGRU (Cho et al., 2014) for comparable performance.

Co-attention Layer
Co-attention networks are composed of two hierarchical self-attention networks.In our paper, the sequence of evidence first leverages one self-attention network to conduct deep semantic interaction with the claim for capturing the false parts of the claim.Then semantics of the interacted claim focus on semantics of the sequence of evidence via another self-attention network for concentrating on the key parts of the evidence.The two self-attention networks are both based on the multi-head attention mechanism (Vaswani et al., 2017).Given a matrix of l query vectors Q ∈ R l×2h , keys K ∈ R l×2h , and values V ∈ R l×2h , the scaled dot-product attention, the core of self-attention networks, is described as Particularly, to enable claim and evidence to interact more directly and effectively, in the first selfattention network, Q = R e pool (R e pool ∈ R 2h ) is the max-pooled vector of the sequence representation of evidence, and K = V = R c , R c is the sequence representation of claim.In the second self-attention network, Q = C, i.e., the output vector of self-attention network for claim (the details are in Eq. 7), and K = V = R e , R e is the sequence representation of evidence.
To get high parallelizability of attention, multihead attention first linearly projects queries, keys, and values j times by different linear projections and then j projections perform the scaled dotproduct attention in parallel.Finally, these results of attention are concatenated and once again projected to get the new representation.Formally, the multi-head attention can be formulated as: where Subsequently, co-attention networks pass a feed forward network (FFN) for adding non-linear features while scale-invariant features, which contains a single hidden layer with an ReLU.Finally, to fully integrate evidence and claim, we adopt the absolute difference and element-wise product to fuse the vectors E and C (Wu et al., 2019).
) where denotes element-wise multiplication operation.

Output Layer
As the last layer, softmax function emits the prediction of probability distribution by the equation: We train the model to minimize cross-entropy error for a training sample with ground-truth label y: The training process of DTCA is presented in Algorithm 1 of Appendix B.

Experiments
As the key contribution of this work is to verify claims accurately and offer evidence as explanations, we design experiments to answer the following questions: • RQ1: Can DTCA achieve better performance compared with the state-of-the-art models?
• RQ2: How do decision conditions in the decision tree affect model performance (to say, the interpretability of evidence selection process)?
• RQ3: Can DTCA make verification results easy-tointerpret by evidence and find false parts of claims?

Datasets
To evaluate our proposed model, we use two widely used datasets, i.e., RumourEval (Derczynski et al., 2017) and PHEME (Zubiaga et al., 2016).Structure.Both datasets respectively contain 325 and 6,425 Twitter conversation threads associated with different newsworthy events like Charlie Hebdo, the shooting in Ottawa, etc.A thread consists of a claim and a tree of comments (a.k.a.responses) expressing their opinion towards the claim.Labels.
Both datasets have the same labels, i.e., true, false, and unverified.Since our goal is to verify whether a claim is true or false, we filter out unverified tweets.In consideration of the imbalance label distributions, besides accuracy (A), we add precision (P), recall (R) and F1-score (F1) as evaluation metrics for DTCA and baselines.We divide the two datasets into training, validation, and testing subsets with proportion of 70%, 10%, and 20% respectively.

Settings
We turn all hyper-parameters on the validation set and achieve the best performance via a small grid search.For hyper-parameter configurations, (1) in DTE, the change range of semantic similarity, the credibility of reviewers, and the credibility of comments respectively belong to [0, 0.8], [0, 0.8], and [0, 0.7]; (2) in CaSa, word embedding size d is set to 768; the size of LSTM hidden states h is 120; attention heads and blocks are 6 and 4 respectively; the dropout of multi-head attention is set to 0.8; the initial learning rate is set to 0.001; the dropout rate is 0.5; and the mini-batch size is 64.

Performance Evaluation (RQ1)
4.3.1 Baselines SVM (Derczynski et al., 2017) is used to detect fake news based on manually extracted features.
CNN (Chen et al., 2017) adopts different window sizes to obtain semantic features similar to n-grams for rumor classification.
TE (Guacho et al., 2018) creates article-byarticle graphs relying on tensor decomposition with deriving article embeddings for rumor detection.
DeClarE (Popat et al., 2018) presents attention networks to aggregate signals from external evidence articles for claim verification.
TRNN (Ma et al., 2018b) proposes two treestructured RNN models based on top-down and down-top integrating semantics of structure and content to detect rumors.In this work, we adopt the top-down model with better results as the baseline.MTL-LSTM (Kochkina et al., 2018) jointly trains rumor detection, claim verification, and stance detection tasks, and learns correlations among these tasks for task learning.
Bayesian-DL (Zhang et al., 2019) uses Bayesian to represent the uncertainty of prediction of the veracity of claims and then encodes responses to update the posterior representations.
Sifted-MTL (Wu et al., 2019) is a sifted multitask learning model that trains jointly fake news detection and stance detection tasks and adopts gate and attention mechanism to screen shared features.

Results of Comparison
Table 2 shows the experimental results of all compared models on the two datasets.We observe that: • SVM integrating semantics from claim content and comments outperforms traditional neural networks only capturing semantics from claim content, like CNN and TE, with at least 4.75% and 6.96% boost in accuracy on the two datasets respectively, which indicates that semantics of comments are helpful for claim verification.
• On the whole, most neural network models with semantic interaction between comments and claims, such as TRNN and Bayesian-DL, achieve from 4.77% to 9.53% improvements in accuracy on the two datasets than SVM without any interaction, which reveals the effectiveness of the interaction between comments and claims.
• TRNN, Bayesian-DL, and DTCA enable claims and comments to interact, but the first two models get the worse performance than DTCA (at least 1.06% and 1.19% degradation in accuracy respectively).That is because they integrate all comments indiscriminately and might introduce some noise into their models, while DTCA picks more valuable comments by DTE.
• Multi-task learning models, e.g.MTL-LSTM and Sifted-MTL leveraging stance features show at most 3.29% and 1.86% boosts in recall than DTCA on the two datasets respectively, but they also bring out noise, which achieve from 1.06% to 21.11% reduction than DTCA in the other three metrics.Besides, DTCA achieves 3.11% and 2.41% boosts than the latest baseline (sifted-MTL) in F1-score on the two datasets respectively.These elaborate the effectiveness of DTCA.

The impact of comments on DTCA
In Section 4.3, we find that the use of comments can improve the performance of models.To further investigate the quantitative impact of comments on our model, we evaluate the performance of DTCA and CaSa with 0%, 50%, and 100% comments.
The experimental results are shown in Table 3.We gain the following observations: • Models without comment features present the lowest performance, decreasing from 5.08% to 9.76% in accuracy on the two datasets, which implies that there are a large number of veracityindicative features in comments.
• As the proportion of comments expands, the performance of models is improved continuously.However, the rate of comments for CaSa raises from 50% to 100%, the boost is not significant, only achieving 1.44% boosts in accuracy on Ru-mourEval, while DTCA obtains better performance, reflecting 3.90% and 3.28% boosts in accuracy on the two datasets, which fully proves that DTCA can choose valuable comments and ignore unimportant comments with the help of DTE.

The impact of decision conditions of DTE on DTCA (RQ2)
To answer RQ2, we analyze the changes of model performance under different decision conditions.Different decision conditions can choose different comments as evidence to participate in the model learning.According to the performance change of the model verification, we are capable of well explaining the process of evidence selection through decision conditions.Specifically, we measure different values (interval [0, 1]) as thresholds of decision conditions so that DTE could screen different comments.Figure 3(a), (b), and (c) respectively present the influence of semantic similarity (simi), the credibility of reviewers (r cred), and the credibility of comments (c cred) on the performance of DTCA, where the maximum thresholds are set to 0.7, 0.7, and 0.6 respectively because there are few comments when the decision threshold is greater than these values.We observe that: • When simi is less than 0.4, the model is continually improved, where the average performance improvement is about 2 % (broken lines) on the two datasets when simi increases by 0.1.Especially, DTCA earns the best performance when simi is set to 0.5 (<0.5), while it is difficult to improve performance after that.These exemplify that DTCA can provide more credibility features under appropriate semantic similarity for verification.
• DTCA continues to improve with the increase of r cred, which is in our commonsense, i.e., the more authoritative people are, the more credible their speech is.Analogously, DTCA boosts with the increase of c cred.These show the reasonability of the terms of both the credibility of reviewers and comments built by meta data.
• When simi is set to 0.5 (<0.5), r cred is 0.7 (<0.7), c cred is 0.6 (<0.6), DTCA wins the biggest improvements, i.e., at least 3.43%, 2.28%, and 2.41% on the two datasets respectively.At this moment, we infer that comments captured by the model contain the most powerful evidence for claim verification.This is, the optimal evidence is formed under the conditions of moderate semantic similarity, high reviewer credibility, and higher comment credibility, which explains the selection process of evidence.

Explainability Analysis (RQ3)
To answer RQ3, we visualize comments (evidence) captured by DTE and the key semantics learned by CaSa when the training of DTCA is optimized.Figure 4 depicts the results based on a specific sample in PHEME, where at the comment level, red arrows represent the captured evidence and grey arrows denote the unused comments; at the word level, darker shades indicate higher weights given to the corresponding words, representing higher attention.We observe that: Simi:0.84,u_cred:0.45,c_cred:0.38@ddemontgolfier @SkyNews why are you shocked??
Simi:0.49, u_cred:0.66,c_cred:0.72@SkyNews @zerohedge they must of run a cartoon with mohamed islam nibbling pork rinds from allu akbars shitty camel asshole Additionally, there are common characteristics in captured comments, i.e., moderate semantic similarity (interval [0.40, 0.49]), high reviewer credibility (over 0.66), and high comment credibility (over 0.62).For instance, the values of the three characteristics of evidence '@TimGrif-fiths85 where was that reported?#dickface' are 0.47, 0.66, and 0.71 respectively.These phenomena explain that DTCA can give reasonable explanations to the captured evidence by decision conditions of DTE, which visually reflects the interpretability of DTCA method itself.
• At the word level, the evidence-related words 'presumably Muslim', 'made fun of', 'shooter', and 'isn't Islamist' in comments receive higher weights than the evidence-independent words 'surprised', 'confirmed 1st' and 'speculating', which illustrates that DTCA can earn the key semantics of evidence.Moreover, 'weekly Charlie Hebdo' in the claim and 'Islamist' and 'Muhammad' in comments are closely focused, which is related to the background knowledge, i.e., weekly Charlie Hebdo is a French satirical comic magazine which often publishes bold satire on religion and politics.'report say' in claim is queried in the comments, like 'How do you arrive at that?' and 'false flag'.These visually demonstrate that DTCA can uncover the questionable and even false parts in claims.comments, while in claims with less than 8 comments, DTCA does not perform well, underperforming its best performance by at least 4.92% and 3.10% in accuracy on the two datasets respectively.Two reasons might explain the issue: 1) The claim with few comments has limited attention, and its false parts are hard to be found by the public; 2) DTCA is capable of capturing worthwhile semantics from multiple comments, but it is not suitable for verifying claims with fewer comments.

Conclusion
We proposed a novel framework combining decision tree and neural attention networks to explore a transparent and interpretable way to discover evidence for explainable claim verification, which constructed decision tree model to select comments with high credibility as evidence, and then designed co-attention networks to make the evidence and claims interact with each other for unearthing the false parts of claims.Results on two public datasets demonstrated the effectiveness and explainability of this framework.In the future, we will extend the proposed framework by considering more context (meta data) information, such as time, storylines, and comment sentiment, to further enrich our ex-plainability.

Figure 1 :
Figure1: The architecture of DTCA.DTCA includes two stages, i.e., DTE for discovering evidence and CaSa using the evidence to explore the false parts of claims.

Figure 2 :
Figure2: Overview of DTE.DTE consists of two parts: tree comment network (the left) and decision tree model (the right), which is used to evaluate the credibility of each node in the tree comment network for discovering evidence.
and b 2 are the learned parameters.O = C and O = E are output vectors of two self-attention networks aiming at the claim and the evidence, respectively.

Figure 4 :
Figure 4: The visualization of a sample (labeled false) in PHEME by DTCA, where the captured evidence (red arrows) and the specific values of decision conditions (blue) are presented by DTE, and the attention of different words (red shades) is obtained by CaSa.

Table 1 :
Table 1 gives statistics of the two datasets.Statistics of the datasets.

Table 2 :
The performance comparison of DTCA against the baselines.
The accuracy comparison of DTCA in different decision conditions.Broken lines represent the performance difference (D-value) between the current decision condition and the previous decision condition.

Table 3 :
The performance comparison of models on different number of comments.

Table 4 :
The performance comparison of DTCA under different claims with different number of comments.