Evidence-based Trustworthiness

The information revolution brought with it information pollution. Information retrieval and extraction help us cope with abundant information from diverse sources. But some sources are of anonymous authorship, and some are of uncertain accuracy, so how can we determine what we should actually believe? Not all information sources are equally trustworthy, and simply accepting the majority view is often wrong. This paper develops a general framework for estimating the trustworthiness of information sources in an environment where multiple sources provide claims and supporting evidence, and each claim can potentially be produced by multiple sources. We consider two settings: one in which information sources directly assert claims, and a more realistic and challenging one, in which claims are inferred from evidence provided by sources, via (possibly noisy) NLP techniques. Our key contribution is to develop a family of probabilistic models that jointly estimate the trustworthiness of sources, and the credibility of claims they assert. This is done while accounting for the (possibly noisy) NLP needed to infer claims from evidence supplied by sources. We evaluate our framework on several datasets, showing strong results and significant improvement over baselines.


Introduction
The emergence of social networks and news aggregators -combined with ill-informed posts, deliberate efforts to create and spread sensationalized information, and a strongly polarized political environment -makes it very difficult to establish what is really known. Therefore, fact checking seeks to assess whether the claim is true or false, or to provide a confidence level for the claim given textual evidence (Hassan et al., 2017;Wang, 2017;Wang et al., 2018). A typical fact checking pipeline consists of document retrieval, sentence-level evidence selection, and textual entailment stages (Thorne et al., 2018). However, this pipeline is local in that it applies to a given claim. The missing step here is to assess the trustworthiness of the sources producing the claims and evidence. This is a global step that, in principle, accounts for all claims made by a source and all sources making a claim.
Previous work has studied how to estimate the trustworthiness or credibility of information sources for fact-finding (Vydiswaran et al.; Pasternack and Roth, 2013), truth discovery (Dong et al.; Pochampally et al., 2014;Dong et al., 2015;Li et al., 2016) and crowdsourcing (Sabou et al., 2012;Hovy et al., 2013;Gao et al., 2015). Usually, given a list of conflicting facts, e.g. "source s asserts claim c", or "annotator x labels data item t by label y", we detect the true claims or correct labels for the data item by resolving conflicts, and then compute the trustworthiness of sources.
However, many sources do not directly assert claims, but rather generate articles as evidence, expecting readers to infer claims from this evidence. In practice, given a claim of interest, people may search for related articles from multiple sources and collect evidence for the claim; they can then determine the veracity of the claim by deciding whether the evidence found supports or refutes the claim. However, most existing work that attempted to study trustworthiness of sources assumed that sources make assertions directly. Even when intermediate text was accounted for (Vydiswaran et al.; Nakashole and Mitchell, 2014), it was assumed that clean evidence and clear connections between evidence and conflicting claims are provided, disregarding the fact that NLP systems attempting to support these tasks are noisy. This paper considers two situations when eval-  Figure 1: Claim with assertions from multiple sources (from http://www.emergent.info/). Direct assertions specify their stance; indirect assertions provide related articles, and we can leverage (noisy) text entailment tools to collect their stances. We want to assess whether to believe the stance and articles.
uating the trustworthiness of information sources: (1) the source directly asserts claims, and (2) the source indirectly asserts claims by proposing evidence. The first case is similar to previous work; the second case is more challenging but more important in practice. Both cases are depicted in Figure 1. A multitude of sources is given and each may assert multiple claims or propose multiple pieces of evidence. At the same time, multiple claims are observed, some of which are directly asserted by sources and some are supported by evidence.
Our goals are to identify true claims and to estimate the trustworthiness of each source. The key challenge is that this global inference task is influenced by the knowledge of which claims are made by which sources; however, establishing links -from evidence generated by a source to claims -requires NLP techniques such as textual entailment (TE) (Dagan et al., 2013). Such TE tools, which assess whether a given textual evidence (premise) entails a given claim (hypothesis), are often noisy -making the evaluation of sources more difficult.
The key contributions of this work are as follows: (1) It proposes a probabilistic model, JELTA, which jointly estimates the credibility of claims and the trustworthiness of sources, when claims are made by sources directly, indirectly, or both. (2) Our framework incorporates a TE model as part of the global inference framework as a way to link evidence (and thus, sources) to claims. (3) This is the first work to distinguish between direct and indirect assertions made by information sources.
Our experiments on both synthetic and natural datasets show solid results that are significantly better than baselines.

Trustworthiness Analysis
Our goal is to evaluate the trustworthiness of information sources by detecting the true claims while accounting for noise in the links between claims and evidence for them. While direct assertions are straightforward to deal with (since it is clear which source generates which claim), the challenge is to incorporate "noisy assertions" into our problem formulation. We first describe our setting, and then elaborate on the probabilistic modeling. (1) source s i directly asserts multiple claims c j ; (2) the source provides evidence e k by multiple articles, and the proposed evidence can support or refute claims via some noisy NLP tool.

Noisy Assertions
We are given a set of claims to validate and a text corpus (pieces of evidence) generated by multiple sources that are believed to have generated the claims. Given the claim text, we issue a set of searches over the corpus, to find evidence in support of the claims. The result is a a set of (noisy) assertions. A (noisy) assertion consists of a claim, a sentence in the corpus, and a label ("entailment", "contradiction", "neutral"). The claim is a real world input we attempt to determine the truth value of. E.g., in Figure 1, "Tom Brokaw wants Brian Williams fired" is such a claim. An assertion, on the other hand, is an artifact of our framework. As we search the corpus generated by the sources for evidence supporting the claim, we identify candidate sentences ('Related Articles" in the figure) and use a pretrained textual entailment model (e.g. the decomposable attention model (Parikh et al., 2016)) to provide an entailment label and complete the triple (claim, sentence, label). The generation of noisy assertions as described above follows a typical fact-checking pipeline mentioned in Thorne et al. (2018). Given noisy assertions, Figure 2 illustrates our problem setting. Overall, there are two situations. In the upper part of the figure, we show the case in which information sources make direct assertions: the source directly states that some claims are true or false. The alternative case, indicated in the lower part of the figure, involves the source indirectly asserting claims by making noisy assertions: the source first generates articles that contain sentences, and the sentences may entail or refute related claims. An entailment tool can then be used to assert the claims to be true or false, based on those sentences. A claim can be supported by multiple sources or multiple pieces of evidence from different sources. We now propose our model, JELTA, which handles both cases described above.

Fundamentals
Our probabilistic model denotes an information source as s ∈ S, a claim as c ∈ C, and m as a mutual exclusion set of claims (exactly one of the claims in each mutual exclusion set is true). Here m is a fact to be checked, and c is a statement that m is true or false. w s,m , w s,e and w e,m are binary indicators -respectively telling us if s asserts claims of m, if s provides evidence e, and if evidence e supports claims in m. We denote evidence e ∈ E, and for each entailment result, we use b s,c and b s,e,c to represent the observed probability that s asserts c and s provides e to assert c respectively. Here, c∈m b s,c = 1 and c∈m b s,e,c = 1. We summarize our notation in Table 1.

JELTA
Our work models a joint distribution that reflects a "story" of how sources generate observations. Intuitively, given an estimation of the verdict of the claims and the factors, including the trustworthiness of sources providing claims and evidence, we want to maximize the probability that we can observe the claims and evidence.
We represent the verdict of a claim m as a latent variable, y m , and associate a parameter H s with each source s, reflecting the probability of s telling the truth, which we use to measure the trustworthiness of s. We now describe how y m and H s are used to compute the probability of observing the claims and evidence. Starting with the probability that source s makes a direct assertion: intuitively, if s asserts a true claim c =ĉ in m, then the probability that s asserts c is H s , the probability s telling the truth. Otherwise, s chooses uniformly from other claims in m with probability 1−Hs |m|−1 . Besides the term H s , we require another (hidden) factor related to s, namely, the probability of s telling the truth as evidence. We denote this as P s , the precision of s generating evidence. Here we allow P s can be different with H s , since providing true evidence for a true claim is more difficult than just providing a true claim. However, considering that those all reflect the trustworthiness of s, we assume they share a similar distribution over sources in our problem. P s can then be represented by two other parameters, R s and Q s (Dong et al., 2015). These represent the true-and false-positive rates of s producing evidence, respectively. We denote γ as the probability of a claim being true, then P s can be represented by Q s and R s as: We assume that the probability of the claim being true or false is equal. Since H s and P s share similar distributions, H s relates to Q s and R s as follows: Now we discuss how to compute the probability of observing the noisy assertions. Intuitively, when source s wants to assert a claim with the NLP tool (textual entailment model) by proposing evidence: if s wants to support a true claim indirectly, s will recall true evidence with probability R s . This requires the NLP tool to do textual entailment correctly, otherwise s will also uniformly choose false or unrelated evidence. This paper considers the simplest way to generate a false claim or false evidence, and the choice may not always follow random sampling in practice. Our prior work Pasternack and Roth (2013) discusses some other options, which could alternatively be used here.
In the remainder of this section, we formally model those processes.
Direct Assertion. Modeling the generative process of direct assertions by sources is very similar to Simple LCA (Pasternack and Roth, 2013). As above, if the claim c ∈ m asserted by s is the true claim y m , the probability of observing the source s asserting claim c of m is H s . Otherwise, the probability of s asserting a false claim of m is 1−Hs |m|−1 . Therefore, the joint probability of the observation over X d and y m can be modeled as follows: (3) Then, given all sources S and θ = {H s }, we can write the full joint of direct observations as: (4) Note that we simplify the expression by leveraging Indirect Assertion. Here the sources provide articles containing possible evidence rather than making direct assertions. Besides the parameters Q s and R s , the observation also depends on the noisy entailment results given by the textual entailment model. Therefore, we introduce a function φ w (e, m, c) ∈ R 1 to measure the reliability of an entailment result. Here φ w (e, m, c) is a linear combination of feature values in a sigmoid function, so that we can scale it to [0, 1]: where z i is a feature for each given e, m, c , and w = {w i } are the weights of each z i learned by our model. For each observation s, e, m, c , the source generates true evidence with probability R s , and with probability of φ w (e, m, c), the proposed evidence e supports claim c of m. This means that we have probability of R s · φ w (e, m, c) to observe the tuple when c = y m . If c = y m , either the source does not provide true evidence, or the entailment model provides an unreliable entailment result -which means we have probability of 1 N 1 − P s · φ w (e, m, c) to observe a false evidence-claim pair. Here N is the total number of such false evidence-claim pairs. Therefore, the joint probability of the observation over y m and X i (indirect assertion observa-tions) is as follows: bs,e,y m 1 − Rs Rs+Qs · φ w (e, m, c) N 1−bs,e,y m ws,e,we,m (6) Here, we also use c∈m\ym b s,e,c = 1−b s,e,ym . Then, given all sources S and θ = {Q s , R s , W }, the full joint probability of indirect assertions is: bs,e,y m 1 − Rs Rs+Qs · φ w (e, m, c) N 1−bs,e,y m ws,ewe,m Joint Modeling. Now to consider direct and indirect assertions together, we multiply Equations 4 and 7 together with two hyper-parameters, η d and η i , which give different weights to direct and indirect assertions. If η d > η i , this means we believe that a direct assertion is more accurate than an indirect assertion, and vice versa. Therefore, observing that all sources propose their evidence and make their assertions independently, and taking θ = {H s , R s , Q s , W }, we can write the full joint as:

Inference
The true claim, y m , is a latent variable that is unknown in our problem, so we solve this ap-proximately by using the EM algorithm (Dempster et al., 1977) to first estimate the true claim, then find the maximum a posterior point estimate of the parameters. Therefore, the E-step is ∀m: In the M-step, besides maximizing the posterior of parameters, we should also consider the interactions between H s and R s , Q s . We include it as a regularization term with a parameter λ that controls the importance of the interactions. Thus, the M-step is as follows: Since there are no closed form solutions for those parameters, we use gradient ascent to solve them parameter-by-parameter. We leave the computation of derivatives to the appendix.

Measuring Entailment Results
In our model, φ w (e, m, c) evaluates the reliability of an entailment result given by the entailment model. As we described in Section 2.3, φ w (e, m, c) is a sigmoid function of a linear combination of feature values, and we include following features: Entailment Score. For each prediction of the given entailment model, the model will predict a label, i.e. entailment, contradiction or neutral as well as a score to support its conclusion.
Text Similarity. This feature is computed by the cosine similarity between numerical representations of the evidence and the claim. In this work, we use tf-idf and Glove (Pennington et al., 2014) to represent sentences respectively. To represent a sentence, we use the pre-trained Glove 1 with a simple method proposed in (Arora et al., 2017).
Entity Similarity. We identify entities for each pair of evidence and claim, and compute the overlap of entities by jaccard similarity and entity similarity by NESim (Do et al., 2009) as two features.

Experimental Evaluation
We evaluate the effectiveness of our joint model JELTA and compare it with baselines. We first de-scribe our datasets and the methods we compare with, then elaborate on the results.

Experimental Settings
Data Sets We use both synthetic and natural datasets to evaluate our models.
Synthetic Dataset: FEVER. We use the training file of FEVER 2 to create the synthetic dataset. FEVER is a dataset for verification of claims. We augment FEVER with sources and other information using following steps.
Step 1: Assign Veracity for Claims. Fever provides evidence-claim pairs with their textual entailment labels. Considering our running example, Fever provides sentence pairs such as "...NBC's Tom Brokaw reportedly..." and "Tom Brokaw wants Brian Williams fired." as evidence and claim. For each experimental round, we sample 200 claims from those pairs, then randomly assign half as true and half as false.These labels will be the ground truth of claims' veracity.
Step 2: Create Sources with Accuracy. Next, we create sources with corresponding accuracy as the ground truth of trustworthiness. In our each experimental round, we create 200 sources and for each source s i , we associate an accuracy denoted as H s i . To generate H s i , we sample a decimal number from a normal distribution (µ = 0.5, σ = 1) in [0, 1]. A normal distribution is used here because we assume most sources mix true and false claims, and a few of them are highly trustworthy or totally unbelievable.
Step 3: Associate Sources with Evidence and Claims. The last step is to assign claims and evidence to each source. In our experiments, each source makes 30 assertions. Each source s i , with probability H s i , picks a true claim; otherwise it picks a false claim. For evidence, since we assume that the distribution of precision generating evidence over sources shares a similar distribution with {H s i }, the source s i picks a piece of evidence either supporting a true claim or refuting a false claim with H s i + , where epsilon is a small Gaussian noise (µ = 0, σ = 1). Considering the running example, if claim "Tom Brokaw wants Brian Williams fired." is associated with "True" in Step 1, and Fever provides the pair with label "Entailment", "...NBC's Tom Brokaw reportedly..." is therefore a piece of evidence supporting a true claim. Otherwise s i picks a piece of evi-2 http://fever.ai/data.html dence supporting a false claim or refuting a true claim. Note we assume that each source provides more pieces of evidence than claims, and set the ratio of direct assertions to indirect assertions as 1 4 in our expeirments.
We run 10 rounds of experiments and report the average performance.
Natural Dataset: Emergent. We use Emergent (Ferreira and Vlachos, 2016) directly; it is derived from a digital journalism project for rumor debunking. It contains 300 rumored claims and 2,595 associated news articles from different websites, collected and labeled by journalists with an estimate of their veracity (true, false or unverified). We eliminated the claims that are unverified, leaving 201 claims and 589 effective sources. For each source, the dataset provides the claims it supports or refutes, which we use as direct assertions. It also provides the articles generated by the source, and we use them as possible evidence repository that may support or refute the claims. The ground truth of the trustworthiness is generated by computing the accuracy of sources based on the veracity label provided. Entailment Model. We need a textual entailment model to tell us if evidence (a sentence) supports or refutes a claim. We use a pre-trained decomposable attention model (Parikh et al., 2016) with Elmo embedding (Peters et al., 2018) trained on the SNLI dataset 3 . The model's performance is not good on either FEVER or Emergent: when we use majority voting over the evidence to estimate the veracity of a claim, the accuracy is under 40%. To improve the textual entailment model, we adapt the pre-trained model with additional training data. For the experiment on FEVER, we randomly sample 100 training examples from labeled development dataset of FEVER. (There is no overlap between the additional training data and our created test data.) For the experiments on Emergent, we construct additional training data by article headlines and article headline stance provided by Emergent. Here, the article headline is generated by each article, and the dataset tells us if the article headline can support or refute the claim, which is a good source of additional training data.
Metrics To evaluate the performance of our method as well as the baselines, we evaluate (1) the accuracy of the estimated veracity of the claims, (2) the accuracy of the estimated trustworthiness of the source. Here, we evaluate trustworthiness by two typical correlation scores, the Spearman correlation coefficient and Pearson correlation coefficient (Fieller et al., 1957). Spearman's correlation assesses monotonic relationships, whereas Pearson's correlation is the covariance of two random variables -thus when computing Pearson's correlation, we normalize the estimated accuracy of the sources.

Baselines
MJ-Claim. In this case, we only consider direct assertions made by sources, and for each claim we collect all related assertions and do majority vote to estimate the veracity of the claims. Once we get an estimation of their veracity, we can compute the accuracy for each source.
MJ-EVI. We only consider indirect assertions in this case. With the textual entailment model output, each evidence provided by the article will ei-ther support, refute or abstain the claim. Here, we also use majority vote to estimate the veracity of the claims, and use the mean ratio between the number of evidence supporting the true claim and the total number of evidence for each claim to estimate the trustworthiness of the source.
Sim-LCA. We leverage the model proposed in (Pasternack and Roth, 2013) to estimate the credibility of the sources. Here, the model only considers direct assertions.
Sim-Com. We propose a simple solution that considers both direct and indirect assertions. Here, we use MJ-EVI to estimate the truth of the claims, based on which we calculate the accuracy for each source. Note that the results are the same compared with MJ-EVI when estimating the veracity of the claims, while the estimation of trustworthiness over sources is different.

Results
Accuracy of Veracity. Figure 4 (a) reports (for each method) the accuracy of estimation for the claims' veracity, over our synthetic dataset. JELTA achieves the highest accuracy, by around 5%, and shows a low standard deviation over the 10 rounds of experiments. Figure 4 (b) reports the accuracy on Emergent. We again observe a 4% improvement in accuracy compared with MJ-EVI, and around 16% improvement vs the methods considering direct assertions only. It makes sense that evidence-based method (leveraging indirect assertions) can beat the claim-based method (leveraging direct assertions only) by using more information to reduce potential noise. However, using sources and claims only is more noisy, especially with many bad information sources. Figure 4 (a) shows that when the distribution of sources changes, the performance of MJ-Claim and Sim-LCA also varies a lot: their performance greatly depends on the distribution of trustworthiness over sources. Besides offering higher accuracy, evidence-based methods are more robust to varying sources' trustworthiness.
Trustworthiness Estimation. Figure 5 (a) reports the performance by Pearson and Spearman score, for each method's estimate of the trustworthiness of each source on FEVER. JELTA's accuracy is consistently better than other baselines, whenever we use the Spearman or Pearson score to compute the correlation between the estimation and the ground truth. JELTA also has a lower standard deviation over different rounds. This result is consistent with the results shown when we are estimating the veracity of claims. It reveals that evidence-based methods are relatively more stable than the methods considering direct assertions only. MJ-Claim and Sim-LCA highly depend on the trustworthiness distribution over sources. If most of the sources are more trustworthy, we can both estimate the true claims more accurately and better estimate the trustworthiness of sources; and vice versa. That is why both MJ-Claim and Sim-LCA have high standard deviations over different rounds. Based on the results of MJ-EVI, we can observe that simply calculating accuracy by estimated "correct" evidence cannot achieve a highly correlated estimation of sources: the entailment tool provides noisy evidence. However, Sim-Com, which directly counts estimated "correct" claims by MJ-EVI, can improve the estimation. Thus, if we can estimate the veracity of claims accurately, estimating the trustworthiness by claims is more accurate than doing that by noisy evidence. This is also why we can significantly improve the performance by joint modeling. Intuitively, we use evidence to better estimate the veracity of claims, and leverage claims to better estimate the trustworthiness of sources, in an iterative fashion. Figure 5 (b) leads to similar conclusions. Since there are more trustworthy sources, the performance of claim-based methods is better than MJ-EVI.
Influence of textual entailment model. Figures 4 and 5 show that our method, which jointly considers direct and indirect assertions, significantly improves the estimation. Among different factors, evidence contributes the most when estimating the veracity of claims, which can also help the estimation of the trustworthiness. However, the usefulness of evidence highly depends on the quality of the NLP tool. To quantify the amount of noise introduced, we report the Pearson and Spearman score varying a noise rate r. Given r, for each entailment result, with probability r, we will flip the answer of the textual entailment. For example, if the result is "entailment", we will change it randomly to either "contradiction" or "neutral", and vice versa. The results are shown in (c) of Figure 4 and 5. As noise increases, the accuracy, Pearson and Spearman score drop lower. However, the JELTA method is consistently better than the alternatives. JELTA's accuracy decreases more slowly, and its correlation remains positive, even though we flip 95% of the entailment results. This demonstrates that jointly considering direct and indirect assertions can better avoid the skewness caused by either evidence or claims.

Related Work
Evaluating the trustworthiness of sources has been studied for fact-finding, truth discovery and crowdsourcing. In the context of fact-finding (Vydiswaran et al.; Pasternack and Roth, 2013) and truth discovery (Yin et al., 2008;Zhao et al., 2012;Li et al., 2014;Pochampally et al., 2014;Dong et al., 2015;Li et al., 2016), the solutions estimate the trustworthiness or credibility of sources, by resolving the conflicts of claims provided by multiple sources. The claims are usually in structured form, and conflicting values can be easily captured without noise. Works in (Vydiswaran et al.; Nakashole and Mitchell, 2014;Popat et al., 2017) further take text into consideration, however, in (Vydiswaran et al.; Nakashole and Mitchell, 2014), they still depend on a structured input form and thus the connection between evidence and conflicting claims are given, which is usually not practical. Popat et al. (2017) leverages text as evidence to do fact-checking, while their estimation of credibility of sources neglects the reliability of sources generating evidence. In crowdsourced labeling (Sabou et al., 2012;Hovy et al., 2013;Gao et al., 2015), the system is given noisy labels which are annotated by different annotators. The input is again in structured form, and there is no evidence to consider. This is a limited setting compared with our problem. Our problem is also related to fact-checking (Wang et al., 2018;Thorne et al., 2018;Yin and Roth, 2018;Zhao et al., 2018), however they only consider if the evidence can support the claim without tracking the source of the claim and evidence.

Conclusions and Future Work
This paper studied the problem of estimating the trustworthiness of given information sources. The sources make direct claims or indirect claims by generating evidence that implies these claims.
We proposed a probabilistic framework, JELTA, which jointly considers both kinds of assertions to better estimate claims' veracity and sources' trustworthiness. We evaluated JELTA over both synthetic and real datasets, and our results show significant improvements over baselines.
While we presented the framework here as applying to claims with two truth values, we believe that this framework can apply more broadly. For example, rather than considering a claim as being True or False, (Chen et al., 2019) suggests that one needs to view a claim from a diverse, yet comprehensive, set of perspectives. Our framework can be extended to deal with sources that generate a spectrum of perspectives, each with a stance relative to claim and with evidence supporting it. We leave this for future work.

A.1 Inference
To infer the value of latent variables and parameters in our model, we use EM algorithm to first estimate the true claim, and then find the maximum a posterior point estimate of the parameters. As shown in Section 2.4, given parameters θ t and X, E-step is easy to compute, while the M-step is more complicated. Since there are no closed form solutions for those parameters, we use gradient ascent to solve them and do them parameterby-parameter.