Explainable Automated Fact-Checking for Public Health Claims

Fact-checking is the task of verifying the veracity of claims by assessing their assertions against credible evidence. The vast majority of fact-checking studies focus exclusively on political claims. Very little research explores fact-checking for other topics, specifically subject matters for which expertise is required. We present the first study of explainable fact-checking for claims which require specific expertise. For our case study we choose the setting of public health. To support this case study we construct a new dataset PUBHEALTH of 11.8K claims accompanied by journalist crafted, gold standard explanations (i.e., judgments) to support the fact-check labels for claims. We explore two tasks: veracity prediction and explanation generation. We also define and evaluate, with humans and computationally, three coherence properties of explanation quality. Our results indicate that, by training on in-domain data, gains can be made in explainable, automated fact-checking for claims which require specific expertise.


Introduction
A great amount of progress has been made in the area of automated fact-checking. This includes more accurate machine learning models for veracity prediction and datasets of both naturally occurring (Wang, 2017;Augenstein et al., 2019;Hanselowski et al., 2019) and human-crafted (Thorne et al., 2018) fact-checking claims, against which the models can be evaluated. However, a few blind spots exist in the state-of-the-art. In this work we address specifically two shortcomings: the narrow focus on political claims, and the paucity of explainable systems.
One subject area which we believe could benefit from expertise-based fact-checking is public health -including the study of epidemiology, disease prevention in a population, and the formulation of public policies (Turnock, 2012). Recent events, including the COVID-19 pandemic, demonstrate the significant potential harm of misinformation in the public health setting, and the importance in accurately fact-checking claims. Unlike political and general misinformation, specific expertise is required in order to fact check claims in this domain. Oftentimes this expertise may be limited, and thus claims which surround public health may be inaccessible (e.g., because of the use of jargon and biomedical terminology) in a way political claims are not. Nonetheless, like political misinformation, the public health variety is also potentially very dangerous, because it can put people in imminent danger and risk lives.
Typically, statements which are candidates for fact-checking originate in the political domain (Vlachos and Riedel, 2014;Ferreira and Vlachos, 2016;Wang, 2017), and tend to surround more general topics or be non-subject specific (Thorne et al., 2018). This follows the trend of the rising interest in political fact-checking in the last decade (Graves, 2018). There are on-going efforts with respect to fact-checking scientific claims (Grabitz et al., 2017). Fact-checking in domains where specific subject expertise is required presents an interesting challenge because general purpose fact-checking systems will not necessarily adapt well to these domains.
The second shortcoming we look to address is the paucity of explainable models for fact-checking (of any kind). Explanations have a particularly important role to play in the task of automated factchecking. The efficacy of journalistic fact-checking hinges on the credibility and reliability of the factcheck, and explanations (e.g., provided by model agnostic tools such as LIME (Ribeiro et al., 2016)) can strengthen this by communicating fidelity in predictive models. Explainable models can also aid the end users' understanding as they further elucidate claims and their context.
In this study we explore the novel case of explainable automated fact-checking for claims for which specialised expertise or in-domain knowledge is essential. For our case study we examine the the public health (biomedical) context.
The system for veracity prediction we aim to produce must fulfil two requirements: (1) it should provide a human-understandable explanation (i.e., judgment) for the fact-checking prediction, and (2) that judgement should be understandable for people who do not have expertise in the subject domain. We list the following as our three main contributions in this paper: 1. We present a novel dataset for explainable fact-checking with gold standard factchecking explanations by journalists. To the best of our knowledge, this is the first dataset specifically for fact-checking in the public health setting.
2. We introduce a framework for generating explanations and veracity prediction specific to public health fact-checking. We show that gains can be made through the use of indomain data.
3. In order to evaluate the quality of our factchecking explanations, we define three coherence properties. These can be evaluated by humans as well as computationally, as approximations for human evaluations of factchecking explanations.
The explanation model trained on in-domain data outperforms the general purpose model on summarization evaluation and also when evaluated for explanation quality.

Related Work
A number of recent works in automated factchecking look at various formulations of factchecking and its analogous tasks (Ferreira and Vlachos, 2016;Hassan et al., 2017;Zlatkova et al., 2019). In this paper, we choose to focus on the two specific aspects of concern to us, which have not been thoroughly explored in the literature. These are domain-specific and expertise-based claim verification and explainability for automated factchecking predictions.

Language Representations for Health
Fewer language resources exist for medical and scientific applications of NLP compared with other NLP application settings, e.g., social media analysis, NLP for law, and computational journalism and fact-checking. We consider the former below.
There are a number of open source pre-trained language models for NLP applications in the scientific and biomedical domains. The most recent of these pre-trained models are based on the BERT language model (Devlin et al., 2019). One example is BIOBERT, which is fine-tuned for the biomedical setting (Lee et al., 2020). BIOBERT is trained on abstracts from PubMed and full article texts from PubMed Central. BIOBERT demonstrates higher accuracies when compared to BERT for named entity recognition, relation extraction and question answering in the biomedical domain.
SCIBERT is another BERT-based pre-trained model (Beltagy et al., 2019). SCIBERT is trained on 1.14M Semantic Scholar articles relating to computer science and biomedical sciences. Similar to BIOBERT, SCIBERT also shows improvements on original BERT for in-domain tasks. SCIBERT outperforms BERT in five NLP tasks including named entity recognition and text classification.
Given that models for applications of NLP tasks in the biomedical domain, e.g., question answering, show marked improvement when domain-specific, we hypothesize that public health fact-checking could also benefit from the language representations suited for that specific domain. We will make use of both SCIBERT and BIOBERT in our framework.

Explainable Fact-Checking.
A number of in-roads have been made in developing models to extract explanations from automated fact-checking systems. To our knowledge, the current state of the art in explainable fact-checking mostly looks to produce extractive explanations, i.e., explanations for veracity predictions in relation to inputs to the system. Instead, our focus in this paper is on abstractive explanations. We choose this approach, which aims to distill the explanation into the most salient components which form it, as more amenable to users with limited domain expertise, as we discuss below.
Various methods have been applied to the explainable fact-checking task. These methods span the gamut form logic-based approaches such as probabilistic answer set programming (Ahmadi et al., 2019) and reasoning with Horn rules (Ahmadi et al., 2019;Gad-Elrab et al., 2019) to deep learning and attention-based approaches, e.g., leveraging co-attention networks and human annotations in the form of news article comments (Shu et al., 2019a). The outputs of these systems also take a number of forms including Horn rules (Ahmadi et al., 2019), saliency maps (Shu et al., 2019a;Popat et al., 2018), and natural language generation (Atanasova et al., 2020).
All approaches produce explanations which are a distillation of the most relevant portion of the system input. In this paper we expand on the work by Atanasova et al. as we formulate explanation generation as a summarization exercise. However, our work differs from the existing literature as we construct a framework for joint extractive and abstractive explanation generation, as opposed to a purely extractive model. We choose an abstractive approach as we hypothesize that particularly in the case of public health claims, where specific expertise is required to understand the context, abstractive explanations can make the explanation more accessible, particularly for those with little knowledge of the subject matter. In this way we take into account the nature of the claims, something other explainable fact-checking systems do not consider.

Evaluation of Explanation Quality
Only a few explainable fact-checking systems employ thorough evaluation in order to assess the quality of explanations produced. In the cases where evaluations are provided, these primarily take the form of human evaluation, e.g., enlisting annotators to score the quality of explanations with respect to some properties (Atanasova et al., 2020;Gad-Elrab et al., 2019) or through the use of an established evaluation metric in the case where explanation generation is modelled as another task (Atanasova et al., 2020).
There is also work on the evaluation of explanation quality more broadly, independently of the task for which explanations are sought. Notably, Sokol and Flach (2019) present explainability factsheets for evaluating (machine learning) explanations along five axes, including usability. One of the usability criteria discussed by Sokol and Flach is coherence, which we use to develop our three explanation quality properties (see Section 5.3).
Whereas Sokol and Flach discuss coherence in general, we provide concrete definitions and use them for evaluating our methods for explaining veracity predictions for public health claims.

The PUBHEALTH dataset
We constructed a dataset of 11,832 claims for factchecking, which are related a range of health topics including biomedical subjects (e.g., infectious diseases, stem cell research), government healthcare policy (e.g., abortion, mental health, women's health), and other public health-related stories (see unproven, false and mixture examples in Table  1), along with explanations offered by journalists to support veracity labelling of these claims. The claims were collected from two sources: factchecking websites and news/news review websites. An example dataset entry is shown in Table 1.
To the best of our knowledge, this is the first factchecking dataset to explicitly include gold standard texts provided by journalists specifically as explanation of the fact-checking judgment. We describe below how the data was collected and processed to obtain the final PUBHEALTH dataset, and provide an analysis of the dataset.

Data collection
Initially, we scraped 39,301 claims, amounting to: 27,578 fact-checked claims from five factchecking websites (Snopes 2 , Politifact 3 , Truthor-Fiction 4 , FactCheck 5 , and FullFact 6 ); 9,023 news headline claims from the health section and health tags of Associated Press 7 and Reuters News 8 websites; and 2,700 claims from the news review site Health News Review (HNR) 9 .
We scraped data for two text fields which are essential for fact-checking: 1) the full text of the factchecking or news article discussing the veracity of the claim, and 2) the fact-checking justification or news summary as explanation for the veracity label of the claim. We also collected the URLs of sources cited by the journalists in the fact-checking and news articles. For each URL, in the case where the referenced sources could be accessed and read, we also scraped the source texts.

Label Explanation
Blue Buffalo pet food contains unsafe and higher-than-average levels of lead.
UNPROVEN Aside from a single claimant's lawsuit against Blue Buffalo and an unrelated recall on one variety of Blue Buffalo product in March 2017, we found no credible information suggesting that Blue Buffalo dog food was tested and found to have abnormally high levels of lead.
Children who watch at least 30 minutes of "Peppa Pig" per day have a 56 percent higher probability of developing autism.

FALSE
Talk of a Harvard study linking the popular British children's show "Peppa Pig" to autism went viral, but neither the study nor the scientist who allegedly published it exists.
Expired boxes of cake and pancake mix are dangerously toxic.

MIXTURE
What's true: Pancake and cake mixes that contain mold can cause life-threatening allergic reactions.
What's false: Pancake and cake mixes that have passed their expiration dates are not inherently dangerous to ordinarily healthy people, and the yeast in packaged baking products does not "over time develops spores." Families tell U.S. lawmakers of heparin deaths.

TRUE
A man who said he lost his wife and a son to reactions from tainted heparin made with ingredients from China urged U.S. lawmakers on Tuesday to protect patients from other unsafe drugs. All claims make reference to articles published between October 19 1995 and May 14 2020. In addition to the claim, article texts, explanation texts, and the date on which the fact-check or news article was published, we scraped meta-data related to each claim. These meta-data include the tags (single or multiple tokens) which may, for example, categorize the topics of the claim or indicate the source of the claim (see Appendix A.1), and the names of the fact-checkers and news reporters who contributed to the article.

Data processing and analysis
The data processing involved three tasks: standardizing the veracity labels, filtering out nonbiomedical claims from the dataset, and finally removing claims with incomplete and brief explanations.
Labels for news headline claims did not require standardization, as we assumed all news headline claims (coming from reputable sources as they were) to be verified and thus labelled these true, but filtered out from the dataset news entries with the headline prefixes "AP EXCLUSIVE", "Correction", "AP Interview", and "AP FACT CHECK". Indeed, it would be difficult to label the veracity of the claim in this type of entries. On the other hand, fact-check and news claims, which were associated with 141 different veracity labels, did require compression. We standardized the original labels for 4-way classification (see Appendix A.1). The cho-sen 4 labels are true, false, mixture, and unproven. We discounted claims with labels that cannot be reduced to one of these 4 labels. The distribution of labels in the final PUBHEALTH is shown in Table 2. The dataset consists of a majority false claims. Unproven claims are the least common in the dataset.  The second step in processing the data was to remove claims with no biomedical context. This step was especially crucial for the claims which originated from fact-checking websites where the bulk of fact-checks concern political and economic claims. Health claims are easier to acquire from news websites, such as Reuters, as they can be quickly identified by the section of the website in which they were located during the data collection process. Although we mentioned that a sizeable number of claims from fact-checking sources are re-lated to political events, some are connected to both political and health events or other mixed health context, and we collected claims whose subject matter intersects other topics in order to obtain a subject-rich dataset (see Appendix A.1).
Claims in the larger dataset were filtered according to a lexicon of 7,000 unique public health and health policy terms scraped from five health information websites (See Appendix A.1).
Furthermore, we manually added 65 more public health terms that were not retrieved during the initial scraping, but which we determined would positively contribute to the lexicon because of their relevance to the COVID-19 pandemic (see Appendix A.1). These claims were identified through exploratory data analysis of bigram and trigram collocations in PUBHEALTH.
In order to filter out the entries which are not health-related, we kept only claims with main article texts that mentioned more than three unique terms in our lexicon. Specifically, let L be our lexicon, and A c and T c , respectively, be the article text and claim text accompanying a candidate dataset entry c. Then, we included in PUBHEALTH only the following set C of claim entries, with accompanying information: As we already knew that all Reuters health news claims qualify for our dataset, we used the lower bound frequency of words from our lexicon present in these article texts to determine our lower bound of three unique terms. We acknowledge that there might be disparities in the amount of medical information present in entries. However, analysis of the dataset shows, quite promisingly, that on average claims' accompanying article texts have 8.92±5.54 unique health lexicon terms and claim texts carry 4.45 ± 0.88 unique terms from the health lexicon.
Claims and explanations in the entries in the dataset were also cleaned. Specifically, we also ensured all claims are between 25 and 400 characters in length. We removed explanations less than 25 characters long as we determined that very few claims shorter than this length contained fully formed claims; we removed claims longer than 400 characters to avoid the complexities of dealing with texts containing multiple claims. We also omitted claims and explanations ending in a question mark to ensure that all claims are statements, i.e., clearly defined.
Note that one aspect of the explanations' quality which we chose not to control, was the intended purpose of the text we labelled as the explanation: as shown in Table 7 in Appendix A.1, there was a wide variation across the websites we crawled. Table 3 shows the Flesch-Kincaid (Kincaid et al., 1975) and Dale-Chall (Chall and Dale, 1995) readability evaluations of claims from our fact-checking dataset when compared to four other fact-checking datasets. The results show that PUBHEALTH claims are, on average, the most challenging to read. Claims from our dataset have a mean Flesch-Kincaid reading ease score of 59.1, which corresponds to a 10th-12th grade reading level and fairly difficult to read. The other fact-checking datasets have reading levels which fit into the 6th, 7th and 8th grade categories. Similarly for the Dale-Chall readability metric, on average our claims are more difficult to understand. Our claims have a mean score of 9.5 which is equivalent to the reading age of college student, whereas all other datasets' claims have an average score which indicates that they are readable by 10th to 12th grade students. Both these results support our earlier assertion about the complexity of public health claims relative to political and more general claims.  In this section we describe in detail the methods we employed for devising automated fact-checking models. We trained two fact-checking models: a classifier for veracity prediction, and a second summarization model for generating fact-checking explanations. The former returns the probability of an input claim text belonging to one of four classes: true, false, unproven, mixture. The latter uses a form of joint extractive and abstractive summarization to generate explanations for the veracity of claims from article text about the claims. Full details of hyperparameters chosen and computer infrastructure which was employed can be found in Appendix A.2.

Veracity Prediction
Blue Buffalo pet food contains unsafe and higher-than-average levels of lead. Veracity prediction is composed of two parts: evidence selection and label prediction (see Figure 1). For evidence selection, within fact-checking and news articles, we employ Sentence-BERT (S-BERT) (Reimers and Gurevych, 2019). SBERT is a model for sentence-pair regression tasks which is based on the BERT language model (Devlin et al., 2019), to encode contextualized representations for each of the evidence sentences and then rank these sentences according to their cosine similarity with respect to the contextualized representation of the claim sentence. We then select the top k sentences for veracity prediction. As with sentence selection approaches from the fact-checking literature (Nie et al., 2019;Zhong et al., 2019), we choose k = 5.
The claim and selected evidence sentences form the inputs for the label prediction part of our model (see Figure 1). We fine-tuned, on the PUBHEALTH dataset, pre-trained models for the downstream task of fact-checking label prediction. We employed four pre-trained models: original BERT uncased, SCIBERT, BIOBERT v1.0, and also BIOBERT v1.1. The two versions of BIOBERT differ slightly in that the earlier version is trained for 470K steps on PubMed abstracts and PubMed Central (PMC) full article texts, whereas BIOBERT v1.1 is trained for 1M steps on PubMed abstracts.

Explanation Generation as Abstractive Summarization
We make use of extractive-abstractive summarization (Liu and Lapata, 2019) in developing the explanation model. We choose this architecture because explanations for claims which concern a specific topic area having a highly complex lexicon can benefit from the ability to articulate judgment in simpler terms. In order to deploy the model proposed by (Liu and Lapata, 2019) we also implemented an explanation generation model. Just as is the case for the predictor model, the explanation model is fine-tuned for the task on evidence sentences ranked by S-BERT. However, for the explanation model we use all article sentences as well as the claim sentence to fine-tune a BERT-based summarization model pre-trained on the Dailymail/CNN news article and summaries dataset (Hermann et al., 2015). One of our models, EXPLAINERFC, is fine-tuned using non-public health data, which we extract from the portion of the 39.3K originally crawled fact-checks, news reviews, and news articles not included in PUB-HEALTH. For fairness, we ensure these data have the same proportion of claims from each website and the number of examples is the same as PUBHEALTH. The second model, EXPLAINERFC-EXPERT, is fine-tuned on PUBHEALTH. Also, we evaluate both models on PUBHEALTH test data. Table 2 shows an example of the explanations generated by the two methods.

Results
We conducted experiments to evaluate the performance of both predictor(s) and explainer(s). The performance of the (various incarnations of the) prediction model is evaluated using an automatic approach, whereas the performance of the (two incarnations of the) explainer is assessed using both automatic and human evaluation.

Gold explanation
Obamacare does not require that patients 76 and older must be admitted to the hospital by their primary care physicians in order to be covered by Medicare.

EXPLAINERFC explanation
What's true: nothing in the Affordable Care Act requires that a primary care physician admit patients 76 or older to a hospital in order for their hospital care to be treated under Medicare. What's false: none of the provisions or rules put an upper age limit on medicare coverage.

EXPLAINERFC-EXPERT explanation
The Affordable Care Act does not require Medicare to admit patients to a hospital after paying the Part B deductible. It's not the same age limit on medicare coverage. But the evidence doesn't specifically set an upper age limit.

Prediction
We split the PUBHEALTH dataset as follows: 9,466 training examples, 1,183 examples for validation and 1,183 examples for testing.
We evaluated veracity prediction using macro-F1, precision, recall and accuracy metrics as shown in Table 4. We employ two baselines: a randomized sentence selection approach with BERT (bert-baseuncased) classifier, and lastly a BERT model, also using pre-trained uncased BERT, which does not make use of sentence selection and instead makes use of the entire article text to fine-tune for the fact-checking task.
Out of the four BERT-derived models, SCIB-ERT achieves the highest macro F1, precision and accuracy scores on the test set. BIOBERT v1.1 achieves the second highest scores for F1, precision and accuracy. As expected, BIOBERT v1.1 outperforms BIOBERT v1.0 on all four metrics. The standard BERT model achieves the highest precision score of the four models, however it also achieves the lowest recall and F1 scores. This supports the argument we presented in Section 1 that subjectspecific fact-checking can benefit from training on in-domain models.

Explanations
We use two methods for evaluating the quality of explanations generated by our methods: automated evaluation and qualitative evaluation, in turn amounting to human and computational evaluation of explanation properties.

Automated Evaluation
We make use of ROUGE summarization evaluation metrics (Lin, 2004). Specifically we use the F1 values for ROUGE-1, ROUGE-2, and ROUGE-L, to evaluate the explanations generated by the EX-PLAINERFC and EXPLAINERFC-EXPERT models. As in the setup employed by Liu and Lapata (2019), we compare our explanation models to two other methods: a LEAD-3 baseline, which constructs a summary out of the first three sentences of an article, and an extractive summarization-based ORACLE upper bound. The results of this evaluation are shown in  Table 5: ROUGE-1 (R1), ROUGE-2 (R2) and ROUGE-L (RL) F1 scores for explanations generated via our two explanation models.

Evaluation of Explanation Quality
As the explanations we generate are from heterogeneous sources (and therefore not directly comparable), evaluation using ROUGE does not present us with a complete picture of the usefulness or quality of these explanations. For this reason, we adapt to the task of explainable fact-checking three of the desirable usability properties for machine learning explanations offered by Sokol and Flach (2019). We define these properties formally and evaluate the quality of the generated explanations against them. These same properties are also used for our human evaluations and for a comparison between human and computational evaluation of the quality of our explanations. To the best of our knowledge, ours is the first systematic evaluation of the quality of explanations for fact-checking in terms of formal properties. We define the three explanation properties as (two forms of) global coherence and (a form of) local coherence, as follows.
Global Coherence refers to the suitability of fact-checking explanations with respect to both the claim and label to which it is associated. We consider two incarnations of global coherence: • Strong global coherence. Let E be an explanation of the veracity label l for claim C, where e 1 , . . . , e N are all the individual sentences which make up E. Then, E satisfies strong global coherence iff ∀e i ∈ E, e i |= C. Put simply, for this property to hold for a generated fact-checking explanation, every sentence in the explanatory text must entail (|=) the claim.
• Weak global coherence. Let E be an explanation of the veracity label l for claim C, where e 1 , . . . , e N are all the individual sentences which make up E. Then, E satisfies weak global coherence iff ∀e i ∈ E, e i |= ¬C. For this property to hold for a generated factchecking explanation, no sentence in the explanatory text should contradict the claim (by entailing its negation); from a natural language inference (NLI) perspective, for weak global coherence to hold all explanatory sentences should entail or have a neutral relation with respect to the claim.
When measuring coherence, we treat as neutral claims originally labelled as false if their claim is contradicted by its explanation. Note that if the false claim is entailed by its explanation we do not reassign the label, because doing so would impose too strong an assumption that the entailment is related to the veracity which we cannot verify.
Local Coherence. Let E be an explanation of the veracity label l for claim C, where e 1 , . . . , e N are all the individual sentences which make up E. Then, E satisfies local coherence iff ∀e i , e j ∈ E, e i |= ¬e j .
Local coherence is a measure of how cohesive sentences in an explanation are. For local coherence to hold any two sentences in an explanation must not contradict each other, i.e., there is no pairwise disagreement between sentences which make up the explanation.
Note that all three coherence properties relate to the usability property of coherence discussed by Sokol and Flach (2019). Local coherence draws specifically on the idea of avoiding internal inconsistencies in explanations. Figure 3 shows an example of evaluation of the three properties, for a specific claim-explanation pair. Schematic examples of explanations and evidence sentence relations which satisfy these coherence properties are shown in Appendix A.4.

Human & Computational Evaluations
We employ human evaluation in order to assess the quality of the gold and generated explanations with respect to these properties. Also, we conduct a computational evaluation of the three coherence properties using NLI.
For human evaluation, we randomly sampled 25 entries from the test set of PUBHEALTH, and enlisted 5 annotators to evaluate the quality of the gold explanations and explanations generated by EXPLAINERFC and EXPLAINERFC-EXPERT for these entries. We asked participants to annotate explanations according to the following criteria: 1) the agreement and disagreement between sentences in the explanation, and 2) relevance of the explanation to the claim. Further information, including an example from the questionnaire, can be found in Appendix A.3.
For the human evaluation we computed Randolph's free-marginal κ (Randolph, 2005) Table 6. Our results suggest that the NLI approximation is a reliable approximation for weak global coherence and local coherence properties. However, entailment appears to be a poor approximation for strong global coherence. Further, a larger human evaluation study would be required in order to verify these results.

Conclusion and Future work
In this paper, we explored fact-checking for claims for which specific expertise is required to produce a veracity prediction and explanations (i.e., judg-

Claim
A list of chemicals, written as if they were ingredients on a food label, accurately depicts the chemical composition of a banana.

Label: TRUE Explanation
In sum, this graphic accurately depicts the chemicals that comprise a banana, using a variety of tactics to make that completely natural food appear to be full of "chemicals" -something originally created by a high school chemistry teacher as part of a lesson on chemophobia. ments used for awarding the label/veracity prediction). To support this exploration we constructed PUBHEALTH, a sizeable dataset for public health fact-checking and the first fact-checking dataset to include explanations as annotations. Our results show that training veracity prediction and explanation generation models on in-domain data improves the accuracy of veracity prediction and the quality of generated explanations compared to training on generic language models without explanation. We hope to explore the topics of explainable fact-checking and specialist fact-checking further. In order to do this, we hope to explore other subjects, in addition to public health, for which factchecking requires a level of expertise in the subject area. Furthermore, we hope to explore the quality of fact-checking explanations with respect to properties other than coherence, e.g., actionability and impartiality. lastly, we plan to explore congruity between veracity prediction and explanation generation tasks, i.e., generating explanations which are compatible with the predicted label and vice versa.

A.2 Reproducibility
Here we provide further information about the experiments described in Section 4.
Prediction models hyperparameters. We perform hyper-parameter grid search as part of validation for batch sizes from {8, 16, 32}, learning rates from {1e-5, 5e-6, 1e-6}, and epochs {2, 3, 4}. We optimize our veracity prediction model on cross entropy loss. The hyper-parameters we selected from this grid search are a batch size of 16, learning rate 1e-6 and 4 epochs for model training.
Computing Infrastructure. All experiments were run on a machine with a dual Intel(R) Core(TM) i9-9900X 3.50GHz CPU. The GPU used for experiments is the Nvidia GeForce RTX 2080 Ti model. Additional information about the software packages used in the development of the explanation generation and veracity prediction models can be found in the GitHub repository, the link to which is given in Footnote 1.

A.3 Human Evaluation Questionnaire
The following are example question and response pairs typical of those presented to participants in the human evaluation questionnaire (see Section 5.3). Question and response pairs are related to the claim and explanation presented below.
1. Question: Are there any sentences or phrases in the explanation which disagree with each other?
2. Question: Which veracity label would you give to the claim taking into account the entire explanation?

Claim
State reports new findings of mosquitoborne illnesses. Explanation Rhode Island health officials say a second mosquito case tested positive for eastern equine encephalitis has been confirmed in the state, marking the first human case of the equine encephalitis in Rhode Island in more than two years.