Generating Fact Checking Explanations

Most existing work on automated fact checking is concerned with predicting the veracity of claims based on metadata, social network spread, language used in claims, and, more recently, evidence supporting or denying claims. A crucial piece of the puzzle that is still missing is to understand how to automate the most elaborate part of the process – generating justifications for verdicts on claims. This paper provides the first study of how these explanations can be generated automatically based on available claim context, and how this task can be modelled jointly with veracity prediction. Our results indicate that optimising both objectives at the same time, rather than training them separately, improves the performance of a fact checking system. The results of a manual evaluation further suggest that the informativeness, coverage and overall quality of the generated explanations are also improved in the multi-task model.


Introduction
When a potentially viral news item is rapidly or indiscriminately published by a news outlet, the responsibility of verifying the truthfulness of the item is often passed on to the audience.To alleviate this problem, independent teams of professional fact checkers manually verify the veracity and credibility of common or particularly checkworthy statements circulating the web.However, these teams have limited resources to perform manual fact checks, thus creating a need for automating the fact checking process.
The current research landscape in automated fact checking is comprised of systems that estimate the veracity of claims based on available metadata and evidence pages.Datasets like LIAR (Wang, 2017) and the multi-domain dataset Mul-tiFC (Augenstein et al., 2019) provide real-world benchmarks for evaluation.There are also artificial datasets of a larger scale, e.g., the FEVER (Thorne et al., 2018) dataset based on Wikipedia articles.As evident from the effectiveness of state-of-theart methods for both real-world -0.492 macro F1 score (Augenstein et al., 2019), and artificial data -68.46FEVER score (label accuracy conditioned on evidence provided for 'supported' and 'refuted' claims) (Stammbach and Neumann, 2019), the task of automating fact checking remains a significant and poignant research challenge.
A prevalent component of existing fact checking systems is a stance detection or textual entailment model that predicts whether a piece of evidence arXiv:2004.05773v1[cs.CL] 13 Apr 2020 contradicts or supports a claim (Ma et al., 2018;Mohtarami et al., 2018;Xu et al., 2018).Existing research, however, rarely attempts to directly optimise the selection of relevant evidence, i.e., the self-sufficient explanation for predicting the veracity label (Thorne et al., 2018;Stammbach and Neumann, 2019).On the other hand, Alhindi et al. (2018) have reported a significant performance improvement of over 10% macro F1 score when the system is provided with a short human explanation of the veracity label.Still, there are no attempts at automatically producing explanations, and automating the most elaborate part of the processproducing the justification for the veracity prediction -is an understudied problem.
In the field of NLP as a whole, both explainability and interpretability methods have gained importance recently, because most state-of-the-art models are large, neural black-box models.Interpretability, on one hand, provides an overview of the inner workings of a trained model such that a user could, in principle, follow the same reasoning to come up with predictions for new instances.However, with the increasing number of neural units in published state-of-the-art models, it becomes infeasible for users to track all decisions being made by the models.Explainability, on the other hand, deals with providing local explanations about single data points that suggest the most salient areas from the input or are generated textual explanations for a particular prediction.
Saliency explanations have been studied extensively (Adebayo et al., 2018;Arras et al., 2019;Poerner et al., 2018), however, they only uncover regions with high contributions for the final prediction, while the reasoning process still remains behind the scenes.An alternative method explored in this paper is to generate textual explanations.In one of the few prior studies on this, the authors find that feeding generated explanations about multiple choice question answers to the answer predicting system improved QA performance (Rajani et al., 2019).
Inspired by this, we research how to generate explanations for veracity prediction.We frame this as a summarisation task, where, provided with elaborate fact checking reports, later referred to as ruling comments, the model has to generate veracity explanations close to the human justifications as in the example in Table 1.We then explore the benefits of training a joint model that learns to generate veracity explanations while also predicting the veracity of a claim.In summary, our contributions are as follows: 1. We present the first study on generating veracity explanations, showing that they can successfully describe the reasons behind a veracity prediction.2. We find that the performance of a veracity classification system can leverage information from the elaborate ruling comments, and can be further improved by training veracity prediction and veracity explanation jointly.3. We show that optimising the joint objective of veracity prediction and veracity explanation produces explanations that achieve better coverage and overall quality and serve better at explaining the correct veracity label than explanations learned solely to mimic human justifications.

Dataset
Existing fact checking websites publish claim veracity verdicts along with ruling comments to support the verdicts.Most ruling comments span over long pages and contain redundancies, making them hard to follow.Textual explanations, by contrast, are succinct and provide the main arguments behind the decision.PolitiFact1 provides a summary of a claim's ruling comments that summarises the whole explanation in just a few sentences.We use the PolitiFact-based dataset LIAR-PLUS (Alhindi et al., 2018), which contains 12,836 statements with their veracity justifications.The justifications are automatically extracted from the long ruling comments, as their location is clearly indicated at the end of the ruling comments.Any sentences with words indicating the label, which Alhindi et al. (2018) select to be identical or similar to the label, are removed.We follow the same procedure to also extract the ruling comments without the summary at hand.
We remove instances that contain fewer than three sentences in the ruling comments as they indicate short veracity reports, where no summary is present.The final dataset consists of 10,146 training, 1,278 validation, and 1,255 test data points.A claim's ruling comments in the dataset span over 39 sentences or 904 words on average, while the justification fits in four sentences or 89 words on average.

Method
We now describe the models we employ for training separately (1) an explanation extraction and (2) veracity prediction, as well as (3) the joint model trained to optimise both.
The models are based on DistilBERT (Sanh et al., 2019), which is a reduced version of BERT (Devlin et al., 2019) performing on par with it as reported by the authors.For each of the models described below, we take the version of DistilBERT that is pre-trained with a language-modelling objective and further fine-tune its embeddings for the specific task at hand.

Generating Explanations
Our explanation model, shown in Figure 1 (left) is inspired by the recent success of utilising the transformer model architecture for extractive summarisation (Liu and Lapata, 2019).It learns to maximize the similarity of the extracted explanation with the human justification.
We start by greedily selecting the top k sentences from each claim's ruling comments that achieve the highest ROUGE-2 F1 score when compared to the gold justification.We choose k = 4, as that is the average number of sentences in veracity justifications.The selected sentences, referred to as oracles, serve as positive gold labels -y E ∈ {0, 1} N , where N is the total number of sentences present in the ruling comments.Appendix A.1 provides an overview of the coverage that the extracted oracles achieve compared to the gold justification.Appendix A.2 further presents examples of the selected oracles, compared to the gold justification.
At training time, we learn a function f (X) = p E , p E ∈ R 1,N that, based on the input X, the text of the claim and the ruling comments, predicts which sentence should be selected -{0,1}, to constitute the explanation.At inference time, we select the top n = 4 sentences with the highest confidence scores.
Our extraction model, represented by function f (X), takes the contextual representations produced by the last layer of DistilBERT and feeds them into a feed-forward task-specific layer -h ∈ R h .It is followed by the prediction layer p E ∈ R 1,N with sigmoid activation.The prediction is used to optimise the cross-entropy loss function L E = H(p E , y E ).

Veracity Prediction
For the veracity prediction model, shown in Figure 1 (right), we learn a function g(X) = p F that, based on the input X, predicts the veracity of the claim y F ∈ Y F , Y F = {true, false, half-true, barely-true, mostly-true, pants-on-fire}.
The function g(X) takes the contextual token representations from the last layer of DistilBERT and feeds them to a task-specific feed-forward layer h ∈ R h .It is followed by the prediction layer with a softmax activation p F ∈ R 6 .We use the prediction to optimise a cross-entropy loss function L F = H(p F , y F ).

Joint Training
Finally, we learn a function h(X) = (p E , p F ) that, given the input X -the text of the claim and the ruling comments, predicts both the veracity explanation p E and the veracity label p F of a claim.The model is shown Figure 2. The function h(X) takes the contextual embeddings c E and c F produced by the last layer of DistilBERT and feeds them into a cross-stitch layer (Misra et al., 2016;Ruder et al., 2019), which consists of two layers with two shared subspaces each -h 1 E and h 2 E for the explanation task and h 1 F and h 2 F for the veracity prediction task.In each of the two layers, there is one subspace for task-specific representations and one that learns cross-task representations.The subspaces and layers interact trough α values, creating the linear combinations h i E and h j F , where i,j∈ {1, 2}: We further combine the resulting two subspaces for each task -h i E and h j F with parameters β to produce one representation per task: where P ∈ {E, F } is the corresponding task.Finally, we use the produced representation to predict p E and p F , with feed-forward layers followed by sigmoid and softmax activations accordingly.We use the prediction to optimise the joint loss function , where γ and η are used for weighted combination of the individual loss functions.

Automatic Evaluation
We first conduct an automatic evaluation of both the veracity prediction and veracity explanation models.

Experiments
In Table 3, we compare the performance of the two proposed models for generating extractive explanations.Explain-MT is trained jointly with a veracity prediction model, and Explain-Extractive is trained separately.We include the Lead-4 system (Nallapati et al., 2017) as a baseline, which selects as a summary the first four sentences from the ruling comments.The Oracle system presents the best greedy approximation of the justification with sentences extracted from the ruling comments.It indicates the upper bound that could be achieved by extracting sentences from the ruling comments as an explanation.The performance of the models is measured using ROUGE-1, ROUGE-2, and ROUGE-L F1 scores.
In Table 2, we again compare two models -one trained jointly -MT-Veracity@Rul, with the explanation generation task and one trained separately -Veracity@Rul.As a baseline, we report the work of Wang ( 2017), who train a model based on the metadata available about the claim.It is the best known model that uses only the information available from the LIAR dataset and not the gold justification, which we aim at generating.
We also provide two upper bounds serving as an indication of the approximate best performance that can be achieved given the gold justification.The first is the reported system performance from Alhindi et al. ( 2018), and the second -Veracity@Just, is our veracity prediction model but trained on gold justifications.The Alhindi et al. ( 2018) system is trained using a BiLSTM, while we train the Veracity@Just model using the same model architecture as for predicting the veracity from the ruling comments with Veracity@Rul.
Lastly, Veracity@RulOracles is the veracity model trained on the gold oracle sentences from the ruling comments.It provides a rough estimate of how much of the important information from the ruling comments is preserved in the oracles.The models are evaluated with a macro F1 score.

Experimental Setup
Our models employ the base, uncased version of the pre-trained DistilBERT model.The models are fed with text depending on the task set-up -claim and ruling sentences for the explanation and joint models; claim and ruling sentences, claim and oracle sentences or claim and justification for the fact-checking model.We insert a '[CLS]' token before the start of each ruling sentence (explanation model), before the claim (fact-checking model), or at the combination of both for the joint model.The text sequence is passed through a number of Transformer layers from DistilBERT.We use the '[CLS]' embeddings from the final contextual layer of Dis-tilBERT and feed that in task-specific feed-forward layers h ∈ R h , where h is 100 for the explanation task, 150 for the veracity prediction one and 100 for each of the joint cross-stitch subspaces.Following are the task-specific prediction layers p E .
The size of h is picked with grid-search over {50, 100, 150, 200, 300}.We also experimented with replacing the feed-forward task-specific layers with an RNN or Transformer layer or including an activation function, which did not improve task performance.
The models are trained for up to 3 epochs, and, following Liu and Lapata (2019), we evaluate the performance of the fine-tuned model on the validation set at every 50 steps, after the first epoch.
We then select the model with the best ROUGE-2 F1 score on the validation set, thus, performing a potential early stopping.The learning rate used is 3e-5, which is chosen with a grid search over {3e-5, 4e-5, 5e-5}.We perform 175 warm-up steps (5% of the total number of steps), after also experimenting with 0, 100, and 1000 warm-up steps.Optimisation is performed with AdamW (Loshchilov and Hutter, 2017), and the learning rate is scheduled with a warm-up linear schedule (Goyal et al., 2017).The batch size during training and evaluation is 8.
The maximum input words to DistilBERT are 512, while the average length of the ruling comments is 904 words.To prevent the loss of any sentences from the ruling comments, we apply a sliding window over the input of the text and then merge the contextual representations of the separate sliding windows, mean averaging the representations in the overlap of the windows.The size of the sliding window is 300, with a stride of 60 tokens, which is the number of overlapping tokens between two successive windows.The maximum length of the encoded sequence is 1200.We find that these hyper-parameters have the best performance after experimenting with different values in a grid search.
We also include a dropout layer (with 0.1 rate for the separate and 0.15 for the joint model) after the contextual embedding provided by the transformer models and after the first linear layer as well.
The models optimise cross-entropy loss, and the joint model optimises a weighted combination of
both losses.Weights are selected with a grid search -0.9 for the task of explanation generation and 0.1 for veracity prediction.The best performance is reached with weights that bring the losses of the individual models to roughly the same scale.

Results and Discussion
For each claim, our proposed joint model (see 3.3) provides both (i) a veracity explanation and (ii) a veracity prediction.We compare our model's performance with models that learn to optimise these objectives separately, as no other joint models have been proposed.Table 2 shows the results of veracity prediction, measured in terms of macro F1.
Judging from the performance of both Veracity@Rul and MT-Veracity@Rul, we can assume that the task is very challenging.Even given a gold explanation (Alhindi et al. (2018) and Veracity@Just), the macro F1 remains below 0.5.This can be due to the small size of the dataset and/or the difficulty of the task even for human annotators.We further investigate the difficulty of the task in a human evaluation, presented in Section 5.
Comparing Veracity@RulOracles and Veracity@Rul, the latter achieves a slightly higher macro F1 score, indicating that the extracted ruling oracles, while approximating the gold justification, omit information that is important for veracity prediction.Finally, when the fact checking system is learned jointly with the veracity explanation system -MT-Veracity@Rul, it achieves the best macro F1 score of the three systems.The objective to extract explanations provides information about regions in the ruling comments that are close to the gold explanation, which helps the veracity prediction model to choose the correct piece of evidence.
In Explain-Extractive also outperforms the Explain-MT system.While we would expect that training jointly with a veracity prediction objective would improve the performance of the explanation model, as it does for the veracity prediction model, we observe the opposite.This indicates a potential mismatch between the ruling oracles and the salient regions for the fact checking model.We also find a potential indication of that in the observed performance decrease when the veracity model is trained solely on the ruling oracles compared to the one trained on all of the ruling comments.We hypothesise that, when trained jointly with the veracity extraction component, the explanation model starts to also take into account the actual knowledge needed to perform the fact check, which might not match the exact wording present in the oracles, thus decreasing the overall performance of the explanation system.We further investigate this in a manual evaluation of which of the systems -Explain-MT and Explain-Extractive, generates explanations with better qualities and with more information about the veracity label.
Finally, comparing the performance of the extractive models and the Oracle, we can conclude that there is still room for improvement of explanation systems when only considering extractive summarisation.

A Case Study
Table 4 presents two example explanations generated by the extractive vs. the multi-task model.In the first example, the multi-task explanation achieves higher ROUGE scores than the extractive one.The corresponding extractive summary contains information that is not important for the final veracity label, which also appears to affect the ROUGE scores of the explanation.On the other hand, the multi-task model, trained jointly with a veracity prediction component, selects sentences that are more important for the fact check, which in this case is also beneficial for the final ROUGE score of the explanation.
In the second example, the multi-task explanation has lower ROUGE scores than the extractive one.We observe that the gold justification contains some sentences that are not relevant to the fact check, and the extractive summary is fooled to select explanation sentences that are close to the gold summary.As a result, the explanation does not provide enough information about the chosen veracity label.The multi-task model, on the other hand, selects sentences that are also contributing to the prediction of the veracity labels.Thus, its explanation turns out to be more beneficial for the final fact check even though it has a lower ROUGE score compared to the gold justification.

Manual Evaluation
As the ROUGE score only accounts for word-level similarity between gold and predicted justifications, we also conduct a manual evaluation of the quality of the produced veracity explanations.

Experiments
Explanation Quality.We first provide a manual evaluation of the properties of three different types of explanations -gold justification, veracity explanation generated by the Explain-MT, and the ones generated by Explain-Extractive.We ask three annotators to rank these explanations with the ranks 1, 2, 3, (first, second, and third place) according to four different criteria: 1. Coverage.The explanation contains important, salient information and does not miss any important points that contribute to the fact check.
Table 4: Examples of the generated explanation of the extractive (Explain-Extr) and the multi-task model (Explain-MT) compared to the gold justification (Just).
2. Non-redundancy.The summary does not contain any information that is redundant/repeated/not relevant to the claim and the fact check.3. Non-contradiction.The summary does not contain any pieces of information that are contradictory to the claim and the fact check.4. Overall.Rank the explanations by their overall quality.
We also allow ties, meaning that two veracity explanations can receive the same rank if they appear the same.
For the annotation task set-up, we randomly select a small set of 40 instances from the test set and collect the three different veracity explanations for each of them.We did not provide the participants with information of the three different explanations and shuffled them randomly to prevent easily creating a position bias for the explanations.The annotators worked separately without discussing any details about the annotation task.
Explanation Informativeness.In the second manual evaluation task, we study how well the veracity explanations manage to address the information need of the user and if they sufficiently describe the veracity label.We, therefore, design the annotation task asking annotators to provide a veracity label for a claim based on a veracity explanation coming from the justification, the Explain-MT, or the Explain-Extractive system.The annotators have to provide a veracity label on two levels -binary classification -true or false, and six-class classification -true, false, half-true, barely-true, mostly-true, pants-on-fire.Each of them has to provide the label for 80 explanations, and there are two annotators per explanation.

Results and Discussion
Explanation Quality.Table 5 presents the results from the manual evaluation in the first set-up, described in Section 5, where annotators ranked the explanations according to four different criteria.
We compute Krippendorff's α inter-annotator agreement (IAA, Hayes and Krippendorff ( 2007)) as it is suited for ordinal values.The corresponding alpha values are 0.26 for Coverage, 0.18 for Nonredundancy, -0.1 for Non-contradiction, and 0.32 for Overall, where 0.67 < α < 0.8 is regarded as significant, but vary a lot for different domains.
We assume that the low IAA can be attributed to the fact that in ranking/comparison tasks for manual evaluation, the agreement between annotators might be affected by small differences in one rank position in one of the annotators as well as by the annotator bias towards ranking explanations as ties.Taking this into account, we choose to present the mean average recall for each of the annotators instead.Still, we find that their preferences are not in a perfect agreement and report only what the majority agrees upon.We also consider that the low IAA reveals that the task might be "already too difficult for humans".This insight proves to be important on its own as existing machine summarisation/question answering studies involving human evaluation do not report IAA scores (Liu and Lapata, 2019), thus, leaving essential details about the nature of the evaluation tasks ambiguous.
We find that the gold explanation is ranked the best for all criteria except for Non-contradiction, where one of the annotators found that it contained more contradictory information than the automatically generated explanations, but Krippendorff's α indicates that there is no agreement between the annotations for this criterion.
Out of the two extractive explanation systems, Explain-MT ranks best in Coverage and Overall criteria, with 0.21 and 0.13 corresponding improvements in the ranking position.These results contradict the automatic evaluation in Section 4.3, where the explanation of Explain-MT had lower ROUGE F1 scores.This indicates that an automatic evaluation might be insufficient in estimating the information conveyed by the particular explanation.
On the other hand, Explain-Extr is ranked higher than Explain-MT in terms of Non-redundancy and Non-contradiction, where the last criterion was disagreed upon, and the rank improvement for the first one is only marginal at 0.04.
This implies that a veracity prediction objective is not necessary to produce natural-sounding explanations (Explain-Extr), but that the latter is useful for generating better explanations overall and with higher coverage Explain-MT.
Table 6 presents the results from the second manual evaluation task, where annotators provided the veracity of a claim based on an explanation from one of the systems.We here show the results for binary labels, as annotators struggled to distinguish between 6 labels.The latter follows the same trends and are shown in Appendix A.3.
The Fleiss' κ IAA for binary prediction is: Just -0.269, Explain-MT -0.345, Explain-Extr -0.399.The highest agreement is achieved for Explain-Extr, which is supported by the highest proportion of agreeing annotations from Table 6.Surprisingly, the gold explanations from Just were most disagreed upon.Apart from that, looking at the agreeing annotations, gold explanations were found most sufficient in providing information about the veracity label and also were found to explain the correct label most of the time.They are followed by the explanations produced by Explain-MT.This supports the findings of the first manual evaluation, where the Explain-MT ranked better in coverage and overall quality than Explain-Extr.

Related Work
Generating Explanations.Generating textual explanations for model predictions is an understud- ied problem.The first study was Camburu et al. (2018), who generate explanations for the task of natural language inference.The authors explore three different set-ups: prediction pipelines with explanation followed by prediction, and prediction followed by explanation, and a joint multi-task learning setting.They find that first generating the explanation produces better results for the explanation task, but harms classification accuracy.We are the first to provide a study on generating veracity explanations.We show that the generated explanations improve veracity prediction performance, and find that jointly optimising the veracity explanation and veracity prediction objectives improves the coverage and the overall quality of the explanations.
Fact Checking Interpretability.Interpreting fact checking systems has been explored in a few studies.Shu et al. (2019) study the interpretability of a system that fact checks full-length news pages by leveraging user comments from social platforms.They propose a co-attention frame-  2019) build an interpretable fact-checking system XFake, where shallow student and self-attention, among others, are used to highlight parts of the input.This is done solely based on the statement without considering any supporting facts.In our work, we research models that generate humanreadable explanations, and directly optimise the quality of the produced explanations instead of using attention weights as a proxy.We use the LIAR dataset to train such models, which contains fact checked single-sentence claims that already contain professional justifications.As a result, we make an initial step towards automating the generation of professional fact checking justifications.Veracity Prediction.Several studies have built fact checking systems for the LIAR dataset (Wang, 2017).The model proposed by Karimi et al. (2018) reaches 0.39 accuracy by using metadata, ruling comments, and justifications.Alhindi et al. (2018) also trains a classifier, that, based on the statement and the justification, achieves 0.37 accuracy.To the best of our knowledge, Long et al. ( 2017) is the only system that, without using justifications, achieves a performance above the baseline of Wang (2017), an accuracy of 0.415-the current state-ofthe-art performance on the LIAR dataset.Their model learns a veracity classifier with speaker profiles.While using metadata and external speaker profiles might provide substantial information for fact checking, they also have the potential to introduce biases towards a certain party or a speaker.
In this study, we propose a method to generate veracity explanations that would explain the reasons behind a certain veracity label independently of the speaker profile.Once trained, such methods could then be applied to other fact checking instances without human-provided explanations or even to perform end-to-end veracity prediction and veracity explanation generation given a claim.Substantial research on fact checking methods exists for the FEVER dataset (Thorne et al., 2018), which comprises rewritten claims from Wikipedia.Systems typically perform document retrieval, evidence selection, and veracity prediction.Evidence selection is performed using keyword matching (Malon, 2018;Yoneda et al., 2018), supervised learning (Hanselowski et al., 2018;Chakrabarty et al., 2018) or sentence similarity scoring (Ma et al., 2018;Mohtarami et al., 2018;Xu et al., 2018).More recently, the multi-domain dataset MultiFC (Augenstein et al., 2019) has been proposed, which is also distributed with evidence pages.Unlike FEVER, it contains real-world claims, crawled from different fact checking portals.
While FEVER and MultiFC are larger datasets for fact checking than LIAR-PLUS, they do not contain veracity explanations and can thus not easily be used to train joint veracity prediction and explanation generation models, hence we did not use them in this study.

Conclusions
We presented the first study on generating veracity explanations, and we showed that veracity prediction can be combined with veracity explanation generation and that the multi-task set-up improves the performance of the veracity system.A manual evaluation shows that the coverage and the overall quality of the explanation system is also improved in the multi-task set-up.
For future work, an obvious next step is to investigate the possibility of generating veracity explanations from evidence pages crawled from the Web.Furthermore, other approaches of generating veracity explanations should be investigated, especially as they could improve fluency or decrease the redundancy of the generated text.

Acknowledgments
This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 801199.

A.1 Comparison of different sources of evidence
Table 7 provides an overview of the ruling comments and the ruling oracles compared to the justification.The high recall in both ROUGE-1 and ROUGE-F achieved by the ruling comments indicates that there is a substantial coverage, i.e. over 70% of the words and long sequences in the justification can be found in the ruling comments.On the other hand, there is a small coverage for the bigrams.Selecting the oracles from all of the ruling sentences increases ROUGE-F1 scores mainly by improving the precision.

A.2 Extractive Gold Oracle Examples
Table 8 presents examples of selected oracles that serve as gold labels during training the extractive summarization model.The three examples represent oracles with different degrees of matching the gold summary.The first row presents an oracle that matches the gold summary with a ROUGE-L F1 score of 60.40 compared to the gold summary.It contains all of the important information from the gold summary and even points precise, not rounded, numbers.The next example has a ROUGE-L F1 score of 43.33, which is close to the average ROUGE-L F1 score for the oracles.The oracle again conveys the main points from the gold justification, thus, being sufficient for the claim's explanation.Finally, the third example is of an oracle with a ROUGE-L F1 score of 25.59.The selected oracle sentences still succeed in presenting the main points from the gold justification, which is at a more detailed level presenting specific findings.The latter might be found as a positive consequence as it presents the particular findings of the journalist that led to selecting the veracity label.

A.3 Manual 6-way Veracity Prediction from explanations
The Fleiss' κ agreement for the 6-label manual annotations is: 0.20 on the Just explanations, 0.230 on the Explain-MT explanations, and 0.333 on the Explain-Extr system.Table 9 represent the results of the manual veracity prediction with six classes.

Figure 1 :
Figure 1: Architecture of the Explanation (left) and Fact-Checking (right) models that optimise separate objectives.

Figure 2 :
Figure 2: Architecture of the Joint model learning Explanation (E) and Fact-Checking (F) at the same time.

Table 1 :
Example instance from the LIAR-PLUS dataset, with oracle sentences for generating the justification highlighted.
Table 3, we present an evaluation of the generated explanations, computing ROUGE F1 score

Table 5 :
Mean Average Ranks (MAR) of the explanations for each of the four evaluation criteria.The explanations come from the gold justification (Just), the generated explanation (Explain-Extr), and the explanation learned jointly (Explain-MT) with the veracity prediction model.The lower MAR indicates a higher ranking, i.e., a better quality of an explanation.For each row, the best results are in bold, and the best results with automatically generated explanations are in blue.