F1 is Not Enough! Models and Evaluation Towards User-Centered Explainable Question Answering

Explainable question answering systems predict an answer together with an explanation showing why the answer has been selected. The goal is to enable users to assess the correctness of the system and understand its reasoning process. However, we show that current models and evaluation settings have shortcomings regarding the coupling of answer and explanation which might cause serious issues in user experience. As a remedy, we propose a hierarchical model and a new regularization term to strengthen the answer-explanation coupling as well as two evaluation scores to quantify the coupling. We conduct experiments on the HOTPOTQA benchmark data set and perform a user study. The user study shows that our models increase the ability of the users to judge the correctness of the system and that scores like F1 are not enough to estimate the usefulness of a model in a practical setting with human users. Our scores are better aligned with user experience, making them promising candidates for model selection.


Introduction
Understanding the decisions of deep learning models is of utmost importance, especially when they are deployed in critical domains, such as medicine or finance (Ribeiro et al., 2016). In natural language processing (NLP), a variety of tasks have been addressed regarding explainability of neural networks, such as textual entailment (Camburu et al., 2018), sentiment classification (Clos et al., 2017), machine translation (Stahlberg et al., 2018) and question answering . In this paper, we address question answering (QA) due to its proximity to users in real-life settings, for instance, in the context of personal assistants.
Explainable question answering (XQA) is the task of (i) answering a question and (ii) providing an explanation that enables the user to understand * The Kalahari Desert is a large semi-arid sandy savanna in Southern Africa extending for 900000 km2 , covering much of Botswana, parts of Namibia and regions of South Africa.
What is the area of the desert that Ghanzi is in the middle of? 900000 km2, because:

* Ghanzi is a town in the middle of the Kalahari Desert the western part of the Republic of Botswana in southern Africa.
Ghanzi's area is 117,910 km². Figure 1: Example output of a representative XQA system  that would receive an answer-F 1 of 1 and an explanation-F 1 of 0.5 although the explanation provides no value to the user since the actual answer evidence (shown in cloud) is not included in the explanation (asterisks mark ground truth explanation).
why the answer was selected, e.g., by pointing to the facts that are needed for answering the question. Compared to approaches that output importance weights or analyze gradients (Simonyan et al., 2014;Ribeiro et al., 2016;Lundberg and Lee, 2017;Sundararajan et al., 2017), this has the advantage that the explanations are intuitively assessible even by lay users without machine learning background.
A good explanation (i.e., one that is helpful for the user) should therefore satisfy the following requirements: (i) It should contain all information that the model used to predict the answer for the question. This is necessary so that the user can reconstruct the model's reasoning process. (ii) It should not include additional information that it did not use for predicting the answer. Otherwise, the explanation will confuse the users rather than help them. Note that these requirements do not only hold for correct model decisions but are also valid for explaining wrong model answers so that the user can assess the correctness of the answers. Previous work on XQA mostly focuses on developing models that predict the correct answer and, independent of this, the correct explanation Qi et al., 2019;Shao et al., 2020). This can lead to model outputs in which the explanations do not sufficiently relate to the answers. Consider the example provided in Figure 1. The model gives the correct answer to the question and provides an explanation consisting of one out of two relevant facts. However, the most important relevant fact (in which the answer actually appears) is not part of the explanation. As a result, the user cannot assess whether the model answer is correct or not and, thus, cannot trust the system. To strengthen the coupling of answer and explanation prediction in the model architecture and during training, we propose two novel approaches in this paper: (i) a hierarchical neural network architecture for XQA that ensures that only information included in the explanation is used to predict the answer to the question, and (ii) a regularization term for the loss function that explicitly couples answer and explanation prediction during training.
A good evaluation measure should score explanations by satisfying the following requirements: (i) It should reward explanations that are coupled to the answers of the model. (ii) It should punish explanations that are unrelated to the answers of the model. (iii) It should be correlated to user experience. Since explanations cannot only empower the user to assess the correctness of a system (Biran and McKeown, 2017;Kim et al., 2016) but also improve user satisfaction and confidence (Sinha and Swearingen, 2002;Biran and McKeown, 2017) and, thus, increase the acceptance of automatic systems (Herlocker et al., 2000;Cramer et al., 2008), this aspect is very important when evaluating models that should be applied in reallife scenarios. In most recent works, evaluation of XQA models focuses on optimizing F 1 -scores of answers and explanations (a collection of so-called supporting or relevant facts) . However, F 1 -scores only assess model outputs with respect to ground-truth annotations which only contain explanations for the correct answer. Thus, they fail to quantify the coherence between answer and explanation, especially when the predicted model answer is wrong. The example model output in Figure 1 leads to an answer-F 1 -score of 1 and an explanation-F 1 -score of 0.5 although the explanation is useless for the user as described before. To quantify the model's answer-explanation coupling, we propose two novel evaluation scores: (i) FARM which tracks prediction changes when removing facts, and (ii) LOCA which assesses whether the answer is contained in the explanation or not. Both scores do not require ground-truth annotations.
To summarize, we make contributions in two directions in this paper: For modeling, (i) we propose a hierarchical neural network architecture as well as (ii) a regularization term for the loss function of XQA systems. For evaluation, (iii) we propose two scores that are able to quantify a model's answerexplanation coupling without relying on groundtruth annotations. (iv) To investigate the relation between different evaluation scores and user experience, we conduct a user study. The results show that our proposed models increase the ability of the user to judge the correctness of an answer and that our scores are stronger predictors of human behavior than standard scores like F 1 . (v) For reproducibility and future research, we will release code for our methods and for computing the evaluation scores as well as the user study data. 1

Related Work
In the context of XQA,  present the HOTPOTQA data set which we also use for the experiments in this paper. In addition to questions and answers, it contains explanations in the form of relevant sentences from Wikipedia articles.
So far, all of the research work on HOTPOTQA focuses on reaching higher F 1 -scores. In contrast, we question whether this actually aligns with user experience. To the best of our knowledge, only Chen et al. (2019) additionally conduct a human evaluation. This confirms the observation of Adadi and Berrada (2018) that only very few papers related to explainable AI address (human) evaluation of explainability. Despite the large body of research in the field of human computer interaction, Abdul et al. (2018) show that there is a lack of collaboration and transfer of results to machine learning communities.
Another line of research our work relates to is the criticism of automatic evaluation scores. One frequently questioned score is BLEU (Papineni et al., 2002), which was shown to only correlate weakly with human judgements in tasks like machine translation (Callison-Burch et al., 2006), storytelling (Wang et al., 2018) and dialogue response generation (Liu et al., 2016). F 1 has been criticized from various perspectives including theoretical considerations and concrete applications (Hand and Christen, 2018;Chicco and Jurman, 2020;Sokolova et al., 2006). Qian et al. (2016) show that modifying F 1 -scores based on insights from psychometrics improves their correlation with human evaluations. In this paper we criticize the usage of F 1 as a measure of explainability in XQA and show in a user study that it is not related to user experience.

Methods for XQA
We built upon the model by Qi et al. (2019) as it is an improved version of the BiDaf++ model, which is used in numerous state-of-the-art XQA models Qi et al., 2019;Nishida et al., 2019;Ye et al., 2019;Qiu et al., 2019) including the best-scoring publication (Shao et al., 2020). It consists of a question and context encoding part with self-attention, followed by two prediction heads: a prediction of relevant facts (i.e., the explanation) and a prediction of the answer to the question. The two heads are trained in a multi-task fashion based on the sum of their respective losses. First, we analyze the outputs of the model, revealing severe weaknesses in answer-explanation coupling. To address those weaknesses, we then propose (i) a novel neural network architecture that selects and forgets facts, and (ii) a novel answer-explanation coupling regularization term for the loss function.

Limitations of Current Models
We manually analyze outputs of the models by Qi et al. (2019) and  and identify the following two problems. Silent Facts. The models make use of facts without including them into their explanations (cf., Figure 1). As a result, the predicted answer does not occur in the explanation, leaving the user uninformed about where it came from.
Unused Facts. The models predict facts to be relevant without any relation to the predicted answer. The second fact of the explanation in Figure  1 is an example for this. We also found examples where the facts predicted to be relevant do not even contain the entities from the question.

Select & Forget Architecture
To explicitly ensure that the model only uses information from facts it predicts to be relevant for the answer selection, we propose a hierarchical model that first selects facts which are relevant to answer the question and then forgets about all other facts (see Figure 2). We use recurrent and self-attention layers to create encodings of the question and the context. In particular, we create two different encodings: one that will be used for predicting the relevance of the facts (fact-specific encoding) and one that will be used for predicting the answer to the question (QA-specific encoding). Based on the fact-specific encoding, the model first predicts which facts are relevant to answer the question. Next, we reduce the QA-specific encoding of the context based on the relevance predictions. In particular, we mask all facts that were not predicted to be relevant by zeroing out their encodings. The reduced context representation is concatenated with the QA-specific question encoding and passed on to the answer prediction, which we implement in the same way as Qi et al. (2019). Thus, the answer prediction now only receives encodings of facts that the model has predicted to be relevant. It predicts the type of the answer (yes/no/text span), as well as the start and end positions of the answer span within the context.

Answer-Fact Coupling Regularizer
Our second method addresses the coupling of answer and explanation prediction by modifying the loss function. The loss function used by Qi et al. (2019) is the sum of four cross entropy losses concerning (i) the answer type (yes/no/span) distribution, (ii) the answer start token distribution, (iii) the answer end token distribution, and (iv) the fact relevance distributions. All terms are optimized to be close to their respective ground truth annota-  tions. This is the desired effect in many, but, as our examples in Section 3.1 show, not in all situations. The loss function especially encourages the model to predict the ground truth explanation rather than an explanation that explains the predicted answer.
In order to reward a coupling between answer and explanation, we propose to add the following regularization term to the loss function:  (1) with p a corresponding to the probability of the model for the correct answer span and p e denoting the probability of the model for the ground truth relevant facts. The term can be broken down into four cases: (i) correct answer and ground truth explanation, (ii) correct answer but non-ground truth explanation, (iii) incorrect answer but ground truth explanation and (iv) incorrect answer and nonground truth explanation. Each case corresponds to a constant cost of 0, c 1 , c 2 and c 3 , respectively, with c 1 , c 2 , c 3 being hyperparameters. The resulting cost J reg is the sum of the four individual costs weighted with their respective probabilities.
In particular, p a is defined as the product of the probabilities assigned to start and end token positions of the answer span. For a data set instance with a context containing N facts, we define s t P t0, 1u N as the ground truth annotations for the relevant facts. Accordingly, we denote the model's relevance probability estimates with s p P r0, 1s N . Based on this, we define p e as pe " ś iPF s p i with F " ti P t1, ..., N u : s ti " 1u denoting the indices that correspond to ground truth facts. This corresponds to the joint probability of selecting the ground truth facts assuming the single selection probabilities to be independent. For our experiments, we adapt this definition to a numerically more stable term pè by replacing the product with a sum as this led to slightly better results on the development set.

New Evaluation Scores for XQA
In this section, we motivate that standard scores like F 1 are not enough to score XQA systems by presenting their limitations in those settings. To be able to quantify the degree to which a model is affected by those limitations, we propose two scores that go beyond standard scores: the factremoval score and the answer-location score. Both scores can be calculated without any assumptions on model architecture and no need for ground truth annotations for answers or supporting facts.

Limitations of Current Evaluation Scores
Current evaluation of XQA is focused on three scores: (i) answer-F 1 , which is based on the token overlap between the predicted and the ground truth answer, (ii) SP-F 1 , which calculates F 1 based on the overlap of predicted and ground truth relevant ("supporting") facts and (iii) joint-F 1 , which is based on the definitions of joint precision and joint recall as the products of answer and SP precision and recall as described in . For HOTPOTQA, models are ranked based on joint-F 1 . We argue that this creates a false incentive that potentially hinders the development of truly usable models for the following reasons.
No Empirical Evidence. There is no empirical evidence that joint-F 1 is related to user performance or experience regarding XQA.
Rewarding Poor Explanations. Figure 1 shows an example prediction that is rewarded with a joint-F 1 of 0.5 although its explanation provides no value to the user. The reward stems from the overlap of the explanation with the ground truth but does not consider that the predicted answer is not contained in any of the predicted relevant facts.
Punishing Good Explanations. Consider a model output in which the predicted answer is wrong but the explanation perfectly explains this wrong answer, showing to the user why the model has selected it. Standard F 1 -scores compare the model output to the ground truth annotations and will, therefore, score both the answer and the explanation with an F 1 of 0. However, we argue that an explanation should be evaluated with a score higher than 0 if it is able to explain the reasoning process of the model to the user and, thus, lets the user identify the failure of the model.

Fact-Removal Score (FARM)
Ideally, the explanations of the model include all facts that the model uses within its reasoning chain but no additional facts beyond that. Note that even for a wrong model answer, this assumption should hold so that the relevant facts provide explanations for the (wrongly) predicted answer. To quantify the degree of answer-explanation coupling, we propose to iteratively remove parts (individual facts) of the explanation, re-evaluate the model using the reduced context and track how many of the model's answers change. For a model with perfect coupling of answer and explanation, the answer will change with the first fact being removed (assuming no redundancy) but will not change when removing irrelevant facts not belonging to the explanation. We remove facts in order of decreasing predicted relevance as more relevant facts should influence the model's reasoning process the strongest.
In the following, we denote an instance of the data set by e P E with its corresponding question e ques and context e con . We use answerp¨,¨q to denote the answer that a model predicts for a given question and context. The functions reduce rel p¨, kq (reduce irr p¨, kq) return a context from which up to k facts the model predicts to be relevant (irrelevant) have been removed. 2 We re-evaluate the model on this reduced context and calculate the fraction of changed answers c rel pkq and c irr pkq, respectively.

apeq
" answerpe ques , e con q (2) a rel,k peq " answerpe ques , reduce rel pe con , kqq(3) a irr,k peq " answerpe ques , reduce irr pe con , kqq (4) c rel pkq " |te P E : apeq ‰â rel,k pequ| |E| (5) c irr pkq " |te P E : apeq ‰â irr,k pequ| |E| 2 If the number of facts predicted as (ir)relevant is less or equal to k, we remove all (ir)relevant facts from the context. Finally, we condense c rel pkq and c irr pkq into a single fact-removal score: FARMpkq ranges between zero and one and a higher score corresponds to a better explanation.

Answer-Location Score (LOCA)
A second important indicator for the degree of a model's answer-explanation coupling is the location of the answer span: As shown in Figure 1, the models can predict answers that are located outside the facts they predict to be relevant, i.e., outside the explanation. This is confusing for a user. Therefore, we consider the fractions of answer spans that are inside the explanation of the model and the fraction of answer spans that are outside. For an ideal model, all answer spans would be located inside the explanation. We use I and O to denote the number of answers inside/outside of the set of facts predicted as relevant. A denotes the total number of answers. 3 Based on these counts, we propose the answer-location score that we define as The LOCA score ranges between zero and one, with larger values indicating better answerexplanation coupling.

Experiments and Results
In this section, we describe the dataset we used in our experiments as well as our results. More details for reproducibility, including hyperparameters, are provided in the appendix.

Dataset
The HOTPOTQA data set is a multi-hop opendomain explainable question answering data set containing 113k questions with crowd-sourced annotations. Each instance of the training data contains a question, a context consisting of the first paragraph of ten Wikipedia articles, the annotated answer and an explanation in the form of a selection of relevant sentences from the context. As HOTPOTQA was designed as a multi-hop data set, finding the answer to a question requires combining information from two different articles. The eight other articles are distracting the system. 4

Experimental Results
In our experiments, we assess the effects of our Select & Forget architecture (S&F) and the regularization term (reg.). Table 1 shows our approaches in comparison to the model by Qi et al. (2019). 5 While our S&F architecture performs comparable in standard scores like answer-exact-match (Answer-EM), answer-F 1 , joint-EM and joint-F 1 (for some of them slightly better, for some of them slightly lower), the regularization term increases the recall of the relevant fact prediction considerably.
In terms of our proposed scores for measuring answer-explanation-coupling, all our three models clearly outperform the baseline model (lower part of Table 1). In the first three rows, we report the models' FARM scores and the fractions of changed answers for k " 4, i.e., when a maximum of four facts are removed. We choose k " 4 as this is the 4 The data set also contains a full wiki test set in which the context spans all collected Wikipedia articles. We focus on the distractor setting in this paper. The data set can be downloaded from https://hotpotqa.github.io/.
5 We retrain their model using the implementation and preprocessing provided at https://github.com/ qipeng/golden-retriever. highest number of facts within an explanation in the ground truth annotations of the HOTPOTQA data. The last three rows show the LOCA scores and the respective fractions of answers inside and outside facts predicted as relevant.
The different behavior of models regarding joint-F 1 vs. FARM and LOCA raises the question which scores are better suited to quantify explainability in a real-life setting with human users. To answer this, we conduct a user study in Section 6.

Human Evaluation
We conduct a user study to investigate whether standard scores like F 1 or our proposed scores are better suited to predict user behavior and performance. Moreover, the study provides another way to compare our proposed methods to the model by Qi et al. (2019) and the ground truth explanations. In contrast to the human evaluation from Chen et al. (2019), we evaluate explanations in the context of the model answer, ask participants to rate the predictions along multiple dimensions and collect responses from 40 instead of 3 subjects.

Choice of Models
We choose to compare the model proposed by Qi et al. (2019) (called "Qi-2019" in the following) as a representative of the commonly used BiDaf++ architecture (Clark and Gardner, 2018) in XQA, our proposed Select & Forget architecture (S&F) and our proposed regularization term (reg.). In addition, we include the ground truth (GT) annotations to set an upper bound. Although the combination of regularization and the S&F architecture reaches promising performance in Table 1, we assess the effects of our methods in isolation here and leave the evaluation of the combination to future work.

Study Design
We make use of a unifactorial between-subject design in which each model constitutes one condition. We randomly sample a set of 25 questions from the HOTPOTQA dev set and collect the model answer and explanation predictions (or annotations for GT) for each condition. For each answer prediction, we manually assess whether it is equivalent to the ground truth answer. 6 Each participant sees the 25 questions and the answers and explanations  of one model in a random permutation. For each question, we ask the participants to rate whether the model answer is correct. In addition, we ask for multiple self-reports to assess, e.g., the trust of the user in the system. In particular, we track the variables discussed in the following subsection.

Dependent Variables
We derive multiple dependent variables from the participants' ratings, namely completion time (Lim et al., 2009;Lage et al., 2019), several performance variables indicating how well they judged the correctness of the model (fraction of correct ratings, false positive ratio (FP), false negative ratio (FN), true positive ratio (TP), true negative ratio (TN), precision (P), recall (R) and F 1 values), agreement (fraction of model predictions that the users rate as correct (Bussone et al., 2015)), and overestimation (difference between agreement and true model accuracy (Nourani et al., 2019)). Furthermore, we collect the following variables in self-reports with five-point Likert scales: certainty of the participants (Greis et al., 2017a), completeness and helpfulness of the explanations (Nourani et al., 2019), trust of the participants in the model (Bussone et al., 2015), and satisfaction (Kulesza et al., 2012;Greis et al., 2017b).
All the questions and screenshots of our study are given in the appendix.

Participants and Data Cleaning
We collect the ratings of 40 participants (16 female, 24 male) with a mean age of 26.6 years (SD " 3.4). We filter out all responses with a completion time smaller than 15 seconds or larger than 5 minutes as this indicates that the participant did not read the whole explanation or was interrupted during the study. We further asked them whether they knew the answer before and exclude the responses to those questions from our evaluation. In total, we discard 12.10% of the responses.

Results
In this section, we summarize the main results of the user study. For better overview, we do not include evaluations on every variable from Section 6.3 but show them in the appendix. Figure 3a shows the fraction of correct user ratings of model correctness. The correctness of our proposed models can be better judged than the Qi-2019 model: The regularized model and the S&F model increase the fraction by 10.79% and 9.17%, respectively, compared to Qi-2019. The GT comparison shows an upper bound.
Among the performance variables, the false positive ratio deserves particular attention as a false positive corresponds to a user thinking the model answer is correct while it is not. Such an error can be dangerous in safety-critical domains. Figure 3b shows that the fraction of FPs is decreased by 6.43% by the regularized model and by 9.25% by the S&F model compared to Qi-2019. The ground truth has zero false positives by definition.
A similar effect can be seen when evaluating overestimation: Both our models alleviate overestimation as shown in Figure 3c. While participants overestimate the model accuracy of Qi-2019 by 2.93% on average, the regularized model only leads to 0.87% overestimation. The S&F model is even underestimated by 6.40% on average. While an ideal model would lead to neither over-nor underestimation, underestimation can be preferable to overestimation if the model is deployed into highrisk environments, such as medical contexts. In general, a reduction in overestimation can -besides enhanced fact selection -also be linked to  an improved answer accuracy as better-performing models naturally leave more room for underestimation. Finally, we consider relations within the variables from Section 6.3. Figure 3d shows that mean completion time monotonously decreases with increasing user certainty (with the exception of "strongly disagree"). 7 This confirms the findings of Greis et al. (2017a) who investigate the effect of user uncertainty on behavioral measurements.

Correlation with Evaluation Scores
Finally, we investigate the correlation of human ratings with model evaluation scores. We rank the models by (i) human measures obtained in the user study and (ii) model evaluation scores. In Table 2, a cell is marked with a "+" if the ranking with respect to the human measure and the model score is identical (e.g., the ranks regarding human-FPs and answer-F 1 are identical). If the ranks are exactly reversed, we mark the cell with a "-". All other cells are left empty. "+" and "-" both indicate a perfect correlation and do not imply one being preferable over the other. Next, we consider whether selecting a model based on the different model scores would result in a desired change in human evaluation scores or not. This depends on whether a high score (e.g., F 1 ) or a low score (e.g., the fraction of answers outside the predicted relevant facts) is aimed for. We indicate desired model selection with green circled cells and undesired model selection with red cells (e.g., choosing a model with a higher answer-F 1 would result in a 7 The low completion time for "strongly disagree" could indicate that the users could not find any relation at all between answer and explanation. model with more human-FPs. This is not desired.) All F 1 -scores show at least one undesirable rank relation. Notably, joint-F 1 is among the least aligned scores. In contrast, our scores have only desirable relations. In particular, FARM(4) and LOCA lead to a model ranking that is inverse to the ranking by human-overestimation and human-FPs. This is also confirmed in Figure 4, which shows how the human-FP ratio varies in comparison to the three F 1 scores (upper plot: no correlation) and to our pro-posed scores (lower plot: correlated). See appendix for other dependent variables.
To sum up, our results indicate that (i) F 1 is not suited to quantify the explanatory power of a model and (ii) our proposed scores predict user behavior better than standard scores, opening the possibility of using them for model selection.

Conclusion
In this paper, we investigated explainable question answering, revealing that existing models lack an explicit coupling of answers and explanations and that evaluation scores used in related work fail to quantify that. This highly impairs their applicability in real-life scenarios with human users. As a remedy, we addressed both modeling and evaluation, proposing a hierarchical neural architecture, a regularization term, as well as two new evaluation scores. Our user study showed that our models help the users assess their correctness and that our proposed evaluation scores are better correlated with user experience than standard measures like F 1 .

C Training Details and Hyperparameters
We use the same preprocessing, hyperparameters and early stopping procedure as Qi et al. (2019) to train our models. 9 The Qi-2019 model as well as our regularized adaption contain 99M parameters, our S&F model and its regularized version contain 100M parameters each. The additional hyperparameters of our regularization term are optimized with random search using 100 runs. In particular, we sample c i " U pr0.0, 5.0sq, i P t1, 2, 3u and select pe and pè with equal probability. We select the best model based on the percentage of answer spans inside facts predicted as relevant to ensure decent answer-explanation coupling while not directly optimizing on the LOCA. Based on our 9 https://github.com/qipeng/ golden-retriever  hyperparameter search, the best regularization parameters for the model proposed by Qi et al. (2019) are c 1 " 4.96, c 2 " 2.02 and c 1 " 3.10. The best parameters for the regularized S&F model are c 1 " 1.18, c 2 " 0.24 and c 1 " 1.61. We trained all models on Nvidia Tesla V100 GPUs.

D Comparison of BiDaf++ Versions
We compare the BiDaf++ model used as the official HOPOTQA baseline of  to the modified model by Qi et al. (2019) in Table 3. As the modified model outperforms the Yang-2018 model on all metrics, we use the model proposed by Qi et al. (2019) throughout our experiments as well as the user study.

E Hierarchical Error Propagation
As the S&F model first selects the supporting facts and then predicts the answer based on the selected subset, we evaluate how errors in the fact selection effect the answer prediction and compare it to the Qi-2019 model. For both models, we compare the fraction of predictions with correct (exact match) answers among predictions that (i) contain all ground truth facts or (ii) contain no ground truth facts. We observe that for Qi-2019 the fraction of correct answers drops from 51.5% for predictions with all ground truth facts to 25.8% for predictions with no ground truth facts. For the S&F model, we observe a drop from 50.4% to 20.6% respectively. This confirms our expectations of an increased error propagation in the S&F model. However, only 1.8% of the S&F fact predictions contain no ground truth fact at all, whereas for Qi-2019 this case occurs for 2.1% of the predictions.

F Answer Changes per Removal
Step Figure 6 shows the values of c rel pkq (top), i.e., the fraction of changed answers when removing k facts predicted as relevant, and c irr pkq (bottom), i.e., the fraction of changed answers when removing k facts predicted as irrelevant, for k P t0, ..., 4u.

G User Study Design Details
All questions and statements that we ask participants to answer/rate are listed in Table 4. H Detailed User Study Results Figure 9 shows boxplots per condition for all continuous dependent variables. Figure 10 shows rat-ing distributions per condition for all ordinal dependent variables. Figures 11 and 12 compare model scores and human measures grouped into F 1 -scores and our proposed FARMp4q and LOCA scores. Rows alternate between F 1 -scores and our scores. Question: Are Brainerd Lakes Regional Airport and Sawyer International Airport located in Europe?
System answer: no System explanation: [Brainerd Lakes Regional Airport]: Brainerd Lakes Regional Airport (IATA: BRD, ICAO: KBRD, FAA LID: BRD) is a public use airport located three nautical miles (6 km

Do you think the system's answer is correct?
Please rate how much you disagree / agree to each of the following statements.

I am confident that my choice is correct.
The given explanation helps me to decide if the answer is correct.
Did you know the answer without the system's answer or explanations? Figure 7: Screenshot of the question rating interface.

Context Question/Statement
Answer Range Each Question Do you think the system's answer is correct? yes/no Did you know the answer without the system's answer or explanations? yes/no I am confident that my choice is correct.
5-point Likert The given explanation helps me to decide if the answer is correct.

5-point Likert
Post Survey I trust the question answering system. 5-point Likert The explanations contained relevant information.
5-point Likert The explanations also contained irrelevant information.
5-point Likert I am satisfied with the question answering system and its explanations. 5-point Likert The following questions are asked with regard to all model outputs you saw on the previous pages.
Please rate how much you disagree / agree to each of the following statements.
I trust the question answering system.

The explanations contained relevant information.
The explanations also contained irrelevant information.
I am satisfied with the question answering system and its explanation.       Figure 12: Comparisons between human measures and model scores. All scores are normalized before plotting by subtracting the minimum score and re-scaling the score span to r0, 1s. Human measures for which lower values correspond to better performance are plotted as p1´scoreq for convenience of the reader. The figure shows scores for false negatives, true negatives, precision, recall and user F 1 .   Table 5: Pearson correlations between human and automatized scores.