Interpretable Question Answering on Knowledge Bases and Text

Interpretability of machine learning (ML) models becomes more relevant with their increasing adoption. In this work, we address the interpretability of ML based question answering (QA) models on a combination of knowledge bases (KB) and text documents. We adapt post hoc explanation methods such as LIME and input perturbation (IP) and compare them with the self-explanatory attention mechanism of the model. For this purpose, we propose an automatic evaluation paradigm for explanation methods in the context of QA. We also conduct a study with human annotators to evaluate whether explanations help them identify better QA models. Our results suggest that IP provides better explanations than LIME or attention, according to both automatic and human evaluation. We obtain the same ranking of methods in both experiments, which supports the validity of our automatic evaluation paradigm.


Introduction
Question answering (QA) is an important task in natural language processing and machine learning with a wide range of applications. QA systems typically use either structured information in the form of knowledge bases (KBs), or raw text. Recent systems have successfully combined both types of knowledge .
Nowadays, due to the changing legal situation and growing application in critical domains, ML based systems are increasingly required to provide explanations of their output. Lipton (2018), Poursabzi-Sangdeh et al. (2018) and Doshi-Velez and Kim (2017) point out that there is no complete agreement on the definition, measurability and evaluation of interpretability in ML models. Nevertheless, a number of explanation methods have been proposed in the recent literature, with the aim of making ML models more transparent for humans.
To the best of our knowledge, the problem of explanations for deep learning based QA models working on a combination of structured and unstructured data has not yet been researched. Also, there is a lack of evaluation paradigms to compare different explanation methods in the context of QA.

Contributions
-We explore interpretability in the context of QA on a combination of KB and text. In particular, we apply attention, LIME and input perturbation (IP).
-In order to compare these methods, we propose a novel automatic evaluation scheme based on "fake facts".
-We evaluate whether explanations help humans identify the better out of two QA models.
-We show that the results of automatic and human evaluation agree.
-Our results suggest that IP performs better than attention and LIME in this context.

Question Answering on Knowledge Bases and Text
The combination of knowledge bases and text data is of particular interest in the context of QA. While knowledge bases provide a collection of facts with a rigid structure, the semantic information contained in text documents has the potential to enrich the knowledge base. In order to exploit different information sources within one QA system,    ( Riedel et al., 2013) of a KB and text documents. They state that "individual data sources help fill the weakness of the other, thereby improving overall performance" and conclude that "the amalgam of both text and KB is superior than KB alone." Their model solves the so-called cloze questions task, i.e., filling in blanks in sentences. For example, the answer to "Chicago is the third most populous city in blank ." would be the entity the USA. The model has a KB and a number of raw text sentences at its disposal.  use Freebase (Bollacker et al., 2008) as KB (8.0M facts) and ClueWeb (Gabrilovich et al., 2013) as raw text source (0.3M sentences). They test on question-answer pairs from SPADES (Bisk et al., 2016) (93K queries). The TextKBQA model ( Figure 1) is a key-value memory network that uses distributed representations for all entities, relations, textual facts and input questions. Every memory cell corresponds to one KB fact or one textual fact, which are encoded as key-value pairs (Miller et al., 2016).
Every KB fact is a triple consisting of a subject s, an object o and the relation r between these entities. s, r, o are embedded into real-valued vectors s, r, o. The memory key is the concatenation of subject and relation embedding: k = [s; r] ∈ R 2d . The memory value is the embedding of the object: v = o ∈ R d . Textual facts are sentences that contain at least two entities. They are also represented as triples, where the relation is a token sequence: (s, [w 1 , ..., arg1, ..., arg2, ..., w n ], o). To convert the sentence into a vector, arg1 and arg2 are replaced by s and blank re-spectively. Then, the sequence is processed by a bidirectional LSTM. Its last states are concatenated to form the memory key A question q = [w 1 , ..., e, ..., blank , ..., w n ] is transformed into a distributed representation q ∈ R 2d using the same bidirectional LSTM as before. In this way, KB and textual facts as well as queries are in the same R 2d space.
Given q and a set of relevant facts, represented by key-value pairs (k, v), TextKBQA performs multi-hop attention. More specifically, the context vector c 0 is set to q. In every iteration (hop) t, a new context vector c t is computed as: (1) where W p , W t are weight matrices. In practice, M contains only facts that share an entity with the query. The result of the last hop is fed into a fullyconnected layer to produce a vector b ∈ R d . Then, the inner product between b and all entity embeddings is taken. The entity with the highest inner product is chosen as the model's answer a q .
We train the TextKBQA model using the datasets described above. We limit the number of textual facts per query to 500, since only 35 out of 1.8M entities in the dataset have more than 500 textual facts. Apart from this modification, we use the exact same implementation and training setup as in . Our final model achieves an F 1 score of 41.59 on the dev dataset and 40.27 on the test dataset, which is slightly better than the original paper (41.1 and 39.9, respectively).

Explanation methods
We first present some important notation and give a working definition of an explanation method.
Formally, let F be a database consisting of all KB and textual facts: F = F KB ∪ F text . Furthermore, let E be a set of entities that are objects and subjects in F, and let R be a set of relations from F KB . In the following we will use a general notation f for a fact from F, distinguishing between KB and textual facts only when necessary.
Let q be a query, and F ⊆ F the corresponding set of facts, such that for ∀f ∈ F holds: subject f ∈ q. Let T extKBQA be a function computed by the TextKBQA model and a q = T extKBQA(q, F), a q ∈ E, the predicted answer to the query q. Note that a q is not necessarily the ground truth answer for q.
Analogously to Poerner et al. (2018), we give the following definition: an explanation method is a function φ(f, a q , q, F) that assigns real-valued relevance scores to facts f from F given an input query q and a target entity a q . If φ(f 1 , a q , q, F) > φ(f 2 , a q , q, F) then fact f 1 is of a higher relevance for a q given q and F than fact f 2 .

Attention Weights
The attention mechanism provides an explanation method which is an integral part of the TextKBQA architecture.
We formally define the explanation method attention weights as: where K F is a matrix whose rows are key vectors of facts in F.
Since the TextKBQA model takes three attention hops per query, φ aw can be extended as follows: On the one hand, we can take attention weights from the first, second or third (=last) hops. Intuitively, attention weights from the first hop reflect the similarity of fact keys with the original query, while attention weights from the last hop reflect the similarity of fact keys with the summarized context from all previous iterations. On the other hand, some aggregation of attention weights could also be a plausible explanation method. For every fact, we take the mean attention weight over hops to be its average relevance in the reasoning process.
Taking into account the above considerations we redefine φ aw : − attention weights at hop j: − average attention weights: where h is the number of hops.

LIME (Local Interpretable Model-Agnostic
Explanations) is a model-agnostic explanation method (Ribeiro et al., 2016). It approximates behavior of the model in the vicinity of an input sample with the help of a less complex, interpretable model. LIME requires a mapping from original features (used by TextKBQA) to an interpretable representation (used by LIME). For this purpose we use binary "bag of facts" vectors, analogously to the idea of bag of words: a vector z ∈ {0, 1} |F | indicates presence or absence of a fact f from F. The reverse mapping is straightforward.
We first turn the original fact set F into an interpretable representation z. Every entry of this vector represents a fact from F. Then we sample vectors z of the same length |F| by drawing facts from F using the Bernoulli distribution with p = 0.5. In every z vector, the presence or absence of facts is encoded as 1 or 0, respectively. We set the number of samples to 1000 in our experiments.
For every z , we obtain the corresponding original representation F and give this reduced input to the TextKBQA model. Note that the query q remains unchanged. We are interested in the probability that a q is still the predicted answer to the query q, given facts F instead of F. In the TextKBQA model, this probability is obtained from the inner product of b and the entity embedding matrix E at position a q . We define this step as a function logit(q, F, a q ) = (E · b) aq .
We gather the outputs of logit(q, F , a q ) for all sampled instances, together with the corresponding binary vectors, into a dataset Z. Then, we train a linear model on Z by optimizing the following equation: where L is ordinary least squares and G is the class of linear models, such that g(z ) = w g · z . 1 From the linear model g, we extract a weight vector w g ∈ R |F | . This vector contains LIME relevance scores for facts in F given a q and q. We formally define the LIME explanation method for the TextKBQA model as follows:

Input Perturbation Method
Another explanation method is input perturbation (IP), originally proposed by Li et al. (2016), who apply it on a sentiment analysis task. They compute relevance scores for every word in a dictionary as the average relative log-likelihood difference that arises when the word is replaced with a baseline value. This method cannot be directly applied to QA, because the same fact can be highly relevant for one query and irrelevant for another. Therefore, we constrain the computation of loglikelihood differences to a single data sample (i.e., a single query). We formally define the input perturbation (IP) explanation method as follows: where logit is the same logit function that we used for LIME. A positive difference means that if we remove fact f when processing query q, the model's hidden vector b is less similar to the entity a q , suggesting that the fact is relevant.

Automatic evaluation using fake facts
This section presents our automatic evaluation approach, which is an extension of the hybrid document paradigm (Poerner et al., 2018). The major advantage of automatic evaluation in the context of explanation methods is that it does not require manual annotation. Poerner et al. (2018) create hybrid documents by randomly concatenating fragments of different documents. We adapt this paradigm to our use case in the following way:

Definition of automatic evaluation
Let q be a query and F the corresponding set of facts. We define the corresponding hybrid fact setF as the union of F with another disjoint fact set F :F Conceptually, F are "fake facts". We discuss how they are created below; for now, just assume that TextKBQA is unable to correctly answer q using only F . Note that we only consider queries that are correctly answered by the model based on their hybrid fact setF = F ∪ F . The next step is to obtain predictions a q for the hybrid instances and to explain them with the help of an explanation method φ. Recall that φ produces one relevance score per fact. The fact with the highest relevance score, rmax(F, q, φ), is taken to be the most relevant fact given query q, answer a q and factsF, according to φ. We assume that φ made a reasonable choice if rmax(F, q, φ) stems from the original fact set F and not from the set of fake facts F . Formally, a "hit point" is assigned to φ if: The pointing game accuracy of explanation method φ is simply its number of hit points divided by the maximally possible number of hit points.

Creating Fake Facts
To create fake facts for query q, we randomly sample a different query q that has the same number of entities and gather its fact set F . We then replace subject entities in facts from F with subject entities from F. We call these "fake facts" because they do not exist in F, unless by coincidence. For example, let q be " blank was chosen to portray Patrick Bateman, a Wall Street serial killer." and q be "This year Philip and blank divided Judea into four kingdoms." Then replace subject entities Philip and Judea in facts of F by subject entities Patrick Bateman and Wall Street, respectively. E.g., the KB  Our assumption is as follows: If the model is still able to predict the correct answer despite these fake facts, then this should be due to a fact contained in F and not in F . This assumption fails when we accidentally sample a fact that supports the correct answer. Therefore, we validate F by testing whether the model is able to predict the correct answer to q using just F . If this is the case, a different query q and a different fake fact set F are sampled and the validation step is applied again. This procedure goes on until a valid F is found. Table 1 contains an example of a query with real and fake facts for which explanations were obtained by average attention weights and IP. IP assigns maximal relevance to a real fact from F, which means that φ ip receives one hit point for this instance. The average attention weight method considers a fake fact from F to be the most important fact and thus does not get a hit point.

Experiments and results
We perform the automatic evaluation experiment on the test set, which contains 9309 questionanswer pairs in total. Recall that we discard queries that cannot be answered correctly, which leaves us with 2661 question-answer pairs. We evaluate the following explanation methods: • φ aw 1 -attention weights at first hop • φ aw 3 -attention weights at third (last) hop • φ awavg -average attention weights • φ lime -LIME with 1000 samples per instance A baseline that samples a random fact for rmax(...) is used for reference. Table 2 shows pointing game accuracies and the absolute number of hit points achieved by all five explanation methods and the baseline. All methods beat the random baseline.
IP is the most successful explanation method with a pointing game accuracy of 0.97, and LIME comes second. Note that we did not tune the number of samples per query drawn by LIME, but set it to 1000. It is possible that as a consequence, queries with large fact sets are not sufficiently explored by LIME. On the other hand, a high number of samples is computationally prohibitive, as TextKBQA has to perform one inference step per sample.
Attention weights at hop 3 performs best among the attention-based methods, but worse than LIME and IP. We suspect that the last hop is especially relevant for selecting the answer entity. The poor performance of attention is in line with recent work by Jain and Wallace (2019), who also question the validity of attention as an explanation method.
We perform significance tests by means of binomial tests (with α = 0.05). Our null hypothesis is that there is no significant difference in hit scores between a given method and the nexthighest method in the ranking in Table 2. Differences are statistically significant in all cases, except for the difference between attention weights at hop 3 and average attention weights (p = 0.06).

Evaluation with human annotators
The main goal of explanation methods is to make machine learning models more transparent for humans. That is why we conduct a study with human annotators.
Our experiment is based on the trust evaluation study conducted by Selvaraju et al. (2017) which, in turn, is motivated by the following idea: An important goal of interpretability is increasing users' trust in ML models, and trust is directly impacted by how much a model is understood (Ribeiro et al., 2016). Selvaraju et al. (2017) develop a method to visualize explanations for convolutional neural networks on an image classification task, and evaluate this method in different ways.
One of their experiments is conducted as follows: Given two models, one of which is known to be better (e.g., to have higher accuracy), instances are chosen that are classified correctly by both models. Visual explanations for these instances are produced by the evaluated explanation methods, and human annotators are given the task of rating the reliability of the models relative to each other, based on the predicted label and the visualizations. Since the annotators see only instances where the classifiers agree, judgments are based purely on the visualizations. An explanation method is assumed to be successful if it helps annotators identify the better model. The study confirmed that humans are able to identify the better classifier with the help of good explanations.
We perform a similar study for our use case, but modify it as described below.

Experimental setup
We use two TextKBQA Models, which are trained differently: • model A is the model used above, with a test set F 1 of 40 • model B is a TextKBQA model with a test set F 1 of 23. The lower score was obtained by training the model for fewer epochs and without pre-training in ONLYKB mode (see ).
We only present annotators with query instances for which both models output the same answer. However, we do not restrict these answers to be the ground truth. We perform the study with three explanation methods: average attention weights, LIME and IP. We apply each of them to the same question-answer pairs, so that the explanation methods are equally distributed among tasks.
Every task contains one query and its predicted answer (which is the same for both models), and explanations for both models by the same explanation method. In contrast to image classification, it would not be human-friendly to show participants all input components (i.e., all facts), since their number can be up to 5500. Hence, we show the top5 facts with the highest relevance score. The order in which model A and model B appear on the screen (i.e., which is "left" and which is "right" in Figure 2) is random to avoid biasing annotators.
Annotators are asked to compare both lists of top5 facts and decide which of them explains the answer better. This decision is not binary, but five options are given: definitely left, rather left, difficult to say, rather right and definitely right. The interface is presented in Figure 2.
25 computer science students, researchers and IT professionals took part in our study and annotated 600 tasks in total. Table 3, the answer difficult to say is the most frequent one for all explanation methods. For attention weights and LIME there is a clear trend that, against expectations, users found fact lists coming from model B to be a better explanation. The total share of votes for definitely model B and rather model B makes up 49.5% for attention weights and 29% for LIME, while definitely model A and rather model A gain 19.5% and 23.5%, respectively. In contrast to that, IP achieves a higher share of votes for model A than for model B: 16.5% vs. 10.5%.

As shown in
Analogously to Selvaraju et al. (2017), we compute an aggregate score that expresses how much an explanation method helps users to identify the better model. Votes are weighted in the following way: definitely model A +1, definitely model A +0.75, difficult to say +0.5, rather model B +0.25 and definitely model B +0. We then compute a weighted average of votes for all tasks per explanation method. In this way, scores are bounded in [0, 1] like the values of the hit score function used for the automatic evaluation. Values smaller than 0.5 indicate that the less accurate model B was trusted more, while values larger than 0.5 represent a higher level of trust in the more accurate model A. According to this schema, attention weights achieve a score of 0.386 and LIME achieves a score of 0.476. The score of the IP method is 0.524, which means that participants were able to identify the better model A when explanations were given by IP.
Significance tests show that while attention weights perform significantly worse than other methods, the difference between LIME and IP is insignificant, with p = 0.07. A larger sample of data and/or more human participants may be necessary in this case.
We also collected feedback from participants and performed qualitative analysis on the evaluated fact lists. The preference for the difficult to say option can be explained by the fact that in many cases, both models were explained with the same or very similar fact lists. In particular, we found that IP provided identical top five fact lists in 120 out of 200 tasks. In the case of attention weights and LIME, this occurs only in 9 and 10 cases out of 200 tasks.
Another problem mentioned by annotators was that KB facts are not intuitive or easy to read for humans that have not dealt with such representations before. It would be interesting to explore if some additional preprocessing of facts would lead to different results. For example, KB facts could be converted into natural language sentences, while textual facts could be presented with additional context like the previous and the next sentences from the original document. We leave such preprocessing to future work.
6 Related work Rychalska et al. (2018) estimate relevance of words in queries with LIME to test the robustness of QA models. However, they do not analyze the importance of the facts used by these QA systems. Abujabal et al. (2017) present a QA system called QUINT that provides a visualization of how a natural language query is transformed into formal language and how the answer is derived. However, this system works only with knowledge bases and the explanatory system is its integral part, i.e., it cannot be reused for other models. Zhou et al. (2018) propose an out-of-the-box interpretable QA model that is able to answer multirelation questions. This model is explicitly designed to work only with KBs. Another approach   (2018). They claim that the transparent nature of attention distributions across reasoning steps allows humans to understand the model's behavior.
To the best of our knowledge, the interpretability of QA models that combine structured and unstructured data has not been addressed yet. Even in the context of KB-only QA models, no comprehensive evaluation of different explanation methods has been performed. The above-mentioned approaches also lack empirical evaluation with human annotators, to estimate how useful the explanations are to non-experts.

Conclusions
We performed the first evaluation of different explanation methods for a QA model working on a combination of KB and text. The evaluated methods are attention, LIME and input perturbation. To compare their performance, we introduced an automatic evaluation paradigm with fake facts, which does not require manual annotations. We validated the ranking obtained with this paradigm through an experiment with human participants, where we observed the same ranking. Based on the outcomes of our experiments, we recommend the IP method for the TextKBQA model, rather than the model's self-explanatory attention mechanism or LIME.