Commonsense Justification for Action Explanation

To enable collaboration and communication between humans and agents, this paper investigates learning to acquire commonsense evidence for action justification. In particular, we have developed an approach based on the generative Conditional Variational Autoencoder(CVAE) that models object relations/attributes of the world as latent variables and jointly learns a performer that predicts actions and an explainer that gathers commonsense evidence to justify the action. Our empirical results have shown that, compared to a typical attention-based model, CVAE achieves significantly higher performance in both action prediction and justification. A human subject study further shows that the commonsense evidence gathered by CVAE can be communicated to humans to achieve a significantly higher common ground between humans and agents.


Introduction
To make AI more accessible, transparent, and trustworthy, recent years have seen an increasing effort on Explainable AI (XAI) which develops explainable models that attempts to explain the agent's decision making behaviors while maintaining a high-level of performance. Two types of explanation have been explored by the research community: introspective explanation which addresses the process of decision making (e.g., how a decision is made) and justification explanation which gathers evidence to support a certain decision (Park et al., 2018;Biran and McKeown, 2017). In this paper we focus on justification explanation, particularly identifying commonsense evidence to justify the prediction of an action. The key question we are addressing is: when an AI agent makes a prediction about an action in the world, how can the system justify its prediction that makes sense to the human?
Humans have tremendous commonsense knowledge about actions in the world (e.g., key constituents of an action) which allows them to quickly recognize and infer actions in the environment from millions of available features (Rensink, 2000). As a first step in our investigation, we initiated a human study to observe the kind of commonsense reasoning used by humans to justify the prediction of an action. From this study, we identified several dimensions of commonsense evidence which is commonly used to explain an action. Motivated by this study, we frame our task as follows: given all the symbolic descriptions of the perceived physical world (e.g., object relations and attributes as a result of vision or other processing), the goal is to identify a small set of descriptions which can justify an action prediction in line with humans' commonsense knowledge about that action. The lack of commonsense knowledge is a major bottleneck in artificial agents which jeopardizes the common ground between humans and agents for successful communication. If artificial agents ever become partners with humans in joint tasks, the ability to learn and acquire commonsense evidence for action justification is crucial.
To address this problem, we developed an approach based on the generative Conditional Variational Autoencoder (CVAE). This approach models the perceived attributes/relations as latent variables and jointly learns a performer which predicts actions based on attributes/relations and a explainer which selects a subset of attributes/relations as commonsense evidence to justify the action prediction. Our empirical results on a subset of the Visual Genome data (Krishna et al., 2016) have shown that, compared to a typical attention-based model, CVAE has a significantly higher explanation ability in terms of identifying correct commonsense evidence to justify the predicted action. When adding the supervision of commonsense evidence during training, both the explainability and the performance (i.e., action prediction) are further improved.
As commonsense evidence is intuitive to humans, the agent's ability to select the right kind of commonsense evidence will allow the human and the agent to come to a common understanding of actions and their justifications, in other words, common ground. To evaluate the role of commonsense evidence in facilitating common ground, we conducted additional human subject studies. In these experiments, the agent is given a set of images and applies our models to predict actions and select commonsense evidence to justify the prediction. For each image, the agent communicates the selected commonsense evidence to the human. The human, who does not have access to the original image, makes a guess on the action only based on the communicated evidence. The agreement between the action guessed by the human and the action predicted by the agent is used to measure how well the selected commonsense evidence serves to bring the human and the agent to a common ground of perceived actions. Our experimental results have shown that the commonsense evidence selected by CVAE leads to a significantly higher common ground.
The contributions of this paper are three folds. First we identified several key dimensions of commonsense evidence, from a human's perspective, to justify concrete actions in the physical environment. These dimensions provide a basis for justification explanation that is aligned with human's commonsense knowledge about the action. Second we proposed a method using CVAE to jointly learn to predict actions and select commonsense evidence as action justification. CVAE naturally models the generation process of both actions and commonsense evidence. Inferring commonsense evidence is equivalent to the posterior inference of the CVAE model, which is flexible and powerful by incorporating actions as context. Our experimental results have shown a higher explainability of CVAE in action justification without sacrificing performance. Finally our dataset of commonsense evidence for action explanation is available to the community 1 . It can serve as a benchmark for future work on this topic. 1 The dataset is available at https://github.com/ yangshao/Commonsense4Action

Related Work
Advanced machine learning such as deep learning approaches have shown effectiveness in many applications, however, they often lack transparency and interpretability. This makes it difficult for humans to understand the agent's capabilities and limitations. To address this problem, there is a growing interest in Explainable AI. For example, previous work has applied high-precision rules to explain classifiers' decisions (Ribeiro et al., 2016(Ribeiro et al., , 2018. For Convolutional Neural Networks (CNNs), recent work attempts to explain model behaviors by mining semantic meanings of filters (Zhang et al., 2017a,b) or by generating language explanations (Hendricks et al., 2016;Park et al., 2018). An increasing amount of work on the Visual Question Answering (VQA) task (Antol et al., 2015;Lu et al., 2016) has also looked into more interpretable approaches, for example, by utilizing attention-based models (Fukui et al., 2016) or reasoning based on explicit evidence (Wang et al., 2017).
Specifically for action understanding, recent work explicitly models commonsense knowledge including causal relations (Gao et al., 2016;Forbes and Choi, 2017;Zellers and Choi, 2017;Gao et al., 2018) related to concrete actions, which can facilitate action explanation. Commonsense knowledge can be acquired from image annotations (Yatskar et al., 2016) or learned from visual abstraction (Vedantam et al., 2015). Different from the above work, our work here focuses on learning to acquire commonsense evidence for action justification.

A Study on Justification Explanation
While there is a rich literature on explanations in Psychology, Philosophy, and Linguistics, particularly for higher-level events and decision making (Thagard, 2000;Lombrozo, 2012;Dennett, 1987), explanations for recognition of lower-level concrete physical actions (e.g., drink, brush, cook, etc.) occurred in our daily life are rarely studied. One possible reason is that we humans are so intuitive in recognizing these actions, which are often taken for granted without the need for any further explanation. However, despite recent advances, the ability to recognize and understand actions in the real world is extremely challenging for artificial agents. Thus it becomes important for the agent to have an ability to explain and justify its action prediction. What can be used to justify an action prediction, and more importantly, in a human understandable way? To address this question, we initiated a human study to examine what kind of evidence humans would gather in justifying their recognition of an action perceived from the physical world.
More specifically, we selected a set of 12 short video clips (each about 14 seconds) from the Microsoft Research Video to Text dataset (Xu et al., 2016). For each video clip, we asked human subjects to explain why they think a certain action is happening in the video. The answers were collected via an online interface. A total of about 140 responses from 67 Michigan State University engineering students were collected. From the data, we identified the following categories of evidence commonly used by the subjects in their justifications. Most responses contain multiple categories of explanation.
• Transitive-Relation. This kind of explanation does not directly focus on the structural relations between an action and its participants, but rather transits to the relation between the participant of the action and other related evidence. For example, using a woman wears an apron to justify the cook action. In the collected responses, 64% of them used transitive relations. • Subaction-Relation.
Lower-level subactions are used to justify a higher-level action. For example, the action is cook because there are sub-actions like cutting and heating meat. Almost 75% of the responses used subactions.
• Spatial-Relation. Spatial relations between the participants of the action play an important role. For example, the knife is on the cutting board is used to explain cooking; and the water is in the bottle to explain drinking.
Around 15% of responses are in this category. • Effect-Attribute. A change in the state of an object, in other words the effect state after the action, is often used as evidence. For example, cucumber in small pieces is used as the evidence for chop. Over 28% of the responses are in this category. • Associated-Attribute. Other attributes associated with the participants of the action, but not the effect state of the participants as a  result of the action (20%). While these attributes are not directly related to the action, they are linked to the action by association. For example, banana is sliced is used as evidence to justify blend. • Other. Participants have also cited other commonsense such as the "definition" of the action (5%), or the manner associated with different sub actions(12%).
Most of the above categories can be potentially perceived and represented through symbolic descriptions such as logic predicates to capture object attributes and relations between objects. This study has motivated us to collect additional data (Section 4) and formulate the task of commonsense justification as described in Section 5.

Data Collection
Motivated by the human study described above, we created a dataset based on the Visual Genome (VG) data (Krishna et al., 2016) for our investigation. Each image in the VG dataset is annotated with bounding boxes, attributes of the bounding boxes, and relations between the bounding boxes.
The available annotation provides an ideal setup for us to focus on commonsense justification. In this work, we are interested in the concrete physical actions that involve physical objects that can be perceived. We selected ten frequently occurred concrete actions: feed, pull, ride, drink, chop, brush, fry, bake, blend, eat and manually identified a set of images from the VG dataset depicting these actions. This has led to a dataset of 853 images with annotated ground-truth actions.
We conducted a crowd-source study to collect responses from the crowd in terms of common-  sense evidence for action justification. As shown in Figure 1, for each image, we showed to the crowd (through Amazon Mechanical Turk) the image itself, the ground-truth action, and a list of relations/attributes. The workers were instructed to select the relations/attributes that were deemed to justify the corresponding action. We randomly assigned three workers to each image. The relations or attributes that were selected by the majority (two or more) workers were considered gold commonsense evidence for action justification. Table 1 shows the average number of relations/attributes available (i.e., Rel# and Att#) for the corresponding images for each verb. It also shows the number of relations/attributes selected by the workers as commonsense evidence (i.e., Gold Rel# and Gold Att#). The average number of relations and attributes in each image for different actions varies slightly. However, only a small percentage of them are considered commonsense evidence. What's interesting is that the percentage of attributes considered good evidence is significantly less than the percentage of the relations. The sparsity of gold relations/attributes shows that it's a challenging task to learn an explainer for a target action.
We further inspected the selected gold commonsense relations and attributes. As shown in Table 2, they nicely fall into the categories of commonsense evidence discussed in Section 3. The ratios of Transitive-Relation are similar across different actions.
The ratios of Subaction-Relation and Spatial-Relation vary for different verbs. For instance, ride, bake, blend tend to be justified by spatial relations more often than sub-actions.
In addition, feed, pull, ride are rarely justified by Effect-Attribute while chop is mainly explained by the effect state of its direct object. These results will provide insight for generating justification explanations for a variety of verbs in the future.

Method
Before we formulate the problem, we will first give some formal definitions. The set of relations R is defined as {r 1 , r 2 , ..., r m } where each r i is a tuple (r p i , r s i , r o i ) corresponding to the predicate, subject, and object; and the set of attributes E is represented as {e 1 , e 2 , ..., e n } where each e i is a tuple (e o i , e p i ) corresponding to the object and attribute. We introduce z as a discrete vector (z 1 , z 2 , ..., z m+n ) where z i ∈ {0, 1} represents the hidden explainable variable. z is interpreted as an evidence selector: z i = 1 means the corresponding relation/attribute justifies the target action a. We define A as the vocabulary of target actions. Based on all these definitions, our goal is to jointly select evidence z and predict target action a ∈ A. In other words, to learn the probability p(a, z|R, E).

Conditional Variational Autoencoder
The varational autoencoder( VAE) (Kingma and Welling, 2013) is proposed as a generative model to combine the power of both directed continuous or discrete graphical models and neural network with latent variables. The VAE models the generative process of a random variable x as following: first the latent variable z is generated from a prior probability distribution p(z), then a data sample x is generated from a conditional probability distribution p(x|z). The CVAE (Zhao et al., 2017) is a natural extension of VAE: Both the prior distribution and conditional distribution now are conditioned on an additional context c: p(z|c) and p(x|z, c).
In our task, we decompose the inference problem p(a, z|R, E) into two smaller problems. The first sub-problem is to infer p(a|R, E), which is a performer. The second problem is to infer p(z|a, R, E) which is an explainer. These two problems are closely coupled, hence we model them jointly. The probability distribution p(a|R, E) can be written as : Directly optimizing this conditional probability is not feasible. Usually the Evidence Lower Bound (ELBO) (Sohn et al., 2015) is optimized, which can be derived as the following: ELBO(a, R, E; θ, φ) = − KL(q φ (z|a, R, E)||p θ (z|R, E)) The first KL divergence term is to minimize the distance between the posterior distribution and the prior distribution. The second term is to maximize the expectation of the target action based on the posterior latent distribution.
In most previous work using VAE, there is no explicit meaning for the hidden representation z, thus it's hard for humans to interpret. For example, z is simply assumed as a Gaussian distribution or a categorical distribution. In order to have a more explicit representation for the purpose of explanation, our latent discrete variable z is used to indicate whether the corresponding relation or attribute can be used for justifying the action.
The whole system architecture is shown in Figure 2. From an image, we first extract a candidate relation set R and an attribute set E. Every relation r and attribute e are embedded using a Gated Recurrent Neural Network (Chung et al., 2014).
The action a is represented by a GloVe embedding (Pennington et al., 2014), followed by another non-linear layer: where a glove ∈ R k is the pre-trained GloVe embedding. Then the latent variable z can be calculated as: where U = [r emb 1 , ..., r emb m , e emb 1 , ..., e emb n ] and [U, a emb ] means the concatenation of U and a emb . and W z ∈ R 2×2k as we assume each z i belongs to one of the two classes {0, 1}.
The prior distribution can be calculated as: The KL divergence between the prior random variable z prior from p θ (z|R, E) and the posterior random variable z posterior from q φ (z|a, R, E) is: Another challenge is that z is a discrete variable which blocks the gradient and makes the endto-end training infeasible. Gumbel-Softmax (Jang et al., 2016) is a re-parameterization trick to deal with the discrete variables in the neural network. We use this trick to sample discrete z. Then we do a weighted sum pooling between discretized z and U: During training, we also add a sparsity regularization on the latent variable z besides the ELBO. So our final training objective is During testing, we have two objectives. First we want to infer the target action a, which can be computed through sampling: where z s ∼ p(z|R, E) and S is the number of samples. After obtaining the predicted actionâ, the posterior explanation is inferred as q φ (z|â, R, E).

Conditional Variational Autoencoder with Supervision (CVAE+SV)
In this setting, we assume we have the supervision for the discrete latent variable z, which is more like a multi-task setting. We optimize both the action prediction loss and the evidence selection loss. The final loss function is defined as: in which z k ∈ {0, 1} is the ground truth label,ẑ k is the predicted label and λ is a hyper-parameter.

Evaluation on Action Explanation
To evaluate our model, we randomly split our dataset (853 images) into 60% for training, 20% for validation, and 20% for test. For all the models we use the Adam optimizer (Kingma and Ba, 2014) with a starting learning rate 1e-4. All other hyperparameters are tuned on the validation set.

Baseline: Attention Model
We use an attention-based model as a baseline, which is similar to the model originally proposed for document classification . The architecture is shown in Figure 3. Different from the CVAE-based method, this model directly learns a context parameter instead of learning from the posterior action context. The attention is calculated as: where v is the context parameter, and u i is the GRU embedding of the corresponding relation/attribute. The learned attention weights are used for the selection of commonsense evidence.

Evaluation Metrics and Comparison
Our evaluation compares the performance from the following models: • Baseline. The attention model presented in Section 6.1.
• CVAE+SV. The CVAE model with supervision as presented in Section 5.2.
• Upper Bound. We also calculate the upper bound of the CVAE model using the human annotated gold evidence.
For each of the above model, evaluate model performance on both action prediction (i.e., performer) and action justification (i.e., explainer) • Performer: Accuracy is used to measure the percentage of actions that are correctly predicted by the model.
• Explainer: As discussed in Section 5, the binary random variable z is used to capture commonsense evidence. The probability of each z represents the model's belief that the corresponding evidence supports the action decision. As we hope to rank the gold evidence higher, the Mean Average Precision (MAP) metric is calculated for evaluating evidence selection.

Evaluation Results
The results are shown in Table 3. Since the Upper Bound method directly uses the human annotated gold evidence, its MAP for selecting evidence is always 1.0. The CVAE model outperforms the attentionbased model in both action prediction and evidence selection. This indicates that the CVAE model can incorporate a better guidance for evidence selection during the training process. One possible explanation is that the CVAE model incorporates the target action as the context during learning instead of directly learning a context parameter. Furthermore, after adding the evidence supervision, the CVAE+SV model gives even better performance in both action prediction and evidence selection. We notice that for the CVAE+SV model, its action prediction accuracy is approaching the upper bound 91.8%, however the evidence selection MAP is still far from the upper bound even with supervision.

Semi-Supervised Learning
Although we have shown that adding supervision on the latent variable z improves the model performance, collecting this label information through human annotation is usually time consuming and expensive. In this section, we explore how semisupervised learning can help to alleviate this difficulty. As a generative model, VAE has shown its advantage on semi-supervised learning . Following the method in , our semi-supervised learning loss function is defined as: where L SV is defined in section 5.2 and L CV AE is detailed in section 5.1. In other words, the data sample with evidence label is fed to L SV , otherwise is fed to L CV AE .
The results are shown in Figure 4 where the

Commonsense Justification towards Common Ground
In human-agent communication, the success of communication is largely dependent on common ground which captures shared knowledge, beliefs, or past experience (Clark, 1996). As commonsense evidence what humans use to justify actions, To validate this hypothesis, we conducted a human-subject experiment to examine the role of commonsense justification in facilitating common ground. Figure 5 shows the setup of our experiment. The agent is provided with an image and applies various models (e.g., CVAE) to jointly predict the action and identify commonsense evidence. The human is provided with a list of six action choices and does not have access to the image. The agent communicates to the human only the identified commonsense evidence and the human makes a guess on the action from the candidate list purely based on the communicated evidence. The idea is that, if the human and the agent share the same beliefs about evidence to justify an action, then the action guessed by the human should be the same as the action predicted by the agent. Generating Distracting Verbs. For each image, the human is provided with a list of six action/verb candidates. To generate this list, we mix four distracting verbs with the ground-truth action verb plus a default Other. Most of the distracting verbs come from the concrete action verbs made available by (Gao et al., 2018). We first manually filtered out the verbs which have the same meaning with the ground-truth verb. We then selected two groups of distracting verbs: an easy group (where the distracting verbs have larger distance from the ground-truth verb in the embedding space, with an average similarity of 0.284) and a hard group (more close to the ground-truth verbs with an average similarity of 0.479). The temperature based softmax distribution (Chorowski and Jaitly, 2016) was used to sample the easy and the hard distracting verbs based on the pre-trained GloVe (Pennington et al., 2014) embedding cosine similarity.

Experiment Setup
Process. A total of 170 images were used in this experiment, and 24 workers from AMT participated in our study. For each image, we applied three different models: Attention baseline, CVAE, and CVAE+SV to generate the commonsense evidence. An upper bound based on gold commonsense evidence was also measured. Note that, the agent has no knowledge of the human's action choices when generating the commonsense evidence. Theory of mind is an important aspect in human-agent communication. Incorporating human's action choices in justifying action is an interesting however a different problem which requires different solutions. In this paper, we only focus on the situation where the mind of the human is opaque to the agent.
For each model and each image under the easy or hard configurations, the top five predicted commonsense evidence (associated with the predicted action) were shown to a worker. The the worker was requested to select the most probable action from the distracting list only based on these five pieces of evidence. We randomly assigned three workers to each image. The majority of three selections was considered as the final answer. If all three selections disagreed, one worker's choice was randomly selected as the final answer.
Metrics for Common Ground. We use the agreement between the action guessed by the human and the action predicted by the agent to measure how well the selected commonsense evidence serves to bring the human and the agent to a common ground of perceived actions. More formally, as shown in Figure 5, given an image, suppose its ground-truth action is A, the action predicted by the agent/machine is A m , and the action guessed by the human is A h , the Common Ground is defined as: A m = A h = A. Here we also enforce that the predicted action should be the same as the ground-truth action. The percentage of trials based on different models that have led to a com-• The baby has a mouth.
• The baby has a hand.
• The baby has eyeballs. • The baby has fingers.
• The baby has a nose.
• The hand holds the toothbrush.
• The toothbrush is in the mouth.
• The baby has a mouth.
• The baby has fingers.
• The baby has a nose.
• The hand holds the toothbrush.
• The toothbrush is in the mouth.
• The baby has eyeballs.
• The baby has a mouth.
• The baby has a hand.
• The toothbrush is in the mouth.
• The hand holds the toothbrush. Figure 6: Two examples of the common ground study based on different models. In each example, a ranked list of commonsense evidence generated by different models is shown. A m captures the action predicted by the agent. A h captures the action guessed by the human based on the selected commonsense evidence. mon ground is measured and compared. Table 4 shows the comparison results among various models and the upper bound where the gold commonsense evidence provided to the human. It's not surprising that performance on common ground is worse in the hard configuration as the distracting verbs are more similar to the target action. The CVAE-based method is better than the attention-based method in facilitating common ground. Figure 6 shows two examples of the top five predicted evidence under different models. For each model, it also shows the agent predicted action (A m ) and the human guessed action (A h ). In both examples, all models were able to establish a common ground except for the attention-based model. The evidence selected by the CVAE+SV model is clearly more accurate than the CVAE model and is more close to the ground-truth evidence. The second example shows that although the attentionbased model predicts a correct target action, it fails to convey correct commonsense evidence to establish a common ground with the human.

Conclusion
This paper describes an approach for action justification using commonsense evidence. As demonstrated in our experiments, commonsense evidence is selected to align with humans' justification of an action and is therefore critical in establishing a common ground between humans and agents.
For all experiments in this paper, we use the annotated relations/attributes from the original Visual Genome data. As the state-of-the-art re-call@50 on the relation detection with a limited vocabulary is only around 20% (Liang et al., 2018). Using annotated relations and attributes allows us to focus on the study of commonsense evidence and its role in action justification and common ground. Nevertheless, our proposed method has the potential to handle the erroneous relations/entities, e.g., as a result of vision processing, for example, by avoiding to select erroneous relations as they do not correlate with actions and other indicative relations/attributes. Our future work will extend the model and findings from this work to vision processing that will not only identify commonsense evidence but also explain where and how in the perceived environment the evidence is gathered.