What Should I Ask? Using Conversationally Informative Rewards for Goal-Oriented Visual Dialog

The ability to engage in goal-oriented conversations has allowed humans to gain knowledge, reduce uncertainty, and perform tasks more efficiently. Artificial agents, however, are still far behind humans in having goal-driven conversations. In this work, we focus on the task of goal-oriented visual dialogue, aiming to automatically generate a series of questions about an image with a single objective. This task is challenging since these questions must not only be consistent with a strategy to achieve a goal, but also consider the contextual information in the image. We propose an end-to-end goal-oriented visual dialogue system, that combines reinforcement learning with regularized information gain. Unlike previous approaches that have been proposed for the task, our work is motivated by the Rational Speech Act framework, which models the process of human inquiry to reach a goal. We test the two versions of our model on the GuessWhat?! dataset, obtaining significant results that outperform the current state-of-the-art models in the task of generating questions to find an undisclosed object in an image.


Introduction
Building natural language models that are able to converse towards a specific goal is an active area of research that has attracted a lot of attention in recent years. These models are vital for efficient human-machine collaboration, such as when interacting with personal assistants. In this paper, we focus on the task of goal-oriented visual dialogue, which requires an agent to engage in conversations about an image with a predefined objective. The task presents some unique challenges. Firstly, the conversations should be consistent with the goals of the agent. Secondly, the conversations between two agents must be coherent with the common vi- Figure 1: An example of goal-oriented visual dialogue for finding an undisclosed object in an image through a series of questions. On the left, we ask a human to guess the unknown object in the image. On the right, we use the baseline model proposed by . While the human is able to narrow down the search space relatively faster, the artificial agent is not able to adopt a clear strategy for guessing the object. sual feedback. Finally, the agents should come up with a strategy to achieve the objective in the shortest possible way. This is different from a normal dialogue system where there is no constraint on the length of a conversation.
Inspired by the success of Deep Reinforcement Learning, many recent works have also used it for building models for goal-oriented visual dialogue (Bordes et al., 2017). The choice makes sense, as reinforcement learning is well suited for tasks that require a set of actions to reach a goal. However, the performance of these models have been sub-optimal when compared to the average human performance on the same task. For example, consider the two conversations shown in Figure 1. The figure draws a comparison between possible ques-tions asked by humans and an autonomous agent proposed by  to locate an undisclosed object in the image. While humans tend to adopt strategies to narrow down the search space, bringing them closer to the goal, it is not clear whether an artificial agent is capable of learning a similar behavior only by looking at a set of examples. This leads us to pose two questions: What strategies do humans adopt while coming up with a series of questions with respect to a goal?; and Can these strategies be used to build models that are suited for goal-oriented visual dialogue?
With this challenge in mind, we directed our attention to contemporary works in the field of cognitive science, linguistics and psychology for modelling human inquiry (Groenendijk et al., 1984;Nelson, 2005;Van Rooy, 2003). More specifically, our focus lies on how humans come up with a series of questions in order to reach a particular goal. One popular theory suggests that humans try to maximize the expected regularized information gain while asking questions (Hawkins et al., 2015;Coenen et al., 2017). Motivated by that, we evaluate the utility of using information gain for goal-oriented visual question generation with a reinforcement learning paradigm. In this paper, we propose two different approaches for training an end-to-end architecture: first, a novel reward function that is a trade-off between the expected information gain of a question and the cost of asking it; and second, a loss function that uses regularized information gain with a step-based reward function. Our architecture is able to generate goal-oriented questions without using any prior templates. Our experiments are performed on the GuessWhat?! dataset , a standard dataset for goal-oriented visual dialogue that focuses on identifying an undisclosed object in the image through a series of questions. Thus, our contribution is threefold: • An end-to-end architecture for goal-oriented visual dialogue combining Information Gain with Reinforcement Learning.
• A novel reward function for goal-oriented visual question generation to model long-term dependencies in dialogue.
• Both versions of our model outperform the current baselines on the GuessWhat?! dataset for the task of identifying an undisclosed ob-ject in an image by asking a series of questions.
2 Related Work

Models for Human Inquiry
There have been several works in the area of cognitive science that focus on models for question generation. Groenendijk et al. (Groenendijk et al., 1984) proposed a theory stating that meaningful questions are propositions conditioned by the quality of its answers. Van Rooy (Van Rooy, 2003) suggested that the value of a question is proportional to the questioner's interest and the answer that is likely to be provided. Many recent related models take into consideration the optimal experimental design (OED) (Nelson, 2005;Gureckis and Markant, 2012), which considers that humans perform intuitive experiments to gain information, while others resort to Bayesian inference. Coenen et al. (Coenen et al., 2017), for instance, came up with nine important questions about human inquiry, while one recent model called Rational Speech Act (RSA) (Hawkins et al., 2015) considers questions as a distribution that is proportional to the trade-off between the expected information gain and the cost of asking a question.

Dialogue Generation and Visual Dialogue
Dialogue generation is an important research topic in NLP, thus many approaches have been proposed to address this task. Most earlier works made use of a predefined template (Lemon et al., 2006;Wang and Lemon, 2013) to generate dialogues. More recently, deep neural networks have been used for building end-to-end architectures capable of generating questions (Vinyals and Le, 2015;Sordoni et al., 2015) and also for the task of goal-oriented dialogue generation (Rajendran et al., 2018;Bordes et al., 2017). Visual dialogue focuses on having a conversation about an image with either one or both of the agents being a machine. Since its inception (Das et al., 2017), different approaches have been proposed to address this problem (Massiceti et al., 2018;Lu et al., 2017;Das et al., 2017). Goaloriented Visual Dialogue, on the other hand, is an area that has only been introduced fairly recently. De Vries et al.  proposed the GuessWhat?! dataset for goal-oriented visual dialogue while Strub et al.  developed a reinforcement learning approach for Figure 2: A block diagram of our model. The framework is trained on top of three individual models: the questioner (QGen), the guesser, and the oracle. The guesser returns an object distribution given a history of question-answer pairs that are generated by the questioner and the oracle respectively. These distributions are used for calculating the information gain of the question-answer pair. The information gain and distribution of probabilities given by the Guesser are used either as a reward or optimized as a loss function with global rewards for training the questioner.
goal-oriented visual question generation. More recently, Zhang et al.  used intermediate rewards for training a model on this task.

Sampling Questions with Information Gain
Information gain has been used before to build question-asking agents, but most of these models resort to it to sample questions. Rothe et al. (Rothe et al., 2017) proposed a model that generates questions in a Battleship game scenario. Their model uses Expected Information Gain to come up with questions akin to what humans would ask. Lee et al. (Lee et al., 2018) used information gain alone to sample goal-oriented questions on the Guess-What?! task in a non-generative fashion. The most similar work to ours was proposed by Lipton et al. (Lipton et al., 2017), who used information gain and Q-learning to generate goal-oriented questions for movie recommendations. However, they generated questions using a template-based question generator.

The GuessWhat?! framework
We built our model based on the GuessWhat?! framework . GuessWhat?! is a two-player game in which both players are given access to an image containing multiple objects. One of the players -the oracle -chooses an object in the image. The goal of the other player -the questioner -is to identify this object by asking a series of questions to the oracle, who can only give three possible answers: "yes," "no," or "not applicable." Once enough evidence is collected, the questioner has to choose the correct object from a set of possibilities -which, in the case of an artificial agent, are evaluated by a guesser module. If this final guess is correct, the questioner is declared the winner. The GuessWhat?! dataset comprises 155,280 games on 66,537 images from the MS-COCO dataset, with 831,889 question-answer pairs. The dataset has 134,074 unique objects and 4,900 words in the vocabulary.
A game is comprised of an image I with height H and width W , a dialogue D = {(q 1 , a 1 ), (q 2 , a 3 ), ...(q n , a n )}, where q j ∈ Q denotes a question from a list of questions and a j ∈ A denotes an answer from a list of answers, which can either be yes , no or N/A . The total number of objects in the image is denoted by O and the target is denoted by o * . The term V indicates the vocabulary that comprises all the words that are employed to train the question generation module (QGen). Each question can be represented by q = {w i }, where w i denotes the i th word in the vocabulary. The set of segmentation masks of objects is denoted by S. These notations are similar to those of Strub et al. ). An example of a game can be seen in Figure 1, where the questioner generates a series of questions to guess the undisclosed object. In the end, the guesser tries to predict the object with the image and the given set of question-answer pairs.

Learning Environment
We now describe the preliminary models for the questioner, the guesser, and the oracle. Before using them for the GuessWhat?! task, we pre-train all three models in a supervised manner. During the final training of the Guesswhat?! task our focus is on building a new model for the questioner and we use the existing pre-trained models for the oracle and the guesser.

The Questioner
The questioner's job is to generate a new question q j+1 given the previous j question-answer pairs and the image I. Our model has a similar architecture to the VQG model proposed by . It consists of an LSTM whose inputs are the representations of the corresponding image I and the input sequence corresponds to the previous dialogue history. The representations of the image are extracted from the fc-8 layer of the VGG16 network (Simonyan and Zisserman, 2014). The output of the LSTM is a probability distribution over all words in the vocabulary. The questioner is trained in a supervised fashion by minimizing the following negative loglikelihood loss function: Samples are generated in the following manner during testing: given an initial state s 0 and new token w j 0 , a word is sampled from the vocabulary. The sampled word along with the previous state is given as the input to the next state of the LSTM. The process is repeated until the output of the LSTM is the end token.

The Oracle
The job of the oracle is to come up with an answer to each question that is posed. In our case, the three possible outcomes are yes , no , or N/A . The architecture of the oracle model is similar to the one proposed by De Vries et al. . The input to the oracle is an image, a category vector, and the question that is encoded using an LSTM. The model then returns a distribution over the possible set of answers.

The Guesser
The job of the guesser is to return a distribution of probabilities over all set of objects given the input image and the dialogue history. We convert the entire dialogue history into a single encoded vector using an LSTM. All objects are embedded into vectors, and the dot product of these embeddings are performed with the encoded vector containing the dialogue history. The dot product is then passed through an MLP layer that returns the distribution over all objects.

Regularized Information Gain
The motivation behind using Regularized Information Gain (RIG) for goal-oriented questionasking comes from the Rational Speech Act Model (RSA) (Hawkins et al., 2015). RSA tries to mathematically model the process of human questioning and answering. According to this model, when selecting a question from a set of questions, the questioner considers a goal g ∈ G with respect to the world state G and returns a probability distribution of questions such that: where P (q|g) represents probability of selecting a question q from a set of questions Q. The probability is directly proportional to the trade-off between the cost of asking a question C(q) and the expected information gain D KL ( ∼ p(q|g)|| ∧ p(q|g)). The cost may depend on several factors such as the length of the question, the similarity with previously asked questions, or the number of questions that may have been asked before. The information gain is defined as the KL divergence between the prior distribution of the world with respect to the goal, ∼ p(q|g), and the posterior distribution that the questioner would expect after asking a question, ∧ p(q|g).
Similar to Equation 2, in our model we make use of the trade-off between expected information gain and the cost of asking a question for goal-oriented question generation. Since the cost term regularizes the expected information gain, we denote this trade-off as Regularized Information Gain. For a given question q, the Regularized Information Gain is given as: where τ (q) is the expected information gain associated with asking a question and C(q) is the cost of asking a question q ∈ Q in a given game. Thus, the information gain is measured as the KL divergence between the prior and posterior likelihoods of the scene objects before and after a certain question is made, weighted by a skewness coefficient β(q) over the same posterior.
The prior distribution before the start of the game is assumed to be 1 N , where N is the total number of objects in the game. After a question is asked, the prior distribution is updated and it is equal to the output distribution of the guesser: We define the posterior to be the output of the guesser once the answer has been given by the oracle: ∧ p(q j |I, (q, a)) 1:j−1 ) = A a∈A p guess (q j |I, (q, a) 1:j−1 ) (6) The idea behind using skewness is to reward questions that lead to a more skewed distribution at each round. The implication is that a smaller group of objects with higher probabilities lowers the chances of making a wrong guess by the end of the game. Additionally, the measure of skewness also works as a counterweight to certain scenarios where KL divergence itself should not reward the outcome of a question, such as when there is a significant information gain from a previous state but the distribution of likely objects, according to the guesser, becomes mostly homogeneous after the question.
Since we assume that initially all objects are equally likely to be the target, the skewness approach is only applied after the first question. We use the posterior distribution provided by the guesser to extract the Pearson's second skewness coefficient (i.e., the median skewness) and create the β component. Therefore, assuming a sample mean µ, median m, and standard deviation σ, the skewness coefficient is simply given by: Some questions might have a high information gain, but at a considerable cost. The term C(q) acts as a regularizing component to information gain and controls what sort of questions should be asked by the questioner. The cost of asking a question can be defined in many ways and may differ from one scenario to another. In our case, we are only considering whether a question is being asked more than once, since a repeated question cannot provide any new evidence that will help get closer to the target, despite a high information gain from one state to another during a complete dialogue. The cost for a repeated question is defined as: The cost for a question is equal to the negative information gain. This sets the value of an intermediate reward to 0 for a repeated question, ensuring that the net RIG is zero when the question is repeated.

Our Model
We view the task of generating goal-oriented visual questions as a Markov Decision Process (MDP), and we optimize it using the Policy Gradient algorithm. In this section, we describe some of the basic terminology employed in our model before moving into the specific aspects of it.
At any time instance t, the state of the agent can be written as u t = ((w j 1 , ..., w j m ), (q, a) 1:j−1 , I), where I is the image of interest, (q, a) 1:j−1 is the question-answer history, and (w j 1 , ..., w j m ) is the previously generated sequence of words for the current question q j . The action v t denotes the selection of the next output token from all the tokens in the vocabulary. All actions can lead to one of the following outcomes: 1. The selected token is stop , marking the end of the dialogue. This shows that it is now the turn of the guesser to make the guess.
2. The selected token is end , marking the end of a question.
3. The selected token is another word from the vocabulary. The word is then appended to the current sequence (w j 1 , ..., w j m ). This marks the start of the next state. Our approach models the task of goal-oriented questioning as an optimal stochastic policy π θ (v|u) over the possible set of state-action pairs.

Algorithm 1 Training the question generator using REINFORCE with the proposed rewards
Require: Pretrained QGen, Oracle and Guesser Require: Batch size K 1: for Each update do 2: for k = 1 to K do 3: Pick image I k and the target object o * k ∈ O k 4: N ← |O k | 5: Here θ represents the parameters present in our architecture for question generation. In this work, we experiment with two different settings to train our model with Regularized Information Gain and policy gradients. In the first setting, we use Regularized Information Gain as an additional term in the loss function of the questioner. We then train it using policy gradients with a 0-1 reward function.
In the second setting, we use Regularized Information Gain to reward our model. Both methods are described below.

Regularized Information Gain loss minimization with 0-1 rewards
During the training of the GuessWhat?! game we introduce Regularized Information Gain as an additional term in the loss function. The goal is to minimize the negative log-likelihood and maximize the Regularized Information Gain. The loss function for the questioner is given by: log pq (w j i |w j 1:i−1 , (q, a) 1:j−1 , I) +D KL ( ∧ p(q j |I, (q, a))|| ∼ p(q j |I, (q, a)))β(q) We adopt a reinforcement learning paradigm on top of the proposed loss function. We use a zeroone reward function similar to Strub et al.  for training our model. The reward function is given as: Thus, we give a reward of 1 if the guesser is able to guess the right object and 0 otherwise.

Using Regularized Information Gain as a reward
Defining a valuable reward function is a crucial aspect for any Reinforcement Learning problem.
There are several factors that should be considered while designing a good reward function for asking goal-oriented questions. First, the reward function should help the questioner achieve its goal.
Second, the reward function should optimize the search space, allowing the questioner to come up with relevant questions. The idea behind using regularized information gain as a reward function is to take into account the long term dependencies in dialogue. Regularized information gain as a reward function can help the questioner to come up with an efficient strategy to narrow down a large search space. The reward function is given by: Thus, the reward function is the sum of the tradeoff between the information gain τ (q) and the cost of asking a question C(q) for all questions Q in a given game. Our function only rewards the agent if it is able to correctly predict the oracle's initial choice.

Policy Gradients
Once the reward function is defined, we train our model using the policy gradient algorithm. For a given policy π θ , the objective function of the policy gradient is given by: According to Sutton et al. (Sutton et al., 2000), the gradient of J(θ) can be written as:  Table 1: A comparison of the recognition accuracy of our model with the state of the art model  and other concurrent models on the GuessWhat?! task for guessing an object in the images from the test set.
where Q π θ (u t , v t ) is the state value function given by the sum of the expected cumulative rewards: Here b φ is the baseline function used for reducing the variance. The baseline function is a single-layered MLP that is trained by minimizing the squared loss error function given by:

Results
The model was trained under the same settings of . This was done in order to obtain a more reliable comparison with the preexisting models in terms of accuracy. After a supervised training of the question generator, we ran our reinforcement procedure using the policy gradient for 100 epochs on a batch size of 64 with a learning rate of 0.001. The maximum number of questions was 8. The baseline model, the oracle, and the guesser were also trained with the same settings described by , in order to compare the performance of the two reward functions. The error obtained by the guesser and the oracle were 35.8% and 21.1%, respectively. 1 Table 1 shows our primary results along with the baseline model trained on the standard crossentropy loss for the task of guessing a new object in the test dataset. We compare our model with the one presented by  and other concurrent approaches. Table 1 also compares our model with others when objects are sampled using a uniform distribution (right column).
1 In order to have a fair comparison, the results reported for TPG (Zhao and Tresp, 2018) and (Abbasnejad et al., 2018) only take into consideration the performance of the question generator. We do not report the scores that were generated after employing memory network to the guesser.

Ablation Study
We performed an ablation analysis over RIG in order to identify its main learning components. The results of the experiments with the reward function based on RIG are presented in Table 2, whereas  Table 3 compares the different components of RIG when used as a loss function. The results mentioned under New Images refer to images in the test set, while the results shown under New Objects refer to the analysis made on the training dataset with different undisclosed objects from the ones used during training time. For the first set of experiments, we compared the performance of information gain vs. RIG with the skewness coefficient for goal-oriented visual question generation. It is possible to observe that RIG is able to achieve an absolute improvement of 10.57% over information gain when used as a reward function and a maximum absolute improvement of 2.8% when it is optimized in the loss function. Adding the skewness term results in a maximum absolute improvement of 0.9% for the first case and an improvement of 2.3% for the second case. Furthermore, we compared the performance of the model when trained using RIG but without policy gradients. The model then achieves an improvement of 10.35% when information gain is used as a loss function.

Qualitative Analysis
In order to further analyze the performance of our model, we assess it in terms repetitive questions, since they compromise the framework's efficiency. We compare our model with the one proposed by  and calculate the average number of repetitive questions generated for each dialogue. The model by Strub et al. achieved a score of 0.82, whereas ours scored 0.36 repeated questions per dialogue and 0.27 using RIG as a reward function.

Discussion
Our model was able to achieve an accuracy of 67.19% for the task of asking goal-oriented questions on the GuessWhat?! dataset. This result is the highest obtained so far among existing approaches on this problem, albeit still far from human-level performance on the same task, reportedly of 84.4%. Our gains can be explained in part by how RIG with the skewness component for goal-oriented VQG constrains the process of generating relevant questions and, at the same time, allows the agent to reduce the search space significantly, similarly to decision trees and reinforcement learning, but in a very challenging scenario, since the search space in generative models can be significantly large.
Our qualitative results also demonstrate that our approach is able to display certain levels of strategic behavior and mutual consistency between questions in this scenario, as shown in Figure 3. The same cannot be said about previous approaches, as the majority of them fail to avoid redundant or other sorts of expendable questions. We argue that our cost function and the skewness coefficient both play an important role here, as the former penalizes synonymic questions and the latter narrows down the set of optimal questions. Our ablation analysis showed that information gain alone is not the determinant factor that leads to improved learning, as hypothesized by Lee et al. (Lee et al., 2018). However, Regularized Information Gain does have a significant effect, which indicates that a set of constraints, especially regarding the cost of making a question, cannot be taken lightly in the context of goal-oriented VQG.

Conclusion
In this paper we propose a model for goal-oriented visual question generation using two different approaches that leverage information gain with reinforcement learning. Our algorithm achieves improved accuracy and qualitative results in comparison to existing state-of-the-art models on the GuessWhat?! dataset. We also discuss the innovative aspects of our model and how performance could be increased. Our results indicate that RIG is a more promising approach to build betterperforming agents capable of displaying strategy and coherence in an end-to-end architecture for Visual Dialogue.