Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog

Creating an intelligent conversational system that understands vision and language is one of the ultimate goals in Artificial Intelligence (AI) (Winograd, 1972). Extensive research has focused on vision-to-language generation, however, limited research has touched on combining these two modalities in a goal-driven dialog context. We propose a multimodal hierarchical reinforcement learning framework that dynamically integrates vision and language for task-oriented visual dialog. The framework jointly learns the multimodal dialog state representation and the hierarchical dialog policy to improve both dialog task success and efficiency. We also propose a new technique, state adaptation, to integrate context awareness in the dialog state representation. We evaluate the proposed framework and the state adaptation technique in an image guessing game and achieve promising results.


Introduction
The interplay between vision and language has created a range of interesting applications, including image captioning (Karpathy and Fei-Fei, 2015), visual question generation (VQG) (Mostafazadeh et al., 2016), visual question answering (VQA) (Antol et al., 2015), and reference expressions (Hu et al., 2016).Visual dialog (Das et al., 2017b) extends the VQA problem to multi-turn visual-grounded conversations without specific goals.In this paper, we study the task-oriented visual dialog setting that requires the agent to learn the multimodal representation and dialog policy for decision making.We argue that a task-oriented visual intelligent conversational sys-tem should not only acquire vision and language understanding but also make appropriate decisions efficiently in a situated environment.Specifically, we designed a 20 images guessing game using the Visual Dialog dataset (Das et al., 2017a).This game is the visual analog of the popular 20 question game.The agent aims to learn a dialog policy that can guess the correct image through question answering using the minimum number of turns.
Previous work on visual dialogs (Das et al., 2017a,b;Chattopadhyay et al., 2017) focused mainly on vision-to-language understanding and generation instead of dialog policy learning.They let an agent ask a fixed number of questions to rank the images or let humans make guesses at the end of the conversations.However, such setting is not realistic in real-world task-oriented applications, because in task-oriented applications, not only completing the task successfully is important but also completing it efficiently.In addition, the agent should also be informed of the wrong guesses, so that it becomes more aware of the vision context.However, solving such real-world setting is a challenge.The system needs to handle the large dynamically updated multimodal stateaction space and also leverage the signals in the feedback loop coming from different sub-tasks.
We propose a multimodal hierarchical reinforcement learning framework that allows learning visual dialog state tracking and dialog policy jointly to complete visual dialog tasks efficiently.The framework we propose takes inspiration from feudal reinforcement learning (FRL) (Dayan and Hinton, 1993), where levels of hierarchy within an agent communicate via explicit goals in a topdown fashion.In our case, it decomposes the decision into two steps: a first step where a master policy selects between verbal task (information query) and vision task (image retrieval), and a second step where a primitive action (question or im-age) is chosen from the selected task.Hierarchical RL that relies on space abstraction, such as FRL, is useful to address the challenge of large discrete action space and has been shown to be effective in dialog systems, especially for large domain dialog management (Casanueva et al., 2018).Besides, we propose a new technique called state adaptation in order to make the multimodal dialog state more aware of the constantly changing visual context.We demonstrate the efficacy of this technique through ablation analysis.
2 Related Work

Visual Dialog
Visual dialog requires the agent to hold a multiturn conversation about visual content.Several visual dialog tasks have been developed, including image grounded conversation generation (Mostafazadeh et al., 2017).Guess What?! (De Vries et al., 2017) involves locating visual objects using dialogs.VisDial (Das et al., 2017a) situates an answer-bot (A-Bot) to answer questions from a question-bot (Q-Bot) about an image.Das et al. (2017b) applied reinforcement learning (RL) to the VisDial task to learn the policies for the Q/A-Bots to collaboratively rank the correct image among a set of candidates.However, their Q-Bot can only ask questions and cannot make guesses.Chattopadhyay et al. (2017) further evaluated the pre-trained A-bot in a similar setting to answer human generated questions.Since humans are tasked to ask questions, the policy learning of Q-Bot is not investigated.Finally, (Manuvinakurike et al., 2017) proposed a incremental dialogue policy learning method for image guessing.However, their dialog state only used language information and did not include visual information.We build upon prior works and propose a framework that learns an optimal dialog policy for the Q-Bot to perform both question selection and image guessing through exploiting multimodal information.

Reinforcement Learning
RL is a popular approach to learn an optimal dialog policy for task-oriented dialog systems (Singh et al., 2002;Williams and Young, 2007;Georgila and Traum, 2011;Lee and Eskenazi, 2012;Yu et al., 2017).The deep Q-Network (DQN) introduced by Mnih et al. (2015) achieved human-level performance in Atari games based on deep neural networks.Deep RL was then used to jointly learn the dialog state tracking and policy optimization in an end-to-end manner (Zhao and Eskenazi, 2016).In our framework, we use a DQN to learn the higher level policy for question selection or image guessing.Van Hasselt et al. (2016) proposed a double DQN to overcome the overestimation problem in the Q-Learning and Schaul et al. (2015) suggested prioritized experience replay to improve the data sampling efficiency for training DQN.We apply both techniques in our implementation.One limitation of DQNs is that they cannot handle unbounded action space, which is often the case for natural language interaction.He et al. (2015) proposed Deep Reinforcement Relevance Network (DRRN) that can handle inherently large discrete natural language action space.Specifically, the DRRN takes both the state and natural language actions as inputs and computes a Q-value for each state action pair.Thus, we use a DRRN as our question selection policy to approximate the value function for any question candidate.
Our work is also related to hierarchical reinforcement learning (HRL) which often decomposes the problem into several sub-problems and achieves better learning convergence rate and generalization compared to flat RL (Sutton et al., 1999;Dietterich, 2000).HRL has been applied to dialog management (Lemon et al., 2006;Cuayáhuitl et al., 2010;Budzianowski et al., 2017) which decomposes the dialog policy with respect to system goals or domains.When the system enters a sub-task, the selected dialog policy will be used and continue to operate until the subproblem is solved, however the terminate condition for a subproblem has to be predefined.Different from prior work, our proposed architecture uses hierarchical dialog policy to combine two RL architectures within a control flow, i.e., DQN and DRRN, in order to jointly learn multimodal dialog state representation and dialog policy.Note that our HRL framework resembles the FRL hierarchy (Dayan and Hinton, 1993) that exploits space abstraction, state sharing and sequential execution.

Proposed Framework
Figure 2 shows an overview of the multimodal hierarchical reinforcement learning framework and the simulated environment.There are four main modules in the framework.The visual dialog semantic embedding module learns a multimodal dialog state representation to support the visual

Visual Dialog Semantic Embedding
This module learns the multimodal representation for the downstream visual dialog state tracking.Figure 3 shows the network architecture for pretraining the visual dialog semantic embedding.A VGG-19 CNN (Simonyan and Zisserman, 2014) and a multilayer perceptron (MLP) with L2 normalization are used to encode visual information (images) as a vector I ∈ R k .We use a dialogconditioned attentive encoder (Lu et al., 2017) to encode textual information as a vector T ∈ R k where k is the joint embedding size.The image caption(c) is encoded with a LSTM to get a vector m c and each QA pair (H 0 , ..., H t ) is encoded separately with another LSTM as M h t ∈ R d×t where t is the turn index and d is the LSTM embedding size.Conditioned on the image caption embedding, the model attends to the dialog history: where 1 is a vector with all elements set to 1, W h , W c ∈ R t×d and w a ∈ R k are parameters to be learned.α ∈ R k is the attention weight over history.The attended history feature mh t is the weighted sum of each column of M h t with α h t .Then mh t is concatenated with m c and encoded via MLP and l2 norm to get the final textual embedding (T ).We train the network with pairwise ranking loss (Kiros et al., 2014) on cosine similarities between the textual and visual embedding.The pretraining step allows the module to have better generalization and improve convergence performance in the RL training.Given the QA pairs from the simulated environ-ment, the output of this module can also be used for the image retrieval sub-task.To verify the quality of this module, we perform a sanity check on an image retrieval task, similar to (Das et al., 2017b).We used the output of the module to rank the 20 images in the game setting.Among 1000 games, we achieved 96.8% accuracy for recall@1 (the target image ranked the highest), which means that this embedding module can provide reliable reward signal in an image retrieval task for the RL training if given the relevant dialog history.

Visual Dialog State Tracking
This module utilizes the output from the visual dialog semantic embedding to formulate the final dialog state representation.We track three types of state information, the dialog meta information (M ET A), the vision belief (V B) and the vision context (V C).The dialog meta information includes the number of questions asked, the number of images guessed and the last action.The vision belief state is the output of the visual dialog semantic embedding module, which captures the internal multimodal information of the agent.We initialize the VB with only the encoding of the image caption and update it with each new incoming QA pair.The vision context state represents the visual information of the environment.In order to make the agent more aware of the dynamic visual context and which images to attend more, we introduce a new technique called state adaptation as it updates the vision context state with the attention scores.The V C is initialized as the average of image vectors and updated as follows: where r, t and i refer to episode, dialog turn and image index.The V C is then adjusted based on the attention scores (see equation 4).The attention scores calculated by dot product in the equation 3 represent the affinity between the current vision belief state and each image vector.In the case of wrong guesses (informed by the simulator), we set the attention score for that wrong image to zero.This method is inspired by Tian et al. (2017) who explicitly weights context vectors by context-query relevance for encoding dialog context.The question selection sub-task also takes the vision context state as input and the vision belief state is used in the image retrieval sub-task.

Hierarchical Policy Learning
The goal is to learn a dialog policy that makes decisions based on the current visual dialog state, i.e, asking a question about the image or making a guess about the image that the user is thinking of.As the agent is situated in a dynamically changing vision context to update its internal decisionmaking model (approximated by the belief state) with new dialog exchange, we treat such environment as a Partially Observable Markov Decision Process (POMDP) and solve it using deep reinforcement learning.We now describe the key components: Dialog State comes from the visual dialog state tracking module as mentioned in Section 3.2 Policy Learning: Given the above dialog state, we introduce a hierarchical dialog policy that contains a high-level control policy and a low-level question selection policy.We learn the control policy with a Double DQN that decides between "question" or "guess" at a game step.
If the high-level action is a "question", then the control is passed over to the low-level policy, which needs to select a question.One challenge is that the list of candidate questions are different for every game, and the number of candidate questions for different images is also different as well.This prohibits us using a standard DQN with fixed number of actions.He et al. (2015) showed that modeling state embedding and action embedding separately in DRRN has superior performance than per-action DQN as well as other DQN variants for dealing with natural language action spaces.Therefore, we use the DRRN to solve this problem, which computes a matching score between the shared current vision context state and the embedding of each question candidate.We use a softmax selection strategy as the exploration policy during the learning stage.The hierarchical policy learning algorithm is described in the Appendix Algorithm 1.
If the high-level action is "guess", then an image is retrieved using cosine distance between each image vector and the vision belief vector.It is worth mentioning that although the action space of the image retrieval sub-task can be incorporated into a flat DRRN combined with text-based inputs,the training is unstable and does not converge within this flat RL framework.We suspect this is due to the sample efficiency problem with large multimodal action space for which the question action or guess action typically results in different reward signals.Therefore, we did not compare our proposed method against a flat RL model.Rewards: The reward function is decomposed as where R G means the final game reward(win/loss= ±10), R I refers to wrong guess penalty (-3).We define R Q as the pseudo reward for the sub-task of question selection as where t refers to the dialog turn and affinity scores (A t and A t−1 ) are the outputs of the sigmoid function that scales the similarity score (0-1) of the vision belief state and the target image vector.The intuition is that different questions provide various information gains for the agent.The integration of R Q is a reward shaping (Ng et al., 1999) technique that aims to provide immediate rewards to make the RL training more efficient.At each turn, if the verbal task (question selection) is chosen, the R Q would serve as immediate reward for training the DQN and DRRN while if the vision task (image retrieval) is chosen, only the R I is available for training DQN.At the end of a game, the reward function varies based on the primitive action and the final game result.

Question Selection
The question selection module selects the best question in order to acquire relevant information to update the image belief state.As discussed in Section 3.3, we used a discriminative approach to select the next question for the agent by learning the policy in a DRRN.It leverages the existing question candidate pool that is constructed differently with respect to different experiment settings in Section 4.4.Ideally we would like to generate realistic questions online towards a specific goal (Zhang et al., 2017) and we leave this generative approach for future study.

Experiments
We first describe the simulation of the environment.Then, we talk about different dialog policy models and implementation details.Finally, we discuss three different experimental settings to evaluate the proposed framework.

Simulator Construction
We constructed a simulator for 20 images guessing game using the VisDial dataset.Each image corresponds to a dialog consisting of ten rounds of question answering generated by humans.To make the task setting meaningful and the training time manageable, we pre-process and select 1000 sets of games consisting of 20 similar images.The simulator provides the reward signals and answers related to the target image.It also tracks the internal game state.A game is terminated when one of the three conditions is fulfilled: 1) the agent guesses the correct answer, 2) the max number of guesses is reached (three guesses) or 3) the max number of dialog turns is reached.The agent wins the game when it guesses the correct image.If the agent wins the game, it gets a reward of 10, and if the agent loses the game, it gets a reward of −10.The agent also receives a −3 penalty for each wrong guess.

Policy Models
To evaluate the contribution of each technique in the multimodal hierarchical framework: the hierarchical policy, the state adaptation, and the reward shaping, we evaluate five different policy models and perform ablation analysis.We describe each model as follows: -Random Policy (Rnd): The agent randomly selects a question or makes a guess at any step.
-Random Question+DQN (Rnd+DQN): The agent randomly selects a question but a DQN is used to optimize the hierarchical decision of making a guess or asking a question.
-DRRN+DQN (HRL): Similar to Rnd+ DQN, except that a DRRN is used to optimize the question selection process -DRRN+DQN+State Apdation (HRL+SA): Similar to HRL, except incorporating the state adaptation, which is similar to the attention re-weighting concept in the vision context state.

Implementation Details
The details about data pre-processing and training hyper-parameters are described in the Appendix.
During the training, the DQN uses the -greedy policy and the DRRN uses the softmax policy for exploration, where is linearly decreased from 1 to 0.1.The resulting framework was trained up to 20,000 iterations for Experiment 1 and 95,000 iterations for Experiment 2 and 3, and evaluated at every 1000 iterations with greedy policy.At each evaluation we record the performance of different models with a greedy policy for 100 independent games.The evaluation metrics are the win rate and the average number of dialog turns.

Experimental Setting
We conduct three sets of experiments to explore the effectiveness of the proposed multimodal hierarchical reinforcement learning framework in a real-world scenario step by step.The first experiment constrains the agent to select among the 10 human generated question-answer pairs.This setting enables us to assess the effectiveness of the framework in a less error-prone setting.The second experiment does not require a human to generate the answer to emulate a more realistic environment.Specifically, we enlarge the number of questions by including 200 human generated questions for the 20 images, and use a pre-trained visual question answer model to generate answers with respect to the target image.In the last experiment, we further automate the process by generating questions given the 20 images using a pretrained visual question generation model.So the agent does not require any human input with respect to any image for training.

Results
We evaluate the models described in Section 4.2 under the settings described in Section 4.4 and report results as following.

Experiment 1: Human Generated Question-Answer Pairs
The agent selects the next question among the 10 question-answer pairs human generated and want to identify the targeted image accurately and efficiently through natural language conversation.We terminate the dialog after ten turns.Each model's performance is shown in Table 1.
HRL+SAR achieves the best win rate with statistical significance.The HRL+SAR policy model performs much better than methods without hierarchical control structure and state adaptation.The learning curves in Figure 4 and 5 reveal that the HRL+SAR converges faster.We further perform bootstrap tests by resampling the game results from each experiment with replacement 1,000 times.Then we calculate the probability of significance level for the difference of average win rates or average turn length to check whether the relative performance improvement from the last baseline is statistically significant.The result shows that the question selection (DRRN) and state adaptation bring the most significant performance improvements (p < 0.01) while reward shaping has less impact (p < 0.05).We also observe that the average number of turns with hierarchical policy learning (HRL) is slightly longer than that of Rnd+DQN but with less statistically significant difference.This is probably because this setting provides the 10 predefined question-answer pairs with a smaller action space, the DQN model tends to encourage the agent to make guesses quicker, while policy models with hierarchical structures tends to optimize the overall task completion rate.We find that RL methods (DQN & DRRN) significantly improve the win rate as they learn to select the optimal list of questions to ask.We also observe that our proposed state adaptation method for vision context state helps achieve the largest performance improvement.The hierarchical control architecture and the state abstraction sharing (Dietterich, 2000) also improve both learning speed and agent performance.This aligns with the observation in Budzianowski et al. (2017).
Moreover, on average, we observe that after seven turns, the agent was able to select the target image with a sufficiently high success rate.We further explore if the proposed hierarchical framework enables efficient decision-making when compared to the agent that keeps asking questions and only makes the guess at the end of the dialog.We refer to such models as the oracle baselines.For example, the Oracle@7 makes the guess at the 7th turn based on the previous dialog history with the correct order of questionanswer pairs in the dataset.The oracle baselines are strong, since they represent the best performance the model can get given the optimal question order provided by human.Table 2 shows the performance of the oracle baselines with various fixed turns.We performed significance tests between each oracle baseline and the hierarchical framework.Since our hierarchical framework requires on average 7.22 turns to complete, so we compared it with Oracle@7 and Oracle@8.We found that the proposed method outperforms Oracle@7 with p − value < 0.01, and similar to Oracle@8 (significant difference (p − value > 0.1).The reason that the hierarchical framework can outperform Oracle@7 is that it learns to make a guess whenever the agent is confident enough, therefore achieving better win rate.Oracle@8 in general receives more information as the dialogs are longer, therefore has an advantage over the hierarchical method.However, it still performs similar to the proposed method, which demonstrates that by learning the hierarchical decision, it enables the agent to achieve the goal more efficiently.One thing we need to point out is that the proposed method also received extra information about whether the guess is correct or not from the environment.Oracle baselines do not have such information, as it can only make a guess at the end of the dialog.Oracle@9 and @10 are better than the hierarchical framework statistically, because they acquire much more information by having longer turns.

Experiment 2: Questions Generated by Human and Answers Generated Automatically
To make the experimental setting more realistic, we select 200 questions generated by a human with respect to 20 images provided and create a user simulator that generates the answers related to the target image.Here, as the questions space is larger, we terminate the dialog after 20 turns.We follow the supervised training scheme discussed in (Das et al., 2017b)   Results in Table 3 indicate that HRL+SAR significantly outperforms Rnd and Rnd+DQN in both win rate and average number of dialog turns.The setting in Experiment 2 is more challenging than that of Experiment 1, because the visual ques-tion module introduces noise that can influence the policy learning.However, the noise also simulates the real-world scenario that a user might have an implicit goal that may change within the task.A user can also accidentally make errors in answering the question.The proposed hierarchical framework (HRL+SAR) with state adaptation and reward shaping achieves the best win rate and the least number of dialog turns in this noisy experiment setting.As compared to Experiment 1, the policy models with hierarchical structures can both optimize the overall task completion rate and the dialog turns.We did not report oracle baselines results, since the oracle order of all the questions (ideally generated by humans) was not available.

Experiment 3: Question-Answer Pairs Generated Automatically
In this setting, both questions and answers are generated automatically through pre-trained visual question and answer generation models (Das et al., 2017b).Such setting enables the agent to play the guessing game given any image as no human input of the image is needed.Notice that the answers should be generated with respect to a target image for our task setting.In this setting, we also set the maximum number of dialog turns to be 20.The results in Table 4 show that the performance of the three policies significantly dropped compared to Experiment 2. Such observation is expected, as the noise coming from both the visual question and answer generation module increases the task difficulty.However, the proposed HRL+SAR is still more resilient to the noise and achieves a higher win rate and less average number of turns compared to other baselines.Figure 5 from the Appendix shows that in Experiment 2 the agent tends select relevant questions faster to ask although the answers can be misleading.On the other hand, in Experiment 3, the agent reacts to the generated question and answers slower to complete the task.The model performance decreases when we increase the task difficulty in order to emulate the real-world scenarios.It hints that there is a possible limitation of using the Vis-Dial dataset, because the dialog is constructed by users who casually talk about MS COCO images (Chen et al., 2015) instead of exchanging with an explicit contextual goal in the dialog.

Discussion and Future Work
We develop a framework for task-oriented visual dialog systems and demonstrate the efficacy of integrating multimodal state representation with hierarchical decision learning in an image guessing game.We also introduce a new technique called state adaptation to further improve the task performance through integrating context awareness.We also test the proposed framework in various noisy settings to simulate real-world scenarios and achieve robust results.
The proposed framework is practical and extensible for real-world applications.For example, the designed system can act as a fashion shopping assistant to help customers pick clothes through strategically inquiring their preferences while leveraging vision intelligence.In another application, such as criminology practice, the agent can communicate with witnesses to identify suspects from a large face database.
Although games provide a rich domain for multimodal learning research, admittedly it is challenging to evaluate a multimodal dialog system due to the data scarcity problem.In future work, we would like to extend and apply the proposed framework for human studies in a situated realworld application, such as a shopping scenario.We also plan to incorporate domain knowledge and database interactions into the system framework design, which will make the dialog system more flexible and effective.Another possible extension of the framework is to update the off-line question and answer generation modules with an online generative version and retrain the module with reinforcement learning.

Figure 1 :
Figure 1: The information flow of the multimodal hierarchical reinforcement learning framework

Figure 2 :
Figure 2: Pretraining scheme of the visual dialog semantic embedding module

Figure 3 :
Figure 3: Learning curves of win rates for five different policy policies in Experiment 1

Figure 4 :
Figure 4: Learning curves of final rewards for five different dialog policies in Experiment 1

Table 2 :
Oracle baselines Performance

Table 3 :
Model Performance in Experiment 2

Table 4 :
Model Performance in Experiment 3