Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning

Mobile agents that can leverage help from humans can potentially accomplish more complex tasks than they could entirely on their own. We develop “Help, Anna!” (HANNA), an interactive photo-realistic simulator in which an agent fulfills object-finding tasks by requesting and interpreting natural language-and-vision assistance. An agent solving tasks in a HANNA environment can leverage simulated human assistants, called ANNA (Automatic Natural Navigation Assistants), which, upon request, provide natural language and visual instructions to direct the agent towards the goals. To address the HANNA problem, we develop a memory-augmented neural agent that hierarchically models multiple levels of decision-making, and an imitation learning algorithm that teaches the agent to avoid repeating past mistakes while simultaneously predicting its own chances of making future progress. Empirically, our approach is able to ask for help more effectively than competitive baselines and, thus, attains higher task success rate on both previously seen and previously unseen environments.


Introduction
The richness and generalizability of natural language makes it an effective medium for directing mobile agents in navigation tasks, even in environments they have never encountered before (Anderson et al., 2018b;Chen et al., 2019;Misra et al., 2018;de Vries et al., 2018;Qi et al., 2019). Nevertheless, even with language-based instructions, such tasks can be overly difficult for agents on their own, especially in unknown environments. To accomplish tasks that surpass their knowledge and skill levels, agents must be able to actively seek for and leverage assistance in the environment. Humans are rich external knowledge sources but, unfortunately, they may not be available all the time to provide guidance, or may be unwilling to help too frequently. To reduce the needed effort from human assistants, it is essential to design research platforms for teaching agents to request help mindfully.
In natural settings, human assistance is often: derived from interpersonal interaction (a lost tourist asks a local for directions); reactive to the situation of the receiver, based on the assistant's knowledge (the local may guide the tourist to the goal, or may redirect them to a different source of assistance); delivered via a multimodal communication channel (the local uses a combination of language, images, maps, gestures, etc.). We introduce the "Help, Anna!" (HANNA) problem ( § 3), in which a mobile agent has to navigate (without a map) to an object by interpreting its first-person visual perception and requesting help from Automatic Natural Navigation Assistants (ANNA). HANNA models a setting in which a human is not always available to help, but rather that human assistants are scattered throughout the environment and provide help upon request (modeling the interpersonal aspect). The assistants are not omniscient: they are only familiar with certain regions of the environment and, upon request, provide subtasks, expressed in language and images (modeling the multimodal aspect), for getting closer to the goal, not necessarily for fully completing the task (modeling the reactive aspect).
In HANNA, when the agent gets lost and becomes unable to make progress, it has the option of requesting assistance from ANNA. At test time, the agent must decide where to go and whether to request help from ANNA without additional supervision. At training time, we leverage imitation learning to learn an effective agent, both in terms of navigation, and in terms of being able to decide Figure 1: An example HANNA task. Initially, the agent stands in the bedroom at A and is requested by a human requester to "find a mug." The agent begins, but gets lost somewhere in the bathroom. It gets to the start location of route ( B ) to request help from ANNA. Upon request, ANNA assigns the agent a navigation subtask described by a natural language instruction that guides the agent to a target location, and an image of the view at that location. The agents follows the language instruction and arrives at C , where it observes a match between the target image and the current view, thus decides to depart route . After that, it resumes the main task of finding a mug. From this point, the agent gets lost one more time and has to query ANNA for another subtask that helps it follow route and enter the kitchen. The agent successfully fulfills the task it finally stops within meters of an instance of the requested object ( ). Here, the ANNA feedback is simulated using two pre-collected language-assisted routes ( and ).
when it is most worthwhile to request assistance. This paper has two primary contributions: 1. Constructing the HANNA simulator by augmenting an indoor photo-realistic simulator with simulated human assistance, mimicking a scenario where a mobile agent finds objects by asking for directions along the way ( §3). 2. An effective model and training algorithm for the HANNA problem, which includes a hierarchical memory-augmented recurrent architecture that models human assistance as sub-goals ( § 5), and introduces an imitation learning objective that enhances exploration of the environment and interpretability of the agent's help-request decisions. ( §4).
We embed the HANNA problem in the photorealistic Matterport3D environments (Chang et al., 2017) with no extra annotation cost by reusing the pre-existing Room-to-Room dataset (Anderson et al., 2018b). Empirical results ( § 7) show that our agent can effectively learn to request and interpret language and vision instructions, given a training set of 51 environments and less than 9,000 language instructions. Even in new environments, where the scenes and the language instructions are previously unseen, the agent successfully accomplishes 47% of its tasks. Our methods for training the navigation and help-request policies outperform competitive baselines by large margins.

Related work
Simulated environments provide an inexpensive platform for fast prototyping and evaluating new ideas before deploying them into the real world. Video-game and physics simulators are standard benchmarks in reinforcement learning (Todorov et al., 2012;Mnih et al., 2013;Kempka et al., 2016;Brockman et al., 2016;Vinyals et al., 2017). Nevertheless, these environments under-represent the complexity of the world. Realistic simulators play an important role in sim-to-real approaches, in which an agent is trained with arbitrarily many samples provided by the simulators, then transferred to real settings using sample-efficient transfer learning techniques (Kalashnikov et al., 2018;Andrychowicz et al., 2018;Karttunen et al., 2019). While modern techniques are capable of simulating images that can convince human perception (Karras et al., 2017(Karras et al., , 2018, simulating language interaction remains challenging. There are efforts in building complex interactive text-based worlds (Côté et al., 2018;Urbanek et al., 2019) but the lack of a graphical component makes them not suitable for visually grounded learning. On the other hand, experimentation on real humans and robots, despite expensive and time-consuming, are important for understanding the true complexity of real-world scenarios (Chai et al., 2018Rybski et al., 2007;Mohan and Laird, 2014;She et al., 2014).
Recent navigation tasks in photo-realistic simulators have accelerated research on teaching agents to execute human instructions. Nevertheless, modeling human assistance in these problems remains simplistic (Table 1): they either do not incorporate the ability to request additional help while executing tasks (Misra et al., 2014(Misra et al., , 2017Anderson et al., 2018b;Chen et al., 2019;Das et al., 2018;Misra et al., 2018;Wijmans et al., 2019;Qi et al., 2019), or mimic human verbal assistance with primitive, highly scripted language Chevalier-Boisvert et al., 2019). HANNA improves the realisticity of the VNLA setup  by using fully natural language instructions.
Imitation learning algorithms are a great fit for training agents in simulated environments: access to ground-truth information about the environments allows optimal actions to be computed in many situations. The "teacher" in standard imitation learning algorithms (Daumé III et al., 2009;Ross et al., 2011;Ross and Bagnell, 2014;Chang et al., 2015;Sun et al., 2017;Sharaf and Daumé III, 2017) does not take into consideration the agent's capability and behavior. He et al. (2012) present a coaching method where the teacher gradually increases the complexity of its demonstrations over time. Welleck et al. (2019) propose an "unlikelihood" objective, which, similar to our curiosity-encouraging objective, penalizes likelihoods of candidate negative actions to avoid mistake repetition. Our approach takes into account the agent's past and future behavior to determine actions that are most and least beneficial to them, combining the advantages of both model-based and progress-estimating methods (Wang et al., 2018;Ma et al., 2019a,b).

The HANNA Simulator
Problem. HANNA simulates a scenario where a human requester asks a mobile agent via language to find an object in an indoor environment. The task request is only a high-level command ("find [object(s)]"), modeling the general case when the requester does not need know how to accomplish a task when requesting it. We assume the task is always feasible: there is at least an instance of the requested object in the environment. Figure 1, to which references in this section will be made, illustrates an example where the agent is asked to "find a mug." The agent starts at a ran-   (Thomason et al., 2019b) contains natural conversations in which a human assistant aids another human in navigation tasks but offers limited language interaction simulation, as language assistance is not available when the agent deviates from the collected trajectories and tasks. HANNA simulates human assistants that provide language-and-vision instructions that adapt to the agent's current position and goal.
dom location ( A ), is given a task request, and is allotted a budget of T time steps to complete the task. The agent succeeds the if its final location is within success meters of the location of any instance of the requested object ( ). The agent is not given any sensors that help determine its location or the object's location and must navigate only with a monocular camera that captures its first-person view as an RGB image (e.g., image in the upper right of Figure 1). The only source of help the agent can leverage in the environment is assistants, who are present at both training and evaluation time. The assistants are not aware of the agent unless it enters their zones of attention, which include all locations within attn meters of their locations. When the agent is in one of these zones, it has an option to request help from the corresponding assistant. The assistant helps the agent by giving a subtask, described by a natural language instruction that guides the agent to a specific location, and an image of the view at that location.
In our example, at B , the assistant says "Enter the bedroom and turn left immediately. Walk straight to the carpet in the living room. Turn right, come to the coffee table." and provides an image of the destination in the living room. Executing the subtask may not fulfill the main task, but is guaranteed to get the agent to a location closer to a goal than where it was before (e.g., C ).
Photo-realistic Navigation Simulator. HANNA uses the Matterport3D simulator (Chang et al., 2017;Anderson et al., 2018b) to photorealistically emulate a first-person view while navigating in indoor environments. HANNA features 68 Matterport3D environments, each of which is a residential building consisting of multiple rooms and floors. Navigation is modeled as traversing an undirected graph G = (V, E), where each location corresponds to a node v ∈ V with 3D-coordinates x v , and edges are weighted by their lengths (in meters). The state of the agent is fully determined by its pose τ = (v, ψ, ω), where v is its location, ψ ∈ (0, 2π] is its heading (horizontal camera angle), and ω ∈ − π 6 , π 6 is its elevation (vertical camera angle). The agent does not know v, and the angles are constrained to multiples of π 6 . In each step, the agent can either stay at its current location, or it can rotate toward and go to a location adjacent to it in the graph 1 . Every time the agent moves (and thus changes pose), the simulator recalculates the image to reflect the new view.
Automatic Natural Navigation Assistants (ANNA). ANNA is a simulation of human assistants who do not necessarily know themselves how to optimally accomplish the agent's goal: they are only familiar with scenes along certain paths in the environment, and thus give advice to help the agent make partial progress. Specifically, the assistance from ANNA is modeled by a set of language-assisted routes R = {r 1 , r 2 , . . . , r |R| }. Each route r = (ψ r , ω r , p r , l r ) is defined by initial camera angles (ψ r , ω r ), a path p r in the environment graph, and a natural language instruction l r . A route becomes enterable when its start location is adjacent to and within attn meters of the agent's location. When the agent enters a route, it first adjusts its camera angles to (ψ r , ω r ), then attempts to interpret the language instructions l r to traverse along p r . At any time, the agent can depart the route by stopping following l r . An example of a route in Figure 1 is the combination of the initial camera angles at B , the path , and the language instruction "Enter the bedroom and turn left immediately. . . " The set of all routes starting from a location simulates a human assistant who can recall scenes along these routes' paths. The zone of attention of the simulated human is the set of all locations from which the agent can enter one of the routes; when the agent is in this zone, it may ask the human for 1 We use the "panoramic action space" (Fried et al., 2018). help. Upon receiving a help request, the human selects a route r for the agent to enter (e.g., ), and a location v d on the route where it wants the agent to depart (e.g., C ). It then replies the agent with a multimedia message (l r , I v d ), where l r is the selected route's language instruction, and I v d is an image of the panoramic view at the departure location. The message describes a subtask which requires the agent to follow the direction described by l r and to stop if it reaches the location referenced by I v d . The route r and the departure node v d are selected to get the agent as close to a goal location as possible. Concretely, let R curr be the set of all routes associated with the requested human. The selected route minimizes the distance to the goal locations among all routes in R curr : (2) d(., .) returns the (shortest-path) distance between two locations, and V goal is the set of all goal locations. The departure location minimizes the distance to the goal locations among all locations on the selected route: When the agent chooses to depart the route (not necessarily at the departure node), the human further assists it by providing I g , an image of the panoramic view at the goal location closest to the departure node: The way the agent leverages ANNA to accomplish tasks is analogous to how humans travel using public transportation systems (e.g., bus, subway). For example, passengers of a subway system utilize fractions of pre-constructed routes to make progress toward a destination. They execute travel plans consisting of multiple subtasks, each of which requires entering a start stop, following a route (typically described by its name and last stop), and exiting at a departure stop (e.g., "Enter the Penn Station, hop on the Red line in the direction toward the South Ferry, get off at the World Trade Center" ). Occasionally, users walk short distances (at a lower speed) to switch routes. Our setup follows the same principle, but instead of having physical vehicles and railways, we employ low-level language-and-vision instructions as the "high-speed means" to accelerate travel. Constructing ANNA route system. Given a photo-realistic simulator, the primary cost for constructing the HANNA problem comes from crowdsourcing the natural language instructions. Ideally, we want to collect sufficient instructions to simulate humans in any location in the environment. Let N = |V | be the number of locations in the environment. Since each simulated human is familiar with at most N locations, in the worst case, we need to collect O(N 2 ) instructions to connect all location pairs. However, we theoretically prove that, assuming the agent executes instructions perfectly, it is possible to guide the agent between any location pair by collecting only Θ(N log N ) instructions. The key idea is using O(log N ) instead of a single instruction to connect each pair, and reusing an instruction for multiple routes.
Lemma 1. (proof in Appendix A) To guide the agent between any two locations using O(log N ) instructions, we need to collect instructions for Θ(N log N ) location pairs.
In our experiments, we leverage the pre-existing Room-to-room dataset (Anderson et al., 2018b) to construct the route system. This dataset contains 21,567 natural language instructions crowdsourced from humans and is originally intended to be used for the Vision-Language Navigation task (such as those in Figure 1), where an agent executes a language instruction to go to a location. We exclude instructions of the test split and their corresponding environments because ground-truth paths are not given. We use (on average) 211 routes to connect (on overage) 125 locations per environment. Even though the routes are selected randomly in the original dataset, our experiments show that they are sufficient for completing the tasks (assuming perfect assistance interpretation).

Retrospective Curiosity-Encouraging
Imitation Learning Agent Policies. Let s be a fully-observed state that contains ground-truth information about the environment and the agent (e.g., object locations, environment graph, agent parameters, etc.). Let o s be the corresponding observation given to the agent, which only encodes the current view, the current task, and extra information that the agent keeps track of (e.g., time, action history, etc.). The Algorithm 1 Task episode, given agent helprequest policyπ ask and navigation policyπ nav 1: agent receives task request e 2: initialize the agent mode: m ← main task 3: initialize the language instruction: l0 ← e 4: initialize the target image: I tgt 0 ← None 5: for t = 1 . . . T do agent executesâ nav t to go to the next location 29: end for agent maintains two stochastic policies: a navigation policyπ nav and a help-request policyπ ask . Each policy maps an observation to a probability distribution over its action space. Navigation actions are tuples (v, ∆ψ, ∆ω), where v is a next location that is adjacent to the current location and (∆ψ, ∆ω) is the camera angle change. A special stop action is added to the set of navigation actions to signal that the agent wants to terminate the main task or a subtask (by departing a route). The action space of the help-request policy contains two actions: request help and do nothing. The request help action is only available when the agent is in a zone of attention. Alg 1 describes the effects of these actions during a task episode.
Imitation Learning Objective. The agent is trained with imitation learning to mimic behaviors suggested by a navigation teacher π nav and a help-request teacher π ask , who have access to the fully-observed states. In general, imitation learning (Daumé III et al., 2009;Ross et al., 2011;Ross and Bagnell, 2014;Chang et al., 2015;Sun et al., 2017) finds a policyπ that minimizes the expected imitation loss L with respect to a teacher policy π under the agent-induced state distribution Dπ: We frame the HANNA problem as an instance of Imitation Learning with Indirect Intervention (I3L) . Under this framework, assistance is viewed as augmenting the current environment with new information. Interpreting the assistance is cast as finding the optimal acting policy in the augmented environment. Formally, I3L searches for policies that optimize: L(s) = L nav (s,π nav , π nav ) + L ask (s,π ask , π ask ) where L nav and L ask are the navigation and helprequest loss functions, respectively, D env π ask is the environment distribution induced byπ ask , and D statê πnav,E is the state distribution induced byπ nav in environment E. A common choice for the loss functions is the agent-estimated negative log likelihood of the reference action: where a is the reference action suggested by π . We introduce novel loss functions that enforce more complex behaviors than simply mimicking reference actions.
Reference Actions. The navigation teacher suggests a reference action a nav that takes the agent to the next location on the shortest path from its location to the target location. Here, the target location refers to the nearest goal location (if no target image is available), or the location referenced by the target image (provided by ANNA). If the agent is already at the target location, a nav = stop. To decide whether the agent should request help, the help-request teacher verifies the following conditions: 1. lost: the agent will not get (strictly) closer to the target location in the future; 2. uncertain wong: the entropy 2 of the navigation action distribution is greater than or equal to a threshold γ, and the highestprobability predicted navigation action is not suggested by the navigation teacher; 3. never asked: the agent previously never requested help at the current location; If condition (1) or (2), and condition (3) are satisfied, we set a ask = request help; otherwise, a ask = do nothing.

Curiosity-Encouraging Navigation Teacher.
In addition to a reference action, the navigation teacher returns A nav⊗ , the set of all non-reference actions that the agent took at the current location while executing the same language instruction: where A nav is the navigation action space. We devise a curiosity-encouraging loss L curious , which minimizes the log likelihoods of actions in A nav⊗ . This loss prevents the agent from repeating past mistakes and motivates it to explore untried actions. The navigation loss is: where α ∈ [0, ∞) is a weight hyperparameter.

Retrospective Interpretable Help-Request
Teacher. In deciding whether the agent should ask for help, the help-request teacher must consider the agent's future situations. Standard imitation learning algorithms (e.g., DAgger) employ an online mode of interaction which queries the teacher at every time step. This mode of interaction is not suitable for our problem: the teacher must be able to predict the agent's future actions if it is queried when the episode is not finished. To overcome this challenge, we introduce a more efficient retrospective mode of interaction, which waits until the agent completes an episode and queries the teacher for reference actions for all time steps at once. With this approach, because the future actions at each time step are now fully observed, they can be taken into consideration when computing the reference action. In fact, we prove that the retrospective teacher is optimal for teaching the agent to determine the lost condition, which is the only condition that requires knowing the agent's future. Lemma 2. (proof in Appendix B) At any time step, the retrospective help-request teacher suggests the Inter-task Intra-task Encoder CosineSim Attention Figure 2: Our hierarchical recurrent model architecture (the navigation network). The help-request network is mostly similar except that the navigation action distribution is fed as an input to compute the "state features".
action that results in the agent getting closer to the target location in the future under its current navigation policy (if such an action exists).
To help the agent better justify its help-request decisions, we train a reason classifier Φ to predict which conditions are satisfied. To train this classifier, the teacher provides a reason vector ρ ∈ {0, 1} 3 , where ρ i = 1 indicates that the i-th condition is met. We formulate this prediction problem as multi-label binary classification and employ a binary logistic loss for each condition. Learning to predict the conditions helps the agent make more accurate and interpretable decisions. The helprequest loss is: L ask (s,π ask , π ask ) = L NL (s,π ask ,π ask ) − logπ ask (a ask | o s ) (10) Lreason(s,π ask ,π ask ) where a ask , ρ = π ask (s), andρ = Φ(o s ) is the agent-estimated likelihoods of the conditions.

Hierarchical Recurrent Architecture
We model the navigation policy and the helprequest policy as two separate neural networks. The two networks have similar architectures, which consists of three main components: the text-encoding component, the inter-task component, and the intra-task component (Figure 2). We use self-attention instead of recurrent neural networks to better capture long-term dependency, and develop novel cosine-similarity attention and ResNet-based time-encoding. Detail on the computations in each module is in the Appendix. The text-encoding component computes a text memory M text , which stores the hidden representation of the current language instruction. The

Split
Environments Tasks   inter-task module computes a vector h inter t representing the state of the current task's execution. During the episode, every time the current task is altered (due to the agent requesting help or departing a route), the agent re-encodes the new language instruction to generate a new text memory and resets the inter-task state to a zero vector. The intra-task module computes a vector h intra t representing the state of the entire episode. To compute this state, we first calculateh intra t , a tentative current state, andh intra t , a weighted combination of the past states at nearly identical situations. h intra t is computed as: Eq 11 creates an context-sensitive dissimilarity between the current state and the past states at nearly identical situations. The scale vector β determines how large the dissimilarity is based on the inputs. This formulation incorporates past related information into the current state, thus enables the agent to optimize the curiosity-encouraging loss effectively. Finally, h intra t is passed through a softmax layer to produce an action distribution.

Experimental Setup
Dataset. We generate a dataset of object-finding tasks in the HANNA environments to train and evaluate our agent.  Table 3: Results on test splits. The agent with "perfect assistance interpretation" uses the teacher navigation policy (π nav ) to make decisions when executing a subtask from ANNA. Results of our final system are in bold.
language instruction vocabulary contains 2,332 words. The numbers of locations on the shortest paths to the requested objects are restricted to be between 5 and 15. With an average edge length of 2.25 meters, the agent has to travel about 9 to 32 meters to reach its goals. We evaluate the agent in environments that are seen during training (SEENENV), and in environments that are not seen (UNSEENALL). Even in the case of SEE-NENV, the tasks and the ANNA language instructions given during evaluation were never given in the same environments during training.
Baselines and Skylines. We compare our agent against the following non-learning agents: 1. SHORTEST: uses the navigation teacher policy to make decisions (this is a skyline); 2. RAN-DOMWALK: randomly chooses a navigation action at every time step; 3. FORWARD10: navigates to the next location closest to the center of the current view to advance for 10 time steps. We compare our learned help-request policy with the following heuristics: 1. NOASK: does not request help; 2. RANDOMASK: randomly chooses to request help with a probabilty of 0.2, which is the average help-request ratio of our learned agent; 3. ASKEVERY5: requests help as soon as walking at least 5 time steps.
Evaluation metrics. Our main metrics are: success rate (SR), the fraction of examples on which the agent successfully solves the task; navigation error, the average (shortest-path) distance between the agent's final location and the nearest goal from that location; and SPL (Anderson et al., 2018a), which weights task success rate by travel distance as follows: where N is the number of tasks, S i indicates whether task i is successful, P i is the agent's travel distance, and L i is the shortest-path distance to the goal nearest to the agent's final location.

Results
Main results. From Table 3, we see that our problem is challenging: simple heuristic-based baselines such as RANDOMWALK and FOR-WARD10 attain success rates less than 7%. An agent that learns to accomplish tasks without additional assistance from ANNA succeeds only 17.21% of the time on TEST SEENENV, and 8.10% on TEST UNSEENALL. Leveraging help from ANNA dramatically boosts the success rate by 71.16% on TEST SEENENV and by 39.35% on TEST UNSEENALL over not requesting help. Given the small size of our dataset (e.g., the agent has fewer than 9,000 subtask instructions to learn from), it is encouraging that our agent is successful in nearly half of its tasks. On average, the agent takes paths that are 1.38 and 1.86 times longer than the optimal paths on TEST SEENENV and TEST UNSEENALL, respectively. In unseen environments, it issues on average twice as many requests to as it does in seen environments. To understand how well the agent interprets the ANNA instructions, we also provide results where our agent uses the optimal navigation policy to make decisions while executing subtasks. The large gaps on TEST SEENENV indicate there is still much room for improvement in the future, purely in learning to exe-   Does understanding language improve generalizability? Our agent is assisted with both language and visual instructions; similar to Thomason et al. (2019a), we disentangle the usefulness two these two modes of assistance. As seen in Table 4, the improvement from language on TEST UNSEENALL (+15.17%) is substantially more than that on TEST SEENENV (+3.42%), largely the agent can simply memorize the seen environments. This confirms that understanding language-based assistance effectively enhances the agent's capability of accomplishing tasks in novel environments.
Is learning to request help effective? Table 5 compares our learned help-request policies with baselines. We find that ASKEVERY5 provides a surprisingly strong baseline for this problem, leading to an improvement of +26.32% over not requesting help on TEST UNSEENALL. Nevertheless, our learned policy, with the ability to predict the future and access to the agent's uncertainty, outperforms all baselines by at least 10.40% in success rate on TEST UNSEENALL, while making less help requests. The small gap between the learned policy and ASKEVERY5 on TEST UN-SEENALL is expected because, on this split, the performance is mostly determined by the model's memorizing capability and is mostly insensitive to the help-request strategy.

Is proposed model architecture effective?
We implement an LSTM-based encoder-decoder model that is based on the architecture proposed  Table 6: Results on TEST UNSEENALL of our model, trained with and without curiosity-encouraging loss, and an LSTM-based encoder-decoder model (both models have about 15M parameters). "Navigation mistake repeat" is the fraction of time steps on which the agent repeats a non-optimal navigation action at a previously visited location while executing the same task. "Help-request repeat" is the fraction of help requests made at a previously visited location while executing the same task.
by (Wang et al., 2019). To incorporate the target image, we add an attention layer that uses the image's vector set as the attention memory. We train this model with imitation learning using the standard negative log likelihood loss (Eq 7), without the curiosity-encouraging and reason-prediction losses. As seen in Table 6, our hierarchical recurrent model outperforms this model by a large margin on TEST UNSEENALL (+28.2%).
Does the proposed imitation learning algorithm achieve its goals? The curiosity-encouraging training objective is proposed to prevent the agent from making the same mistakes at previously encountered situations. Table 6 shows that training with the curiosity-encouraging objective reduces the chance of the agent looping and making the same decisions repeatedly. As a result, its success rate is greatly boosted (+4.33% on TEST UN-SEENALL) over no curiosity-encouraging.

Conclusion
In this work, we present a photo-realistic simulator that mimics primary characteristics of real-life human assistance. We develop effective imitation learning techniques for learning to request and interpret the simulated assistance, coupled with a hierarchical neural network model for representing subtasks. Future work aims to provide more natural, linguistically realistic interaction between the agent and humans (e.g., providing the agent the ability ask a natural question rather than just signal for help), and to establish a theoretical framework for modeling human assistance. We are also exploring ways to deploy and evaluate our methods on real-world platforms.