BabyWalk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps

Learning to follow instructions is of fundamental importance to autonomous agents for vision-and-language navigation (VLN). In this paper, we study how an agent can navigate long paths when learning from a corpus that consists of shorter ones. We show that existing state-of-the-art agents do not generalize well. To this end, we propose BabyWalk, a new VLN agent that is learned to navigate by decomposing long instructions into shorter ones (BabySteps) and completing them sequentially. A special design memory buffer is used by the agent to turn its past experiences into contexts for future steps. The learning process is composed of two phases. In the first phase, the agent uses imitation learning from demonstration to accomplish BabySteps. In the second phase, the agent uses curriculum-based reinforcement learning to maximize rewards on navigation tasks with increasingly longer instructions. We create two new benchmark datasets (of long navigation tasks) and use them in conjunction with existing ones to examine BabyWalk’s generalization ability. Empirical results show that BabyWalk achieves state-of-the-art results on several metrics, in particular, is able to follow long instructions better. The codes and the datasets are released on our project page: https://github.com/Sha-Lab/babywalk.


Introduction
Autonomous agents such as household robots need to interact with the physical world in multiple modalities. As an example, in vision-and-language navigation (VLN) (Anderson et al., 2018), the agent moves around in a photo-realistic simulated environment (Chang et al., 2017) by following a sequence of natural language instructions. To infer its whereabouts so as to decide its moves, the * Author contributed equally † On leave from University of Southern California agent infuses its visual perception, its trajectory and the instructions (Fried et al., 2018;Anderson et al., 2018;Wang et al., 2019;Ma et al., 2019a,b). Arguably, the ability to understand and follow the instructions is one of the most crucial skills to acquire by VLN agents. Jain et al. (2019) shows that the VLN agents trained on the originally proposed dataset ROOM2ROOM (i.e. R2R thereafter) do not follow the instructions, despite having achieved high success rates of reaching the navigation goals. They proposed two remedies: a new dataset ROOM4ROOM (or R4R) that doubles the path lengths in the R2R, and a new evaluation metric Coverage weighted by Length Score (CLS) that measures more closely whether the groundtruth paths are followed. They showed optimizing the fidelity of following instructions leads to agents with desirable behavior. Moreover, the long lengths in R4R are informative in identifying agents who score higher in such fidelity measure.
In this paper, we investigate another crucial aspect of following the instructions: can a VLN agent generalize to following longer instructions by learning from shorter ones? This aspect has important implication to real-world applications as collecting annotated long sequences of instructions and training on them can be costly. Thus, it is highly desirable to have this generalization ability. After all, it seems that humans can achieve this effortlessly 1 .
To this end, we have created several datasets of longer navigation tasks, inspired by R4R (Jain et al., 2019). We trained VLN agents on R4R and use the agents to navigate in ROOM6ROOM (i.e., R6R) and ROOM8ROOM (i.e., R8R). We contrast to the performance of the agents which are trained on those datasets directly ("in-domain"). The results Figure 1: Performance of various VLN agents on generalizing from shorter navigation tasks to longer ones. The vertical axis is the newly proposed path-following metric SDTW (Magalhaes et al., 2019), the higher the better. BABYWALK generalizes better than other approaches across different lengths of navigation tasks. Meanwhile, it get very close to the performances of the in-domain agents (the dashed line). Please refer to the texts for details. are shown in Fig. 1.
Our findings are that the agents trained on R4R (denoted by the purple and the pink solid lines) perform significantly worse than the in-domain agents (denoted the light blue dashed line). Also interestingly, when such out-of-domain agents are applied to the dataset R2R with shorter navigation tasks, they also perform significantly worse than the corresponding in-domain agent despite R4R containing many navigation paths from R2R. Note that the agent trained to optimize the aforementioned fidelity measure (RCM(fidelity)) performs better than the agent trained to reach the goal only (RCM(goal)), supporting the claim by Jain et al. (2019) that following instructions is a more meaningful objective than merely goal-reaching. Yet, the fidelity measure itself is not enough to enable the agent to transfer well to longer navigation tasks.
To address these deficiencies, we propose a new approach for VLN. The agent follows a long navigation instruction by decomposing the instruction into shorter ones ("micro-instructions", i.e., BABY-STEPs), each of which corresponds to an intermediate goal/task to be executed sequentially. To this end, the agent has three components: (a) a memory buffer that summarizes the agent's experiences so that the agent can use them to provide the context for executing the next BABY-STEP. (b) the agent first learns from human experts in "bitesize". Instead of trying to imitate to achieve the ground-truth paths as a whole, the agent is given the pairs of a BABY-STEP and the corresponding human expert path so that it can learn policies of actions from shorter instructions. (c) In the second stage of learning, the agent refines the policies by curriculum-based reinforcement learning, where the agent is given increasingly longer navigation tasks to achieve. In particular, this curriculum design reflects our desiderata that the agent optimized on shorter tasks should generalize well to slightly longer tasks and then much longer ones.
While we do not claim that our approach faithfully simulates human learning of navigation, the design is loosely inspired by it. We name our approach BABYWALK and refer to the intermediate navigation goals in (b) as BABY-STEPs. Fig. 1 shows that BABYWALK (the red solid line) significantly outperforms other approaches and despite being out-of-domain, it even reach the performance of in-domain agents on R6R and R8R.
The effectiveness of BABYWALK also leads to an interesting twist. As mentioned before, one of the most important observations by Jain et al.
(2019) is that the original VLN dataset R2R fails to reveal the difference between optimizing goalreaching (thus ignoring the instructions) and optimizing the fidelity (thus adhering to the instructions). Yet, leaving details to section 5, we have also shown that applying BABYWALK to R2R can lead to equally strong performance on generalizing from shorter instructions (i.e., R2R) to longer ones.
In summary, in this paper, we have demonstrated empirically that the current VLN agents are ineffective in generalizing from learning on shorter navigation tasks to longer ones. We propose a new approach in addressing this important problem. We validate the approach with extensive benchmarks, including ablation studies to identify the effectiveness of various components in our approach.

Related Work
Vision-and-Language Navigation (VLN) Recent works (Anderson et al., 2018;Thomason et al., 2019;Jain et al., 2019;Chen et al., 2019;Nguyen and Daumé III, 2019) extend the early works of instruction based navigation (Chen and Mooney, 2011;Kim and Mooney, 2013;Mei et al., 2016) to photo-realistic simulated environments. For instance, Anderson et al. (2018) proposed to learn a multi-modal Sequence-to-Sequence agent (Seq2Seq) by imitating expert demonstration. Fried et al. (2018) developed a method that augments the paired instruction and demonstration data using a learned speaker model, to teach the navigation agent to better understand instructions. Wang et al. (2019) further applies reinforcement learning (RL) and self-imitation learning to improve navigation agents. Ma et al. (2019a,b) designed models that track the execution progress for a sequence of instructions using soft-attention.
Different from them, we focus on transferring an agent's performances on shorter tasks to longer ones. This leads to designs and learning schemes that improve generalization across datasets. We use a memory buffer to prevent mistakes in the distant past from exerting strong influence on the present. In imitation learning stage, we solve fine-grained subtasks (BABY-STEPs) instead of asking the agent to learn the navigation trajectory as a whole. We then use curriculum-based reinforcement learning by asking the agent to follow increasingly longer instructions.
Transfer and Cross-domain Adaptation There have been a large body of works in transfer learning and generalization across tasks and environments in both computer vision and reinforcement learning (Andreas et al., 2017;Oh et al., 2017;Zhu et al., 2017a,b;Sohn et al., 2018;Hu et al., 2018). Of particular relevance is the recent work on adapting VLN agents to changes in visual environments (Huang et al., 2019;Tan et al., 2019). To our best knowledge, this work is the first to focus on adapting to a simple aspect of language variability -the length of the instructions.
Curriculum Learning Since proposed in (Bengio et al., 2009), curriculum learning was successfully used in a range of tasks: training robots for goal reaching (Florensa et al., 2017), visual question answering (Mao et al., 2019), image generation (Karras et al., 2018). To our best knowledge, this work is the first to apply the idea to learning in VLN.

Notation and the Setup of VLN
In the VLN task, the agent receives a natural language instruction X composed of a sequence of sentences. We model the agent with an Markov Decision Process (MDP) which is defined as a tuple of a state space S, an action space A, an initial state s 1 , a stationary transition dynamics ρ : S×A → S, a reward function r : S × A → R, and the discount factor γ for weighting future rewards. The agent acts according to a policy π : S × A → 0 ∪ R + . The state and action spaces are defined the same as in (Fried et al., 2018) (cf. § 4.4

for details).
For each X, the sequence of the pairs (s, a) is called a trajectory Y = s 1 , a 1 , . . . , s |Y| , a |Y| where |·| denotes the length of the sequence or the size of a set. We useâ to denote an action taken by the agent according to its policy. Hence,Ŷ denotes the agent's trajectory, while Y (or a) denotes the human expert's trajectory (or action). The agent is given training examples of (X, Y) to optimize its policy to maximize its expected rewards.
In our work, we introduce additional notations in the following. We will segment a (long) instruction X into multiple shorter sequences of sentences {x m , m = 1, 2, · · · , M}, to which we refer as BABY-STEPs. Each x m is interpreted as a microinstruction that corresponds to a trajectory by the agentŷ m and is aligned with a part of the human expert's trajectory, denoted as y m . While the alignment is not available in existing datasets for VLN, we will describe how to obtain them in a later section ( § 4.3). Throughout the paper, we also freely interexchange the term "following the mth microinstruction", "executing the BABY-STEP x m ", or "complete the mth subtask".
We use t ∈ [1, |Y|] to denote the (discrete) time steps the agent takes actions. Additionally, when the agent follows x m , for convenience, we sometimes use t m ∈ [1, |ŷ m |] to index the time steps, instead of the "global time" t = t m + m−1 i=1 |ŷ i |.

Approach
We describe in detail the 3 key elements in the design of our navigation agent: (i) a memory buffer for storing and recalling past experiences to provide contexts for the current navigation instruction ( § 4.1); (ii) an imitation-learning stage of navigating with short instructions to accomplish a single BABY-STEP ( § 4.2.1); (iii) a curriculum-based reinforcement learning phase where the agent learns with increasingly longer instructions (i.e. multiple BABY-STEPs) ( § 4.2.2). We describe new benchmarks created for learning and evaluation and key implementation details in § 4.3 and § 4.4 (with more details in the Appendix).

The BABYWALK Agent
The basic operating model of our navigation agent BABYWALK is to follow a "micro instruction" x m (i.e., a short sequence of instructions, to which we Figure 2: The BABYWALK agent has a memory buffer storing its past experiences of instructions x m , and its trajectoryŷ m . When a new BABY-STEP x m is presented, the agent retrieves from the memory a summary of its experiences as the history context. It takes actions conditioning on the context (as well as its state s t and the previous actionâ t ). Upon finishing following the instruction. the trajectoryŷ m is then sent to the memory to be remembered.
also refer as BABY-STEP), conditioning on the contextẑ m and to output a trajectoryŷ m . A schematic diagram is shown in Fig. 2. Of particularly different from previous approaches is the introduction of a novel memory module. We assume the BABY-STEPs are given in the training and inference time - § 4.3 explains how to obtain them if not given a prior (Readers can directly move to that section and return to this part afterwards). The left of the Fig. 3 gives an example of those micro-instructions.
Context The context is a summary of the past experiences of the agent, namely the previous (m− 1) mini-instructions and trajectories: where the function g is implemented with a multilayer perceptron. The summary function f SUMMARY is explained in below.
Summary To map variable-length sequences (such as the trajectory and the instructions) to a single vector, we can use various mechanisms such as LSTM. We reported an ablation study on this in § 5.3. In the following, we describe the "forgetting" one that weighs more heavily towards the most recent experiences and performs the best empirically.
where the weights are normalized to 1 and inverse proportional to how far i is from m, γ is a hyper-parameter (we set to 1/2) and ω(·) is a monotonically nondecreasing function and we simply choose the identity function. Note that, we summarize over representations of "micro-instructions" (x m ) and experiences of executing those micro-instructionsŷ m . The two encoders u(·) and v(·) are described in § 4.4. They are essentially the summaries of "low-level" details, i.e., representations of a sequence of words, or a sequence of states and actions. While existing work often directly summarizes all the low-level details, we have found that the current form of "hierarchical" summarizing (i.e., first summarizing each BABY-STEP, then summarizing all previous BABY-STEPs) performs better.
Policy The agent takes actions, conditioning on the contextẑ m , and the current instruction x m : where the policy is implemented with a LSTM with the same cross-modal attention between visual states and languages as in (Fried et al., 2018).

Learning of the BABYWALK Agent
The agent learns in two phases. In the first one, imitation learning is used where the agent learns to execute BABY-STEPs accurately. In the second one, the agent learns to execute successively longer tasks from a designed curriculum.

Imitation Learning
BABY-STEPs are shorter navigation tasks. With the mth instruction x m , the agent is asked to follow the instruction so that its trajectory matches the human expert's y m . To assist the learning, the context is computed from the human expert trajectory up to the mth BABY-STEP (i.e., in eq. (1),ŷs are replaced with ys). We maximize the objective We emphasize here each BABY-STEP is treated independently of the others in this learning regime. Each time a BABY-STEP is to be executed, we "preset" the agent in the human expert's context

Baby Walk
Baby Walk Baby Walk Figure 3: Two-phase learning by BABYWALK. (Left) An example instruction-trajectory pair from the R4R dataset is shown. The long instruction is segmented into four BABY-STEP instructions. We use those BABY-STEPs for imitation learning ( § 4.2.1) (Right) Curriculum-based RL. The BABYWALK agent warm-starts from the imitation learning policy, and incrementally learns to handle longer tasks by executing consecutive BABY-STEPs and getting feedback from external rewards (c.f . § 4.2.2). We illustrate two initial RL lectures using the left example. and the last visited state. We follow existing literature (Anderson et al., 2018;Fried et al., 2018) and use student-forcing based imitation learning, which uses agent's predicted action instead of the expert action for the trajectory rollout.

Curriculum Reinforcement Learning
We want the agent to be able to execute multiple consecutive BABY-STEPs and optimize its performance on following longer navigation instructions (instead of the cross-entropy losses from the imitation learning). However, there is a discrepancy between our goal of training the agent to cope with the uncertainty in a long instruction and the imitation learning agent's ability in accomplishing shorter tasks given the human annotated history. Thus it is challenging to directly optimize the agent with a typical RL learning procedure, even the imitation learning might have provided a good initialization for the policy, see our ablation study in § 5.3. Inspired by the curriculum learning strategy (Bengio et al., 2009), we design an incremental learning process that the agent is presented with a curriculum of increasingly longer navigation tasks. Fig. 3 illustrates this idea with two "lectures". Given a long navigation instruction X with M BABY-STEPs, for the kth lecture, the agent is given all the human expert's trajectory up to but not including the (M − k + 1)th BABY-STEP, as well as the history context z M−k+1 . The agent is then asked to execute the kth micro-instructions from x M−k+1 to x M using reinforcement learning to produce its trajectory that optimizes a task related  metric, for instance the fidelity metric measuring how faithful the agent follows the instructions. As we increase k from 1 to M, the agent faces the challenge of navigating longer and longer tasks with reinforcement learning. However, the agent only needs to improve its skills from its prior exposure to shorter ones. Our ablation studies show this is indeed a highly effective strategy.

New Datasets for Evaluation & Learning
To our best knowledge, this is the first work studying how well VLN agents generalize to long navigation tasks. To this end, we create the following datasets in the same style as in (Jain et al., 2019).

ROOM6ROOM and ROOM8ROOM
We concatenate the trajectories in the training as well as the validation unseen split of the ROOM2ROOM dataset for 3 times and 4 times respectively, thus extending the lengths of navigation tasks to 6 rooms and 8 rooms. To join, the end of the former trajectory must be within 0.5 meter with the beginning of the later trajectory. Table 1 and Fig. 4 contrast the different datasets in the # of instructions, the average length (in words) of instructions and how the distributions vary. Table 1 summarizes the descriptive statistics of BABY-STEPs across all datasets used in this paper. The datasets and the segmentation/alignments are made publically available 2 .

Key Implementation Details
In the following, we describe key information for research reproducibility, while the complete details are in the Appendix.

States and Actions
We follow (Fried et al., 2018) to set up the states as the visual features (i.e. ResNet-152 features (He et al., 2016)) from the agent-centric panoramic views in 12 headings × 3 elevations with 30 degree intervals. Likewise, we use the same panoramic action space.
Identifying BABY-STEPs Our learning approach requires an agent to follow microinstructions (i.e., the BABY-STEPs). Existing datasets (Anderson et al., 2018;Jain et al., 2019;Chen et al., 2019) do not provide fine-grained segmentations of long instructions. Therefore, we use a template matching approach to aggregate consecutive sentences into BABY-STEPs. First, we extract the noun phrase using POS tagging. Then, we employs heuristic rules to chunk a long instruction into shorter segments according to punctuation and landmark phrase (i.e., words for concrete objects). We document the details in the Appendix.

Aligning BABY-STEPs with Expert Trajectory
Without extra annotation, we propose a method to approximately chunk original expert trajectories into sub-trajectories that align with the BABY-STEPs. This is important for imitation learning at the micro-instruction level ( § 4.2.1). Specifically, we learn a multi-label visual landmark classifier to identify concrete objects from the states along expert trajectories by using the landmark phrases extracted from the their instructions as weak supervision. For each trajectory-instruction pair, we then extract the visual landmarks of every state as well as the landmark phrases in BABY-STEP instructions. Next, we perform a dynamic programming procedure to segment the expert trajectories by aligning the visual landmarks and landmark phrases, using the confidence scores of the multi-label visual landmark classifier to form the function.

Encoders and Embeddings
The encoder u(·) for the (micro)instructions is a LSTM. The encoder for the trajectory y contains two separate Bi-LSTMs, one for the state s t and the other for the action a t . The outputs of the two Bi-LSTMs are then concatenated to form the embedding function v(·). The details of the neural network architectures (i.e. configurations as well as an illustrative figure), optimization hyper-parameters, etc. are included in the Appendix.

Learning Policy with Reinforcement Learning
In the second phase of learning, BABYWALK uses RL to learn a policy that maximizes the fidelity-oriented rewards (CLS) proposed by Jain et al. (2019). We use policy gradient as the optimizer (Sutton et al., 2000). Meanwhile, we set the maximum number of lectures in curriculum RL to be 4, which is studied in Section 5.3.

Experiments
We describe the experimental setup ( § 5.1),followed by the main results in § 5.2 where we show the proposed BABYWALK agent attains competitive results on both the in-domain dataset but also generalizing to out-of-the-domain datasets with varying lengths of navigation tasks. We report results from various ablation studies in § 5.3. While we primarily focus on the ROOM4ROOM dataset, we re-analyze the original ROOM2ROOM dataset in § 5.4 and were surprised to find out the agents trained on it can generalize.

Experimental Setups.
Datasets We conduct empirical studies on the existing datasets ROOM2ROOM and ROOM4ROOM (Anderson et al., 2018;Jain et al., 2019), and the two newly created benchmark datasets ROOM6ROOM and ROOM8ROOM, described in § 4.3. Table 1 and Fig. 4 contrast their differences.

In-domain
Generalization to other datasets  Agents to Compare to Whenever possible, for all agents we compare to, we either re-run, reimplement or adapt publicly available codes from their corresponding authors with their provided instructions to ensure a fair comparison. We also "sanity check" by ensuring the results from our implementation and adaptation replicate and are comparable to the reported ones in the literature. We compare our BABYWALK to the following: (1) the SEQ2SEQ agent (Anderson et al., 2018), being adapted to the panoramic state and action space used in this work; (2) the Speaker Follower The last 3 agents are reported having state-ofthe art results on the benchmark datasets. Except the SEQ2SEQ agent, all other agents depend on an additional pre-training stage with data augmentation (Fried et al., 2018), which improves crossboard. Thus, we train two BABYWALK agents: one with and the other without the data augmentation.

Main results
In-domain Generalization This is the standard evaluation scenario where a trained agent is assessed on the unseen split from the same dataset as the training data. The leftmost columns in Table 2 reports the results where the training data is from R4R. The BABYWALK agents outperform all other agents when evaluated on CLS and SDTW.
When evaluated on SR, FAST performs the best and the BABYWALK agents do not stand out. This is expected: agents which are trained to reach goal do not necessarily lead to better instructionfollowing. Note that RCM(FIDELITY) performs well in path-following.
Out-of-domain Generalization While our primary goal is to train agents to generalize well to longer navigation tasks, we are also curious how the agents perform on shorter navigation tasks too. The right columns in Table 2 report the comparison. The BABYWALK agents outperform all other agents in all metrics except SR. In particular, on   SDTW, the generalization to R6R and R8R is especially encouraging, resulting almost twice those of the second-best agent FAST. Moreover, recalling from Fig. 1, BABYWALK's generalization to R6R and R8R attain even better performance than the RCM agents that are trained in-domain. Fig. 5 provides additional evidence on the success of BABYWALK, where we have contrasted to its performance to other agents' on following instructions in different lengths across all datasets. Clearly, the BABYWALK agent is able to improve very noticeably on longer instructions. Fig. 6 contrasts visually several agents in executing two (long) navigation tasks. BABYWALK's trajectories are similar to what human experts provide, while other agents' are not. Table 3 illustrates the importance of having a memory buffer to summarize the agent's past experiences. Without the memory (NULL), generalization to longer tasks is significantly worse. Using LSTM to summarize is worse than using forgetting to summarize (eqs. (2,3)). Meanwhile, ablating γ of the forgetting   mechanism concludes that γ = 0.5 is the optimal to our hyperparameter search. Note that when γ = 0, this mechanism degenerates to taking average of the memory buffer, and leads to inferior results. Table 4 establishes the value of CRL. While imitation learning (IL) provides a good warm-up for SR, significant improvement on other two metrics come from the subsequent RL (IL+RL). Furthermore, CRL (with 4 "lectures") provides clear improvements over direct RL on the entire instruction (i.e., learning to execute all BABY-STEPs at once). Each lecture improves over the previous one, especially in terms of the SDTW metric.

Revisiting ROOM2ROOM
Our experimental study has been focusing on using R4R as the training dataset as it was established that as opposed to R2R, R4R distinguishes well an agent who just learns to reach the goal from an agent who learns to follow instructions. Given the encouraging results of generalizing to longer tasks, a natural question to ask, how well can an agent trained on R2R generalize? Results in Table 5 are interesting. Shown in the top panel, the difference in the averaged performance of generalizing to R6R and R8R is not significant. The agent trained on R4R has a small win on R6R presumably because R4R is closer to R6R than R2R does. But for even longer tasks in R8R, the win is similar.
In the bottom panel, however, it seems that R2R → R4R is stronger (incurring less loss in performance when compared to the in-domain setting R4R → R4R) than the reverse direction (i.e., comparing R4R → R2R to the in-domain R2R → R2R). This might have been caused by the noisier segmentation of long instructions into BABY-STEPs in R4R. (While R4R is composed of two navigation paths in R2R, the segmentation algorithm is not aware of the "natural" boundaries between the two paths.)

Discussion
There are a few future directions to pursue. First, despite the significant improvement, the gap between short and long tasks is still large and needs to be further reduced. Secondly, richer and more complicated variations between the learning setting and the real physical world need to be tackled. For instance, developing agents that are robust to variations in both visual appearance and instruction descriptions is an important next step.

A Details on BABY-STEP Identification and Trajectory Alignments
In this section, we describe the details of how BABY-STEPs are identified in the annotated natural language instructions and how expert trajectory data are segmented to align with BABY-STEP instructions.

A.1 Identify BABY-STEPs
We identify the navigable BABY-STEPs from the natural language instructions of R2R, R4R, R6R and R8R, based on the following 6 steps: 1. Split sentence and chunk phrases. We split the instructions by periods. For each sentence, we perform POS tagging using the SpaCy (Honnibal and Montani, 2017) package to locate and chunk all plausible noun phrases and verb phrases.
2. Curate noun phrases. We curate noun phrases by removing the stop words (i.e., the, for, from etc.) and isolated punctuations among them and lemmatizing each word of them. The purpose is to collect a concentrated set of semantic noun phrases that contain potential visual objects.
3. Identify "landmark words". Next, given the set of candidate visual object words, we filter out a blacklist of words that either do not correspond to any visual counterpart or are misclassified by the SpaCy package. The word blacklist includes: end, 18 inch, head, inside, forward, position, ground, home, face, walk, feet, way, walking, bit, veer, 've, next, stop, towards, right, direction, thing, facing, side, turn, middle, one, out, piece, left, destination, straight, enter, wait, don't, stand, back, round We use the remaining noun phrases as the "landmark words" of the sentences. Note that this step identifies the "landmark words" for the later procedure which aligns BABY-STEPs and expert trajectories.

Identifying verb phrases.
Similarly, we use a verb blacklist to filter out verbs that require no navigational actions of the agent. The blacklist includes: make, turn, face, facing, veer.

5.
Merge non-actionable sentences. We merge the sentence without landmarks and verbs into the next sentence, as it is likely not actionable.
6. Merge stop sentences. There are sentences that only describe the stop condition of a navigation action, which include verb-noun compositions indicating the stop condition. We detect the sentences starting with wait, stop, there, remain, you will see as the sentences that only describe the stop condition and merge them to the previous sentence. Similarly, we detect sentences starting with with, facing and merge them to the next sentence.
After applying the above 6 heuristic rules to the language instruction, we obtain chunks of sentences that describes the navigable BABY-STEPs of the whole task (i.e., a sequence of navigational sub-goals.).

A.2 Align Expert Trajectories with identified BABY-STEPs
In the previous section, we describe the algorithm for identifying BABY-STEP instructions from the original natural language instructions of the dataset. Now we are going to describe the procedure of aligning BABY-STEPs with the expert trajectories, which segments the expert trajectories according to the BABY-STEPs to create the training data for the learning pipeline of our BABYWALK agent. Note that during the training, our BABYWALK does not rely on the existence of ground-truth alignments between the (micro)instructions and BABY-STEPs trajectories.

Main Idea
The main idea here is to: 1) perform visual landmark classification to produce confidence scores of landmarks for each visual state s along expert trajectories; 2) use the predicted landmark scores and the "landmark words" in BABY-STEPs to guide the alignment between the expert trajectory and BABY-STEPs. To achieve this, we train a visual landmark classifier with weak supervision -trajectory-wise existence of landmark objects. Next, based on the predicted landmark confidence scores, we use dynamic programming (DP) to chunk the expert trajectory into segments and assign the segments to the BABY-STEPs.
Weakly Supervised Learning of the Landmark Classifier Given the pairs of aligned instruction and trajectories (X, Y) from the original dataset, we train a landmark classifier to detect landmarks mentioned in the instructions. We formulate it as a multi-label classification problem that asks a classifier f LDMK (s t ; O) to predict all the landmarks O X of the instruction X given the corresponding trajectory Y. Here, we denotes all possible landmarks from the entire dataset to be O, and the landmarks of a specific instruction X to be O X . Concretely, we first train a convolutional neural network (CNN) based on the visual state features s t to independently predict the existence of landmarks at every time step, then we aggregate the predictions across all time steps to get trajectory-wise logits ψ via max-pooling over all states of the trajectory.
Here f LDMK denotes the independent state-wise landmark classifier, and ψ is the logits before normalization for computing the landmark probability. For the specific details of f LDMK , we input the 6×6 panorama visual feature (i.e. ResNet-152 feature) into a two-layer CNN (with kernel size of 3, hidden dimension of 128 and ReLU as non-linearity layer) to produce feature activation with spatial extents, followed by a global averaging operator over spatial dimensions and a multi-layer perceptron (2-layer with hidden dimension of 512 and ReLU as non-linearity layer) that outputs the state-wise logits for all visual landmarks O. We then max pool all the state-wise logits along the trajectory and compute the loss using a trajectory-wise binary cross-entropy between the ground-truth landmark label (of existence) and the prediction.
Aligning BABY-STEPs and Trajectories with Visual Landmarks Now, sppose we have a sequence of BABY-STEP instructions X = {x m , m = 1, . . . , M}, and its expert trajectory Y = {s t , t = 1, . . . , |Y|}, we can compute the averaged landmark score for the landmarks O xm that exists in this sub-task instruction x m on a single state s t : represents the one-hot encoding of the landmarks that exists in the BABY-STEP x m , and |O xm | is the total number of existed landmarks. We then apply dynamic programming (DP) to solve the trajectory segmentation specified by the following Bellman equation (in a recursive form). Encoders Instruction encoder u(·) for the instructions is a single directional LSTM with hidden size 512 and a word embedding layer of size 300 (initialized with GloVE embedding (Pennington et al., 2014)). We use the same encoder for encoding the past experienced and the current executing instruction. Trajectory encoder v(·) contains two separate bidirectional LSTMs (Bi-LSTM), both with hidden size 512. The first Bi-LSTM encodes a t i and outputs a hidden state for each time step t i . Then we attends the hidden state to the panoramic view s t i to get a state feature of size 2176 for each time step. The second Bi-LSTM encoders the state feature. We use the trajectory encoder just for encoding the past experienced trajectories.

BABYWALK Policy
The BABYWALK policy network consists of one LSTM with two attention layers and an action predictor. First we attend the hidden state to the panoramic view s t to get state feature of size 2176. The state feature is concatenated with the previous action embedding as a variable to update the hidden state using a LSTM with hidden size 512. The updated hidden state is then attended to the context variables (output of u(·)). For the action predictor module, we concatenate the output of text attention layer with the summarized past contextẑ m in order to get an action prediction variable. We then get the action prediction variable through a 2-layer MLP and make a dot product with the navigable action embeddings to retrieve the probability of the next action.
Model Inference During the inference time, the BABYWALK policy only requires running the heuristic BABY-STEP identification on the test-time instruction. No need for oracle BABY-STEP trajectory during this time as the BABYWALK agent is going to roll out for each BABY-STEP by itself.

B.2 Details of Reward Shaping for RL
As mentioned in the main text, we learn policy via optimizing the Fidelity-oriented reward (Jain et al., 2019). Now we give the complete details of this reward function. Suppose the total number of roll out steps is T = M i=1 |ŷ i |, we would have the following form of reward function: Here,Ŷ =ŷ 1 ⊕ . . . ⊕ŷ M represents the concatenation of BABY-STEP trajectories produced by the navigation agent (and we note ⊕ as the concatenation operation).

B.3 Optimization Hyper-parameters
For each BABY-STEP task, we set the maximal number of steps to be 10, and truncate the corresponding BABY-STEP instruction length to be 100. During both the imitation learning and the curriculum reinforcement learning procedures, we fix the learning rate to be 1e-4. In the imitation learning, the mini-batch size is set to be 100. In the curriculum learning, we reduce the mini-batch size as curriculum increases to save memory consumption. For the 1st, 2nd, 3rd and 4th curriculum, the mini-batch size is set to be 50, 32, 20, and 20 respectively. During the learning, we pre-train our BABYWALK model for 50000 iterations using the imitation learning as a warm-up stage. Next, in each lecture (up to 4) of the reinforcement learning (RL), we train the BABYWALK agent for an additional 10000 iterations, and select the best performing model in terms of SDTW to resume the next lecture. For executing each instruction during the RL, we sample 8 navigation episodes before performing any back-propagation. For each learning stage, we use separate Adam optimizers to optimize for all the parameters. Meanwhile, we use the L2 weight decay as the regularizer with its coefficient set to be 0.0005. In the reinforcement learning, the discounted factor γ is set to be 0.95.

C Additional Experimental Results
In this section, we describe a comprehensive set of evaluation metrics and then show transfer results of models trained on each dataset, with all metrics. We provide additional analysis studying the effectiveness of template based BABY-STEP identification. Finally we present additional qualitative results.
Complete set of Evaluation Metrics. We adopt the following set of metrics: • Path Length (PL) is the length of the agent's navigation path.
• Navigation Error (NE) measures the distance between the goal location and final location of the agent's path.
• Success Rate (SR) that measures the average rate of the agent stopping within a specified distance near the goal location (Anderson et al., 2018) •  Table 6: Sanity check of model trained on R2R and evaluated on its validation unseen split ( + : pre-trained with data augmentation; :reimplemented or readapted from the original authors' released code).
As mentioned in the main text, we compare our re-implementation and originally reported results of baseline methods on the R2R datasets, as Table 6. We found that the results are mostly very similar, indicating that our re-implementation are reliable.

C.2 Complete Curriculum Learning Results
We present the curriculum learning results with all evaluation metrics in Table 7.

C.3 Results of BABY-STEP Identification
We present an additional analysis comparing different BABY-STEP identification methods. We compare our template-based BABY-STEP identification with a simple method that treat each sentence as an BABY-STEP (referred as sentence-wise), both using the complete BABYWALK model with the same training routine. The results are shown in the    Table 8. Generally speaking, the template based BABY-STEP identification provides a better performance.

C.4 In-domain Results of Models Trained on Instructions with Different lengths
As mentioned in the main text, we display all the indomain results of navigation agents trained on R2R, R4R, R6R, R8R, respectively. The complete results of all different metrics are included in the Table 9. We note that our BABYWALK agent consistently outperforms baseline methods on each dataset. It is worth noting that on R4R, R6R and R8R datasets, RCM(GOAL) + achieves better results in SPL. This is due to the aforementioned fact that they often  take short-cuts to directly reach the goal, with a significantly short trajectory. As a consequence, the success rate weighted by inverse path length is high.

C.5 Transfer Results of Models Trained on Instructions with Different lengths
For completeness, we also include all the transfer results of navigation agents trained on R2R, R4R, R6R, R8R, respectfully. The complete results of all different metrics are included in the  lating to shorter ones, rather than extrapolating to longer instructions, which is intuitively an easier direction.

C.6 Additional Qualitative Results
We present more qualitative result of various VLN agents as Fig 8. It seems that BABYWALK can produce trajectories that align better with the human expert trajectories.