Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation

Vision-and-Language Navigation (VLN) is a natural language grounding task where an agent learns to follow language instructions and navigate to specified destinations in real-world environments. A key challenge is to recognize and stop at the correct location, especially for complicated outdoor environments. Existing methods treat the STOP action equally as other actions, which results in undesirable behaviors that the agent often fails to stop at the destination even though it might be on the right path. Therefore, we propose Learning to Stop (L2Stop), a simple yet effective policy module that differentiates STOP and other actions. Our approach achieves the new state of the art on a challenging urban VLN dataset Touchdown, outperforming the baseline by 6.89% (absolute improvement) on Success weighted by Edit Distance (SED).


Introduction
Vision-and-language navigation (VLN) aims at training an agent to navigate in real environments by following natural language instructions. Compared to indoor VLN (Anderson et al., 2018), navigation in urban environments (Chen et al., 2019) is particularly challenging, since urban environments are often more diverse and complex. Several research studies (Mirowski et al., 2018;Li et al., 2019;Bruce et al., 2018) have been conducted to solve the problem. In this paper, we also focus on the urban VLN task. As shown in Fig. 1, given a natural language instruction, the agent perceives local visual scene and chooses actions at every time step, learning to match the instruction with the produced trajectory and navigate to the destination. Existing VLN models (Wang et al., 2019;Tan et al., 2019;Ke et al., 2019;Ma et al., 2019b,a;Fried et al., 2018;Wang et al., 2018) seem to neglect the importance of the STOP action and treat all actions equally. However, this can lead to undesirable behaviors, also noticed in Cirik et al. (2018); Blukis et al. (2018), that the agent fails to stop at the target although it might be on the right path, because the STOP action is severely underestimated.
We argue that the STOP action in the urban VLN tasks is crucially important and deserves special treatment. First, in contrast to errors on other actions that are likely to be fixed later in the journey, the price of stopping at a wrong location is higher, because producing STOP terminates the episode, and there will be no chance to fix a wrong stop. Second, the statistical count of STOP is much lower than other actions as it only appears once per episode. Thus STOP will receive less attention if we treat all actions equally and ignore the difference of occurrence frequency. Moreover, STOP and other actions need different understandings of the dynamics between the instruction and the visual scene. Both require the alignment between trajectories and instructions, but STOP would emphasize the completeness of the instruction and the matching between the inferred target and the sur-rounding scene, while choosing directions requires a planning ability to imagine the future trajectory.
Therefore, we introduce a Learning to Stop (L2STOP) module to address the issues. L2STOP is a simple and model-agnostic module, which can be easily plugged into VLN models to improve their navigation performance. As we demonstrate in Fig. 1, the L2STOP module consists of a Stop Indicator to determine whether to stop and a Direction Decider to choose directions when at key points. Besides, we weigh STOP action more than other actions in the loss function, forcing the agent to pay more attention to the STOP action. We conduct experiments on a language-grounded street-view navigation dataset TOUCHDOWN 1 (Chen et al., 2019). Extensive results show that our proposed approach significantly improves the performance over the baseline model on all metrics and achieves the new state-of-the-art on the TOUCHDOWN dataset. 2

Approach
Fig. 2 illustrates the framework of our L2STOP model. Specifically, a text encoder and visual encoder are used to get the text and visual representations. Then the trajectory encoder uses the representations to compute the hidden context state, which is the input of the policy module. Unlike previous VLN models, which use one branch policy module, we use our proposed L2STOP module, a two-branch policy module that separates the policies for STOP and other actions. We detail each component below.

Visual and Text Encoder
As shown in the Fig. 2, we use two encoders as used in Chen et al. (2019) for encoding visual scene and language instruction respectively. For visual part, we apply a CNN (Krizhevsky et al., 2012) as the visual encoder to extract visual representation v t from current visual scene at time step t. For text, we adopt an LSTM (Hochreiter and Schmidhuber, 1997) as the text encoder to get the instruction representation X = {x 1 , x 2 , ..., x l }. We then use a soft-attention (Vaswani et al., 2017) to get the grounded textual feature x t at time step t: (1)  where W x denotes parameters to be learnt, α t,l denotes attention weight over l-th feature vector at time t, and h t−1 denotes the hidden context at previous time step. Then the agent produces the hidden context at the current step:

Learning to Stop Policy Module
Unlike the existing methods that view all the actions equally important, we propose the L2STOP module that helps the agent to learn whether to stop and where to go next with separate policy branches, Stop Indicator and Direction Decider.

Stop Indicator
The stop indicator produces stop or non-stop signals at every time step. At time step t, the stop indicator takes the hidden context h t and the time embedding t as input and outputs the probabilities of stop and non-stop signals: where g 2 is a linear layer, and s t,1 as well as s t,2 are the probabilities of non-stop and stop signals at time step t, respectively. If the stop indicator produces stop signal, the agent will stop immediately. Otherwise, the direction decider will choose a direction to go next.

Direction Decider
The direction decider is employed to select actions from a subset of the original action space. Specifically, the action subset includes all actions except STOP action (go forward, turn left, and turn right). Empirically, we observe that when navigating in urban environments, the agent only needs to choose directions at the intersections (nodes with more than two neighbors) it encounters in the journey. Therefore, we view these intersections as key points on the road and assume that the direction decider only needs to choose directions at key points and always goes forward otherwise. So at time step t, if the agent is at a key point, it will be activated and takes the hidden context h t as well as a learned time embedding t as input and outputs the probability of each action in its action space: where g 1 is a linear layer and p t,k is the probability of each action at time step t.

Learning
We use Teacher-Forcing (Luong et al., 2015) method to train the model. We have two loss functions, L direction and L stop , for direction decider and stop indicator, respectively. L direction is a regular cross-entropy loss function, Where q t,k denotes ground truth label for each action at time step t. For the stop indicator, we use a weighted cross-entropy loss, where we assign a greater weight for the stop signal in the loss function and therefore force the agent to pay more attention to the stop action, in formula, where o t are the ground-truth non-stop signals, and λ is the weight for the stop signal. Finally, the agent is optimized with a weighted sum of two loss functions: where γ is the weight balancing the two losses.  a three-layer CNN. The first layer uses 32 8 × 8 kernels with stride 4, and the second layer uses 64 4 × 4 kernels with stride 4, applying ReLu nonlinearities after each convolutional operation. Then a single fully-connected layer including biases of size 256 follows. An action embedding layer of size 16 is learned to map the previous action at every time step. Then, we concatenate the text representation, the visual representation, and the action embedding to get the input of the trajectory encoder. The trajectory encoder is a single-layer RNN with 256 hidden states. The time embedding layer is a single fully-connected layer including biases of size 32. Both of the stop indicator and the direction decider consist of a single-layer perceptron with biases and a SOFTMAX operation to compute the action probabilities.

Experimental Results
We compare the performance of our approach with the baselines: (1) Random: randomly take actions at each time step.
(2) GA and RCONCAT: the baseline models reported in the original dataset paper (Chen et al., 2019). We adapt the RCONCAT model by equipping it with an attention mechanism on instruction representation to get our Attention-RConcat (ARC) model that outperforms RCON-CAT. Then we integrate ARC with the proposed L2STOP module, which further boosts the performances on all metrics and achieves the best results on both development and test sets.
In Table 1, our approach substantially outperforms the baseline models, improving SED from 9.45% to 16.34%. Significant improvements on both goal-oriented metrics (TC, SED) and path alignment metrics (CLS, SDTW) demonstrate the effectiveness of L2STOP model in instruction following and goal achievement, which also validate that L2STOP learns not only where to go but also where to stop better.

Modularity
We compare the performance between the baseline models with and without L2STOP module. The results are shown in Table 1. Integrated with the L2STOP module, both of the baseline models show improvements on all the metrics. It demonstrates that our approach is model-agnostic and generalizable: the L2STOP module can be plugged into other VLN models and enhance their navigation performance in the urban environment.

Ablation Study
Effect of Individual Components We conduct an ablation study to illustrate each component's effect on the development set in Table 2. Row 2-4 shows the influence of each component by removing them respectively from the final model (ARC with L2STOP module). Removing any of the components results in worse performance, proving the indispensability of all components in our model. Row 2 shows the results of ARC with only one policy module, which will disable turn left and turn right actions when the agent is not at key points. The results evaluate the effectiveness of the two-branch structure for providing different subpolicies for STOP and other actions. Row 3 shows the results of the model whose Direction Decider makes decisions at every time step instead of only at key points. The results validate the effectiveness of only choosing directions at key points. Row 4 shows the results where the stop signal's weight is the same as the non-stop signal in the loss function of Stop Indicator. The worst results validate the importance of STOP action. When stop and non-stop signals are treated equally, the agent will prefer non-stop because of its higher occurrence frequency.
Which Is More Important, Stop or Direction?
In Table 3 truth stop signals as long as the agent reaches there. First, Row 2 shows the stop branch has about 30% chance to stop at the right position when the agent is on the right path. Second, The performance in Row 3 is much greater than that in Row 1, indicating that although our approach improves agent's stop ability, the performance is still seriously limited by the wrong stop problem. This indicates that the wrong stop problem in VLN deserves more attention and further study.

Case Study
We provide visualizations for two qualitative examples to further illustrate how our L2STOP model learns to stop better in Figure 3. In both cases, our model and the baseline model are on the right path to the target. However, the baseline stops either too late or too early. Specifically, In (a), the baseline agent fails to recognize the black fire hydrant on the target but stops at a place where another black fire hydrant is visible. In (b), the baseline agent successfully recognize the parking pay station on the right, but it ignores the instruction "slightly past it" and just stops immediately. In contrast, our agent stops in the right place.

Conclusion
We investigate the importance of the STOP action and study how to learn a policy that can not only make better decisions on where to go but also stop more accurately. We propose the L2STOP module for the vision-language navigation task situated in urban environments. Experiments illustrate that L2STOP is modular and can be plugged into other VLN models to further boost their performance in urban environments.

A.1 Analysis of Model Structure
In Fig. 4, we examine four model structures to evaluate the interactions between the two branches: (1) Separate Enc-Dec model, where two encoderdecoder models are trained separately for two branches.
(2) Shared Enc model, which has a shared encoder but uses two different decoders for two branches.
(3) Shared Dec model, which has different encoders for both linguistic and visual input but shared trajectory decoder. (4) shared Enc-Dec model, which shares both the encoder and the decoder. Note that this is the final architecture we use, which is demonstrated in Sec. 2.

A.2 Hyper-Parameters Sensitivity Analysis
Threshold for Stop Signal We study the sensitivity of the threshold for stop signals on the development set. The result is shown in Fig. 5 (a). Task-Completion (TC) is consistent in a large range of thresholds, with a slight drop when the threshold is getting higher than 0.7 and sharp decreases when the threshold is close to 0 and 1. The results demonstrate that our approach is insensitive to the change of threshold for stop signals. The consistency of the performance means that the scores of stop signals are either low or high, rarely intermediate. This proves that our approach enables the agent to pay more attention to STOP; that is, the agent is cautious about deciding to stop and only stop when it is highly confident it reaches the goal.

Direction Branch Weight
We study the sensitivity of direction branch weight γ on the development set. The optimal value for γ is 0.6, as depicted in Fig. 5 (b), which demonstrates that the balance between the loss functions of two branch enables the agent to not only select correct directions at key points but also stop at the right place. As shown in the figure, smaller γ (0-0.5) results in relatively worse performance than higher γ, indicating that small γ enforces the agent to concentrate too much on STOP but ignore the choice for direction. Consistently good performance with larger γ (0.6-0.85) shows that only a small weight for the stop branch can significantly improve the agent's stop ability.

Stop Signal Weight
We study the sensitivity of stop signal weight λ on the development set. As shown in the Fig. 5 (c), the optimal value for λ is 20. We can see that when λ = 0, our model's performance is similar to the ARC model (15.53 as shown in Table 1). However, when setting greater λ, the TC shows fluctuations, but is consistently better than ARC's performance. Only when λ increases to a large number of 80 does the performance decline sharply. This demonstrates the effectiveness of our proposed Weighted Cross-Entropy loss function, which consistently improves the agent's stop ability with a large range of λ.