RUN through the Streets: A New Dataset and Baseline Models for Realistic Urban Navigation

Following navigation instructions in natural language (NL) requires a composition of language, action, and knowledge of the environment. Knowledge of the environment may be provided via visual sensors or as a symbolic world representation referred to as a map. Previous work on map-based NL navigation relied on small artificial worlds with a fixed set of entities known in advance. Here we introduce the Realistic Urban Navigation (RUN) task, aimed at interpreting NL navigation instructions based on a real, dense, urban map. Using Amazon Mechanical Turk, we collected a dataset of 2515 instructions aligned with actual routes over three regions of Manhattan. We then empirically study which aspects of a neural architecture are important for the RUN success, and empirically show that entity abstraction, attention over words and worlds, and a constantly updating world-state, significantly contribute to task accuracy.


Introduction and Background
The task of interpreting and following natural language (NL) navigation instructions involves interleaving different signals, at the very least the linguistic utterance and the representation of the world. For example, in "turn right on the first intersection", the instruction needs to be interpreted, and a specific object in the world (the intersection) needs to be located in order to execute the instruction. In NL navigation studies, the representation of the world may be provided via visual sensors Nguyen et al., 2018;Yan et al., 2018;Anderson et al., 2018) or as a symbolic world representation. This work focuses on navigation based on a symbolic world representation (referred to as a map).
Previous datasets for NL navigation based on a symbolic world representation, HCRC (Anderson et al., 1991;Vogel and Jurafsky, 2010;Levit and Roy, 2007) and SAIL (MacMahon et al., 2006;Chen and Mooney, 2011;Mooney, 2012, 2013;Artzi and Zettlemoyer, 2013;Artzi et al., 2014;Fried et al., 2017;Andreas and Klein, 2015) present relatively simple worlds, with a small fixed set of entities known to the navigator in advance. Such representations bypass the great complexity of real urban navigation, which consists of long paths and an abundance of previously unseen entities of different types.
In this work we introduce Realistic Urban Navigation (RUN), where we aim to interpret navigation instructions relative to a rich symbolic representation of the world, given by a real dense urban map. To address RUN, we designed and collected a new dataset based on OpenStreetMap, in which we align NL instructions to their corresponding routes. Using Amazon Mechanical Turk, we collected 2515 instructions over 3 regions of Manhattan, all specified (and verified) by (respective) sets of humans workers. This task raises several challenges. First of all, we assume a large world, providing long routes, vulnerable to error propagation; secondly, we assume a rich environment, with entities of various different types, most of which are unseen during training and are not known in advance; finally, we evaluate on the full route intended, rather than on last-position only.
We then propose a strong neural baseline for RUN where we augment a standard encoderdecoder architecture with an entity abstraction layer, attention over words and worlds, and a constantly updating world-state. Our experimental results and ablation study show that this architecture is indeed better-equipped to treat grounding in realistic urban settings than standard sequenceto-sequence architectures. Given this RUN benchmark, empirical results, and evaluation procedure, we hope to encourage further investigation into the topic of interpreting NL instructions in realistic and previously unseen urban domains.

The RUN Task and Dataset
In this work we address the task of following a sequence of NL navigation instructions given in colloquial language based on a dense urban map. The input to the RUN task we define is as follows: (i) a map with rich details divided into tiles, (ii) an explicit starting point, and (iii) a sequence of navigation instructions we henceforth refer to as a navigation paragraph . We refer to each sentence as an instruction, and we assume that following the individual instructions in the paragraph one by one will lead the agent to the intended end-point. The output of RUN is the entire route described by the paragraph, i.e., all coordinates up to and including its end-point, pinned on the map.
In order to address RUN we designed and collected a novel dataset, henceforth the RUN dataset, which is based on OpenStreetMap (OSM). 2 The map contains rich layers and an abundance of entities of different types. Each entity is complex and can contain (at least) four labels: name, type, is building=y/n, and house number. An entity can spread over several tiles. As the maps do not overlap, only very few entities are shared among them. The RUN dataset aligns NL navigation instructions to coordinates of their corresponding route on the OSM map.
We collected the RUN dataset using Amazon Mechanical Turk (MTurk), allowing only native English speakers to perform the task. We collected instructions based on three different areas, all in urban, dense parts of Manhattan. The size of each map is 0.5 km 2 . The dataset contains 2515 navigation instructions (equal to 389 complete paragraphs) paired with their routes. The paragraphs are crowd-sourced from 389 different instructors, of which style and language vary (Geva et al., 2019).
Our data collection protocol is as follows. First, we asked the MTurk worker to describe a route between two landmarks of their choice. After having described the complete route in NL, the same worker was instructed to pin their described route on the map. This was moderated by showing them the paragraph they narrated, sentence by sentence,  In the full map, many entities are not seen until zoom-in is applied. The navigation paragraph is divided into four sentences: (1) sentence requires a turn action; (2) requires implicit walk actions and an explicit turn; (3) requires walk actions; the last (4) sentence is a verification only and no action is required. so that they have to pin on the map each instruction separately. A worker could only pin routes on street paths. Furthermore, on every turn the worker had to mark an explicit point on the map which marked the direction in which the worker needs to move next. An example of simple individual instructions and their respective route is given in Figure 1.
We then asked a disjoint set of workers (testers) to verify the routes by displaying the starting point of the route, and displaying the instructions in the paragraph sentence-by-sentence. The tester had to pin the final point of the sentence. Each route was tested by three different workers. Testing the routes allowed us to find incorrect routes (paragraphs that don't match an actual path) and discard them. They also provide an estimate of the human performance on the task (Reported in Section 4, Experiments).
Having collected the data, we divided the map into tiles, each tile is 11.132 m X 11.132 m. Each tile contains the labels of the entities it displays on the map, such as restaurants, traffic-lights, etc., and the walkable streets in it. Each walkable street is composed of an ordered list of tiles, including a starting tile and an end tile. Table 1 shows statistics over the dataset. Table 2 characterize linguistic phenomena in RUN, categorized according to the catalogue of Chen et al. (2018). Table 3 shows a quantitative comparison of the RUN dataset to previous datasets of map-based navigation. The table underscores some key features of RUN, relative to the previous tasks. RUN contains longer paths and many more entities that are unique, thus appearing for the first time during testing; the size of the map is on a different scale than previous tasks, thus, amplifying the grounding challenge; the number of tiles moved is accordingly larger than in previous datasets, hence increasing the vulnerability to error propagation.
Overall, RUN contains challenging linguistic phenomena, at least as in previous work, and a rich environment, with more realistic paths than in previous tasks.

Models for RUN
We model RUN as a sequence-to-sequence learning problem, where we map a sequence of instructions to a sequence of actions that should be performed in order to pin the actual path on the map. The execution system we provide for the task defines three types of actions: 'TURN', 'WALK', 'END'. 3 'TURN' is one of the following: rightturn, left-turn, turn-around. The turning move is not necessarily an exact 90-degree turn; the execution system looks for the closest turn option. 'WALK' is a change of position in the direction we are facing. The streets can be curved, so 'WALK' is relative to the street that the agent is on. Each street is an ordered-list of tiles, so an action of walking two steps is in fact two actions of 'WALK', in the direction the agent is facing. The 'END' action defines the end of each route. The input consists of an instruction sequence x 1:N , a map M , and a starting point p 0 on the map. The output is a sequence of actions a * 1:T to be executed. a * 1:T = arg max Where x i denote sentences, a i denotes actions, M is the map and p 0 is the starting point. Our basic model for RUN is a sequence-tosequence model similar to the work of Mei et al. (2015) on SAIL, and inspired by Xu et al. (2015). It is based on Conditioned Generation with Attention (CGA). To this model we added an Entity abstraction layer (CGAE) and a World-state representation (CGAEW). It thus consists of six components we describe in turn -Encoder, Decoder, Attention, Entity Abstraction, World-State Processor, Execution-System. The complete architecture is depicted in Fig. 2.
The Encoder takes the sequence of words that assembles a single sentence and encodes it as a vector using a biLSTM (Graves and Schmidhuber, 2005). The Decoder is an LSTM generating a sequence of actions that the execution-system can perform, according to weights defined by an Attention layer. The Entity Abstraction component deals with out-of-vocabulary words (OOV). We adopt a similar approach to Iyer et al. (2017); Suhr et al. (2018), replacing phrases in the sentences which refer to previously unseen entities with variables, prior to delivering the sentence to the Encoder. E.g., "Walk from Macy's to 7th street" turns into "Walk from X1 to Y1". Variables are typed (streets, restaurants, etc.) and are numbered based on their order of occurrence in the sentence. The numbering resets after every utterance, so the model remains with a handful of typed entityvariables. The World-State Processor maps variables to the entities on the map which are mentioned in the sentence. The world-state representation consists of two vectors, one representing the entities at the current position, and one representing the entities in the path ahead. The Attention layer considers the sequence of encoded words as well as current world-state, and provides weights on the words for each of the decoder steps. In both training and testing, the Execution-System executes each action separately to produce the next position. 4

Experiments
We evaluate our model on RUN and assess the contribution of the particular components that we added on top of the standard CGA model. We train the model using a negative loglikelihood loss, and used Adam (Kingma and Ba, 2014) optimization. For weights initialization we rely on Glorot and Bengio (2010). We used a grid search to validate the hyper-parameters. The model converged at around 30 epochs and produced good results with 0.9 drop-out and a beam of size 4. During inference, we seek the best candidate path using beam-search and normalize the scores of the sequences according to Wu et al. (2016).
We follow the evaluation methodology defined by Chen and Mooney (2011) for SAIL where we use three-fold validation, and in each fold, we use two maps for training (90%) and validation (10%) and test on the third one. We report a sized-   weighted average test result. For all models we report the accuracy per single sentences and full paragraphs. Success is measured by generating an exact route, not striding away from the path. The last position on the path should be within five tile euclidean distance from the intended destination, as the position explained in the instruction might not be specific enough for one tile. 5 In singlesentences, the last position should also be facing the correct direction. We provide three simple baselines for the RUN task: (1) NO-MOVE: the only position considered is the starting point; (2) RANDOM: As in Anderson et al. (2018), turn to a randomly selected heading, then execute a number of 'WALK' actions of an average route; (3) JUMP: at each sentence, extract entities from the map and move between them in the order they appear. If the 'WALK' action is invalid we take a random 'TURN' action. Table 4 shows the results for the baseline models as well as the HUMAN measured performance on the task. The human performance provides an upper bound for the RUN task performance, while the simple baselines provide lower bounds. The best baseline model is NO-MOVE, reaching an accuracy of 30.3% on single sentences and 0.3 on complete paragraphs. For the HUMAN case, paragraph accuracy reaches above 80. Table 4 shows the results of our model as an ablation study, and Table 5 shows typical errors of each variant. We see that CGAE outperforms CGA, as the swap of entities with variables lowers the complexity of the language that the model needs to learn, allowing the model to effectively cope with unseen entities at test time. We further found that, in many cases, CGAE produces the right type of action, but it does not produce enough 5 We selected this distance as it was the average distance our successful mechanical tester-workers arrived from the intended pinned point. of it to reach the intended destination. We attribute these errors to the absence of a world-state representation, resulting in an incapability to ground instructions to specific locations. CGAEW improves upon CGAE as the existence of world-state in the score of the attention layer allows the model better learn the grounding of entities in the instruction to the map. However our best model still fails on features not captured by our world-state: abstract unmarked entities such as blocks, intersections, etc, and generic entities such as traffic-lights (Tab. 5).

Conclusion
We introduce RUN, a new task and dataset for NL navigation in realistic urban environments. We collected (and verified) NL navigation instructions aligned with actual paths, and propose a strong neural baseline for the task. Our ablation studies show the significant contribution of each of the components we propose. In the future we plan to extend the world-state representation, and enable the model to ground generic and abstract concepts as well. We further intend to add additional signals, for instance coming from vision (cf. Chen et al. (2018)), for more accurate localization.