Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigation (VLN) dataset. RxR is multilingual (English, Hindi, and Telugu) and larger (more paths and instructions) than other VLN datasets. It emphasizes the role of language in VLN by addressing known biases in paths and eliciting more references to visible entities. Furthermore, each word in an instruction is time-aligned to the virtual poses of instruction creators and validators. We establish baseline scores for monolingual and multilingual settings and multitask learning when including Room-to-Room annotations. We also provide results for a model that learns from synchronized pose traces by focusing only on portions of the panorama attended to in human demonstrations. The size, scope and detail of RxR dramatically expands the frontier for research on embodied language agents in simulated, photo-realistic environments.


Introduction
Vision-and-Language Navigation (VLN) tasks require computational agents to mediate the relationship between language, visual scenes and movement. Datasets have been collected for both indoor (Anderson et al., 2018b;Thomason et al., 2019b;Qi et al., 2020) and outdoor (Chen et al., 2019;Mehta et al., 2020) environments; success in these is based on clearly-defined, objective task completion rather than language or vision specific annotations. These VLN tasks fall in the Goldilocks zone: they can be tackled -but not solved -with current methods, and progress on them makes headway on real world grounded language understanding.
We introduce Room-across-Room (RxR), a VLN dataset that addresses gaps in existing ones by (1) * First two authors contributed equally. including more paths that (2) counter known biases in existing datasets, and (3) collecting an order of magnitude more instructions for (4) three languages (English, Hindi and Telugu) while (5) capturing annotators' 3D pose sequences. As such, RxR includes dense spatiotemporal grounding for every instruction, as illustrated in Figure 1.
We provide monolingual and multilingual baseline experiments using a variant of the Reinforced Cross-Modal Matching agent (Wang et al., 2019). Performance generally improves by using monolingual learning, and by using RxR's follower paths as well as its guide paths. We also concatenate R2R and RxR annotations as a simple multitask strategy (Wang et al., 2020): the agent trained on both datasets obtains across the board improvements.
RxR contains 126K instructions covering 16.5K sampled guide paths and 126K human follower demonstration paths. The dataset is available. 1 We plan to release a test evaluation server, our annotation tool, and code for all experiments.

Motivation
A number of VLN datasets situated in photorealistic 3D reconstructions of real locations contain human instructions or dialogue: R2R (Anderson et al., 2018b), Touchdown (Chen et al., 2019;Mehta et al., 2020), CVDN (Thomason et al., 2019b) and REVERIE (Qi et al., 2020). RxR addresses shortcomings of these datasetsin particular, multilinguality, scale, fine-grained word grounding, and human follower demonstrations (Table 1). It also addresses path biases in R2R. More broadly, our work is also related to instruction-guided household task benchmarks such as ALFRED (Shridhar et al., 2020) and CHAI (Misra et al., 2018). These synthetic environments provide interactivity but are generally less diverse, less visually realistic and less faithful to real world structures than the 3D reconstructions used in VLN.
Multilinguality. The dominance of high resource languages is a pervasive problem as it is unclear that research findings generalize to other languages (Bender, 2009). The issue is particularly severe for VLN. Chen and Mooney (2011) translated(∼1K) English navigation instructions into Chinese for a game-like simulated 3D environment. Otherwise, all publicly available VLN datasets we are aware of have English instructions.
To enable multilingual progress on VLN, RxR includes instructions for three typologically diverse languages: English (en), Hindi (hi), and Telugu (te). The English portion includes instructions by speakers in the USA (en-US) and India (en-IN). Unlike Chen and Mooney (2011) and like the TyDi-QA multilingual question answering dataset (Clark et al., 2020), RxR's instructions are not translations: all instructions are created from scratch by native speakers. This especially matters for VLN, as different languages encode spatial and temporal information in idiosyncratic ways-e.g., how contact/support relationships are expressed (Munnich et al., 2001), frame of reference (Haun et al., 2011), and how temporal accounts are expressed (Bender and Beller, 2014).
Scale. Embodied language tasks suffer from a relative paucity of training data; for VLN, this has 1 https://github.com/google-research-datasets/RxR  led to a focus on data augmentation (Fried et al., 2018;Tan et al., 2019), pre-training (Wang et al., 2019Huang et al., 2019;Li et al., 2019), multi-task learning (Wang et al., 2020) and better generalization through piece-wise curriculum design (Zhu et al., 2020). To address this shortage, for each language RxR contains 14K paths with 3 instructions per path, for a total of 126K instructions and 10M words (based on whitespace tokenization). As illustrated in Table 1, this is an order of magnitude larger than previous datasets. Fine-Grained Grounding. Like R2R, RxR's instructions are collected by immersing Guide annotators in a simulated first-person environment backed by the Matterport3D dataset (Chang et al., 2017) and asking them to describe predefined paths. RxR also enhances each instruction with dense spatiotemporal groundings. Guides speak as they move and later transcribe their audio; our annotation tool records their 3D poses and time-aligns the entire pose trace with words in the transcription. Instructions and pose traces can thus be aligned with any Matterport data including surface reconstructions (Figure 1), RGB-D panoramas ( Figure  4), and 2D and 3D semantic segmentations.
Follower Demonstrations. Annotators also act as Followers who listen to a Guide's instructions and attempt to follow the path. In addition to verifying instruction quality, this allows us to collect a play-by-play account of how a human interpreted the instructions, represented as a pose trace. Guide and Follower pose traces provide dense spatiotemporal alignments between instructions, visual percepts and actions -and both perspectives are useful for agent training.
Path Desiderata. R2R paths span 4-6 edges and are the shortest paths from start to goal. Thomason et al. (2019a) showed that agents can exploit effective priors over R2R paths, and Jain et al. (2019) showed that R2R paths encourage goal seeking Figure 2: Given the panorama navigation graph P with room graph R in Figure 2a, we sample a simple room path (r 0 , r 2 , r 3 ) inducing the subgraph in Figure 2b. The generated panorama path is the shortest path in the subgraph linking sampled panoramas r 8 and r 6 . over path adherence. These matter both for generalization to new environments and fidelity to the descriptions given in the instruction-otherwise, strong performance might be achieved by agents that mostly ignore the language. RxR addresses these biases by satisfying four path desiderata: 1. High variance in path length, such that agents cannot simply exploit a strong length prior. 2. Paths may approach their goal indirectly, so agents cannot simply go straight to the goal. 3. Naturalness: paths should not enter cycles or make continual direction changes that would be difficult for people to describe and follow. 4. Uniform coverage of environment viewpoints, to maximize the diversity of references to visual landmarks and objects over all paths. This increases RxR's utility for testing agents' ability to ground language. It also makes RxR a more challenging VLN dataset-but one for which human followers still achieve a 93.9% success rate.

Two-Level Path Sampling
We satisfy desiderata 1-3 using a two-level procedure. At a high-level, each path visits a sequence of rooms; these are simple paths with no repeated (room) vertices. Such paths are not necessarily shortest paths. The low-level sequence is then the shortest panorama path, constrained by the room sequence. Given the set of all such paths across all houses, the fourth desiderata is satisfied by iteratively selecting the path that most improves coverage while maintaining a bias against shortest paths.
Preliminaries Movement in the simulator is based on a navigation graph. Vertices correspond to 360-degree panoramic images, captured at approximately 2.2m intervals throughout 90 indoor environments. Edges are navigable links between panoramas. Chang et al. (2017) also partition panoramas via human-defined room annotations.
Let P be an undirected graph of interconnected panoramas, with vertices p i ∈ V(P ) and edges (p i , p j )∈E(P ). Let A R be a set of disjoint room annotations; each room r i ∈A R is a non-overlapping subset of panoramas r i ⊆ V(P ), as shown in Figure 2a. We abbreviate (p 1 , · · · , p m ) as p 1:m .
We create R, an undirected room graph with ver- is the subgraph of P induced by room annotation r i and C returns a graph's connected components. Simply put, each vertex in R encompasses a subgraph of Path Generation We generate the set of all simple paths in R that traverse at most 5 rooms and two building levels. Let r p i ∈V(R) be the room containing panorama p i . As shown in Figure 2b, for each room path r 1:n , we construct a directed graph P [r 1:n ] in which an edge (p i , p j ) exists if r p i =r p j (p i and p j are in the same room) or (r p i , r p j ) is an edge in the room path. Given P [r 1:n ], we sample the start p 1 and goal p m uniformly from r 1 and r n , respectively. The full panorama path p 1:m is then the shortest path between p 1 and p m in P [r 1:n ].
Room size varies greatly, so this approach produces high path length variance. It also satisfies naturalness because people tend to ground instructions at the room level (e.g., Exit through the carved wooden door on the other side of the room). We find such paths easy to describe even with as many as 20 edges. Finally, these paths can approach their goal indirectly, as exemplified in Figure 2b.
Greedy Selection for Coverage The final path dataset D is constructed by repeatedly selecting a panorama path p 1:m from all sampled paths (without replacement) until a desired size is reached. After selecting k paths, let O(p i , D k ) be the number of occurrences of panorama p i in the paths in D k . At step k + 1, we select the path with the minimum value for d(p 1 ,pm) L(p 1:m ) + 1 where L is path length in P and d(p 1 , p m ) is the shortest path distance between p 1 and p m in P . The first term prefers non-shortest paths while the second encourages selection of paths that cover panoramas with low coverage in D k . This selection step is also subject to a maximum path length of 40m, and a maximum of 500 paths per building environment.

Path Statistics
In total, we sample 16522 paths, which are split: 11089 train, 1232 val-seen (train environments), 1517 val-unseen (val environments), and 2684 test, following the same environment splits as Matterport3D and R2R. Compared to R2R, RxR paths are longer, spanning 8 edges and 14.9m on average, vs. 5 edges and 9.4m in R2R.
More importantly, as shown in Figure 3, RxR paths exhibit much greater variation in length while also achieving more uniform coverage of the panoramas (and edges). Furthermore unlike R2R, 44.5% of RxR paths are not the shortest path from the start to the goal location. RxR paths are on average 27.4% longer than the shortest path.

Data Collection and Metrics
We immerse annotators in our own web-based version of the Matterport3D simulator using the panoramic images and the navigation graph. Compared to Anderson et al. (2018b), our annotation tool has additional capabilities including speech collection, virtual pose tracking, and timealignment between transcript and pose. Figure 4 gives an example instruction with accompanying Guide and Follower pose traces. Here, we describe our collection process, analysis of the data, path evaluation metrics and simple baselines.
Guide Task Like R2R, our simulator has camera controls allowing continuous heading and elevation changes and movement between panoramas. Guides look around and move to explore a provided path and attempt to create an instruction others can follow. R2R's Guides create written instructions.    In contrast, RxR's Guides speak and the tool logs their entire virtual camera pose sequence. We use a 640 × 480 pixel viewing canvas and a camera vertical field of view of 75 degrees. This process is inspired by Localized Narratives (Pont-Tuset et al., 2020), an image captioning dataset for which annotators move mouse pointers around images while talking about them. As with Localized Narratives, RxR Guides transcribe their own recordings; this produces high quality text versions of the instructions. To align text and pose traces, we generate a time-stamped transcription using automatic speech recognition. 2 The transcription and ASR output are aligned using dynamic time warping. The output of the Guide task is an audio file, a tokenized, timestamped, manually-transcribed instruction, and a pose trace (a series of timestamped 6-DOF camera poses). On average, Guide task annotations (including both steps, performed back-to-back) take 458 seconds.
For each language (English, Hindi and Telugu) we annotate 14K paths with three instructions each. In the English dataset, each path gets one US English instruction and two Indian English instructions. Of the 14K paths per language, 12.8K paths are common across all three languages, and 1.2K paths in each language are unique (equaling 16.5K paths in total). The fact that most paths are annotated 9 times (3 per language) creates interesting opportunities to study aligned instructions across languages. Unique paths add variety and coverage.
Follower Task As Followers, annotators begin at the start of an unknown path and try to follow the Guide's instruction. They observe the environment and navigate in the simulator as the Guide's audio plays. They can pause, rewind and skip forward in the instruction. If they believe they have reached the the end of the path, or give up, they indicate they 2 https://cloud.google.com/speech-to-text are done and rate the instruction's clarity and their confidence in their own navigation. On average, Follower tasks take 132 seconds.
The Follower tasks objectively validate the quality of Guide instructions based on whether the Follower can succeed (i.e., reaching within 3m of the last panorama in the path). If the Follower doesn't succeed, the Guide instruction is paired with a second Follower. If the second Follower succeeds, the first Follower annotation is discarded and replaced. If the second Follower also fails, then the path is reenqueued to generate another Guide and Follower annotation. The most successful of the three resulting Guide-Follower pairs is selected for inclusion in RxR and the others are discarded.
In addition to validating data quality, the Follower task also trains annotators to be better Guides-following bad instructions often helps one see how to produce better instructions. Most importantly, we collect the pose trace of the Follower as they execute the instruction. This provides an alternative path with dense grounding that we can compare to the Guide's pose trace and use as an additional training signal. Table 2 provides summary statistics for RxR. The average words per instruction (using whitespace tokenization) is 78 vs R2R's 29. US English instructions are the longest on average. We attribute this to conventions developed by each annotator pool rather than language specific properties. On average Guide tasks take much longer than Follower tasks (458 vs. 132 seconds). Most of the Guide's time is spent transcribing audio (Guide audio recordings average 60 seconds).

Dataset Analysis
Following a similar analysis as Chen et al. (2019), Table 3 gives examples and statistics for linguistic phenomena, based on manual analysis of instructions for 25 paths. All RxR subsets produce a higher rate of entity references compared to R2R. This is consistent with the extra challenge of RxR's paths and our annotation guidance that instructions should help followers stay on the path as well as reach the goal. Doing so requires more extensive use of objects in the environment. RxR's higher rate of both coreference and sequencing indicates that its instructions have greater discourse coherence and connection than R2R's. RxR also includes a far higher proportion of allocentric relations and state verification compared to R2R, and matches Touchdown (navigation instructions). Hindi contains less coreference, sequencing, and temporal  Table 3: Linguistic phenomena in a manually annotated random sample of 25 paths from RxR and R2R. p is the % of sentences that contain the phenomena while µ is the average number of times they occur within each sentence. conditions than the other languages. That said, it is not clear how much the differences within RxR exhibited in Table 3 can be attributed to language, dialect, annotator pools, or other factors. Figure 5 (top) illustrates the close alignment between instruction progress (measured in words) and path progress (measured in steps). Figure 5 (bottom) indicates that both Guide and Tourist annotators orient themselves by looking around at the first panoramic viewpoint, after which they maintain a narrower focus. On average, Guides / Tourists observe 43% / 44% of the available spherical visual signal at the first viewpoint, and 27% / 28% at subsequent viewpoints. These findings stand in contrast to standard VLN agents that routinely consume the entire panoramic image and attend over the entire instruction sequence at each step. Inputs that the Guide / Tourist have not observed cannot influence their utterances / actions, so pose traces offer rich opportunities for agent supervision.
Evaluation We use the following standard evaluation metrics (with arrows indicating improve-  At each time step t, the agent receives a panoptic encoding of its viewpoint v t ∈ R k×d (where k = 36 is the number of 30 • intervals that span the panorama) along with a visual encoding of navigable directions a t ∈ R n×d (where n is the number of navigable directions). Each feature of dimension d is a pre-trained CNN feature concatenated with an angle encoding (Fried et al., 2018). The LSTM decoder computes an updated hidden state h t by conditioning on the previous selected action in a t−1 and attending over the panoptic encoding v t and the instruction x using dot-product attention (Luong et al., 2014). The distribution over next actions is computed via a similarity ranking h t · a t,i between hidden state h t and each direction encoding in a t .
For the image features we use an EfficientNet-B4 CNN (Tan and Le, 2019). Following Parekh et al. (2020), we pretrain the CNN in an image-text dual encoder setting using the Conceptual Captions dataset (Sharma et al., 2018). In preliminary experiments, we found that pretraining the CNN in this way gave noticeable improvements over the same CNN pretrained for image classification on ImageNet (Russakovsky et al., 2015).
Grounding Supervision To incorporate spatiotemporal groundings into agent training, for each Guide path (G-path) and Follower path (F-path) we convert the corresponding pose trace into: (1) a sequence of text masks b t ∈ {0, 1} l indicating which words in instruction x the Guide spoke / Follower heard at or prior to step t, and (2) a sequence of visual masks M t ∈ {0, 1} h×w indicating which pixels were observed in the panoramic image at t (like Figure 5 bottom). We then project and maxpool M t to a vector mask m t ∈ {0, 1} k aligning to the agent's visual input features v t . Zeros in b t and m t indicate irrelevant textual and visual inputs that were not observed by the annotators, and are therefore not related to their utterances and actions.
To help prevent the agent from overfitting to superficial correlations in the training data, we use b t and m t to supervise the normalized textual and visual attention weights in the model. Specifically, during training whenever the agent is on the gold path we apply a cross-entropy loss to the visual attention weights given by where z is the vector of unnormalized logits determining attention weights via a softmax. This loss forces the attention weights on irrelevant input features towards zero. The textual version is analogous.
Implementation Details Agents are implemented in VALAN (Lansing et al., 2019), a distributed reinforcement learning framework designed for VLN. We use a mix of supervised learning and policy gradients. Each minibatch is constructed from 50% behavioural cloning roll-outs (following the gold paths while minimizing crossentropy loss), and 50% policy gradient rollouts with reward (following paths sampled from the agent's policy). As in Ilharco et al. (2019), the reward at each step is the incremental difference in NDTW, plus a linear function of navigation error after stopping. All agents are trained with Adam (Kingma and Ba, 2014) to convergence (100K iterations with batch size of 32 and initial learning rate of 1e-4). Table 5 provides results on the val-unseen split for several training settings, as well as human performance from Follower annotations. We report en-US and en-IN results together as en. Experiments 1-3 compare agents trained (1) only on G-paths, (2) only on F-paths, and (3) on both. In contrast to algorithmically generated Gpaths, each F-path reflects a grounded human interpretation of an instruction, which may deviate from the G-path because multiple correct interpretations are possible (e.g., Figure 4). For training, we do not differentiate F-paths from G-paths, and each   instruction-path pair is treated as an independent example. Experiment (3) shows that including both G-and F-paths in training benefits every metric. Given the overall positive impact of F-paths, we use both path types in our further experiments.

Monolingual Results
Multilinguality For experiment (4) in Table 5, we train a single multilingual agent on all three languages simultaneously. While the multilingual agent sees substantially more instructions than each monolingual agent, performance is worse across all metrics. This is consistent with results in multilingual machine translation (MT) and automatic speech recognition (ASR) where adding more languages can also lead to degradation for highresource languages (Aharoni et al., 2019; Pratap et al., 2020). Experiment (5) takes this one step further by obtaining translations from every instruction into the two other languages (e.g., en → hi, te) using a MT service. 3 Including these translations hurts performance for all languages. The fact that most G-paths are shared across languages may limit the value of automatic cross-translations. Notwithstanding the higher performance of the monolingual approaches, in the remaining experiments we focus on multilingual agents for greater scalability.
3 https://cloud.google.com/translate These translations are included in the RxR data release. Table  5 experiment (6) incorporates a loss for spatiotemporal grounding over visual attention which gives mixed results on val-unseen (better on NDTW, NE and worse on success-based metrics) compared to (4). Applying the same approach to textual attention did not improve performance. However, we stress that this is only a preliminary investigation. Using human demonstrations to supervise visual groundings is an active area of research (Wu and Mooney, 2019; Selvaraju et al., 2019). As one of the first large-scale spatially-temporally aligned language datasets, RxR offers new opportunities to extend this work from images to environments.   path bias, and for R2R → RxR, the much longer paths and richer language are out-of-domain. Table 7 reports the performance of the multilingual agent under settings in which we ablate either the vision or the language inputs during both training and evaluation, as advocated by Thomason et al. (2019a). The multimodal agent (4) outperforms both the languageonly agent (9) and the vision-only agent (10), indicating that both modalities contribute to performance. The language-only agent performs better than the vision-only agent. This is likely because even without vision, parts of the instructions such as 'turn left' and 'go upstairs' still have meaning in the context of the navigation graph. In contrast, the vision-only model has no access to the instructions, without which the paths are highly random.

Unimodal Ablations
Test Set RxR includes a heldout test set, which we divide into two splits: test-standard and testchallenge. These splits will remain sequestered to support a public leaderboard and a challenge so the community can track progress and evaluate agents fairly. Table 8 provides test-standard performance of the mono and multilingual agents using Guide and Follower paths, along with random and human Follower scores. While the learned agent is clearly much better than a random agent, there is a great deal of headroom to reach human performance.

Conclusion
RxR represents a significant evolution in the scale, scope and possibilities for research on embodied language agents in simulated, photo-realistic 3D environments. RxR's paths better ensure that language itself will play a fundamental role in better agents. Evaluating on three typologically diverse languages will help the community avoid overfitting to a particular language and dataset.
We have only begun to explore the possibilities opened up by pose traces. Whereas others have retro-actively refined R2R's annotations to get alignments between sub-instructions and panorama sequences (Hong et al., 2020), RxR provides wordlevel alignments to specific pixels in panoramas. This is obtained as a by-product of significant work on the annotation tooling itself and designing the process to be more natural for Guides. Finally, every instruction is accompanied by a Follower demonstration, including a perspective camera pose trace that shows a play-by-play account of how a human interpreted the instructions given their position and progress through the path. We have shown that these can help with agent training, but they also open up new possibilities for studying grounded language pragmatics in the VLN setting, and for training VLN agents with perspective cameras -either in the graph-based simulator or by lifting RxR into a continuous simulator (Krantz et al., 2020).

Guide Alignment
ordered left-to-right → You're starting in a closet, facing an abstract painting on your right. Just slightly to your left will be an open, wooden door next to an amp. Walk through that wooden door. This will take you to a hallway with stairs going up on the right hand side.   Once you get to the... Jimi Hendrix painting, turn to your right and...
...and walk between the stair railing and the white kitchen cabinet toward the refrigerator.
Take a step in front of the refrigerator. Take another step toward the windows...
...windows overlooking the trees. Then take a right at the end of the refrigerator. You'll take three steps toward the fireplace.
...fireplace. Once you get to the fireplace, it will be on your right hand side.
...side. This is where you stop. You begin in a large wooden room with a dining table, an immense fireplace, and a lovely carpet. turn to your right, and move along the edge of that carpet you're nearest to, towards the wooden doorway into another interior room. You should see a large circular table with an urn in the center of it when you enter that room. Skirt the edge of that table to the left, moving towards the staircase. Don't go to the staircase, but instead proceed to the left of it, down the large rectangular rug. Continue through the open glass door,, and the second glass door across the small hallway from it.
Step inside this small... Dining area? If you are just inside the room with the circular table in the middle of it, a couch on the left hand wall, two armchairs across from the entrance, and one armchair, striped, just next to you, you're in the right place, and you are done.
Starting facing a large ornate vase with gold leaf on it as well as a curtained window, we are going to turn towards the dinning room table we are face. We are going to hop around it and come to just beside the painting in the background. Once we're behind the head of the table chair, we are going to face forward and notice that there is a marble staircase before us. Head towards that, but don't head up the stairs and don't exit the room, instead we're going to turn to the left and you should see a kitchen before you. Let's go ahead and enter the kitchen through the archway, and here walk to the right of the China cabinet, and towards the island with the dark cabinets and the granite countertop. Once we've turned the corner, and we're beside the large gas range and the stainless steel hood, we're going to walk between the stove and the kitchen island, towards the refrigerator, and you should see an open doorway before you, to the right of the fridge. Go ahead and walk towards this open door and through it. Walk all the way down and turn to the right, passed the closed door, until you're faced with another flight of stairs. Let's go ahead and move up them. When you've ascended the stairs, turn and face your right, and walk towards the music room that we can see in the distance with its grande piano. We are going to come to a stop right at the base of another small flight of stairs, and looking into the sitting room with a grande piano and marble mantle over a fireplace.
You are beside the bed in your bed room, turn towards your left and keep moving forward. Go near the stair case support and turn towards your right, keep moving forwards and you can find a long corridor on your left. Go through the corridor and the opposite end you can find a gaming room. Go through that gaming room and opposite end of a room, towards your slight right, you can find a air hockey table. Go and stand near that table and you reached your destination.
You are facing towards the white door. Turn left and walk towards the swimming pool. Turn left and walk towards the gym equipment. Turn right. Walk a few steps ahead and stand beside the swimming pool. There is a window towards your left side. You have reached your point.
now you are on a stair case facing the stairs, climb up the stair case, now you will enter a big hall, now walk to the other end of the hall and now you will see two doors which are wide opened, exit through the doors and take a right turn and walk on the corridor, to the send window from the right is your destination.
Right now you're facing towards a curtain. Now turn behind and move towards the wall which is in front of you. Now turn left and exit the room, there are portraits to your left. Now turn right and move forward in the walkway. You can see an open door to your left, move towards the door and turn left. Now enter in to the room, there are two washing machines to your right and you can see shelves in front of you, move towards the shelves and stand in front of it and it is your end point.
You are in a living area, facing towards the corner of a door. Turn towards your slight right and keep moving forward. In front, towards your slight right, you find an other section. Go near that section and turn towards your left. You find a brown door, go pass through the door and move forward. You enter into your bedroom. In front, you find a bed, walk towards the bed and stand near it. You reached your destination.
Right now you're facing towards a bed. Now slightly turn right, there is an open door in front of you, move towards the door and exit the room. There is a walkway in front of you and some portraits on the wall to your right and a staircase to your left, move forward in the walkway, continue moving forward in the walkway, until you reach an open door in front of you, there is an open door to your right, move towards the door and turn right. Now enter in to the room, there is a portrait in between two windows in front of you. Now slightly turn left, there is a sliding door in front of you, which guides to the balcony, move towards the door and enter in to the balcony. Now turn left, you can see a sliding door in front of you, move towards the sliding door and enter in to the room and this is your end point.

MOTIVATION
For what purpose was the dataset created? RxR was created to advance progress on visionand-language navigation (VLN) in multiple languages (English, Hindi, Telugu). It addresses gaps in existing datasets by including more paths that counter known biases and an order of magnitude more navigation instructions for three languages plus annotators' 3D virtual pose sequences.
Who created this dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? This dataset was created by Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, Jason Baldridge and the Google Data Compute team on behalf of Google Research.
What support was needed to make this dataset? Funding was provided by Google Research.

COMPOSITION
What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
The instances in RxR are natural language navigation instructions paired with trajectories in reconstructed 3D buildings. Each navigation instruction has been recorded as speech and transcribed by the speaker. The dataset includes the text transcriptions, but not the audio files, although they may be released in future. The trajectories are provided as paths, consisting of sequences of viewpoint ids corresponding to navigation graphs from Anderson et al. (2018b), and pose traces, consisting of sequences of virtual camera poses situated in the underlying building reconstructions which are from the Matterport3D dataset (Chang et al., 2017). Pose traces and text transcriptions are timestamped and aligned. Pose traces are provided for both the instruction annotator (the Guide), and a second annotator charged with following the Guide's instructions (the Follower).
How many instances are there in total (of each type, if appropriate)? RxR contains 126K Guide instructions covering 16.5K sampled paths and 126K human Follower demonstration paths. Annotations are split equally across the three languages in the dataset. Refer to Table 1 for a comparison of the number of instances to previous datasets and Table 2 for summary statistics.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? Refer to Section 3 for a detailed description of the sampling procedure used to select the paths for annotation.
What data does each instance consist of? Each instance consists of a trajectory through a building from the Matterport3D dataset (Chang et al., 2017) paired with a natural language navigation instruction. A trajectory can be visualized as a sequence of 360-degree panoramic images, or as path traversing a 3D reconstruction of the building represented as a textured mesh. Refer to Table 3 for an analysis of linguistic phenomena in the instructions and Figures 9, 10 and 11 for instruction examples in English, Hindi and Telugu respectively.
Is there a label or target associated with each instance? When training wayfinding agents to navigate from natural language instructions, the trajectory is the target. Instructions and paths are annotated with unique identifiers.

Is any information missing from individual instances?
We do not provide the Guide audio recordings, for reasons outlined in Appendix A.
Are relationships between individual instances made explicit (e.g., users' movie ratings, social network links)? Trajectories may belong to the same building or different buildings; each instance is annotated with a scan (building) identifier.