Where Are You? Localization from Embodied Dialog

We present Where Are You? (WAY), a dataset of ~6k dialogs in which two humans -- an Observer and a Locator -- complete a cooperative localization task. The Observer is spawned at random in a 3D environment and can navigate from first-person views while answering questions from the Locator. The Locator must localize the Observer in a detailed top-down map by asking questions and giving instructions. Based on this dataset, we define three challenging tasks: Localization from Embodied Dialog or LED (localizing the Observer from dialog history), Embodied Visual Dialog (modeling the Observer), and Cooperative Localization (modeling both agents). In this paper, we focus on the LED task -- providing a strong baseline model with detailed ablations characterizing both dataset biases and the importance of various modeling choices. Our best model achieves 32.7% success at identifying the Observer's location within 3m in unseen buildings, vs. 70.4% for human Locators.


Introduction
Imagine getting lost in a new building while trying to visit a friend who lives or works there. Unsure of exactly where you are, you call your friend and start describing your surroundings ('I'm standing near a big blue couch in what looks like a lounge. There are a set of wooden double doors opposite the entrance.') and navigating in response to their questions ('If you go through those doors, are you in a hallway with a workout room to the right?'). After a few rounds of dialog, your friend who is familiar with the building will hopefully know your location. Success at this cooperative task requires goal-driven questioning based on your friend's understanding of the environment, unambiguous answers communicating observations via language,

Locator Observer
Locator Observer Figure 1: LED Task: The Locator has a top-down map of the building and is trying to localize the Observer by asking questions and giving instructions. The Observer has a first person view and may navigate while responding to the Locator. The turn-taking dialog ends when the Locator predicts the Observer's position. and active perception and navigation to investigate the environment and seek out discriminative observations.
In this work we present WHERE ARE YOU? (WAY), a new dataset based on this scenario. As shown in Fig. 1, during data collection we pair two annotators: an Observer who is spawned at random in a novel environment, and a Locator who must precisely localize the Observer in a provided top-down map. The map can be seen as a proxy for familiarity with the environment -it is highly detailed, often including multiple floors, but does not show the Observer's current or initial location. In contrast to the "remote" Locator, the Observer navigates within the environment from a first-person view but without access to the map. To resolve this information asymmetry and complete the task, the Observer and the Locator communicate in a live two-person chat. The task concludes when the Locator makes a prediction about the current location of the Observer. For the environments we use the Matterport3D dataset (Chang et al., 2017) of 90 reconstructed indoor environments. In total, we collect ∼6K English dialogs of humans completing this task from over 2K unique starting locations.
The combination of localization, navigation, and dialog in WAY provides for a variety of modeling possibilities. We identify three compelling tasks encapsulating significant research challenges: -Localization from Embodied Dialog. LED, which is the main focus of this paper, is the state estimation problem of localizing the Observer given a map and a partial or complete dialog between the Locator and the Observer. Although localization from dialog has not been widely studied, we note that indoor localization plays a critical role during calls to emergency services (Falcon and Schulzrinne, 2018). As 3D models and detailed maps of indoor spaces become increasingly available through indoor scanners (Chang et al., 2017), LED models could have the potential to help emergency responders localize emergency callers more quickly by identifying locations in a building that match the caller's description.
-Embodied Visual Dialog. EVD is the navigation and language generation task of fulfilling the Observer role. This involves using actions and language to respond to questions such as 'If you walk out of the bedroom is there a kitchen on your left?' In future work we hope to encourage the transfer of existing image-based conversational agents (Das et al., 2017a) to more complex 3D environments additionally requiring navigation and active vision, in a step closer to physical robotics. The WAY dataset provides a testbed for this.
-Cooperative Localization. In the CL task, both the Observer and the Locator are modeled agents. Recent position papers (Baldridge et al., 2018;Mc-Clelland et al., 2019;Bisk et al., 2020) have called for a closer connection between language models and the physical world. However, most reinforcement learning for dialog systems is still text-based (Li et al., 2016) or restricted to static images (Das et al., 2017b;De Vries et al., 2017). Here, we provide a dataset to warm-start and evaluate goaldriven dialog in a realistic embodied setting.
Our main modeling contribution is a strong baseline model for the LED task based on LingUnet . In previously unseen test environments, our model successfully predicts the Locator's location within 3 meters 32.7% of the time, vs. 70.4% for the human Locators using the same map input, with random chance accuracy at 6.6%. We include detailed studies highlighting the importance of data augmentation and residual connections. Additionally, we characterize the biases of the dataset via unimodal (dialog-only, map-only) baselines and experiments with shuffled and ablated dialog inputs, finding limited potential for models to exploit unimodal priors. Contributions: To summarize: 1. We present WAY, a dataset of ∼6k dialogs in which two humans with asymmetric information complete a cooperative localization task in reconstructed 3D buildings. 2. We define three challenging tasks: Localization from Embodied Dialog (LED), Embodied Visual Dialog, and Cooperative Localization. 3. Focusing on LED, we present a strong baseline model with detailed ablations characterizing both modeling choices and dataset biases.

Related Work
Image-based Dialog Several datasets grounding goal-oriented dialog in natural images have been proposed. The most similar settings to ours are Cooperative Visual Dialog (Das et al., 2017a,b), in which a question agent (Q-bot) attempts to guess which image from a provided set the answer agent (A-bot) is looking at, and GuessWhat?! (De Vries et al., 2017), in which the state estimation problem is to locate an unknown object in the image. Our dataset extends these settings to a situated 3D environment allowing for active perception and navigation on behalf of the A-bot (Observer), and offering a whole-building state space for the Q-bot (Locator) to reason about.
Embodied Language Tasks. A number of 'Embodied AI' tasks combining language, visual perception, and navigation in realistic 3D environments have recently gained prominence, including Interactive and Embodied Question Answering (Das et al., 2018;Gordon et al., 2018), Vision-and-Language Navigation or VLN (Anderson et al., 2018;Chen et al., 2019;Mehta et al., 2020;Qi et al., 2020), and challenges based on household tasks (Puig et al., 2018;Shridhar et al., 2020). While these tasks utilize only a single question or instruction input, several papers have extended the VLN task -in which an agent must follow natural language instructions to traverse a path in the environment -to dialog settings. Nguyen and Daumé III (2019) consider a scenario in which the agent can query an oracle for help while complet-ing the navigation task. However, the closest work to ours is Cooperative Vision-and-Dialog Navigation (CVDN) (Thomason et al., 2019). CVDN is a dataset of dialogs in which a human assistant with access to visual observations from an oracle planner helps another human complete a navigation task. CVDN dialogs are set in the same Mat-terport3D buildings (Chang et al., 2017) and like ours they are goal-oriented and easily evaluated.
The main difference is that we focus on localization rather than navigation. Qualitatively, this encourages more descriptive utterances from the firstperson agent (rather than eliciting short questions). Our work is also related to Talk the Walk (de Vries et al., 2018) which presented a dataset for a similar task in an outdoor setting using a restricted, highlyabstracted map which encouraged language that is grounded in the semantics of building types rather than visual descriptions of the environment. Table 1 compares the language in WAY against existing embodied perception datasets. Specifically, size, length and the density of different parts of speech (POS) are shown. Vocab size was determined by the total number of unique words. We used the (Loper and Bird, 2002) POS tagger to calculate the POS densities over the text in each dataset. We find that WAY has a higher density of adjectives, nouns, and prepositions than related datasets suggesting the dialog is more descriptive than the text in existing datasets.
Localization from Language. While localization from dialog has not been intensively studied, localization from language has been studied as a sub-component of instruction-following navigation agents Anderson et al., 2019;Blukis et al., 2019). The LingUnet model -a generic language-conditioned image-to-image network we use as the basis of our LED model in Section 4 -was first proposed in the context of predicting visual goals in images . This also illustrates the somewhat close connection between grounding language to a map and grounding referring expressions to an image (Kazemzadeh et al., 2014;Mao et al., 2016).
It is important to note that localization is often a precursor to navigation -one which has not been addressed in existing work in language-based navigation. In both VLN and CVDN, the instructions are conditioned on specific start locations -assuming the speaker knows the navigator's location prior to giving directions. The localization tasks of the WAY dataset fill this gap by introducing a dialogbased means to localize the navigator. This requires capabilities such as describing a scene, answering questions, and reasoning about how discriminative potential statements will be to the other agent.

WHERE ARE YOU? Dataset
We present the WHERE ARE YOU? (WAY) dataset consisting of 6,134 human embodied localization dialogs across 87 unique indoor environments.
Environments. We build WAY on Matterport3D (Chang et al., 2017), which contains 90 buildings captured in 10,800 panoramic images. Each building is also provided as a reconstructed 3D textured mesh. This dataset provides high-fidelity visual environments in diverse settings including offices, homes, and museums -offering numerous objects to reference in localization dialogs. We use the Matterport3D simulator (Anderson et al., 2018) to enable first-person navigation between panoramas.

Task.
A WAY episode is defined by a starting location (i.e. a panorama p) in an environment e. The Observer is spawned at p 0 in e and the Locator is provided a top-down map of e (see Fig. 1). Starting with the Locator, the two engage in a turn-based dia- where each can pass one message per turn. The Observer may move around in the environment during their turn, resulting in a trajectory (p 0 , p 1 , . . . , p T ) over the dialog. The Locator is not embodied and does not move but can look at the different floors of the house at multiple angles. The dialog continues until the Locator uses their turn to make a prediction (p T ) of the Observer's current location (p T ). The episode is successful if the prediction is within k meters of the true final position -i.e. ||p T −p T || 2 < k m. This does not depend on the initial position, encouraging movement to easily-discriminable locations.
Map Representation. The Locator is shown topdown views of Matterport textured meshes as environment maps. In order to increase the visibility of walls in the map (which may be mentioned by the Observer), we render views using perspective rather than orthographic projections (see left in Fig. 1). We set the camera near and far clipping planes to render single floors such that multi-story buildings contain an image for each floor.

Collecting Human Localization Dialogs
To provide a human-performance baseline and gather training data for agents, we collect human  localization dialogs in these environments.
Episodes. We generate 2020 episodes across 87 environments by rejection sampling to avoid spatial redundancy. For each environment, we iteratively sample start locations, rejecting ones that are within 5m of already-sampled positions. Three environments were excluded due to their size (too large or small) or poor reconstruction quality.
Data Collection. We collect dialogs on Amazon Mechanical Turk (AMT) -randomly pairing workers into Observer or Locator roles for each episode. The Observer interface includes a first-person view of the environment and workers can pan/tilt the camera in the current position or click to navigate to adjacent panoramas. The Locator interface shows the top-down map of the building, which can be zoomed and tilted to better display the walls. Views for each floor can be selected for multi-story environments. Both interfaces include a chat window where workers can send their message and end their dialog turn. The Locator interface also includes the option to make their prediction by clicking a spot on the top-down map -terminating the dialog. Note this option is only available after two rounds of dialog. Refer to the appendix for further details on the AMT interfaces. Before starting, workers were given written instructions and a walk-through video on how to perform their role. We restricted access to US workers with at least a 98% success rate over 5,000 previous tasks. Further, we restrict workers from repeating tasks on the same building floor. In order to filter bad-actors, we monitored worker performance based on a running-average of localization error in meters and the number of times they disconnected from dialogs -removing workers who exceeded a 10m threshold and discarding their data.
Dataset Splits. We follow the standard splits for the Matterport3D dataset (Chang et al., 2017)dividing along environments. We construct four splits: train, val-seen, val-unseen, and test comprising 3,967/299/561/1,165 dialogs from 58/55/11/18 environments respectively. Val-seen contains new start locations for environments seen in train. Both val-unseen and test contain new environments. This allows us to assess generalization to new dialogs and to new environments separately in validation. Following best practices, the final locations of the observer for the test set will not be released but we will provide an evaluation server where predicted localizations can be uploaded for scoring. WAY includes dialogs in which the human Locator failed to accurately localize the Observer. In reviewing failed dialogs, we found human failures are often due to visual aliasing (e.g., across multiple floors), or are relatively close to the 3m threshold. We therefore expect that these dialogs still contain valid descriptions, especially when paired with the Observer's true location during training. In experiments when removing failed dialogs from the train set, accuracy did not significantly change.

Dataset Analysis
Data Collection and Human Performance. In total, 174 unique workers participated in our tasks. On average each episode took 4 minutes and the average localization error is 3.17 meters. Overall, 72.5% of episodes where considered successful localizations at an error threshold of 3 meters. Each starting location has 3 annotations by separate randomly-paired Observer-Locator teams. In 40.9% of start locations, all 3 teams succeeded, in 36.3% 2, 18.5% 1, and 4.3% 0 teams succeeded. Fig. 2 left shows a histogram of localization errors.
Why is it Difficult? Localization through dialog is a challenging task, even for humans. The teams success depends on the uniqueness of starting position, if and where the Observer chooses to navigate, and how discriminative the Locator's questions are. Additionally, people vary greatly in their ability to interpret maps, particularly when performing mental rotations and shifting perspective (Kozhevnikov et al., 2006), which are both skills required to solve this task. We also observe that individual environments play a significant role in human erroras illustrated in Fig. 2 right, larger buildings and buildings with multiple floors tend to have larger localization errors, as do buildings with multiple similar looking rooms (e.g., multiple bedrooms with similar decorations or office spaces with multiple conference rooms). The buildings with the highest and lowest error are shown in Fig. 3.
Characterizing WAY Dialogs. Fig. 4 shows two example dialogs from WAY. These demonstrate a common trend -the Observer provides descriptions of their surroundings and then the Locator asks clarifying questions to refine the position. More difficult episodes require multiple rounds to narrow down the correct location and the Locator may ask the Observer to move or look for landmarks. On average, dialogs contain 5 messages and 61 words.
The Observer writes longer messages on average (19 words) compared to the Locator (9 words). This asymmetry follows from their respective roles. The Observer has first-person access to high-fidelity visual inputs and must describe their surroundings, 'In a kitchen with a long semicircular black countertop along one wall. There is a black kind of rectangular table and greenish tiled floor.'. Meanwhile, the Locator sees a top-down view and uses messages to probe for discriminative details, 'Is it a round or rectangle table between the chairs?', or to prompt movement towards easier to discriminate spaces, 'Can you go to another main space?'.
As the Locator has no information at the start of the episode, their first message is often a short prompt for the Observer to describe their surroundings, further lowering the average word count. Conversely, the Observer's reply is longer on average at 24 words. Both agent's have similar word counts for further messages as they refine the location. See the appendix for details on common utterances for both roles in the first two rounds of dialog.
Role of Navigation. Often the localization task can be made easier by having the Observer move to reduce uncertainty (see bottom example of Fig. 4). This includes moving away from nondescript areas like hallways and moving to unambiguous locations. We observe at least one navigation step in 62.6% of episodes and an average of 2.12 steps. Episodes containing navigation have a significantly lower average localization error (2.70m) compared to those that did not (3.98m). We also observe the intuitive trend that larger environments elicit more navigation. The distributions for start and end locations for the most and least navigated environments in the appendix.

WHERE ARE YOU? Tasks
We now formalize the LED, EVD and CL tasks to provide a clear orientation for future work.
Localization from Embodied Dialog.
The LED task is the following -given an episode comprised of a environment and human dialog -(e, L 0 , O 0 , . . . L T −1 , O T −1 ) -predict the Observer's final location p T . This is a grounded natural language understanding task with pragmatic evaluations -localization error and accuracy at a variable threshold which in this paper is set to 3 meters. This task does not require navigation or text generation; instead, it mirrors AI-augmented localization applications. An example would be a system that listens to emergency services calls and provides a real time estimate of the caller's indoor location to aid the operator.
Embodied Visual Dialog. This task is to replace the Observer by an AI agent. Given a embodied first-person view of a 3D environment (see Observer view in Fig. 1 Observers tend to navigate more in featureless areas, such as the long corridor in (a). Localization error is highest in buildings with many repeated indistinguishable features, such as the cathedral with rows of pews in (c). Figure 4: Examples from the dataset illustrating the Observer's location on the top-down map vs. the Locator's estimate (left) and the associated dialog (right). In the bottom example the Locator navigates to find a more discriminative location, which is a common feature of the dataset. The Observer navigates in 63% of episodes and the average navigation distance for these episodes is 3.4 steps (7.45 meters).
dict the Observer agent's next navigational action and natural language message to the Locator. To evaluate the agent's navigation path, the error in the final location can be used along with path metrics such as nDTW (Ilharco et al., 2019). Generated text can be evaluated against human responses using existing text similarity metrics.
Cooperative Localization. In this task, both the Observer and the Locator are modeled agents. Modeling the Locator agent requires goal-oriented dialog generation and confidence estimation to determine when to end the task by predicting the location of the Observer. Observer and Locator agents can be trained and evaluated independently using strategies similar to the EVD task, or evaluated as a team using localization accuracy as in LED.

Modeling Localization From Embodied Dialog
While the WAY dataset supports multiple tasks, we focus on Localization from Embodied Dialog as a first step. In LED, the goal is to predict the location of the Observer given a dialog exchange.

LED Model from Top-down Views
We model localization as a language-conditioned pixel-to-pixel prediction task -producing a probability distribution over positions in a top-down view of the environment. This choice mirrors the environment observations human Locators had during data collection, allowing straightforward comparison. However, future work need not be restricted to this choice and may leverage the panoramas or 3D reconstructions that Matterport3D provides.
Dialog Representation. Locator and Observer messages are tokenized using a standard toolkit (Loper and Bird, 2002). The dialog is represented as a single sequence with identical 'start' and 'stop' tokens surrounding each message, and then encoded using a single-layer bidirectional LSTM with a 300 dimension hidden state. Word embeddings are initialized using GloVe (Pennington et al., 2014) and finetuned end-to-end.
Environment Representation. The visual input to our model is the environment map which we scale to 780×455 pixels. We encode this map using a ResNet18 CNN (He et al., 2016) pretrained on ImageNet (Russakovsky et al., 2015), discarding Figure 5: The 3-layer LingUNet-Skip architecture used to model the Localization from Embodied Dialog task. Table 2: Comparison of our model with baselines and human performance on the LED task. We report average localization error (LE) and accuracy at 3 and 5 meters (all ± standard error). * denotes oracle access to Matterport3D node locations.
val-seen val-unseen test the 3 final conv layers and final fully-connected layer in order to output a 98×57 spatial map with feature dimension 128. Although the environment map is a top-down view which does not closely resemble ImageNet images, in initial experiments we found that using a pretrained and fixed CNN improved over training from scratch.

Language-Conditioned Pixel-to-Pixel Model.
We adapt a language-conditioned pixel-to-pixel LingUNet  to fuse the dialog and environment representations. We refer to the adapted architecture as LingUNet-Skip. As illustrated in Fig. 5, LingUNet is a convolutional encoder-decoder architecture. Additionally we introduce language-modulated skip-connections between corresponding convolution and deconvolution layers. Formally, the convolutional encoder produces feature maps F l = Conv(F l−1 ) beginning with the initial input F 0 . Each feature map F l is transformed by a 1×1 convolution with weights K l predicted from the dialog encoding, i.e. G l = Conv K l (F l ). The language kernels K l are linear transforms from components of the dialog representation split along the feature dimension. Finally, the deconvolution layers combine these transformed skip-connections and the output of the previous layer, such that H l = Deconv([H l+1 ; (G l + F l )]). There are three layers and the output of the final de-convolutional is processed by a MLP and a softmax to output a distribution over pixels.
Loss Function. We train the model to minimize the KL-divergence between the predicted location distribution and the ground-truth location, which we smooth by applying a Gaussian with standard deviation of 3m (matching the success criteria). During inference, the pixel with highest probability is selected as the final predicted location. For multi-story environments, each floor is processed independently during training. During inference only the ground truth final floor is processed. This is done to maintain accurate euclidean distance measurements for localization error as euclidean distance is not meaningful when measuring across points on floors in multi-story environments. This schema is used for all baselines experiments except for human locators who select from all floors.

Experimental Setup
Metrics. We evaluate performance using localization error (LE) defined as the Euclidean distance in meters between the predicted Observer location p T and the Observer's actual terminal location p T : LE = ||p T −p T || 2 . We also report a binary success metric that places a threshold k on the localization error -1(LE ≤ k) -for 3m and 5m. The 3m threshold allows for about one viewpoint of error since viewpoints are on average 2.25m apart. We use euclidean distance for LE because localization predictions are not constrained to the navigation graph. Matterport building meshes contain holes and other errors around windows, mirrors and glass walls, which can be problematic when computing geodesic distances for points off the navigation graph.
Training and Implementation Details. Our LingUNet-Skip model is implemented in PyTorch (Paszke et al., 2019). Training the model involves optimizing around 16M parameters for 15-30 epochs, requiring ∼8 hours on a single GPU. We use the Adam optimizer (Kingma and Ba, 2014) with a batch size of 10 and an initial learning rate of 0.001 and apply Dropout (Srivastava et al., 2014) in non-convolutional layers with p = 0.5. We tune hyperparameters based on val-unseen performance and report the checkpoint with the highest val-unseen Acc@3m. To reduce overfitting we apply color jitter, 180°rotation, and random cropping by 5% to the map during training.
Baselines. We consider a number of baselines and human performance to contextualize our results and analyze WAY: -Human Locator. The average performance of AMT Locator workers as described in Sec. 3. -Random. Uniform random pixel selection.
-Center. Always selects the center coordinate.
-Random Node. Uniformly samples from Mat-terport3D node locations. This uses oracle knowledge about the test environments. While not a fair comparison, we include this to show the structural prior of the navigation graph which reduces the space of candidate locations. -Heuristic Driven. For each dialog D t in the validation splits we find the most similar dialog D g in the training dataset based on BLEU score (Papineni et al., 2002). From the top-down map associated with D g , a 3m x 3m patch is taken around the ground truth Observer location. We predict the location for D t by convolving this patch with the top-down maps associated with D t and selecting the most similar patch (according to Structural Similarity). The results (below) are only slightly better than random.

Results
Tab. 2 shows the performance of our LingUNet-Skip model and relevant baselines on the val-seen, val-unseen, and test splits of the WAY dataset. Human and No-learning Baselines. Humans succeed 70.4% of the time in test environments. Notably, val-unseen environments are easier for humans (79.7%), see appendix for details. The Random Node baseline outperforms the pixel-wise Random setting (Acc@3m and Acc@5m for all splits) and this gap quantifies the bias in nav-graph positions. We find the Center baseline to be rather strong in terms of localization error, but not accuracy -wherein it lags behind our learned model significantly (Acc@3m and Acc@5m for all splits).

LingUNet-Skip outperforms baselines. Our
LingUNet-Skip significantly outperforms the handcrafted baselines in terms of accuracy at 3m -improving the best baseline, Center, by an absolute 10% (test) to 30% (val-seen and val-unseen) across splits (a 45-130% relative improvement). Despite this, it achieves higher localization error than the Center model for val-unseen and test. This is a consequence of our model occasionally being quite wrong despite its overall stronger localization performance. There remains a significant gap between our model and human performance -especially on novel environments (70.4% vs 32.7% on test).

Ablations and Analysis
Tab. 3 reports detailed ablations of our LingUNet-Skip model. Following standard practice, we report performance on val-seen and val-unseen.
Navigation Nodes Prior We do not observe significant differences between val-seen (train environments) and val-unseen (new environments), which suggests the model is not memorizing the node locations. Even if the model did, learning this distribution would likely amount to free-space prediction which is a useful prior in localization.

Input Modality Ablations.
No Vision explores the extent that linguistic priors can be exploited by LingUNet-Skip, while No Dialog does the same for visual priors. No Dialog beats the Center baseline (32.1% vs. 29.8% val-unseen Acc@3m) indicating that it has learned a visual centrality prior that is stronger than the center coordinate. This makes sense because some visual regions like nondescript hallways are less likely to contain terminal Observer locations. Both No Vision and No Dialog perform much worse than our full model (7.8% and 32.1% val-unseen Acc@3m vs. 45.6%), suggesting that the task is strongly multimodal.
Dialog Halves. First-half Dialog uses only the first half of dialog pairs, while Second-half Dialog uses just the second half. Together, these examine whether the start or the end of a dialog is more salient to our model. We find that First-half Dialog performs marginally better than using the full dialog (46.2% vs 45.6% val-unseen Acc@3m) which we suspect is due to our model's failure to generalize second half dialog to unseen environments and problems handling long sequences. Further intuition for these results is that the first-half of the dialog contains coarser grained descriptions and discriminative statements ("I am in a kitchen"). The second-half of the dialog contains more fine grained descriptions (relative to individual referents in a room). Without the initial coarse localization, the second-half dialog is ungrounded and references to initial statements are not understood, therefore leading to poor performance.
Observer dialog is more influential. Observeronly ablates Locator dialog and Locator-only ablates Observer dialog. We find that Observer-only significantly outperforms Locator-only (44.9% vs. 33.3% val-unseen Acc@3m). This is an intuitive result as Locators in the WAY dataset commonly query the Observer for new information. We note that Observers were guided by the Locators in the collection process (e.g. 'What room are you in?'), and that ablating the Locator dialog does not remove this causal influence.
Shuffling Dialog Rounds. Shuffle Rounds considers the importance of the order of Locator-Observer dialog pairs by shuffling the rounds. Shuffling the rounds causes our LingUNet-Skip to drop just an absolute 0.7% val-unseen Acc@3m (2% relative).

Conclusion and Future Work
In summary, we propose a new set of embodied localization tasks: Localization from Embodied Dialog -LED (localizing the Observer from dialog history), Embodied Visual Dialog -EVD (modeling the Observer), and Cooperative Localization -CL (modeling both agents). To support these tasks we introduce WHERE ARE YOU? a dataset containing ∼6k human dialogs from a cooperative localization scenario in a 3D environment. WAY is the first dataset to present extensive human dialog for an embodied localization task. On the LED task we show that a LingUNet-Skip model improves over simple baselines and model ablations but without taking full advantage of the second half of the dialog. Since WAY encapsulates multiple embodied localization tasks, there remains much to be explored.
Val-Unseen has higher accuracy than other splits. Human's localization Acc@3m is 79.4% for val-unseen which is higher than all other splits such as test which as a Acc@3m of 70.4%. Following standard practice, the splits followed (Chang et al., 2017). The val-unseen split is notably smaller than the rest of the splits and through qualitative analysis, we found that the environments in the val-unseen split (Chang et al., 2017) are generally smaller and have discriminative features which we attribute to the split having a high localization performance. Our LingUNet-Skip model has lower performance on test than on val-unseen which we reason is be to the nature of the environments in the splits. Additionally the LingUNet-Skip model has lower performance on test than on val-seen which is expected because test environments are unseen environments and val-seen environments are contained in the training set.
Navigation differs between environments. As previously discussed, different environments in the WAY dataset have varying levels of navigation. This is likely attributed to a few factors such as size of the building and discriminative features of the building such as decorations. Additionally we see features like long hallways frequently lead to long navigational paths. The variances in navigation between environments is further illustrated in Fig. 7. While the distribution between the starting and final positions barely changes for the environment on the left, we see significant change in the environment on the right. Most noticeably we see that there are no final positions in the long corridor of the right environment despite it containing several start locations.

Data Collection
Interface. Fig. 8 shows the data collection interface for the Observer and Locator human annotators. The annotator team was able to chat with each other via a message box that also displayed the chat history. The Locator had a top down map of the environment and had buttons to switch between floors. The Observer was given a first person view of the environment and could navigate the environment by clicking on the blue cylinders shown in Fig. 8 Closer Look at Dialog. Fig. 9 further breaks this down by looking at the average length of specific messages of the two agents. The Locator's first message is short in comparison to the average num-ber of words per message of the agent. This is expected as this message is always some variation of getting the Observer to describe their location and it follows that the message has a low number of unique words. The Observer's first message is by far their longest, at 23.9 words, which is logical since in this message the Observer is trying to give the most unique description possible with no constraint on length. The distributions become more uniform in the 2nd messages from both the Locator and Observer. While the first message of the observer has a large number of unique words the distribution is not uniform over the words leading to the conclusion that the message has an common structure to it but that the underlying content is still discriminative for modeling the location. The word distribution of messages further down in the dialogue sequence are largely conditioned on the previous message from the other agent, which means that accurately encoding the dialogue history is important for accurate location estimation. Distribution of Localization Error. In order to better understand the distribution of the LingUNet-Skip model's predictions we visualize the distributions in Fig. 10. Success and Failure Examples. To qualitatively evaluate our model we visualize the predicted distributions, the true location over the top down map and the dialog in Fig. 11. We also show two failure cases in which the model predicts the wrong location.

Least Navigation
Most Navigation  Locator 2nd Message Figure 9: Distributions of the first four words for each of the first four messages of the dialogs in the WAY dataset separated by message number and role type. The ordering of the words starts in the center and radiates outwards. The arc length is proportional to the number of messages containing the word. The white areas are words that had too low of a count to illustrate.