ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments

For embodied agents, navigation is an important ability but not an isolated goal. Agents are also expected to perform specific tasks after reaching the target location, such as picking up objects and assembling them into a particular arrangement. We combine Vision-andLanguage Navigation, assembling of collected objects, and object referring expression comprehension, to create a novel joint navigation-and-assembly task, named ARRAMON. During this task, the agent (similar to a PokeMON GO player) is asked to find and collect different target objects one-by-one by navigating based on natural language (English) instructions in a complex, realistic outdoor environment, but then also ARRAnge the collected objects part-by-part in an egocentric grid-layout environment. To support this task, we implement a 3D dynamic environment simulator and collect a dataset with human-written navigation and assembling instructions, and the corresponding ground truth trajectories. We also filter the collected instructions via a verification stage, leading to a total of 7.7K task instances (30.8K instructions and paths). We present results for several baseline models (integrated and biased) and metrics (nDTW, CTC, rPOD, and PTC), and the large model-human performance gap demonstrates that our task is challenging and presents a wide scope for future work.


Introduction
Navigation guided via flexible natural language (NL) instructions is a crucial capability for robotic and embodied agents. Such systems should be capable of interpreting human instructions to correctly navigate realistic complex environments and reach destinations by understanding the environment, and associating referring expressions in the Figure 1: Navigation and assembly phases (2 turns), via NL (English) instructions in a dynamic 3D environment. In the navigation phase, agents are asked to find and collect a target object. In the assembly phase, agents have to egocentrically place the collected object at a relative location (navigation turn 2 starts where turn 1 ends; we only show 3 snapshots here for space reasons, but the full simulator and its image set will be made available).
instructions with the corresponding visual cues in the environment. Many research efforts have focused on this important vision-and-language navigation task (MacMahon et al., 2006;Mooney, 2008;Chen and Mooney, 2011;Tellex et al., 2011;Mei et al., 2016;Hermann et al., 2017;Anderson et al., 2018;Misra et al., 2018;Das et al., 2018;Thomason et al., 2019;Chen et al., 2019;Jain et al., 2019;Shridhar et al., 2020;Qi et al., 2020;Hermann et al., 2020). However, in real-world applications, navigation alone is rarely the exclusive goal. In most cases, agents will navigate to perform another task at their destination, and also repeat subtasks, e.g., a warehouse robot may be asked to pick up several objects from different locations and then assemble the objects into a desired arrangement. When these additional tasks are interweaved with navigation, the degree of complexity increases exponentially due to cascading errors. Relatively few studies have focused on this idea of combining navigation with other tasks. Touchdown (Chen et al., 2019) combines navigation and object referring expression resolution, REVERIE (Qi et al., 2020) performs remote referring expression comprehension, and ALFRED (Shridhar et al., 2020) combines indoor navigation and household manipulation. However, there has been no task that integrates the navigation task in complex outdoor spaces with the assembling task (and object referring expression comprehension), requiring spatial relation understanding in an interweaved temporal way, in which the two tasks alternate for multiple turns with cascading error effects (see Figure 1).
Thus, we introduce a new task that combines the navigation, assembling, and referring expression comprehension subtasks. This new task can be explained as an intuitive combination of the navigation and collection aspects of PokéMON GO 2 and an ARRAnging (assembling) aspect, hence we call it 'ARRAMON'. In this task, an agent needs to follow navigational NL instructions to navigate through a complex outdoor and fine-grained city environment to collect diverse target objects via referring expression comprehension and dynamic 3D visuospatial relationship understanding w.r.t. other distracter objects. Next, the agent is asked to place those objects at specific locations (relative to other objects) in a grid environment based on an assembling NL instruction. These two phases are performed repeatedly in an interweaved manner to create an overall configuration of the set of collected objects. For enabling the ARRAMON task, we also implement a simulator built in the Unity game engine 3 to collect the dataset (see Appendix B.2 for the simulator interface). This simulator features a 3D synthetic city environment based on real-world street layouts with realistic buildings and textures (backed by Mapbox 4 ) and a dynamic grid floor assembly room (Figure 1), both from an egocentric view (the full simulator and its image set will be made available). We take 7 disjoint sub-sections from the city map and collect instructions from workers within each section. Workers had to write instructions based on ground truth trajectories (represented as path lines in navigation, location highlighting during assembly). We placed diverse background objects as well as target objects so that the rich collected instructions require agents to utilize strong linguistic understanding. The instructions were next executed by a new set of annotators in a second verification stage and were filtered based on low match w.r.t. the original ground truth trajectory, and the accuracy of assembly placement. Overall, this resulted in a dataset of 7,692 task instances with multiple phases and turns (a total of 30,768 instructions and paths). 5 To evaluate performance in our ARRAMON task, we employ both the existing metric of nDTW (Normalized Dynamic Time Warping) (Ilharco et al., 2019) and our newly-designed metrics: CTC-k (Collected Target Correctness), rPOD (Reciprocal Placed Object Distance), and PTC (Placed Target Correctness). In the navigation phase, nDTW measures how similar generated paths are to the ground truth paths, while CTC-k computes how closely agents reach the targets. In the assembly phase, rPOD calculates the reciprocal distance between target and agents' placement locations, and PTC counts the correspondence between those locations. Due to the interweaving property of our task with multiple navigation and assembling phases and turns, performance in the previous turn and phase cascadingly affects the metric scoring of the next turn and phase (Section 3.2).
Lastly, we implement multiple baselines as good starting points and to verify our task is challenging and the dataset is unbiased. We present integrated vision-and-language, vision-only, languageonly, and random-walk baselines. Our vision-andlanguage model shows better performance over the other baselines, which implies that our ARRAMON dataset is not skewed; moreover, there exists a very large gap between this model and the human performance, implying that our ARRAMON task is challenging and that there is substantial room for improvements by future work. We will publicly release the ARRAMON simulator, dataset, and code, along with a leaderboard to encourage further com- Figure 2: Illustration of the basic object types that the agent must collect, and will also appear as distracter objects during both navigation and assembly phases. munity research on this realistic and challenging joint navigation-assembly task.

Related Work
Vision-and-Language Navigation.
Recently, Vision-and-Language Navigation (VLN) tasks, in which agents follow NL instructions to navigate through an environment, have been actively studied in research communities (MacMahon et al., 2006;Mooney, 2008;Chen and Mooney, 2011;Tellex et al., 2011;Mei et al., 2016;Hermann et al., 2017;Anderson et al., 2018;Misra et al., 2018;Das et al., 2018;Thomason et al., 2019;Chen et al., 2019;Jain et al., 2019;Shridhar et al., 2020;Qi et al., 2020;Hermann et al., 2020).To encourage the exploration of this challenging research topic, multiple simulated environments have been introduced. Synthetic ( Kempka et al., 2016;Beattie et al., 2016;Kolve et al., 2017;Brodeur et al., 2017;Wu et al., 2018;Savva et al., 2017;Zhu et al., 2017;Yan et al., 2018;Shah et al., 2018;Puig et al., 2018) as well as real-world and image-based environments (Anderson et al., 2018;Xia et al., 2018;Chen et al., 2019) have been used to provide agents with diverse and complement training environments. Referring Expression Comprehension. The ability to make connections between objects or spatial regions and the natural language expressions that describe those objects or regions, has been a focus of many studies. Given that humans regularly carry out complex symbolic-spatial reasoning, there has been much effort to improve the capability of referring expression comprehension (including remote objects) in agents (Kazemzadeh et al., 2014;Mao et al., 2016;Hu et al., 2016;Yu et al., 2018;Chen et al., 2019;Qi et al., 2020), but such reasoning remains challenging for current models. Our ARRA-MON task integrates substantial usage of referring expression comprehension as a requirement, as it is necessary to the successful completion of both the navigation and assembly phases. Assembling Task. Object manipulation and configuration is another subject that has been studied along with language and vision grounding (Bisk et al., 2016;Wang et al., 2016;Li et al., 2016;Bisk et al., 2018). However, most studies focus on addressing the problem in relatively simple environments from a third-person view. Our ARRAMON task, on the other hand, provides a challenging dynamic, multi-step egocentric viewpoint within a more realistic and interactive 3D, depth-based environment. Moreover, the spatial relationships in ARRAMON dynamically change every time the agent moves, making 'spatial-action' reasoning more challenging. We believe that an egocentric viewpoint is a key part of how humans perform spatial reasoning, and that such an approach is therefore vital to producing high-quality models and datasets.
These three directions of research are typically pursued independently (esp. navigation and assembling), and there have been only a few recent efforts to combine the traditional navigation task with other tasks. Touchdown (Chen et al., 2019) combines navigation and object referring expression resolution, REVERIE (Qi et al., 2020) performs remote referring expression comprehension, while ALFRED (Shridhar et al., 2020) combines indoor navigation and household manipulation. Our new complementary task merges navigation in a complex outdoor space with object referring expression comprehension and assembling tasks that require spatial relation understanding in an interweaved temporal style, in which the two tasks alternate for multiple turns leading to cascading error effects. This will allow development of agents with more integrated, human-like abilities that are essential in real-world applications such as moving and arranging items in warehouses; collecting material and assembling structures in construction sites; finding and rearranging household objects in homes.

Task
The ARRAMON task consists of two phases: navigation and assembly. We define one turn as one navigation phase plus one assembly phase (see Figure 1). Both phases are repeated twice (i.e., 2 turns), starting with the navigation phase. During the navigation phase, an agent is asked to navigate a rich outdoor city environment by following NL instructions, and then collect the target object identified in the instructions via diverse referring expressions. During the assembly phase, the agent is asked to place the collected object (from the previous nav- igation phase) at a target location on a grid layout, using a different NL instruction via relative spatial referring expressions. Target objects and distracter objects are selected from one of seven objects shown in Figure 2 and then are given one of two different patterns and one of seven different colors (see Figure 11 in Appendices). In both phases, the agent can take 4 actions: forward, left, right, and an end pickup/place action. Forward moves the agent 1 step ahead and left/right makes agents rotate 30°in the respective direction. 6

Environment
Navigation Phase. In this phase, agents are placed at a random spot in one of the seven disjoint subsections of the city environment (see Figure 3), provided with an NL instruction, and asked to find the target object. The city environment is filled with background objects: buildings and various objects found on streets (see Figure 4). There are also a few distracter objects in the city that are similar to target objects (in object type, pattern, and color). During this phase, the agent's end action is 'pickup'. The pick-up action allows agents to pick up any collectible object within range (a rectangular area: 0.5 unit distance from an agent toward both their left and right hand side and 3 unit distance forward). Assembly Phase. Once the agent picks up the collectible object in the navigation phase, they enter the assembly phase. In this phase, agents are again provided with an NL instruction, but they are now asked to place the target object they collected in the 6 In our task environment, holistically, the configuration of the set of objects dynamically changes as agents pick-up and place or stack them relative to the other objects, which is one challenging interaction between the objects.  previous phase at the target location identified in the instruction. When the assembly phase begins, 8 decoy basic-type objects ( Figure 2) with random pattern and color, are placed for use as distractions. In this phase, agents can only move on a 4-by-5 grid layout. The grid is bordered by 4 walls, each with a different texture/pattern (wood, brick, spotted, striped) to allow for more diverse expressions in the assembly phase. Their end action is 'place', which puts the collected object onto the grid one step ahead. Agents cannot place diagonally and, unlike in the navigation phase, cannot move forward diagonally.
Hence, to accomplish the overall joint navigation-assembly task, it is required for agents to have integrated abilities. During navigation they must take actions based on understanding the egocentric view and aligning the NL instructions with the dynamic visual environment to successfully find the target objects (relevant metrics: nDTW and CTC-k, see Section 3.2). During assembly, from an egocentric view, they must understand 3D spatial relations among objects identified by referring expressions in order to place the target objects at the right relative location. (relevant metrics: PTC and rPOD, see Section 3.2). 7

Metrics
Normalized Dynamic Time Warping (nDTW). To encourage the agent to follow the paths closely during the navigation task, we employ nDTW (Ilharco et al., 2019) as our task metric. nDTW measures the similarity between a ground-truth path and a predicted trajectory of an agent, thus penalizing randomly walking around to find and pick up the target object.
Collected Target Correctness (CTC). An agent that understands the given NL instructions well should find and pick up a correct target object at the end of the navigation task. Therefore, we evaluate the agent's ability with CTC, which will have a value of 1 if the agent picks up a correct object, and a value of 0 if they pick up an incorrect object or do not pick up any object. Since collecting the correct object is a difficult task, we also implement the CTC-k metric. CTC-k measures the CTC score at distance k. If the agent is within k distance of the target object, then the value is 1, otherwise it is 0 (CTC-0 indicates the original CTC). Placed Target Correctness (PTC). In the assembly task, placing the collected object at the exact target position is most important. The PTC metric counts the correspondence between the target location and the placed location. If the placed and target locations match, then the PTC is 1, otherwise it is 0. If the collected object is not correct, then the score is also 0. Reciprocal Placed Object Distance (rPOD). We also consider the distance between the target position and the position where the collected object is eventually placed in the assembly task (Bisk et al., 2018). The distance squared is taken to penalize the agent more for placing the object far from the target position. Then 1 is added and the reciprocal is taken to normalize the final metric value: rPOD = 1 1+D 2 a , where D a is the Manhattan distance between the target and placed object positions. If the collected object is not correct, then the score is 0 (see Figure 9 in Appendices).
Overall, our metrics reflect the interweaving property of our task. For example, if agents show poor performance in the first turn navigation phase (i.e., low nDTW and CTC-k scores), they will not obtain high scores in the continuing assembly phase (i.e., low PTC and rPOD scores), also leading to lower scores in the second turn navigation phase.

ARRAMON Dataset
Our ARRAMON navigation-assembly dataset is a collection of rich human-written NL (English) instructions. The navigation instructions explain how to navigate the large outdoor environments and describe which target objects to collect. The assembly instructions provide the desired target locations for placement relative to objects. Each instruction set in the dataset is accompanied by ground truth (GT) trajectories and placement locations. Data was col-lected from the online crowd-sourcing platform Amazon Mechanical Turk (AMT).

Data Collection
The data collection process was broken into two stages: Stage 1: Writing Instructions, and Stage 2: Following/Verifying Instructions. Within each stage, there are two phases: Navigation and Assembly (see Figure 15 in Appendices for the interface of each stage and each phase). During the first stage's navigation phase, a crowdworker is placed in the city environment as described in Section 3.1 and moves along a blue navigation line (representing the GT path) that will lead them to a target object (see Appendix B.1 for the exact route generation details). While the worker travels this line, they write instructions describing their path (e.g., "Turn to face the building with the green triangle on a blue ... Walk past the bench to the dotted brown TV and pick it up."). Workers were bound to this navigation line to ensure that they wrote instructions only based on what they could see from the GT path. Next, the worker starts the first stage's assembly phase and is placed in a small assembly room, where they must place the object they just collected in a predetermined location (indicated by a transparent black outline of the object they just collected) and write instructions on where to place the object relative to other objects from an egocentric viewpoint (e.g., "Place the dotted brown TV in front of the striped white hourglass."). The worker is then returned to the city environment and repeats both phases once more.
A natural way of verifying the instruction sets from Stage 1 is to have new workers follow them (Chen et al., 2019). Thus, during Stage 2 Verification, a new worker is placed in the environment encountered by the Stage 1 worker and is provided with the NL instructions that were written by that Stage 1 worker. The new worker has to follow the instructions to find the target objects in the city and place them in the correct positions in the assembly environment. Each instruction set from Stage 1 is verified by three unique crowdworkers to ensure instructions are correctly verified. Next, evaluation of the Stage 2 workers performance was done through the use of the nDTW and PTC metrics. If at least one of three different Stage 2 workers scored higher than 0.2 on nDTW in both navigation turns and had a score of 1 on PTC in both assembly turns, then the corresponding Stage 1 instruction set was considered high quality and kept in the dataset, otherwise it was discarded. The remaining dataset has a high average nDTW score of 0.66 and an even higher expert score of 0.81 (see Sec. 8). 8

Data Quality Control
Instructions written by the Stage 1 workers needed to be clear and understandable. Workers were encouraged to follow certain rules and guidelines so that the resulting instruction would be of high quality and made proper use of the environment. Guidelines, Automated Checks, and Qualification Tests. Detailed guidelines were put in place to help ensure that the instructions written contained as few errors as possible. Rules were shown to workers before the start of the task and active automated checks took place as the workers wrote. These active checks helped prevent poor instructions (such as those including certain symbols) from being submitted, requiring workers to fix them before submitting. In the case the instruction quality was questionable, an email notification was sent (see Appendix B.1 for the exact guidelines and checks that were implemented, as well as details regarding the email notifications). A screening test was also required at the start of both stages to test the crowdworkers' understanding of the task.
If a wrong answer was chosen, an explanation was displayed and the crowdworker was allowed to try again (see Figure 13 and 14 in Appendices for the screening tests). To help workers place the object in the right location during Stage 2, we use a sim-8 Workers were allowed to repeat both tasks, however they were prevented from encountering an identical map setting that already has instructions during Stage 1 and their own instructions during Stage 2. ple placement test which they pass by placing an object at the correct place during a mock assembly phase (see Appendix B.1 for details). Worker Qualifications. Workers completing the task were required to pass certain qualifications before they could begin. As the Stage 1 and 2 tasks require reading English instructions (Stage 1 also involves writing), we required workers be from native-speaking English countries. Workers were required to have at least 1000 approved tasks and a 95% or higher approval rating. 3D depth to guide the agent; implying that the combined navigation and assembly task requires that agents posses a full understanding of object relations in a 3D environment. Our analysis showed other linguistic properties, such as frequent directional references, ego and allocentric spatial relations, temporal conditions, and sequencing (see Appendix C.1 for the details and examples). Dataset Statistics. Figure 5 shows that the most frequently occurring words in our dataset. These words are primarily directional or spatial relations. This implies that agents should be able to understand the concept of direction and the spatial relations between objects, especially as they change with movement. Table 1 and Figure 6 show that navigation tends to have longer instructions and path lengths. Assembly occurs in a smaller environment, requiring agents to focus less on understanding paths than in navigation and more on understanding the 3D spatial relations of objects from the limited egocentric viewpoint.

Models
We train an integrated Vision-and-Language model as a good starting point baseline for our task. To verify that our dataset is not biased towards some specific factors, we trained ablated and random walk models and evaluated them on the dataset. Vision-and-Language Baseline. This model uses vision and NL instruction features together to predict the next actions (Figure 7). We implement each module for navigation/assembly phases as: where Img t is the view of an agent at time step t, Inst. is natural language instructions given to the We train the model with the teacher-forcing approach (Lamb et al., 2016) and cross entropy loss: , where a * t is ground truth action at time step t. Vision/Language only Baseline. To check the unimodality bias, we evaluate vision and language only baselines on our dataset. These exploit only single modality (visual or language) to predict the appropriate next action. To be specific, they use the same architecture as the Vision-and-Language baseline except the Cross-Attn module.
Random Walk. Agents take a random action at each time step without considering instruction and environment information.
Shortest Path. This baseline simulates an agent that follows the shortest path provided by A* algorithm (Hart et al., 1968) to show that the GT paths are optimal in terms of trajectory distances.

Experiments
We split the dataset into train/val-seen/valunseen/test-unseen. We assign the city sub-sections 1 to 5 to train and val-seen, sub-section 6 to valunseen, and section 7 to test-unseen splits. We randomly split data from sub-sections 1 to 5 into 80/20 ratio to get train and val-seen splits, respectively. Thus, the final number of task samples for each split is 4,267/1,065/1,155/1,205 (total: 17,068/4,260/4,620/4,820). The Stage 1 workers are equally distributed across the city sub-sections, so the dataset splits are not biased toward specific workers. We also keep the separate 2 sections (i.e., section 6 and 7) for the unseen dataset following Anderson et al. (2018), which allows the evaluation of the models' ability to generalize in new environments. Note that for agents to proceed to the next phase, we allow them to pick up the closest target object (in the navigation phase) or place collected object at the closest location (in the assembly phase) when they do not perform the required actions. Training Details: We use 128 as hidden size. For word and action embedding sizes, we use 300 and 64, respectively. We use Adam (Kingma and Ba, 2015) as the optimizer and set the learning rate to 0.001 (see Appendix E.2 for details).

Results and Analysis
As shown in Table 2, overall, there is large humanmodel performance gap, indicating that our ARRA-MON task is very challenging and there is much room for model improvement. Performance in the navigation and assembly phases are directly related. If perfect performance is assumed in the navigation phase, rPOD and PTC are higher than if there were low CTC-k scores in navigation (e.g., 0.382 vs. 0.044 for PTC of the Vision-and-Language model on val-seen: see Appendix F for the comparison). This scoring behavior demonstrates that phases in our ARRAMON task are interweaved. Also, comparing scores from turn 1 and 2, all turn 2 scores are lower than their turn 1 counterparts (e.g., 0.222 vs. 0.049 nDTW of the Vision-and-Language model on val-seen split; see Appendix F for the detailed turn-wise results). This shows that the performance of the previous turn strongly affects the next turn's result. Note that to relax the difficultly of the task, we consider CTC-3 (instead of CTC-0; see Section 3.2) as successfully picking up the target object and then we calculate the assembly metrics under this assumption. If this was not done, then almost all the metrics across assembly would be nearly zero.

Model Ablations
Vision/Language Only Baseline. As shown in Table 2, our Vision-and-Language baseline shows better performance over both vision-only and language-only models, implying our dataset is not biased to a single modality and requires multimodal understanding to get high scores. Random Walk. The Random-Walk baseline shows poor performance on our task, implying that the task cannot be solved through random chance. Human Evaluation. We conducted human evaluations with workers (Table 2, 3) as well as an expert (Table 3). For workers' evaluations, we averaged all the workers' scores for the verified dataset (from Stage 2: verification/following, see Sec. 4.1).
For expert evaluation, we took 50 random samples from test-unseen and asked our simulator developer to blindly complete the task. Both workers and the expert show very high performance on our task (0.66 nDTW and 0.87 PTC for workers; 0.81 nDTW and 0.99 PTC for expert), demonstrating a large model-human performance gap and allowing much room for further improvements by the community on our challenging ARRAMON dataset.

Output Examples
As shown in an output example in Figure 8, our model navigates quite well and reaches very close to the target in the 1st turn and then places the target object in the right place in the assembly phase. However, in the 2nd turn, our model fails to find the "striped red mug" by missing the left turn around the "yellow and white banner". In the next assembling phase, the model cannot identify the exact location ("in front of the spotted yellow mug") to place the collected object (assuming the model picked up the correct object in the previous phase) possibly being distracted by another mug and misunderstanding the spatial relation. See Appendix G for more output examples.

Conclusion
We introduced ARRAMON, a new joint navi-gation+assembling instruction following task in which agents collect target objects in a large realistic outdoor city environment and arrange them in a dynamic grid space from an egocentric view. We collected a challenging dataset via a 3D synthetic simulator with diverse object referring expressions, environments, and visuospatial relationships. We also provided several baseline models which have a large performance gap compared to humans, implying substantial room for improvements by future work.

Navigation Assembly
Turn around and walk to the traffic signal. Take a right and walk past the orange cone in the middle of the road. Pick up the dotted red bucket in the middle of the road.
Turn right and place the dotted red bucket on top of the brown striped bowl.
Turn around, go forward, and take a left turn at the intersection. Keep going until you see the yellow and white banner, then turn left. Behind a phone booth on your right you will find a striped red mug. Pick it up.
Place the striped red mug in front of the spotted yellow mug.

Appendices A Task and Metrics
As shown in Figure 9, the score of rPOD is decreased according to the placement error (the Manhattan distance) exponentially. Thus, to score high in the rPOD metric, agents should place the target objects as close to the target place as possible.

B Dataset
To support the ARRAMON task, we collected a dataset. Our dataset is based on a large dynamic outdoor environment from which diverse instructions with interesting linguistic properties are derived.

B.1 Data Collection
Route Generation. The ground truth trajectories is determined by the A* shortest path algorithm (Hart et al., 1968). Using the shortest path algorithm allows the resulting Ground Truth (GT) path to be straightforward and reach the target while avoiding going to unnecessary places. The blue navigation guideline provided to the Stage 1 workers is a mimic of this GT path (Figure 15a).
Qualification Tests. When placing an object in the assembly phase, the item is placed 1 space in front of where the agent stands. To ensure that the workers who will be following instructions in Stage 2 fully understood this concept, at the start of Stage 2, they were presented with a small test ( Figure 10)  that would show them how to correctly move and place objects and required that they demonstrate that they could do so. Both Stage 1 and 2 workers were also required to pass a short screening test before they could begin their respective tasks. The tests are shown in Figure 13 (Stage 1) and Figure  14 (Stage 2).
Worker Bonus Criteria and Rates. For Stage 1 workers who did the instruction writing task correctly {5, 20, 50} times, a bonus of {$0.10, $0.90, $4.00} respectively was awarded. Stage 1 workers were also provided a $0.10 bonus for every instruction they wrote that was able to successfully pass Stage 2 verification with high nDTW and perfect assembling scores.
Instruction Rules and Guidelines. Rules and guidelines were put into place to help ensure that instructions written by the Stage 1 workers were high quality and written with as few errors as possible. Particularly, the guidelines serve to prevent the workers from using other elements of the UI or tools we provided, such as the blue navigation line or guiding arrow (see Figure 15) and other elements that were not part of the true environment in  Figure 11: Illustration of the colors and patterns that collectable and distracter objects can have.
START Figure 12: Illustration of the assembly grid with the starting position marked. their instructions.
• Instructions must be written relative to objects and the environment and not contain exact counts of movements (e.g., "Go forward 10 times and then turn left 2 times" is bad). • Instructions must be clear, concise, and descriptive. • Do not write more than the text-field can hold. • At the end of writing an instruction for the navigation phase be sure to include something similar to "pick up" or "collect" the object. • At end of writing an instruction for the assembly phase be sure to include something similar to "place" or "put" the object you collected before. • Do not reference the navigation line, the blue balls on the navigation line, the floating arrows above the objects, or any of the interface elements when writing instructions. • Do not reference any buildings that are a solid gray color. • Do not reference the transparent black outline or the white grid tiles on the floor (Figure 12 and Figure 15b) during the assembly phase • Do not write vague or potentially misleading in-Quiz You must pass the quiz before you can continue to the task.
What is a good example of a navigation instruction? a: Go forward and turn left. b: Go forward 5 times to reach the red TV. Then turn 4 times left and continue to the yellow building. c: Turn to face the purple bowl to your right. Continue forward till you reach a lamp post. Pick up the yellow bowl near the red traffic cone.
What is a good example of a navigation instruction? a: Go forward to the intersection and then turn right. Go forward till you reach the green traffic cone. Collect the green ball next to the lamp post. b: Go forward to the intersection and then turn left. Go forward following the blue guideline till you reach the red book. c: Turn around and go forward till you reach the floating arrow. Pick up the green ball underneath.
Which of the following is true? a: All the objects will be dotted. b: Objects will always be the same color. c: Objects will always be a book, hourglass, mug, bucket, ball, tv, or bowl, but may vary in color and texture.
Which of the following is true? a: During Navigation phase, instruction writing is not required. Instruction writing is only required in Assembly phase b: Both Navigation and Assembly phases require instructions to be written. c: Writing instructions is optional and should only be done if you feel like it.
Which of the following is a good example of a Assembly instruction? a: Turn to face the left wall. Then place the dotted yellow TV on top of the striped red book. b: Place the object. c: Move forward. Turn right and then put down the green book.
Get Results Figure 13: Screening test that is required to be taken prior to starting Stage 1.

Quiz
You must pass the quiz before you can continue to the task.
What is the overall goal of this task? a: Roam aimlessly until you are done. b: Follow the provided instructions as accurate as possible. c: Pick up random things.
Which of the following is true? a: All the objects will be dotted. b: Objects will always be the same color. c: Objects will always be a book, hourglass, mug, bucket, ball, tv, or bowl, but may vary in color and texture.
Get Results Figure 14: Screening test that is required to be taken prior to starting Stage 2.
structions and do not create any instructions that reference previous instructions such as "Go back to" or "Return to". • Avoid spelling and grammar mistakes. • When writing instructions for the assembly phase, do not write movement instructions. Make sure to use object references (e.g, "the red dotted ball").
During the navigation phase, the instruction writing worker cannot stray from the navigation line, ensuring that they collect the objects in the correct order. During the assembly phase, regardless of where the instruction worker places the collected object, it will move into the correct position (work-ers are not informed of this), ensuring that the objects are always in the correct formation for the next phase and future instructions do not become invalid. Additionally, we have implemented active quality checks which will prevent a worker from submitting their instructions if certain criteria is not met. If a worker is blocked by one of these checks, they will be shown which check failed so that they can easily correct the error.
General Active Quality Checks.
• Each instruction must contain at least 6 words.
• Less than 40% of the characters in the instructions can be spaces.  • The symbols (, [, ], ), &, *,ˆ, %. $, #, @, !, =, and + cannot be included. • Single letter words other than "a" cannot be included. • A single letter cannot be repeated 3 consecutive times. i.e "sss". • The same word cannot be repeated twice in a row. • At least 40% of the words in the instruction must be unique. • The term "key" cannot be included. • The term "step" cannot be included. • The term "time" cannot be included. • The term "go back" cannot be included. • The term "return" cannot be included. • The term "came" cannot be included. • The term "item" cannot be included.
Navigation Active Quality Checks.
• If the ground truth path requires turning at the beginning of the path, the term "turn" must be included. • The term "arrow" cannot be included.
Assembly Active Quality Checks.
• The terms "tile" or "grid" cannot be included.
• The term "space" cannot be included. • The term "go" cannot be included. • The term "corner" cannot be included. • The term "move" cannot be included. • The black outline cannot be referenced. Review Notifications. It is possible for instructions to be written that can pass all automated checks and still be of poor quality. However, there is no quick and reliable way to automatically check if an instruction passes the tests but is still vague or misleading. Additional active checks could be added, however, in cases of ambiguity, more active checks would result in potentially good instructions being blocked. Instead of blocking submission, checks that could have been incorrectly triggered, would send a notification email, allowing us to take quick action by manually reviewing the instruction in question to see if the worker who created it needs feedback on writing better instructions.

B.2 Interface
Stage 1: Instruction Writing. The goal of this stage is to write instructions on how to navigate and place objects. The provided interface was designed to make this process easier for the workers completing the task. In both phases, the interface provides  a arrow on the bottom left that will also point to the target destination and target location (depending on the active phase; navigation and assembly respectively.) • Navigation Phase: (Figure 15a) The workers will follow the provided navigation line and as they follow it, write instructions on how to reach the destination. Additionally, the workers are provided with the controls and a few tips that they should keep in mind while completing the navigation phase. A small preview of the next phase (Assembly) is shown in the lower right. • Assembly Phase: (Figure 15b) The interface is similar to that of the navigation phase interface. During this phase, the Assembly preview which previously occupied the lower right corner will come into focus, and the navigation phase preview is now occupying that space. In this phase, no navigation line is provided, as there is nowhere that cannot be seen from the starting position. The controls and tip information are updated with information about the assembly phase.
Stage 2: Instruction Following. The goal of this stage is for the instructions written in the previous to be validated. Again, this interface was designed to make completing this task easier for the workers. Workers are also provided with some check boxes, which they can use to flag an instruction for certain issues so that we can more easily identify poor instructions.
• Navigation Phase: (Figure 15c) Workers are placed in an exact copy of the environment that a Stage 1 worker used, as well as given the instructions they wrote on how to accomplish the task, which are visible in the top right corner. This new worker is not provided the blue guideline and the indicating arrow, and must now navigate using the instructions alone.
• Assembly Phase: (Figure 15d) The worker is again shifted into the assembly room, but will no longer see the transparent outline that indicates where the object should be placed. They must instead rely on the instructions written by a Stage 1 worker. The worker is also provided a realtime diagram indicating where they will place the object given the position they currently stand. The object is always placed 1 space directly in front of the worker's location. The worker is also provided with some tips that might help them.
C Data Analysis C.1 Linguistic Properties As shown in Table 4, our instruction sets have diverse linguistic features that make our task more challenging. Our ARRAMON task requires that the agent be able to understand and distinguish between both egocentric and allocentric spatial relations, necessitating that they comprehend the relation between entities in the environment according to their location and orientation. The instructions contain many directional words and phrases which require that agents utilize strong navigational skills. Additionally, due to the large scale of the environment, temporal condition expressions are crucial for agents to navigate effectively, as they are useful for describing long-distance travel.

D Model
Cross Attention. We employ the bidirectional attention mechanism (Seo et al., 2017) to align the visual feature V and instruction feature L. We calculate the similarity matrix, S ∈ R w ×l between visual and instruction.
where W s ∈ R d×1 is the trainable parameter, and is element-wise product. From the similarity  Table 5: Performance of Vision-and-Language (V/L) baseline for turns T1 and T2, plus overall scores on the Val-Seen/Unseen splits.

Ground Truth Human
Our Model Random Walk Turn 1 :Turn slightly left as you move ahead past the traffic light. Go toward the speed limit sign, and move past the dotted white barrier. Head to the left to the lamp post, and fetch the dotted brown tv past a blue cone. Turn 2 :Turn around and pass the blue and orange cones. Keep going straight for a long way passing the speed limit sign. Head toward the two striped yellow barriers ahead, but pick up the striped yellow book before you reach them.

Ground Truth Human
Our Model Random Walk Turn 1 :Turn right until you see the green banner. Go towards the tire stack to the right of it and take a left down the street behind it. Go forward and pass the barrel. In the intersection there is a dotted white bucket. Pick up the dotted white bucket. Turn 2 :Turn right until you see the green cone. Go forward and take a left at the first street. Go towards the trash bags and take a left at the street. Pass the black barrel and go towards the dotted blue bucket. Pick up the dotted blue bucket. Figure 16: Navigation paths of ground truth, human evaluation, random walk, and our model. Pink is the GT path and the other paths are shown in green (turn 1 starts from the black dot and goes to the white dot. Turn 2 starts from white dot and goes to the end of the path).

Model
Navigation Assembly CTC (k=3) rPOD PTC Vision-and-Language 1.000 0.539 0.382 Table 6: Scores in the assembly phase calculated under the assumption of the perfect performance in the navigation phase on Val-Seen split.
matrix, the new fused instruction feature is: Similarly, the new fused visual feature is: where W L and W V are trainable parameters.
General Attention. We employ a basic attention mechanism for aligning action feature, h, and each of visual and instruction features.
E Experiments

E.1 Simulator Setup
Our task is quite challenging. In many cases, agents may not even be able to pick up an object in the navigation phase (agents would have to be in a position close enough to the object and of the correct rotation to pick the object. These factors along with the size of the environment, make this difficult). To decrease the difficulty of the task, in the event agents do not successfully pick up an object, we allow them to continue to the assembly phase with whatever object is the closest to their final location. Likewise in the assembly phase, if the time step limit is reached before the agent places Navigation Assembly Turn left, go to the mailbox and turn right. Go past the dumpster then right at the next intersection. Go to the phone booth and collect the striped purple bowl.
Place bowl in front of the striped blue hourglass.
Turn around then go left between the blue and brown buildings. Go past the silver dumpster and collect the striped yellow ball next to the mailbox.
Place the ball on top of the striped purple bowl.

TURN 1 TURN 2 Navigation Assembly
Turn left to face the short traffic light. Walk to it and turn right. Walk to the orange barrels and turn left. Walk past the barricade to the mailbox and pick up the striped blue hourglass.
Place the striped blue hourglass against the brick wall and aligned with the purple bucket.
Turn to face the yellow and white flag. Walk to the orange barrels and turn left. Walk to the short traffic light and pick up the dotted purple mug.
Place the dotted purple mug in between the blue hourglass and the purple bucket. the object down, the object will be placed in front of them (in the event "in front of them" is out of bounds, it is placed at their feet). Note that either of these actions will result in PTC and rPOD to be 0.

E.2 Training Details
We use PyTorch (Paszke et al., 2017) to build our model. We take the average of the losses from navigation and assembly phase modules to calculate the final loss. We use 128 as a hidden size of linear layers and LSTM. For word and action embedding sizes, we use 300 and 64, respectively. The visual feature map size is 7 × 7 with 2048 channel size. For dropout p value, 0.3 is used. We use Adam (Kingma and Ba, 2015) as the optimizer and set the learning rate to 0.001. The number of trainable parameters of our Vision-and-Language model is 1.83M (Language-only: 1.11M, Visiononly: 0.73M). We use NVIDIA RTX 2080 Ti and TITAN Xp for training and evaluation, respectively.

F Results and Analysis
As shown in Table 5, almost all scores from turn 1 are improved compared to turn 2. Scoring in rPOD and PTC metrics in the assembly phase is largely dependent on the score of CTC-k in the navigation phase. Comparing the rPOD and PTC scores of Vision-and-Language model on the valseen split (Table 5) and the ones from Table 6, if the CTC-k is decreased by 1/10 (1.0 to 0.098), the PTC is also decreased around 1/10 (0.382 to 0.044). This demonstrates our ARRAMON task involves interweaving and is challenging to complete.

G Output Examples
In the left path set of Figure 16, our model follows the instructions well in the beginning. However, the model goes a little bit further and fails to find the target object (dotted brown tv). In the second turn, the model turns around, but does not do it fully, so heads a different direction failing to reach the goal position. For the example on the right, the model performs very well in the first turn, but in the second turn fails to find the target object although reaches very close to it and then backtracks out of the alley. Also, as shown in the figure, the human performs the navigation almost perfectly, indicating there is significant room for improvement by future work, and random-walk shows quite poor performance, implying that our ARRAMON task cannot be completed by random chance. Figure 17 compares the model against the GT in both turns and phases. On the left set, the model almost reaches the target object, but it cannot find the target object (striped purple bowl) and goes a little further past it. In the corresponding assembly phase, the model places the collected object (assuming it picked up the correct object in the previous navigation phase) 1 space to the right of the target location. In the next navigation turn, due to the error in the previous turn, the model path starts a bit further away from the GT, however, it starts to realign itself towards the end around the corner. The model is able to locate the target object and stop to pick it up. In the next assembly phase, the model fails to place the collected object at the right location. On the right set, the model shows worse performance. It misses all of the turning needed to reach the target. In the assembly phase, the model misses the target location by 1 space, likely due to misunderstanding the complex spatial relationship in the instructions. In the next navigation phase, the model starts in the wrong place, so ends up arriving at a totally different place from the target position. In the next assembly phase, the performance of the previous turn affected the object configuration, so the model cannot find the place "between the blue hourglass and the purple bucket".