Collaborative Dialogue in Minecraft

We wish to develop interactive agents that can communicate with humans to collaboratively solve tasks in grounded scenarios. Since computer games allow us to simulate such tasks without the need for physical robots, we define a Minecraft-based collaborative building task in which one player (A, the Architect) is shown a target structure and needs to instruct the other player (B, the Builder) to build this structure. Both players interact via a chat interface. A can observe B but cannot place blocks. We present the Minecraft Dialogue Corpus, a collection of 509 conversations and game logs. As a first step towards our goal of developing fully interactive agents for this task, we consider the subtask of Architect utterance generation, and show how challenging it is.


Introduction
Building interactive agents that can successfully communicate with humans about the physical world around them to collaboratively solve tasks in this environment is a long-sought goal of AI (e.g. Winograd, 1971). Such situated dialogue poses challenges that go beyond what is required for the slot-value filling tasks performed by standard dialogue systems (e.g. Kim et al., 2016Kim et al., , 2017Budzianowski et al., 2018) or chatbots (e.g. Ritter et al., 2010;Schrading et al., 2015;Lowe et al., 2015), as well as for so-called visual dialogue where users talk about a static image (Das et al., 2017) or video-context dialogue where users interact in a chat room while viewing a live-streamed video (Pasunuru and Bansal, 2018). It requires the ability to refer to real-world objects and spatial relations that depend on the current position of the speakers as well as changes in the environment. Due to the expense of actual human-robot communication (e.g. Tellex et al., 2011;Thomason et al., * Both authors equally contributed to the paper. Misra et al., 2016;Chai et al., 2018), simulated environments that allow easier experimentation are commonly used (Koller et al., 2010;Chen and Mooney, 2011;Janarthanam et al., 2012).
In this paper, we therefore introduce the Minecraft Collaborative Building Task, in which pairs of users control avatars in the Minecraft virtual environment and collaboratively build 3D structures in a Blocks World-like scenario while communicating solely via text chat (Section 3). We have built a data collection platform and have used it to collect the Minecraft Dialogue Corpus, consisting of 509 human-human written dialogues, screenshots and complete game logs for this task (Section 4). While our ultimate goal is to develop fully interactive agents that can collaborate with humans successfully on this task, we first consider the subtask of Architect utterance generation (Section 5) and describe a set of baseline models that encode both the dialogue history (Section 6) and the world state (Section 7). Section 8 describes our experiments. Our analysis (Section 9) highlights the challenges of this task. The corpus and platform as well as our models are available for download. 1

Related Work
Our work is partly inspired by the HCRC Map Task Corpus (Anderson et al., 1991), which consists of route-following dialogues between an Instruction Giver and a Follower who are given maps of an environment that differ in significant details. Our task also features asymmetric roles and levels of information between the two speakers, but operates in 3D space and focuses on the creation of structures rather than navigation around existing ones. Koller et al. (2010) design a challenge where systems with access to symbolic world rep-resentations and a route planner generate real-time instructions to guide users through a treasure hunt in a virtual 3D world.
There is a resurgence of interest in Blocks World-like scenarios. Wang et al. (2017) let users define 3D voxel structures via a highly programmatic natural language. The interface learns to understand descriptions of increasing complexity, but does not engage in a back-and-forth dialogue with the user. Most closely related to our work are the corpora of Bisk et al. (2018Bisk et al. ( , 2016a, which feature pairs of scenes involving simulated, uniquely labeled, 3D blocks annotated with single-shot instructions aimed at guiding an (imaginary) partner on how to transform an input scene into the target. In their scenario, the building area is always viewed from a fixed bird's-eye perspective. Simpler versions of the data retain the grid-based assumption over blocks, and structures consist solely of numeric digits procedurally reconstructed along the horizontal plane. Later versions increase the task complexity significantly by incorporating human-generated, truly 3D structures and removed the grid assumption, as well as allowing for rotations of individual blocks. Their blocks behave like physical blocks, disallowing structures with floating blocks that are prevalent in our data. Our work differs considerably in a few other aspects: our corpus features two-way dialogue between an instructor and a real human partner; it also includes a wide range of perspectives as a result of using Minecraft avatars, rather than a fixed bird's-eye perspective; and we utilize blocks of different colors, allowing for entire substructures to be identified (e.g., "the red pillar").

Minecraft Collaborative Building Task
Minecraft (https://minecraft.net/) is a popular multi-player game in which players control avatars to navigate in a 3D world and manipulate inherently block-like materials in order to build structures. Players can freely move, jump and fly, and they can choose between firstor third-person perspectives. Camera angles can be smoothly rotated by moving around or turning one's avatar's head up, down, and side-to-side, resulting in a wide range of possible viewpoints.

Blocks World in Minecraft
Minecraft provides an ideal setting to simulate Blocks World, although there are two key differences to physical toy blocks: Minecraft blocks can only be placed on a discrete 3D grid, and they do not need to obey gravity. That is, they do not need to be placed on the ground or on top of another block, but can be put anywhere as long as one of their sides touches another block. That neighboring block can later be removed, allowing the second block (and any structure supported by it) to "float". Players need to identify when such supporting blocks need to be added or removed.
Collaborative Building Task We define the Collaborative Building Task as a two-player game between an Architect (A) and a Builder (B). A is given a target structure (Target) and has to instruct B via a text chat interface to build a copy of Target on a given build region. A and B can communicate back and forth via chat throughout the game (e.g. to resolve confusions or to correct B's mistakes). B is given access to an inventory of 120 blocks of six given colors that it can place and remove. A can observe B and move around in its world, allowing it to provide instructions from varying perspectives. But A cannot move blocks, and remains invisible to B. The task is complete when the structure built by B (Built) matches Target, invariant to translations within the horizontal plane and rotations about the vertical axis. Built also needs to lie completely within the boundaries of the predefined build region.
Although human players were able to complete each structure successfully, this task is not trivial. Figure 1 shows the perspectives seen by each player in the Minecraft client. This example from our corpus shows some of the challenges of this task. A often provides instructions that they think are sufficient, but leave B still clearly confused, indicated either by B's lack of initiative to start building or a confused response. Once a multistep instruction is understood, B also needs to plan a sequence of steps to follow that instruction; in many cases, B chooses clearly suboptimal solutions, resulting in large amounts of redundancy in block movements. A misinterpreted instruction may also lead to a whole sequence of blocks being misplaced by B (either due to miscommunication, or because B made an educated guess on how to proceed) until A decides to intervene (in the example, this can be seen with the built yellow 6). A could also misinterpret the target structure, giving B incorrect instructions that would later need to be rectified. This illustrates the challenges involved in designing an interactive agent for this task: the Architect needs to provide clear instructions; the Builder needs to identify when more information is required, and both agents may need to design efficient plans to construct complex structures.

The Minecraft Dialogue Corpus
The Minecraft Dialogue Corpus consists of 509 human-human dialogues and game logs for the Collaborative Building Task. This section describes this corpus and our data collection process. Further details are in the supplementary materials.

Data Collection Procedure
Data was collected over the course of 3 weeks (approx. 62 hours overall). 40 volunteers, both undergraduate and graduate students with varying levels of proficiency with Minecraft, participated in 1.5 hour sessions in which they were paired up and asked to build various predefined structures within a 11 × 11 × 9 sized build region. Builders began with an inventory of 6 colors of blocks and 20 blocks of each color. After a brief warm-up round to become familiar with the interface, participants were asked to successfully build as many structures as they could manage within this time frame. On average, each game took 8.55 minutes.
Architects were encouraged not to overwhelm the Builder with instructions and to allow their partner a chance to respond or act before moving on. Builders were instructed not to place blocks outside the specified build region and to stay as faithful as possible to the Architect's instructions. Both players were asked to communicate as naturally as possible while avoiding idle chit-chat.
Participants were allowed to complete multiple sessions if desired; we ensured that an individual never saw the same target structure twice, and attempted as much as possible to pair them with a previously unseen partner. While some individuals indicated a preference towards either the Architect or Builder roles, roles were, for the most part, assigned in such a way that each individual who participated in repeat sessions played both roles equally often. Each participant is assigned a unique anonymous ID across sessions.

Data Structures and Collection Platform
Microsoft's Project Malmo (Johnson et al., 2016) is an AI research platform that provides an API for Minecraft agents and the ability to log, save, and load game states. We have extended Malmo into a data collection platform. We represent the progression of each game (involving the construction of a single target structure by an Architect and Builder pair) as a discrete sequence of game states. Although Malmo continuously monitors the game, we selectively discretize this data by only saving snapshots, or "observations," of the game state at certain triggering moments (whenever B picks up or puts down a block or when either player sends a chat message). This allows us to reduce the amount of (redundant) data to be logged while preserving significant game state changes. Each observation is a JSON object that contains the following information: 1) a time stamp, 2) the chat history up until that point in time, 3) B's position (a tuple of real-valued x, y, z coordinates as well as pitch and yaw angles, representing the orientation of their camera), 4) B's block inventory, 5) the locations of the blocks in the build region, 6) screenshots taken from A's and B's perspectives. Whenever B manipulates a block, we also capture screenshots from four invisible "Fixed Viewer" clients hovering around the build region at fixed angles.

Data Statistics and Analysis
Overall statistics The Minecraft Dialogue Corpus contains 509 human-human dialogues (15,926 utterances, 113,116 tokens) and game logs for 150 target structures of varying complexity (min. 6 blocks, max. 68 blocks, avg. 23.5 blocks). We collected a minimum of three dialogues per structure. The training, test and development sets consist of 85 structures (281 dialogues), 39 structures (137 dialogues), and 29 structures (101 dialogues) respectively. Dialogues for the same structure are fully contained within a single split; structures in training are thus guaranteed to be unseen in test.
On average, dialogues contain 30.7 utterances: 22.5 Architect utterances (avg. length 7.9 tokens), 8.2 Builder utterances (avg. length 2.9 tokens), and 49.5 Builder block movements. Dialogue length varies greatly with the complexity of the target structure (not just the number of blocks, but whether it requires floating blocks or contains recognizable substructures).
Floating blocks Blocks in Minecraft can be placed anywhere as long as they touch an existing block (or the ground). If such a supporting block is later removed, the remaining block (and any structure supported by it) will continue to "float" in place. This makes it possible to produce complex designs. 53.6% of our target structures contain such floating blocks. Instructions for these struc-tures varied greatly, ranging from step-by-step instructions involving temporary supporting blocks to single-shot descriptions such as, simply, "build a floating yellow block" (sufficient for a veteran Minecraft player, but not necessarily for a novice).
Referring expressions and ellipsis Architects made frequent use of implicit arguments and references, relying heavily on the Builder's current perspective and their most recent actions for reference resolution. For instance, Architect instructions could include references such as "two more in the same direction," "one up," "two towards you," and "one right from the last thing you built." Recognizable shapes and sub-structures Some target structures were designed with commonplace objects in mind. Some Architects took advantage of this in their instructions, ranging from straightforward ('L'-shapes, "staircases") to more eccentric descriptions ("either a chicken or a gun turret," "a heart that looks diseased," "a silly multicolored worm"). To avoid slogging through block-by-block instructions, Architects frequently used such names to refer to sub-elements of the target structure. Some even defined new terms that get re-used across utterances: A: i will refer to this shape as r-windows from here on out... B: okay A: please place the first green block in the right open space of the blue r-window.
Builder utterances Even though the Architect shouldered the large responsibility of describing the unseen structure, the Builder played an active role in continuing and clarifying the dialogue, especially for more complex structures. Builders regularly took initiative during the course of a dialogue in a variety of ways, including verification questions ("is this ok?"), clarification questions ("is it flat?" or "did I clean it up correctly?"), status updates ("i'm out of red blocks"), suggestions ("feel free to give more than one direction at a time if you're comfortable," "i'll stay in a fixed position so it's easier to give me directions with respect to what i'm looking at"), or extrapolation ("I think I know what you want. Let me try," then continuing to build without explicit instruction).

Architect Utterance Generation Task
Although the Minecraft Dialogue Corpus was motivated by our ultimate goal of building agents that can successfully play an entire collaborative building game as Architect or Builder, we first con- sider the task of Architect utterance generation: given access to the entire game state context leading up to a certain point in a human-human game at which the human Architect spoke next, we aim to generate a suitable Architect utterance.
Architect utterance generation is a much simpler task than developing a fully interactive Architect or Builder, but it still captures some of the essential difficulties of the Architect's role. Since Architects need to be able to give instructions, correct Builders' mistakes and answer their questions, they need the ability to compare the built structure against the target structure, and to understand the preceding dialogue. We also believe that the models developed for this task could be leveraged to at least bootstrap a fully interactive Architect (which will also need to decide when to speak, as well as deal with potentially much noisier dialogue histories than those we are considering here).
Although future work should consider the task of Builder utterance generation, the challenges in creating a fully interactive Builder lie more in the need to understand and execute complex instructions in a discourse and game context, to know when it is appropriate to ask clarification questions and to understand the Architect's answers, than in the need to generate complex utterances.

Seq2Seq Architect Utterance Model
We define a sequence of models for Architect utterance generation. Our most basic variant is a sequence-to-sequence model  that conditions the next utterance on the pre- ceding dialogue. Since Architects need to compare the current state of the build region against the target structure, we augment this model in the next section with world state information.
Dialogue History Encoder We encode the entire dialogue history as a sequence of tokens in which each player's utterances are contained within speaker-specific start and end tokens (<A>...</A> or <B>...</B>....). Each utterance corresponds to a single chat message, and may consist of multiple sentences. These tokens are fed through a word embedding layer and subsequently passed through a bidirectional RNN (Schuster and Paliwal, 1997) to produce an embedding of the entire dialogue history in the encoder RNN's final hidden state.
Output Utterance Decoder The output utterance is generated by a decoder RNN conditioned on the discourse context. In standard fashion, the final hidden state of the encoder RNN is used to initialize the hidden state of the decoder RNN.

World State Representations
To be able to give accurate instructions, the Architect requires a mental model of how the target structure can be constructed successfully given the current state of the built structure. Since the Builder's world is not explicitly aligned to the target structure (our space does not contain any markers that would indicate cardinal directions or other landmarks, and we consider any built structure a success as long as it matches the target structure and fits completely into the Builder's build region), this model must consider all possible translational and rotational alignment variants, although we assume it can ignore any sub-optimal alignments. For any given alignment, we compute the Hamming distance between the built structure and the target (the total number of blocks of each color to be placed and removed), and only retain those alignments that have the smallest distance to the target. Once the game has progressed sufficiently far, there is often only one optimal alignment between built and target structures, but in the early stages, a number of different optimal alignments may be possible. Our world state representation captures this uncertainty. Figure 3 depicts a target structure (left) and a point in the game at which a single red block has been placed (right). We can identify three potential paths (left, up, and down) to continue the structure by extending it along the four cardinal directions. A permissibility check disqualifies the option of extending to the right, as blocks would end up placed outside the build region. These remaining paths, considered equally likely, indicate the colors and locations of blocks to be placed (or removed). A summary of this information forms the basis of the input to our model.

Computing the distance between structures
Computing the Hamming distance between the built and target structure under a given alignment tells us also which blocks need to be placed or removed. A structure S is a set of blocks (c, x, y, z). Each block has a color c and occupies a location (x, y, z) in absolute coordinate space (i.e., the coordinate system defined by the Minecraft client). A structure's position and orientation can be mutated by an alignment A in which S undergoes a translation A T (shift) followed by a rotation A R , denoted A(S) = A R (A T (S)). We only consider rotations about the vertical axis in 90-degree intervals, but allow all possible translations along the horizontal plane. The symmetric difference between the target T and a built structure S w.r.t. an alignment A, diff(T, S, A), consists of the set of blocks to be placed, B p = A(T ) − S and the set of blocks to be removed from S, B r = S − A(T ).

diff(T, S, A) = B p ∪ B r
The cardinality |diff(T, S, A)| is the Hamming distance between A(T ) and S.
Feasible next placements Architects' instructions often concern the immediate next blocks to be placed. Since new blocks can only be feasibly placed if one of their faces touches the ground or another block, we also wish to capture which blocks B n can be placed in the immediate next action. B n , the set of blocks that can be feasibly placed, is a subset of B p .
Block counters To obtain a summary representation of the optimal alignments (without detailed spatial information), we represent each of the sets B p and B r (as well as B n ) of an alignment A = B p ∪ B r as sets of counters over block colors (indicating how many blocks of each color remain to be placed [next] and to be removed). We compute the set of expected block counters for each color c ∈ {red,blue,orange, purple, yellow, green} and action a ∈ {p, r, n} as the average over all k optimal alignments A * = arg min A (|diff(T, S, A)|).
With six colors, and three sets of blocks (all placements, next placements, removals), we obtain an 18-dimensional vector of expected block counts.

Block Counter Models
We augment our basic seq2seq model with two variants of block counters that capture the current state of the built structure: Global block counters are 18-dimensional vectors (capturing expected overall placements, next placements, and removals for each of the six colors) that are computed over the whole build region.
Local block counters Since many Builder actions involve locations immediately adjacent to their last action, we construct local block counters that focus on and encode spatial information of this concentrated region. Here, we consider a 3 × 3 × 3 cube of block locations: those directly surrounding the location of the last Builder action as well as the last action itself. We compute a separate set of block counters for each of these 27 locations. Using the Builder's position and gaze, we deterministically assign a relative direction for each location that indicates its position relative to the last action in the Builder's perspective, e.g., "left", "top", "back-right," etc. The 27 18-dimensional block counters of each location are concatenated, using a fixed canonical ordering of the assigned directions.
Adding block counters to the model To add block counters to out models, we found the best results by feeding the concatenated global and local counter vectors through a single fully-connected layer before concatenating them to the word embedding vector that is fed into the decoder at each time step (Figure 2).

Experimental Setup
Data Our training, test and dev splits contain 6,548, 2,855, and 2,251 Architect utterances.
Training We trained for a maximum of 40 epochs using the Adam optimizer (Kingma and Ba, 2015). During training, we minimize the sum of the cross entropy losses between each predicted and ground truth token. We stop training early when perplexity on the held-out validation set had increased monotonically for two epochs. All word embeddings were initialized with pretrained GloVe vectors (Pennington et al., 2014). We first performed grid search over model architecture hyperparameters (embedding layer sizes and RNN layer depths). Once the best-performing architecture was found, we then varied dropout parameters (Srivastava et al., 2014). More details can be found in the supplementary materials.
Decoding We use beam search decoding to generate the utterance with the maximum loglikelihood score according to our model normalized by utterance length (beam size = 10). In order to promote diversity of generated utterances, we use a γ penalty (Li et al., 2016) of γ = 0.8. These parameters were found by a grid search on the validation set for our best model.

Results and Analysis
We evaluate our models in three ways: we use automated metrics to assess how closely the generated utterances match the human utterances. For a random sample of 100 utterances per model, we use human evaluators to identify dialogue acts and to evaluate whether the generated utterances are correct in the given game context. Finally, we perform a qualitative analysis of our best model.

Automated Evaluation
Metrics To evaluate how closely the generated utterances resemble the human utterances, we report standard BLEU scores (Papineni et al., 2002). We also compute (modified) precision and recall of a number of lists of domain-specific keywords that are instrumental to task success: colors, spatial relations, and other words that are highly in-dicative of dialogue acts (e.g., responding "yes" vs. "no", instructing to "place" vs. "remove", etc.). These lists also capture synonyms that are common in our data (e.g. "yes"/"yeah"), and were obtained by curating non-overlapping lists of words (with a frequency ≥ 10 across all data splits) that are appropriate to each category. 2 We report precision and recall scores per category, and for an "all keywords" list consisting of the union of all category word lists. For each category, we reduce both human and generated utterances to those tokens that occur in the corresponding keyword list: "place another red left of the green" reduces to "red green" for color, to "left" for spatial relations and "place" for dialogue.
For a given (reduced) generated sentence S g and its associated (reduced) human utterance S h , we calculate term-specific precision (and recall) as follows. Any token t g in S g matches a token t h in S h if t g and t h are identical or synonyms. Similar to BLEU's modified unigram precision, once t g is matched to one token t h , it cannot be used for further matches to other tokens within S h . Counts are accumulated over the entire corpus to compute the ratio of matched to total tokens in S g (or S h ). Table 1 shows the results of an ablation study on the validation set. All model variants here share the same RNN parameters. While the individual addition of global and local block counters each see a slight boost in performance in precision and recall respectively, combining them as in our final model shows significant performance increase, especially on colors.

Ablation study
Test set results We finetune our most basic and most complex model via a grid search over all architectural parameters and dropout values on the validation set. The best model's results on the test set are shown in Table 2. Our full model shows noticeable improvements on each of our metrics over the baseline. Most promising is again the significant increase in performance on colors, indicating that the block counters capture necessary information about next Builder actions.

Human Evaluation
In order to better evaluate the quality of generated utterances as well as benchmark human performance, we performed a small-scale human evaluation of Architect utterances. We asked 3 hu-   Table 2 as well as the original human utterance) to the evaluators in randomized order.
Here, we analyze a subset of results on coarse annotation of dialogue acts and utterance correctness. More details on the full evaluation framework, including descriptions of evaluation criteria and inter-annotator agreement statistics, are included in the supplementary materials.
Dialogue acts Given a list of six predefined coarse-grained dialogue acts (including Instruct B, Describe Target, etc.; see the supplementary material for full details), evaluators were asked to choose all dialogue acts that categorized a candidate utterance. An utterance could belong to any number of categories; e.g., "great! now place a red block" is both a confirmation as well as an instruction. Results can be found in Table 3. These results show a significantly higher diversity of utterance types generated by humans. Humans provided instructions only about half of the time, and devoted more energy to providing higher-level descriptions of the target, responding to the Builder's actions and queries, and rectifying mistakes. On the other hand, even the improved model failed to capture this, mainly generating instructions even if it was inappropriate or unhelpful to do so.
Utterance correctness Given a window of game context (consisting of at least the last seven Builder's and Architect's actions, but always including the previous Architect's utterance) and access to the target structure to be built, evaluators were asked to rate the correctness of an utterance immediately following that context with respect to task completion. For an utterance to be fully correct, information contained within it must both be consistent with the current state of the world as well as not lead the Builder off-course from the target. Utterances could be considered partially correct if some described elements (e.g. colors) were accurate, but other incorrect elements precluded full correctness. Otherwise, utterances could be deemed incorrect (if wildly off-course) or N/A (if there was not enough information). Results can be found in Table 4. Unsurprisingly, without access to world state information, the baseline model performs poorly, conveying incorrect information about half of the time. With access to a simple world representation, our full model shows marked improvement on generating both fully and partially correct utterances. Finally, human performance sets a high bar; when not engaging in chitchat or correcting typos, humans consistently produce fully correct utterances constructive towards task completion.

Qualitative Analysis
Here, we use examples to illustrate different aspects of our best model's utterances.
Identifying the game state In the course of a game, players progress through different states. In the human-human data, dialogue is peppered with context cues (greetings, questions, apologies, in-   structions to move or place blocks) that indicate the flow of a game. Our model is able to capture some of these aspects. It often begins games with an instruction like "we'll start with blue", and may end them with "ok we're done!" (although it occasionally continues with further instructions, e.g "great! now we'll do the same thing on the other side".) It often says "perfect!" immediately followed by a new instruction which indicates the model's ability to acknowledge a Builder's previous actions before continuing. The model often describes the type of the next required action correctly (even if it makes mistakes in the specifics of that action): it generated "remove the bottom row" when the ground truth was "okay so now get rid of the inner most layer of purple in the square".
Predicting block colors and spatial relations Generated utterances often identify the correct color of blocks, e.g "then place a red block on top of that" in a context when the the next placements include a layer of red blocks (ground truth utterance: "the second level of the structure consists wholly of red blocks. start by putting a red block on each orange block".) Less frequently, the model is also able to predict accurate spatial relations ("perfect! now place a red block to the left of that") for referent blocks.
Utterance diversity and repetition Generated utterances lack diversity: the pattern "a x b" (for a rectangle of size a × b) is almost exclusively used to describe squares (an extremely common shape in our data). Utterances are mostly fluent, but sometimes contain repeats: "okay, on top of the blue block, put a blue block on top of the blue" or "yes, now, purple, purple, purple, ..."

Conclusion and Future Work
The Minecraft Collaborative Building Task provides interesting challenges for interactive agents: they must understand and generate spatially-aware dialogue, execute instructions, identify and recover from mistakes. As a first step towards the goal of developing fully interactive agents for this task, we considered the subtask of Architect utterance generation. To give accurate, high-level instructions, Architects need to align the Builder's world state to the target structure and identify complex substructures. We show that models that capture some world state information improve over naive baselines. Richer models (e.g. CNNs over world states, attention mechanisms (Bahdanau et al., 2015), memory networks (Bordes et al., 2017)) and/or explicit semantic representations should be able to generate better utterances. Clearly, much work remains to be done to create actual agents that can play either role interactively against a human. The Minecraft Dialogue Corpus as well as the Malmo platform and our extension of it enable many such future directions. Our platform can also be extended to support fully interactive scenarios that may involve a human player, measure task completion, or support other training regimes (e.g. reinforcement learning).