Reading and Acting while Blindfolded: The Need for Semantics in Text Game Agents

Text-based games simulate worlds and interact with players using natural language. Recent work has used them as a testbed for autonomous language-understanding agents, with the motivation being that understanding the meanings of words or semantics is a key component of how humans understand, reason, and act in these worlds. However, it remains unclear to what extent artificial agents utilize semantic understanding of the text. To this end, we perform experiments to systematically reduce the amount of semantic information available to a learning agent. Surprisingly, we find that an agent is capable of achieving high scores even in the complete absence of language semantics, indicating that the currently popular experimental setup and models may be poorly designed to understand and leverage game texts. To remedy this deficiency, we propose an inverse dynamics decoder to regularize the representation space and encourage exploration, which shows improved performance on several games including Zork I. We discuss the implications of our findings for designing future agents with stronger semantic understanding.


Introduction
Text adventure games such as ZORK I (Figure 1 (a)) have been a testbed for developing autonomous agents that operate using natural language. Since interactions in these games (input observations, action commands) are through text, the ability to understand and use language is deemed necessary and critical to progress through such games. Previous work has deployed a spectrum of methods for language processing in this domain, including word vectors (Fulda et al., 2017), recurrent neural networks (Narasimhan et al., 2015;, pre-trained language models (Yao * Work partly done during internship at Microsoft Research. Project page: https://blindfolded.cs. princeton.edu. et al., 2020), open-domain question answering systems , knowledge graphs Adhikari et al., 2020), and reading comprehension systems .
Meanwhile, most of these models operate under the reinforcement learning (RL) framework, where the agent explores the same environment in repeated episodes, learning a value function or policy to maximize game score. From this perspective, text games are just special instances of a partially observable Markov decision process (POMDP) (S, T, A, O, R, γ), where players issue text actions a ∈ A, receive text observations o ∈ O and scalar rewards r = R(s, a), and the underlying game state s ∈ S is updated by transition s = T (s, a).
However, what distinguishes these games from other POMDPs is the fact that the actions and observations are in language space L. Therefore, a certain level of decipherable semantics is attached to text observations o ∈ O ⊂ L and actions a ∈ A ⊂ L. Ideally, these texts not only serve as observation or action identifiers, but also provide clues about the latent transition function T and reward function R. For example, issuing an action "jump" based on an observation "on the cliff" would likely yield a subsequent observation such as "you are killed" along with a negative reward. Human players often rely on their understanding of language semantics to inform their choices, even on games they have never played before, while replacing texts with non-semantic identifiers such as their corresponding hashes (Figure 1 (c)) would likely render games unplayable for people. However, would this type of transformation affect current RL agents for such games? In this paper, we ask the following question: To what extent do current reinforcement learning agents leverage semantics in text-based games?
To shed light on this question, we investi-(a) ZORK I Observation 21: You are in the living room. There is a doorway to the east, a wooden door with strange gothic lettering to the west, which appears to be nailed shut, a trophy case, and a large oriental rug in the center of the room. You are carrying: A brass lantern . . .

Action 21: move rug
Observation 22: With a great effort, the rug is moved to one side of the room, revealing the dusty cover of a closed trap door... Living room... You are carrying: ... gate the Deep Reinforcement Relevance Network (DRRN) (He et al., 2016), a top-performing RL model that uses gated recurrent units (GRU) (Cho et al., 2014) to encode texts. We conduct three experiments on a set of interactive fiction games from the Jericho benchmark  to probe the effect of different semantic representations on the functioning of DRRN. These include (1) using just a location phrase as the input observation (Figure 1 (b)), (2) hashing text observations and actions (Figure 1 (c)), and (3) regularizing vector representations using an auxiliary inverse dynamics loss. While reducing observations to location phrases leads to decreased scores and enforcing inverse dynamics decoding leads to increased scores on some games, hashing texts to break semantics surprisingly matches or even outperforms the baseline DRRN on almost all games considered. This implies current RL agents for textbased games might not be sufficiently leveraging the semantic structure of game texts to learn good policies, and points to the need for developing better experiment setups and agents that have a finer grasp of natural language.
2 Models DRRN Baseline Our baseline RL agent DRRN (He et al., 2016) learns a Q-network Q φ (o, a) parametrized by φ. The model encodes the observation o and each action candidate a using two separate GRU encoders f o and f a , and then aggregates the representations to derive the Q-value through a MLP decoder g: For learning φ, tuples (o, a, r, o ) of observation, action, reward and the next observation are sampled from an experience replay buffer and the following temporal difference (TD) loss is minimized: (2) During gameplay, a softmax exploration policy is used to sample an action: Note that when the action space A is large, (2) and (3) become intractable. A valid action handicap  or a language model (Yao et al., 2020) can be used to generate a reduced action space for efficient exploration. For all the modifications below, we use the DRRN with the valid action handicap as our base model.

Reducing Semantics via Minimizing Observation (MIN-OB)
Unlike other RL domains such as video games or robotics control, at each step of text games the (valid) action space is constantly changing, and it reveals useful information about the current state. For example, knowing "unlock box" is valid leaks the existence of a locked box. Also, sometimes action semantics indicate its value even unconditional on the state, e.g. "pick gold" usually seems good. Given these, we minimize the observation to only a location phrase o → loc(o) (Figure 1 (b)) to isolate the action semantics: given a hash function from strings to integers h : L → Z, and a pseudo-random generator G : Z → R d that turns an integer seed to a random Gaussian vector, a hashing encoder (1) are trainable,f is fixed throughout RL, and ensures two texts that only differ by a word would have completely different representations. In this sense, hashing breaks semantics and only serves to identify different observations and actions in an abstract MDP problem (Figure 1 (c)): .

Regularizing Semantics via Inverse Dynamics
Decoding (INV-DY) The GRU representations in DRRN f o (o), f a (a) are only optimized for the TD loss (2). As a result, text semantics can degenerate during encoding, and the text representation might arbitrarily overfit to the Q-values. To regularize and encourage more game-related semantics to be encoded, we take inspiration from Pathak et al. (2017) and propose an inverse dynamics auxiliary task during RL. Given representations of current and next observations f o (o), f o (o ), we use a MLP g inv to predict the action representation, and a GRU decoder d to decode the action back to text * . The inverse dynamics loss is defined as where θ denote weights of g inv and d, and p d (a|x) is the probability of decoding token sequence a using GRU decoder d with initial hidden state as x. To also regularize the action encoding, action reconstruction from f a is also used as a loss term: L dec (φ, θ) = − log p d (a|f a (a)) * Directly defining an L1/L2 loss between fa(a) and ginv(concat(fo(o), fo(o ))) in the representation space will collapse text representations together. And during experience replay, these two losses are optimized along with the TD loss: An intrinsic reward r + = L inv (φ, θ) is also used to explore toward where the inverse dynamics is not learned well yet. All in all, the aim of INV-DY is threefold: (1) regularize both action and observation representations to avoid degeneration by decoding back to the textual domain, (2) encourage f o to encode action-relevant parts of observations, and (3) provide intrinsic motivation for exploration.

Results
Setup We train on 12 games † from the Jericho benchmark . These human-written interactive fictions are rich, complex, and diverse in semantics. ‡ For each game, we train DRRN asynchronously on 8 parallel instances of the game environment for 10 5 steps, using a prioritized replay buffer. Following prior practice , we augment observations with location and inventory descriptions by issuing the 'look' and 'inventory' commands. We train three independent runs for each game and report their average score. For HASH, we use the Python built-in hash function to process text as a tuple of token IDs, and implement the random vector generator G by seeding PyTorch with the hash value. For INV-DY, we use λ 1 = λ 2 = 1.
Scores Table 1 reports the final score (the average score of the final 100 episodes during training), and the maximum score seen in each game for different models. Average normalized score (raw score divided by game total score) over all games is also reported. Compared to the base DRRN, MIN-OB turns out to explore similar maximum scores on † We omit games where DRRN cannot score. ‡ Please refer to  for more details about these games. most games (except DEEPHOME and DRAGON), but fails to memorize the good experience and reach high episodic scores, which suggests the importance of identifying different observations using language details. Most surprisingly, HASH has a lower final score than DRRN on only one game (ZORK I), while on PENTARI it almost doubles the DRRN final score. It is also the model with the best average normalized final score across games, which indicates that the DRRN model can perform as well without leveraging any language semantics, but instead simply by identifying different observations and actions with random vectors and memorizing the Q-values. Lastly, we observe on some games (DRAGON, OMNIQUEST, ZORK I) INV-DY can explore high scores that other models cannot. Most notably, on ZORK I the maximum score seen is 87 (average of 54, 94, 113), while any run of other models does not explore a score more than 55. This might indicate potential benefit of developing RL agents with more semantic representations.
Transfer We also investigate if representations of different models can transfer to a new language environment, which is a potential benefit of learning natural language semantics. So we consider the two most similar games in Jericho, ZORK I and ZORK III, fix the language encoders of different ZORK I models, and re-train the Q-network on ZORK III for 10,000 steps. As shown in Figure 2, INV-DY representations can achieve a score around 1, which surpasses the best result of models trained from scratch on ZORK III for 100,000 steps (around 0.4), showing great promise in better gameplay by leveraging language understanding from other games. HASH transfer is equivalent to training from scratch as the representations are not learnt, and a score around 0.3 is achieved. Finally, DRRN representations transfer worse than HASH, possibly due to overfitting to the TD loss (2).
Visualizations Finally, we use t-SNE (Maaten and Hinton, 2008) to visualize representations of some ZORK I walkthrough states in Figure 3. The first 30 walkthrough states (red, score 0-45) are well experienced by the models during exploration, whereas the last 170 states (blue, score 157-350) are unseen § . We also encircle the subset of states at location 'living room' for their shared semantics.
First, we note that the HASH representations for living room states are scattered randomly, unlike the other two models with GRU language encoders. Further, the base DRRN overfits to the TD loss (2), representing unseen states (blue) in a different subspace to seen states (red) without regarding their semantic similarity. IND-DY is able to extrapolate to unseen states and represent them similarly to seen states for their shared semantics, which may explain its better performance on this game.
Game stochasticity All the above experiments were performed using a fixed game random seed for each game, following prior work . To investigate if randomness in games affects our conclusions, we run one trial of each game with episode-varying random seeds ¶ . We find the average normalized score for the base DRRN, HASH, INV-DY to be all around 17%, with performance drop mainly on three stochastic games (DRAGON, ZORK I, ZORK III). Notably, the core finding that the base DRRN and HASH perform similarly still holds. Intuitively, even though the Q-values would be lower overall with unexpected transitions, RL would still memorize observations and actions that lead to high Q-values.

Discussion
At a high level, RL agents for text-based games succeed by (1) exploring trajectories that lead to high scores, and (2) learning representations to stably reach high scores. Our experiments show that a semantics-regularized INV-DY model manages to explore higher scores on some games (DRAGON, OMNIQUEST, ZORK I), while the HASH model manages to memorize scores better on other games (LIBRARY, LUDICORP, PENTARI) using just a fixed, random, non-semantic representation. This leads us to hypothesize two things. First, fixed, stable representations might make Q-learning easier. Second, it might be desirable to represent similar texts very differently for better gameplay, e.g. the Q-value can be much higher when a key object is mentioned, even if it only adds a few words to a long observation text. This motivates future thought into the structural vs. functional use of language semantics in these games.
Our findings also urge a re-thinking of the popular 'RL + valid action handicap' setup for these games. On one hand, RL sets training and evaluation in the same environment, with limited text corpora, and sparse, mostly deterministic rewards as the only optimization objective. Such a combination easily results in overfitting to the reward system of a specific game (Figure 2), or even just a specific stage of the game (Figure 3). On the other hand, the valid action handicap reduces the action set to a small size tractable for memorization, and reduces the language understanding challenge for the RL agent. Thus for future research on text-based games, we advocate for more attention towards alternative setups without RL or handicaps (Hausknecht et al., 2019;Yao et al., 2020;. Particularly, in a 'RL + no valid action handicap' setting, generating action candidates rather than simply choosing from a set entails more opportunities and challenges with respect to learning grounded language semantics (Yao et al., 2020). Additionally, training agents on a distribution of games and evaluating them on a separate set of unseen games would require more general semantic understanding. Semantic evaluation of these proposed paradigms is outside the scope of this paper, but we hope it will spark a productive discussion on the next steps toward building agents with stronger semantic understanding.

Ethical Considerations
Autonomous decision-making agents are potentially impactful in our society, and it is of great ethical consideration to make sure their understanding of the world and their objectives align with humans. Humans use natural language to convey and understand concepts as well as inform decisions, and in this work we investigate whether autonomous agents leverage language semantics similarly to humans in the environment of text-based games. Our findings suggest that the current generation of agents optimized for reinforcement learning objectives might not exhibit human-like language understanding, a phenomenon we should pay attention to and further study.