Experience Grounds Language

Successful linguistic communication relies on a shared experience of the world, and it is this shared experience that makes utterances meaningful. Despite the incredible effectiveness of language processing models trained on text alone, today's best systems still make mistakes that arise from a failure to relate language to the physical world it describes and to the social interactions it facilitates. Natural Language Processing is a diverse field, and progress throughout its development has come from new representational theories, modeling techniques, data collection paradigms, and tasks. We posit that the present success of representation learning approaches trained on large text corpora can be deeply enriched from the parallel tradition of research on the contextual and social nature of language. In this article, we consider work on the contextual foundations of language: grounding, embodiment, and social interaction. We describe a brief history and possible progression of how contextual information can factor into our representations, with an eye towards how this integration can move the field forward and where it is currently being pioneered. We believe this framing will serve as a roadmap for truly contextual language understanding.

Natural Language Processing is a diverse field, and progress throughout its development has come from new representational theories, modeling techniques, data collection paradigms, and tasks. We posit that the present success of representation learning approaches trained on large text corpora can be deeply enriched from the parallel tradition of research on the contextual and social nature of language.
In this article, we consider work on the contextual foundations of language: grounding, embodiment, and social interaction. We describe a brief history and possible progression of how contextual information can factor into our representations, with an eye towards how this integration can move the field forward and where it is currently being pioneered. We believe this framing will serve as a roadmap for truly contextual language understanding. Improvements in hardware and data collection have galvanized progress in NLP. Performance peaks have been reached in tasks such as language modeling (Radford et al., 2019;Zellers et al., 2019c;Keskar et al., 2019) and span-selection question answering (Devlin et al., 2019a;Yang et al., 2019; through massive data and massive models. Now is an excellent time to reflect on the direction of our field, and on the relationship of language to the broader AI, Cognitive Science, and Linguistics communities.
We are interested in how the data and world a language learner is exposed to defines and potentially constrains the scope of the learner's semantics. Meaning is not a unique property of language; Meaning is not a unique property of language, but a general characteristic of human activity ... We cannot say that each morpheme or word has a single or central meaning, or even that it has a continuous or coherent range of meanings ... there are two separate uses and meanings of language -the concrete ... and the abstract.
Zellig S. Harris (Distributional Structure 1954) it is a universal property of intelligence. Consequently, we must consider what knowledge or concepts are missing from models trained solely on text corpora, even when those corpora are large scale or meticulously annotated.
Large, generative language models emit utterances which violate visual, physical, and social commonsense. Perhaps this is to be expected, since they are tested in terms of Independent and Identically Distributed held-out sets, rather than queried on points meant to probe the granularity of the distinctions they make (Kaushik et al., 2020;Gardner et al., 2020). We draw on previous work in NLP, Cognitive Science, and Linguistics to provide a roadmap towards addressing these gaps. We posit that the universes of knowledge and experience available to NLP models can be defined by successively larger world scopes: from a single corpus to a fully embodied and social context.
We propose the notion of a World Scope (WS) as a lens through which to view progress in NLP. From the beginning of corpus linguistics (Zipf, 1932;Harris, 1954), to the formation of the Penn Treebank (Marcus et al., 1993), NLP researchers have consistently recognized the limitations of corpora in terms of coverage of language and experience. To organize this intuition, we propose five WSs, and note that most current work in NLP operates in the second (internet-scale data).

WS1: Corpora and Representations
The story of computer-aided, data-driven language research begins with the corpus. While by no means the only of its kind, the Penn Treebank (Marcus et al., 1993) is the canonical example of a sterilized subset of naturally generated language, processed and annotated for the purpose of studying representations, a perfect example of WS1. While initially much energy was directed at finding structure (e.g. syntax trees), this has been largely overshadowed by the runaway success of unstructured, fuzzy representations-from dense word vectors (Mikolov et al., 2013) to contextualized pretrained representations (Peters et al., 2018b;Devlin et al., 2019b). Yet fuzzy, conceptual word representations have a long history that predates the recent success of deep learning methods. Philosophy (Lakoff, 1973) and linguistics (Coleman and Kay, 1981) recognized that meaning is flexible yet structured. Early experiments on neural networks trained with sequences of words (Elman, 1990) suggested that vector representations could capture both syntax and semantics. Subsequent experiments with larger models, document contexts, and corpora have demonstrated that representations learned from text capture a great deal of information about word meanings in and out of context (Bengio et al., 2003;Collobert and Weston, 2008;Turian et al., 2010;Mikolov et al., 2013;McCann et al., 2017) It has long been acknowledged that context lends meaning (Firth, 1957;Turney and Pantel, 2010). Local contexts proved powerful when combined with agglomerative clustering guided by mutual information (Brown et al., 1992). A word's position in this hierarchy captures semantic and syntactic distinctions. When the Baum-Welch algorithm (Welch, 2003) was applied to unsupervised Hidden Markov Models, it assigned a class distribution to every word, and that distribution was considered a partial representation of a word's "meaning." If the set of classes was small, syntax-like classes were induced; If the set was large, classes became more semantic. Over the years this "search for meaning" has played out again and again, often with similar themes.
Later approaches discarded structure in favor of larger context windows for better semantics. The most popular generative approach came in the form of Latent Dirichlet Allocation (Blei et al., 2003). The sentence structure was discarded and a doc-1960 1970 1980 1990 2000 2010 2020 Year 0 20 40 60 80 100 % of 2019 Citations Harris 1954Firth 1957Chomsky 1957 Academic interest in Firth and Harris increases dramatically around 2010, perhaps due to the popularization of Firth (1957) "You shall know a word by the company it keeps." ument generated as a bag-of-words conditioned on topics. However, the most successful vector representations came from Latent Semantic Indexing/Analysis (Deerwester et al., 1988(Deerwester et al., , 1990Dumais, 2004), which represents words by their cooccurrence with other words. Via singular value decomposition, these matrices are reduced to low dimensional projections, essentially a bag-of-words.
A related question also existed in connectionism (Pollack, 1987;James and Miikkulainen, 1995): Are concepts distributed through edges or local to units in an artificial neural network? "... there has been a long and unresolved debate between those who favor localist representations in which each processing element corresponds to a meaningful concept and those who favor distributed representations." Hinton (1990) Special Issue on Connectionist Symbol Processing Unlike work that defined words as distributions over clusters, which were perceived as having intrinsic meaning to the user of a system, in connectionism there was the question of where meaning resides. The tension of modeling symbols and distributed representations was nicely articulated by Smolensky (1990), and alternative representations (Kohonen, 1984;Hinton et al., 1986;Barlow, 1989) and approaches to structure and composition (Erk and Padó, 2008;Socher et al., 2012) span decades of research. All of these works rely on corpora. The Brown Corpus (Francis, 1964) and Penn Treebank (Marcus et al., 1993) were colossal undertakings that defined linguistic context for decades. Only relatively recently (Baroni et al., 2009) has the cost of annotations decreased enough to enable the introduction of new tasks and have web-crawls become viable for researchers to collect and process (WS2). 1 1 An important parallel discussion centers on the hardware 2 WS2: The Written World Corpora in NLP have broadened to include large web-crawls. The use of unstructured, unlabeled, multi-domain, and multilingual data broadens our world scope, in the limit, to everything humanity has ever written. We are no longer constrained to a single author or source, and the temptation for NLP is to believe everything that needs knowing can be learned from the written world.
This move towards using large scale (whether mono-or multilingual) has led to substantial advances in performance on existing and novel community benchmarks (Wang et al., 2019a). These advances have come due to the transfer learning enabled by representations in deep models. Traditionally, transfer learning relied on our understanding of model classes, such as English grammar. Domain adaptation could proceed by simply providing a model with sufficient data to capture lexical variation, while assuming most higher-level structure would remain the same. Embeddingslexical representations built from massive corporaencompass multiple domains and lexical senses, violating this structural assumption.
These representations require scale-both in terms of data and parameters. Concretely  (Vaswani et al., 2017) trained on over 120GB of text. Our interest in these approaches is two-fold: 1. Larger models see diminishing returns (especially with respect to computational cost) despite the massive availability of data.
2. These models have expanded the notion of local context to include multiple sentences.
Current models are the next (impressive) step in modeling lexical distributions which started required to enable advances that move us from one world scope to the next. Playstations (Pinto et al., 2009) and GPUs (Krizhevsky et al., 2012) made much of WS2 advances possible, but perception, interaction, and social robotics bring new challenges which our current tools are not equipped to handle.
with Good (1953), the weights of Kneser and Ney (1995); Chen and Goodman (1996), and the power-law distributions of Teh (2006). Modern approaches to learning dense representations allow us to better estimate these distributions from massive corpora. However, modeling lexical co-occurrence, no matter the scale, is still "modeling the written world." Models constructed this way blindly search for symbolic co-occurences void of meaning.
In regards to the apparent paradox of "impresive results" vs. "diminishing returns", language modeling-the modern workhorse of neural NLP systems-provides a poignant example. Recent pretraining literature has produced results that few could have predicted, crowding leader boards results that make claims to super-human accuracy (Rajpurkar et al., 2018). However, there are diminishing returns. For example, in the LAM-BADA dataset (Paperno et al., 2016) (designed to capture human intuition), GPT2 (Radford et al., 2019) (1.5B), Megatron-LM (Shoeybi et al., 2019) (8.3B), and TuringNLG (Rosset, 2020) (17B) perform within a few points of each other and very far from perfect (<68%). We argue that continuing to expand hardware and data sizes by orders of magnitude is not the path forward.
The aforementioned approaches for learning transferable representations demonstrate that sentence and document context provide powerful signals for learning aspects of meaning, especially semantic relations among words (Fu et al., 2014) and inferential relationships among sentences . The extent to which they capture deeper notions of contextual meaning, however, remains an open question. On the one hand, stateof-the-art, pre-trained language models are capable of generating locally coherent narratives (Radford et al., 2019). On the other hand, past work has found that pretrained word and sentence representations fail to capture many grounded features of words (Lucy and Gauthier, 2017) and sentences, while current NLU systems fail on the thick tail of experience-informed inferences, such as hard coreference problems (Peng et al., 2015). As pretraining schemes seem to be reaching the point of diminishing returns from data, even for some syntactic phenomena (van Schijndel et al., 2019), we posit that other forms of supervision, such as multimodal perception, will be necessary to learn the remaining aspects of meaning in context. Language learning needs perception. An agent observing the real world can build axioms from which to extrapolate. Learned, physical heuristics, such as that a falling cat will land quietly, are generalized and abstracted into language metaphors like as nimble as a cat (Lakoff, 1980). World knowledge forms the basis for how people make entailment and reasoning decisions. There exists strong evidence that children require grounded sensory input, not just speech, to learn language (Sachs et al., 1981;O'Grady, 2005;Vigliocco et al., 2014).
Perception includes auditory, tactile, and visual input. Even restricted to purely linguistic signals, auditory input is necessary for understanding sarcasm, stress, and meaning implied through prosody. Further, tactile senses give meaning, both physical (Sinapov et al., 2014;Thomason et al., 2016) and abstract, to concepts like heavy and soft. Visual perception is a powerful signal for modeling a vastness of experiences in the world that cannot be documented by text alone (Harnad, 1990).
For example, frames and scripts (Schank and Abelson, 1977;Charniak, 1977;Dejong, 1981;Mooney and Dejong, 1985) require understanding often unstated sets of pre-and post-conditions about the world. To borrow from Charniak (1977), how should we learn the meaning, method, and implications of painting? A web crawl of knowledge from an exponential number of possible how-to, text-only guides and manuals (Bisk et al., 2020) is misdirected without some fundamental referents to ground symbols to. We posit that models must be able to watch and recognize objects, people, and activities to understand the language describing them Krishna et al., 2017;Yatskar et al., 2016;Perlis, 2016).
While the NLP community has played an important role in the history of scripts and grounding (Mooney, 2008), recently remarkable progress in grounding has taken place in the Computer Vision community. In the last decade, CV researchers have refined and codified the "backbone" of computer vision architectures. These advances have lead to parameter efficient models (Tan and Le, 2019) and real time object detection on embedded devices and phones Redmon and Farhadi, 2018).
We find anecdotally that many researchers in the NLP community still assume that vision models tested on the 1,000 class 2 ImageNet challenge are limited to extracting a bag of visual words. In reality, these architectures are mature and capable of generalizing, both from the perspective of engineering infrastructure 3 and that the backbones have been tested against other technologies like self-attention (Ramachandran et al., 2019). The stability of these architectures allows for new research into more challenging world modeling. Mottaghi et al. (2016)  While a minority of language researchers make forays into computer vision, researchers in CV are often willing to incorporate language signals. Advances in computer vision architectures and modeling have enabled building semantic representations rich enough to interact with natural language. In the last decade of work descendant from image captioning (Farhadi et al., 2010;Mitchell et al., 2012), a myriad of tasks on visual question answering (Antol et al., 2015;Das et al., 2018), natural language and visual reasoning (Suhr et al., 2019b), visual commonsense (Zellers et al., 2019a), and multilingual captioning and translation via video  have emerged. These combined text and vision benchmarks are rich enough to train large-scale, multimodal transformers (Li et al., 2019a;Lu et al., 2019;Zhou et al., 2019) without language pretraining, for example via conceptual captions (Sharma et al., 2018), or further broadened to include audio (Tsai et al., 2019).
Semantic level representations emerge from Im-ageNet classification pretraining partially due to class hypernyms. For example, the person class sub-divides into many professions and hobbies, like firefighter, gymnast, and doctor. To differentiate such sibling classes, learned vectors can also encode lower-level characteristics like clothing, hair, and typical surrounding scenes. These representations allow for pixel level masks and skeletal modeling, and can be extended to zero-shot settings targeting all 20,000 ImageNet categories (Chao et al., 2016;Changpinyo et al., 2017). Modern architectures are flexible enough to learn to differentiate instances within a general class, such as face. For example, facial recognition benchmarks require distinguishing over 10,000 unique faces . While vision is by no means "solved," vision benchmarks have led to off-the-shelf tools for building representations rich enough for tens of thousands of objects, scenes, and individuals.
That the combination of language and vision supervision leads to models that produce clear and coherent sentences, or that their attention can translate parts of an image to words, indicates they truly are learning about language. Moving forward, we believe benchmarks incorporating auditory, tactile, and visual sensory information together with language will be crucial. Such benchmarks will facilitate modeling language meaning with respect to an experienced world.

WS4: Embodiment and Action
Many of the phenomena in Winograd (1972)'s Understanding Natural Language require representation of concepts derivable from perception, such as color, counting, size, shape, spatial reasoning, and stability. Recent work (Gardner et al., 2020) argues that these same dimensions can serve as meaningful perturbations of inputs to evaluate models' reasoning consistency. However, their presence is in service of actual interactions with the world. Action taking is a natural next rung on the ladder to broader context.
An embodied agent, whether in a virtual world, such as a 2D Maze (MacMahon et al., 2006), a Baby AI in a grid world (Chevalier-Boisvert et al., 2019), Vision-and-Language Navigation (Anderson et al., 2018;Thomason et al., 2019b), or the real world (Tellex et al., 2011;Matuszek, 2018;Thomason et al., 2020;Tellex et al., 2020) must translate from language to control. Control and If A and B have some environments in common and some not ... we say that they have different meanings, the amount of meaning difference corresponding roughly to the amount of difference in their environments ... (Distributional Structure 1954) action taking open several new dimensions to understanding and actively learning about the world. Queries can be resolved via dialog-based exploration with a human interlocutor (Liu and Chai, 2015), even as new object properties like texture, weight, and sound become available (Thomason et al., 2017). We see the need for embodied language with complex meaning when thinking deeply about even the most innocuous of questions:

Zellig S. Harris
Is an orange more like a baseball or a banana?
WS1 is likely not to have an answer beyond that the objects are common nouns that can both be held. WS2 may also capture that oranges and baseballs both roll, but is unlikely to capture the deformation strength, surface texture, or relative sizes of these objects (Elazar et al., 2019). WS3 can begin to understand the relative deformability of these objects, but is likely to confuse how much force is necessary given that baseballs are used much more roughly than oranges in widely distributed media. WS4 can appreciate the nuanced nature of the questionthe orange and baseball share a similar texture and weight allowing for similar manipulation, while the orange and banana both contain peels, deform under stress, and are edible. We as humans have rich representations of these items. The words evoke a myriad of properties, and that richness allows us to reason.
Control is where people first learn abstraction and simple examples of post-conditions through trial and error. The most basic scripts humans learn start with moving our own bodies and achieving simple goals as children, such as stacking blocks. In this space, we have unlimited supervision from the environment and can learn to generalize across plans and actions. In general, simple worlds do not entail simple concepts: even in block worlds (Bisk et al., 2018) where concepts like "mirroring"-one of a massive set of physical phenomena that we generalize and apply to abstract concepts-appear.
In addition to learning basic physical properties of the world from interaction, WS4 also al-lows the agent to construct rich pre-linguistic representations from which to generalize. Hespos and Spelke (2004) show pre-linguistic category formation within children that are then later codified by social constructs. Mounting evidence seems to indicate that children have trouble transferring knowledge from the 2D world of books (Barr, 2013) and iPads  to the physical 3D world. So while we might choose to believe that we can encode parameters (Chomsky, 1981) more effectively and efficiently than evolution provided us, developmental experiments indicate doing so without 3D interaction may prove insurmountable.
Part of the problem is that much of the knowledge humans hold about the world is intuitive, possibly incommunicable by language, but is required to understand that language. This intuitive knowledge could be acquired by embodied agents interacting with their environment, even before language words are grounded to meanings.
Robotics and embodiment are not available in the same off-the-shelf manner as computer vision models. However, there is rapid progress in simulators and commercial robotics, and as language researchers we should match these advances at every step. As action spaces grow, we can study complex language instructions in simulated homes (Shridhar et al., 2019) or map language to physical robot control (Blukis et al., 2019;Chai et al., 2018). The last few years have seen massive advances in both high fidelity simulators for robotics (Todorov et al., 2012;Coumans andBai, 2016-2019;NVIDIA, 2019;Xiang et al., 2020) and the cost and availability of commodity hardware (Fitzgerald, 2013;Campeau-Lecours et al., 2019;Murali et al., 2019).
As computers transition from desktops to pervasive mobile and edge devices, we must make and meet the expectation that NLP can be deployed in any of these contexts. Current representations have very limited utility in even the most basic robotic settings (Scalise et al., 2019), making collaborative robotics (Rosenthal et al., 2010) largely a domain of custom engineering rather than science.

WS5: The Social World
Natural language arose to enable interpersonal communication (Dunbar, 1993). Since then, it has been used in everything from laundry instructions to personal diaries, which has given us data and situated scenarios that are the raw material for the previous levels of contextualization. Interpersonal commu-In order to talk about concepts, we must understand the importance of mental models... we set up a model of the world which serves as a framework in which to organize our thoughts. We abstract the presence of particular objects, having properties, and entering into events and relationships.
Terry Winograd -1971 nication in service of real-world cooperation is the prototypical use of language, and the ability to facilitate such cooperation remains the final test of a learned agent.
The acknowledgement that interpersonal communication is the necessary property of a linguistic intelligence is older than the terms "Computational Linguistics" or "Artificial Intelligence." Launched into computing when Alan Turing suggested the "Imitation Game" (Turing, 1950), the first illustrative example of the test brings to bear an issue that has haunted generative NLP since it was possible.
Smoke and mirrors cleverness, in which situations are framed to the advantage of the model, can create the appearance of genuine content where there is none. This phenomenon has been noted countless times, from Pierce (1969) criticizing Speech Recognition as "deceit and glamour" (Pierce, 1969) to Marcus and Davis (2019) which complains of humanity's "gullibility gap".
Interpersonal dialogue is canonically framed as a grand test for AI (Norvig and Russel, 2002), but there are few to no examples of artificial agents one could imagine socializing with in anything more than transactional (Bordes et al., 2016) or extremely limited game scenarios (Lewis et al., 2017)-at least not ones that aren't purposefully hollow and fixed, where people can project whatever they wish (e.g. ELIZA (Weizenbaum, 1966)). More important to our discussion is why socialization is required as the next rung on the context ladder in order to fully ground meaning.
Language that Does Something Work in the philosophy of language has long suggested that function is the source of meaning, as famously illustrated through Wittgenstein's "language games" (Wittgenstein, 1953(Wittgenstein, , 1958 and J.L. Austin's "performative speech acts" (Austin, 1975). That function is the source of meaning was echoed in linguistics usage-based theory of language acquisition, which suggests that constructions that are useful are the building blocks for everything else (Langacker, 1987(Langacker, , 1991. In recent years, this theory has begun to shed light on what use-cases language presents in both acquisition and its initial origins in our species (Tomasello, 2009;Barsalou, 2008). WS1, WS2, WS3, and WS4 lend an extra depth to the interpretation of language through context, because they expand the factorizations of information available to define meaning. Understanding that what one says can change what people do allows language to take on its most active role. This is the ultimate goal for natural language generation: language that does something to the world.
Despite the current, passive way generation is created and evaluated (Liu et al., 2016), the ability to change the world through other people is the rich learning signal that natural language generation offers. In order to learn the effects language has on the world, an agent must participate in linguistic activity, such as negotiation (Lewis et  As an example, what "hate" means in terms of discriminative information is always at question: it can be defined as "strong dislike," but what it tells one about the processes operating in the environment requires social context to determine (Bloom, 2002). It is the toddler's social experimentation with "I hate you" that gives the word weight and definite intent (Ornaghi et al., 2011). In other words, the discriminative signal for the most foundational part of a word's meaning can only be observed by its effect on the world, and active experimentation seems to be key to learning that effect. This is in stark contrast to the disembodied chat bots that are the focus of the current dialogue community (Adiwardana et al., 2020;Zhou et al., 2020;Chen et al., 2018;Serban et al., 2017), which often do not learn from individual experiences and are not given enough of a persistent environment to learn about the effects of actions.
Theory of Mind By attempting to get what we want, we confront people with their own desires and identities. Premack and Woodruff (1978) began to formalize how much the ability to consider the feelings and knowledge of others is human-specific and how it actually functions, commonly referred to now as the "Theory of Mind." In the language community, this paradigm has been described under the "Speaker-Listener" model (Stephens et al., 2010), and a rich theory to describe this computationally is being actively developed under the Rational Speech Act Model (Frank and Goodman, 2012;Bergen et al., 2016).
Recently, a series of challenges that attempt to address this fundamental aspect of communication (Nematzadeh et al., 2018;Sap et al., 2019) have been introduced. Using traditional-style datasets can be problematic due to the risk of embedding spurious patterns . Despite increased scores on datasets due to larger models and more complex pretraining curricula, it seems unlikely that models understand their listener any more than they understand the physical world in which they exist. This disconnect is driven home by studies of bias (Gururangan et al., 2018;Glockner et al., 2018) and techniques like Adversarial Filtering (Zellers et al., 2018(Zellers et al., , 2019b, which elucidate the biases models exploit in lieu of understanding. Our training data, complex and large as it is, does not offer the discriminatory signals that make the hypothesizing of consistent identity or mental states an efficient path towards lowering perplexity or raising accuracy. Firstly, there is a lack of inductive bias (Martin et al., 2018): it is not clear that any learned model, not just a neural network, from reading alone would be able to posit that people exist, that they arrange narratives along an abstract single-dimension called time, and that they describe things in terms of causality. Models learn what they need to discriminate, and so inductive bias is not just about efficiency, it is about making sure what is learned will generalize (Mitchell, 1980). Secondly, current cross entropy training losses actively discourage learning the tail of the distribution properly, as events that are statistically infrequent are drowned out (Pennington et al., 2014;. Researchers have just begun to scratch the surface by considering more dynamic evaluations (Zellers et al., 2020;Dinan et al., 2019), but true persistence of identity and adaption to change are still a long way off.
Language in a Social Context Whenever language is used between two people, it exists in a concrete social context: status, role, intention, and countless other variables intersect at a specific point (Wardhaugh, 2011). Currently, these complexities are overlooked by selecting labels crowd workers agree on. That is, our notion of ground truth in dataset construction is based on crowd consensus bereft of social context (Zellers et al., 2020).
In a real-world interaction, even one as anonymous as interacting with a cashier at a store, notions of social cognition (Fiske and Taylor, 1991;Carpenter et al., 1998) are present in the style of utterances and in the social script of the exchange. The true evaluation of generative models will require the construction of situations where artificial agents are considered to have enough identity to be granted social standing for these interactions.
Social interaction is a precious signal and some NLP researchers have began scratching the surface of it, especially for persuasion related tasks where interpersonal relationships are key (Wang et al., 2019c;Tan et al., 2016). These initial studies have been strained by the training-validation-test set scenario, as collecting static of this data is difficult for most use-cases and often impossible as natural situations cannot be properly facilitated. Learning by participation is a necessary step to the ultimately social venture of communication. By exhibiting different attributes and sending varying signals, the sociolinguistic construction of identity (Ochs, 1993) could be examined more deeply than ever before, yielding an understanding of social intelligence that simply isn't possible with a fixed corpus. This social layer is the foundation upon which language situates (Tomasello, 2009). Understanding this social layer is not only necessary, but will make clear complexities around implicature and commonsense that so obviously arise in corpora, but with such sparsity that they cannot properly be learned (Gordon et al., 2018). Once models are commonly expected to be interacted with in order to be tested, probing their decision boundaries for simplifications of reality as in Kaushik et al. will become trivial and natural.

Self-Evaluation
You can't learn language from the radio.
Nearly every NLP course will at some point make this claim. While the futility of learning language from an ungrounded linguistic signal is intuitive, those NLP courses go on to attempt precisely that learning. In fact, our entire field follows this pattern: trying to learn language from the internet, standing in as the modern radio to deliver limitless language. The need for language to attach to "extralinguistic events" (Ervin-Tripp, 1973) and the requirement for social context (Baldwin et al., 1996) should guide our research.
Concretely, we use our notion of World Scopes to make the following concrete claims: You can't learn language ...

... from the radio (internet). WS2 ⊂ WS3
A learner cannot be said to be in WS3 if it can perform its task without sensory perception such as visual, auditory, or tactile information.
... from a television. WS3 ⊂ WS4 A learner cannot be said to be in WS4 if the space of actions and consequences of its environment can be enumerated.

... by yourself. WS4 ⊂ WS5
A learner cannot be said to be in WS5 if its cooperators can be replaced with cleverly pre-programmed agents to achieve the same goals.
By these definitions, most of NLP research still resides in WS2, and we believe this is at odds with the goals and origins of our science. This does not invalidate the utility or need for any of the research within NLP, but it is to say that much of that existing research targets a different goal than language learning.
Where Should We Start? Concretely, many in our community are already pursuing the move from WS2 to WS3 by rethinking our existing tasks and investigating whether their semantics can be expanded and grounded. Importantly, this is not a new trend (Chen and Mooney, 2008;Feng and Lapata, 2010;Bruni et al., 2014;, and task and model design can fail to require sensory perception (Thomason et al., 2019a).
However, research and progress in multimodal, grounded language understanding has accelerated dramatically in the last few years. Elliott et al. (2016) takes the classic problem of machine translation and reframes it in the context of visual observations (Elliott and Kádár, 2017).  extends this paradigm by exploring machine translation with videos serving as an interlingua, and Regneri et al. (2013) introduced a foundational dataset which aligns text descriptions and semantic annotations of actions with videos. Even core tasks like syntax can be informed with visual information. Shi et al. (2019) investigate the role of visual information in syntax acquisition. This may prove necessary to learn headedness (e.g. choosing the main verb vs the more common auxiliary as the root of a sentence) (Bisk and Hockenmaier, 2015). Relatedly, Ororbia et al. (2019) investigate the benefits of visual context for language modeling.
Collaborative games have long served as a testbed for studying language (Werner and Dyer, 1991). Recently, Suhr et al. (2019a) introduced a testbed for evaluating language understanding in the service of a shared goal, and Andreas and Klein (2016) use a simpler visual paradigm for studying pragmatics. A communicative goal can also be used to study the emergence of language. Lazaridou et al. (2018) evolves agents with language that aligns to perceptual features of the environment. These paradigms can help us examine how inductive biases and environmental pressures can build towards socialization (WS5).
Most of this research provides resources such as data, code, simulators and methodology for evaluating the multimodal content of linguistic representations (Silberer and Lapata, 2014;Bruni et al., 2012). Moving forward, we would like to encourage a broader re-examination of how NLP researchers frame the relationship between meaning and context. We believe that the time is ripe to begin a deeper dive into grounded learning, embodied learning, and ultimately social participation. With recent deep learning representations, researchers can more easily integrate additional signals into the learned meanings of language than just word tokens. We particularly advocate for the homegrown creation of such merged representations, rather than waiting for the representations and data to come to NLP from other fields. These problems include the need to bring meaning and reasoning into systems that perform natural language processing, the need to infer and represent causality, the need to develop computationally-tractable representations of uncertainty and the need to develop systems that formulate and pursue long-term goals.
Michael Jordan (Artificial intelligence -the revolution hasn't happened yet, 2019)

Conclusions
Our proposed World Scopes are steep steps, and it is possible that WS5 is AI-complete. WS5 implies a persistent agent experiencing time and a personalized set of experiences. With few exceptions (Carlson et al., 2010), machine learning models have been confined to IID datasets that do not have the structure in time from which humans draw correlations about long-range causal dependencies. What happens if a machine is allowed to participate consistently? This is difficult to imagine testing under current evaluation paradigms for generalization. Yet, this is the structure of generalization in human development: drawing analogies to episodic memories, gathering a system through non-independent experiments.
While it is fascinating to broadly consider such far reaching futures, our goal in this call to action is more pedestrian. As with many who have analyzed the history of Natural Language Processing, its trends (Church, 2007), its maturation toward a science (Steedman, 2008), and its major challenges (Hirschberg and Manning, 2015;McClelland et al., 2019), we hope to provide momentum for a direction many are already heading. We call for and embrace the incremental, but purposeful, contextualization of language in human experience. With all that we have learned about what words can tell us and what they keep implicit, now is the time to ask: What tasks, representations, and inductive-biases will fill the gaps?
Computer vision and speech recognition are mature enough for novel investigation of broader linguistic contexts (Section 3). The robotics industry is rapidly developing commodity hardware and sophisticated software that both facilitate new research and expect to incorporate language technologies (Section 4). Our call to action is to encourage the community to lean in to trends prioritizing grounding and agency (Section 5), and explicitly aim to broaden the corresponding World Scopes available to our models.

A Invitation to Reply
What we have introduced here is not comprehensive and you may not agree with our arguments. Perhaps they do not go far enough, they miss a relevant literature, or you feel they do not represent the goals of NLP/CL. To this end, we welcome both suggestions for an updated version of this manuscript, as well as response papers on alternate foci and directions for the field.