Like a Baby: Visually Situated Neural Language Acquisition

We examine the benefits of visual context in training neural language models to perform next-word prediction. A multi-modal neural architecture is introduced that outperform its equivalent trained on language alone with a 2% decrease in perplexity, even when no visual context is available at test. Fine-tuning the embeddings of a pre-trained state-of-the-art bidirectional language model (BERT) in the language modeling framework yields a 3.5% improvement. The advantage for training with visual context when testing without is robust across different languages (English, German and Spanish) and different models (GRU, LSTM, Delta-RNN, as well as those that use BERT embeddings). Thus, language models perform better when they learn like a baby, i.e, in a multi-modal environment. This finding is compatible with the theory of situated cognition: language is inseparable from its physical context.


Introduction
The theory of situated cognition postulates that a person's knowledge is inseparable from the physical or social context in which it is learned and used (Greeno and Moore, 1993). Similarly, Perceptual Symbol Systems theory holds that all of cognition, thought, language, reasoning, and memory, is grounded in perceptual features (Barsalou, 1999). Knowledge of language cannot be separated from its physical context, which allows words and sentences to be learned by grounding them in reference to objects or natural concepts on hand (see Roy and Reiter, 2005, for a review). Nor can knowledge of language be separated from its social context, where language is learned interactively through communicating with others to facilitate problem-solving. Simply put, language does not occur in a vacuum.
Yet, statistical language models, typically connectionist systems, are often trained in such a vacuum. Sequences of symbols, such as sentences or phrases composed of words in any language, such as English or German, are often fed into the model independently of any real-world context they might describe. In the classical language modeling framework, a model learns to predict a word based on a history of words it has seen so far. While these models learn a great deal of linguistic structure from these symbol sequences alone, acquiring the essence of basic syntax, it is highly unlikely that this approach can create models that acquire much in terms of semantics or pragmatics, which are integral to the human experience of language. How might one build neural language models that "understand" the semantic content held within the symbol sequences, of any language, presented to it?
In this paper, we take a small step towards a model that understands language as a human does by training a neural model jointly on corresponding linguistic and visual data. From an imagecaptioning dataset, we create a multi-lingual corpus where sentences are mapped to the real-world images they describe. We ask how adding such real-world context at training can improve language model performance. We create a unified multi-modal connectionist architecture that incorporates visual context and uses either ∆-RNN (Ororbia II et al., 2017), Long Short Term Memory (LSTM; Hochreiter and Schmidhuber, 1997) or Gated Recurrent Unit (GRU; Cho et al., 2014) units. We find that the models acquire more knowledge of language than if they were trained without corresponding, real-world visual context.

Related Work
Both behavioral and neuroimaging studies have found considerable evidence for the contribution of perceptual information to linguistic tasks (Barsalou, 2008). It has long been held that language is acquired jointly with perception through interaction with the environment (e.g. Frank et al., 2008). Eye-tracking studies show that visual context influences word recognition and syntactic parsing from even the earliest moments of comprehension (Tanenhaus et al., 1995).
Computational cognitive models can account for bootstrapped learning of word meaning and syntax when language is paired with perceptual experience (Abend et al., 2017) and for the ability of children to rapidly acquire new words by inferring the referent from their physical environment (Alishahi et al., 2008). Some distributional semantics models integrate word co-occurrence data with perceptual data, either to achieve a better model of language as it exists in the minds of humans (Baroni, 2016;Johns and Jones, 2012;Kievit-Kylar and Jones, 2011;Lazaridou et al., 2014) or to improve performance on machine learning tasks such as object recognition (Frome et al., 2013;Lazaridou et al., 2015a), image captioning (Kiros et al., 2014;Lazaridou et al., 2015b), or image search (Socher et al., 2014).
Integrating language and perception can facilitate language acquisition by allowing models to infer how a new word is used from the perceptual features of its referent (Johns and Jones, 2012) or to allow for fast mapping between a new word and a new object in the environment (Lazaridou et al., 2014). Likewise, this integration allows models to infer the perceptual features of an unobserved referent from how a word is used in language (Johns and Jones, 2012;Lazaridou et al., 2015b). As a result, language data can be used to improve object recognition by providing information about unobserved or infrequently observed objects (Frome et al., 2013) or for differentiating objects that often co-occur in photos (e.g., cats and sofas; Lazaridou et al., 2015a).
By representing the referents of concrete nouns as arrangements of elementary visual features (Biederman, 1987), Kievit-Kylar and Jones (2011) found that the visual features of nouns capture semantic typicality effects, and that a combined representation, consisting of both visual features and word co-occurrence data, more strongly cor-relates with human judgments of semantic similarity than representations extracted from a corpus alone. While modeling similarity judgments is distinct from the problem of predictive language modeling, we take this finding as evidence that visual perception informs semantics, which suggests there are gains to be had integrating perception with predictive language models.
In contrast to prior work in machine learning, where mappings between vision and language have been examined (Kiros et al., 2014;Vinyals et al., 2015;Xu et al., 2015), our goal in integrating visual and linguistic data is not to accomplish a task such as image search/captioning that inherently requires a mapping between these modalities. Rather, our goal is to show that, since perceptual information is intrinsic to how humans process language, a language model that is trained on both visual and linguistic data will be a better model, consistently across languages, than one trained on linguistic data alone.
Due to the ability of language models to constrain predictions on the basis of preceding context, language models play a central role in natural-language and speech processing applications. However, the psycholinguistic questions surrounding how people acquire and use linguistic knowledge are fundamentally different from the aims of machine learning. Using NLP language models to address psycholinguistic questions is a new approach that integrates well with the theory of predictive coding in cognitive psychology (Clark, 2013;Rao and Ballard, 1999). For language processing this means that when reading text or comprehending speech, humans constantly anticipate what will be said next. Predictive coding in humans is a fast, implicit cognitive process similar to the kind of sequence learning that recurrent neural models excel at. We do not propose recurrent neural models as direct accounts of human language processing. Instead, our intent is to use a general purpose machine learning algorithm as a tool to investigate the informational characteristics of the language learning task. More specifically, we use machine learning to explore the question as to whether natural languages are most easily learned when situated in an environmental context and grounded in perception.

The Multi-modal Neural Architecture
We will evaluate the multi-modal training approach on several well-known complex architectures, including the LSTM, and further examine the effect of using pre-trained BERT embeddings. However, to simply describe the the neural model, we start from the Differential State Framework (DSF; Ororbia II et al., 2017), which unifies gated recurrent architectures under the general view that state memory is a simple parametrized mixture of "fast" and "slow" states. Our aim is to model sequences of symbols, such as the words that compose sentences, where at each time we process x t , or the one-hot encoding of a token 1 One of the simplest models that can be derived from the DSF is the ∆-RNN (Ororbia II et al., 2017). A ∆-RNN is a simple gated RNN that captures longer-term dependencies in sequences through the use of a parametrized, flexible state "mixing" function. The model computes a new state at a given time step by comparing a fast state (which is proposed after accounting for the current token) and a slow state (a form of longterm memory). The model is defined by parameters Θ = {W, V, b r , β 1 , β 2 , α} (input-to-hidden weights W , recurrent weights V , gating-control coefficients β 1 , β 2 , α, and the rate-gate bias b r ). Inference is defined as: where e w,t is the 1-of-k encoding of the word w at time t. Note that {α, β 1 , β 2 } are learnable bias vectors that modulate internal multiplicative interactions. The rate gate r controls how slow and fast-moving memory states are mixed inside the model. In contrast to the model originally trained in Ororbia II et al. (2017), the outer activation is the linear rectifier, Φ(v) = max(0, v), instead of the identity or hyperbolic tangent, because we found that it worked much better. The inner acti- To integrate visual context information into the ∆-RNN, we fuse the model with a neural vision system, motivated by work done in automated image captioning (Xu et al., 2015). We adopt a transfer learning approach and incorporate a stateof-the-art convolutional neural network into the ∆-RNN model, namely the Inception-v3 network (Szegedy et al., 2016) 2 , in order to create a multimodal ∆-RNN model (MM-∆-RNN; see Figure  1). Since our focus is on language modeling, the parameters of the vision network are fixed.
To obtain a distributed representation of an image from the Inception-v3 network, we extract the vector produced from the final max-pooling layer, c, after running an image through the model (note that this operation occurs right before the final, fully-connected processing layers which are usually task-specific parameters, such as in object classification). The ∆-RNN can make use of the information in this visual context vector if we modify its state computation in one of two ways. The first way would be to modify the inner state to be a linear combination of the data-dependent pre-activation, the filtration, and a learned linear mapping of c as follows: where M is a learnable synaptic connections matrix that connects the visual context representation with the inner state. The second way to modify the ∆-RNN would be change its outer mixing function instead: Here in Equation 8 we see the linearly-mapped visual context embedding interacts with the currently computation state through a multiplicative operation, allowing the visual-context to persist and work in a longer-term capacity. In either situation, using a parameter matrix M frees us from having to set the dimensionality of the hidden state to be the same as the context vector produced by the Inception-v3 network. We do not use regularization techniques with this model. The application of regularization techniques is, in principle, possible (and typically im- proves performance of the ∆-RNN), but it is damaging to performance in this particular case, where an already compressed and regularized representation of the images from Inception-v3 serves as input to the multi-modal language modeling network.
Let w 1 , . . . , w N be a variable-length sequence of N words corresponding to an image I. In general, the distribution over the variables follows the graphical model: For all model variants the state h t calculated at any time step is fed into a maximum-entropy classifier 3 defined as: The model parameters Θ optimized with respect to the sequence negative log likelihood: We differentiate with respect to this cost function to calculate gradients.
3 Bias term omitted for clarity.

GRU, LSTM and BERT variants
Does visually situated language learning benefit from the specific architecture of the ∆-RNN, or does the proposal work with state-of-the-art language models? We applied the same architecture to Gated Recurrent Units (GRU, Cho et al., 2014), Long Short Term Memory (LSTM, Hochreiter and Schmidhuber, 1997), and BERT (Devlin et al., 2018). We train these models on text alone and compare to the two variations of the multi-modal ∆-RNN, as described in the previous section. The multi-modal GRU, with context information directly integrated, is defined as follows: where we note the parameter matrix M that maps the visual context c into the GRU state effectively gates the outer function. 4 The multi-modal variant of the LSTM (with peephole connections) is defined as follows: We furthermore created one more variant of each multi-modal RNN by initializing a portion of their input-to-hidden weights with embeddings extracted from the Bidirectional Encoder Representations from Transformers (BERT) model (Devlin et al., 2018). This would correspond to initializing W in the ∆-RNN, W i in the LSTM, and Wĥ in the GRU. Note that in our results, we only report the best-performing model, which turned out to be the LSTM variant. Since the models in this work are at the word level and BERT operates at the subword level, we create initial word embeddings by first decomposing each word into its appropriate subword components, according to the Word-Pieces model (Wu et al., 2016), and then extract the relevant BERT representation for each. For each subword token, a representation is created by summing together a specific learned token embedding, a segmentation embedding, and a position embedding. For a target word, we linearly combine subword input representations and initialize the relevant weight with this final embedding.

Experiments
The experiments in this paper were conducted using the MS-COCO image-captioning dataset. 5 Each image in the dataset has five captions provided by human annotators. We use the captions to create five different ground truth splits. We translated each ground truth split into German and Spanish using the Google Translation API, which was chosen as a state-of-the-art, independently evaluated MT tool that produces, according to our inspection of the results, idiomatic, and syntactically and semantically faithful translations. To our knowledge, this represents the first Multi-lingual MSCOCO dataset on situated learning. We tokenize the corpus and obtain a 16.6K vocabulary for English, 33.2K for German and 18.2k for Spanish. 5 https://competitions.codalab.org/competitions/3221 As our primary concern is the next-step prediction of words/tokens, we use negative log likelihood and perplexity to evaluate the models. This is different from the goals of machine translation or image captioning, which, in most cases, is concerned with a ranking of possible captions where one measures how similar the model's generated sequences are to ground-truth target phrases.
Baseline results were obtained with neural language models trained on text alone. For the ∆-RNN, this meant implementing a model using only Equations 1-7. The best results were achieved using the BERT Large model (bidirectional Transformer, 24 layers, 1024dims, 16 attention heads: Devlin et al. 2018). We used the large pretrained model and then trained with visual context.
All models were trained to minimize the sequence loss of the sentences in the training split. The weight matrices of all models were initialized from uniform distribution, U (−0.1, 0.1), biases were initialized from zero, and the ∆-RNNspecific biases {α, β 1 , β 2 } were all initialized to one. Parameter updates calculated through backpropagation through time required unrolling the model over 49 steps in time (this length was determined based on validation set likelihood). All symbol sequences were zero-padded and appropriately masked to ensure efficient mini-batching. Gradients were hard-clipped at a magnitude bound of l = 2.0. Over mini-batches of 32 samples, model parameters were optimized using simple stochastic gradient descent (learning rate λ = 1.0 which was halved if the perplexity, measured at the end of each epoch, goes up three or more times).
To determine if our multi-modal language models capture knowledge that is different from a textonly language model, we evaluate each model twice. First, we compute the model perplexity on the test set using the sentences' visual context vectors. Next, we compute model perplexity on test sentences by feeding in a null-vector to the multimodal model as the visual context. If the model did truly pick up some semantic knowledge that is not exclusively dependent on the context vector, its perplexity in the second setting, while naturally worse than the first setting, should still outperform text-only baselines.
In Table 1, we report each model's negative log likelihood (NLL) and per-word perplexity (PPL).  PPL is calculated as: We observe that in all cases the multi-modal models outperform their respective text-only baselines. More importantly, the multi-modal models, when evaluated without the Inception-v3 representations on holdout samples, still perform better than the text-only baselines. The improvement in language generalization can be attributed to the visual context information provided during training, enriching its representations over word sequences with knowledge of actual objects and actions. Figure 2 shows the validation perplexity of the ∆-RNN on each language as a function of the first 15 epochs of learning. We observe that throughout learning, the improvement in generalization afforded by the visual context c is persistent. Validation performance was also tracked for the various GRU and LSTM models, where the same trend was also observed (see supplementary material).

Model Analysis
We analyze the decoders of text-only and multimodal models. We examine the parameter matrix U , which is directly involved in calculating the predictions of the underlying generative model. U can be thought of as "transposed embeddings", an idea that has also been exploited to introduce further regularization into the neural language model learning process (Press and Wolf, 2016;Inan et al., 2016). If we treat each row of this matrix as the learned embedding for a particular word (we assume column-major orientation in implementation), we can calculate its proximity to other embeddings using cosine similarity. Table 3 shows the top ten words for several randomly selected query terms using the decoder parameter matrix. By observing the different sets of nearest-neighbors produced by the ∆-RNN and the multi-modal ∆-RNN (MM-∆-RNN), we can see that the MM-∆-RNN appears to have learned to combine the information from the visual context with the token sequence in its representations. For example, for the query "ocean", we see that while the ∆-RNN does associate some relevant terms, such as "surfing" and "beach", it also associates terms with marginal relevance to "ocean" such as "market" and "plays". Conversely, nearly all of the terms the MM-∆-RNN associates with "ocean" are relevant to the query. The same is true for "kite" and "subway". For "racket", while the text-only baseline mostly associates the query with sports terms, especially sports equipment like "bat", the MM-∆-RNN is able to relate the query to the correct sport, "tennis".

Conditional Sampling
To see how visual context influences the language model, we sample the conditional generative model. Beam search (size 13) allows us to generate full sentences (    a bowl full of food on the table  a green and red bowl on the table  a salad bowl with chicken   a dog on blue bed with blanket.  a dog sleeps near wooden table. a dog sleeps on a bed. a dog on some blue blankets. ranked based on model probabilities.

Discussion and Conclusions
Training with perceptual context improves multimodal neural models compared to training on language alone. Specifically, augmenting a predictive language model with images that illustrate the sentences being learned enhances its next-word or masked-word prediction ability. The performance improvement persists even in situations devoid of visual input, when the model is used as a pure language model.
The near state-of-the-art language model, using BERT, reflects the case of human language acquisition less than do the other models, which were trained "ab initio" in a situated context. BERT is pre-trained on a very large corpus, but it still picked up a performance improvement when finetuned on the visual context and language, as compared to the corpus language signal alone. We do not expect this to be a ceiling for visual augmentation: in the world of training LMs, the MS COCO corpus is, of course, a small dataset.
Neural language models, as used here, are contenders as cognitive and psycholinguistic models of the non-symbolic, implicit aspects of language representation. There is a great deal of evidence that something like a predictive language model exists in the human mind. The surprisal of a word or phrase refers to the degree of mismatch between what a human listener expected to be said next and what is actually said, for example, when a garden path sentence forces the listener to abandon a partial, incremental parse (Ferreira and Henderson, 1991;Hale, 2001). In the garden path sen-  Table 3: The ten words most closely related to the bolded query word, rank ordered, trained without (∆-RNN) and with (+MM) visual input.
tence "The horse raced past the barn fell", the final word "fell" forces the reader to revise their initial interpretation of "raced" as the active verb (Bever, 1970).
More generally, the idea of predictive coding holds that the mind forms expectations before perception occurs (see Clark, 2013, for a review). How these predictions are formed is unclear. Predictive language models trained with a generic neural architecture, without specific linguistic universals, are a reasonable candidate for a model of predictive coding in language. This does not imply neuropsychological realism of the low-level representations or learning algorithms, and we cannot advocate for a specific neural architecture as being most plausible. However, we can show that an architecture that predicts linguistic input well learns better when its input mimics that of a human language learner.
A theory of human language processing might distinguish between symbolic language knowledge and processes that implement compositionality to produce semantics on the one hand, and implicit processes that leverage sequences and associations to produce expectations. With respect to acquiring the latter, implicit and predictive model, we note that children are exposed to a rich sensory environment, one more detailed than what is provided to our model here. If even static visual input alone improves language acquisition, then what could a sensorily rich environment achieve? When a multi-modal learner is considered, then, perhaps, the language acquisition stimulus that has been famously labeled to be rather poor (Chomsky, 1959;Berwick et al., 2013), is quite rich after all.