Associative Multichannel Autoencoder for Multimodal Word Representation

In this paper we address the problem of learning multimodal word representations by integrating textual, visual and auditory inputs. Inspired by the re-constructive and associative nature of human memory, we propose a novel associative multichannel autoencoder (AMA). Our model first learns the associations between textual and perceptual modalities, so as to predict the missing perceptual information of concepts. Then the textual and predicted perceptual representations are fused through reconstructing their original and associated embeddings. Using a gating mechanism our model assigns different weights to each modality according to the different concepts. Results on six benchmark concept similarity tests show that the proposed method significantly outperforms strong unimodal baselines and state-of-the-art multimodal models.


Introduction
Representing the meaning of a word is a prerequisite to solve many linguistic and non-linguistic problems, such as retrieving words with the same meaning, finding the most relevant images or sounds of a word and so on. In recent years we have seen a surge of interest in building computational models that represent word meanings from patterns of word co-occurrence in corpora (Turney and Pantel, 2010;Mikolov et al., 2013;Pennington et al., 2014;Clark, 2015;Wang et al., 2018b). However, word meaning is also tied to the physical world. Many behavioral studies suggest that human semantic representation is grounded in the external environment and sensorimotor experience (Landau et al., 1998;Barsalou, 2008). This has led to the development of multimodal representation models that utilize both textual and perceptual information (e.g., images, sounds).
As evidenced by a range of evaluations (Andrews et al., 2009;Bruni et al., 2014;Silberer et al., 2016), multimodal models can learn better semantic word representations (a.k.a. embeddings) than text-based models. However, most existing models still have a number of drawbacks. First, they ignore the associations between modalities, and thus lack the ability of information transferring between modalities. Consequently they cannot handle words without perceptual information. Second, they integrate textual and perceptual representations with simple concatenation, which is insufficient to effectively fuse information from various modalities. Third, they typically treat the representations from different modalities equally. This is inconsistent with many psychological findings that information from different modalities contributes differently to the meaning of words (Paivio, 1990;Anderson et al., 2017).
In this work, we introduce the associative multichannel autoencoder (AMA), a novel multimodal word representation model that addresses all the above issues. Our model is built upon the stacked autoencoder (Bengio et al., 2007) to learn semantic representations by integrating textual and perceptual inputs. Inspired by the re-constructive and associative nature of human memory, we propose two associative memory modules as extensions. One is to learn associations between modalities (e.g., associations between textual and visual features), so as to reconstruct corresponding perceptual information of concepts. The other is to learn associations between related concepts, by reconstructing embeddings of both target words and their associated words. Furthermore, we propose a gating mechanism to learn the importance weights of different modalities to each word.
To summarize, our main contributions in this work are two-fold: • We present a novel associative multichannel autoencoder for multimodal word representation, which is capable of utilizing associations between different modalities and related concepts, and assigning different importance weights to each modality according to different words. Results on six standard benchmarks demonstrate that our methods outperform strong unimodal baselines and state-ofthe-art multimodal models.
• Our model successfully integrates cognitive insights of the re-constructive and associative nature of semantic memory in humans, suggesting that rich information contained in human cognitive processing can be used to enhance NLP models. Furthermore, our results shed light on the fundamental questions of how to learn semantic representations, such as the plausibility of reconstructing perceptual information, associating related concepts and grounding word symbols to external environment.
2 Background and Related Work

Cognitive Grounding
A large body of research evidences that human semantic memory is inherently re-constructive and associative (Collins and Loftus, 1975;Anderson and Bower, 2014). That is, memories are not exact static copies of reality, but are rather reconstructed from their stimuli and associated concepts each time they are retrieved. For example, when we see a dog, not only the concept itself, but also the corresponding perceptual information and associated words will be jointly activated and reconstructed. Moreover, various theories state that the different sources of information contribute differently to the semantic representation of a concept (Wang et al., 2010;Ralph et al., 2017). For instance, Dual Coding Theory (Hiscock, 1974) posits that concrete words are represented in the brain in terms of a perceptual and linguistic code, whereas abstract words are encoded only in the linguistic modality. In these respects, our method employs a retrieval and representation process analogous to that of humans, in which the retrieval of perceptual information and associated words is triggered and mediated by a linguistic input. The learned cross-modality mapping and reconstruction of associated words are inspired by the human mental model of associations between different modalities and related concepts. Moreover, word meaning is tied to both linguistic and physical environment, and relies differently on each modality in-puts (Wang et al., 2018a). These are also captured by our multimodal representation model.

Multimodal Models
The existing multimodal representation models can be generally classified into two groups: 1) Jointly training models build multimodal representations with raw inputs of textual and perceptual resources. 2) Separate training models independently learn textual and perceptual representations and integrate them afterwards.

Jointly training models
A class of models extends Latent Dirichlet Allocation (Blei et al., 2003) to jointly learn topic distributions from words and perceptual units (Andrews et al., 2009;Silberer and Lapata, 2012;Roller and Schulte im Walde, 2013). Recently introduced work is an extension of the Skip-gram model (Mikolov et al., 2013). For instance,  propose a corpus fusion method that inserts the perceptual features of concepts in the training corpus, which is then used to train the Skip-gram model. Lazaridou et al. (2015) propose MMSkip model, which injects visual information in the process of learning textual representations by adding a max-margin objective to minimize the distance between textual and visual vectors. Kiela and Clark (2015) adopt the MMSkip to learn multimodal vectors with auditory perceptual inputs.
These methods can implicitly propagate perceptual information to word representations and at the same time learn multimodal representations. However, they utilize raw text corpus in which words having perceptual information account for a small portion. This weakens the effect of introducing perceptual information and consequently leads to the slight improvement of textual vectors.

Separate training models
The simplest approach is concatenation which fuses textual and visual vectors by concatenating them. It has been proven to be effective in learning multimodal representations (Bruni et al., 2014;Collell et al., 2017). Variations of this method employ transformation and dimension reduction on the concatenation result, including application of singular value decomposition (SVD) (Bruni et al., 2014) or canonical correlation analysis (CCA) . There is also work using deep learning methods to project different modality inputs into a common space, including restricted Boltzman machines (Ngiam et al., 2011;Srivastava and Salakhutdinov, 2012), autoencoders (Silberer and Lapata, 2014;Silberer et al., 2016), and recursive neural networks (Socher et al., 2013). However, the above methods can only generate multimodal vectors of those words that have perceptual information, thus reducing multimodal vocabulary drastically. An empirically superior model addresses this problem by predicting missing perceptual information firstly. This includes  who utilize the ridge regression method to learn a mapping matrix from textual modality to visual modality, and Collell et al. (2017) who employ a feedforward neural network to learn the mapping relation between textual vectors and visual vectors. Applying the mapping function on textual representations, they obtain the predicted visual vectors for all words in textual vocabulary. Then they calculate multimodal representations by concatenating textual and predicted visual vectors. However, the above methods learn separate mapping functions and fusion models, which are somewhat inelegant. In this paper we employ a neural-network mapping function to integrate these two processes into a unified multimodal models.
According to this classification, our method falls into the second group. However, existing models ignore either the associative relations among modalities, associative relations among relative words, or the different contributions of each modality. This paper aims to integrate more perceptual information and the human-like associative memory into a unified multimodal model to learn better word representations.

Associative Multichannel Autoencoder
We first provide a brief description of the basic multichannel autoencoder for learning multimodal word representations ( Figure 1). Then we extend the model with two associative memory modules and a gating mechanism (Figure 2) in the next sections.

Basic Mutichannel Autoencoder
An autoencoder is an unsupervised neural network which is trained to reconstruct a given input from its latent representation (Bengio, 2009). In this work, we propose a variant of autoencoder called multichannel autoencoder, which maps multimodal inputs into a common space.  Our model extends the unimodal and bimodal autoencoder (Ngiam et al., 2011;Silberer and Lapata, 2014) to induce semantic representations integrating textual, visual and auditory information. As shown in Figure 1, our model first transforms input textual vector x t , visual vector x v and auditory vector x a to hidden representations: (1) Then the hidden representations are concatenated together and mapped to a common space: The model is trained to reconstruct the hidden representations of the three modalities from the multimodal representation h m : and finally to reconstruct the original embeddings of textual, visual and auditory inputs: bt, bv, bâ, b m , bm} are bias vectors. Here [· ; ·] denotes the vector concatenation, and g denotes the non-linear function which we use tanh(·).
Training a single-layer autoencoder corresponds to optimizing the learning parameters to minimize the overall loss between inputs and their reconstructions. Following (Vincent et al., 2010), we use squared loss: where i denotes the i th word, and the model pa- Autoencoders can be stacked to create deep networks. To enhance the quality of semantic representations, we employ a stacked multichannel autoencoder, which is composed of multiple hidden layers that are stacked together.

Integrating Modality Associations
In reality, the words that have corresponding images or sounds are only a small subset of the textual vocabulary. To obtain the perceptual vectors for each word, we need associations between modalities (i.e., text-to-vision and text-to-audition mapping functions), that transform the textual vectors into visual and auditory ones. Previous methods learn separate mapping functions and fusion models, which are somewhat inelegant. Here we employ a neural-network mapping function to incorporate this modality association module into multimodal models.
Take text-to-vision mapping as an example. Suppose that T ∈ R mt×nt is the textual representation containing m t words, V ∈ R mv×nv is the visual representation containing m v ( m t ) words, where n t and n v are dimensions of the textual and visual representations respectively. The textual and visual representations of the i th concept are denoted as T i and V i respectively. Our goal is to learn a mapping function f : g(W p T + b p ) from textual to visual space such that the prediction f (T i ) is similar to the actual visual vector V i .  are used to learn the mapping function. To train the model, we employ a square loss: where the training parameters are θ 2 = {W p , b p }. We adopt the same method to learn the text-toaudition mapping function.

Integrating Word Associations
Word associations are a proxy for an aspect of human semantic memory that is not sufficiently captured by the usual training objectives of multimodal models. Therefore we assume that incorporating the objective of word associations helps to learn better semantic representations. To achieve this, we propose to reconstruct the vector of associated word from the corresponding multimodal semantic representation. Specifically, in the decoding process we change the equation (3) to: and equation (4) to: x t = g(W tĥ t + bt) x v = g(W vĥ v + bv) x a = g(W aĥ a + bâ) x asc = g(W ascĥasc + b asc ).
To train the model, we add an additional objective function, which is the mean square error between the embeddings of the associated word y and their re-constructive embeddingsx asc : where y i and x i are the embeddings of a pair of associated words. Here, y is the concatenation of three unimodal vectors [y t ; y v ; y a ]. The parameters of word association module are This additional criterion drives the learning towards a semantic representation capable of reconstructing its associated representation.

Integrating a Gating Mechanism
Considering that the meaning of each word has different dependencies on textual and perceptual information, we propose the sample-specific gate to assign different weights to each modality according to different words. The weight parameters are calculated by the following feed-forward neural networks: where g t , g v and g a are value or vector gate of textual, visual and auditory representations respectively. For the value gate, W gt , W gv and W ga are vectors, and b gt , b gv and b ga are value parameters. For the vector gate, the parameters W gt , W gv and W ga are matrices, b gt , b gv and b ga are vectors. The value gate controls the importance weights of different input representations as a whole, whereas the vector gate can adjust the importance weights of each dimension of input representations. Finally, we compute element-wise multiplication of the textual, visual and auditory representations with their corresponding gates: The x gt , x gv and x ga can be seen as the weighted textual, visual and auditory representations. The parameters of our gating mechanism is trained together with that of the proposed model.

Model Training
To train the AMA model, we use overall objective function of equation (5) + (6) + (9). In the training phase, model inputs are textual vectors, the corresponding visual vectors, auditory vectors, and association words (Figure 2). In the testing phase, we only need textual inputs to generate multimodal word representations.

Datasets
Textual vectors. We use 300-dimensional GloVe vectors 1 which are trained on the Common Crawl corpus consisting of 840B tokens and a vocabulary of 2.2M words 2 .
Visual vectors. Our source of visual vectors are collected from ImageNet (Russakovsky et al., 2015) which covers a total of 21,841 WordNet synsets (Fellbaum, 1998) that have 14,197,122 images. For our experiments, we delete words with fewer than 50 images or words not in the Glove vectors, and sample at most 100 images for each word. To generate a visual vector for each word, we use the forward pass of a pre-trained VGGnet model 3 and extract the hidden representation of the last layer as the feature vector. Then we use averaged feature vectors of the multiple images corresponding to the same word. Finally, we get 8,048 visual vectors of 128 dimensions.
Auditory vectors. For auditory data, we gather audio files from Freesound 4 , in which we select words with more than 10 audio files and sample at most 50 sounds for one word. To extract auditory features, we use the VGG-net model which is pretrained on Audioset 5 . The final auditory vectors are averaged feature vectors of multiple audios of the same word, which contains 9,988 words of 128 dimensions 6 .
Word associations. We use the word association data collected by (De Deyne et al., 2016), in which each word pair is generated by at least one subject 7 . This dataset includes mostly words with similar meaning (e.g., occasionally & sometimes, adored & loved, supervisor & boss) and related words (e.g., eruption & volcano, cortex & brain, umbrella & rain). We calculate the association score for each word pair (cue word + target word) as: the number of person who generated the word pair divided by the total number of people who were presented with the cue word. For training, we select pairs of associated words above a threshold of 0.15 and delete those that are not in the Glove vocabulary, which results in 7,674 word association data sets 8 . For the development set, we randomly sample 5,000 word association collections together with their association scores.

Model Settings
Our models are implemented with PyTorch (Paszke et al., 2017), optimized with Adam (Kingma and Ba, 2014). We set the initial learning rate to 0.05, and batch size to 64. We tune the number of layers over 1, 2, 3, the size of multimodal vectors over 100, 200, 300, and the size of each layer in textual channel over 300, 250, 200, 150, 100 and in visual/auditory channel over 128, 120, 90, 60. We train the model for 500 epochs and select the best parameters on the development set. All models are trained for 3 times and the average results are reported in Table 1.
To test the effect of each module, we separately train the following models: multichannel autoencoder with modality association (AMA-M), with modality and word associations (AMA-MW), with modality and word associations plus value/vector gate (AMA-MW-Gval/vec).
For AMA-M model, we initialize the text-tovision and text-to-audition mapping functions with pre-trained mapping matrices, which are parameters of one-layer feed-forward neural networks. The network uses input of the textual vectors, output of visual or auditory vectors, and is trained with SGD for 100 epochs. We initialize the network biases as zeros and network weights with He-initialisation (He et al., 2015). The best parameters of AMA-M model are 2 hidden layers, with textual channel size of 300, 250 and 150, visual/auditory channel size of 128, 90, 60. For AMA-MW model, we use the best AMA-M model parameters as initialization, and train the model with word association data. The optimal parameter of association channel size is 300, 350, 556 (or 428 for bimodal inputs). For AMA-MW-Gval and AMA-MW-Gvec, we adopt the same training strategy as AMA-MW model. The code for training and evaluation can be found at: https://github.com/wangshaonan/ Associative-multichannel-autoencoder.
We employ Spearman's correlation method to evaluate the performance of our models. This method calculates the correlation coefficients between model predictions and subject ratings, in which the model prediction is the cosine similarity between semantic representations of two words.

Baseline Multimodal Models
Most of existing multimodal models only utilize textual and visual modalities. For fair comparison, we re-implement several representative systems with our own textual and visual vectors. The Concatenation (CONC) model (Kiela and Bottou, 2014) is simple concatenation of normalized textual and visual vectors. The Mapping (Collell et al., 2017) and Ridge  models first learn a mapping matrix from textual to visual modality using feed-forward neural network and ridge regression respectively. After applying the mapping function on the textual vectors, they obtain the predicted visual vectors for all words in textual vocabulary. Then they concatenate the normalized textual and predicted visual vectors to get multimodal word representations. The SVD (Bruni et al., 2014) and CCA  models first concatenate normalized textual and visual vectors, and then conduct SVD or CCA transformations on the concatenated vectors.
For multimodal models with textual, visual and Here T, V, A denote textual, visual and auditory. TV denotes bimodal inputs of textual and visual. TVA denotes trimodal inputs of textual, visual and auditory. The bold scores are the best results per column in bimodal models and trimodal models respectively. For each test, ALL corresponds to the whole testing set, V/A to those word pairs for which we have textual&visual vectors in bimodal models or textual&visual&auditory in trimodal models, and ZS (zero-shot) denotes word pairs for which we have only textual vectors. The #inst. denotes the number of word pairs.  AMA-M(TVA) auditory inputs, we implement CONC and Ridge as baseline models. The trimodal CONC model simply concatenates normalized textual, visual and auditory vectors. The trimodal Ridge model first learns text-to-vision and text-to-audition mapping matrices with ridge regression method. Then it applies the mapping functions on the textual vectors to get the predicted visual and auditory vectors. Finally, the normalized textual, predictedvisual and predicted-auditory vectors are concatenated to get the multimodal representations.
All above baseline models are implemented with Sklearn 9 . Same as the proposed AMA model, 9 http://scikit-learn.org/ the hyper-parameters of baseline models are tuned on the development set using Spearman's correlation method. In Ridge model, the optimal regularization parameter is 0.6. The Mapping model is trained with SGD for maximum 100 epochs with early stopping, and the optimal learning rate is 0.001. The output dimension of SVD and CCA models are 300.

Results and Discussion
As shown in Table 1, we divide all models into six groups: (1) existing multimodal models (with textual and visual inputs) in which results are reprinted from Collell et al. (2017). (2) Unimodal models with textual, (predicted) visual or (pre-dicted) auditory inputs. (3) Our re-implementation of baseline bimodal models with textual and visual inputs (TV). (4) Our AMA models with textual and visual inputs. (5) Our implementation of trimodal baseline models with textual, visual and auditory inputs (TVA). (6) Our AMA model with textual, visual and auditory inputs. Overall performance Our AMA models (in group 4 and 6) clearly outperform their baseline unimodal and multimodal models (in group 2, 3 and 5). We use Wilcoxon signed-rank test to check if significant difference exists between two models. Results show that our multimodal models perform significantly better (p < 0.05) than all baseline models.
As shown clearly, our bimodal and trimodal AMA models achieve better performance than baselines in both V/A (visual or auditory, the testing data that have associated visual or auditory vectors) and ZS (zero-shot, the testing data that do not have associated visual or auditory vectors) region. In other words, our models outperform baseline models on words with or without perceptual information. The good results in ZS region also indicate that our models have good generalization capacity. Unimodal baselines As shown in group 2, the Glove vectors are much better than CNNvisual and CNN-auditory vectors, in which CNNauditory has the worst performance on capturing concept similarities. Comparing with visual and auditory vectors, the predicted visual and auditory vectors achieve much better performance. This indicates that the predicted vectors contain richer information than purely perceptual representations and are more useful for building semantic representations. Multimodal baselines For bimodal models (group 3), the CONC model that combines Glove and visual vectors performs worse than Glove on four out of six datasets, suggesting that simple concatenation might be suboptimal. The Mapping and Ridge models, which combine Glove and predicted visual vectors, improve over Glove on five out of six datasets in ALL regions. This reinforces the conclusion that the predicted visual vectors are more useful in building multimodal models. The SVD model gets similar results as Ridge model. The CCA model maps different modality inputs into a common space, achieving better results on some datasets and worse results on the others.
The improvement on three benchmark tests shows the potential of mapping multimodal inputs into a common space.
The above results can also be observed in the trimodal CONC and Ridge models (group 5). Overall, the trimodal models, which utilize additional auditory inputs, get slightly worse performance than bimodal models. This is partly caused by the fusion method of concatenation. Note that our proposed AMA models are more effective with trimodal inputs as shown in group 6. Our multimodal models With either bimodal or trimodal inputs, the proposed AMA-M model outperforms all baseline models by a large margin. Specifically our AMA-M model achieves an relative improvement of 4.1% on average (4.5% with trimodal inputs) over the state-of-the-art Ridge model. This illustrates that our AMA models can productively combine textual and perceptual representations. Moreover, our AMA-MW model, which employs word associations, achieves an average improvement of 1.5% (2.7% with trimodal inputs) over the AMA-M model. That is to say, the representation ability of multimodal models can be clearly improved by learning associative relations between words. Furthermore, the AMA-MW-Gval model improves the AMA-MW model by 1.3% (0.3% with trimodal inputs) on average, illustrating that the gating mechanism (especially the value gate) helps to learn better semantic representations.
In addition, we explore the effect of word association data size. We find that the decrease of association data has no discernible effect on model performance: when using 100%, 80%, 60%, 40%, 20% of the data, the average results are 0.6479, 0.6409, 0.6361, 0.6430, 0.6458 in bimodal model. The same trend is observed in trimodal models.

Conclusions and Future Work
We have proposed a cognitively-inspired multimodal model -associative multichannel autoencoder -which utilizes the associations between modalities and related words to learn multimodal word representations. Performance improvement on six benchmark tests shows that our models can efficiently fuse different modality inputs and build better semantic representations.
Ultimately, the present paper sheds light on the fundamental questions of how to learn word meanings, such as the plausibility of reconstructing per-ceptual information, associating related concepts and grounding word symbols to external environment. We believe that one of the promising future directions is to learn from how humans learn and store semantic word representations to build a more effective computational model.