Encoding Lexico-Semantic Knowledge using Ensembles of Feature Maps from Deep Convolutional Neural Networks

Semantic models derived from visual information have helped to overcome some of the limitations of solely text-based distributional semantic models. Researchers have demonstrated that text and image-based representations encode complementary semantic information, which when combined provide a more complete representation of word meaning, in particular when compared with data on human conceptual knowledge. In this work, we reveal that these vision-based representations, whilst quite effective, do not make use of all the semantic information available in the neural network that could be used to inform vector-based models of semantic representation. Instead, we build image-based meta-embeddings from computer vision models, which can incorporate information from all layers of the network, and show that they encode a richer set of semantic attributes and yield a more complete representation of human conceptual knowledge.


Introduction
Many approaches to representing the meaning of imageable, concrete concepts (e.g. FROG, APPLE, CAR, GUITAR) have been developed in the fields of cognitive science, computational linguistics and computer vision. Most explicitly, property listing studies have been used in cognitive psychology and cognitive neuroscience to characterise word meaning in terms of discrete semantic properties (McRae et al., 2005;Devereux et al., 2014;Buchanan et al., 2019). In property listing studies, human participants enumerate as many features as they can for each concept word, and these responses are then aggregated and normalised to a set of verbal semantic descriptors that correspond to elements of concept meaning (e.g. does-croak for FROG). This gives a representation of each concept as a sparse vector which encodes the semantic properties that occur for that concept. The resulting properties can then been applied to research investigating the organisation of semantic processing across the cortex, and to studies of the speed or ease of semantic processing for different concepts and different types of concept knowledge (Fieder et al., 2019;Evans et al., 2019;Kivisaari et al., 2019a;Bruffaerts et al., 2019).
A desirable trait of semantic property norms is their interpretability, since this interpretability facilitates the design of cognitive experiments on conceptual semantics (Murphy, 2004). This interpretability has also allowed researchers in NLP interested in distributional lexical semantics to gain better insights into the kinds of information that dense vector space models attain from pretraining. Even though many state-of-the-art vector space models of word meaning perform well when evaluated on both intrinsic and downstream tasks (Mikolov et al., 2013;Pennington et al., 2014;Bojanowski et al., 2017), researchers have demonstrated that such models often fail to fully encode certain facets of conceptual meaning (Li and Gauthier, 2017). For example, taxonomic properties (i.e. properties describing the object category, such as is-an-amphibian for FROG) and properties reflecting encyclopedic information or information about object function tend to be well-represented in the vector space, but properties that correspond to other kinds of attributes, such as colour, form, and modes of motion, tend to be poorly encoded (Rubinstein et al., 2015;Collell and Moens, 2016). These insights have motivated the development of multimodal semantic representations, which learn from multiple sources of information to ground these representations in the real world -an approach most successfully demonstrated by models that combine data from text and images Kiela and Bottou, 2014;Lazaridou et al., 2015b;Silberer, 2017).
Generally, when image data is incorporated into the construction of word embedding spaces, vectors are derived from the penultimate layer of a deep convolutional neural network trained on an image classification task. Although not usually explicitly stated, the rationale for using the penultimate layer of the network is that this layer should be the layer that is maximally relevant to the predicted label for the objects, whilst also being the layer that is least influenced by the low-level visual noise due to the details of the sampled training images. However, it is possible that meaningful object knowledge is present in lower layers of the network even if such information is not directly relevant to the object labelling task that the computer vision model is trained on. For example, in human property listing data, the property is-green is reliably given for FROG. In a computer vision model trained to discriminate between object classes, however, the greenness of frogs may not be strongly relevant to discriminating between images of frogs and images of other amphibians or reptiles (which also tend to be green) and so this information -highly relevant to human semantic representations -may not be strongly represented in the penultimate layer.
In this work, we demonstrate that whilst using the penultimate layer of a computer vision model is an effective way to capture concept semantics, this approach is not optimal with respect to the goal of producing representations that encode information about cognitively-relevant semantic attributes of concepts. Using human property norm data, we demonstrate that certain features and feature types are more decodable at particular layers of the network than others. To produce representations that can make use of all the available semantics-relevant information from the network, we merge the feature map information by constructing meta-embeddings using information from all convolutional layers. We demonstrate that these ensembles of distributional models produce more complete representations of conceptual meaning, when evaluated against human conceptual knowledge. To our knowledge, ours is the first work to consider and evaluate the use of all layers of computer vision models for constructing semantic models, and is the first to consider a range of recent computer architectures in building ensembled and multi-modal representations. Finally, we demonstrate how these meta-embeddings can be used in a zero-shot property mapping task, which allows us to automate the generation of interpretable semantic properties for unseen concepts. We make our code, embeddings, and analysis pipleine openly available 1 .

Related Work
Whilst word embedding vectors derived from text data have a long history of proven utility on a wide range of downstream tasks, they have been shown to struggle with encoding certain types of semanticallyrelevant information. Directly analysing the relationship between explicit property knowledge found in property norm data and the information in pretrained distributional models shows that particular properties relating to more sensory or perpetual information about object semantics can be poorly captured, compared with more associative and encyclopedic knowledge (Rubinstein et al., 2015;Collell and Moens, 2016;Li and Gauthier, 2017). Sommerauer and Fokkens (2018)), using probing classifiers, showed that many of these properties may not be decodable at all from text-based embedding spaces. Findings such as these have motivated researchers to incorporate multimodal information into representations of word meaning (Bulat et al., 2016).
Meta-embeddings have emerged as a useful method for combining information from different word embeddings models (Yin and Schütze, 2016). Different embeddings may be trained on various corpora of text, with different sizes, vocabularies, learning methods or model architecture. Meta-embeddings can then be created that combine the complementary information from all sources (Muromägi et al., 2017). In the context of language modelling, different layers of a pre-trained language model may be sensitive to different kinds of linguistic information, and effectively combining embedding information across layers has been shown to improve performance on different tasks sensitive to different kinds of information (such as POS-tagging and word sense disambiguation) (Peters et al., 2018). A number of successful methods have emerged for combining word embedding spaces from different sources. Coates and Bollegala (2018) demonstrate that combining vectors using element-wise addition can be just as effective as concatenating, given that the embeddings are orthogonal. Bollegala and Bao (2018) propose a number of autoencoder type networks to combine one or more vector space models. Finally, Neill and  give a comprehensive set of empirical results for a number of models and loss functions for learning complex meta-embeddings, demonstrating that loss functions that focus on vector direction such as cosine or KL-divergence based losses give the best performance on intrinsic benchmarks.
In distributional semantics, powerful approaches have been developed for building functions that can map from one semantic space to another (Lazaridou et al., 2014), compelling researchers to construct cross-modal mappings from dense distributional models with much larger vocabularies onto property norm data (Fagarasan et al., 2015). For example, Derby et al. (2019) constructed distributed semantic representations for each property dimension used in a large set of property norms, whilst Li and Summers-Stay (2019) showed that deep neural networks provide the best performance for zero-shot mapping between semantic spaces.
Motivating our work on ensembling over layers of deep convolutional neural network vision models, it has been shown that different layers of such networks learn features that reflect different kinds of visual properties. The lower layers of vision networks tend to capture visually basic features such as Gabor filters and colour gradients, which are then combined at later layers to construct task-specific high-level visual features that are relevant to object classification (Yosinski et al., 2014;Zeiler and Fergus, 2014).

Approach
The most prominent method for building image-based semantic models involves using pretrained deep convolutional neural networks (DCNNs) to extract visual information from image data by retrieving the output vectors from the penultimate layer of the network. In this work, we utilise DCNNs trained for the ImageNet LSVRC competition (Deng et al., 2009), which aim to predict the correct object in an image from a set of 1000 possible labels. In general, these networks use convolutional layers to extract visual information and build increasingly complex features, before one or many fully-connected layers are used to compute the probability distribution over the classes. Distributional semantic spaces can then be constructed from the penultimate fully-connected layer. Here, we instead focus our analysis on representational spaces generated at all convolutional layers of the network. Our goal is to demonstrate that the types of semantic attributes encoded in these feature maps depend on their depth in the network, and thus using only the penultimate layer may be suboptimal for representing concept meaning.

Visual Stimulus
We make use of the CSLB property norm data (Devereux et al., 2014), which includes 638 concept words together with 2725 human-elicited semantic properties 2 . We used a script to web scrape ten representative images for each of these concepts from a Google image search. We manually reviewed the images to check that they were appropriate and representative (for example, to ensure images for the search term APPLE do not include the logo of Apple Inc.). For each word, we feed the corresponding images into a DCNN and extract the feature map outputs at every layer of the network. Once we have retrieved the feature maps for each concept, we perform additional preprocessing steps in order to create embedding vectors at each layer. Each convolutional filter should activate if it receives certain visual patterns from the stimulus, with the resulting feature map representing the activity value of each filter at each spatial location. To obtain an overall measure of the presence of each feature in each image, we perform global max pooling across the feature maps, which takes the highest activity value at all spacial locations for each filter. We then average the max-pooled responses for each filter across each of our ten images to get the final concept representation and finally normalise the vectors using L2 distance.

Image Classification Models
For our analysis, we evaluate a number of standard implementations of networks trained for image classification 3 . The first model is AlexNet, containing five convolutional layers followed by three feed-forward layers. AlexNet is the most widely used pretrained DCNN in work on multimodal distributional semantics. Next, we chose the VGG16 model, which has a similar architecture to AlexNet but contains 16 convolutional layers (Simonyan and Zisserman, 2014). By comparing Alexnet and VGG16, we aim to investigate how depth affects the models' ability to encode human-relevant semantic property knowledge. For our third model, we chose ResNet34 which is not only deeper, with 34 convolutional layers, but also has residual connections between layers. These connections allow information from lower layers to more easily flow to higher layers.The final model we consider is DenseNet; it further extends the objective of feeding low-level information as inputs to the latter layers of the network by using dense skip connections (Huang et al., 2017). As DenseNet has a vast number of layers (169), we only take the output of the first convolutional layer, each block layer in the network, and each transition block (which perform downsampling of the feature maps). ResNet and DenseNet have been shown to be amongst the most "'brainlike" DCNNs, insofar as their internal representations correlate well with neuroimaging data from the human visual processing stream (José Meijer and Visser, 2019;Wen et al., 2018).

Analysis
To measure how well each layer of a DCNN encodes salient properties of human conceptual knowledge, we train supervised models to predict the presence of a property for a given concept. We note that while a supervised classifier's ability to identify the presence of a property indicates that the property is encoded in the representations, the converse is not always true (Collell and Moens, 2016).

Decoding Property Norms Data from DCNN layers
We first preprocessed the property norm data so as to retain only semantic properties that occur for at least 5 concepts (leaving 638 concepts and 390 properties). For each property, we perform 5-fold crossvalidation with stratified sampling so that at least one positive case occurs in each test fold, and train a logistic regression classifier to predict the presence of that property for each embedding, recording the average F1 score over all the folds. Since the dataset is highly imbalanced, we down-weight the negative classes in the loss function and regularise by adding the L2 norm of the weights. After obtaining the classification result for each property, we use the partition of properties into distinct feature classes in the CSLB data to aggreggate the results by property class. These property classes are Taxonomic (e.g. is-anamphibian), Encyclopaedic (e.g. lays-spawn), Functional (e.g. hops), Visual Perceptual (e.g. is-green) and Other perceptual (e.g. does-croak).
The results are consistent across all models, with Taxonomic being the most decodable followed by Visual Perceptual, Other Perceptual, Functional and finally Encyclopaedic (Fig. 1). These results follow previous work where taxonomic features tend to be more decodable than other attributes, for many vector space models. Furthermore, all types of concept properties seem to be more decodable as we move to later layers of the network. Such results could be down to the supervised classifier being unable to make use of the more image-specific information in the lower layers, or the fact that layers in the latter part of the network tend to be better for classification tasks. The most notable exception to this is DenseNet, which has two spikes in performance at the convolutional bottleneck layers.

Fine-Grained Analysis of the Decoding Results
Whilst the five property classes provide a useful distinction between types of semantic information, they tend to include a broad range of semantic attributes. For example, the visual-perceptual class includes visual features relating to colour and texture (is-green, has-smooth-skin) as well as more complex information about form (has-four-legs). To gain a deeper understanding of the kinds of semantic knowledge encoded in the DCNN layers, we divided the properties into more fine-grained classes. First, we split  the Visual Perceptual features into two basic types of visual information, Colour and Shape/Size. We expect the lower layers to perform well at decoding these properties since previous research has shown that the lower layers tend to learn colour gradients and Gabor filters (Zeiler and Fergus, 2014). In the CSLB property norm data, semantic properties always consist of a relation term and an attribute value. For example, FROG has the feature "has-legs", where "has" is the relation. The property listing task prompted participants to use four such relations: "is", "has", "does" and "made-of" (Devereux et al., 2014). These relations relate to the type of property being described; for example, "does" relates to action or function, while "has" corresponds to object parts. We therefore use these four relations to build the other fine-grained categories, for a total of six categories. Based on the previous results, we expect the later DCNN layers to have higher average F1 scores for all properties, but here we are interested in which type of property is most decodable at each layer.
In the lower layers of the DCNNs, we see that Colour tends to be the most decodable followed by "made-of" and Shape/Size, but as we move through to the middle sections of the networks "made-of" properties become the most decodable, for all DCNNs (Fig. 2). As we move further, "has" and "is-a" property decoding improves, and by the end either "is-a" or "made-of" become the most decodable property types. As we move through the networks, features related to color or shape become relatively less decodable, compared to other feature types. Overall, the results support the rationale that the penultimate layer (as is commonly used) should give a good correspondence to object semantics for the purpose of building distributional semantic models. But would such models also benefit from having direct information about different features from earlier stages of the DCNNs?

Improving Distributional Semantic Models with Visual Meta-Embeddings
We have seen that particular layers of DCNNs best capture different types of semantic information. Here we investigate whether we can use this insight to obtain improved image-based semantic embedding spaces and thus build more faithful representations of conceptual meaning. (In these experiments, we focus on ResNet, since it gave the best performance on the decoding task; for results with all four models, see the Supplementary Table 1).

Convolutional Meta-Embeddings
In order to make full use of the information generated by the network, we require a method that can effectively combine the feature maps from each layer, retaining only the most relevant information from each. We construct semantic representations by aggregating features from the output of each convolution layer by assembling them into a single set of representations known as a meta-embedding. Meta-embeddings are vector representations that incorporate information from a set of word embeddings that can differ in a range of aspects such as training data and training methods (Peters et al., 2018;Coates and Bollegala, 2018). Most importantly, they look to combine complementary knowledge from each embedding, and do not require that the vectors be the same dimensionality. Here we apply two common approaches. The first approach is a simple concatenation technique to combine all embeddings. Following previous work, we also up-weight the best embeddings; in this case, we multiply the second-to-last layer by 5 and the last layer by 10, keeping the other layers the same before concatenating. To reduce dimensionality, we also apply Single Value Decomposition (SVD) to fix dimensionality to 300 while preserving information from the most important features. We refer to these embeddings as SVD Meta ResNet. The second approach uses a method known as 1ToN (Yin and Schütze, 2016); this method looks to learn set of meta-  Table 2: Results for zero-shot cross-modal mapping task using several predictive models. The Hit@K tells us the percentage of test features which appear in the top K neighbours with the ground truth representations.
embeddings using a neural network. For each word, we have a meta-embedding vector, for which the network predicts the associated word embedding for each of our vector space models using several linear layers. A network which combines the information from N embeddings will contain N linear layers which map the meta-embedding into the original constituent word embedding spaces. Suppose we have N distributional models {W 1 , W 2 , . . . W N }, with equal vocabulary V , and vector lengths (a i ) N i=1 ⊂ N. We define the 1T oN neural network with an embeddings matrix E ∈ R |V |×k for some size k ∈ R, with N linear projections of weights M i ∈ R k×a i and corresponding biases b i , 1 ≤ i ≤ N . For each word w ∈ V , let w i ∈ W i be it's associated word vector for each 1 ≤ i ≤ N , with meta embeddings E(w). We want to minimize the following loss function, for our neural network parameterized by θ: where [β 1 , β 2 . . . β N ] are the scaler weightings for each component embedding, though we set these values to one. We instead up-weight the embeddings from the last two convolutional layers which we multiply by 5 and 10 respectively. We call this the 1ToN Meta ResNet embedding.

Experiments
To evaluate these two meta-embedding models, we repeat the decoding experiment with the SVD and a 1ToN meta-embeddings both of size 300 built using the feature maps from all of the ResNet convolutional layers. We compare the meta-embeddings with embeddings constructed from the penultimate layer of ResNet (i.e. the traditional approach) to see how well they decode each property type.

Property Decodability
The results are displayed in Table 4. We see that both meta-embedding approaches, ensembling over the convolutional layers of ResNet, are better representations for decoding human property knowledge than the traditional approach of using the penultimate layer of ResNet alone. We see that there is no real change in how decodable Encyclopaedic or Functional properties are in the meta embeddings, which is to expected, as this is where text-based word embeddings have been shown to perform strongest. Furthermore, Taxonomic and Visual Perceptual properties are more decodable since certain layers more strongly encode different types of visual information depending on their location in the DCNN. Surprisingly, Other Perceptual information, such as olfactory or taste-based features are also more decodable in the meta-embeddings. Overall, Taxonomic, Visual Perceptual and Other Perceptual have the most significant improvement when using all layers, with the overall F1 score increasing by 5 on average for these three categories, compared with using the penultimate layer of ResNet alone.

Cross-Modal Embedding-to-Property Mapping
While the sparse property vectors obtained from norming studies are useful in cognitive science due to their interpretability, as a lexical resource they are very limited in size, due to being created manually. This has driven recent work aiming to learn zero-shot cross-modal mappings between a pretrained semantic embedding space and these property vectors, so that property-norm information can be generated automatically from word embeddings (Derby et al., 2019;Fagarasan et al., 2015). Furthermore, we can extend our analysis of the convolutional layers from global to local interpretations based on some sample images (See appendix B). Here we evaluate how accurately our models predict semantic properties for unseen concept words in a zero-shot set-up using several regression models. We predict the property vector for each test concept and find the top T nearest neighbours of the predicted vector, to determine whether the concept word for the ground-truth vector is retrieved within that set, which we refer to as a hit. We perform repeated 10-fold cross-validation on the concepts due to the small number of training samples and average the number of hits across all test folds at each T , for T ∈ [1, 5, 10, 20] in our evaluations. We perform 5 repeats of each cross-validation. To learn a cross-modal mapping, we report the results for three different models. A k-nearest-neighbours model with k = 5, ridge regression and a neural network with one hidden layer. The neural network had a hidden layer of size 1200, Relu activations and used the Adam optimiser. We include a neural network as previous work has shown that they give the strongest performance on this zero-shot cross-modal mapping task (Li and Summers-Stay, 2019). The loss function we use is based on the cosine similarity function from Lazaridou et al. (2014). For each ground truth property norm representation y ∈ G, with corresponding predicted vectorŷ from the network parameterised by θ, the loss function is Training neural networks on such a small set of data points for zero-shot cross-modal mapping can be difficult, as several problems arise such as "hubness" (Radovanović et al., 2010), "pollution" (Lazaridou et al., 2015a) and neighbourhood structures resembling the input space more than the output (Collell  Table 3: Spearman ρ correlation with MEN and SimLex999 human similarity benchmarks. and Moens, 2018). Hence, we perform a hyperparameter search using a small grid of values with 5-fold cross validation to determine the best set of training parameters. To avoid over-fitting, we determine crossvalidation performance using Mean Average Precision (MAP). To illustrate this cross-modal mappling approach, examples of zero-shot property predictions for held-out images are presented in Figure  3. We also combined the image embeddings with text embeddings to create multimodal distributional models that have been shown to give better performance on cross-modal mapping (Bulat et al., 2016).
For the text embeddings, we use Spacy's GloVe vectors (Pennington et al., 2014), from the large English language model. To build the text+image multimodal models, we concatenate the L2-normalized vectors from the GloVe embeddings with each of our image-based embeddings, which gives us three multimodal models in total. The results are presented in Table 2. We see that in all cases, the meta-embeddings outperform the embeddings from the penultimate layer of ResNet, and in particular, the 1T oN embeddings show the best performance. As the number of models being ensembled increases, information can get lost when concatenating to high dimensionality (Neill and , but with the 1ToN method the network effectively retains the important information from the ensembled component embeddings because of the learning objective.

Semantic Similarity Task
A common benchmark to evaluate distributional semantic models is to directly compare word similarity scores with human annotator similarity ratings for word pairs. We utilize MEN (Bruni et al., 2012) and SimLex999 (Hill et al., 2015), for which we have 104 and 48 word pair ratings respectively. In this final evaluation of the models, we use cosine similarity to score word-pair similarity and then use Spearman ρ to measure the correlation between embedding word similarities and the human annotator ratings. As we can see in Table 3, the results again show the same pattern, with the meta embeddings outperforming the penultimate ResNet layer for both the unimodal and multimodal (text+image) embeddings.

Conclusion
We have demonstrated the potential of utilizing interpretable semantic primitives derived from human property norm data as a tool for investigating the information captured in the latent representations of deep convolutional neural networks. We reveal that, whilst the widely accepted approach for extracting visual semantic representations, using the penultimate layer of DCNNs, yields strong representations of conceptual meaning, they overlook key information generated by the neural network. Instead, we develop meta-embeddings that encompass all the salient feature information encoded in the representations produced at all layers of several DCNNs. These new vector space models are not only closer representations of human conceptual knowledge, but also can be used to build multimodal semantic models that improve performance on a zero-shot cross-modal mapping task and give better fit to human semantic similarity benchmarks. Furthermore, the field of meta-embeddings is rich in potential methods for combining vector space models from different semantic domains, while our research offers empirical evidence that supports our method for constructing meta-embeddings to improve image-based and multi-modal distributional semantic models.  In section 3, we extracted convolutional feature maps from the layers of deep convolutional neural networks for a set of images representing several concepts. We then pool these features based on the concept each image represents, so we could construct semantic representations of word meaning from each convolutional layer of the network. By performing a property decoding task on these embeddings, we could then infer what semantic knowledge the model captures at particular layers of the network. Such an approach reflects a global interpretation of what information the network captures at each convolutional layer, and is not based on any particular sample we gave to the network. Thankfully, cross-modal mapping provides us with a simple method for interpreting local instances from our visual data.
Mapping Images to Semantic Primitives. When we learn a cross-modal mapping from a distributional feature space onto the conceptual feature space, the model must learn to map common features related to a particular concept onto some plausible semantic properties. Because of this, we can use our trained cross-modal map to predict semantic properties for other instances of the concept, since it has been trained to map common feature onto some associated conceptual knowledge. For example, if the model learns to map features it associates as is-red based on images from some concept such as ROSE, then a new image of a ROSE should still produce features in the convolutional layers that the cross-modal map similarly identifies as is-red. Furthermore, images of other concepts that also have the property is-red, such as STRAWBERRY, should be accurately inferred from the model. For our analysis, we train a ridge regression to map from the convolution layer embeddings of ResNet onto the conceptual space. After training, we apply the appropriate image preprocessing to the sample images that we wish to analyse and extract features across all convolutional layers. Since we frame the task as a regression problem, each predicted conceptual mapping should not only predict the correct properties for a concept, but also the strength of the production frequency for each concept. Production frequencies are count-based statistics that reflect the number of times human annotators express that property as true for a particular concept. For our analysis, we predict the conceptual representation for each image and take the highest valued dimensions which correspond to some conceptual property. As there are a large number of layers, we focuses on features at particular intervals of the network, in this case, layers 3, 15, 27 and 35 of ResNet. Concept: Car. We first chose an image of the concept CAR (displayed in Figure 4). As we can see the lower layers are dominated by visual perceptual properties, with more high-level properties eventually emerging in the upper layers of the network. Furthermore, the lower layers of the network tend to focus on shape, colour and form across the entire image. We can observe this in the fact that the cross-modal map detects properties such as made-of-wood, is-green and has-leaves, a consequence of the model detecting the trees in the background. Notice also that the dimensions of the object become more precise as we move to the middle layers, with features such as is-small and is-circular-round appearing at the start, while 'is-big-large appears in the later layers. Not only do these upper layers predict more complex notions about the concept such as made-a-metal or is-expensive, but the attention of these features tend to be solely related to the central object in the image, in this case, a car.
Concept: Guitar. Next, we chose an image of the concept GUITAR (displayed in Figure 5), which is another one of the concepts in our lexicon, though is not directly taken from the training data. We see that the model detects some visual properties in the lower layers, which it assumes is related to another high-level concept, an animal. We can see this from the top prediction being is-an-animal and other features related to animals such as has-a-tail and has-fur-hair. It is not surprising that there is a high degree of association between certain properties in the norming study, since many related to particular taxonomies, in this case, is-an-animal. Nevertheless, as we move through the network the trajectory of the prediction quickly becomes more related to a guitar, though is-an-animal is still predicted in the top ten features. As we can see, these models generalise quite well to other images and can still decode complex features related to conceptual categories.
Concept: Fruit. Next, we chose an image of the concept FRUIT (displayed in Figure 6), which is neither taken from the training data or the lexicon, but instead consists of many concepts from the data such as APPLE, BANANA and KIWI. Here we want to analyse how our cross-modal model copes with multiple instances of the concepts. Here, we see that the model can quickly detect visual properties such as is-yellow, is-red and is-small, though other high-level properties emerge such as is-eaten-edible and is-a-fruit. Again, is-an-animal emerges as a property, which may be due to bias in the model towards high occurring properties.
Concept: Wampimuk. Finally, we chose an imagined concept, known as a WAMPIMUK (displayed in Figure 7). A Wampimuk is a fictitious concept proposed by Lazaridou et al. (2014), to convey how context can shape our perception of a concept, even if we have never heard of it before. Humans are capable of building complete semantic representations for concepts, even when the information is fragmented (Kivisaari et al., 2019b). Hence, a sentence like "We found a cute, hairy wampimuk sleeping behind the tree" can communicate a lot of information about what a wampimuk might be, in this case, a small furry animal. The authors create a potential image of such an animal that does not exist, yet we can extract properties about the concept just as well. Hence, we also examine the convolutional layers of the network when given such a creature, to determine whether reasonable semantic properties can be captured by our cross-modal model. We see that the network produces features that the cross-modal map detects as salient aspects of the image such as is-small and has-fur-hair. Furthermore, the model can detect conceptual knowledge related to this imaginary creature based on the context of this information. For example, in the lower layers, has-a-tail is predicted by the cross-modal map, even though there is no evidence of this in the picture, yet it would make sense for a small creature. As we move to the final layer of the network, we can even see complex taxonomies emerge, such as is-a-mammal that is quite plausible.