Deconstructing multimodality: visual properties and visual context in human semantic processing

Multimodal semantic models that extend linguistic representations with additional perceptual input have proved successful in a range of natural language processing (NLP) tasks. Recent research has successfully used neural methods to automatically create visual representations for words. However, these works have extracted visual features from complete images, and have not examined how different kinds of visual information impact performance. In contrast, we construct multimodal models that differentiate between internal visual properties of the objects and their external visual context. We evaluate the models on the task of decoding brain activity associated with the meanings of nouns, demonstrating their advantage over those based on complete images.


Introduction
Multimodal models combining linguistic and visual information have enjoyed a growing interest in the field of semantics. Recent research has shown that such models outperform purely linguistic models on a range of NLP tasks, including modelling semantic similarity (Silberer and Lapata, 2014), lexical entailment (Kiela et al., 2015), and metaphor identification (Shutova et al., 2016). Despite this success, little is known about the nature of semantic information learned from images and why it is useful. For instance, some concepts may be better characterised by their own (internal) visual properties and others by the (external) visual context, in which they appear. However, existing neural multimodal semantic approaches use entire images to learn visual word representations, without differentiating between these two kinds of visual information. In contrast, we investigate whether differentiating between internal visual properties and external visual context is beneficial compared to learning visual representations from complete images. We construct three multimodal models combining linguistic and visual information: using (1) internal visual features extracted from an object's bounding box, (2) external visual features outside the bounding box, i.e. the visual context, and (3) visual features extracted from complete images. Figure 1 visualises the different visual information extracted from an image. We use skip-gram (Mikolov et al., 2013) as our linguistic model and extract visual representations from a convolutional neural network (CNN) pretrained on the ImageNet classification task (Fei-Fei, 2010).
We evaluate the models in their ability to decode patterns of brain activity associated with the meanings of nouns, obtained via brain imaging. This choice of task allows us to assess the importance of each type of visual information in human semantic processing. Specifically, we perform two experiments: (1) using the Visual Genome (Krishna et al., 2016) dataset of images where objects are manually annotated with bounding boxes, and (2) using images retrieved from Google Image Search and automatically segmenting them using a Faster R-CNN (FRCNN) model (Ren et al., 2015). We find that all of our multimodal models are able to decode brain activity patterns and that the models relying on internal visual properties are superior to all others.

Related Work
Multimodal Semantics Multimodal models are inspired by cognitive science research, suggesting that human semantic knowledge relies on perceptual and sensori-motor experience (Louwerse, 2011). Contemporary approaches use deep CNNs trained on image classification tasks to extract visual representations of words. Kiela and Bottou (2014) extract visual word representations from feature extraction layers in CNNs and con- catenate them with linguistic representations obtained from a skip-gram model. Their results presented empirical improvements over the previous bag-of-visual-words method (Bruni et al., 2012). Other approaches use restricted Boltzmann machines (Srivastava and Salakhutdinov, 2012), recursive neural networks (Socher et al., 2014) and autoencoders (Silberer and Lapata, 2014).
Decoding Brain Activity Research in neuroscience supports the view that concepts are represented as patterns of neural activation and, similarly to distributed semantic representations, are naturally encoded in neural semantic vector space (Haxby et al., 2001;Huth et al., 2012;Anderson et al., 2013). Mitchell et al. (2008) were the first to employ distributional semantic models to predict neural activation in the human brain using data obtained via functional Magentic Resonance Imaging (fMRI). Murphy et al. (2012); Devereux et al. (2010); Pereira et al. (2013) have since successfully tested a wider range of distributional models in this task.
Recent research shows that multimodal models grounded in the visual modality strongly correlate with neural activation patterns associated with word meaning. Anderson et al. (2013) construct semantic models using visual data and show a high correlation to brain activation patterns from fMRI. While Anderson et al. (2015) find that linguisticonly semantic models better predict brain activity associated with linguistic processing, and imagebased semantic models better predict similarity within the visual processing portions of the brain. Bulat et al. (2017) compare and evaluate a range of distributional semantic models in their ability to predict brain activity associated with concepts. Two key differences between our work and both Anderson et al. (2013) and Anderson et al. (2015) are 1) we make use of neural-network-based visual features as opposed to SIFT features (Lowe, 2004), and 2) we perform a word-level decoding analysis as opposed to representational similarity analysis (Kriegeskorte et al., 2008).
We aim to further our understanding of the role of vision in semantic processing by evaluating our models on the task of decoding brain activity associated with the meanings of nouns.

Data
Visual Data In the first experiment, we used the Visual Genome (Krishna et al., 2016) dataset of images manually-annotated for objects and their bounding boxes. In the second experiment, we trained Faster-RCNN networks on manually annotated images from ImageNet (Deng et al., 2009;Fei-Fei, 2010), and then processed images retrieved from Google Images to construct a dataset of automatically-annotated images. Both Visual Genome and ImageNet were selected as they contain bounding box annotations around objects.
Brain Imaging Data We used a dataset of brain activity patterns associated with the meanings of nouns created by Mitchell et al. (2008) (MITCHELL). The dataset includes 60 concrete nouns from 12 semantic categories, such as vehicles or vegetables. fMRI images were recorded when participants were presented with line drawings of the objects and the corresponding nouns. We use 50 nouns from the dataset in our experiments, since 10 of the nouns were not covered by the Visual Genome and ImageNet datasets.
Following Mitchell et al. (2008), we select the 500 voxels with the most stable activation pro- file across concepts. We perform leave-two-out cross validation and select voxels independently for each of the cross validation folds during training. The stability score for a voxel is measured across six presentations of a word and is approximated as the average pairwise Pearson correlation among activation profiles over the training words in a cross-validation fold. The 500 voxels with the highest stability score are chosen and combined into a vector, used to evaluate how well the multimodal models can decode brain activity patterns.

Methods
We construct three visual models using three types of visual information: the internal features of the object, the external context surrounding it, and the whole image. These representations are then combined with linguistic representations to create the multimodal models.

Learning linguistic representations
We use the skip-gram model with negative sampling (Mikolov et al., 2013) to learn 100dimensional word embeddings from a lemmatized 2015 copy of Wikipedia (Rimell et al., 2016).

Learning visual representations
Object detection and segmentation We use the FRCNN unified object detection model (Ren et al., 2015) to automatically detect objects and their bounding boxes in images associated with our nouns. FRCNN combines a region proposal network (RPN) with Fast R-CNN, an object detection network, and minimizes computational cost during training and testing by sharing convolutional layers between the networks. To maximize accuracy, we train an FRCNN network for each semantic class in the MITCHELL dataset, starting from a VGG16 network (Simonyan and Zisserman, 2014) pre-trained on the PASCAL VOC 2007 data set.
The pre-trained model contains many useful lower level features and therefore we expect fine-tuning a pre-trained model to yield optimal results. We train the networks using ImageNet images annotated with bounding boxes. We collected an average of 303 images per concept, with the following nouns lacking annotated images: foot, arm, eye, igloo, pliers and carrot. Images were split into 10% test, 40% train, and 50% train-validation sets. We trained the networks using approximate joint training. We tuned the step-size to 3000 and used the following default hyperparameter values: learning rate policy: "step"; base learning rate: 0.001; average loss: 100; momentum: 0.9; weight decay: 0.0005; gamma: 0.1. After training, the mean average precision (mAP) score across all semantic classes was 0.73.

Extracting visual features
We retrieve 60 images per word using Google Image Search. We then create three sets of images for every word: the INTERNAL image (containing the object denoted by the word), an EXTERNAL image (containing its visual context), and the original WHOLE image. To generate the internal images, we crop and extract each object from within the annotated bounding boxes. To generate external images, we fill in the annotated bounding box area with black pixels, leaving only the visual context (black pixels are used as a simple way to represent no information). All images are re-scaled to 256x256 and the original aspect ratios are maintained, padding any remaining area with black pixels.
We use a Caffe (Jia et al., 2014) implementation of a pre-trained AlexNet model (Krizhevsky et al., 2012) to extract a visual representation for each of the images. We first take an image as input to the network, perform a forward pass, and extract the pre-softmax layer in the network (FC7) as a representation of the image. We use the MMfeat toolkit (Kiela, 2016) Table 2: Average decoding accuracies over the nine participants for the semantic models trained on the automatically annotated images. Naming convention follows Table 1 . to the nouns in our data set.

Multimodal Models
We construct multimodal models by concatenating L2-normalised linguistic and visual representations. This strategy, known as middle fusion, has been shown successful in previous multimodal semantics research (Kiela and Bottou, 2014). We combine the linguistic model with each of our visual models, resulting in the three kinds of multimodal models: INTERNAL, EXTERNAL and WHOLE. Furthermore, we construct two combined models: a COMBINED visual-only model concatenating the internal and external models, and a COMBINED multimodal model concatenating the internal, external, and linguistic models.

Decoding Brain Activity
We evaluate our models in their ability to decode brain activity associated with unseen words, i.e. to predict the correct label associated with their fMRI patterns. We follow the same procedure as Anderson et al. (2016), computing a semantic model similarity matrix consisting of semantic model similarity codes for each of the 50 nouns from the Mitchell et al. (2008) dataset. Similarly, we construct a brain activity similarity matrix consisting of brain activity similarity codes of the 50 nouns. This process is visualised in Figure 2, where the coloured columns represent semantic model vectors for each word in the dataset, and the bottom row represents the resulting similarity codes for the concept "Leg".
We perform leave-two-out cross validation, selecting the semantic model similarity codes ( − → s i , − → s j ) and brain activity similarity codes ( − → a i , − → a j ) for two nouns. We remove the i-th and j-th elements from each of the similarity codes as these entries correspond to the nouns being tested. Figure 3 visualises an example of the decoding procedure. Decoding is successful if the sum of Pearson correlations for the correct pairings is greater than the sum of Pearson correlations for the incorrect pairings, resulting in decoding accuracy of 1 for this pair and 0 otherwise. The expected chance-level decoding accuracy is 50% if a model were to match word labels with similarity vectors at random.

Experiments
We first experiment with a set of manuallyannotated images from Visual Genome and then with images where objects and their bounding boxes have been automatically detected using FR-CNN networks.

Manually annotated images
Experimental Setup We use 50 nouns from the MITCHELL dataset and assess each model's ability to decode brain activity vectors using leave-twoout cross validation, resulting in 1225 (50 choose 2) cross-validation folds per model.

Results
The results, presented in Table 1, demonstrate that all semantic models decode brain activity patterns significantly above chance levels 1 . We investigated the errors produced during the cross-validation folds, and found the INTERNAL visual-only model outperforms its EXTERNAL and WHOLE counterparts systematically for all but one semantic class: kitchen utensils, where the EX-TERNAL visual-only model obtains the fewest errors. Overall, these results suggest that internal visual features are superior in this task and correlate strongly with the patterns of human semantic representation.

Automatically annotated images
Experimental Setup For each of our 50 nouns from the MITCHELL dataset, we retrieve 60 images using Google Image Search. The images are annotated using FRCNNs and then processed to create INTERNAL, EXTERNAL and WHOLE models. We follow the same evaluation procedure as in the previous experiment, performing 1225 (50 choose 2) cross-validation folds.

Results
The results, presented in Table 2, demonstrate that all models decode brain activity vectors significantly above chance level. They also show multimodal models constructed with automatic object detection perform on par with representations learned from manually annotated images. Overall, we observe a similar trend, i.e. the INTERNAL visual-only model significantly outperforms (V=43, p<0.015) the EXTERNAL visualonly model (mean accuracies of 0.80 and 0.74).
Our qualitative analysis has shown that the IN-TERNAL visual model outperforms the others for the following semantic classes, in both experiments: building, furniture and insect. We find the WHOLE visual-only model has fewer class-level errors in this experiment. We believe this is due to the quality of the images; the Visual Genome images contain more objects per image on average, making the external visual context more variable compared to images from Google Images.
Besides corroborating the findings of the previous experiment on the importance of the internal visual features, these results show that high quality visual representations capturing the objects' internal properties and their visual context can be learned through automatic object detection techniques, decreasing the reliance on human annotated datasets (albeit some annotated data is required to train the object detection system) and allowing for a greater scalability of the models.

Conclusion
Our results show that multimodal semantic models correlate with human neural semantic representations associated with concrete concepts, and the visual-only model using internal visual features outperforms the other visual-only models in most cases. Similar performance across models using manual and automatically annotated images demonstrates progress in object detection systems, presenting opportunities to expand to other tasks where evaluation datasets may not be covered by manually annotated image datasets.