Comparing Data Sources and Architectures for Deep Visual Representation Learning in Semantics

Multi-modal distributional models learn grounded representations for improved performance in semantics. Deep visual representations, learned using convolutional neural networks, have been shown to achieve particularly high performance. In this study, we systematically compare deep visual representation learning techniques, experimenting with three well-known network architectures. In addition, we explore the various data sources that can be used for retrieving relevant images, showing that images from search engines perform as well as, or better than, those from manually crafted resources such as ImageNet. Furthermore, we explore the optimal number of images and the multi-lingual applicability of multi-modal semantics. We hope that these ﬁndings can serve as a guide for future research in the ﬁeld.

Most multi-modal semantic models tend to rely on raw images as the source of perceptual input. Many data sources have been tried, ranging from image search engines to photo sharing websites to manually crafted resources. Images are retrieved for a given target word if they are ranked highly, have been tagged, or are otherwise associated with the target word(s) in the data source.
Traditionally, representations for images were learned through bag-of-visual words (Sivic and Zisserman, 2003), using SIFT-based local feature descriptors (Lowe, 2004). Kiela and Bottou (2014) showed that transferring representations from deep convolutional neural networks (ConvNets) yield much better performance than bag-of-visual-words in multi-modal semantics. ConvNets (LeCun et al., 1998) have become very popular in recent years: they are now the dominant approach for almost all recognition and detection tasks in the computer vision community (LeCun et al., 2015), approaching or even exceeding human performance in some cases (Weyand et al., 2016). The work by Alex , which won the Im-ageNet Large Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al., 2015) in 2012, has played an important role in bringing convolutional networks (back) to prominence. A similar network was used by Kiela and Bottou (2014) to obtain high quality image embeddings for semantics. This work aims to provide a systematic comparison of such deep visual representation learning techniques and data sources; i.e. we aim to answer the following open questions in multi-modal semantics: • Does the improved performance over bagof-visual-words extend to different convolutional network architectures, or is it specific to Krizhevsky's AlexNet? Do others work even better?
• How important is the source of images? Is there a difference between search engines and manually annotated data sources? Does the number of images obtained for each word matter?
• Do these findings extend to different languages beyond English?
We evaluate semantic representation quality through examining how well a system's similarity scores correlate with human similarity and relatedness judgments. We examine both the visual representations themselves as well as the multi-modal representations that fuse visual representations with linguistic input, in this case using middle fusion (i.e., concatenation). To the best of our knowledge, this work is the first to systematically compare these aspects of visual representation learning.

Architectures
We use the MMFeat toolkit 1 (Kiela, 2016) (Szegedy et al., 2015) and VGGNet (Simonyan and Zisserman, 2014). Image representations are turned into an overall word-level visual representation by either taking the mean or the elementwise maximum of the relevant image representations. All three networks are trained to maximize the multinomial logistic regression objective using mini-batch gradient descent with momentum: where 1{·} is the indicator function, x (i) and y (i) are the input and output, respectively. D is the number of training examples and K is the number of classes. The networks are trained on the ImageNet classification task and we transfer layers from the pre-trained network. See Table 1 for an overview. In this section, we describe the network architectures and their properties.
AlexNet The network by  introduces the following network architecture: first, there are five convolutional layers, followed by two fullyconnected layers, where the final layer is fed into a softmax which produces a distribution over the class labels. All layers apply rectified linear units (ReLUs) (Nair and Hinton, 2010) and use dropout for regularization . This network won the ILSVRC 2012 ImageNet classification challenge. In our case, we actually use the CaffeNet reference model, which is a replication of AlexNet, with the difference that it is not trained with relighting data-augmentation, and that the order of pooling and normalization layers is switched (in CaffeNet, pooling is done before normalization, instead of the other way around). While it uses an almost identical architecture, performance of Caf-feNet is slightly better than the original AlexNet.
GoogLeNet The ILSVRC 2014 challenge winning GoogLeNet (Szegedy et al., 2015) uses "inception modules" as a network-in-network method (Lin et al., 2013) for enhancing model discriminability for local patches within the receptive field. It uses much smaller receptive fields and explicitly focuses on efficiency: while it is much deeper than AlexNet, it has fewer parameters. Its architecture consists of two convolutional layers, followed by inception layers that culminate into an average pooling layer that feeds into the softmax decision (so it has no fully connected layers). Dropout is only applied on the final layer. All connections use rectifiers.
VGGNet The ILSVRC 2015 ImageNet classification challenge was won by VGGNet (Simonyan and Zisserman, 2014). Like GoogLeNet, it is much deeper than AlexNet and uses smaller receptive fields. It has many more parameters than the other networks. It consists of a series of convolutional layers followed by the fully connected ones. All layers are rectified and dropout is applied to the first two fully connected layers.
These networks were selected because they are very well-known in the computer vision community. They exhibit interesting qualitative differences in terms of their depth (i.e., the number of layers), the number of parameters, regularization methods and the use of fully connected layers. They have all been winning network architectures in the ILSVRC ImageNet classification challenges.

Sources of Image Data
Some systematic studies of parameters for textbased distributional methods have found that the source corpus has a large impact on representational quality (Bullinaria and Levy, 2007;Kiela and Clark, 2014). The same is likely to hold in the case of   (2014) compare ImageNet (Deng et al., 2009) and the ESP Game dataset (von Ahn and Dabbish, 2004), but most works use a single data source. In this study, one of our objectives is to asses the quality of various sources of image data. Table  2 provides an overview of the data sources, and Figure 1 shows some example images. We examine the following corpora: Google Images Google's image search 2 results have been found to be comparable to hand-crafted image datasets (Fergus et al., 2005).
Bing Images An alternative image search engine is Bing Images 3 . It uses different underlying technology from Google Images, but offers the same functionality as an image search engine.
Flickr Although Bergsma and Goebel (2011) have found that Google Images works better in one experiment, the photo sharing service Flickr 4 is an interesting data source because its images are tagged by human annotators.  1995), by attaching images to the corresponding synset (synonym set).

ESP Game
The ESP Game dataset (von Ahn and Dabbish, 2004) was constructed through a so-called "game with a purpose". Players were matched online and had to agree on an appropriate word label for a randomly selected image within a time limit. Once a word has been mentioned a certain number of times, that word becomes a taboo word and can no longer be used as a label. These data sources have interesting qualitative differences. Online services return images for almost any query, with much better coverage than the fixed-size ImageNet and ESP Game datasets. Search engines annotate automatically, while the others are human-annotated, either through a strict annotation procedure in the case of ImageNet, or by letting users tag images, as in the case of Flickr and ESP. Automatic systems sort images by relevance, while the others are unsorted. The relevance ranking method is not accessible, however, and so has to be treated as a black box. Search results can be  language-specific, while the human annotated data sources are restricted to English. Google and Bing will return images that were ranked highly, while Flickr contains photos rather than just any kind of image. ImageNet contains high-quality images descriptive of a given synset, meaning that the tagged object is likely to be centered in the image, while the ESP Game and Flickr images may have tags describing events happening in the background also.

Selecting and processing images
Selecting images for Google, Bing and Flickr is straightforward: using their respective APIs, the desired word is given as the search query and we obtain the top N returned images (unless otherwise indicated, we use N=10). In the case of ImageNet and ESP, images are not ranked and vary greatly in number: for some words there is only a single image, while others have thousands. With ImageNet, we are faced with the additional problem that images tend to be associated only with leaf nodes in the hierarchy. For example, dog has no directly associated images, while its hyponyms (e.g. golden retriever, labrador) have many. If a word has no associated images in its subtree, we try going up one level and seeing if the parent node's tree yields any images. We subsequently randomly sample 100 images associated with the word and obtain semi-ranked re-sults by selecting the 10 images closest to the median representation as the sampled image representations. We use the same method for the ESP Game dataset. In all cases, images are resized and centercropped to ensure that they are the correct size input.

Evaluation
Representation quality in semantics is usually evaluated using intrinsic datasets of human similarity and relatedness judgments. Model performance is assessed through the Spearman ρ s rank correlation between the system's similarity scores for a given pair of words, together with human judgments. Here, we evaluate on two well-known similarity and relatedness judgment datasets: MEN (Bruni et al., 2012) and SimLex-999 (Hill et al., 2015). MEN focuses explicitly on relatedness (i.e. coffee-tea and coffee-mug get high scores, while bakery-zebra gets a low score), while SimLex-999 focuses on what it calls "genuine" similarity (i.e., coffee-tea gets a high score, while both coffee-mug and bakery-zebra get low scores). They are standard evaluations for evaluating representational quality in semantics.
In each experiment, we examine performance of the visual representations compared to text-based representations, as well as performance of the multimodal representation that fuses the two. In this Arch.  case, we apply mid-level fusion, concatenating the L2-normalized representations (Bruni et al., 2014). Middle fusion is a popular technique in multi-modal semantics that has several benefits: 1) it allows for drawing from different data sources for each modality, that is, it does not require joint data; 2) concatenation is less susceptible to noise, since it preserves the information in the individual modalities; and 3) it is straightforward to apply and computationally inexpensive. Linguistic representations are 300-dimensional and are obtained by applying skipgram with negative sampling (Mikolov et al., 2013) to a recent dump of Wikipedia. The normalization step that is performed before applying fusion ensures that both modalities contribute equally to the overall multi-modal representation.

Results
As Table 3 shows, the data sources vary in coverage: it would be unfair to compare data sources on the different subsets of the evaluation datasets that they have coverage for. That is, when comparing data sources we want to make sure we evaluate on images for the exact same word pairs. When comparing network architectures, however, we are less interested in the relative coverage between datasets and more interested in overall performance, in such a way that it can be compared to other work that was evaluated on the fully covered datasets. Hence, we report results on the maximally covered subsets per data source, which we refer to as MEN and SimLex, as well as for the overlapping common subset of word pairs that have images in each of the sources, which we refer to as MEN* and SimLex*. Table 4 shows the results on the maximally covered datasets. This means we cannot directly compare between data sources, because they have different coverage, but we can look at absolute performance and compare network architectures. The first row reports results for the text-based linguistic representations that were obtained from Wikipedia (repeated across columns for convenience). For each of the three architectures, we evaluate on SimLex (SL) and MEN, using either the mean (Mean) or elementwise maximum (Max) method for aggregating image representations into visual ones (see Section 2). For each data source, we report results for the visual representations, as well as for the multi-modal representations that fuse the visual and textual ones together. Performance across architectures is remarkably stable: we have had to report results up to three decimal points to show the difference in performance in some cases. For each of the network architectures, we see a marked improvement of multi-modal representations over uni-modal linguistic representations. In many cases, we also see visual representations outperforming linguistic ones, especially on SimLex. This is interesting, because e.g. Google and Bing have full coverage over the datasets, so their visual representations include highly abstract words, which does not appear to have an adverse impact on the method's performance. For the ESP Game dataset (on which performance is quite low) and ImageNet, we observe an increase in performance as we move to the right in the table. Interestingly, VGGNet on ImageNet scores very highly, which seems to indicate that VGGNet is somehow more "specialized" on ImageNet than the others. The difference between mean and max aggregation is relatively small, although the former seems to work better for Sim-Lex while the latter does slightly better for MEN. Table 5 shows the results on the common subset of the evaluation datasets, where all word pairs have images in each of the data sources. First, note the same patterns as before: multi-modal representa-tions perform better than linguistic ones. Even for the poorly performing ESP Game dataset, the VG-GNet representations perform better on both Sim-Lex and MEN (bottom right of the table). Visual representations from Google, Bing, Flickr and Im-ageNet all perform much better than ESP Game on this common covered subset. In a sense, the fullcoverage datasets were "punished" for their ability to return images for abstract words in the previous experiment: on this subset, which is more concrete, the search engines do much better. To a certain extent, including linguistic information is actually detrimental to performance, with multi-modal performing worse than purely visual. Again, we see the marked improvement with VGGNet for ImageNet, while Google, Bing and Flickr all do very well, regardless of the architecture.

Common subset comparison
These numbers indicate the robustness of the approach: we find that multi-modal representation learning yields better performance across the board: for different network architectures, different data sources and different aggregation methods. If computational efficiency or memory usage are issues, then GoogLeNet or AlexNet are the best choices. The ESP Game dataset does not appear to work very well, and is best avoided. If we have the right coverage, then ImageNet gives good results, especially if we can use VGGNet. However, coverage is often the main issue, in which case Google and Bing yield images that are comparable or even better than images from the carefully annotated ImageNet.

Number of images
Another question is the number of images we want to use: does performance increase with more images? Is it always better to have seen 100 cats instead of only 10, or do we have enough information after having seen one or two already? There is an obvious trade-off here, since downloading and processing images takes time (and may incur financial costs). This experiment only applies to relevancesorted data sources: the image selection procedure for ImageNet and ESPGame is more about removing outliers than about finding the best possible images.
As Figure 2 shows, it turns out that the optimal number of images stabilizes surprisingly quickly: around 10-20 images appears to be enough, and in some cases already too many. Performance across networks does not vary dramatically when using more images, but in the case of Flickr images on the MEN dataset, performance drops significantly as the number of images increases.

Multi-and cross-lingual applicability
Although there are some indicators that visual representation learning extends to other languages, particularly in the case of bilingual lexicon learning (Bergsma and Van Durme, 2011;Kiela et al., 2015b;Vulić et al., 2016), this has not been shown directly on the same set of human similarity and relatedness judgments. In order to examine the multi-lingual applicability of our findings, we train linguistic representations on recent dumps of the English and Italian Wikipedia. We then search for 10 images per word on Google and Bing, while setting the language to English or Italian. We compare the results on the original SimLex, and the Italian version from Leviant and Reichart (2015).
Similarly, we examine a cross-lingual scenario, where we translate Italian words into English using Google Translate. We then obtain images for the translated words and extract visual representations. These cross-lingual visual representations are sub-  The results can be found in Table 6. We find the same pattern: in all cases, visual and multi-modal representations outperform linguistic ones. The Italian version of SimLex appears to be more difficult than the English version. Google performs better than Bing, especially on the Italian evaluations. For Google, the cross-lingual scenario works better, while Bing yields better results in the multilingual setting where we use the language itself instead of mapping to English. Although somewhat preliminary, these results clearly indicate that multimodal semantics can fruitfully be applied to languages other than English.

Conclusion and future work
The objective of this study has been to systematically compare network architectures and data sources for multi-modal systems. In particular, we focused on the capabilities of deep visual representations in capturing semantics, as measured by correlation with human similarity and relatedness judgments. Our findings can be summarized as follows: • We examined AlexNet, GoogLeNet and VGGNet, all three recent winners of the ILSVRC ImageNet classification challenge (Russakovsky et al., 2015), and found that they perform very similarly. If efficiency or memory are issues, AlexNet or GoogLeNet are the most suitable architectures. For overall best performance, AlexNet and VGGNet are the best choices.
• The choice of data sources appeared to have a bigger impact: Google, Bing, Flickr and Im-ageNet were much better than the ESP Game dataset. Google, Flickr and Bing have the advantage that they have potentially unlimited coverage. Google and Bing are particularly suited to full-coverage experiments, even when these include abstract words.
• We found that the number of images has an impact on performance, but that it stabilizes at around 10-20 images, indicating that it is usually not necessary to obtain more than 10 images per word. For Flickr, obtaining more images is detrimental to performance.
• Lastly, we established that these findings extend to other languages beyond English, obtaining the same findings on an Italian version of SimLex using the Italian Wikipedia. We examined both the multi-lingual setting where we obtain search results using the Italian language and a cross-lingual setting where we mapped Italian words to English and retrieved images for those.
This work answers several open questions in multi-modal semantics and we hope that it will serve as a guide for future research in the field. It is important to note that the multi-modal results only apply to the mid-level fusion method of concatenating normalized vectors: although these findings are indicative of performance for other fusion methods, different architectures or data sources may be more suitable for different fusion methods.
In future work, downstream tasks should be addressed: it is good that multi-modal semantics improves performance on intrinsic evaluations, but it is important to show its practical benefits in more applied tasks as well. Understanding what it is that makes these representations perform so well is another important and yet unanswered question. We hope that this work may be used as a reference in determining some of the choices that can be made when developing multi-modal models.