Spectral Graph-Based Method of Multimodal Word Embedding

In this paper, we propose a novel method for multimodal word embedding, which exploit a generalized framework of multi-view spectral graph embedding to take into account visual appearances or scenes denoted by words in a corpus. We evaluated our method through word similarity tasks and a concept-to-image search task, having found that it provides word representations that reflect visual information, while somewhat trading-off the performance on the word similarity tasks. Moreover, we demonstrate that our method captures multimodal linguistic regularities, which enable recovering relational similarities between words and images by vector arithmetics.


Introduction
Word embedding plays important roles in the field of Natural Language Processing (NLP). Many existing studies use word vectors for various downstream NLP tasks, such as text classification, Part-of-Speech tagging, and machine translation. One of the most famous approaches is skip-gram model (Mikolov et al., 2013), which is based on a neural network, and its extensions have also been widely studied as well.
There are alternative approaches depending on a spectral graph embedding framework (Yan et al., 2007;Huang et al., 2012) for word embedding. For examples, Dhillon et al. (2015) proposed a method based on Canonical Correlation Analysis (CCA) (Hotelling, 1936), while a PCA based word embedding method was proposed in Lebret and Collobert (2014). * This work was partially supported by grants from Japan Society for the Promotion of Science KAKENHI (16H02789) to HS. In recent years, many researchers have been actively studying the use of multiple modalities in the fields of both NLP and computer vision. Those studies combine textual and visual information to propose methods for image-caption matching (Yan and Mikolajczyk, 2015), caption generation (Kiros et al., 2014), visual question answering (Antol et al., 2015), quantifying abstractness  of words, and so on.
As for word embedding, multimodal versions of word2vec (Mikolov et al., 2013) have been proposed in Lazaridou et al. (2015) and Kottur et al. (2016). The first one jointly optimize the objective of both skip-gram model and a cross-modal objective across texts and images, and the latter uses abstract scenes as surrogate labels for capturing visually grounded semantic relatedness. More recently, Mao et al. (2016) proposed a multimodal word embedding methods based on a recurrent neural network to learn word vectors from their newly proposed large scale image caption dataset.
In this paper, we introduce a new spectral graphbased method of multimodal word embedding. Specifically, we extend Eigenwords (Dhillon et al., 2015), a CCA-based method for word embedding, by applying a generalized framework of spectral graph embedding (Nori et al., 2012;Shimodaira, 2016). Figure 1 shows a schematic diagram of our method.
In the rest of this paper, we call our method Multimodal Eigenwords (MM-Eigenwords). The most similar existing method is Multimodal Skip-gram model (MMskip-gram) (Lazaridou et al., 2015), which slightly differ in that our model can easily deal with many-to-many relationships between words in a corpus and their relevant images, while MMskip-gram only considers one-to-one relationships between concrete words and images.
Using a corpus and datasets of image-word rela-Figure 1: Our proposed method extends a CCAbased method of word embedding by means of multi-view spectral graph embedding frameworks of dimensionality reduction to deal with visual information associated with words in a corpus.
tionships, which are available in common benchmark datasets or on online photo sharing services, MM-Eigenwords jointly learns word vectors on a common multimodal space and a linear mapping from a visual feature space to the multimodal space. Those word vectors also reflect similarities between words and images. We evaluated the multimodal word representations obtained by our model through word similarity task and concept-to-image search, having found that our model has ability to capture both semantic and word-to-image similarities. We also found that our model captures multimodal linguistic regularities (Kiros et al., 2014), whose examples are shown in Figure 2b.

Multi-view Spectral Graph Embedding
A spectral graph perspective of dimensionality reduction was first proposed in Yan et al. (2007), which showed that several major statistical methods for dimensionality reduction, such as PCA and Eigenmap (Belkin and Niyogi, 2003), can be written in a form of graph embedding frameworks, where data points are nodes and those points have weighted links between other points. Huang et al. (2012) extended this work for two-view data with many-to-many relationships (or links) and showed that their two-view graph embedding framework includes CCA, one of the most popular method for multi-view data analysis, as its special cases. However, available datasets may have more than two views with complex graph structures, which are unmanageable for CCA or Multiset CCA (Kettenring, 1971) whose inputs must be fed in the form of n-tuples. Shimodaira (2016) further generalized the graph embedding frameworks to deal with manyto-many relationships between any number of views, and Nori et al. (2012) also proposed an equivalent method for multimodal relation prediction in social data. This generalized framework is used to extend Eigenwords for cross-lingual word embedding (Oshikiri et al., 2016), where vocabularies and contexts of multiple languages are linked through sentence-level alignment. Our proposed method also makes use of the framework of Shimodaira (2016) to extend Eigenwords for multimodal word embedding.

Eigenwords (One Step CCA)
Canonical Correlation Analysis (Hotelling, 1936) is a multivariate analysis method for finding optimal projections of two sets of data vectors by maximizing the correlations. Applying CCA to pairs of raw word vectors and raw context vectors, Eigenwords algorithms attempt to find low-dimensional vector representations of words (Dhillon et al., 2015). Here we explain the simplest version of Eigenwords called One Step CCA (OSCCA).
We have a corpus consisting of T tokens; (t i ) i=1,...,T , and the vocabulary consisting of V word types; {v i } i=1,...,V . Each token t i is drawn from this vocabulary. We define a word matrix V ∈ {0, 1} T ×V whose i-th row encodes the token t i by 1-of-V representation; the j-th element is 1 if the word type of t i is v j , 0 otherwise.
Let h be the size of context window. We define context matrix C ∈ {0, 1} T ×2hV whose i-th row represents the surrounding context of the to- We apply CCA to T pairs of row vectors of V and C. The objective function of CCA is constructed using V ⊤ V, V ⊤ C, C ⊤ C which represent occurrence and co-occurrence counts of words and contexts. In Eigenwords, however, we with the following preprocessing of these matrices before constructing the objective function. First, centering-process of V and C is "bird" "bird" + "white" "bird" + "flying" "birds" "feathers" "bird watcher" "avain" "aves" "raptor" "perch" "hawk"

Query Top Match
(a) Word-to-Image Search.
-"day" + "night" ¡ -"brown" + "white" ¡ -"yellow" + "red" omitted, and off-diagonal elements of C ⊤ C are ignored for simplifying the computation of inverse matrices. Second, we take the square root of the elements of these matrices for "squashing" the heavy-tailed word count distributions. Finally, we obtain vector representations of words as C

Multimodal Eigenwords
In this section, we introduce Multimodal Eigenwords (MM-Eigenwords) by extending the CCA based model of Eigenwords to obtain multimodal representations across words and images.
Suppose we have N vis images, and each image is associated with multiple tags (or words). These associations are denoted byw ij ≥ 0 (1 ≤ i ≤ V, 1 ≤ j ≤ N vis ), whose value represents the strength of a relationship between the i-th word and the j-th image. In this study, for example, w ij = 1 if the j-ith image has the i-th word as its tag, whereasw ij = 0 otherwise, and we define a matrix W V X = (w ij ). In addition, we denote a image feature matrix by X vis ∈ R N vis ×p vis and its i-th row vector x i , as well as row vectors of V, C by v i , c i respectively. Here, the goal of MM-Eigenwords is to obtain multimodal representations by extending the CCA in Eigenwords with generalized frameworks of multi-view spectral graph embedding (Nori et al., 2012;Shimodaira, 2016), which include CCA as their special cases. In these frameworks, our goal can be at-tained by finding an optimal linear mappings to the K-dimensional multimodal space A V , A C , A vis that minimize the following objective with a scale constraint.
where w ij = (V W V X ) ij , and the multimodal term coefficient η ≥ 0 determines to which extent the model reflects the visual information. Considering a scale constraint, Eq. (1) can be reformulated as follows: We first define some matrices then the optimization problem of Eq. (1) can be written as Similar to Eigenwords, we squash X ⊤ WX and X ⊤ MX in Eq. (2) by replacing them with H, G respectively, which are defined as follows.
where diag(v) is a diagonal matrix aligning v as its diagonal elements, sqrt(·) represents elementwise square root, the vectors m, n are defined as m = sqrt(V ⊤ 1), n = η W V X 1, • represents element-wise product, and Consequently, our final goal here is to find an optimal linear mapping which maximizes Tr(A ⊤ HA) subject to A ⊤ GA = I K , and this problem reduces to a generalized eigenvalue problem Ha = λGa. Hence, we can obtain the optimal solution aŝ are eigenvectors of (G −1/2 ) ⊤ HG −1/2 for the K largest eigenvalues. Note that we obtain the word representations as the rows ofÂ V , as well as a linear mapping from the visual space to the common multimodal spaceÂ vis , and that when visual data X vis is omitted from the model, Eq. (2) is equivalent to CCA, namely, the ordinary Eigenwords. There are several ways to solve a generalized eigenvalue problem. In this study, we employed a randomized method for a generalized Hermitian eigenvalue problem proposed in Saibaba et al. (2016). Silberer and Lapata (2012) also uses CCA to obtain multimodal representations, which associates term-document matrix representing word occurrences in documents and perceptual matrix containing scores on feature norms (or attributes) like "is brown", "has fangs", etc. This model is not considering any recent developments in word embedding. In addition, the feature norms are expensive to obtain, and hence we cannot expect them for a large number vocabularies. Besides, images relevant to a given word are more easy to collect.

Dataset
In our experiment, we used English Wikipedia corpus (2016 dump) 1 , which consists of approximately 3.9 billion tokens. We first used the script provided by Mahoney 2 to clean up the original dump.
Afterward, we applied word2phrase (Mikolov et al., 2013) to the original corpus twice with a threshold value 500 to obtain multi-term phrases.
As for visual data, we downloaded images from the URLs in the NUS-WIDE image dataset (Chua et al., 2009), which also provides Flickr tags of each image. Although Flickr tags associated with each image could be very noisy and have varying abstractness, they provides a rich source of many-to-many relationships between images and words. Since we were interested in investigating if the large, but noisy web data would play a role as a helpful source for multimodal word representations, we omitted preprocessing like manually removing noisy tags or highly abstract tags.
The images were converted to 4096-dim feature vectors using the Caffe toolkit (Jia et al., 2014), together with a pre-trained 3 AlexNet model (Krizhevsky et al., 2012). These feature vectors are the output of the fc7 layer on the AlexNet. We randomly selected 100k images for a training set.

Word Similarity Task
We compared MM-Eigenwords against Eigenwords and skip-gram model through word similarity tasks, a common evaluation method of vector word representations. In our experiments, we used MEN (Bruni et al., 2014), SimLex (Hill et al., 2015), and another semantic similarity (Silberer and Lapata, 2014) denoted as SemSim, which provide 3000, 999, and 7576 word pairs respectively. These datasets provide manually scored word similarities, and the last one also provides visual similarity scores of word pairs denoted as VisSim. As for model-generated word vectors, the semantic similarity between two word vectors was measured by cosine similarity, and we quantitatively evaluated each embedding method by calculating Spearman correlation between model-based and human annotated scores.

Concept-to-Image Search
We also evaluated the accuracy of concept-toimage search to investigate the extent to which our multimodal word representations reflect visual information. In this experiment, we used 81 manually annotated concepts provided in NUS-WIDE dataset as queries. In addition, we randomly selected 10k images which are absent during the training phase as test-images and usedÂ vis to  Table 1: Spearman correlations between word similarities based on the word vectors and that of the human annotations, and the right part shows the accuracies of concept-to-image search evaluated by precision@k.
project them to the textual space, on which topmatch images were found by cosine similarities with the query vectors. We evaluated the accuracies of image search by precision at 1, 5, and 10, averaged over all query concepts, while varying the value of the multimodal term coefficient η in Eq. (1).

Results
For Eigenwords and MM-Eigenwords, we set the number of word types to V ≈ 140k, including 30k most frequent vocabularies, words in the benchmarks, and Flickr tags associated with trainingimages, and we set the number of power iteration to 3. As for skip-gram model, we set the subsampling threshold to 10 −5 , number of negative examples to 5, and training iterations to 5. In addition we fixed the dimensionality of word vectors to K = 500, and the context window size to h = 4 for every methods. As mentioned in Section 1, one of the most related methods is MMSkip-gram, against which we should compare MM-Eigenwords. However, since we could not find its code nor implement it by ourselves, a comparative study with MMSkip-gram is not included in this paper. Table 1 shows the results of the word similarity tasks. As we can see in the table, with smaller η, the performance on word-similarity tasks of MM-Eigenwords is similar to that of Eigenwords or skip-gram model, whereas poor results on the concept-to-image search task. On the other hand, larger η helps improve the performance on the concept-to-image search while sacrificing the performances on the word similarity tasks. These results implies that too strongly associated visual information can distort the semantic structure obtained from textual data. Despite some similar ex-isting studies showed positive results with auxiliary visual features (Lazaridou et al., 2015;Kiela and Bottou, 2014;, our results achieved less improvements in the word-similarity tasks, indicating negative transfer of learning. However, the visual informative word vectors obtained by our method enable not only word-toword but also word-to-image search as shown in Figure 2a, and the many-to-many relationships between images and a wide variety of tags fed to our model contributed to the plausible retrieval results with the sum of two word vectors as their queries (e.g. "bird" + "flying" ≈ images of flying birds). Moreover, the word vectors learned with our model capture multimodal linguistic regularities (Kiros et al., 2014). We show some examples of our model in Figure 2b.

Conclusion
In this paper, we proposed a spectral graph-based method of multimodal word embedding. Our experimental results showed that MM-Eigenwords captures both semantic and text-to-image similarities, and we found that there is a trade-off between these two similarities.
Since the framework we used can be adopted to any number of views, we could further extend our method by considering image caption datasets through employing document IDs like Oshikiri et al. (2016) in our future works.