Multi- and Cross-Modal Semantics Beyond Vision: Grounding in Auditory Perception

Multi-modal semantics has relied on feature norms or raw image data for perceptual input. In this paper we examine grounding semantic representations in raw auditory data, using standard evaluations for multi-modal semantics, including measuring conceptual similarity and relatedness. We also evaluate cross-modal mappings, through a zero-shot learning task mapping between linguistic and auditory modalities. In addition, we evaluate multi-modal representations on an unsupervised musical instrument clustering task. To our knowledge, this is the ﬁrst work to combine linguistic and auditory information into multi-modal representations.


Introduction
Although distributional models (Turney and Pantel, 2010;Clark, 2015) have proved useful for a variety of NLP tasks, the fact that the meaning of a word is represented as a distribution over other words implies that they suffer from the grounding problem (Harnad, 1990); i.e. they do not account for the fact that human semantic knowledge is grounded in the perceptual system (Louwerse, 2008). Motivated by human concept acquisition, multi-modal semantics enhances linguistic representations with extra-linguistic perceptual input. These models outperform language-only models on a range of tasks, including modelling semantic similarity and relatedness, and predicting compositionality (Silberer and Lapata, 2012;Roller and Schulte im Walde, 2013;. Although feature norms have also been used, raw image data has become the de-facto perceptual modality in multi-modal models. However, if the objective is to ground semantic representations in perceptual information, why stop at image data? The meaning of violin is surely not only grounded in its visual properties, such as shape, color and texture, but also in its sound, pitch and timbre. To understand how perceptual input leads to conceptual representation, we should use as many perceptual modalities as possible. A recent preliminary study by Lopopolo and van Miltenburg (2015) found that it is possible to derive uni-modal semantic representations from sound data. Here, we explore taking multi-modal semantics beyond its current reliance on image data and experiment with grounding semantic representations in the auditory perceptual modality.
Multi-modal models that rely on raw image data have typically used "bag of visual words" (BoVW) representations (Sivic and Zisserman, 2003). We follow a similar approach for the auditory modality and construct bag of audio words (BoAW) representations. Following previous work in multi-modal semantics, we evaluate these models on measuring conceptual similarity and relatedness, and inducing cross-modal mappings between modalities to perform zeroshot learning. In addition, we evaluate on an unsupervised musical instrument clustering task. Our findings indicate that multi-modal representations enriched with auditory information perform well on relatedness and similarity tasks, particularly on words that have auditory assocations. To our knowledge, this is the first work to combine linguistic and auditory representations in multimodal semantics.

Related Work
Information processing in the brain can be roughly described to occur on three levels: perceptual input, conceptual representation and symbolic reasoning (Gazzaniga, 1995). While research in AI has made great progress in understanding the first and last of these, understanding the middle level is still more of an open problem: how is it that per-ceptual input leads to conceptual representations that can be processed and reasoned with?
A key observation is that concepts are, through perception, grounded in physical reality and sensorimotor experience (Harnad, 1990;Louwerse, 2008), and there has been a surge of recent work on perceptually grounded semantic models that try to account for this fact. These models learn semantic representations from both textual and perceptual input, using either feature norms (Silberer and Lapata, 2012;Roller and Schulte im Walde, 2013; or raw image data (Feng and Lapata, 2010;Leong and Mihalcea, 2011; as the source of perceptual information. A popular approach in the latter case is to collect images associated with a concept, and then lay out each image as a set of keypoints on a dense grid, where each keypoint is represented by a robust local feature descriptor such as SIFT (Lowe, 2004). These local descriptors are subsequently clustered into a set of "visual words" using a standard clustering algorithm such as k-means and then quantized into vector representations by comparing the descriptors with the centroids. An alternative to this bag of visual words (BoVW) approach is transferring features from convolutional neural networks (Kiela and Bottou, 2014).
Various ways of aggregating images into visual representations have been proposed, such as taking the mean or the elementwise maximum. Ideally, one would jointly learn multi-modal representations from parallel multi-modal data, such as text containing images (Silberer and Lapata, 2014) or images described with speech (Synnaeve et al., 2014), but such data is hard to obtain, has limited coverage and can be noisy. Hence, image representations are often learned independently. Aggregated visual representations are subsequently combined with a traditional linguistic space to form a multi-modal model. This mixing can be done in a variety of ways, ranging from simple concatenation to more sophisticated fusion methods .
Cross-modal semantics, instead of being concerned with improving semantic representations through grounding, focuses on the problem of reference. Using, for instance, mappings between visual and textual space, the objective is to learn which words refer to which objects (Lazaridou et al., 2014). This problem is very much re- lated to the object recognition task in computer vision, but instead of using just visual data and labels, these cross-modal models also utilize textual information (Socher et al., 2014;Frome et al., 2013). This allows for zero-shot learning, where the model can predict how an object relates to other concepts just from seeing an image of the object, but without ever having previously encountered an image of that particular object (Lazaridou et al., 2014). Multi-modal and cross-modal approaches have outperformed state-of-the-art textbased methods on a variety of tasks Silberer and Lapata, 2014).

Evaluations
Following previous work in multi-modal semantics, we evaluate on two standard similarity and relatedness datasets: SimLex-999  and the MEN test collection . These datasets consist of concept pairs together with a human-annotated similarity or relatedness score, where the former dataset focuses on genuine similarity (e.g., teacher-instructor) and the latter focuses more on relatedness (e.g., riverwater). In addition, following previous work in cross-modal semantics, we evaluate on the zeroshot learning task of inducing a cross-modal mapping to the correct label in the auditory modality from the linguistic one and vice-versa.

Multi-modal Semantics
Evidence suggests that the inclusion of visual representations only improves performance for certain concepts, and that in some cases the introduction of visual information is detrimental to performance on similarity and relatedness tasks . The same is likely to be true for other perceptual modalities: in the case of comparisons such as guitar-piano, the auditory modal-  ity is certainly meaningful, whereas in the case of democracy-anarchism it is probably less so. Therefore, we had two graduate students annotate the datasets according to whether auditory perception is relevant to the pairwise comparison. The annotation criterion was as follows: if both concepts in a pairwise comparison have a distinctive associated sound, the modality is deemed relevant. Inter-annotator agreement was high: κ = 0.93 for MEN and κ = 0.92 for SimLex-999. Some examples of relevant pairs can be found in Table 1. Hence, we now have four evaluation datasets: the MEN test collection MEN and its auditory-relevant subset AMEN; and the SimLex-999 dataset SLex and its auditory-relevant subset ASLex. Due to the nature of the auditory data sources, it is not possible to build auditory representations for all concepts in the test sets. Hence, unless stated otherwise, we report results for the covered subsets (using the same subsets when comparing across modalities, to ensure a fair comparison). Table 2 shows how much of the test sets are covered for each modality. 1

Cross-modal Semantics
In addition to evaluating our models on the MEN and SimLex tasks, we evaluate on the crossmodal task of zero-shot learning. In the case of vision, Lazaridou et al. (2014) studied the possibility of predicting from "we found a cute, hairy wampimuk sleeping behind the tree" that a "wampimuk" will probably look like a small furry animal, even though a wampimuk has never been seen before. We evaluate zero-shot learning, using partial least squares regression (PLSR) to obtain cross-modal mappings from the linguistic to auditory space and vice versa. 2 Thus, given a linguistic representation for e.g. guitar, the task is to map it to the appropriate place in auditory space without ever having heard a guitar; or map it to the appropriate place in linguistic space without ever having read about a guitar (having only heard it).

Approach
One reason for using raw image data in multimodal models is that there is a wide variety of resources that contain tagged images, such as Im-ageNet (Deng et al., 2009) and the ESP Game dataset (Von Ahn and Dabbish, 2004). However, such resources do not exist for audio files, and so we follow a similar approach to Fergus et al. (2005) and Bergsma and Goebel (2011), who use Google Images to obtain images. We use the online search engine Freesound 3 to obtain audio files. Freesound is a collaborative database released under Creative Commons licenses, in the form of snippets, samples and recordings, that is aimed at sound artists. The Freesound API allows users to easily search for audio files that have been tagged using certain keywords.
For each of the concepts in the evaluation datasets, we used the Freesound API to obtain samples encoded in the standard open source OGG format 4 . Because the database contains variable numbers of files, with varying duration per individual file, we restrict the search to a maximum of 50 files and a maximum of 1 minute duration per file. The Freesound API allows for various degrees of keyword matching: we opted for the strictest keyword matching, in that the audio file needs to have been purposely tagged with the given word (the alternative includes searching the text description for matching keywords). For example, if we are searching for audio files of cars, we retrieve up to 50 files with a maximum duration of 1 minute per file that have been tagged with the label "car".

Linguistic Representations
For the linguistic representations we use the continuous vector representations from the log-linear skip-gram model of . Specifically, we trained 300-dimensional vector representations trained on a dump of the English Wikipedia plus newswire (8 billion words in total). 5 These types of representations have been found to yield the highest performance on a variety of semantic similarity tasks .

Auditory Representations
A common approach to obtaining acoustic features of audio files is the Mel-scale Frequency Cepstral Coefficient (MFCC) (O'Shaughnessy, 1987). MFCC features are abundant in a wide variety of applications in audio signal processing, ranging from audio information retrieval, to speech and speaker recognition, and music analysis (Eronen, 2003). Such features are derived from the mel-frequency cepstrum representation of an audio fragment (Stevens et al., 1937). In MFCC, frequency bands are spaced along the mel scale, which has the advantage that it approximates human auditory perception more closely than e.g. linearly-spaced frequency bands. Hence, MFCC takes human perceptual sensitivity to audio frequencies into consideration, which makes it suitable for e.g. compression and recognition tasks, but also for our current objective of modelling auditory perception. We obtain MFCC descriptors for frames of audio files using librosa, a popular library for audio and music analysis written in Python. 6 After having obtained the descriptors, we cluster them using mini-batch k-means (Sculley, 2010) and quantize the descriptors into a "bag of audio words" (BoAW) (Foote, 1997) representation by comparing the MFCC descriptors to the cluster centroids. This gives us BoAW representations for each of the audio files. Auditory representations are obtained by taking the mean of the BoAW representations of the relevant audio files, and finally weighting them using positive point-wise mutual information (PPMI), a standard weighting scheme for improving vector representation quality (Bullinaria and Levy, 2007). We set k = 300, which equals the number of dimensions for the linguistic representations.

Multi-modal Fusion Strategies
Since multi-modal semantics relies on two or more modalities, there are several ways of combining or fusing linguistic and perceptual cues . When computing similarity scores, for instance, we can either jointly learn the representations; learn them independently, combine (e.g. concatenate) them and compute similarity scores; or learn them independently, compute similarity scores independently and combine the scores. We call these possibilities early, middle and late fusion, respectively, and evaluate multi-modal mod-els in each category.

Early Fusion
A good example of early fusion is the recently introduced multi-modal skip-gram model (Lazaridou et al., 2015). This model behaves like a normal skip-gram, but instead of only having a training objective for the linguistic representation, it includes an additional training objective for the visual context, which consists of an aggregated representation of images associated with the given target word. The skip-gram training objective for a sequence of training words w 1 , w 2 , ..., w T and a context size c is: where J θ is the log-likelihood −c≤j≤c log p(w t+j |w t ) and p(w t+j |w t ) is obtained via the softmax: where u w and v w are the context and target vector representations for the word w respectively, and W is the vocabulary size. The objective for the multi-modal skip-gram has an additional visual objective J vis (in this case a margin criterion): Here, we take a similar but more straightforward approach by making the auditory context a part of the initial training objective, which is possible because linguistic and auditory representations have the same dimensionality. That is, we modified word2vec to predict additional auditory contexts that have been set to the corresponding BoAW representation. We jointly learn linguistic and audio representations by taking the aggregated mean µ a w of the auditory vectors for a given word w, and adding this mean vector to the context: The intuition is that the induced vector for the target word now has to predict an auditory vector as part of its context, as well as the linguistic ones. As an alternative, we also investigate re-placing the mean µ a wt with an auditory vector obtained by uniformly sampling from the set of auditory representations for the target word. We refer to these two alternatives as MMSG-MEAN and MMSG-SAMPLED, respectively. For this model, auditory BoAW representations are built for the ten thousand most frequent words in our corpus, based on 10 audio files retrieved from FreeSound for each word (or fewer when 10 are not available).

Middle and Late Fusion
Whereas early fusion requires a joint training objective that takes into account both modalities, middle fusion allows for individual training objectives and independent training data. Similarity between two multi-modal representations is calculated as follows: where g is some similarity function, u l and v l are linguistic representations, and u a and v a are the auditory representations. A typical formulation in multi-modal semantics for f (x, y) is α x (1−α)y, where is concatenation (see e.g.  and Kiela and Bottou (2014)).
Late fusion can be seen as the converse of middle fusion, in that the similarity function is computed first before the similarity scores are combined: where g is some similarity function and h is a way of combining similarities, in our case a weighted average: h(x, y) = 1 2 (α x + (1 − α)y); and we use g = x·y |x||y| (cosine similarity). Since cosine similarity is the normalized dot-product, and the unimodal representations are themselves normalized, middle and late fusion are equivalent if α = 0.5, which we call MM. However, when α = 0.5, we distinguish between the two models, calling them MM-MIDDLE and MM-LATE respectively.

Conceptual Similarity and Relatedness
We evaluate performance by calculating the Spearman ρ s correlation between the ranking of the concept pairs produced by the automatic similarity metric (cosine between the derived vectors) and that produced by the gold-standard similarity scores. To ensure a fair comparison, we evaluate  on the common subsets for which there are representations in both modalities (see Table 2). The results are reported in Table 3. We find that, while performance decreases for linguistic representations on the auditory-relevant subsets of the two datasets, performance increases for the uni-modal auditory representations on those subsets. This indicates that our auditory representations are better at judging auditory-relevant comparisons than they are at non-auditory ones, as we might expect.
For all datasets, the accuracy scores for multimodal models are at least as high as those for the purely linguistic representations. In the case of the full datasets this difference is only marginal, which is to be expected given how few of the words in the datasets are auditory-relevant. However, the results indicate that adding auditory input even for words that are not directly auditoryrelevant is not detrimental to overall performance.
In the case of the auditory-relevant subsets, we see a large increase in performance when using multi-modal representations. It is also interesting that this performance increase is found in the simple MM model, compared to the more complicated MMSG models, which seems to indicate that the latter models are still too reliant on linguistic information, which harms their performance when performing auditory-specific comparisons. The model which performs consistently well across the four datasets is MM, the middle-late fusion model with α = 0.5.

Cross-modal Zero-shot Learning
We learn a cross-modal mapping between the linguistic and auditory spaces using partial least squares regression, taking out each concept, training on the others, and then projecting from one  Table 4: Cross-modal zero-shot learning accuracy.
space into the other. Zero-shot performance is evaluated using the average percentage correct at N (P@N ), which measures how many of the test instances were ranked within the top N highest ranked nearest neighbors. Results are shown in Table 4, with the chance baseline obtained by randomly ranking a concept's nearest neighbors. Insofar as it is possible to make a direct comparison with linguistic-visual zero-shot learning (which uses entirely different data), it appears that the current task may be more difficult: Lazaridou et al. (2014) report a P@1 of 2.4 and P@20 of 33.0 for their linguistic-visual model.

Qualitative Analysis
We also performed a small qualitative analysis of the BoAW representations for the words in MEN and SLex. As Table 5 shows, the nearest neighbors are remarkably semantically coherent. For example, the model groups together sounds produced by machines, or by water. It even finds that dinner, meal, lunch and breakfast are closely related. In contrast, nearest neighbors for the linguistic model tend to be of a more abstract nature: where we find mouth and throat as auditory neighbors for language, the linguistic model gives us concepts like word and dictionary; while auditory gossip sounds like maids and is something you might do in the corridor, it is linguistically associated with more abstract concepts like news and newspaper.

Parameter Tuning
There are many parameters that were left fixed in the main results that could have been adjusted to improve performance, particularly in the middle and late fusion models. It is useful to investigate some of the parameters that are likely to have an impact on performance: what the effect of the α mixing parameter is, whether a different k would have yielded better auditory representations, and whether the number and duration of the audio files from FreeSound has any effect. Figure 1: Performance of middle and late multimodal fusion models compared to linguistic representations on the four datasets when varying the α mixing parameter on the x-axis.

Mixing with α
The mixing parameter α plays an important role in the middle and late fusion models. We kept it fixed at 0.5 for the MM model above, but here we experiment with varying the parameter, yielding results for two different models, MM-MIDDLE and MM-LATE. The results are shown in Figure 1, where moving to the right on the x-axis uses more linguistic input and moving to the left uses more auditory input. The late model consistently outperforms the middle fusion model, which is probably because it is less susceptible to any noise in the auditory representation. Optimal performance seems to be around α = 0.6 for both fusion strategies on all four datasets, indicating that it is better to include a little more linguistic than auditory input. It appears that any 0.5 ≤ α < 1 (i.e., where we have more linguistic input but still some auditory signal), outperforms the purely linguistic representation, substantially in the case of the auditory-relevant subsets.

Number of Auditory Dimensions
We experimented with different values for the number of audio words k (i.e. the number of clusters in the k-means clustering that determines the number of "audio words"). As Figure 2 shows, the quality of the uni-modal auditory representations is highly robust to the number of dimensions. In fact, any choice of k in the range shown provides similar results across the datasets.

Number and Duration of Audio Files
We experimented with the number of audio files by querying FreeSound for up to 100 audio files per search word, while keeping k = 300. The results are shown in Figure 3. It appears that "the more the better", although performance does not increase significantly after around 40 audio files. In order to examine the effect of audio file duration, we experimented with specifying the duration of audio files when querying the database, either taking very short (up to 5 seconds), medium length (up to 1 minute) or files of any duration. The results can be found in Figure 4, showing that performance generally increases as the files get longer (except on AMEN where a duration of 1 minute provides optimal performance).

Case Study: Musical Instruments
To strengthen the finding that multi-modal representations perform well on the auditory-relevant subsets of the datasets, we evaluate on an altogether different task, namely that of musical in- strument classification. We used Wikipedia to collect a total of 52 instruments and divided them into 5 classes: brass, percussion, piano-based, string and woodwind instruments. For each of the instruments, we collected as many audio files from FreeSound as possible, and used the MM-MIDDLE model with parameter settings that yielded good results in the previous experiments (k = 300 and α = 0.6). We then performed k-means clustering with five cluster centroids and compared results between auditory, linguistic and multi-modal, evaluating the clustering quality using the standard V-measure clustering evaluation metric (Rosenberg and Hirschberg, 2007). This is an interesting problem because instrument classes are determined somewhat by conven- Figure 4: Performance of uni-modal auditory representations on the four datasets when varying the maximum duration. tion (is a saxophone a brass or a woodwind instrument?). What is more, how instruments actually sound is rarely described in detail in text, so corpus-based linguistic representations cannot take this information into account. The results are in Table 6, clearly showing that the multi-modal representation which utilizes both linguistic information and auditory input performs much better on this task than the uni-modal representations. It is interesting to observe that the linguistic representations perform better than the auditory ones: a possible explanation for this is that audio files in FreeSound are rarely samples of a single individual instrument, so if a bass is often accompanied by a drum this will affect the overall representation. The table also shows, for the 5 clusters under both models, the nearest instruments to the cluster centroids, qualitatively demonstrating the greater cluster coherence for the multi-modal model.

Conclusions
We have studied grounding semantic representations in raw auditory perceptual information, using a bag of audio words model to obtain auditory representations, and combining them into multi-modal representations using a variety of fusion strategies. Following previous work in multimodal semantics, we evaluated on conceptual similarity and relatedness datasets, and on the crossmodal task of zero-shot learning. We presented a short case study showing that multi-modal representations perform much better than auditory or linguistic representations on a musical instrument clustering task. It may well be the case that the  Table 6: V-measure performance for clustering musical instruments, together with instruments closest to cluster centroid for linguistic and multimodal.
auditory modality is better suited for other evaluations, but we have chosen to follow standard evaluations in multi-modal semantics to allow for a direct comparison.
In future work, it would be interesting to investigate different sampling strategies for the early fusion joint-learning approach and to investigate more sophisticated mixing strategies for the middle and late fusion models, e.g. using the "audio dispersion" of a word to determine how much auditory input should be included in the multi-modal representation . Another interesting possibility is to improve auditory representations by training a neural network classifier on the audio files and subsequently transferring the hidden representations to tasks in semantics. Lastly, now that the perceptual modalities of vision, audio and even olfaction (Kiela et al., 2015) have been investigated in the context of distributional semantics, the logical next step for future work is to explore different fusion strategies for multi-modal models that combine various sources of perceptual input into a single grounded model. for useful suggestions and thank the anonymous reviewers for their helpful comments.