Sound-Word2Vec: Learning Word Representations Grounded in Sounds

To be able to interact better with humans, it is crucial for machines to understand sound – a primary modality of human perception. Previous works have used sound to learn embeddings for improved generic semantic similarity assessment. In this work, we treat sound as a first-class citizen, studying downstream 6textual tasks which require aural grounding. To this end, we propose sound-word2vec – a new embedding scheme that learns specialized word embeddings grounded in sounds. For example, we learn that two seemingly (semantically) unrelated concepts, like leaves and paper are similar due to the similar rustling sounds they make. Our embeddings prove useful in textual tasks requiring aural reasoning like text-based sound retrieval and discovering Foley sound effects (used in movies). Moreover, our embedding space captures interesting dependencies between words and onomatopoeia and outperforms prior work on aurally-relevant word relatedness datasets such as AMEN and ASLex.


Introduction
Sound and vision are the dominant perceptual signals, while language helps us communicate complex experiences via rich abstractions. For example, a novel can stimulate us to mentally construct the image of the scene despite having never physically perceived it. Indeed, language has evolved to contain numerous constructs that help depict visual concepts. For example, we can easily form the picture of a white, furry cat with blue eyes via. a description of the cat in terms of its visual attributes (Lampert et al., 2009;Parikh and Grauman, 2011).
Need for Onomatopoeia. However, how would one describe the auditory instantiation of cats? While a first thought might be to use audio descriptors like loud, shrill, husky etc. as mid-level constructs or "attributes", arguably, it is difficult to precisely convey and comprehend sound through such language. Indeed, Wake and Asahi (1998) find that humans first communicate sounds using "onomatopoeia" -words that are suggestive of the phonetics of sounds while having no explicit meaning e.g. meow, tic-toc. When asked for further explanation of sounds, humans provide descriptions of potential sound sources or impressions created by the sound (pleasant, annoying, etc.) Need for Grounding in Sound. While onomatopoeic words exist for commonly found concepts, a vast majority of concepts are not as perceptually striking or sufficiently frequent for us to come up with dedicated words describing their sounds. Even worse, some sounds, say, musical instruments, might be difficult to mimic using speech. Thus, for a large number of concepts there seems to be a gap between sound and its counterpart in language (Sundaram and Narayanan, 2006). This becomes problematic in specific situations where we want to talk about the heavy tail of concepts and their sounds, or while describing a particular sound we want to create as an effect (say in movies). To alleviate this, a common literary strategy is to provide metaphors to more relatable exemplars. For example, when we say, "He thundered angrily", we compare the person's angry speech to the sound of thunder to convey the seriousness of the situation. However, without this grounding in sound, thunder and anger both appear to be seemingly unrelated concepts in terms of semantics.
Contributions. In this work, we learn embeddings to bridge the gap between sound and its counterpart in language. We follow a retrofitting strategy, capturing similarity in sounds associated with words, while using distributional semantics (from word2vec) to provide smoothness to the embeddings. Note that we are not interested in capturing phonetic similarity, but the grounding in sound of the concept associated with the word (say "rustling" of leaves and paper.) We demonstrate the effectiveness of our embeddings on three downstream tasks that require reasoning about related aural cues: 1. Text-based sound retrieval -Given a textual query describing the sound and a database containing sounds and associated textual tags, we retrieve sound samples by matching text (Sec. 5.1) 2. Foley Sound Discovery -Given a short phrase that outlines the technique of producing Foley sounds 1 , we discover other relevant words (objects or actions) which can produce similar sound effects (Sec. 5.2) 3. Aurally-relevant word relatedness assessment on AMEN and ASLex (Kiela and Clark, 2015) (Sec. 5.3) We also qualitatively compare with word2vec to highlight the unique notions of word relatedness captured by imposing auditory grounding.

Related Work
Audio and Word Embeddings. Multiple works in the recent past (Bruni et al., 2014;Lazaridou et al., 2015;Lopopolo and van Miltenburg, 2015;Kiela and Clark, 2015;Kottur et al., 2016) have explored using perceptual modalities like vision and sound to learn language embeddings. While Lopopolo and van Miltenburg (2015) show preliminary results on using sound to learn distributional representations, Kiela and Clark (2015) build on ideas from Bruni et al. (2014) to learn word embeddings that respect both linguistic and auditory relationships by optimizing a joint objective. Further, they propose various fusion strategies to combine knowledge from both the modalities. Instead, we "specialize" embeddings to exclusively respect relationships defined by sounds, while initializing with word2vec embeddings for smoothness. Similar to previous findings (Melamud et al., 2016), we observe that our specialized embeddings outperform language-only as well as other multi-modal embeddings in the downstream tasks of interest. In an orthogonal and interesting direction, other recent works (Chung et al., 2016;He et al., 2016;Settle and Livescu, 2016) learn word representations based on similarity in their pronunciation and not the sounds associated with them. In other words, phonetically similar words that have near identical pronunciations are brought closer in the embedding space (e.g., flower and flour). Sundaram and Narayanan (2006) study the applicability of onomatopoeia to obtain semantically meaningful representations of audio.
Using a novel word-similarity metric and principal component analysis, they find representations for sounds and cluster them in this derived space to reason about similarities. In contrast, we are interested in learning word representations that respect aural-similarity. More importantly, our approach learns word representations for in a data-driven manner without having to first map the sound or its tags to corresponding onomatopoeic words.
Multimodal Learning with Surrogate Supervision. Kottur et al. (2016) and Owens et al. (2016) use a surrogate modality to induce supervision to learn representations for a desired modality. While the former learns word embeddings grounded in cartoon images, the latter learns visual features grounded in sound. In contrast, we use sound as the surrogate modality to supervise representation learning for words.

Datasets
Freesound. We use the freesound database (Font et al., 2013), also used in prior work (Kiela and Clark, 2015;Lopopolo and van Miltenburg, 2015) to learn the proposed sound-word2vec embeddings. Freesound is a freely available, collaborative dataset consisting of user uploaded sounds permitting reuse. All uploaded sounds have human descriptions in the form of tags and captions in natural language. The tags contain a broad set of relevant topics for a sound (e.g., ambience, electronic, birds, city, reverb) and captions describing the content of the sound, in addition to details pertaining to audio quality. For the text-based sound retrieval task, we use a subset of 234,120 sounds from this database and divide it into training (80%), validation (10%) and testing splits (10%). Further, for foley sound discovery, we aggregate descriptions of foley sound production provided by sound engineers (epicsound, accessed 23-Jan-2017; Singer, accessed 23-Jan-2017) to create a list of 30 foley sound pairs, which forms our ground truth for the task. For example, the description to produce a foley "driving on gravel" sound is to record the "crunching sound of plastic or polyethene bags".
AMEN and ASLex. AMEN and ASLex (Kiela and Clark, 2015) are subsets of the standard MEN (Bruni et al., 2014) and SimLex (Hill et al., 2015) word similarity datasets consisting of word-pairs that "can be associated with a distinctive associated sound". We evaluate on this dataset for completeness to benchmark our approach against previous work. However, we are primarily interested in the slightly different problem of relating words with similar auditory instantions that may or may not be semantically related as opposed to relating semantically similar words that can be associated with some common auditory signal.

Approach
We use the Freesound database to construct a dataset of tuples {s, T }, where s is a sound and T is the set of associated user-provided tags. We then aim to learn an embedding space for the tags that respects auditory grounding using sound information as cross-modal context -similar to word2vec (Mikolov et al., 2013) that uses neighboring words as context / supervision. We now explain our approach in detail.

Audio Features and Clustering.
We represent each sound s by a feature vector consisting of the mean and variance of the following audio descriptors that are readily available as part of Freesound database: • Mel-Frequency Cepstral Co-efficients: This feature represents the short-term power spectrum of an audio and closely approximates the response of the human auditory system -computed as given in (Ganchev et al., 2005). • Spectral Contrast: It is the magnitude difference Figure 1: The model used to learn the proposed sound-word2vec embeddings. The projection matrix WP containing that is used as the sound-word2vec embedding is learned by training the model to accurately predict the cluster assignment of the sound.
in the peaks and valleys of the spectrum -computed according to (Akkermans et al., 2009).
• Dissonance: It measures the perceptual roughness of the sound (Plomp and Levelt, 1965 (Ricard, 2004). We then use K-Means algorithm to cluster the sounds in this feature space to assign each sound to a cluster C(s) ∈ {1, . . . K}. We set K to 30 by evaluating the performance of the embeddings on text-based audio-retrieval on the held out validation set. Note that the clustering is only performed once, prior to representation learning described below.
Representation Learning. We represent each tag t ∈ T using a |V| dimensional one-hot encoding denoted by v t , where V is the set of all unique tags in the training set (the size of our dictionary). This one-hot vector v t is projected into a Ddimensional vector space via W P ∈ R |V|×D , the projection matrix. This projection matrix computes the representation for each word in V. The idea of our approach is to use W P to accurately predict cluster assignments (for sounds associated with words), which enforces grounding in sound. For each data-point, we obtain the summary of the tags T , by averaging the projections of all tags in the set as 1 |T | t∈T W P v t . We then transform the so obtained summary representation via a linear layer (with parameters W O ) and pass the output through the softmax function to obtain a distribution, p(c|T ) over the K sound clusters. We perform maximum-likelihood training for the correct cluster assignment C(s) 2 , optimizing for parameters W P and W O : We use SGD with momentum to optimize this objective which essentially is the cross-entropy between cluster assignments and p(c|T ). We set D to 300 to be consistent with the publicly available word2vec embeddings.
Initialization. We initialize W P with word2vec embeddings (Mikolov et al., 2013) trained on the Google news corpus dataset with ∼3M words. We fine-tune on a subset of 9578 tags which are present in both Freesound as well as Google news corpus datasets, which is 55.68% of the original tags in the Freesound dataset. This helps us remove noisy tags unrelated to the content of the sound.
In addition to enlarging the vocabulary, the pretraining helps induce smoothness in the sound-word2vec embeddings -allowing us to transfer semantics learnt from sounds to words that were not present as tags in the Freesound database. Indeed, we find that word2vec pre-training helps improve performance (Sec. 5.3). Our use of language embeddings as an initialization to fine-tune (specialize) from, as opposed to formulating a joint objective with language and audio context (Kiela and Clark, 2015) is driven by the fact that we are interested in embeddings for words grounded in sounds, and not better generic word similarity.

Results
Ablations. In addition to the language-only baseline word2vec (Mikolov et al., 2013), we compare against tag-word2vec -that predicts a tag using other tags of the sound as context, inspired by (Font et al., 2014). We also report results with a randomly initialized projection matrix (sound-word2vec(r) to evaluate the effectiveness of pretraining with word2vec. Prior work. We compare against previous works Lopopolo and van Miltenburg (2015) and Kiela and Clark (2015). While the former uses a standard bag of words and SVD pipeline to arrive at distributional representations for words, the latter trains under a joint objective that respects both linguistic and auditory similarity. We use the openly available implementation for Lopopolo and van Miltenburg (2015) and re-implement Kiela and Clark (2015) and train them on our dataset for a fair comparison of the methods. In addition, we show a comparison to word-vectors released by (Kiela and Clark, 2015) in the supplementary material. All approaches use an embedding size of 300 for consistency.

Text-based Sound Retrieval
Given a textual description of a sound as query, we compare it with tags associated with sounds in the database to retrieve the sound with the closest matching tags. Note that this is a purely textual task, albeit one that needs awareness of sound. In a sense, this task exactly captures what we want our model to be able to do -bridge the semantic gap between language and sound. We use the training split (Sec. 3) to learn the sound-word2vec vectors, validation to pick the number of clusters (K), and report results on the test split. For retrieval, we represent sounds by averaging the learnt embeddings for the associated tags. We embed the caption provided for the sound (in the Freesound database) in a similar manner, and use it as the query. We then rank sounds based on the cosine similarity between the tag and query representations for retrieval. We evaluate using standard retrieval metrics -Recall@{1,10,50,100}. Note that the entire testing set (≈10k sounds) is present in the retrieval pool. So, recall@100 corresponds to obtaining the correct result in the top 1% of the search results, which is a relatively stringent evaluation criterion.
Results. Table. 1 shows that our sound-word2vec embeddings outperform the baselines. We see that specializing the embeddings for sound using our two-stage training outperforms prior work (Kiela and Clark (2015) and Lopopolo and van Miltenburg (2015)), which did not do specialization. Among our approaches, tag-word2vec performs second best -this is intuitive since the tag distributions implicitly capture auditory relatedness (a sound may have tags cat and meow), while word2vec and sound-word2vec(r) have the lowest performance.  Table 1: Text-based sound retrieval (higher is better). We find that our sound-word2vec model outperforms all baselines.

Foley Sound Discovery
In this task, we evaluate how well embeddings identify matching pairs of target sounds (flapping bird wings) and descriptions of Foley sound production techniques (rubbing a pair of gloves).
Intuitively, one expects sound-aware word embeddings to do better at this task than sound-agnostic ones. We setup a ranking task by constructing a set of original Foley sound pairs and decoy pairs formed by pairing the target description with every word from the vocabulary. We rank using cosine similarity between the average word-vectors in each member of the pair. A good embedding is one in which the original Foley sound pair has the lowest rank. We use the mean rank of the Foley sound in the dataset for evaluation. We transfer the embeddings from Sec. 5.1 to this task, without additional training.

Results.
We find that Sound-word2vec performs the best with a mean rank of 34.6 compared to other baselines tag-word2vec (38.9), sound-word2vec(r) (114.3) and word2vec (189.45). As observed previously, the second best performing approach is tag-word2vec. Lopopolo and van Miltenburg (2015) and Kiela and Clark (2015) perform worse than tag-word2vec with a mean rank of 48.4 and 42.1 respectively. Note that random chance gets a rank of (|V| + 1)/2 = 4789.5.

Evaluation on AMEN and ASLex
AMEN and ASLex (Kiela and Clark, 2015) are subsets of the MEN and SimLex-999 datasets for word relatedness grounded in sound. From Table 2, we can see that our embeddings outperform (Kiela and Clark, 2015) on both AMEN and ASLex. These datasets were curated by annotating  concepts related by sound; however we observe that relatedness is often confounded. For example, (river, water), (automobile, car) are marked as aurally related however they do not stand out as aurally-related examples as they are already semantically related. In contrast, we are interested in how onomatopoeic words relate to regular words (Table 3), which we study by explicit grounding in sound. Thus while we show competitive performance on this dataset, it might not be best suited for studying the benefits of our approach.

Discussion and Conclusion
We show nearest neighbors in both sound-word2vec and word2vec space (Table 3) to qualitatively demonstrate the unique dependencies captured due to auditory grounding. While word2vec maps a word (say, apple) to other semantically similar words (other fruits), similar 'sounding' words (chips) or onomatopoeia (munch) are closer in our embedding space. Moreover, onomatopoeic words (say, boom and slam) are mapped to relevant objects (explosion and door). Interestingly, parts (e.g., lock, latch) and actions (closing) are also closer to the onomatopoeic query -exhibiting an understanding of the auditory scene.
Conclusion. In this work we introduce a novel word embedding scheme that respects auditory grounding. We show that our embeddings provide strong performance on text-based sound retrieval, Foley sound discovery along with intuitive nearest neighbors for onomatopoeia that are tasks in text requiting auditory reasoning. We hope our work motivates further efforts on understanding and relating onomatopoeia words to "regular" words.