Modeling the Music Genre Perception across Language-Bound Cultures

The music genre perception expressed through human annotations of artists or albums varies significantly across language-bound cultures. These variations cannot be modeled as mere translations since we also need to account for cultural differences in the music genre perception. In this work, we study the feasibility of obtaining relevant cross-lingual, culture-specific music genre annotations based only on language-specific semantic representations, namely distributed concept embeddings and ontologies. Our study, focused on six languages, shows that unsupervised cross-lingual music genre annotation is feasible with high accuracy, especially when combining both types of representations. This approach of studying music genres is the most extensive to date and has many implications in musicology and music information retrieval. Besides, we introduce a new, domain-dependent cross-lingual corpus to benchmark state of the art multilingual pre-trained embedding models.


Introduction
A prevalent approach to culturally study music genres starts with a common set of music items, e.g. artists, albums, tracks, and assumes that the same music genres would be associated with the items in all cultures (Ferwerda and Schedl, 2016;Skowron et al., 2017). However, music genres are subjective. Cultures themselves and individual musicological backgrounds influence the music genre perception, which can differ among individuals (Sordo et al., 2008;Lee et al., 2013). For instance, a Westerner may relate funk to soul and jazz, while a Brazilian to baile funk that is a type of rap (Hennequin et al., 2018). Thus, accounting for cultural differences in music genres' perception could give a more grounded basis for such cultural studies. However, ensuring both a common set of music items and culture-sensitive annotations with broad coverage of music genres is strenuous (Bogdanov et al., 2019).
To address this challenge, we study the feasibility of cross-culturally annotating music items with music genres, without relying on a parallel corpus. In this work, culture is related to a community speaking the same language (Kramsch and Widdowson, 1998). The specific research question we build upon is: assuming consistent patterns of music genres association with music items within cultures, can a mapping between these patterns be learned by relying on language-specific semantic representations? It is worth noting that, since music genres fall within the class of Culture-Specific Items (Aixelá, 1996;Newmark, 1988), cross-lingual annotation, in this case, cannot be framed as standard translation, as one also needs to model the dissimilar perception of music genres across cultures.
Our work focuses on four language families, Germanic (English-en and Dutch-nl), Romance (Spanish-es and French-fr), Japonic (Japanese-ja), Slavic (Czech-cs), and on two types of languagespecific semantic representations, ontologies and multi-word expression embeddings.
First, ontologies are often used to represent music genres, showing how they relate conceptually (Schreiber, 2016). We identify Wikipedia 1 , the online multilingual encyclopedia, to be particularly relevant to our study. It extensively documents worldwide music genres relating them through a coherent set of relation types across languages (e.g. derivative genre, sub-genre). Though the relations types are the same per language, the actual music genres and the way they are related can differ. Indeed, Pfeil et al. (Pfeil et al., 2006) have shown that Wikipedia contributions expose cultural differences aligned with the ones in the physical world.
Second, music genres can be represented from a distributional semantics perspective. Word vector spaces are generated from large corpora following the distributional hypothesis, i.e. words with similar contexts have akin meanings. As languages are passed on culturally, we assume that the languagespecific corpora used to create these spaces are sufficient to convey concepts' cultural specificity into their vector representations. In our study, we focus on multiple recent multilingual pre-trained models to generate word or sentence 2 embeddings (Arora et al., 2017;Grave et al., 2018;Artetxe and Schwenk, 2019;Devlin et al., 2019;Lample and Conneau, 2019), to account for variances in the used corpora or model designs.
Lastly, we combine the semantic representations by retrofitting distributed music genre embeddings to music genre ontologies. Retrofitting (Faruqui et al., 2015) modifies each concept embedding such that the representation is still close to the distributed one, but also encodes ontology information. Initially, we retrofit music genres per language, using monolingual ontologies. Then, by partially aligning these ontologies, we apply retrofitting to learn multilingual embeddings from scratch.
The results show that we can model the crosslingual music genre annotation with high accuracy by combining both types of language-specific semantic representations. When comparing the representations derived from multilingual pre-trained models, the smooth inverse frequency averaging (Arora et al., 2017) of aligned word embeddings outperforms the state of the art approaches. To our knowledge, this simple method has been rarely used to embed multilingual sentences (Vargas et al., 2019), and we hypothesize its potential as a strong baseline on other cross-lingual datasets and tasks too. Finally, embedding learning based on retrofitting leads to better multilingual music genre representations than when inferred with pretrained embedding models. This opens the possibility to learn embeddings for rare music genres or languages when aligned music genre ontologies are available.
Summing up, our contributions are: 1) a study on how effective language-specific semantic representations of music genres are for modeling crosslingual annotation, without relying on a parallel music item corpus; 2) an extensive evaluation of multilingual pre-trained embedding models to derive representations for multi-word concepts in the music domain. Our study can enable complete musicological research, but also localized music information retrieval. This latter application is crucial for online music streaming platforms that leverage music genre annotations to provide worldwide, user-personalized music recommendations.
Our domain-specific study complements other works benchmarking general-language sentence representations (Conneau et al., 2018). Finally, we provide an in-depth formal analysis of the retrofitting part of our method. We prove the strict convexity of retrofitting and show that the ontology concepts' final embeddings converge to the same values despite the order in which concepts are iteratively updated, on condition that we know a single initial node embedding in each connected component of the ontology.

Related Work
Music genres are conceptual representations encompassing a set of conventions between the music industry, artists, and listeners about individual music styles (Lena, 2012). From a cultural perspective, it has been shown that there are differences in how people listen to music genres. (Ferwerda and Schedl, 2016;Skowron et al., 2017). Average listening habits in some countries span across many music genres and are less diverse in other countries (Ferwerda and Schedl, 2016). Also, cultural dimensions proved strong predictors for the popularity of specific music genres (Skowron et al., 2017).
Despite the apparent agreement on the music style for which the music genres stand, conveyed in the earlier definition and implied in the related works too, music genres are subjective concepts (Sordo et al., 2008;Lee et al., 2013). To address this subjectivity, Bogdanov et al. (2019) proposed a dataset of music items annotated with English music genres by different sources. In this line of work, we address the divergent perception of music genres. Still, we focus on multilingual, unsupervised music genre annotation without relying on content features, i.e. audio or lyrics. We also complement similar studies in other domains (art: Eleta and Golbeck, 2012) with another research method.
Then, in the literature, there are other works that benchmark pre-trained word and sentence embedding models. van der Heijden et al. (2019) compares multilingual contextual language models for named-entity recognition and part-of-speech tagging. Shwartz and Dagan (2019b) use multiple static and contextual word embeddings to represent multi-word expressions and assess their capacity to capture meaning shift and implicit meaning in compositionality. Conneau et al. (2018) formulate a new task to evaluate cross-lingual sentence representation centered on natural language inference.
Compared to these works, our benchmark is aimed at cross-lingual annotation; we target a specific domain, music, for which we try concept embedding adaptation with retrofitting; and we also test a multilingual sentence representation obtained with smooth inverse frequency averaging of multilingual word embeddings. As discussed in a recent survey on cross-lingual word embedding models (Ruder et al., 2019), there is a need to unlock domain-specific data to assess if general-language sentence representations are also accurate across domains. Our work builds towards this goal.

Cross-lingual Music Genre Annotation
Further, we formalize the cross-lingual annotation task and the strategy to evaluate it in Section 3.1. We describe the test corpus used in this work, together with its collection procedure in Section 3.2.

Problem Formalization
The cross-lingual music genre annotation consists of inferring, for music items, tags in a target language L t , knowing tags in a source language L s . For instance, knowing the English music genres of Fatboy Slim (big beat, electronica, alternative rock), the goal is to predict rave and rock alternativo in Spanish. As shown in the example, but also Section 1, the problem goes beyond translation and instead targets a model able to map concepts, potentially dissimilar, across languages and cultures.
Formally, given S a set of tags in language L s , P the partitions of S and T a set of tags in language L t , a mapping scoring function f : P(S) → IR |T | can attribute a prediction score to each target tag, relying on subsets of source tags drawn from S (Hennequin et al., 2018;Epure et al., 2019Epure et al., , 2020. The produced score incorporates the degree of relatedness of each particular input source tag to the target tag. A common approach to compute relatedness in distributional semantics relies on cosine similarity. Thus, for {s 1 , ..., s K } source tags and   any target tag t, f can be defined as: where || · || 2 is the Euclidean norm.

Test Corpus
Wikipedia records worldwide music artists and their discographies, with a frequent mentioning of their music genres. By manually checking the Wikipedia pages of miscellaneous music items, we observed that their music genres vary significantly across languages. For instance, Knights of Cydonia, a single by Muse, was annotated in Spanish as progressive rock, while in Dutch as progressive metal and alternative rock. In Figure 1, we show another example of different annotations in English, Spanish, and Japanese from Wikipedia infoboxes. As Wikipedia writing is localized, contributors' culture can lead to differences in the multilingual content on the same topic (Pfeil et al., 2006), particularly for subjective matters. Thus, Wikipedia was a suitable source for assembling the test corpus. Using DBpedia (Auer et al., 2007) as a proxy to Wikipedia, we collected music items such as artists and albums, annotated with music genres in at least two of the six languages (en, nl, fr, es, cs and ja). We targeted MusicalWork, MusicalArtist and Band DBpedia resource types, and we only kept music items that were annotated with music genres which appeared at least 15 times in the corpus. Our final corpus includes 63246 music items. The number of annotations for each language pair is presented in Table 1. We also show in Table 2 the number of unique music genres per language in the corpus and the average number of tags for each music item.
The en and es languages use the most diverse tags. This can be because more annotations exist in these languages, in comparison to cs, which has the least annotations and least diverse tags. However, the mean number of tags per item appears relatively high for cs, while ja has the smallest mean number of tags per item.

Language-specific Semantic Representations for Music Genres
This work aims to assess the possibility of obtaining relevant cross-lingual music genre annotations, able to capture cultural differences too, by relying on language-specific semantic representations. Two types of semantic representations are investigated given their popularity: ontologies to represent music genre relations (presented in Section 4.1) and distributed embeddings to represent multiword expressions in general (presented in Section 4.2). In contrast to this unsupervised approach, mapping patterns of associating music genres with music items across cultures could have also been enabled with a parallel corpus. However, gathering a corpus that includes all music genres for each pair of languages is challenging.

Music Genre Ontology
Conceptually, music genres are interconnected entities. For example, rap west coast is a sub-genre of hiphop or música electrónica is the origin of synthpunk. Academic and practitioner communities often use ontologies or knowledge graphs to represent music genre relations and enrich the music genre definitions (Schreiber, 2016;Lisena et al., 2018). As mentioned in Section 1, we use in this study Wikipedia-based music genre ontologies because the multilingual Wikipedia contributions on the same topic can differ and these differences have been proven aligned with the ones in the physical world (Pfeil et al., 2006). We further describe how we crawl the Wikipediabased music genres ontologies for the six languages by relying on DBpedia. For each language, first, we constitute the seed list using two sources: the DBpedia resources of type MusicGenre and their aliases linked through the wikiPageRedirects relation; the music genres discovered when collecting the test corpus (introduced in Section 3.1) and their aliases. Then, music genres are fetched by visiting the DBpedia resources linked to the seeds through the relations wikiPageRedirects, musicSubgenre, stylisticOrigin, musicFusionGenre and derivative 3 . The seed list is updated each time, allowing the crawling to continue until no new resource is found.
In DBpedia, resources are sometimes linked to their equivalents in other languages through the relation sameAs. For most experiments, we rely on monolingual music genres ontologies. How-ever, we also collect the cross-lingual links between music genres to include a translation baseline for cross-lingual annotation, i.e. for each music genre in a source language, we predict its equivalent in a target language using DBpedia. Besides, we try to learn aligned embeddings from scratch by relying on these partially aligned music genre ontologies, as will be discussed in Section 4.3.
The number of unique Wikipedia music genres discovered in each language is presented in Table 2. Let us note that the graph numbers are much larger than the test corpus numbers, emphasizing the challenge to constitute a parallel corpus that covers all language-specific music genres.

Music Genre Distributed Representations
As music genres are multi-word expressions, we make use of existing sentence representation models. We also inquire into word vector spaces and obtain sentence embeddings by hypothesizing that music genres are generally compositional, i.e. the sense of a multi-word expression is conveyed by the sense of each composing word (e.g. West Coast rap, jazz blues; there are also non-compositional examples like hard rock). We set our investigation scope to multilingual pre-trained embedding models, and we consider both static and contextual word/sentence representations as described next.
Multilingual Static Word Embeddings. The classical word embeddings we study are the multilingual fastText word vectors trained on Wikipedia and Common Crawl (Grave et al., 2018). The model is an extension of the Common Bag of Word Model (CBOW, Mikolov et al., 2013), which includes subword and word position information. The fastText word vectors are trained in distinct languages. Thus, we must ensure that the monolingual word vectors are projected in the same space for cross-lingual annotation. We perform the alignment with the method proposed by Joulin et al. (2018), which treats word translation as a retrieval task and introduces a new loss relying on a relaxed cross-domain similarity local scaling criterion.
Multilingual Contextual Word Embeddings . Contextual word embeddings (Peters et al., 2017;Devlin et al., 2019), in contrast to the classical ones, are dynamically inferred based on the given context sentence. This type of embedding can address polysemy as the word sense is disambiguated through the surrounding text. In our work, we include two recent contextualized language models compatible with the multilingual scope: multilingual Bidirectional Encoder Representations from Transformers (BERT, Devlin et al., 2019) and Cross-lingual Language Model (XLM, Lample and Conneau, 2019).
BERT (Devlin et al., 2019) is trained to jointly predict a masked word in a sentence and whether sentences are successive text segments. Similar to fastText (Grave et al., 2018), subword and word position information is also used. An input sentence is tokenized against a limited token vocabulary with a modified version of the byte pair encoding algorithm (BPE, Sennrich et al., 2016). Multilingual BERT is trained as a single-language model, fed with 104 concatenated monolingual Wikipedias (Pires et al., 2019; Wu and Dredze, 2019).
XLM (Lample and Conneau, 2019) has a similar architecture to BERT. Also, it shares with BERT one training objective, the masked word prediction, and the tokenization using BPE, but applied on sentences differently sampled from each monolingual Common Crawl corpus. Compared to BERT, two other objectives are introduced, to predict a word from previous words and a masked word by leveraging two parallel sentences. Thus, to train XLM, several multilingual aligned corpora are used (Lample and Conneau, 2019).
Multilingual Sentence Embeddings. Contextualized language models can be exploited in multiple ways. First, as Lample and Conneau (2019) show, by training the transformers on multi-lingual data, cross-lingual word vectors are obtained in an unsupervised way. The word vectors can be accessed through the model lookup table. These embeddings are merely aligned but not contextual, thus directly comparable to fastText. For these three types of cross-lingual non-contextual word embeddings, fastText (FT), the multilingual BERT's lookup table (mBERT) and the XLM's lookup table (XLM), we compute the sentence embedding using the standard average (avg) or the smooth inverse frequency averaging (sif ) introduced by Arora et al. (2017).
Formally, let c denote a music genre composed of multiple tokens {t 1 , t 2 , . . . , t M }, t m the embedding of each token t m initialized from a given pretrained embedding model or to d-dimensional 4 null vector 0 d if t m is absent from the model vocabulary, andq i ∈ IR d the representation of c which we want to infer. The avg strategy computesq i as 1 M M m=1 t m . The sif strategy computesq i as: where f tm is the frequency of t m , a is a hyperparameter usually fixed to 10 −3 (Arora et al., 2017) and u is the first singular vector obtained through the singular value decomposition (Golub and Reinsch, 1971) of Q, the embedding matrix computed with the Equation 2 for all music genres. Vocabulary tokens of pre-trained embedding models are usually sorted by decreasing frequency in the training corpus, i.e. the higher the rank, the more frequent the token. Thus, based on the Zipf's law (Zipf, 1949), f wm can be approximated by 1/z tm , z tm being the rank of t m . The intuition of this simple sentence embedding method is that uncommon words are semantically more informative. Second, contextualized language models can be used as feature extractors representing sentences from the contextual embeddings of the associated tokens. Multiple strategies exist to retrieve contextual token embeddings: to use the embeddings layer or the last hidden layer or to apply min or max pooling over time (Devlin et al., 2019). To infer a fixed-length representation of a multi-word music genre, we try max and mean pooling over token embeddings (Lample and Conneau, 2019; Reimers and Gurevych, 2019), obtained with the diverse strategies mentioned before. We denote these sentence embeddings XLM Ctxt and mBERT Ctxt .
The contextualized language models can be further fine-tuned for particular downstream tasks, yielding better sentence representations (Eisenschlos et al., 2019;Lample and Conneau, 2019). Existing evaluations of cross-lingual sentence representations are centered on natural language inference (XNLI, Conneau et al., 2018) or classification (Eisenschlos et al., 2019). The cross-lingual music genre annotation would be closer to the XNLI task; hence we could fine-tune the pre-trained models on a parallel corpus of music genres translations or music genre annotations. However, our research investigates language-specific semantic representations. Also, using translated music genres would not model their different perception across cultures while obtaining an exhaustive corpus of cross-lingual annotations is challenging.
Last, we explore LASER, a universal languageagnostic sentence embedding model (Artetxe and Schwenk, 2019). The model is based on a BiLSTM encoder trained on corpora in 93 languages to learn multilingual fixed-length sentence embeddings. As in other models, sentences are tokenized against a fixed vocabulary, obtained with BPE from the concatenated multilingual corpora. LASER appears highly effective without requiring task-specific finetuning (Artetxe and Schwenk, 2019).

Retrofitting Music Genre Distributed
Representations to Ontologies Retrofitting (Faruqui et al., 2015) is a method to refine vector space word representations by considering the relations between words as defined in semantic lexicons such as WordNet (Miller, 1995). The intuition is to modify the distributed embeddings to become closer to the representations of the concepts to which they are related. Ever since the original work, many uses of retrofitting have been explored to semantically specialize word embeddings in relations such as synonyms or antonyms (Kiela et al., 2015;Kim et al., 2016), in other languages than a source one (Ponti et al., 2019) or in specific domains (Hangya et al., 2018). Enhanced extensions of retrofitting exist, but they require supervision (Lengerich et al., 2018). The original method (Faruqui et al., 2015) is unsupervised and can simply yet effectively leverage distributed embeddings and ontologies for improved representations. Thus, we mainly rely on it, but we apply some changes as further described.
Let Ω = (C, E) be an ontology including the concepts C and the semantic relations between these concepts E ⊆ C × C. The retrofitting goal is to learn new concept embeddings, Q ∈ IR n×d with n = |C| and d the embedding dimension. The learning starts with initializing each q i ∈ IR d , the new embedding for concept i ∈ C, toq i , the initial distributed embedding, and then iteratively updates q i until convergence as follows: α and β are positive scalars weighting the importance of the initial, respectively, the related concept embeddings in computation. The formula was reached through the optimization of the retrofitting objective using the Jacobi method (Saad, 2003). Equation 4 is a corrected version of the original work, as for a concept i, not only β ij appears in it, but also β ji . That is to say that when computing the  Table 3: Macro-AUC scores (in %, best overall in bold, best locally underlined). The first part corresponds to the translation baselines; the second to the best distributed representations; the last to the retrofitted FT sif vectors. partial derivative of the retrofitting objective concerning i, two non-zero terms are corresponding to the related concept j: when i is the source and j is the target and vice-versa (Bengio et al., 2006;Saha et al., 2016). The further modifications that we make regard the parameters α and β. For each i ∈ C, Faruqui et al. (2015) fix α i to 1, and β ij to 1 degree(i) for (i, j) ∈ E or 0 otherwise; degree(i) is the number of related concepts i has in Ω.
While many embedding models can handle unknown words nowadays, concepts may still have unknown initial distributed vectors, depending on the model's choice. For this case, expanded retrofitting (Speer and Chin, 2016) has been proposed, considering α i = 0, for each concept i with unknown initial distributed vector, and α i = 1 for the rest. Thus, q i is initialized to 0 d and updated by averaging the embeddings of its related concepts at each iteration. Let us notice that, through retrofitting, representations are not only modified but also learned from scratch for some concepts.
We adopt the same approach to initialize α. Moreover, we also adjust the parameters β to weight the importance of each related concept em-bedding depending on the relation semantics in our music genre ontology (Epure et al., 2020). Specifically, we distinguish between equivalence and relatedness as follows: where E contains the equivalence relation types (wikiPageRedirects, sameAs); E − E contains the relatedness relation types (stylisticOrigin, music-Subgenre, derivative, musicFusionGenre). We label this modified version of retrofitting as Rfit.
Finally, we want to highlight a crucial aspect of retrofitting. Previous works (Speer and Chin, 2016;Hayes, 2019;Fang et al., 2019) claim that, while the retrofitting updating procedure converges, the results depend on the order in which the updates are made. We prove in Appendix A that the retrofitting objective is strictly convex when at least one initial concept vector is known in each connected component. Hence, with this condition satisfied, retrofitting converges to the same solution always and independently of the updates' order.
Cross-lingual music genre annotation, as formalized in Section 3, is a typical multi-label prediction task. For evaluation, we use the Area Under the receiver operating characteristic Curve (AUC, Bradley, 1997), macro-averaged. We report the mean and standard deviations of the macro AUC scores using 3-fold cross-validation. For each language, we apply an iterative split (Sechidis et al., 2011) of the test corpus that balances the number of samples and the tag distributions across the folds. We pre-process the music genres by either replacing special characters with space ( -/,) or removing them (()':.!$). For Japanese, we introduce spaces between tokens obtained with Mecab 5 . Embeddings are then computed from pre-processed tags.
We test two translation baselines, one based on Google Translate 6 (GTrans) and one on the DBpedia SameAs relation (DBpSameAs). In this case, a source music genre is mapped on a single or no target music genre, its embedding being in the form {0,1} |T | . For XLM Ctxt , we compute the sentence embedding by averaging the token embeddings obtained with mean pooling across all layers. For mBERT Ctxt , we apply the same strategy, but by max pooling the token embeddings instead. We chose these representations as they showed the best performance experimentally compared to the other strategies described in Section 4.2.
When retrofitting language-specific music genre embeddings, we use the corresponding monolingual ontology (Rfit uΩ ). When we learn multilingual embeddings from scratch with retrofitting, by knowing only music genre embeddings in one language (la), we use the partially aligned DBpedia ontologies which contain the SameAs relations (Rfit la aΩ ). For this case, we also propose a baseline representing a source concept embedding as a vector of geodesic distances in the partially aligned ontologies to each target concept (DBp aΩ NNDist). Table 3 shows the cross-lingual annotation results. The standard translation, GTrans, leads to the lowest results being over-performed by a knowledge-based translation, more adapted to this domain (DBpSameAs). Also, these results show that translation methods fail to capture the dissimilar cross-cultural music genre perception.

Results.
The second part of  Table 4: Macro-AUC scores (in %; those larger than Rfit uΩ FT sif in Table 3 in bold) with vectors learned by retrofitting to aligned monolingual ontologies.
best 7 music genre embeddings computed with each word/sentence pre-trained model or method. When averaging static multilingual word embeddings, those from mBERT often yield the most relevant cross-lingual annotations, while when applying the sif averaging, the aligned FT word vectors are the best choice. Between the two contextual word embedding models, XLM Ctxt significantly outperforms mBERT Ctxt , thus we report only the former. We can notice that all distributed representations of music genres can model quite well the varying music genre annotation across languages. FT sif results in the most relevant cross-lingual annotations consistently for 5 out of 6 languages as a source. For cs though, the embeddings from XLM Ctxt are sometimes slightly better. LASER under-performs for most languages but ja, for which the vectors obtained with mBERT avg are less suitable.
The last column of Table 3 shows the results of cross-lingual annotation when using the FT sif vectors retrofitted to monolingual music genre ontologies. The domain adaptation of concept embeddings, inferred with general-language pre-trained models, significantly improves music genre annotation modeling across all pairs of languages. Table 4 shows the results when using retrofitting to learn music genre embeddings from scratch. Here, distributed vectors are known for one language (en, respectively ja 8 ) and the monolingual ontologies are partially aligned. Even though not necessarily all music genres are linked to their equivalents in the other language, the concept representations learned in this way are more relevant for cross-lingual annotation, for all pairs involving en as the source and for ja-cs and ja-en. In fact, the baseline (DBp aΩ NNDist) reveals that the aligned ontologies stand-alone can model the cross-lingual annotation quite well, in particular for en.
Discussion. The results show that using translation to produce cross-lingual annotations is limited as it does not consider the culturally divergent perception of music genres. Instead, monolingual semantic representations can model this phenomenon rather well. For instance, from Milton Cardona's music genres in es, salsa and jazz, it correctly predicts the Japanese equivalent of fusion (フュー ジョン) in ja. Yet, while a thorough qualitative analysis requires more work, preliminary exploration suggests that larger gaps in perception might still be inadequately modeled. For instance, for Santana's album Welcome tagged with jazz in es, it does not predict pop in fr.
When comparing the distributed embeddings, a simple method that relies on a weighted average of multilingual aligned word vectors significantly outperforms the others. Although rarely used before, we question if we can notice such high performance with other multilingual data-sets. The cross-lingual annotations are further improved by retrofitting the distributed embeddings to monolingual ontologies. Interestingly, the vector alignment does not appear degraded by retrofitting to disjoint graphs. Or, the negative impact is limited and exceeded by introducing domain knowledge in representations. Further, as shown in Table 4, joining semantic representations in this way proves very suitable to learn music genre vectors from scratch.
Regarding the scores per language, we obtained the lowest ones for ja as the source. We could explain this by either a more challenging test corpus or still incompatible embeddings in ja, possibly because of the quality of the individual embedding models for this language and the completeness of the Japanese music genre ontology. Also, we did not notice any particular improvement for pairs of languages from the same language family, e.g. fr and es. However, we would need a sufficiently sizeable parallel corpus exhaustively annotated in all languages to reliably compare the performance for pairs of languages from the same language family or different ones.
Finally, by closely analysing the results in Table  3, we noticed that given two languages L 1 and L 2 , with more music genre embeddings in L 1 than in L 2 (from both ontology and corpus), the results of mapping annotations from L 1 to L 1 seems al-ways better than the results from L 2 to L 1 . This observation explains two trends in Table 3. First, the scores achieved for en or es as the source, the languages with the largest number of music genres, are the best. Second, the results for the same pair of languages could vary a lot, depending on the role each language plays, source, or target.
One possible explanation is that the prediction from languages with fewer music genre tags such as L 2 towards languages with more music genre tags such as L 1 is more challenging because the target language contains more specific or rare annotations. For instance, when checking the results per tag from cs to en we observed that among the tags with the lowest scores, we found moombahton, zeuhl, or candombe. However, other common music genres, such as latin music or hard rock, were also poorly predicted, showing that other causes exist too. Is the unbalanced number of music genres used in annotations a cultural consequence? Related work (Ferwerda and Schedl, 2016) seems to support this hypothesis. Then could we design a better mapping function that leverages the unbalanced numbers of music genres in cross-cultural annotations? We will dedicate a thorough investigation of these questions as future work.

Conclusion
We have presented an extensive investigation on cross-lingual modeling of music genre annotation, focused on six languages, and two common approaches to semantically represent concepts: ontologies and distributed embeddings 9 .
Our work provides a methodological framework to study the annotation behavior across languagebound cultures in other domains too. Hence, the effectiveness of language-specific concept representations to model the culturally diverse perception could be further probed. Then, we combined the semantic representations only with retrofitting. However, inspired by paraphrastic sentence embedding learning, one can also consider the music genre relations as paraphrasing forms with different strengths (Wieting et al., 2016). Finally, the models to generate cross-lingual annotations should be thoroughly evaluated in downstream music retrieval and recommendation tasks. A Strict Convexity of Retrofitting Theorem. Let V be a finite vocabulary with |V | = n. Let Ω = (V, E) be an ontology represented as a directed graph which encodes semantic relationships between vocabulary words. Further, letV ⊆ V be the subset of words which have non-zero initial distributed representations,q i . The goal of retrofitting is to learn the matrix Q ∈ IR d , stacking up the new embeddings q i ∈ IR d for each i ∈ V . The objective function to be minimized is: where the α i and β ij are positive scalars. Assuming that each connected component of Ω includes at least one word fromV , the objective function Φ is strictly convex w.r.t. Q.
Proof. First of all, letQ denote the n × d matrix whose i-th row corresponds toq i if i ∈V , and to the d-dimensional null vector 0 d otherwise.
Let A denote the n × n diagonal matrix verifying A ii = α i if i ∈V and A ii = 0 otherwise. Let B denote the n × n symmetric matrix such as, for all i, j ∈ {1, ..., n} with i = j, B ij = B ji = − 1 2 (β ij + β ji ) and B ii = n j=1,j =i |B ij |. With these notations, and with T r(·) the trace operator for square matrices, we have: Therefore, as the trace is a linear mapping, we have: Then, we note that A + B is a weakly diagonally dominant matrix (WDD) as, by construction, ∀i ∈ {1, ..., n}, |(A + B) ii | ≥ j =i |(A + B) ij |. Also, for all i ∈V , the inequality is strict, as |(A + B) ii | = α i + j =i |B ij | > j =i |(A + B) ij | = j =i |B ij |, which means that, for all i ∈V , row i of A + B is strictly diagonally dominant (SSD). Assuming that each connected component of graph G includes at least one node fromV , we conclude that A+B is a weakly chained diagonally dominant matrix (Azimzadeh and Forsyth, 2016), i.e. that: • A + B is WDD; • for each i ∈ V such that row i is not SSD, there exists a walk in the graph whose adjacency matrix is A + B (two nodes i and j are connected if (A + B) ij = (A + B) ji = 0), starting from i and ending at a node associated to a SSD row.
Such matrices are nonsingular (Azimzadeh and Forsyth, 2016), which implies that Q → Q T (A + B)Q is a positive-definite quadratic form. As A+B is a symmetric positive-definite matrix, there exists a matrix M such that A + B = M T M. Therefore, denoting || · || 2 F the squared Frobenius matrix norm: which is strictly convex w.r.t. Q due to the strict convexity of the squared Frobenius norm (see e.g. (2005)). Since the sum of strictly convex functions of Q (first trace in Φ(Q)) and linear functions of Q (second trace in Φ(Q)) is still strictly convex w.r.t. Q, we conclude that the objective function Φ is strictly convex w.r.t. Q.

in Dattorro
Corollary. The retrofitting update procedure is insensitive to the order in which nodes are updated. The aforementioned updating procedure for Q (Faruqui et al., 2015) is derived from Jacobi iteration procedure (Saad, 2003;Bengio et al., 2006) and converges for any initialization. Such a convergence result is discussed in Bengio et al. (2006). It can also be directly verified in our specific setting by checking that each irreducible element of A + B, i.e. each connected component of the underlying graph constructed from this matrix, is irreducibly diagonally dominant (see 4.2.3 in Saad (2003)) and then by applying Theorem 4.9 from Saad (2003) on each of these components. Besides, due to its strict convexity w.r.t. Q, the objective function Φ admits a unique global minimum. Consequently, the retrofitting update procedure will converge to the same embedding matrix regardless of the order in which nodes are updated.

B Extended Results
The following Tables 5 and 6 provide more complete results from our experiments. Table 5: Macro-AUC scores (in %, best locally underlined). The first two parts correspond to averaging or applying sif averaging to static multilingual word embeddings; the third part corresponds to the contextual sentence embeddings.