Manual Clustering and Spatial Arrangement of Verbs for Multilingual Evaluation and Typology Analysis

We present the first evaluation of the applicability of a spatial arrangement method (SpAM) to a typologically diverse language sample, and its potential to produce semantic evaluation resources to support multilingual NLP, with a focus on verb semantics. We demonstrate SpAM’s utility in allowing for quick bottom-up creation of large-scale evaluation datasets that balance cross-lingual alignment with language specificity. Starting from a shared sample of 825 English verbs, translated into Chinese, Japanese, Finnish, Polish, and Italian, we apply a two-phase annotation process which produces (i) semantic verb classes and (ii) fine-grained similarity scores for nearly 130 thousand verb pairs. We use the two types of verb data to (a) examine cross-lingual similarities and variation, and (b) evaluate the capacity of static and contextualised representation models to accurately reflect verb semantics, contrasting the performance of large language specific pretraining models with their multilingual equivalent on semantic clustering and lexical similarity, across different domains of verb meaning. We release the data from both phases as a large-scale multilingual resource, comprising 85 verb classes and nearly 130k pairwise similarity scores, offering a wealth of possibilities for further evaluation and research on multilingual verb semantics.


Introduction
Many recent efforts in semantic modeling have focused on unsupervised pretraining to extend the benefits offered by recently proposed text encoders (Devlin et al., 2019) to new languages and domains. In these approaches, general language representations are learned from large volumes of unlabeled text, and subsequently leveraged in downstream systems by means of fine-tuning on a given supervised task. The release of large multilingual pretrained encoders (Devlin et al., 2019;Conneau and Lample, 2019) boosted the state of the art on a range of multilingual tasks (Kondratyuk and Straka, 2019;Wang et al., 2019;Pires et al., 2019;Wu and Dredze, 2019;Hu et al., 2020;Artetxe et al., 2020;Qiu et al., 2020;Mueller et al., 2020). In parallel, the number of language-specific pretrained architectures available has also been steadily growing, with the advantage of being more attuned to the properties of the language in question (Virtanen et al., 2019;Nozza et al., 2020). The ease of incorporating these powerful encoders into downstream task pipelines has made them widely popular. However, there is a disproportionate shortage of resources allowing for probing of the learned representations in most languages. The aim of this work is to address this deficit by releasing a multilingual resource targeting verb semantics in a typologically diverse selection of languages where no such datasets have hitherto been available. The motivation behind the specific focus on verbs is twofold: (i) the importance of accurate and nuanced representation of verb meaning in light of their pivotal role in sentence structure and the still subpar verbal reasoning ability of SOTA models (Rogers et al., 2020), and (ii) the scarcity of verb data in evaluation datasets currently available. To this end, we employ a recently proposed two-phase data collection method  combining semantic clustering (Phase 1) and finer-grained spatial arrangements of words based on their similarity (Phase 2), and evaluate its cross-lingual applicability. Using cross-lingual mappings,  Table 1: Data statistics including the number of unique verbs in each sample (translated from English) (N verbs), the number of Phase 1 classes (N classes), the total number of pairwise scores in the final dataset (N pairs) and the thresholded subset of each dataset (THR pairs) (See §4.2).
we carry out analyses of cross-language overlap in the semantic classes created in Phase 1, as well as quantitative and qualitative comparisons of the semantic distance matrices from Phase 2. Subsequently, we perform evaluation of static and contextualised representation models on the tasks of lexical similarity and semantic clustering using the data from both phases. This allows us to identify models' strengths and shortcomings, as well as specific challenges posed by the languages' properties and different domains of verb meaning. The collected data, comprising semantic classes and fine-grained pairwise similarity scores for Chinese, Japanese, Finnish, Italian, and Polish, are made freely available with this paper at https://github.com/om304/Multi-SpA-Verb.

Background and Design Motivation
Word similarity has been widely used as a go-to intrinsic evaluation task, in which rankings of similarity scores computed between word embeddings produced by representation models are compared against ranked human similarity judgments. The dataset design involving sets of word pairs and their associated rating on a discrete scale has been particularly common, due to its reliance on non-expert native speaker judgments, quicker and cheaper to obtain than the large expert-curated lexical-semantic or semanticsyntactic resources such as WordNet (Fellbaum, 1998) or VerbNet (Kipper Schuler, 2005;Kipper et al., 2006). In English, examples include WordSim-353 (Finkelstein et al., 2002;Agirre et al., 2009), MEN (Bruni et al., 2014) and SimLex-999 (Hill et al., 2015). Analogous datasets have been created in other languages, either through translation from an existing English dataset (e.g., from SimLex: German, Italian, and Russian (Leviant and Reichart, 2015), Hebrew and Croatian  and Polish (Mykowiecka et al., 2018)), or from a new set of concept pairs (e.g., Turkish (Ercan and Yıldız, 2018), Mandarin Chinese (Huang et al., 2019), Japanese (Sakaizawa and Komachi, 2018)). While these datasets are dominated by nouns (e.g., SimLex includes 222 verb pairs), verb-oriented datasets are harder to come by. In English, these include datasets of Yang and Powers (2006) (Table 1). We start from the English SpA-Verb sample translated into five target languages and apply the two-phase annotation method combining semantic clustering and spatial arrangements based on semantic similarity proposed in . The method adapts a SpAM approach previously used in cognitive science and psychology in behavioural studies of visual similarity between concrete objects (Kriegeskorte and Mur, 2012;Mur et al., 2013;Cichy et al., 2019) to lexical stimuli. In Phase 1, a large word sample is divided into a number of broad categories of similar and related items. Each of these classes is then used as input in Phase 2, where the related class members are arranged in a 2D space based on their semantic similarity, with similar words placed closer together. Each item placement simultaneously communicates Figure 1: Consecutive Phase 2 trials on a class of Polish emotion verbs. In the first trial (1-2), the whole class is displayed around the arena and word labels are placed one by one based on the similarity of their meaning. Words put closer together in the first trial (2) are subsampled for the subsequent trial (3), and arranged again in a less crowded space (annotators are asked to use the entire space available in each trial and the relative inter-item distances, not the absolute on-screen distances, represent the dissimilarities).
its semantic distance to all other items present and the inter-stimulus Euclidean distances represent pairwise dissimilarities between words in the sample. The arrangements are performed repeatedly over numerous trials first on the entire word set and subsequently on subsets of items, selected by an adaptive algorithm which optimises the evidence collected for the dissimilarity estimates (see Figure 1). The final representational dissimilarity matrix (RDM) estimate is produced by statistically combining the evidence from multiple subsequent 2D arrangements and contains a dissimilarity estimate for each pairing of words in the set (see Kriegeskorte and Mur (2012) for the details). The dissimilarities collected for each Phase 1 class are then normalised to ensure inter-class consistency in the final dataset.
The main advantages of the spatial arrangement method lie in its intuitiveness, rooted in psychology (Lakoff and Johnson, 1999;Gärdenfors, 2004;Casasanto, 2008), and flexibility, due to the reliance on fluid item placements simultaneously expressing multi-way similarity judgments, rather than discrete numerical scores. By repeatedly considering subsets of items, the users reflect on relative differences in meaning between different configurations of words, which decreases bias from placement error, order of presentation and judgment context. The two-phase design offers a practical advantage for porting the method to other languages. The approach starts from a verb sample, rather than a set of word pairs, which allows for easy translation into the target language, avoiding many of the complications encountered in translation of pairs, including cases where both words in the source language pair translate into the same word (e.g., cup -mug → Italian tazza -tazza), or several pairs in the source language translate into identical target pairs (e.g., easy -hard, easy -difficult → Polish łatwy -trudny). 1

Data Collection and Analysis
We sampled languages from 5 different language families to ensure typological diversity: Sino-Tibetan (Mandarin Chinese ZH), Japonic (Japanese JA), Uralic (Finnish FI), Slavic (Polish PL) and Romance (Italian IT). Following translation from English (EN), the two data collection phases were set up on an online platform (meadows-research.com) as two separate studies for each language. Recruitment was carried out on a crowd-sourcing website, prolific.co. Participants were native speakers of the target language with at least undergraduate education level and at least a 90% approval rating. Each phase featured a short qualification task testing the participants' understanding of the guidelines.

Word Sample Translation
Translation was carried out by one native speaker translator per language. In case several equally suitable candidates were identified for one source word, all of them were kept. This was especially true for polysemous English verbs which translated to more than one target verb, each expressing a distinct sense of the source word (e.g., bear → Finnish (1) kantaa, 'carry', (2) sietää, 'endure'). On the other hand, if two English words had only one adequate translation equivalent, the two-to-one mapping was kept where unavoidable (e.g., restrict, limit → Mandarin Chinese 限制). Table 1 shows the number of unique verbs in the final target sample for each language. Additional design choices concerned the following: (i) multi-word expressions, which we permitted if they were the natural translation choice, so as to accurately reflect target language lexical semantics; (ii) intransitive and transitive translation variants of the same English verb (both variants were kept only if they captured important meaning distinctions beyond valency, e.g., in Polish: impose→(1) narzucać, 'to force someone to accept something', (2) narzucać się, 'to cause inconvenience to someone by demanding their attention'); (iii) verbal aspect (translators selected the variant most closely capturing source word meaning, e.g., in Finnish: jump→hypätä but bounce→hyppiä (continuative aspect)).

Phase 1: Semantic Clustering
Five native speakers per language independently performed a rough clustering of the initial verb sample into broad semantic classes. Users dragged words one by one from a queue and placed them in circles representing broad semantic groupings (see Figure 2). The annotators were instructed to create groupings of similar and related verbs, each containing roughly 30-50 words. This rule of thumb, applied previously , ensures similar granularity across languages.
To ensure annotation quality, the produced classifications were manually reviewed to identify rogue annotators and low-effort responses (e.g., multiple consecutive words in the queue were placed in the same class indiscriminately or large numbers of words were placed in the trash circle and missing from the final classification), which were subsequently discarded. The final sets of classes for Phase 2 were produced in each language by first identifying the overlap in Phase 1 classifications (i.e., all the verb pairs put in the same class by all annotators in a given language), which determined the class structure and broad semantics of each class (e.g., movement, emotion, communication), and then populating the classes based on majority decisions. Finally, for each language, the cross-subject classes were reviewed manually by a native speaker adjudicator; in the process, the verbs missing from the intersection of individual clusterings were added to valid classes of related verbs (based on the criterion of semantic similarity and relatedness, ensuring semantic coherence of the resultant classes). Phase 1 produced 16-18 classes in each language (Table 1) and took between 2.5 (Finnish) and 3.5 hours (Mandarin Chinese) to complete.
Cross-lingual Overlap. Table 3 summarises Phase 1 output. Given the similar granularity of classifications, we aligned classes with most overlap (via English mappings) and shared broad semantics for an easier comparison. We see a lot of high-level category overlap (e.g., 'possession', 'motion', 'cognition').  To measure the degree of alignment, we calculate pairwise item-level overlap using the B-Cubed metric (Jurgens and Klapaftis, 2013;Amigó et al., 2009) between all language pairs. We examine whether stronger alignment corresponds to a greater typological affinity by confronting the results with the degree of overlap in syntactic, morphological and lexical typological features from the WALS database (Dryer and Haspelmath, 2013) (Table 2). 2 The two languages with the strongest class alignment, Italian and Polish (0.565), also share the most structural properties. Japanese, the only SOV language in the selection, has  Table 3: Semantic classes produced in Phase 1, aligned cross-lingually based on member overlap (size = number of verbs in class, ρ = Spearman's IAA); English labels serve to identify broad semantic categories. ↑/↓ indicate a category is subsumed by the one above or below. S/A/P labels signal arguments typically selected by class members (agent-like (A), patient-like (P), or sole argument of an intransitive verb (S)).
the lowest average pairwise overlap with other languages both in terms of features and Phase 1 classes. Manual examination of the classes provides additional insights into the factors (beyond purely semantic) impacting classification decisions across languages. For instance, in both Polish and Italian we observe a class split corresponding to the reflexive vs. non-reflexive distinction: reflexive motion verbs (e.g., PL kołysać się, 'to sway', obracać się, 'to spin'; IT abbassarsi, 'to lower', ritirarsi, 'to retreat') end up separated from their non-reflexive counterparts (PL kołysać, obracać, IT abbassare, ritirare). Whereas in Chinese, we observe complex causative verbs (formed with the causative 使 shǐ, 'to make, cause') clustered together (e.g., 使失望 shǐ shīwàng, 'disappoint', 使心烦 shǐ xīnfán, 'upset', 使厌恶 shǐ yànwù, 'disgust'), forming a grouping of verbs denoting causing negative emotions. In §3.3, we zoom into specific semantic classes to analyse patterns of similarity and variation in-depth.

Phase 2: Similarity Multi-Arrangement
The classes from Phase 1 were fed into Phase 2, divided into 5-6 batches of 3-4 classes each. Verbs from one class are annotated independently from all others. Annotators are instructed to arrange presented verbs in a circular arena based on similarity of their meaning, putting similar words closer, disregarding similarity of sound, letters or simple association. For each batch, the aim was to obtain at least 5 valid sets of annotations and recruitment continued until this condition was satisfied. We employed the following post-processing quality assurance protocol: first, we filtered annotators who performed the first arrangement too quickly (i.e., averaging less than 1 sec per word placement upon first seeing the sample, following ); next, for each class, we filtered out annotators for whom the average pairwise Spearman's correlation of arena dissimilarities with those of all other annotators was more than one standard deviation below the mean of all such average correlations (as done by Hill et al. (2015) and ). To produce the final dataset and ensure consistency between differently sized classes, we calculated the average of the Euclidean distances from all accepted annotators for each verb pair and then normalised them, using the approach from previous work (Kriegeskorte and Mur, 2012; where each dissimilarity matrix is scaled to have a root mean square (RMS) of 1.
Cross-lingual Comparisons. We compute inter-annotator agreement (IAA) in Phase 2 using Spearman's rank correlation coefficient (ρ) as the average correlation of an individual annotator with the average of all other annotators for each class in each language (Table 3). We observe that certain classes proved consistently easier to judge across languages ('emotion', 'change', 'cooking'), while some were consistently more challenging ('motion', 'handicraft', 'law/crime'  The availability of complete dissimilarity matrices enables analyses of cross-lingual similarities in how concepts pertaining to a given domain are organised. To illustrate this, we focus on two semantic areas, verbs of motion (#4) and change (#9), and compute the correlation between the intersection of distance matrices for all language pairs, and additionally English SpA-Verb data, using the non-parametric Mantel test (Mantel, 1967). We find statistically significant correlations between all pairings of languages (p ≤ .001), but the results show crosslingual and cross-domain differences (Table 4). Overall, we observe substantially higher correlations on verbs of change than movement verbs, mirroring the intralingual IAA patterns (Table 3): while there is more room for variation in pairwise distances in a more populated 'motion' class, the alignment on verbs of change is also due to the nature of the class, dominated by antonymous verb pairs of opposite polarity (e.g., increase-decrease, grow-shrink), which are consistently spread out in the arena. The moderate to strong correlations recorded indicate that the dimensions which underlie the organisation of concepts in this class -especially the polarity dimension -are cross-lingually shared. As observed in Phase 1 (Table  2), Italian and Polish correlate the most, while Japanese is the least aligned with other languages.
Comparison with the 'motion' class illustrates that there is variation in patterns of cross-lingual affinities across different semantic domains. While Italian correlates the most with English, the correlation with Polish motion verbs is weak. Running agglomerative clustering on top of distance matrices revealed that in all three there emerge subclusters corresponding to the different medium of movement ([dive, swim, flow], [run, walk, crawl]) and a separation between static and dynamic verbs ([lounge, poise, remain], [chase, dance, dash]. However, Polish makes some additional fine-grained distinctions based on manner and speed of movement (e.g., jumping, fast vs. slow movement, motion with a change of direction). Whereas in Italian and English, verbs describing motion towards the speaker/listener form a distinct cluster. These preliminary analyses suggest that the collected semantic multi-arrangement data may support many other, fine-grained and in-depth lexical-typological analyses in future work, e.g., focusing on cross-lingual comparisons of the organisation of different semantic fields and examination of the most salient meaning dimensions underlying a given conceptual space.

Evaluation
Evaluation is focused on two types of representation architectures: static word embeddings (Bojanowski et al., 2017) and more recently proposed large pretrained encoders (Devlin et al., 2019). We compare their ability to capture word-level semantics across languages and domains of verb meaning. We also contrast the performance of language-specific BERT models with their massively multilingual counterpart (Devlin et al., 2019), and examine the impact of computing word-level representations in context, rather than by feeding items to a pretrained model in isolation.
Representation Models. We evaluate FASTTEXT (FT) as a representative non-contextualised word embedding model with proven representation capabilities on diverse NLP tasks (Mikolov et al., 2018) and coverage of 157 languages. For multi-word expressions, we compute their representations by averaging the vectors of their constituent words. We contrast the performance of FT vectors with the omnipresent state-of-the-art BERT model (Devlin et al., 2019). We derive word-level BERT representations of words and multi-word expressions in two different ways: (a) in isolation and (b) in context. In method (a), we follow the steps of  by (1) Table 5: Clustering results (F1 score) on Phase 1 classes, for the optimal clustering solution (highest F1 score) and with k clusters equal to the number of gold classes in each language (see Table 1). We report scores for (M-)BERT embeddings computed in isolation (ISO) and in context (CTX) (see §4).
the H hidden representations for each of the subword tokens constituting the item, and finally (3)  In (b), we encode word meaning in context of other words using external corpora 4 in the following way. First, we randomly sample N sentences containing each item in the corpus, then, we compute the item's representation in each of N sentential contexts (averaging over constituent subword representations and hidden layers as in steps (2)-(3) above), and finally average over the N sentential representations to obtain the final representation for each item. 5 We evaluate the uncased multilingual BERT model (M-BERT) (Devlin et al., 2019), pretrained on monolingual Wikipedia corpora of 102 languages, as well as language-specific pretrained BERT encoders released for ZH, JA (BERT-BASE with and without whole word masking (+WWM)), PL, FI, and IT (BERT-BASE and BERT-BASE-XXL trained on a larger Italian corpus), available in the Transformers repository (Wolf et al., 2019). 4

Semantic Verb Clustering
First, we evaluate the models on semantic clustering, where the task is to group the starting verb sample (Table 1, N verbs) into clusters based on semantic similarity. For each vector collection, we apply the spectral clustering algorithm (Meila and Shi, 2001;Yu and Shi, 2003), shown to produce strong results in previous work on verb clustering (Sun et al., 2010;Scarton et al., 2014;, and evaluate the produced groupings against the Phase 1 classes in each language using standard clustering evaluation metrics, modified purity (MPUR) (i.e., mean precision of induced verb clusters) and weighted class accuracy (WACC), calculated as follows: where (1) each cluster C from the set of all K Clust automatically induced clusters Clust is associated with its prevalent Phase 1 class, and n prev(C) is the number of verbs in an induced cluster C appearing in that class (all other verbs are considered errors). n test_verbs is the total number of test verbs, and singleton clusters (n prev(C) = 1) are not counted. In Eq.
(2), for each C from the set of Phase 1 classes Gold we identify the dominant cluster from the set of induced clusters which has most verbs in common with C (n dom(C) ). The metrics are combined into an F1 score, the balanced harmonic mean of MPUR and WACC. Table 5 includes the results for the optimal number of clusters (highest F1), and for a fixed k equal to the number of gold truth classes. We observe several interesting patterns in the F1 scores. First, we note that FT vectors clearly outperform the BERT models in languages using the Latin script, IT, FI, PL,  ----------+XXL  ---------------.073 .083 .073 .016 (2) CTX . 315 .344 .231 .330 .067 .128 .064 .201 .124 .237 .130 .073 .188 .042 .245 .108 .134 .117 .085 .038  achieving the top three F1 scores overall (0.389, 0.386, 0.377). 6 In Chinese and Japanese, FT vectors surpass BERT embeddings in isolation, but are outperformed by BERT vectors computed in context (in ZH) and by the multilingual BERT in Japanese. The stronger performance of the massively multilingual model in Japanese and Chinese contrasts with the results in PL, FI and IT, where it mostly lags behind the language-specific counterparts. In terms of relative scores, we see that BERT and M-BERT embeddings computed over a number of sentential contexts consistently outperform their in isolation counterparts across all languages. On the other hand, we observe that whole word masking does not reliably improve clustering performance in Japanese, nor does using a larger training corpus in Italian (BERT-XXL).
Error Analysis. Manual inspection of the induced clusters reveals some common pitfalls and areas of difficulty. First, the evaluated models are largely oblivious to idiomatic meaning. In Polish and Italian, the FT model produces a cluster of 'possession' verbs (EN have, give, lend, buy), including the verbs mieć (PL,'to have'), dać and dare (PL/IT,'to give'). However, it also incorporates all phrasal verbs and multi-word expressions featuring these words, with meanings unrelated to the rest of the class: PL mieć coś przeciw ('to mind/object to'), mieć nadzieję ('to hope'), mieć wpływ ('to influence'), dać klapsa ('to spank'); IT dare un'occhiata ('to glance'). This is even more evident in Finnish, where a separate cluster of phrasal verbs with olla ('to be/have') emerges (e.g., olla varuillaan, 'beware', olla peräisin, 'originate', olla samaa mieltä 'agree'). Similarly, all Polish models produce a cluster of just reflexive verbs (e.g., slizgać się ('to slide'), cieszyć się ('to rejoice'), zdarzyć się ('to happen')), regardless of discrepancies in meaning. In Italian, BERT models fall into the same trap, clustering reflexives regardless of their meaning (informarsi 'to inquire', precipitarsi 'to rush', abbronzarsi 'to tan'); however, FT vectors are more robust: we observe a separate cluster of movement verbs, with both reflexives and non-reflexives (saltare 'to jump', precipitarsi 'to rush', andare 'to go'), and of knowledge-related verbs (informarsi 'to inquire', studiare 'to study', istruire 'to instruct'). The attention to subword signal is apparent in clusters produced by BERT models. In languages using logographic scripts, this yields valid groupings, e.g., Japanese 再 現する saigen suru 'to reproduce/reappear', 再生する saisei suru 'to reproduce', 再生利用する saisei riyō suru 'to recycle'. In Polish, however, narzucać się ('to impose') and podrzucać ('to toss'), and polować ('to hunt') and malować ('to paint') end up clustered together. While it could be argued that a weak semantic link (apart from the etymological one) exists between the first pair, the second pair has only coincidental orthographic overlap. Similarly, a semantically heterogeneous cluster of Italian verbs ending in -lare is produced (coccolare 'cuddle', capitolare 'capitulate', scongelare 'defrost').Whether computed in context or in isolation, BERT word-level representations capture a lot of subword-and surface-level information without fully capturing higher-level semantic signal, which negatively affects cluster quality.

Word Similarity
We compute Spearman's ρ correlation between the ranks of models' similarity scores and those of human judgments from Phase 2. To ensure reliability of the results, we perform evaluation on a thresholded subset of each dataset focusing on classes with IAA above ρ = 0.3 (THR) ( Table 1). We also compare the models' capacity to discriminate between related concepts within a narrow semantic domain and report scores on three semantic classes: 'emotion' (#1), 'change' (#9), 'cooking' (#2) 7 (Table 6). The primacy of FT vectors in Polish, Finnish, and Italian is again conspicuous, while in Chinese and Japanese the pretrained encoders are in the lead, with noticeably lower FT performance recorded for Japanese than in the other languages. Results achieved on the THR sets repeat the patterns seen in the clustering task: contextualised variants of BERT embeddings (CTX) outperform those computed in isolation (ISO), and the language-specific encoders prove to capture richer semantics than the massively multilingual model -with the exception of Japanese, where contextualised M-BERT again achieves the top result (albeit noticeably lower than top THR scores in other languages). The relatively stronger M-BERT results on Japanese, as well as Chinese, illustrate the known unfavourable characteristic of multilingual pretraining with a subword vocabulary shared by 102 languages. The languages with scripts distinct from those of the majority of languages covered by M-BERT do not share their subwords with a large number of other languages, and their language-specific subwords constitute a large proportion of the total subword vocabulary; in consequence, the model can capitalise on this proliferation to produce higher-quality representations. Conversely, the representation quality is degraded for languages with very rich and productive morphology like Finnish or Polish, despite the availability of training data. This also applies to language-specific BERT models: given the same vocabulary capacity, morphologically rich languages have many words split into subwords and fewer full words represented in the vocabulary than analytic languages like Chinese or English.
To explore the potential of generating stronger word-level embeddings from BERT models, we investigated the impact of two parameters on lexical representation extraction: the number of hidden layers we average over (all 12 or first 8) and the inclusion of the special classification token ([CLS]) in the subword averaging step. Table 7 summarises the results for Polish and Finnish. We note that including the CLS token yields better lexical embeddings in both languages, however, whether averaging over 12 or 8 layers produces strongest results is language-specific. Notably, the top-performing Finnish BERT configuration (ρ = 0.250) outperforms FT embeddings (ρ = 0.248) on the THR set. While a full evaluation of all parameter configurations is beyond the scope of this paper, these findings suggest careful language-specific tuning of the extraction configuration is crucial to achieve optimal performance.

12
Although more variation in terms of primacy of one model variant over the other is expected on the semantic classes due to smaller dataset size, the general pattern whereby computing BERT embeddings by averaging N contextualised representations boosts performance applies in 72% of cases. Additional observations can be drawn from results on individual domains. Class #9, including verbs describing change in size or speed (e.g., accelerate, increase, shrink), is especially rich in synonyms and antonyms. Due to antonyms' high semantic overlap they are often confused with synonyms by distributional models learning purely from patterns of occurrences in raw text. This effect also emerges in our results, where performance on this class is the lowest for most model configurations and languages. Finnish is the exception, possibly due to the slightly broader coverage of this class, which also incorporates verbs of being and existence (Table 3), with smaller proportion of antonymous pairs. Interestingly, this class is where the multilingual model outperforms the language-specific counterparts in ZH, JA, PL. In Italian, where class #9 has only 23 members, most of which stand in antonymous relations (e.g., cresceredecrescere 'increase -decrease', aumentare -diminuire 'rise -drop', iniziare -finire 'begin -finish'), the BERT model trained on the larger corpus is the most robust. Results on this class illustrate that semantic areas which are easier for humans to reason about are not necessarily less challenging for models.
An area where greater ease of human judgment is reflected in relatively higher model performance is the domain of cooking verbs in ZH, FI, JA, where the highest overall scores are recorded (>0.4 for FT in ZH and FI), and the top model scores in Japanese (BERT). While we do not report all class-specific correlations for brevity, they reveal further interesting patterns as to the semantic domains which prove easiest for models to accurately capture. In IT, we record highest model correlations for verbs of communication and destruction (top scores >0.4 (FT)), while verbs of physical violence are the domain with highest correlations in PL (>0.3 FT), FI (>0.4 FT) and ZH (>0.4 BERT). In Japanese, the best result overall is achieved on verbs of cognition (0.276 BERT), followed by verbs of physiological processes with >0.2 correlations scored by the contextualised BERT models. Similar analyses on specific semantic domains can help identify strengths and deficiencies of different types of embeddings and highlight the areas of meaning which pose challenges across languages, guiding further developments in representation learning.

Main Observations
Our evaluation revealed the dataset to be a challenging benchmark, and provided a number of insights into the potential of the evaluated models to capture verbal lexical semantics across languages.
• Overall, model performance across tasks shows a split pattern: the pretrained encoders surpass static word embeddings in Chinese and Japanese, but are outperformed by FASTTEXT by a significant margin in languages using the Latin script, Polish, Finnish, and Italian. There is potential to derive higher-quality word-level BERT embeddings in those languages through careful selection of language-specific lexical representation extraction parameters.
• BERT word-level embeddings derived by averaging over N occurrences in context prove predominantly stronger than those obtained by feeding words into the pretrained model in isolation, with more variation observed in the case of the smaller semantic sets.
• The results achieved on the thresholded datasets show a clear advantage of monolingual pretraining over the massively multilingual pretraining -with the exception of Japanese, where M-BERT achieves the top results, as is the case in semantic clustering.
• Error analysis revealed that clustering performance of the pretrained encoders suffers due to the primacy given to low-level subword signal over the high-level semantic information, while an important area of difficulty for all models in the lexical similarity task is the problem of teasing apart antonymous and synonymous word pairs.

Conclusion and Future Work
We presented the first large-scale multilingual evaluation resource, constructed via spatial arrangement and targeting verb semantics in Chinese, Japanese, Polish, Finnish, and Italian. It includes semantic classes and fine-grained pairwise lexical similarity scores, which we release with this paper. The dual nature and vast coverage of the dataset enables evaluation of representation models on two tasks, semantic clustering and word similarity, and focused probing analyses on specific semantic domains, revealing aspects of verbal meaning which elude models' representation capacity. The low overall model performance indicates that estimating similarity between a large number of semantically proximate concepts linked by fine-grained relations is a challenging task. In future work, we will use the spatial arrangement data for in-depth analyses of cross-lingual typological variation and model evaluation on fine-grained semantic clusters from Phase 2 to explore the potential for (semi-)automatic creation of verb classes and semantic resources in languages where those are still lacking. We will also evaluate cross-lingual representation learning algorithms on mapped cross-lingual verb similarity datasets for all language pairs created in this project.