Probing Pretrained Language Models for Lexical Semantics

The success of large pretrained language models (LMs) such as BERT and RoBERTa has sparked interest in probing their representations, in order to unveil what types of knowledge they implicitly capture. While prior research focused on morphosyntactic, semantic, and world knowledge, it remains unclear to which extent LMs also derive lexical type-level knowledge from words in context. In this work, we present a systematic empirical analysis across six typologically diverse languages and five different lexical tasks, addressing the following questions: 1) How do different lexical knowledge extraction strategies (monolingual versus multilingual source LM, out-of-context versus in-context encoding, inclusion of special tokens, and layer-wise averaging) impact performance? How consistent are the observed effects across tasks and languages? 2) Is lexical knowledge stored in few parameters, or is it scattered throughout the network? 3) How do these representations fare against traditional static word vectors in lexical tasks? 4) Does the lexical information emerging from independently trained monolingual LMs display latent similarities? Our main results indicate patterns and best practices that hold universally, but also point to prominent variations across languages and tasks. Moreover, we validate the claim that lower Transformer layers carry more type-level lexical knowledge, but also show that this knowledge is distributed across multiple layers.

While there is a clear consensus on the effectiveness of pretrained LMs, a body of recent research has aspired to understand why they work (Rogers et al., 2020). State-of-the-art models are "probed" to shed light on whether they capture task-agnostic linguistic knowledge and structures (Liu et al., 2019a;Belinkov and Glass, 2019;Tenney et al., 2019); e.g., they have been extensively probed for syntactic knowledge (Hewitt and Manning, 2019;Jawahar et al., 2019;Kulmizev et al., 2020;Chi et al., 2020, inter alia) and morphology (Edmiston, 2020;Hofmann et al., 2020).
In this work, we put focus on uncovering and understanding how and where lexical semantic knowledge is coded in state-of-the-art LMs. While preliminary findings from Ethayarajh (2019) and Vulić et al. (2020) suggest that there is a wealth of lexical knowledge available within the parameters of BERT and other LMs, a systematic empirical study across different languages is currently lacking.
We present such a study, spanning six typologically diverse languages for which comparable pretrained BERT models and evaluation data are readily available. We dissect the pipeline for extracting lexical representations, and divide it into crucial components, including: the underlying source LM, the selection of subword tokens, external corpora, and which Transformer layers to average over. Different choices give rise to different extraction configurations (see Table 1) which, as we empirically verify, lead to large variations in task performance.
We run experiments and analyses on five diverse lexical tasks using standard evaluation benchmarks: lexical semantic similarity (LSIM), word analogy resolution (WA), bilingual lexicon induction (BLI), cross-lingual information retrieval (CLIR), and lex-ical relation prediction (RELP). The main idea is to aggregate lexical information into static type-level "BERT-based" word embeddings and plug them into "the classical NLP pipeline" (Tenney et al., 2019), similar to traditional static word vectors. The chosen tasks can be seen as "lexico-semantic probes" providing an opportunity to simultaneously 1) evaluate the richness of lexical information extracted from different parameters of the underlying pretrained LM on intrinsic (e.g., LSIM, WA) and extrinsic lexical tasks (e.g., RELP); 2) compare different type-level representation extraction strategies; and 3) benchmark "BERT-based" static vectors against traditional static word embeddings such as fastText (Bojanowski et al., 2017).
Our study aims at providing answers to the following key questions: Q1) Do lexical extraction strategies generalise across different languages and tasks, or do they rather require language-and taskspecific adjustments?; Q2) Is lexical information concentrated in a small number of parameters and layers, or scattered throughout the encoder?; Q3) Are "BERT-based" static word embeddings competitive with traditional word embeddings such as fastText?; Q4) Do monolingual LMs independently trained in multiple languages learn structurally similar representations for words denoting similar concepts (i.e., translation pairs)?
We observe that different languages and tasks indeed require distinct configurations to reach peak performance, which calls for a careful tuning of configuration components according to the specific task-language combination at hand (Q1). However, several universal patterns emerge across languages and tasks. For instance, lexical information is predominantly concentrated in lower Transformer layers, hence excluding higher layers from the extraction achieves superior scores (Q1 and Q2). Further, representations extracted from single layers do not match in accuracy those extracted by averaging over several layers (Q2). While static word representations obtained from monolingual LMs are competitive or even outperform static fastText embeddings in tasks such as LSIM, WA, and RELP, lexical representations from massively multilingual models such as multilingual BERT (mBERT) are substantially worse (Q1 and Q3). We also demonstrate that translation pairs indeed obtain similar representations (Q4), but the similarity depends on the extraction configuration, as well as on the typological distance between the two languages.

Lexical Representations from Pretrained Language Models
Classical static word embeddings (Bengio et al., 2003;Mikolov et al., 2013b;Pennington et al., 2014) are grounded in distributional semantics, as they infer the meaning of each word type from its co-occurrence patterns. However, LM-pretrained Transformer encoders have introduced at least two levels of misalignment with the classical approach (Peters et al., 2018;Devlin et al., 2019). First, representations are assigned to word tokens and are affected by the current context and position within a sentence (Mickus et al., 2020). Second, tokens may correspond to subword strings rather than complete word forms. This begs the question: do pretrained encoders still retain a notion of lexical concepts, abstracted from their instances in texts? Analyses of lexical semantic information in large pretrained LMs have been limited so far, focusing only on the English language and on the task of word sense disambiguation. Reif et al. (2019) showed that senses are encoded with finer-grained precision in higher layers, to the extent that their representation of the same token tends not to be self-similar across different contexts (Ethayarajh, 2019;Mickus et al., 2020). As a consequence, we hypothesise that abstract, type-level information could be codified in lower layers instead. However, given the absence of a direct equivalent to a static word type embedding, we still need to establish how to extract such type-level information.
In prior work, contextualised representations (and attention weights) have been interpreted in the light of linguistic knowledge mostly through probes. These consist in learned classifier predicting annotations like POS tags (Pimentel et al., 2020) and word senses (Peters et al., 2018;Reif et al., 2019;Chang and Chen, 2019), or linear transformations to a space where distances mirror dependency tree structures (Hewitt and Manning, 2019). 1 In this work, we explore several unsupervised word-level representation extraction strategies and configurations for lexico-semantic tasks (i.e., probes), stemming from different combinations of the components detailed in Table 1 and illustrated in Figure 1. In particular, we assess the impact of: 1) encoding tokens with monolingual LM-pretrained Transformers vs. with their mas-

Experimental Setup
Pretrained LMs and Languages. Our selection of test languages is guided by the following constraints: a) availability of comparable pretrained (language-specific) monolingual LMs; b) availability of evaluation data; and c) typological diversity of the sample, along the lines of recent initiatives in multilingual NLP (Gerz et al., 2018;Hu et al., 2020;Ponti et al., 2020, inter alia embedding layer (L 0 ), and 12 attention heads. We also experiment with multilingual BERT (mBERT) (Devlin et al., 2019) as the underlying LM, aiming to measure the performance difference between language-specific and massively multilingual LMs in our lexical probing tasks.
Word Vocabularies and External Corpora. We extract type-level representations in each language for the top 100K most frequent words represented in the respective fastText (FT) vectors, which were trained on lowercased monolingual Wikipedias by Bojanowski et al. (2017). The equivalent vocabulary coverage allows a direct comparison to fast-Text vectors, which we use as a baseline static WE method in all evaluation tasks. To retain the same vocabulary across all configurations, in AOC variants we back off to the related ISO variant for words that have zero occurrences in external corpora.
For all AOC vector variants, we leverage 1M sentences of maximum sequence length 512, which we randomly sample from external corpora: Europarl (Koehn, 2005)  Evaluation Tasks. We carry out the evaluation on five standard and diverse lexical semantic tasks: Task 1: Lexical semantic similarity (LSIM) is the most widespread intrinsic task for evaluation of traditional word embeddings (Hill et al., 2015). The evaluation metric is the Spearman's rank correlation between the average of human-elicited semantic similarity scores for word pairs and the cosine similarity between the respective type-level word vectors. We rely on the recent comprehensive multilingual LSIM benchmark Multi-SimLex (Vulić et al., 2020), which covers 1,888 pairs in 13 languages. We focus on EN, FI, ZH, RU, the languages represented in Multi-SimLex.
Task 2: Word Analogy (WA) is another common intrinsic task. We evaluate our models on the Bigger Analogy Test Set (BATS) (Drozd et al., 2016) with 99,200 analogy questions. We resort to the standard vector offset analogy resolution method, searching for the vocabulary word w d ∈ V such that its vector d is obtained by argmax d (cos(d, c − a + b)), where a, b, and c are word vectors of words w a , w b , and w c from the analogy w a : w b = w c : x. The search space comprises vectors of all words from the vocabulary V , excluding a, b, and c. This task is limited to EN, and we report Precision@1 scores.
Task 3: Bilingual Lexicon Induction (BLI) is a standard task to evaluate the "semantic quality" of static cross-lingual word embeddings (CLWEs) (Gouws et al., 2015;Ruder et al., 2019). We learn "BERT-based" CLWEs using a standard mappingbased approach (Mikolov et al., 2013a;Smith et al., 2017) with VECMAP (Artetxe et al., 2018). BLI evaluation allows us to investigate the "alignability" of monolingual type-level representations extracted for different languages. We adopt the standard BLI evaluation setup from Glavaš et al. (2019): 5K training word pairs are used to learn the mapping, and another 2K pairs as test data. We report standard Mean Reciprocal Rank (MRR) scores for 10 language pairs spanning EN, DE, RU, FI, TR.
Task 4: Cross-Lingual Information Retrieval (CLIR). We follow the setup of Litschko et al. (2018Litschko et al. ( , 2019 and evaluate mapping-based CLWEs (the same ones as on BLI) in a document-level retrieval task on the CLEF 2003 benchmark. 4 We use a simple CLIR model which showed competitive performance in the comparative studies of Litschko et al. (2019) and Glavaš et al. (2019). It embeds queries and documents as IDF-weighted sums of their corresponding WEs from the CLWE space, and uses cosine similarity as the ranking function. We report Mean Average Precision (MAP) scores for 6 language pairs covering EN, DE, RU, FI.

Task 5: Lexical Relation Prediction (RELP).
We probe if we can recover standard lexical relations (i.e., synonymy, antonymy, hypernymy, meronymy, plus no relation) from input type-level vectors. We rely on a state-of-the-art neural model for RELP operating on type-level embeddings (Glavaš and Vulić, 2018): the Specialization Tensor Model (STM) predicts lexical relations for pairs of input word vectors based on multi-view projections of those vectors. 5 We use the WordNet-based (Fellbaum, 1998) evaluation data of Glavaš and Vulić (2018): they contain 10K annotated word pairs balanced by class. Micro-averaged F 1 scores, averaged across 5 runs for each input vector space (default STM setting), are reported for EN and DE.

Results and Discussion
A summary of the results is shown in Figure 2 for LSIM, in Figure 3a for BLI, in Figure 3b for CLIR, in Figure 4a and Figure 4b for RELP, and in Figure 4c for WA. These results offer multiple axes of comparison, and the ensuing discussion focuses on the central questions Q1-Q3 posed in §1. 6 Monolingual versus Multilingual LMs. Results across all tasks validate the intuition that languagespecific monolingual LMs contain much more lexical information for a particular target language than massively multilingual models such as mBERT or XLM-R (Artetxe et al., 2020). We see large drops between MONO.* and MULTI.* configurations even for very high-resource languages (EN and DE), and they are even more prominent for FI and TR.
Encompassing 100+ training languages with limited model capacity, multilingual models suffer from the "curse of multilinguality" (Conneau et al., 2020): they must trade off monolingual lexical information coverage (and consequently monolingual performance) for a wider language coverage. 7 How Important is Context? Another observation that holds across all configurations concerns the usefulness of providing contexts drawn from external corpora, and corroborates findings from prior work (Liu et al., 2019b): ISO configurations cannot match configurations that average subword embeddings from multiple contexts (AOC-10 and AOC- Average over layers 100). However, it is worth noting that 1) performance gains with AOC-100 over AOC-10, although consistent, are quite marginal across all tasks: this suggests that several word occurrences in vivo are already sufficient to accurately capture its typelevel representation. 2) In some tasks, ISO configurations are only marginally outscored by their AOC counterparts: e.g., for MONO.*.NOSPEC.AVG(L≤8) on EN-FI BLI or DE-TR BLI, the respective scores are 0.486 and 0.315 with ISO, and 0.503 and 0.334 with AOC-10. Similar observations hold for FI and ZH LSIM, and also in the RELP task.
In RELP, it is notable that 'BERT-based' embeddings can recover more lexical relation knowledge than standard FT vectors. These findings reveal that pretrained LMs indeed implicitly capture plenty of lexical type-level knowledge (which needs to be 'recovered' from the models); this also suggests why pretrained LMs have been successful in tasks where this knowledge is directly useful, such as NER and POS tagging (Tenney et al., 2019;Tsai et al., 2019). Finally, we also note that gains with AOC over ISO are much more pronounced for the under-performing MULTI.* configurations: this indicates that MONO models store more lexical information even in absence of context. How Important is Layer-wise Averaging? Averaging across layers bottom-to-top (i.e., from L 0 to L 12 ) is beneficial across the board, but we notice that scores typically saturate or even decrease in some tasks and languages when we include higher layers into averaging: see the scores with *.AVG(L≤10) and *.AVG(L≤12) configurations, e.g., for FI LSIM; EN/DE RELP, and summary BLI and CLIR scores. This hints to the fact that two strategies typically used in prior work, either to take the vectors only from the embedding layer L 0 (Wu et al., 2020;Wang et al., 2019) or to average across all layers (Liu et al., 2019b), extract suboptimal word representations for a wide range of setups and languages.
The sweet spot for n in *.AVG(L≤n) configurations seems largely task-and language-dependent, as peak scores are obtained with different n-s. Whereas averaging across all layers generally hurts performance, the results strongly suggest that averaging across layer subsets (rather than selecting a single layer) is widely useful, especially across bottom-most layers: e.g., L ≤ 6 with MONO.ISO.NOSPEC yields an average score of 0.561 in LSIM, 0.076 in CLIR, and 0.432 in BLI; the respective scores when averaging over the 6 top layers are: 0.218, 0.008, and 0.230. This evidence implies that, although scattered across multiple layers, type-level lexical information seems to be concentrated in lower Transformer layers. We investigate these conjectures further in §4.1.
Comparison to Static Word Embeddings. The results also offer a comparison to static FT vectors across languages. The best-performing extraction configurations (e.g., MONO.AOC-100.NOSPEC) outperform FT in monolingual evaluations on LSIM (for EN, FI, ZH), WA, and they also display much stronger performance in the RELP task for both evaluation languages. While the comparison is not strictly apples-to-apples, as FT and LMs were trained on different (Wikipedia) corpora, these findings leave open a provocative question for future work: Given that static type-level word representations can be recovered from large pretrained LMs, does this make standard static WEs obsolete, or are there applications where they are still useful?
The trend is opposite in the two cross-lingual tasks: BLI and CLIR. While there are language pairs for which 'BERT-based' WEs outperform FT (i.e., EN-FI in BLI, EN-RU and FI-RU in CLIR) or are very competitive to FT's performance (e.g., EN-TR, TR-BLI, DE-RU CLIR), FT provides higher scores overall in both tasks. The discrepancy between results in monolingual versus cross-lingual tasks warrants further investigation in future work.  For instance, is using linear maps, as in standard mapping approaches to CLWE induction, suboptimal for 'BERT-based' word vectors?  show some variation across different languages and tasks. For instance, while EN LSIM performance declines modestly but steadily when averaging over higher-level layers (AVG(L≤ n), where n > 4), performance on EN WA consistently increases for the same configurations. The BLI and CLIR scores in Figures 3a and 3b also show slightly different patterns across layers. Overall, this suggests that 1) extracted lexical information must be guided by task requirements, and 2) config components must be carefully tuned to maximise performance for a particular task-language combination.

Lexical Information in Individual Layers
Evaluation Setup. To better understand which layers contribute the most to the final performance in our lexical tasks, we also probe type-level representations emerging from each individual layer of pretrained LMs. For brevity, we focus on the best performing configurations from previous experiments: {MONO, MBERT}.{ISO, AOC-100}.NOSPEC. In addition, tackling Q4 from §1, we analyse the similarity of representations extracted from monolingual and multilingual BERT models using the centered kernel alignment (CKA) as proposed by (Kornblith et al., 2019). The linear CKA computes similarity that is invariant to isotropic scaling and orthogonal transformation. It is defined as Discussion. Per-layer CKA similarities are provided in Figure 7 (self-similarity) and Figure 5 (bilingual), and we show results of representations extracted from individual layers for selected evaluation setups and languages in Table 2. We also plot bilingual layer correspondence of true word translations versus randomly paired words for EN-RU in Figure 6. Figure 7 reveals very similar patterns for both EN and FI, and we also observe that selfsimilarity scores decrease for more distant layers (cf., similarity of L 1 and L 2 versus L 1 and L 12 ). However, despite structural similarities identified by linear CKA, the scores from Table 2 demonstrate that structurally similar layers might encode different amounts of lexical information: e.g., compare performance drops between L 5 and L 8 in all evaluation tasks. The results in Table 2 further suggest that more type-level lexical information is available in lower layers, as all peak scores in the table are achieved with representations extracted from layers L 1 − L 5 . Much lower scores in type-level semantic tasks for higher layers also empirically validate a recent hypothesis of Ethayarajh (2019) "that contextualised word representations are more contextspecific in higher layers." We also note that none of the results with L=n configurations from Table 1 can match best performing AVG(L≤n) configurations with layer-wise averaging. This confirms our hypothesis that type-level lexical knowledge, although predominantly captured by lower layers, is disseminated across multiple layers, and layer-wise averaging is crucial to uncover that knowledge. Further, Figure 5 and Figure 6 reveal that even LMs trained on monolingual data learn similar representations in corresponding layers for word translations (see the MONO.AOC columns). Intuitively, this similarity is much more pronounced with AOC configurations with mBERT. The comparison of scores in Figure 6 also reveals much higher correspondence scores for true translation pairs than for randomly paired words (i.e., the correspondence scores for random pairings are, as expected, random). Moreover, MULTI CKA similarity scores turn out to be higher for more similar language pairs (cf. EN-DE versus EN-TR MULTI.AOC columns). This suggests that, similar to static WEs, type-level 'BERT-based' WEs of different languages also display topological similarity, often termed approximate isomorphism (Søgaard et al., 2018), but its degree depends on language proximity. This also clarifies why representations extracted from two independently trained monolingual LMs can be linearly aligned, as validated by BLI and CLIR evaluation (Table 2 and Figure 3). 9 We also calculated the Spearman's correlation between CKA similarity scores for configurations MONO.AOC-100.NOSPEC.AVG(L≤n), for all n = 0, . . . , 12, and their corresponding BLI scores on EN-FI, EN-DE, and DE-FI. The correlations are very high: ρ = 1.0, 0.83, 0.99, respectively. This further confirms the approximate isomorphism hypothesis: it seems that higher structural similarities of representations extracted from monolingual pretrained LMs facilitate their cross-lingual alignment.

Further Discussion and Conclusion
What about Larger LMs and Corpora? Aspects of LM pretraining, such as the number of model parameters or the size of pretraining data, also impact lexical knowledge stored in the LM's parameters. Our preliminary experiments have verified that EN BERT-Large yields slight gains over the EN BERT-Base architecture used in our work (e.g., peak EN LSIM scores rise from 0.518 to 0.531). In a similar vein, we have run additional experiments with two available Italian (IT) BERT-Base models with 9 Previous work has empirically validated that sentence representations for semantically similar inputs from different languages are less similar in higher Transformer layers (Singh et al., 2019; Wu and Dredze, 2019). In Figure 5, we demonstrate that this is also the case for type-level lexical information; however, unlike sentence representations where highest similarity is reported in lowest layers, Figure 5 suggests that highest CKA similarities are achieved in intermediate layers L5-L8. identical parameter setups, where one was trained on 13GB of IT text, and the other on 81GB. In EN (BERT-Base)-IT BLI and CLIR evaluations we measure improvements from 0.548 to 0.572 (BLI), and from 0.148 to 0.160 (CLIR) with the 81GB IT model. In-depth analyses of these factors are out of the scope of this work, but they warrant further investigations.
Opening Future Research Avenues. Our study has empirically validated that (monolingually) pretrained LMs store a wealth of type-level lexical knowledge, but effectively uncovering and extracting such knowledge from the LMs' parameters depends on several crucial components (see §2). In particular, some universal choices of configuration can be recommended: i) choosing monolingual LMs; ii) encoding words with multiple contexts; iii) excluding special tokens; iv) averaging over lower layers. Moreover, we found that type-level WEs extracted from pretrained LMs can surpass static WEs like fastText (Bojanowski et al., 2017).
This study has only scratched the surface of this research avenue. In future work, we plan to investigate how domains of external corpora affect AOC configurations, and how to sample representative contexts from the corpora. We will also extend the study to more languages, more lexical semantic probes, and other larger underlying LMs. The difference in performance across layers also calls for more sophisticated lexical representation extraction methods (e.g., through layer weighting or attention) similar to meta-embedding approaches (Yin and Schütze, 2016;Bollegala and Bao, 2018;Kiela et al., 2018). Given the current large gaps between monolingual and multilingual LMs, we will also focus on lightweight methods to enrich lexical content in multilingual LMs (Wang et al., 2020;Pfeiffer et al., 2020). URLs to the models and external corpora used in our study are provided in Table 3 and Table 4, respectively. URLs to the evaluation data and task architectures for each evaluation task are provided in Table 5. We also report additional and more detailed sets of results across different tasks, word embedding extraction configurations/variants, and language pairs: • In Table 6 and Table 7, we provide full BLI results per language pair. All scores are Mean Reciprocal Rank (MRR) scores (in the standard scoring interval, 0.0-1.0).
• In Table 8, we provide full CLIR results per language pair. All scores are Mean Average Precision (MAP) scores (in the standard scoring interval, 0.0-1.0).
• In Table 9, we provide full relation prediction (RELP) results for EN and DE. All scores are micro-averaged F 1 scores over 5 runs of the relation predictor (Glavaš and Vulić, 2018). We also report standard deviation for each configuration.
Finally, in Figures 8-10, we also provide heatmaps denoting bilingual layer correspondence, computed via linear CKA similarity (Kornblith et al., 2019), for several EN-L t language pairs (see §4.1), which are not provided in the main paper      Table 7: Results in the bilingual lexicon induction (BLI) task across different language pairs and word vector extraction configurations: Part II. MAP scores reported. For clarity of presentation, a subset of results is presented in this table, while the rest (also used to calculate the averages) is provided in Table 6 in the previous page. AVG (L≤n) means that we average representations over all Transformer layers up to the nth layer (included), where L = 0 refers to the embedding layer, L = 1 to the bottom layer, and L = 12 to the final (top) layer. Different configurations are described in §2 and Table 1.