An analysis of language models for metaphor recognition

We conduct a linguistic analysis of recent metaphor recognition systems, all of which are based on language models. We show that their performance, although reaching high F-scores, has considerable gaps from a linguistic perspective. First, they perform substantially worse on unconventional metaphors than on conventional ones. Second, they struggle with handling rarer word types. These two findings together suggest that a large part of the systems’ success is due to optimising the disambiguation of conventionalised, metaphoric word senses for specific words instead of modelling general properties of metaphors. As a positive result, the systems show increasing capabilities to recognise metaphoric readings of unseen words if synonyms or morphological variations of these words have been seen before, leading to enhanced generalisation beyond word sense disambiguation.


Introduction
Metaphor is a type of figurative language where meaning transfer occurs via similarity between two conceptual domains. In Examples 1 to 3, the metaphors attacked, bashed and ceasefire stem from a transfer from the domain WAR (or FIGHT) to the domain ARGUMENT (Lakoff and Johnson, 1980). 1 (1) He attacked my argument.
(3) We declared a ceasefire during dinner.
Metaphoric instances can stem from such regular metaphoric patterns equating two domains habitually or conventionally. 2 However, they can also be novel/unconventional such as the famous Emily Dickinson metaphor Hope is the thing with feathers, which does not correspond to a well-known metaphorical pattern. Even within a metaphorical pattern such as Argument is War, there are degrees of conventionality with Example 1 being more conventional than Examples 2 and 3. To a human, unconventional metaphors tend to be more noticeable.
Metaphor detection has been studied extensively in NLP in recent years (see (Veale et al., 2016; for overviews). State-of-the-art approaches in metaphor detection build strongly on language models and word embeddings, with more than half of the participants in the 2020 Shared Task on Metaphor Detection (Leong et al., 2020) using a variant of BERT language models (Devlin et al., 2019). Evaluations on the standard metaphor recognition test sets report scores that creep up steadily, using such methods. We investigate whether these models really are able to learn general properties of metaphor. To do so and to go beyond word sense disambiguation, they should be able to (i) recognise conventional and unconventional metaphors (ii) be able to perform well on rarer word types that often This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/. 1 In our examples, metaphoric words are marked in italics. 2 Lakoff and Johnson (1980) call these patterns conceptual metaphors. We will mainly use the term metaphoric patterns to distinguish those clearly from specific metaphoric instances, which we simply call metaphors.
follow the same metaphoric patterns as frequently seen ones and (iii) be able to generalise across synonyms and morphological variations of word types (such as making the inference from Example 1 to 2).
Our contributions are as follows: • We conduct a systematic comparison of different sequential metaphor recognition systems on the two most frequently used datasets. Although the two datasets contain different token sets of the same underlying corpus (Steen, 2010), we show that one is substantially easier to do well on than the other. We therefore call on future research to stop comparing their results across these two different datasets as this leads to unfair system comparisons.
• We show that the systems behave counter-intuitively by having lower performance on unconventional metaphors than on conventional ones. However, unconventional metaphors are the ones that are particularly relevant as conventional ones can potentially be interpreted with standard word sense disambiguation techniques.
• We show that metaphor recognition systems are strongly dependent on the frequency of word types in the training data.
• As a positive result, we show that the systems have increasing generalisation capabilities in that they perform better on unknown word types if synonyms or morphological variations have been seen in the training data.

Models
We report on the following models, all except the baseline being based on a sequence of progressively stronger language models. Lex-BL is a baseline suggested by Gao et al. (2018) that assigns metaphoric if the word has been annotated as metaphoric more often than literal in the training set, and literal otherwise (including for word types unseen in training).
Wu (Wu et al., 2018) is a system based on skip-gram word2vec (Mikolov et al., 2013), POS tags and word clusters with a CNN and BiLSTM plus ensemble learning, and is the winner of the 2018 Metaphor Detection Shared Task (Leong et al., 2018). As code or system output is not available, we report only the results in their paper and leave it out of fine-grained analysis.
Mao (Mao et al., 2019) build on Gao et al. (2018) but explicitly model two linguistically-motivated factors that might indicate metaphoricity: firstly, the potential clash between contextual and literal meaning of the word to be labeled, and secondly, the possible conflict between the literal meaning of the word to be labeled and its context.
Dankers (Dankers et al., 2019) enhance a fine-tuned BERT model (Dankers-BERT) with a multitask setup that learns metaphor and emotion labels jointly (Dankers). As code or system output is not available, we report only the results in their paper and leave it out of fine-grained analysis.
Stowe (Stowe et al., 2019) use the ELMO model of Gao et al. (2018) but show that additional, linguistically motivated training data enhances performance. As code or system output is not available, we report only the results in their paper and leave it out of fine-grained analysis.
BERT is a fine-tuned BERT model we implemented. Parameter details are in the Supplement. ILLI (Gong et al., 2020) is one of the 3 best-performing systems on the 2020 Metaphor Detection Shared Task (Leong et al., 2020). Its most basic form is a simple fine-tuned RoBERTa (Liu et al., 2019) language model (ILLI-ROB). Its most sophisticated version (ILLI-F-ENS) adds a wide variety of linguistic features and an ensemble based on 3 different runs on different train/dev splits. The system code is available but at too short notice for us to conduct a fine-grained analysis of this system yet.
DM is the 2020 Shared Task winner (Su et al., 2020). It uses RoBERTa enriched with POS features and two transformers, one focusing on the whole sentence context and one on a more local context. DM-ENS builds an ensemble across nine different runs of DM. Their system output is available. 3 Our analysis is based on the DM outputs in their submit folder, more specifically answer9 for DM as well as ensemble3 for DM-ENS (both for the VUA-ALL-POS task). The results vary only marginally from the reported best results in their paper.

Datasets
The VUA Metaphor Corpus 4 (Steen, 2010) consists of 115 texts of four different genres: academic, conversation, fiction and news. Each word, including function words, is annotated as metaphoric or literal, using guidelines based on literal meanings being the more basic or concrete meanings of a word (Group, 2007). Metaphoric readings can still be highly frequent. Example 4 from the corpus contains three conventional, frequent metaphoric readings, including the non-spatial meaning of in.
(4) But Nicholas's grand design collapsed in 1918 The corpus was used in the 2018 and 2020 VUA Metaphor Detection Shared Task (Leong et al., 2018;Leong et al., 2020). 5 The shared tasks include the VUA-ALL-POS task where all content words in a sentence (adjectives, verbs without have, do, be, nouns, adjectives) have to be labeled. Although the original corpus also labels function words for metaphoricity, the VUA-ALL-POS task does not evaluate systems on function words. Therefore in Example 4, four tokens (Nicholas,grand,design,collapsed) would have to be labeled as metaphoric or literal. 6 Four other papers that did not participate in the shared task (Gao et al., 2018;Mao et al., 2019;Dankers et al., 2019;Stowe et al., 2019) also use the VUA corpus but use quite different subsets of the corpus than the shared task VUA-ALL-POS does. In the VUA-ALL-POS task all sentences in the VUA texts are used whereas Gao et al. (2018), Mao et al. (2019), Dankers et al. (2019) and Stowe et al. (2019) use a much smaller subset of sentences, for reasons unknown. In addition, these four papers also evaluate on function words in this smaller subset, which makes a substantial difference.
We handle these two setups in two separate tasks: firstly, the original VUA-ALL-POS Shared Task data 7 and secondly, VUA-SEQ which uses the data in (Gao et al., 2018) 8 , subsequently used by (Mao et al., 2019;Dankers et al., 2019;Stowe et al., 2019). Statistics on these datasets are given in Table 1. Using only content words means that VUA-ALL-POS evaluates on fewer tokens although it has more sentences and that it contains fewer metaphors per sentence.

Results
We use the standard VUA-ALL-POS and VUA-SEQ training/test splits. Evaluation measures are precision, recall and F1 for the metaphoric class as well as accuracy on all target tokens. Table 2 shows overall results. For Lex-BL, Gao, Mao and BERT we had working code and ran that on both datasets as well as reporting original results from the Gao and Mao papers. For DM and DM-ENS we report and analyse output from their Github repository, for others we report only original results from their papers.
Dataset comparison. Results on VUA-ALL-POS are overall considerably lower than on VUA-SEQ for equivalent models. For example, our BERT model achieves F1 of 77.5 on VUA-SEQ but only 69.7 on VUA-ALL-POS. Similarly our rerun of Mao et al. (2019) achieves 74.3 on VUA-SEQ (identical to their reported results) but only 65.5 on VUA-ALL-POS. This is because VUA-SEQ also evaluates on function word metaphors that are easier to classify. Therefore, comparisons in various papers that do  not distinguish the two setups are inherently unfair and this widespread practice should not be continued. Thus, for example, the Gao model is not better than the 2018 Shared Task winner Wu as claimed in Gao et al. (2018), when compared on the same dataset and the same kind of part-of-speech; and the ILLINIMET paper (Gong et al., 2020) disadvantages itself by comparing their results on VUA-ALL-POS negatively to Mao et al. (2019), which they clearly beat when looking at the right dataset comparison.
Language model improvements. The gains achieved by exploiting state-of-the-art language models are usually higher than the ones achieved by additional linguistic modeling or insights. For example, on VUA-SEQ, the gain in moving from ELMO (Gao et al., 2018) to a fine-tuned BERT model with an otherwise similar setup was 4.9 F-measure points, whereas the gain from ELMO (Gao et al., 2018) to the inclusion of more complex linguistic modelling (Mao et al., 2019) is only 1.7 F-measure points. The gain when moving from Dankers-BERT to a multi-task model on top of BERT is only 0.6 F-measure points (Dankers et al., 2019). On VUA-ALL-POS, we again see a steady improvement with better language models, from ELMO in Gao/Mao (F1 65.5) to BERT (F1 69.7) to RoBERTa in ILLI-ROB (F1 72.0).
Here again more sophisticated features (from ILLI-ROB's F1 72.0 to ILLI-F-Ens F1 73.0) yield less of an improvement than better language models, although part-of-speech features play a positive role in both ILLI-F-ENS and in DM. Especially important is reducing performance variation by extensive ensemble modeling (from DM's 72.7 F1 to DM-ENS 76.6 F1).
Does this mean that standard language models indeed learn metaphor properties and generalise within metaphorical patterns? We will now conduct further linguistic analysis to adress this question.

Analysis
We will now investigate (i) how well the current systems handle conventional vs. novel metaphors, (ii) if they can handle frequent and less frequent word types and (iii) what the influence of morphology as well as semantic similarity is on their capability to handle metaphoric usage of word types not seen in training. In our opinion, frequency and conventionality analysis are crucial to test whether the system mainly recognises frequently seen, word-specific meanings that might also be specified in dictionary entries (see the metaphors in Example 4 in Section 2.2) or whether it is able to generalise the concept of metaphor to word types not seen in training or newly occurring metaphoric transfers.
We conduct the analysis on our reruns of Gao and Mao as well as BERT on VUA-SEQ and extend the analysis with the DM models on VUA-ALL-POS. This includes the state-of-the art systems on both datasets as well as 3 different language models (and extensions).

Novel vs. Conventional Metaphors.
Metaphors can be conventional (Example 4) or novel (see goose-step in Example 5 from the VUA corpus).
(5) Ron Todd [...] warned that party leaders could not expect everybody to 'goose-step' in the same direction [...] Conventional metaphors are frequent word usages that often have their own dictionary entries whereas novel readings are rare and cannot be found in standard lexical resources. Other aspects also contribute to a metaphor's conventionality, such as whether they do follow a metaphoric pattern. Recognising novel metaphors is important: Shutova (2015) argues "that NLP applications do not necessarily need to address highly conventional and lexicalized metaphors that can be interpreted using standard word sense disambiguation techniques".
Do Dinh et al. (2018) have extended the VUA corpus with reliable novelty scores for content word metaphors. Their annotation guidelines define conventionality and novelty based on frequency of use (often used in everyday language vs. not usually used in everyday language). The scores range from −1 indicating conventional metaphors to 1 for the most novel metaphors. For example, the metaphor in Example 5 has the score 0.765.
Whereas Do Dinh et al. (2018) and Simpson et al. (2019) tackle novelty scoring given gold standard metaphoric/literal information, we investigate how the novelty of a metaphor affects automatic methods for finding metaphors in the first place. Figure 1a and 1b show performance on conventionalised vs. novel metaphors for all systems on metaphoric content words with novelty scores. The x-axis shows the conventionality threshold t and the y-axis shows accuracy/recall. The graphs depict results for conventional metaphors with a novelty score below t and for novel metaphors with a novelty score above t. For example, on VUA-SEQ, our BERT model achieves an accuracy just below 0.6 on the 828 metaphoric content words with a novelty score higher than 0.2. In contrast, it achieves an accuracy of over 70% on the 2876 metaphoric content words with a novelty score lower than 0.2, indicating a substantially better performance on conventional metaphors.
Within each model, the curve for conventionalised metaphors is consistently above the curve for novel metaphors as long as the buckets have a reasonable size. 9 Conventionalised metaphors are therefore recognized much more easily than novel ones. This is interesting as, from a human perspective, novel metaphors are easier to "notice", and suggests that the algorithms might mainly learn different word senses instead of general properties of metaphor, such as the fact that many metaphoric readings show a contrast to their dictionary sense(s) or a contrast to the surrounding context.

Frequent vs Infrequent Word Types.
To further investigate how far the algorithms generalise across different word types and their specific meanings, we show the performance of the systems on word types grouped by frequency in the training set in Tables 3 and 4 Table 3: F-measure (accuracy in parenthesis) on different frequency buckets in VUA-SEQ. The frequency buckets are given in the first column, the number of tokens in the test set that belong to each bucket in the second column and the number of types in the test set belonging to each bucket in the third column. For example, there are 2807 word types in the test set that have never been seen in the training set. 125 word types in the test set have been seen over 100 times in training, making up 27,997 tokens of all test tokens.
Overall, F-measure and accuracy increases with the number of times the word type has been seen in training for all models. For example, the best-performing model, BERT, on VUA-SEQ (Table 3) achieves an F-measure of 53.5 on words whose type has not been seen in training, but already 70.3 on words whose type has been seen 1-10 times in training. The one exception is a drop in performance on F-measure for all models on the highest frequency bucket in VUA-ALL-POS (Table 4). Investigation showed that this bucket included only 46 word types, including also word types such as Yes, Mm, er, also which were rarely used metaphorically. 10 Thus, the percentage of metaphors in this bucket is much smaller than in the remainder of the corpus, making F-measure more volatile. This is not true for the high frequency bucket in VUA-SEQ which includes many prepositions which are frequently annotated as metaphors (see the non-spatial meaning of in in Example 4). Mao et al. (2019) explicitly encode clashes between literal word meaning and contextual meaning as fr.   Table 3.
a metaphoricity indicator on top of Gao's Elmo model, leading to some performance improvements also for word types not seen in training when compared to Gao et al. (2018) (improving from an F1 of 45.1 to 48.3 on unseen word types on VUA-SEQ, Table 3). These improvements are dwarfed by just moving to a stronger language model such as BERT (F1 53.5 on unseen types in VUA-SEQ) but it is possible that the improvements would also carry over when the additional linguistic modelling would be stacked on top of BERT. It might not seem surprising that all algorithms perform better on word types more often seen in training, but we believe that this type of analysis should be given regularly to check the model's dependence on word-specific labeled data and its ability to generalise.

The impact of morphology and lexical relations
All models are still able to recognise some metaphors for word types not seen in training (henceforth, unseen types). We now investigate when the models are able to generalise to such unseen types.
First, we look at whether performance on unseen word types whose morphological variants have been seen in training is higher than on other unseen word types. This would be plausible as morphological variants will often be close in embedding space and also often undergo the same metaphoric pattern shifts. For example, the AFFECTION IS WARMTH metaphoric pattern is instantiated by warm greeting, warmer greeting as well as the warmth of his greeting. We also hypothesized that inflectional variations probably behave more similarly than derivational variations. We therefore distinguished between exact word type seen, word type not seen but an inflectional variation seen and neither word type nor inflectional variation seen but derivational variant seen. Potential derivational variations were extracted via WordNet (Miller et al., 1990). Tables 5 and 6 show that unseen types where morphological variations had been seen are indeed easier than other unseen types for all systems. For example, on VUA-SEQ F-measure for BERT gradually gets worse from seen types (F1 of 80.3) to 63.8 for word types that have only an inflectional variant seen in training to 54.9 for word types that have only a derivational variant seen in training to 47.4 for word types that have neither itself, nor an inflectional or derivational variant seen in training (Table 5). For all systems but DM-ENS, performance on word types where inflectional variations have been seen is higher than if only derivational variations have been seen. For Gao, Mao and BERT, performance on types where no variation has been seen in training might actually not be better than just assigning literal as Lex-BL does for the unseen cases -we see a drop in accuracy compared to Lex-BL for these models (last column in Tables 5 and 6) as well as low precision for metaphor recognition (precision not shown in the tables). type seen infl.var. seen deriv.var seen no var seen Lex-BL 56.9 (90.9) -(76.0) -(80 . Table 6: F-measure (accuracy) on VUA-ALL-POS with regard to morphological variants seen in training In a second study, we look at the performance on unseen word types when synonyms have been seen in training. Synonyms also are close in embedding spaces and also often share metaphorical patterns (see attack and bash in Example 1 and 2 in the Introduction). Therefore, language models might fare better when a synonym has been seen. We extract synonyms of a word from WordNet. Table 7 shows unseen word types where synonyms have been seen before are indeed easier than unseen word types where synonyms have not been seen. This holds for all systems without exception. For example, on VUA-ALL-POS, performance of the Shared Task Winner DM-ENS has an F1 of 78.8 for seen word types, falling to 68.3 for unseen word types where a synonym has been seen and to 56.8 for unseen word types where no synonym has been seen.
In conclusion, current models seem to be able to generalise to a certain degree to unseen word types as long as they are synonyms or morphological variations of seen ones. We give two examples of metaphors in the test set of VUA-ALL-POS (i) that all or most systems identified correctly, (ii) the type of which has not been seen in training and (iii) for which morphological variations or synonyms have been seen. The test example comes first and a similar metaphor from the training set second after an arrow.
(6) . . . to . . . punctuate aspects of Holly's life ← stressed different facets of Kahlo's public persona (7) the richness of their exquisitely-sculpted decoration ← the colours were rich Of course, due to the black box nature of the language models, the extensive pretraining they undergo before fine-tuning and other interferences such as context similarity, we cannot claim that these were the actual examples that the models generalised from. However, the quantitative data shows that some generalisations do take place.

The interaction of word frequency and unconventionality
There is a moderate inverse correlation between metaphoric novelty and word frequency (Do Dinh et al., 2018). High frequency words tend to have many conventionalised metaphoric senses; however, low frequency is not necessarily an indication of novel metaphor usage as low frequency words can also follow the metaphoric patterns of their high frequency synonyms (such as tussle being used for non-physical arguments just like attack). We therefore investigate the interaction between novelty and word frequency, in particular whether for low frequency word types performance still depends on novelty/conventionality. Figure 2 shows a heatmap that displays the interaction between frequency count in training on the x-axis and conventionality scores on the y-axis for the 3704 VUA-SEQ content word metaphors with novelty scores in the test set. Similar to the analysis in Do Dinh et al. (2018), we see that metaphors using high frequency words normally do not have high novelty scores (right-most column): of 262 metaphoric test tokens whose type has been seen more than 100 times in training, 222 have a novelty score equal or below zero.
We enhance this analysis by showing the accuracy/recall of the BERT model on these subgroups in the fields of the heatmap. We see that even for low frequency words (two left-most columns), conventionality still matters and unconventional metaphors tend to be harder to recognise than conventional ones. For example, even for unseen word types (left-most column), performance on conventional metaphors with novelty scores below 0 is 66%, and then gradually decreases to 53%, 52%, 50% and 38% for less conventional metaphors. Therefore word type frequency does not account for all variation in classifier performance.
The picture is not always completely clear for all models and across both datasets. Especially, small bucket sizes for some fields in the heatmap do not allow firm conclusions. However, the general message holds. Further heatmap examples for other systems on VUA-ALL-POS can be found in the supplementary material. Figure 2: Heatmap showing the interaction between frequency and conventionality as well as classifier performance for the 3704 metaphors with a conventionality score in the VUA-SEQ test set. On the x-axis, we find how often a word type was seen in training. On the y-axis, we have buckets of conventionality scores. In the fields we see the number of test tokens in the bucket as well as accuracy/recall of the BERT model on this bucket.

Related Work
Datasets. In some datasets, each word is labeled for metaphoricity (VUA Metaphor corpus, (Steen, 2010)) whereas in others only one target word in a bigram or a sentence is labeled (Mohammad et al., 2016;Birke and Sarkar, 2006;Tsvetkov et al., 2014;Turney et al., 2011, among others). We concentrate on datasets where each word is labeled as (i) these are highly appropriate for the sequence labeling tasks that language models excel at and (ii) the 2018 and 2020 Metaphor Shared Tasks (Leong et al., 2018;Leong et al., 2020) use such corpora. We have shown that it matters substantially which dataset partition and setup within the VUA corpus you use and encourage future work to not compare systems working on the two different setups anymore.
Most datasets include only a binary metaphor/literal annotation per word, making it hard to assess system capabilities for the recognition of various metaphor types, such as conventional vs. novel metaphors, deliberately used vs. unintentional metaphors (Steen, 2008) or different domain mappings. Some exceptions exist, such as the conventionality annotation in (Do Dinh et al., 2018;Dunn, 2014), an annotation akin to deliberateness in (Klebanov and Flor, 2013) and annotated domain mappings in (Shutova and Teufel, 2010). However, most of these were small scale and/or are not publically available, the exception being the conventionality ratings by Do Dinh et al. (2018), which we use in this paper.
Metaphor recognition. Data-driven approaches to metaphor recognition (Turney et al., 2011;Tsvetkov et al., 2014;Rei et al., 2017;Köper and im Walde, 2017;Wu et al., 2018;Gao et al., 2018;Gutierrez et al., 2016;Mao et al., 2018;Mao et al., 2019;Dankers et al., 2019;Stowe et al., 2019;Su et al., 2020;Gong et al., 2020, among others) use a variety of information sources such as abstractness/concreteness features, semantic class information, part-of-speech tags, property norms and outside lexical databases as well as multimodal and multilingual information. The recent state of the art models we discuss (Gao et al., 2018;Wu et al., 2018;Mao et al., 2019;Dankers et al., 2019;Stowe et al., 2019;Gong et al., 2020;Su et al., 2020) use sequence labeling and build on embeddings and/or language models. Leong et al. (2020) state that more than half of participants in the 2020 Shared Task use a variation of BERT. We investigate their properties and performance levels in more detail than previously done, including analysis for conventionality, frequency and generalisation via morphology and semantic similarity.
Novel vs. conventionalized metaphors. We investigated how conventionality impacts metaphor recognition. Recent work (Dunn, 2014;Do Dinh et al., 2018;Parde and Nielsen, 2018;Simpson et al., 2019) has assigned novelty scores to (given) metaphors. However, they have either not investigated the influence of novelty on metaphor detection per se or not worked in a sequence labeling, full-text paradigm. We have shown that assigning metaphor novelty scores assuming that metaphors have already been reliably detected is currently somewhat unrealistic as a metaphor's novelty has a strong influence on being detected in the first place by current models.

Results by POS Tag and genre
It is standard to give the results for the 4 genres in the corpus as well as on different POS. Results for VUA-ALL-POS can be found in the 2018 and 2020 Shared Task reports The corresponding table for VUA-SEQ is given below. We do not observe any differences in tendencies to what has been previously reported: adjectives and nouns are harder than verbs and adverbs; conversational texts are the most difficult genre.

Heatmap Examples for VUA-ALL-POS
All heatmaps below show the interaction between frequency and novelty for the 3862 metaphors with a novelty score in VUA-ALL-POS. On the x-axis we find how often a word type was seen in training, on the y-axis we have buckets of novelty scores. In the fields we see the number of test tokens in the bucket as well as accuracy/recall on this bucket. We show heatmaps for the three best-performing models.