MelBERT: Metaphor Detection via Contextualized Late Interaction using Metaphorical Identification Theories

Automated metaphor detection is a challenging task to identify the metaphorical expression of words in a sentence. To tackle this problem, we adopt pre-trained contextualized models, e.g., BERT and RoBERTa. To this end, we propose a novel metaphor detection model, namely metaphor-aware late interaction over BERT (MelBERT). Our model not only leverages contextualized word representation but also benefits from linguistic metaphor identification theories to detect whether the target word is metaphorical. Our empirical results demonstrate that MelBERT outperforms several strong baselines on four benchmark datasets, i.e., VUA-18, VUA-20, MOH-X, and TroFi.


Introduction
As the conceptual and cognitive mapping of words, a metaphor is a common language expression representing other concepts rather than taking literal meanings of words in context (Lakoff and Johnson, 1980;Lagerwerf and Meijers, 2008). For instance, in the sentence "hope is on the horizon," the word "horizon" does not literally mean the line at the earth's surface. It is a metaphorical expression to describe a positive situation. Therefore, the meaning of "horizon" is context-specific and different from its literal definition.
As the metaphor plays a key role in cognitive and communicative functions, it is essential to understand contextualized and unusual meanings of words (e.g., metaphor, metonymy, and personification) in various natural language processing (NLP) tasks, e.g., machine translation (Shi et al., 2014), sentiment analysis (Cambria et al., 2017), and dialogue systems (Dybala and Sayama, 2012). A lot of existing studies have developed various computational models to recognize metaphorical words in a sentence.
Automated metaphor detection aims at identifying metaphorical expressions using computational models. Existing studies can be categorized into three pillars. First, feature-based models employ various hand-crafted features (Shutova et al., 2010;Turney et al., 2011;Shutova and Sun, 2013;Broadwell et al., 2013;Tsvetkov et al., 2014;. Although simple and intuitive, they are highly sensitive to the quality of a corpus. Second, some studies (Wu et al., 2018;Gao et al., 2018;Mao et al., 2019) utilize recurrent neural networks (RNNs), which are suitable for analyzing the sequential structure of words. However, they are limited to understanding the diverse meanings of words in context. Lastly, the pre-trained contextualized models, e.g., BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), have been used for detecting metaphors Gong et al., 2020;Su et al., 2020). Owing to the powerful representation capacity, such models have been successful for addressing various NLP tasks (Wang et al., 2019) and document ranking in IR (Mitra and Craswell, 2018).
Based on such an advancement, we utilize a contextualized model using two metaphor identification theories, i.e., Metaphor Identification Procedure (MIP) (Pragglejaz Group, 2007;Steen et al., 2010) and Selectional Preference Violation (SPV) (Wilks, 1975(Wilks, , 1978. For MIP, a metaphorical word is recognized if the literal meaning of a word is different from its contextual meaning (Haagsma and Bjerva, 2016). For instance, in the sentence "Don't twist my words", the contextual meaning of "twist" is "to distort the intended meaning", different from its literal meaning, "to form into a bent, curling, or distorted shape." For SPV, a metaphorical word is identified if the target word is unusual in the context of its surrounding words. That is, "twist" is metaphorical because it is unusual in the context of "words." Although the key ideas of the two strategies are similar, they have different procedures for detecting metaphorical words and their contexts in the sentence.
To this end, we propose a novel metaphor detection model using metaphorical identification theories over the pre-trained contextualized model, namely metaphor-aware late interaction over BERT (MelBERT). MelBERT deals with a classification task to identify whether a target word in a sentence is metaphorical or not. As depicted in Figure 2, MelBERT is based on a siamese architecture that takes two sentences as input. The first sentence is a sentence S with a target word w t and the second sentence is a target word w t itself. MelBERT independently encodes S and w t into each embedding vector, which avoids unnecessary interactions between S and w t . Inspired by MIP, MelBERT then employs the contextualized and isolated representations of w t to distinguish between the contextual and literal meaning of w t . To utilize SPV, MelBERT employs the sentence embedding vector and the contextualized target word embedding vector. MelBERT identifies how much the surrounding words mismatch from the target word. Lastly, MelBERT combines two metaphor identification strategies to predict if a target word is metaphorical or not. Each metaphor identification theory is non-trivial for capturing complicated and vague metaphorical words. To overcome these limitations, we incorporate two linguistic theories into a pre-trained contextualized model and utilize several linguistic features such as POS features.
To summarize, MelBERT has two key advantages. First, MelBERT effectively employs the contextualized representation to understand various aspects of words in context. Because MelBERT is particularly based on a late interaction over contextualized models, it can prevent unnecessary interactions between two inputs and effectively distinguish the contextualized meaning and the isolated meaning of a word. Second, MelBERT utilizes two metaphor identification theories to detect whether the target word is metaphorical. Experimental results show that MelBERT consistently outperforms state-of-the-art metaphor detection models in terms of F1-score on several benchmark datasets, such as VUA-18, VUA-20, and VUA-Verb datasets.
2 Related Work

Metaphor Detection
Feature-based approach. Various linguistic features are used to understand metaphorical expressions. Representative hand-engineered features include word abstractness and concreteness (Tur-ney et al., 2011), word imageability (Broadwell et al., 2013, semantic supersenses (Tsvetkov et al., 2014), and property norms . However, they have difficulties handling rare usages of metaphors because the features rely on manually annotated resources. To address this problem, sparse distributional features (Shutova et al., 2010;Shutova and Sun, 2013) and dense word embeddings Rei et al., 2017), i.e., Word2Vec (Mikolov et al., 2013), are used as better linguistic features. For details, refer to the survey (Veale et al., 2016).
RNN-based approach. Several studies proposed neural metaphor detection models using recurrent neural networks (RNNs). (Wu et al., 2018) adopts a bidirectional-LSTM (BiLSTM) (Graves and Schmidhuber, 2005) and a convolutional neural network (CNN) using Word2Vec (Mikolov et al., 2013) as text features in addition to part-of-speech (POS) and word clustering information as linguistic features. (Gao et al., 2018) employs BiLSTM as an encoder using GloVe (Pennington et al., 2014) and ELMo (Peters et al., 2018) as text input representation. (Mao et al., 2019) makes use of the metaphor identification theory on top of the architecture of (Gao et al., 2018). Despite their success, the shallow neural networks (e.g., BiLSTM and CNN) have limitations on representing various aspects of words in context.
Contextualization-based approach. Recent studies utilize pre-trained contextualized language models, e.g., BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), for metaphor detection. Because the pre-trained model can encode rich semantic and contextual information, it is useful for detecting metaphors with fine-tuning training. DeepMet (Su et al., 2020) utilizes RoBERTa with various linguistic features, i.e., global text context, local text context, and POS features. IlliniMet (Gong et al., 2020) combines RoBERTa with linguistic information obtained from external resources.  formulates the multitask learning problem for both metaphor detection, and  reports the results of these models in the VUA 2020 shared task.

Semantic Matching over BERT
The key idea of neural semantic matching is that neural models encode a query-document pair into two embedding vectors and compute a relevance score between the query and the document (Mi-tra and Craswell, 2018). The simple approach is to feed a query-document pair to BERT (Devlin et al., 2019) and compute a relevance score, where the query and the document are fully interacted (Nogueira et al., 2019;Dai and Callan, 2020). In contrast, SBERT (Reimers and Gurevych, 2019), TwinBERT (Lu et al., 2020), and ColBERT (Khattab and Zaharia, 2020) adopt late interaction architectures using siamese BERT, where the query and the document are encoded independently. Our work is based on the late interaction architecture. In other words, the sentence with the target word and the target word is encoded separately to represent contextualized and isolated meanings of the target word.

MelBERT
In this section, we propose a novel metaphor detection model over a pre-trained contextualized model. To design our model, we consider two metaphor detection tasks. Given a sentence S = {w 1 , . . . , w n } with n words and a target word w t ∈ S, the classification task predicts the metaphoricity (i.e., mataphorical or literal) of w t . Given a sentence S, the sequence labeling predicts the metaphoricity of each word w t (1 ≤ t ≤ n) in S.
We aim at developing a metaphor detection model for the classification task. Our model returns a binary output, i.e., 1 if the target word w t in S is metaphorical or 0 otherwise. By sequentially changing the target word w t , our model can be generalized to classify the metaphoricity of each word in a sentence, as in sequence labeling.

Motivation
The pre-trained language models, e.g., BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), usually take two sentences as input and return output to predict the relevance between two input sentences. We adopt RoBERTa as the contextualized backbone model because RoBERTa is known to outperform BERT (Liu et al., 2019). To design a metaphor detection model, we treat one input sentence as a single word (or a phrase).
As depicted in Figure 1, there are two paradigms for representing the interaction between two input sentences: all-to-all interaction and late interaction, as discussed in the document ranking problem (Khattab and Zaharia, 2020). While all-to-all interaction takes two input sentences together as an input, late interaction encodes two sentences separately over a siamese architecture. Given a sentence S and a target word w t , all-to-all interaction can capture all possible interactions within and across w t and S, which incurs high computational cost. Moreover, when some interactions across w t and S are useless, it may learn noisy information.
In contrast, because late interaction encodes w t and S independently, it naturally avoids unnecessary intervention across w t and S. The sentence embedding vector also can be easily reused in computing the interaction with the target word. In other words, the cost of encoding the sentence vector can be amortized for that of encoding different target words. Because our goal is to identify whether the contextualized meaning of the target word w t is different from its isolated meaning, we adopt the late interaction paradigm for metaphor detection. Our model encodes a sentence S with a target word and a target word w t into embedding vectors, respectively, and computes the metaphoricity score of the target word. (In Section 4, it is found that our model using late interaction outperforms a baseline model using all-to-all interaction.)

Model Architecture
We propose a novel metaphor detection model, namely, metaphor-aware late interaction over BERT (MelBERT) using metaphor identification theories, i.e., Metaphor Identification Procedure (MIP) (Pragglejaz Group, 2007;Steen et al., 2010) and Selectional Preference Violation (SPV) (Wilks, 1975(Wilks, , 1978. Figure 2 illustrates the overall architecture of MelBERT, which consists of three components: a sentence encoder Enc(S), a target word encoder Enc(w t ), and a late interaction mechanism to compute a score.
We first explain the input layer for two encoders Enc(S) and Enc(w t ). Each word in the sentence is converted to tokens using an improved implementation of byte-pair encoding (BPE) (Radford  ). When the sentence is represented as a composite sentence, the local context indicates a clause including target tokens. For simplicity, we represent the local context using comma separator (,) in the sentence. Besides, we add a special classification token [CLS] before the first token and a segment separation token [SEP] after the last token. To make use of the POS feature of the target word, we append the POS tag for the target word after [SEP], as used in (Su et al., 2020). The input representation is finally computed by the element-wise addition of token, position embedding, and segment embedding. For Enc(w t ), the target word is converted to the tokens using BPE, but position and segment embedding are not used. Given a sentence S = {w 1 , . . . , w n }, Enc(S) encodes each word into a set of contextualized embedding vectors, {v S , v S,1 , . . . , v S,n } using the transformer encoder (Vaswani et al., 2017), where v S is the embedding vector corresponding to the [CLS] token and v S,i is the i-th embedding vector for w i in S. Similarly, Enc(w t ) encodes a target word w t into v t without context.
While v S reflects the interaction across all words in S, v S,t considers the interaction between w t and other words in S. Therefore, v S,t and v t can be interpreted as different meanings for w t , i.e., v S,t is contextualized representation of w t and v t is isolated representation of w t .
Then, we utilize two metaphor identification theories using contextualized embedding vectors. MelBERT using MIP. The basic idea of MIP is that a metaphorical word is identified by the gap between the contextual and literal meaning of a word. To incorporate MIP into MelBERT, we employ two embedding vectors v S,t and v t , representing a contextualized embedding vector and an isolated embedding vector for w t , respectively. Using these vectors, we identify the semantic gap for the target word in context and isolation. MelBERT using SPV. The idea of SPV is that a metaphorical word is identified by the semantic difference from its surrounding words. Unlike MIP, we only utilize the sentence encoder. Given a target word w t in S, our key assumption is that v S and v S,t show a semantic gap if w t is metaphorical. Although v S and v S,t are contextualized, the meanings of the two vectors are different; v S represents the interaction across all pair-wise words in S, but v S,t represents the interaction between w t and other words in S. In this sense, when w t is metaphorical, v S,t can be different from v S by the surrounding words of w t . Late interaction over MelBERT. Using the two strategies, MelBERT predicts whether a target word w t ∈ S is metaphorical or not. We can compute a hidden vector h M IP by concatenating v S,t and v t for MIP.
where h M IP ∈ R h×1 and f (·) is a function for the MLP layer to learn the gap between two vectors v S,t and v t . We can also compute a hidden vector h SP V using v S and v S,t for SPV.
where h SP V ∈ R h×1 and g(·) is a function for the MLP layer to learn the semantic difference between v S and v S,t . We combine two hidden vectors h M IP and h SP V to compute a prediction score: where σ(·) is the sigmoid function, W ∈ R 2h×1 is the parameter, and b is a bias. To learn MelBERT, finally, we use the cross-entropy loss function for binary classification as follows: where N is the number of samples in the training set. y i andŷ i are the true and predicted labels for the i-th sample in the training set.

Evaluation
In this section, we first present the experimental setup, then report empirical results by comparing our model against strong baselines. and TroFi (Birke and Sarkar, 2006) for testing purposes only. MOH-X is a verb metaphor detection dataset with the sentences from WordNet and TroFi is also a verb metaphor detection dataset, including sentences from the 1987-89 Wall Street Journal Corpus Release 1. The sizes of these datasets are relatively smaller than those of VUA datasets, and they have metaphorical words of more than 40%, while VUA-18 and VUA-20 datasets have about 10% of metaphorical words. While MOH-X and TroFi only annotate verbs as metaphorical words, the VUA dataset annotates all POS tags as metaphorical words. In this sense, we believe that the VUA dataset is more appropriate for training and testing models. Table 1 summarizes detailed statistics on the benchmark datasets. Baselines. We compare our models with several strong baselines, including RNN-based and contextualization-based models.
• RNN_ELMo and RNN_BERT (Gao et al., 2018): They employ the concatenation of the pre-trained ELMo/BERT and the GloVe (Pennington et al., 2014) embedding vectors as an input, and use BiLSTM as a backbone model. Note that they use contextualized models only for input vector representation.

2019): They incorporate MIP and SPV into
RNN_ELMo (Gao et al., 2018). RNN_HG compares an input embedding vector (literal) with its hidden state (contextual) through BiL-STM. RNN_MHCA utilizes multi-head attention to capture the contextual feature within the window size.
• RoBERTa_BASE: It is a simple adoption of RoBERTa for metaphor detection. It takes a target word and a sentence as two input sentences and computes a prediction score. It can be viewed as a metaphor detection model over an all-to-all interaction architecture.
• RoBERTa_SEQ (Leong et al., 2020): It takes one single sentence as an input, and a target word is marked as the input embedding token and predicts the metaphoricity of the target word using the embedding vector of the target word. This architecture is used as the BERTbased baseline in the VUA 2020 shared task.
• DeepMet (Su et al., 2020): It is the winning model in the VUA 2020 shared task. It also utilizes RoBERTa as a backbone model and incorporates it with various linguistic features, such as global context, local context, POS tags, and fine-grained POS tags.

Evaluation protocol.
Because the ratio of metaphorical words is relatively small, we adopt three metrics, e.g., precision, recall, and F1-score, denoted by Prec, Rec, and F1. MOH-X and TroFi datasets are too smaller than VUA datasets. Thus, we only used them as the test datasets; metaphor detection models are only trained in VUA datasets, and zero-shot transfer is conducted to evaluate the effectiveness of model generalization. Implementation details. For four baselines, we used the same hyperparameter settings 1 in (Gao et al., 2018;Mao et al., 2019;Su et al., 2020). For DeepMet 2 , we evaluated it with/without bagging technique. While DeepMet (Su et al., 2020) exploits two optimization techniques, bagging and ensemble, we only used a bagging technique for MelBERT and DeepMet. It is because we want to evaluate the effectiveness of model designs. The performance difference for DeepMet between the original paper and ours thus comes from the usage of the ensemble method. For contextualized models, we used a pre-trained RoBERTa 3 with 12 layers, 12 attention heads in each layer, and 768 dimensions of the hidden state. For contextualized baselines, we set the same hyperparameters with MelBERT, which were tuned on VUA-18 dev based on F1-score. The batch size and max sequence length were set as 32 and 150. For training, the number of epochs was three with Adam optimizer.  Table 2: Performance comparison of MelBERT with baselines on VUA-18 and VUA-Verb (best is in bold and second best is in italic underlined). Let -CV denote the bagging technique for its base model (best is in bold-italic). * denotes p < 0.05 for a two-tailed t-test with the best competing model.  Table 3: Performance comparison of MelBERT with baselines on VUA-20 (best is in bold and second best is in italic underlined). Let -CV denote the bagging technique for its base model (best is in bold-italic). * denotes p < 0.05 for a two-tailed t-test with the best competing model.
We increased the learning rate from 0 to 3e-5 during the first two epochs and then linearly decreased it during the last epoch. We set the dropout ratio as 0.2. All experimental results were averaged over five runs with different random seeds. We conducted all experiments on a desktop with 2 NVidia TITAN RTX, 256 GB memory, and 2 Intel Xeon Processor E5-2695 v4 (2.10 GHz, 45M cache). We implemented our model using PyTorch. All the source code is available at our website 4 .  Table 4: Model performance of different genres in VUA-18 (best is in bold and second best is in italic underlined).

Empirical Results
Overall results. Tables 2 and 3 report the comparison results of MelBERT against other baselines using RNNs and contextualized models on VUA-18, VUA-20, and VUA-Verb. It is found that Mel-BERT is consistently better than strong baselines in terms of F1-score. MelBERT outperforms (F1 = 78.5, 75.7, and 72.3) DeepMet (Su et al., 2020) with 2.8%, 1.0%, and 1.9% performance gains on the three datasets. MelBERT also outperforms contextualized baseline models (i.e., RoBERTa_BASE and RoBERTa_SEQ), up to 1.2-1.5% gains on the three datasets, indicating that MelBERT effectively utilizes metaphorical identification theories. When combining MelBERT and DeepMet with the bagging technique, both models (i.e., MelBERT-CV and DeepMet-CV) show better performance than their original models by aggregating multiple models trained with 10-fold cross-validation process as used in (Su et al., 2020  CV still shows better performance for all metrics than DeepMet-CV in VUA-18 and VUA-20. Also, MelBERT-CV (Recall = 73.7) significantly improves the original MelBERT (Recall = 68.6) in terms of recall. It implies that MelBERT-CV can capture various metaphorical expressions by combining multiple models. Besides, it is found that contextualizationbased models show better performance than RNN-based models in VUA-18 and VUA-Verb. While RNN-based models show 71-74% F1-score, contextualization-based models show 76-78% F1score on VUA-18. It is revealed that RNN-based models are limited in capturing various aspects of words in context. Compared to RNN_ELMo and RNN_BERT, it also indicates that utilizing contextualization-based models as backbone models can have a better effect than simply utilizing it as an extra input embedding vector in (Gao et al., 2018;Mao et al., 2019).  VUA-18 breakdown analysis. Table 4 reports the comparison results for four genres in the VUA-18 dataset. MelBERT still shows better than or comparable to all competitive models in both breakdown datasets. Compared to RNN-based models, Mel-BERT achieves substantial improvements, as high as 4.9% (Academic), 4.4% (Conversation), 10.2% (Fiction), and 2.8% (News) in terms of F1-score. Particularly, they show the lowest accuracy because Conversation and Fiction have more complicated or rare expressions than other genres. For example, Conversation contains colloquial expressions or fragmented sentences such as "ah", "cos", "yeah" and Fiction often contains the names of fictional characters such as "Tepilit", "Laibon" which do not appear in other genres. Nonetheless, MelBERT shows comparable or the best performance in all genres. For Academic and Fiction, MelBERT particularly outperforms all the models in terms of F1-score. Table 5 reports the comparison result for four POS tags in the VUA-18 dataset. For all POS tags, MelBERT consistently shows the best performance in terms of the F1-score. Compared to RNN-based models, MelBERT achieves as much as 5.9% (Verb), 3.4% (Adjective), 14.5% (Adverb), and 10.3% (Noun) gains in terms of F1-score. For all POS tags, MelBERT also outperforms Deep-Met. It means that MelBERT using metaphorical identification theories can achieve consistent improvements regardless of POS tags of target words.
Zero-shot transfer on MOH-X and TroFi. We evaluate a zero-shot learning transfer across different datasets, where the models are trained with the VUA-20 training dataset, and MOH-X and TroFi are used as test datasets. Although it is a challenging task, it is useful for evaluating the gener-  alization power of trained models. Table 6 reports the comparison results of MelBERT against other contextualization-based models. For the MOH-X dataset, MelBERT (F1 = 79.2) shows the best performance in terms of F1-score with 0.6-1.6% performance gains. It indicates that MelBERT is an effective generalization model. For the TroFi dataset, the overall performance of all the models is much lower than MOH-X. It is because the average length of the sentences in the TroFi dataset is much longer and sentences are more complicated than those in MOH-X. Also, note that we trained DeepMet with the VUA-20 training dataset for evaluating a zero-shot transfer, while (Su et al., 2020) reported the results for DeepMet trained and tested with the MOH-X and TroFi datasets. While the performance gap between models is much small in terms of precision, MelBERT is better than Deep-Met in terms of recall. It means that MelBERT can capture complicated metaphorical expressions than DeepMet.
Ablation study of MelBERT. Table 7 compares the effectiveness of metaphor identification theories. It is found that MelBERT using both strategies consistently shows the best performance. Also, MelBERT without SPV shows better performance than MelBERT without MIP, indicating that Mel-BERT using late interaction is more effective for capturing the difference between contextualized and isolated meanings of target words. Nonetheless, MelBERT shows the best performance by synergizing both metaphor identification strategies.
Error analysis. Table 8 reports qualitative evaluation results of MelBERT. Based on the original annotation guideline 5 , we analyze several failure cases of MelBERT. For MelBERT without MIP, it is difficult to find common words with multiple meanings, e.g., go and feel. Also, when a sentence includes multiple metaphorical words, it mostly fails to detect metaphorical words. In this case, That's an old trick.
Oh you rotten old pig, you've been sick.
"Are the twins trash?" I know, what is going on! So who's covering tomorrow?
Do you feel better now?
The day thrift turned into a nightmare.
Way of the World: Farming notes So many places Barry are going down Sensitivity, though, is not enough. the surrounding words of a target word are not a cue to detect metaphors using SPV. Meanwhile, MelBERT without SPV has a failure case if target words are metaphorical for personification. That is, using MIP only, the target word can be closely interpreted by its literal meaning. As the most difficult case, MelBERT often fails to identify metaphorical words for borderline or implicit metaphors, e.g., Way of the World is poetic.

Conclusion
In this work, we proposed a novel metaphor detection model, namely, metaphor-aware late interaction over BERT (MelBERT), marrying pre-trained contextualized models with metaphor identification theories. To our best knowledge, this is the first work that takes full advantage of both contextualized models and metaphor identification theories. Comprehensive experimental results demonstrated that MelBERT achieves state-of-the-art performance on several datasets.