End-to-End Sequential Metaphor Identification Inspired by Linguistic Theories

End-to-end training with Deep Neural Networks (DNN) is a currently popular method for metaphor identification. However, standard sequence tagging models do not explicitly take advantage of linguistic theories of metaphor identification. We experiment with two DNN models which are inspired by two human metaphor identification procedures. By testing on three public datasets, we find that our models achieve state-of-the-art performance in end-to-end metaphor identification.


Introduction
Metaphoric expressions are common in everyday language, attracting attention from both linguists and psycho-linguists (Wilks, 1975;Glucksberg, 2003;Group, 2007;Holyoak and Stamenković, 2018). Computationally, metaphor identification is a task that detects metaphors in texts. Traditional approaches, such as phrase-level metaphor identification, detect metaphors with word pairs (Tsvetkov et al., 2014;Rei et al., 2017), where a target word whose metaphoricity is to be identified is given in advance. However, such target words are not highlighted in real world text data; a newer approach is sequential metaphor identification, where the metaphoricity of a target word is identified without knowing its position in a sentence. Therefore, it is more readily applied to support Natural Language Processing tasks.
The most recent approaches (Wu et al., 2018;Gao et al., 2018) treat this as a sequence tagging task: the classified labels are only conditioned on BiLSTM (Graves and Schmidhuber, 2005) hidden states of target words. This approach is not tailormade for metaphors; it is the same procedure to that used in other sequence tagging tasks, such as Part-of-Speech (PoS) tagging (Plank et al., 2016) and Named Entity Recognition (NER) (Lample et al., 2016). However, we have available linguistic theories of metaphor identification, which have not yet been exploited with Deep Neural Network (DNN) models. We hypothesise that by exploiting linguistic theories of metaphor identification in the design of a DNN architecture, the model performance can be further improved.
Linguistic theories suggest that a metaphor is identified by noticing a semantic contrast between a target word and its context. This is the basis of Selectional Preference Violation (SPV) (Wilks, 1975(Wilks, , 1978. E.g., in the sentence my car drinks gasoline (Wilks, 1978), 'drinks' is identified as metaphoric, because 'drinks' is unusual in the context of 'car' and 'gasoline'; a car cannot drink, nor is gasoline drinkable. Formally, a label is predicted, conditioned on a target word and its context. An alternative approach by Group (2007) and Steen et al. (2010) is the Metaphor Identification Procedure (MIP): a metaphor is identified if the literal meaning of a word contrasts with the meaning that word takes in this context. E.g., in my car drinks gasoline, the contextual meaning of 'drink' is 'consuming too much', which contrasts with its literal meaning of 'taking a liquid into the mouth' 1 . Formally, a label is predicted, conditioned on literal and contextual meanings. Fundamentally, the two models are similar, as both MIP and SPV analyse the relations between metaphors and their contexts, but with different procedures.
We propose two end-to-end metaphor identification models 2 , detecting metaphors based on MIP and SPV, respectively. The experimental re-sults show that both of our models perform better than the state-of-the-art baseline (Gao et al., 2018) across three benchmark datasets. In particular, our MIP based model with a simple DNN architecture, outperforms the baseline with an average of 2.2% improvement in F1 score, whereas the SPV based model with a novel multi-head contextual attention mechanism achieves an even higher gain of 2.5% against the baseline.
The contribution of our work can be summarized as follows: (1) To the best of our knowledge, we are the first to explore using linguistic theories (MIP and SPV) to directly inform the design of Deep Neural Networks (DNN) for endto-end sequential metaphor identification; (2) Our first DNN model is based on MIP, which encapsulates the idea that a metaphor is classified by the contrast between its contextual and literal meanings. The second model is inspired by SPV, in which we propose a novel window-based contextual attentive method, allowing the model to attend to important fragments of BiLSTM hidden states and hence better capture the context of text; (3) We conducted extensive experiments on three public datasets for end-to-end metaphor identification, where both of our models outperform the state-of-the-art DNN models.

Related Work
Metaphor identification is a linguistic metaphor processing task that identifies metaphors in textual data, which is different from conceptual metaphor processing that maps concepts between source and target domains (Shutova, 2016), based on Conceptual Metaphor Theory (Lakoff and Johnson, 1980). In linguistic metaphor processing a metaphor is identified when the contextual meaning of a word contrasts with its literal meaning (summarised as MIP by Group (2007) and Steen et al. (2010)). Many metaphor dataset annotations were guided by MIP, e.g., VU Amsterdam Metaphor Corpus (Steen et al., 2010), and a verbal metaphor dataset, formed by Mohammad et al. (2016). Another hypothesis for linguistic metaphor identification, SPV, was proposed by Wilks (1975Wilks ( , 1978 who argued that a metaphoric word could violate selectional preferences of an agent. E.g., 'drinks' violates selectional preferences of the agent of 'car' in the sentence, my car drinks gasoline. Ortony (1979) further claimed that metaphoric words, phrases and sentences are contextually anomalous.
Recently, metaphor identification has been treated as a sequence tagging task. Wu et al. (2018) proposed a model based on word2vec (Mikolov et al., 2013), PoS tags and word clusters, which were encoded by a Convolutional Neural Network (CNN) and BiLSTM. The encoded information was directly fed into a softmax classifier. This model performed best on the NAACL-2018 Metaphor Shared Task (Leong et al., 2018) with an ensemble learning strategy. Gao et al. (2018) proposed a model that concatenated GloVe (Pennington et al., 2014) and ELMo (Peters et al., 2018) representations which were then encoded by BiLSTM. Hidden states of the BiLSTM were classified by a softmax classifier. These sequential metaphor identification models classify labels, conditioned on encoder hidden states. However, we expect that explicit modelling of interactions between either contextual and literal meanings (MIP) or target words and their contexts (SPV) may further boost performance.

Methodology
Here we detail our two models, inspired by MIP and SPV respectively, and systematically compare the differences between them.

MIP based model
Our first model ( Figure 1) is built upon MIP: a metaphor is classified by the contrast between a word's contextual and literal meanings. To facilitate the classifier in making this comparison we concatenate the contextual meaning representation with the literal meaning representation. e 1 e 2 e 3 e 4 e 5 RNN HG (Recurrent Neural Network Hidden-GloVe) Humans infer the contextual meanings of a word conditioned on its context. We use BiL-STM hidden states as our contextual meaning representations, where the hidden state of a word is encoded by its forward and backward contexts and itself (Graves and Schmidhuber, 2005). Pretrained GloVe (Pennington et al., 2014) is considered as our literal meaning representation, as words have been embedded with their most common senses (trained on Web crawled data 3 ). The most common senses are likely literal, as literals occur more than metaphors in typical corpora (Cameron, 2003;Martin, 2006;Steen et al., 2010;Shutova, 2016). The comparison of literal and contextual can be seen at the top of Figure 1, comparison stage; the GloVe embedding (literal) from below joins the hidden state from the BiLSTM (contextual). The probability of a label prediction (ŷ) for a target word at position t is conditioned on contextual and literal meaning representations of the target word where σ is softmax function. h is a BiLSTM hidden state. g is GloVe embedding. w is trained parameters. b is bias.
[; ] denotes concatenating tensors along the last dimension. Similar to Gao et al. (2018), we use GloVe and ELMo (Embeddings from Language Models) as input features for the BiLSTM. The recommended way of using 3 Note that our results are likely to improve if the pretrained GloVe is trained on a cleaner set of purely literal data.
e 1 e 2 e 3 e 4 e 5 ELMo is to concatenate ELMo (e) with GloVe (g), e.g., [g t ; e t ] (Peters et al., 2018). Thus, the BiL-STM hidden state h t is

SPV based model
The intuition behind SPV is that metaphoricity is identified by detecting the incongruity between a target word and its context. RNN MHCA (Recurrent Neural Network Multi-Head Contextual Attention) Our second model ( Figure 2) compares a target word representation h t with its context c t . This is achieved by concatenating these two representations (see top of Figure 2). Target word representation h t is a BiLSTM hidden state. Context is composed of left-side ( − → c n t ) and right-side ( ← − c n t ) attentive context representations, where n is a window size of context words. We adopt a multi-head contextual attention (MHCA) mechanism to compute c n t . The BiLSTM hidden state matrix (H, where h ∈ H) is split into equivalent pieces where N is the number of heads. Irrelevant context hidden states, h j / ∈ [h t±1 , h t±n ], are masked out. We apply a window size of n context words, as h j only encodes words that are out of the window. In computing a context representation, h j may bring in noise, and it may miss important context information, provided by the close context words, while the distant context information could be memorized by Noticeably, MHCA is similar to dot-product attention (Luong et al., 2015), if N = 1. Using N > 1 heads would attend to different parts of hidden states of context words and recall previous important context information that is forgotten at the current point. Unlike multi-head self-attention (Vaswani et al., 2017) that encodes a target word by its context, MHCA computes the context representation by attending to a target word. The query of MHCA is a hidden state of a target word, while the key and value are hidden states of its context. We do not employ training parameters, non-linear operations or positional encoding in MHCA, because performance is better (compared with MHA in Figure 4) when we model context (via attention) and the target word (via BiLSTM) in the same space (see § 3.3). Besides, extra position encoding is unnecessary in our model, as input sentences have been encoded along with a time sequence by BiLSTM. The probability of a label prediction, given by RNN MHCA is where a label prediction is conditioned on a hidden state of a target word (h t ) and its attentive context representation (c n t ). The input feature of word t is also [g t ; e t ]. So, h t is given by Equation 2.
Embedding space Figure 3: A comparison between RNN HG and RNN MHCA. C is car. D is drinks. G is gasoline. A is animal. W is water. emb is GloVe embedding. enc is BiLSTM encoding. att is an attention mechanism. In embedding space, the lighter part of a node is ELMo embedding, while the darker part is GloVe embedding. Figure 3 gives an overview of the two models and how they process the example of 'drinks' in the sequence car drinks gasoline. We use different coloured nodes to indicate that words are distant from each other in vector space. E.g., red 'drinks' (D) is distant from blue 'car' (C) and 'gasoline' (G), because they are from non-literally related domains Mao et al., 2018). Note that there is no external knowledge base for domain knowledge. 'Drinks' (D) is distant because of the statistics of the corpus; it occurs in contexts relating to humans and other animals consuming liquids such as water.

Model comparison
Our MIP based RNN HG model is on the left. In the leftmost part of the figure, we have the literal embedding of 'drinks' (D), which is embedded by words in the domains of 'animal' (A) and 'water' (W). To the right of this, the green 'drinks' ( ← → D ) captures the meaning of 'drinks' in context via BiLSTM encoding; it is encoded by 'car' (C), 'gasoline' (G) and itself (D). These two different vectors for 'drinks' are concatenated. Classifier 1 (RNN HG) learns to recognise if the two vectors represent similar meanings (indicating literal) or different meanings (indicating metaphor), which is p(ŷ t |h t , g t ) in Equation 1. In the case illustrated, the meaning of 'drinks' (green ← → D ) from the encoding is very different from its word embedding meaning (red D).
The right part of Figure 3 is our SPV based RNN MHCA model. Blue 'car' (C) and 'gasoline' (G) are encoded by themselves from left to right and right to left, respectively. Purple 'car' ( − → C ) and 'gasoline' ( ← − G ) are still closer to each other than green 'drinks' ( ← → D ) in encoding space, because the green 'drinks' ( ← → D ) has a component of literal meaning from red 'drinks' (D). Our attention mechanism does not employ non-linear transformations. Thus, the attentive context ([ − → C ; ← − G ]) does not significantly change its colour from the context word encoding ( − → C and ← − G ). Classifier 2 (RNN MHCA) learns to recognise the contrast between green 'drinks' ( ← → D ) and its purple context . In RNN MHCA, we use the BiLSTM green 'drinks' ( ← → D ) as the target word representation, rather than the word embedding red 'drinks' (D). This is necessary because it will be concatenated with the purple attentive context representation, in encoding space; we found that performance is better when both meanings are in the encoding space. On the other hand, the RNN HG does concatenate vectors from two different spaces; this works because they are representations of the same word, rather than word versus context.
In Figure 3, it appears that both models use the same BiLSTM encoded green 'drinks' ( ← → D ), however the two models have different objective functions (Equation 1 and 9 ple sentences are from WordNet (Fellbaum, 1998). Only a single target verb in each sentence is annotated. The average length of sentences is the shortest of our three datasets. TroFi 6 (Birke and Sarkar, 2006) The dataset consists of sentences from the 1987-89 Wall Street Journal Corpus Release 1 (Charniak et al., 2000). The average length of sentences is the longest of our datasets. Each sentence has a single annotated target verb.

Baselines
CNN+RNN ensmb (Wu et al., 2018) This is the best model at the NAACL-2018 Metaphor Shared Task, which encodes three concatenated input features (word2vec, PoS tags, and word2vec clusters) with CNN and BiLSTM. The label prediction is conditioned on BiLSTM hidden states p(ŷ t |h t ) with a weighted softmax classifier. The performance is further boosted by ensemble learning. RNN ELMo (Gao et al., 2018) This is a model that uses GloVe and ELMo as features for sequential metaphor identification. GloVe and ELMo are concatenated and encoded by BiLSTM, classified by a softmax classifier, which is also conditioned on BiLSTM hidden states p(ŷ t |h t ). RNN ELMo is the strongest baseline to our knowledge. RNN BERT (Devlin et al., 2018) We introduce feature-based BERT (cased, large) as a baseline, as it has shown strong performance on the NER task, which is also a sequence tagging task. We use the same framework as RNN ELMo. The inputs are the concatenation of the hidden states of the last four BERT layers, which was recommended  by Devlin et al. (2018). Hyperparameters are finetuned on each dataset.

Setup
The inputs are 300 dimension pre-trained GloVe 7 embeddings, concatenated with 1024 dimension pre-trained ELMo (Peters et al., 2018). We adopt a batch size of 2, 2 × 256 dimension hidden state BiLSTM, SGD optimiser, and weighted cross entropy loss where y i is a ground truth label for a word at position i.ŷ i is its prediction. The weight w y i = 1, if y i is literal, otherwise w y i = 2, which is in line with Wu et al. (2018). In RNN MHCA, the window size (n) is 3 on VUA and MOH-X, while n is 5 on TroFi. The number of attention heads (N) is 16, which is in line with Vaswani et al. (2017). Training, development and testing sets of VUA ALL POS are built in line with the NAACL-2018 Metaphor Shared Task (see Table 1). Since the examined models predict labels for all words in a sentence, the outputs have covered the target verbs in VUA VERB. So, we simply evaluate on the verb track without training models separately. As annotations of MOH-X and TroFi datasets only cover target verbs, we consider the remaining words as literal for training, but only evaluate on the target words. We adopt 10-fold cross validation on MOH-X and TroFi datasets, since the sizes of these two datasets are small. Our hyperparameters are tuned on each dataset.

Results
F1 score is the main measurement of model performance. Metaphors are positive labels. The accuracy is measured by the number of correct target token predictions over the total number of target tokens. For the VUA ALL POS dataset, we 7 http://nlp.stanford.edu/data/glove. 840B.300d.zip consider all tokens as the target tokens. For the VUA VERB, MOH-X and TroFi, we consider target verbs as target tokens.
As shown in Table 2, our two proposed models are consistently the top two for F1 on the four evaluation tasks, where the improvements against the third best model (F1 with an underline) are statistically significant (two-tailed t-test, p < 0.01). RNN MHCA achieves state-of-the-art performance in VUA ALL POS (F1=74.3%), MOH-X (F1=80.0%) and TroFi (F1=72.4%). RNN HG performs slightly worse than RNN MHCA. However, it exceeds RNN MHCA by 0.3% on the VUA VERB track (F1=70.8%).
Compared with RNN ELMo, the biggest improvements of RNN HG and RNN MHCA appear in MOH-X dataset, gaining 4.2% and 4.4%, respectively. Our models also outperform RNN BERT by at least 1.6% in MOH-X. In contrast with VUA ALL POS that has an average of 3.4 metaphors (see Table 1) per metaphoric sentence, each metaphoric sentence in MOH-X contains a single metaphor. We observed that in MOH-X most non-target words are literal, so that a metaphor can be better identified by RNN MHCA via modelling the contrast between the metaphor and its context in a single-metaphor sentence. Furthermore, the average length of MOH-X sentences is the shortest, therefore the context of a target word will be cleaner. MOH-X source sentences are from WordNet sample sentences, where the language is straightforward because the writer designed it to illustrate the meaning of a word, e.g., Don't abuse the system. Similarly, the straightforward contexts also help RNN HG to infer contextual meanings of words. The anomalies that MIP and SPV are designed to detect are very clear in MOH-X, so that our models improve the most against RNN ELMo. VUA in contrast is more complex (see examples in VUA Breakdown Analysis and Error Analysis below).
(1.1% and 1.3%). We have observed that many of the non-target words in TroFi are metaphoric (but not labeled), as the sample sentences are from financial news, where word play is common (e.g., VUA news contains the largest percentage of metaphors in Table 4). Our system considers TroFi non-target words as literal without knowing their ground truth labels during training. Additionally, the average length of sequences of TroFi is the longest among the datasets, at 28.3 tokens.
Although RNN MHCA slightly outperforms RNN HG, the difference is small. This is because modelling the contrast between contextual and literal meanings of metaphors in MIP is theoretically similar to modelling in SPV (see §1).
Variations of RNN HG An alternative way of encapsulating contextual and literal meanings in RNN HG is taking the sum of h t and g t (h t + g t ) instead of their concatenations ([h t ; g t ]) in Equation 1. Such an idea is inspired by residual connection (He et al., 2016). In this approach, we take 2 × 150 dimension BiLSTM hidden states so that h t and g t are aligned in dimensionality. However, such an approach yields 73.7%, 70.0%, 78.9% and 71.8% F1 scores on VUA ALL POS, VUA VERB, MOH-X and TroFi datasets, which is worse than the concatenation approach (RNN HG) in Table 2. This is because the concatenation approach highlights the contrast between GloVe and BiLSTM hidden states of metaphors. Variations of RNN MHCA We examined the impact of different window sizes and attention mechanisms of RNN MHCA. All these baselines are fine-tuned on each dataset. Given a window size of 1, bi-directional hidden states of a target ). The context2vec model (Melamud et al., 2016) used − → h t−1 and ← − h t+1 as their context representations, with Multilayer Perceptron tuning.
As shown in Figure 4, setting a window size of 3 surpasses other sizes on 3 out of 4 datasets. The attentive context representation with a window size larger than 1 can better represent a context than the hidden states of adjacent words (window = 1). The average length of TroFi sequences is the longest, so that a larger window size, e.g., window = 5, performs better. Given a window size of 3, MHCA outperforms the multi-head attention (Vaswani et al., 2017) which employs training parameters and non-linear operations. This shows that modelling the contrast between a target word and its context in the same space performs better than that in different spaces. MHCA exceeds the dot-prodcut attention (Luong et al., 2015) which demonstrates the utility of multi-heads that attend to different fragments of hidden states. We also examined variations, e.g., an infinite window size and a different number of heads, but the performances did not improve.

Variations of Feature Selection
We examine the concatenation of hidden states of the last four BERT large model layers (B l ) instead of ELMo on RNN HG and RNN MHCA. Our models with the combination of BERT and GloVe (B l +G) perform better than the BERT baseline model (RNN BERT) with B l on VUA ALL POS development set by at least 2.9% in terms of F1 score (see Table 3). However, the performance, given by B l +G, is not further improved, compared with the combination of ELMo and GloVe (E+G) on each of our models. VUA Breakdown Analysis We report the model performance on different types of articles and words based on VUA ALL POS test set. We analyse all four genres and four types of open class words (verbs, adjectives, nouns and adverbs), 8.3 11.5 8.3 10.7 7.9 13.6 8.2 11.9 Adv 6.0 6.0 5.8 6.9 6.8 7.2 6.1 6.5 which is in line with Leong et al. (2018). The verbal statistics in Table 5 are different from VUA VERB in Table 2, as they are different tracks in the Metaphor Shared Task. Not all verbs in VUA ALL POS are included in VUA VERB. In Table 5, metaphor identification achieves better performance on academic articles across all the models and genres, where RNN MHCA yields the highest F1 (79.8%). Intuitively, metaphor identification is easier as the style of English is more formal. E.g., (using underlines for metaphors) This mixture, heated by recession and high unemployment, inevitably generates a high level of crime. (VUA ID: as6-fragment01-30). Identifying metaphors in conversation is the hardest for our baselines, probably due to its fragmented language. E.g., Drawing, oh well! (VUA ID: kbp-fragment09-4105). However, RNN HG achieves large improvements against RNN ELMo (3.8%) and RNN BERT (3.4%) on conversation. The improvements of our models against RNN ELMo on news are larger than in TroFi, although source sentences of both datasets are from news. It supports our arguments that the noise of treating non-target words as literals in TroFi negatively impact our models' ability to learn the difference between literals and metaphors. In contrast, all words in VUA news are annotated, so that the advantages of our models are more obvious.
In PoS breakdown analysis, verb metaphors are better identified than others, as verbal metaphors take the largest part among all PoS. RNN HG achieves the biggest improvement (4.1%) in adverbs against RNN ELMo, whereas RNN BERT also presents strong performance. In adjectives, CNN+RNN ensmb surpasses the second best RNN HG by 2.9%. The use of word embedding clusters, PoS tags and ensemble learning may con-  tribute to identifying adjective metaphors.
Error Analysis By comparing our two models, 96.3% of predictions are the same in the VUA ALL POS testing set. For these same predictions, precision, recall, F1 and accuracy are 80.2%, 77.2%, 78.7% and 95.3%, respectively, which is better than each of our models on the full dataset. False negatives are common in sentences with multiple metaphors, e.g., Or: 'When Cupid shot his dart He shot it at your heart.' (VUA ID: a5e-fragment06-187), where 10 out of 12 words have true labels as metaphor. However, our models only classify 'heart' as metaphoric in this sentence. Ambiguous contexts are also challenging, e.g., I'm gonna play with that and see what (VUA ID: kbd-fragment21-8037), where the referent of 'that' is not in the context, so that 'play with' are also false negatives.
For the samples where our models predict different labels, the main errors of RNN HG are false negatives, while the main errors of RNN MHCA are false positives. This is likely due to the fact that some conventional metaphors frequently appear in typical corpora, so that GloVe embeddings of metaphors are not distinct from their contextual meaning encodings. Metaphors may be misclassified as literal by RNN HG. On the other hand, RNN MHCA may flag the clash between literals and their contexts, if there are many metaphors in the contexts, so that literal target words may be misclassified as metaphoric.

Conclusion
We proposed two metaphor identification models based on Metaphor Identification Procedure (Group, 2007;Steen et al., 2010) and Selectional Preference Violation (Wilks, 1975(Wilks, , 1978. Our models achieve state-of-the-art performance on three public datasets. The performances of the two models are close in terms of F1 score, as their linguistic fundamentals, MIP and SPV are similar in principle. The breakdown analysis of VUA demonstrates that the improvements of our models derive from the problematic instances for our baselines, e.g., conversation articles and adverb metaphors. In future work, we will explore ensemble learning. Our error analysis demonstrates that when the predictions of our two models are the same, the prediction is more accurate with high precision, suggesting the idea of combining them. Another interesting direction is to explore combining different semantic similarity measures (Lin et al., 2015) for our task.