Neural Metaphor Detection in Context

We present end-to-end neural models for detecting metaphorical word use in context. We show that relatively standard BiLSTM models which operate on complete sentences work well in this setting, in comparison to previous work that used more restricted forms of linguistic context. These models establish a new state-of-the-art on existing verb metaphor detection benchmarks, and show strong performance on jointly predicting the metaphoricity of all words in a running text.


Introduction
Metaphors are pervasive in natural language, and detecting them requires challenging contextual reasoning about whether specific situations can actually happen. (Lakoff and Johnson, 1980). For example, in Table 1, "examining" is metaphorical because it is impossible to literally use a "microscope" to examine an entire country. In this paper, we present end-to-end neural models for metaphor detection, which can learn rich contextual word representations that are crucial for accurate interpretation of figurative language.
In contrast, most previous approaches focused on limited forms of linguistic context, for example by only providing SVO triples such as (car, drink, gasoline) to the model Tsvetkov et al., 2013;Rei et al., 2017;. While the verbal arguments provide strong cues, providing the full sentential context supports more accurate prediction, as seen in Table 1. Even in the few cases when the full sentence is used (Köper and im Walde, 2017;Turney et al., 2011;Jang et al., 2016) existing models have used unigram-based features with limited expressivity. We investigate two common task formulations: (1) given a target verb in a sentence, classifying whether it is metaphorical or not, and (2) The experts started examining the Soviet Union with a microscope to study perceived changes. Rockford teachers are honored for saving a drowning student. You're drowning in student loan debt. given a sentence, detecting all of the metaphorical words (independent of their POS tags). We find that relatively standard architectures based on bi-directional LSTMs (Hochreiter and Schmidhuber, 1997) augmented with contextualized word embeddings (Peters et al., 2018) perform surprisingly well on both tasks, even with modest amount of training data. We improve the previous state-of-the-art by 7.5 F1 on the VU Amsterdam Metaphor Corpus (VUA) for the sequence labeling task (Steen et al., 2010), by 2.5 F1 on the VUA verb clasificiation dataset, and by 4.9 F1 on the MOH-X dataset (Mohammad et al., 2016). Our code is publicly available at https://github.com/gao-g/ metaphor-in-context.

Task
We study two task formulations.
Sequence Labeling: Given a sentence x 1 ,. . . ,x n , predict a sequence of binary labels l 1 , . . . , l n to indicate the metaphoricity of each word.
Classification: Given a sentence x 1 , . . . , x n and a target verb index i, predict a binary label l to indicate the metaphoricity of the target x i .
While both formulations have been studied in previous work, it is worth noting that the sequence labeling task generalizes the classification task in that the prediction for the target verb can be extracted from the full sentence predictions. In addition, as will be shown in Section 5, we find that given accurate annotations for all words in a sentence, the sequence labeling model outperforms the classification model even when the evaluation is set up as a classification task.

Model
Our models use a bidirectional LSTM to encode a sentence, and a feedforward neural network for classification, optimized for the log-likelihood of gold labels.
Sentence encoding For both sequence labeling and classification, we represent each token x i in the input sentence with a pre-trained word embedding w i . To further encode contextual information, we also concatenate ELMo (Embeddings from Language Models) vectors e i from Peters et al. (2018). These vectors have been shown to be useful for word sense disambiguation, a task closely related to metaphor detection (Birke and Sarkar, 2006). Figure 1 shows the model architecture. We input the word representation [w i ; e i ] to a bidirectional LSTM, producing a contextualized representation h i for each token. Then we use a feedforward neural network that takes h i to predict a label l i for each word x i . When the dataset does not contain annotations for every word, we make the simplifying assumption that every unannotated word is used literally.

Classification Model
Figure 2 shows the model architecture. We concatenate an index embedding n i , which indicates whether x i is the target verb. We use [w i ; e i ; n i ] as an input to a bidirectional LSTM, producing a contextualized representation h i . We add an attention layer by computing the attention weight a i for token x i , and compute the representation c as a weighted sum of LSTM output states where W a and b a are learned parameters.
Finally, we feed c to a feedforward network to compute the label scores for target verb.

Dataset
We evaluate performance on a number of benchmark datasets, including two for classification (TroFi and MOH\MOH-X) and one for tagging (VUA). 1 Table 2 shows statistics for the verb classification datasets. Despite being two times larger than the MOH dataset, the TroFi dataset contains only 50 unique verbs, and the larger VUA dataset contains over 2K unique verbs. The MOH dataset contains shorter and simpler sentences (example sentences in WordNet), compared to sentences in other datasets which come from resources such as   Sequence Labeling Experiment Setup The VUA dataset contains annotations for all words in each sentence. We divide the data into training, development, and test set following the same split for the VUA verb classification task. While the label classes are less balanced (only 11% metaphors at the token level), this dataset is much bigger. Table 3 shows the data statistics.

Experiments
Evaluation Metric We report precision, recall and F1 measure for the metaphor class as well as the overall accuracy. For the VUA dataset, we also report macro-averaged F1 score across four genres (conversation, academic writing, fiction and news).

Comparison Systems
We propose a simple yet effective lexical baseline. It assigns the metaphor label if the word is annotated metaphorically more frequently than as literally in the training set, and the literal label otherwise. We also compare our a neural similarity network with skip-gram word embeddings (Rei et al., 2017), (3) a balanced logistic regression classifier on target verb lemma that uses a set of features based on multisense abstractness rating (Köper and im Walde, 2017), and (4) a CNN-LSTM ensemble model with weighted-softmax classifier which incorporates pre-trained word2vec, POS tags, and word cluster features (Wu et al., 2018). 2 We experiment with both sequence labeling model (SEQ) and classification model (CLS) for the verb classification task, and the sequence labeling model (SEQ) for the sequence labeling task.

Implementation Details
We used 300d GloVe vectors (Pennington et al., 2014) and 1024d ELMo vectors. We used additional 50d index embedding for the classification task. The LSTM module has a 300d hidden state. We applied dropout on the input to LSTM and on the input to the feedforward layer. We fine-tuned learning rate and dropout rate for each model on each dataset. We used SGD to optimize the CLS model and Adam (Kingma and Ba, 2013) for the SEQ model. We used spaCy (Honnibal and Montani, 2017) for lemmatization, tokenization, and part-of-speech tagging. Klebanov (   Sequence Labeling Results Performance on the sequence labeling task is reported in Table 4. While prior work (Klebanov et al., 2014;Özbal et al., 2016) reported on the same dataset, the experiment setting is not comparable (they did cross validation on a smaller training set). 3 Our lexical baseline performs strongly in terms of precision, as some words and POS tags are almost exclusively annotated as literal. Our sequence labeling model mainly improves recall. Table 5 reports the breakdown of performance by POS tags. Not surprisingly, tags with more data are easier to classify. Adposition is the easiest to identify as metaphorical and is also the most frequently metaphorical class (28%). On the other hand, particles are challenging to identify, since they are often associated with multi-word expressions, such as "put down the disturbances". Table 6 shows performance on the verb classification task for three datasets (MOH-X , TroFi and VUA). 4 Our models achieve strong performance on all datasets, outperforming existing models on the MOH-X and VUA datasets. On the MOH-X dataset, the CLS model outperforms the SEQ model, likely due to the simpler overall sentence structure and the fact that the target verbs are the only words annotated for metaphoricity. For the VUA dataset, where we have annotations for all words in a sentence, the SEQ model significantly outperforms the CLS model. This result shows that predicting metaphor labels of context words helps to predict the target verb. We hypothesize that Köper et al. (2017) outperforms our models on the TroFi dataset for a similar reason: their work uses concreteness labels, which highly correlate to metaphor labels of neighboring words in the sentence. Also, their best model uses the verb lemma as a feature, which itself provides a strong clue in the dataset of 50 verbs (see lexical baseline). Table 7 shows an ablation study on input representations (with or without ELMo vectors). Contextualized word vectors improve the performance of both models by a large margin.

Verb Classification Results
Error Analysis We sampled 100 errors of our best model from the VUA verb classification development set for analysis. Table 8 shows examples. Following the original annotation guideline, 5 we classify metaphors into five categories: direct metaphor, indirect metaphor, implicit metaphor, personification, and borderline case. Indirect metaphor, the most common type for verbs, means that the basic meaning of a word is different from its contextual meaning. Implicit metaphor occurs due to an underlying link which points to a recoverable metaphorical concept.
About half of the errors were false positives, and the other half were false negatives. Among the false negatives, 33% are indirect metaphors, 18% are personifications, and 2% are direct metaphors. Among 55 false positives, 31% of verbs have im-

CLS SEQ Sentence
Metaphor Type To throw up an impenetrable Berlin Wall between you and them could be tactless.
-In reality you just invent a tale, as if you were sitting round a fire in a cave. direct metaphor So they bought immunity.
indirect metaphor During the early states of the phased evacuation the logistical problem facing the police was the street-by-street warning of the population to make ready for evacuation.
indirect metaphor There are few things worse than being bludgeoned into reading a book you hate.
indirect metaphor He thought of thick, fat, hot motorways carving up that land. personification One might ask whether motorists are ever justified in knowingly taking risks with other people's lives.
-The abstract talk of commuting by rail or road being replaced by information technology finds a concrete expression in the idea of telecottages.
-A fly landed on the empty, staring vizor, and crawled across it. -
Many neural models with various features and architectures were introduced in the 2018 VUA Metaphor Detection Shared Task. They include LSTM-based models and CRFs augmented by linguistic features, such as WordNet, POS tags, concreteness score, unigrams, lemmas, verb clusters, and sentence-length manipulation (Swarnkar and Singh, 2018;Pramanick et al., 2018;Mosolova et al., 2018;Bizzoni and Ghanimifard, 2018;Wu et al., 2018). Researchers also studied different word embeddings, such as embeddings trained from corpora representing different levels of language mastery (Stemle and Onysko, 2018) and binarized vectors that reflect the General Inquirer dictionary category of a word (Mykowiecka et al., 2018). We show that contextualized word embedding significantly improves metaphor detection. We also study both sequence labeling and classification approaches, suggesting that sequence labeling approach enhances performance when used to jointly predict the metaphoricity of all words in a sentence.

Conclusion
In this paper, we present simple biLSTM models augmented with contextualized word representation for metaphor detection. Our models establish new state-of-the-arts across multiple existing benchmarks, and our error analysis shows remaining challenges for metaphor detection.