A Report on the 2018 VUA Metaphor Detection Shared Task

As the community working on computational approaches to figurative language is growing and as methods and data become increasingly diverse, it is important to create widely shared empirical knowledge of the level of system performance in a range of contexts, thus facilitating progress in this area. One way of creating such shared knowledge is through benchmarking multiple systems on a common dataset. We report on the shared task on metaphor identification on the VU Amsterdam Metaphor Corpus conducted at the NAACL 2018 Workshop on Figurative Language Processing.


Introduction
Metaphor use in everyday language is a way to relate our physical and familiar social experiences to a multitude of other subjects and contexts (Lakoff and Johnson, 2008); it is a fundamental way to structure our understanding of the world even without our conscious realization of its presence as we speak and write. It highlights the unknown using the known, explains the complex using the simple, and helps us to emphasize the relevant aspects of meaning resulting in effective communication. Consider the following examples of metaphor use in Table 1.
M: In Washington, people change dance partners frequently, but not the dance. I: In Washington, people work with one another opportunistically.
M: Robert Muller is like a bulldog -he will get what he wants. I: Robert Muller will work in a determined and aggressive manner to get what he wants. In this paper, we report on the first shared task on automatic metaphor detection. By making available an easily accessible common dataset and framework for evaluation, we hope to contribute to the consolidation and strengthening of the growing community of researchers working on computational approaches to figurative language. By engaging a variety of teams to test their systems within a common evaluation framework and share their findings about more or less effective architectures, features, and data sources, we hope to create a shared understanding of the current state of the art, laying a foundation for further work.
This report provides a description of the shared task, dataset and metrics, a brief description of each of the participating systems, a comparative evaluation of the systems, and our observations about trends in designs and performance of the 56 systems that participated in the shared task.

Related Work
Over the last decade, automated detection of metaphor has become an increasingly popular topic, which manifests itself in both a variety of approaches and in an increasing variety of data to which the methods are applied. In terms of methods, approaches based on feature-engineering in a supervised machine learning paradigm explored features based on concreteness and imageability, semantic classification using WordNet, FrameNet, VerbNet, SUMO ontology, property norms, and distributional semantic models, syntactic dependency patterns, sensorial and vision-based features Köper and im Walde, 2017;Tekiroglu et al., 2015;Tsvetkov et al., 2014;Beigman Klebanov et al., 2014;Dunn, 2013;Neuman et al., 2013;Mohler et al., 2013;Hovy et al., 2013;Tsvetkov et al., 2013;Turney et al., 2011;Shutova et al., 2010;Gedigian et al., 2006); see  and Veale et al. (2016) for reviews of supervised as well as semi-supervised and unsupervised approaches.
Recently, deep learning methods have been explored for token-level metaphor detection (Rei et al., 2017;Gutierrez et al., 2017;Do Dinh and Gurevych, 2016). As discussed later in the paper later, the fact that all but one of the participating teams for the shared task experimented with neural network architectures testifies to the increasing popularity of this modeling approach.
In terms of data used for evaluating metaphor detection systems, researchers used specially constructed or selected sets, such as adjective noun pairs Tsvetkov et al., 2014), WordNet synsets and glosses (Mohammad et al., 2016), annotated lexical items (from a range of word classes) in sentences sampled from corpora (Özbal et al., 2016;Jang et al., 2015;Hovy et al., 2013;Birke and Sarkar, 2006), all the way to annotation of all words in running text for metaphoricity (Beigman Klebanov et al., 2018;Steen et al., 2010); Veale et al. (2016) review additional annotated datasets. By far the largest annotated dataset is the VU Amsterdam Metaphor Corpus; it has also been used for evaluating many of the cited supervised learning-based systems. Due to its size, availability, reliability of annotation, and popularity in current research, we decided to use it to benchmark the current field of supervised metaphor detection approaches.

Task Description
The goal of this shared task is to detect, at the word level, all metaphors in a given text. Specifically, there are two tracks, namely, All Part-Of-Speech (POS) and Verbs. The former track is concerned with the detection of all content words, i.e., nouns, verbs, adverbs and adjectives that are labeled as metaphorical while the latter track is concerned only with verbs that are metaphorical. We excluded all forms or be, do, and have for both tracks. Each participating individual or team can elect to compete in the All POS track, Verbs track, or both. The competition is organized into two phases: training and testing.

Dataset
We use the VU Amsterdam Metaphor Corpus (VUA) (Steen et al., 2010) as the dataset for our shared task. The dataset consists of 117 fragments sampled across four genres from the British National Corpus: Academic, News, Conversation, and Fiction. Each genre is represented by approximately the same number of tokens, although the number of texts differs greatly, where the news archive has the largest number of texts. We randomly sampled 23% of the texts from each genre to set aside for testing, while retaining the rest for training. The data is annotated using the MIP-VU procedure with a strong inter-annotator reliability of κ > 0.8. It is based on the MIP procedure (Group, 2007), extending it to handle metaphoricity through reference (such as marking did as a metaphor in As the weather broke up, so did their friendship) and allow for explicit coding of difficult cases where a group of annotators could not arrive at a consensus. The tagset is rich and is organized hierarchically, detecting various types of metaphors, words that flag the presense of metaphors, etc. In this paper, we consider only the top-level partition, labeling all content words with the tag "function=mrw" (metaphor-related word) as metaphors, while all other content words are labeled as non-metaphors. Table 2 shows the overall statistics of our training and testing sets.
To facilitate the use of the datasets and evaluation scripts beyond this shared task in future re-  search, the complete set of task instructions and scripts are published on Github 1 . Specifically, we provide a script to parse the original VUAMC.xml, which was not provided in our download bundle due to licensing restriction, to extract the verbs and other content words required for the shared task.
We also provide a set of features used to construct the baseline classification model for prediction of metaphor/non-metaphor classes at the word level, and instructions on how to replicate the baselines.

Training phase
In this first phase, data is released for training and/or development of metaphor detection models. Participants can elect to perform crossvalidation on the training data, or partition the training data further to have a held-out set for preliminary evaluations, and/or set apart a subset of the data for development/tuning of hyperparameters. However the training data is used, the goal is to have N final systems (or versions of a system) ready for evaluation when the test data is released.

Testing phase
In this phase, instances for evaluation are released. 2 Each participating system generated predictions for the test instances, for up to N models. 3 Predictions are submitted to CodaLab 4 1 https://github.com/EducationalTestingService/metaphor /tree/master/NAACL-FLP-shared-task 2 In principle, participants could have access to the test data by independently obtaining the VUA corpus. The shared task was based on a presumption of fair play by participants. 3 We set N =12. 4 https://competitions.codalab.org/competitions/17805 and evaluated automatically against the true labels. We selected CodaLab as a platform for organizing the task due to its ease of use, availability of communication tools such as mass-emailing, online forum for clarification of task issues, and tracking of submissions in real time. Submissions were anonymized. Hence, the only statistics displayed were the highest score of all systems per day, and the total number of system submissions per day. The metrics used for evaluation is the F1 score (least frequent class/label, which is "metaphor") with Precision and Recall also available via the detailed results link in CodaLab.

Systems
The shared task started on January 12, 2018 when the training data was made available to registered participants. On February 12, 2018, the testing data was released. Submissions were accepted until March 8, 2018. Overall, there were a total of 32 submissions by 8 unique individuals/teams for the Verbs track, and 100 submissions by 11 individuals/teams for the All POS track. All participants in the Verbs track also participated in the All POS track. In total, 8 system papers were submitted describing the algorithms and methodology for generating their metaphor predictions. In the following sections, we first describe the baseline classification models and their feature sets. Next, we report performance results and ranking of the best systems for each of the 8 teams. We also briefly describe the best-performing system for every team. The interested readers can refer to the teams' papers for more details.

Baseline Classifiers
We make available to shared task participants a number of features from prior published work on metaphor detection, including unigram features, features based on WordNet, VerbNet, and those derived from a distributional semantic model, POS-based, concreteness and difference in concreteness, as well as topic models. As baselines, we train two logistic regression classifiers for each track (Verbs and All-POS), with instance weights inversely proportional to class frequencies. Lemmatized unigrams (UL) is a simple yet fairly strong baseline (Baseline 1). This feature is produced using NLTK (Bird and Loper, 2004) to generate the lemma of each word according to its tagged POS. As Baseline 2, we use the best system from Beigman . The features are: lemmatized unigrams, generalized WordNet semantic classes, and difference in concreteness ratings between verbs/adjectives and nouns (UL + WordNet + CCDB). 5

System Descriptions
The best-performing system from each participant is described below, in alphabetic order.
bot.zen (Stemle and Onysko, 2018) used word embeddings from different standard corpora representing different levels of language mastery, encoding each word in a sentence into multiple vector-based embeddings which are then fed into an LSTM RNN network architecture. Specifically, the backpropagation step was performed using weightings computed based on the logarithmic function of the inverse of the count of the metaphors and non-metaphors. Their implementation is hosted on Github 6 under the Apache License Version 2.0.
DeepReader (Swarnkar and Singh, 2018) The authors present a neural network architecture that concatenates hidden states of forward and backward LSTMs, with feature selection and classification. The authors also show that reweighting examples and adding linguistic features (WordNet, POS, concreteness) helps improve performance further.
MAP (Pramanick et al., 2018) used a hybrid architecture of Bi-directional LSTM and Conditional Random Fields (CRF) for metaphor detection, relying on features such as token, lemma and POS, and using word2vec embeddings trained on English Wikipedia. Specifically, the authors considered contextual information within a sentence for generating predictions.
nsu ai (Mosolova et al., 2018) used linguistic features based on unigrams, lemmas, POS tags, topical LDAs, concreteness, WordNet, VerbNet and verb clusters and trained a Conditional Random Field (CRF) model with gradient descent using the L-BFGS method to generate predictions.
OCOTA (Bizzoni and Ghanimifard, 2018) experimented with a deep neural network composed of a Bi-LSTM preceded and followed by fully connected layers, as well as a simpler model that has a sequence of fully connected neural networks. The authors also experiment with word embeddings trained on various data, with explicit features based on concreteness, and with preprocessing that addresses variability in sentence length. The authors observe that a model that combines Bi-LSTM with the explicit features and sentence-length manipulation shows the best performance. The authors also show that an ensemble of the two types of neural models works even better, due to a substantial increase in recall over single models.
Samsung RD PL (Skurniak et al., 2018) explored the use of several orthogonal resources in a cascading manner to predict metaphoricity. For a given word in a sentence, they extracted three feature sets: concreteness score from the Brysbaert database, intermediate hidden vector representing the word in a neural translation framework, and generated logits of a CRF sequence tagging model trained using word embeddings and contextual information. Trained on the VUA data, the CRF model alone outperforms that of a GRU taking all three features.
THU NGN (Wu et al., 2018) created word embeddings using a pre-trained word2vec model and added features such as embedding clusterings and POS tags before using CNN and Bi-LSTM to capture local and long-range dependencies for generating metaphorical labels. Specifically, they used an ensemble strategy in which iterative modeling is performed by training on randomly selected training data and averaging the model predictions for finalized outputs. At the inferencing layer, the authors showed that the best-performing system is one achieved by using a weighted-softmax classifier rather than the Conditional Random Field predictor, since it can significantly improve the recall.
ZIL IPIPAN (Mykowiecka et al., 2018) used word2vec embeddings over ortographical word forms (no lemmatization) as an input for LSTM network for generating predictions. They explored augmenting word embeddings by binarized vectors that reflect the General Inquirer dictionary category of a word and its POS. Experiments were also carried out with different parametrization of LSTM based on type of unit network, number of layers, size of dropout, number of epochs, etc., though vectors enriched with POS information did not result in any improvement.

Results
Tables 3 and 4 show the performance and the ranking of all the systems, including the baseline systems. For overall results on All-POS track, three out of the seven systems outperformed the stronger of the two baselines, with the best submitted system gaining 6 F1-score points over the best baseline (0.65 vs 0.59). We note that the best system outperformed the baseline through improved precision (by 10 points), while the recall remained the same, around 0.7.
For the Verbs track, four out of the five systems outperformed both baselines. The best system posted an improvement of 7 F1-score points over best baseline (0.67 vs 0.60), achieved by improvements of about the same magnitude in both recall and precision.
In the following section, we inspect the performance of the different systems more closely.

Trends in system design
All the submitted systems but one are based on a neural network architecture. Out of the top three systems that outperform the baseline on All-POS, two introduce explicit linguistic features into the architecture along with the more standard wordembedding-based representations, while the third experiments with using a variety of corporaincluding English-language-learner-produced corpora -to compute word embeddings. Tables 3 and 4 show the overall performance for the best submission per team, as well as the performance of these systems by genre. It is clear that the overall F1 scores of 0.62-0.65 for the top three systems do not make explicit the substantial variation in performance across genres. Thus, Academic is the easiest genre, with the best performance of 0.74, followed by News (0.66), with comparable scores for Fiction (0.57) and Conversation (0.55). In fact, this trend holds not only for the top systems but for all systems, including baselines, apart from the lowest-performing system that showed somewhat better results on News than on Academic. The same observations hold for the Verb data. The large discrepancies in performance across different genres underscore the need for wide genre coverage when evaluating metaphor detection systems, as the patterns of metaphor use are quite different across genres and present tasks of varying difficulty to machine learning systems across the board.

Performance across genres
Furthermore, we note that the best overall system, which is the only system that improves upon the baseline for every single genre in All-POS evaluation, improved over the baseline much more substantially in the lower-performance genres. Thus, for Academic and News, the increase is 1.4 and 5.2 F1 points, respectively, while the improvements for Conversation and Fiction are 8.1 and 11.1 points, respectively. The bestperforming system thus exhibits more stable performance across genres than the baseline, though genre discrepancies are still substantial, as described above.

AllPOS vs Verbs
We observe that for the four teams who improved upon the baseline on the Verbs-only track, their best performance on the Verbs was better than on the All-POS track, by 2.1-5 F1 score points.    Table 5: Performance (F-score) of the best systems submitted to All-POS track by POS subsets of the test data. In parentheses, we show the rank of the given POS within all POS for the system. The last column shows the overall drop in performance from best POS (ranked 1) to worst (ranked 4). This could be related to the larger preponderance of metaphors among verbs, which, in turn, leads to a more balanced class distribution in the Verbs data.

AllPOS by POS
To better understand performance patterns across various parts of speech, we break down the All-POS test set by POS, and report performance of each of the best systems submitted to the All-POS track on each POS-based subset of the test data; Table 5 shows the results. First, we observe that the average difference in performance between best and worst POS is 9 points (see column Best to Worst in the Table), with different systems ranging from 3 to 14. We note that the baseline systems are relatively more robust in this respect (3-7 points), while the the top 3 systems exhibit a 9-12 point range of variation in performance by POS. While this gap is substantial, it is much smaller than the 20-point gap observed in by-genre breakdown.
Second, we note that without exception all systems performed best on verbs, and for all but one system performance was worst on adverbs (see "Av. rank among POS" row in Table 5). Performance on adjectives and nouns was comparable for most systems, with slightly better results for adjectives for 7 out of 10 systems. These trends closely follow the proportions of metaphors within each POS: While 30% of verbs are marked as metaphorical, only 8% of adverbs are thus marked, with nouns and adjectives occupying the middle ground with 13% and 18% metaphors, respectively.
Third, we observe that the relative performance of the systems is quite consistent across POS. Thus, the rank order correlation between systems' overall performance (AllPOS) and their performance on Verbs is 0.94; it is 0.98 for nouns and 0.92 for Adjectives (see the last row of Table 5). In fact, the top three ranks are occupied by the same systems in AllPOS, Verbs, Adjectives, and Nouns categories. The somewhat lower rank order correlation for Adverbs (0.81) reflects Baseline 1 (which ranks 6th overall) posting a relatively strong performance for Adverbs (ranks 3rd), while the ZIL IPIPAN system (ranks 5th overall) shows relatively weak performance on Adverbs (ranks 9th). Overall, the systems' relative standings are not much affected when parceled out by POS-based subsets.

Conclusion
This paper summarized the results of the 2018 shared task on metaphor identification in the VUA corpus, held as part of the 2018 NAACL Workshop on Figurative Language Processing. We provided brief descriptions of the participating systems for which detailed papers were submitted; systems' performance in terms of precision, recall, and F-score; and breakdowns of systems' performance by POS and genre. We observed that the task of metaphor detection seems to be somewhat easier for verbs than for other parts of speech, consistently across participating systems. For genres, we observed a large discrepancy in best and worst performance, with results in the .7s for Academic and in .5s for Conversation data. Clearly, understanding and bridging the genre-based gap in performance is an important avenue for future work.
While most systems employed a deep learning architecture effectively, the baselines that use a traditional feature-engineering design were not far behind, in terms of performance; the stronger baseline came 4th overall. Indeed, some of the contributions explored a combination of a DNN architecture and explicit linguistic features; this seems like a promising direction for future work. Some of the teams made their implementations publicly available, which should facilitate further work on improving performance on this task.