Metaphor Detection using Context and Concreteness

We report the results of our system on the Metaphor Detection Shared Task at the Second Workshop on Figurative Language Processing 2020. Our model is an ensemble, utilising contextualised and static distributional semantic representations, along with word-type concreteness ratings. Using these features, it predicts word metaphoricity with a deep multi-layer perceptron. We are able to best the state-of-the-art from the 2018 Shared Task by an average of 8.0% F1, and finish fourth in both sub-tasks in which we participate.


Introduction
Metaphor detection is the task of assigning a label to a word (or sometimes a sentence) in a piece of text, to indicate whether or not it is metaphorical. Some metaphors occur so frequently as to be considered word senses in their own right (so-called conventional metaphors), whilst others are creative, and involve the use of words in unexpected ways (novel metaphors). Sometimes whole phrases or even sentences can lend themselves to metaphorical or literal interpretations. 1 For these reasons and others, human annotators might disagree about what constitutes a metaphor-computational metaphor detection is no doubt a challenging problem.
In this work, we participate in the 2020 Metaphor Detection Shared Task (Leong et al., 2020). First, we offer a description of metaphoricity, framing it in terms of the concreteness of a word in different contexts. Concreteness of a word in context is not a quantity for which there exists large-scale annotated data. In lieu of this, we train a metaphor detection model using input features which we expect to 1 Consider drowning student, which could refer to students submerged in water, or students struggling with coursework (Tsvetkov et al., 2014), or the more idiomatic phrase, they stabbed him in the back, which could be taken literally or (more likely) metaphorically, depending on its context.

Max concreteness
(any sense) low high

Concreteness (this sense)
low "she considered "she attacked the problem", the problem" "they hated "he showered the film" her with love" high "she attacked the soldier", "he showered at 8am" Figure 1: Examples of verbs with varying levels of concreteness-metaphors are in green contain the information needed to derive this contextual concreteness. This model outperforms the highest performing system of the previous shared task ( Leong et al., 2018), and finishes 4 th in the two subtasks in which we participate.

Concreteness and Context
Metaphor is a device which allows one to project structure from a source domain to a target domain (Lakoff and Johnson, 1980). For instance, in the sentence "he attacked the government", attacked can be seen as a conventional metaphor, which applies structure from the source domain of war to the target domain of argument. Intuitively, it seems that the context in which a word appears tells us about the target domain, whilst the word itself (and some knowledge about how it is used nonmetaphorically) tells us about the source. Several existing models have exploited this difference (e.g. Mao et al., 2019;Gao et al., 2018). Usually, the target domain is something intangible, whilst the source domain relates more closely to our real-world experience. Concreteness refers to the extent to which a word denotes something that can be experienced by the senses, and is gener-ally measured by asking annotators to rate words on a numeric scale (Paivio et al., 1968;Spreen and Schulz, 1966); abstractness is then the inverse of concreteness. Using concreteness ratings for metaphor identification is clearly well motivated, as evidenced by previous work (e.g. Tsvetkov et al., 2014Tsvetkov et al., , 2013Beigman Klebanov et al., 2015).
For a word to be metaphorical in a particular context, then, it needs to have a concrete sense and an abstract sense, with the abstract sense activated in that context. The concrete sense would belong to the source domain, and the abstract sense to the target domain. For instance, the meaning of the word attacked in "she attacked the soldier" is concrete, but in "she attacked the problem" it is abstract-and thus that usage is metaphorical. Polysemy of the word is a necessary condition; the existence of an abstract sense is not enough, otherwise a monosemously abstract word such as considered in "he considered the problem" would be metaphorical. Figure 1 shows examples of words with different maximum concreteness levels elicited by certain senses (columns) appearing in contexts which result in different values of concreteness for that particular sense (rows). The most concrete sense of a word is a lexical property, and thus context independent. The metaphors (green) are found in the top right quadrant-they have an abstract meaning in context, but a concrete sense exists (as evidenced by the examples in the bottom right quadrant). The bottom left quadrant is greyed out, since it is conceptually impossible for a word to exist there-the concreteness of one sense of a word cannot be greater than the concreteness of any of its senses.

Model Architecture
We now describe a model which uses semantic representations of a word in and out of context to predict metaphoricity. Ideally, we would only provide the model with a representation of the concreteness of a word in context (since we believe that would do most of the lifting), but to our knowledge, no large-scale annotated datasets exist for contextdependent concreteness. In most popular datasets of concreteness annotation (e.g. Coltheart, 1981;Brysbaert et al., 2014), concreteness is a property assigned to each word type-but we would need the concreteness of a word instance. In this respect, our work resembles the abstractness classifier in Turney et al. (2011)-although this work uses word senses instead of instances as we do. Because contextualised concreteness data is unavailable, we instead choose features which, when given to a multi-layer perceptron (MLP), should provide enough information about a word for the MLP to be able to differentiate between each cell in Figure 1.
We first provide the model with contextualised word embeddings, which we expect will provide some information about the target domain of the metaphor. In the contextualised representations, we expect there to exist a space of concrete meanings and some space of abstract meanings-which would help the network differentiate between the top and bottom rows of Figure 1. Along with this, we provide static word embeddings, to provide information about the source domain. Since these static type-level embeddings will clearly contain information about both source and target, we compliment them with type-level concreteness ratings. Such ratings should reflect the concreteness of the most concrete sense of the word, thus allowing the network to differentiate between the left and right columns of Figure 1. Figure 2 shows an overview of our architecture. In the following paragraphs, we detail each individual component of the model.

Contextual Word Embedding
For contextualised embeddings, we fine-tune BERT (Devlin et al., 2019). BERT is a sentence encoder which utilises a transformer architecture (Vaswani et al., 2017), and is trained with two separate tasksmasked language modelling (a cloze task), and next-sentence prediction. The latent space (the final hidden state of the encoder) contains vector representations of each input token, which change in different contexts. Several pre-trained BERT models are available-we use BERT large. 2

Concreteness Model
We define a simple model which represents the concreteness of a word as a linear interpolation between two vectors, representing maximal concreteness and abstractness, v con and v abs respectively. For each word w we obtain a real number estimate of its concreteness, c, from Out-of-vocabulary words have their own vector, v unk . Each of v abs , v con , and v unk are randomly initialised and learned from data. The dimensionality of these vectors is a hyperparameter which is tuned-a higher dimensionalality will likely place more emphasis on this feature when it is fed into the MLP as part of the ensemble model.

Static Word Embedding
We initialise a matrix of static 300-dimensional word embeddings from the Word2Vec Google News pretrained model (Mikolov et al., 2013), then fine-tune it to the data. Out-of-vocabulary words are given their lemma's embedding, if present, otherwise they are initialised randomly.

Multi-Layer Perceptron
We define a deep multi-layer percepton (MLP), which at each layer has four components: a linear transformation, layer normalisation, a ReLU activation function, and finally dropout (Srivastava et al., 2014). The structure is parameterised with three parameters-the input size k, number of layers n, and first hidden layer size h. The first linear layer of the network has an input of size k, and an output of size h. Each successive layer halves the size of the hidden state. After n layers, a final linear layer converts to a single output, which is then passed through a sigmoid to yield the prediction. Based on this design, we have the constraint that 2 n ≤ h ≤ k, which imposes that (1) the first hidden layer is not larger than the input, and (2) the size of the hidden layer does not reach 1 before the final layer.
Ensemble Model Tying all of the aforementioned models together is an ensemble model. First, it passes each input sentence w 0 , · · · , w n through the contextual embedding component to yield the embeddings h 1 , · · · , h n . To reduce their dimentionality, these are each passed through a simple linear transform, yielding h 1 , · · · , h n . Each word is then passed through the static embedding component, yielding embeddings m 1 , · · · , m n , which are also projected down to m 1 , · · · , m n . Each word is also fed to the concreteness model, yielding concreteness vectors c 1 , · · · , c n . For each word, the three representations (h i , m i , and c i ) are concatenated, and passed into a deep multi-layer perceptron which makes the final metaphor prediction (per-word). This model is depicted in Figure 2. Crucially, though this model accepts sentences (needed to process the contextualised word representations), it makes predictions using the MLP on a per-word basis-but back-propagates through BERT for all annotated words in each sentence.

Experiments
Data We use the VUA corpus (Steen et al., 2010) which was made available for the shared task (Leong et al., 2020). 3 We train a model on all the available training data-not just those marked for the verbs or all-pos subtasks, because we found this improved performance. We split the data in an 8:1 ratio, ensuring that the split puts 1/9 of each subtask's data in the development set-details of the splits are shown in Table 1.
Training Details We train in batches of 32 sentences, and employ early stopping after 20 stable steps (based on F 1 on dev). As an optimizer, we use AdamW (Loshchilov and Hutter, 2017). We experimented with three fine-tuning options: (1) unfreezing the whole network and training it all at once, (2) freezing BERT and training until early stopping activates, then unfreezing BERT and training until early stopping again, and (3) freezing BERT and training until early stopping, then sequentially unfreezing and training a single layer of BERT at a time, and finally the whole model at once (inspired by Felbo et al., 2017). We used option (2) in the end, since it offered a large improvement over (1) when we used a lower learning rate for the second phase. We found that (3) offered no additional advantage. To find hyperparameters, we performed a random search over the parameter space; final hyperparameters are reported in Table 2.
Threshold Shifting The ratio of metaphors to non-metaphors in the entire VUA dataset was not the same as that of the verb and all-pos subsets used by the Shared Task. Having trained the model on all the data, we then adjust it to each different distribution. To do this, we find the threshold for the sigmoid output that maximises the F 1 score on each particular development set. 4   Ablation Study To verify that each feature contributes useful information over just using a contextualised representation, we first conduct a simple ablation study, to see the performance impact of removing either the static word embeddings or concreteness ratings. We train four models (with the same hyperparameters as in Table 2), with different combinations of the concreteness model (CM) and static word embedding (SWE) model removed. 5 Table 3 shows the results on the development set. The contextualised word embeddings (CWE) on their own performs the worst. Adding the embeddings in particular really bolsters the performance (increasing it from 0.574 to 0.636 F 1 ). The typelevel concreteness annotation also helps, but not quite as much. The combination of all three features achieves the highest F 1 score.

Shared Task Performance
The Shared Task results are computed as the F 1 Score on held-out test data. Our results are presented in Table 4, alongside the results from the previous highest-performing system (Wu et al., 2018) from the 2018 Shared Task (Leong et al., 2018), and the highest-performing system on this shared task. Through the use of contextualised representations and concreteness rat-  ings, we are able to improve substantially over the best submission to the 2018 shared task (Wu et al., 2018) for metaphor detection on the VUA corpus (Steen et al., 2010), by 8.0% F 1 . We trail the winner of the 2020 task by an average of 5.0% F 1 .

Conclusion
We participated in the 2020 Metaphor Identification Shared Task (Leong et al., 2020). Our model was designed to try and exploit knowledge of lexical concreteness and contextual meaning to identify metaphors. Our results improved over the previous best performing system by an average of 8.0% F 1 , but trailed behind the leader of the task by 5.0%. In future work, we are keen to explore first training a model to identify concreteness in context, then fine-tuning this to metaphor identification, based on the reasoning presented in §2.