Modality Enriched Neural Network for Metaphor Detection

Metaphor as a cognitive mechanism in human’s conceptual system manifests itself an effective way for language communication. Although being intuitively sensible for human, metaphor detection is still a challenging task due to the subtle ontological differences between metaphorical and non-metaphorical expressions. This work proposes a modality enriched deep learning model for tackling this unsolved issue. It provides a new perspective for understanding metaphor as a modality shift, as in ‘sweet voice’. It also attempts to enhance metaphor detection by combining deep learning with effective linguistic insight. Extending the work at Wan et al. (2020), we concatenate word sensorimotor scores (Lynott et al., 2019) with word vectors as the input of attention-based Bi-LSTM using a benchmark dataset–the VUA corpus. The experimental results show great F1 improvement (above 0.5%) of the proposed model over other methods in record, demonstrating the usefulness of leveraging modality norms for metaphor detection.


Introduction
Metaphor is a common linguistic mechanism that encodes versatile ontological information accompanied by phenomena of e.g. domain transfer , concreteness change (Maudslay et al., 2020) or semantic surprise (Zhang and Barnden, 2013) etc. In general, metaphor usually involves certain concept transfer from one domain (Source) to another (Target), as demonstrated in the example 'sweet smile' (using the taste modality to describe the vision modality). It is a fundamental way for people to relate our physical and familiar social experiences to a multitude of other subjects and contexts that are more complex, implicit or less known (Lakoff and Johnson, 1980;Lakoff and Johnson, 2008).
Detecting metaphors in texts is essential for capturing the authentic meaning of a language unit (e.g. word, phrases or even sentences) when the language is beyond literal interpretations, which can benefit a lot of language processing applications, such as machine translation, dialogue systems and sentiment analysis (Tsvetkov et al., 2014). Although detecting metaphors is subconsciously effortless for human, it is actually a great challenge to existing technologies due to the subtle semantic and ontological differences between metaphorical and non-metaphorical ones .
This work proposes a modality-enriched deep learning model for metaphor detection based on the idea that the basic senses (touch, hearing, smell, taste, vision and interception) and action affectors (mouth/throat, hand/arm, foot/leg, head) indicated by the sensorimotor norms (Lynott et al., 2019) of words provide crucial information for metaphoricity inference. As a next step of Wan et al. (2020) which only adopts simple statistical models, we build a more complex model combining neural networks with modality information. For standard reference, we adopt the benchmark dataset of the first and second shared tasks of metaphor detection-the VUA corpus

Related Work
Automatic detection of metaphors has been intensively studied in the recent decade among the Natural Language Processing community. Many approaches have been proposed with systems such as traditional machine learning, deep neural networks, transformer and sequential models etc. Commonly utilized features include n-grams, word2vecs, parts-of-speech, semantic classes, lexical concreteness, constructions and frames, among others (Bizzoni and Ghanimifard, 2018;Hong, 2016;Klebanov et al., 2014;Klebanov et al., 2015;Wilks et al., 2013). The primary methods in metaphor detection, including the earliest attempts, mainly adopt supervised machine learning paradigm with training on pre-labelled datasets (Blanchard et al., 2014;Steen, 2010) based on various features mentioned above. This can be attested by the recent system papers for the 1 st and 2 nd shared tasks of Metaphor Detection . Although 90% of the systems adopt deep learning architectures (Dankers et al., 2019;Gao et al., 2018;Gutierrez et al., 2017), there are substantial linguist resources employed for system refinement, such as data with information of concreteness and imageability, semantic encoding of WordNet, FrameNet, VerbNet, SUMO ontology, property norms, syntactic dependency patterns, sensorial and vision-based information, etc. Such external resources provide different linguistic cues for metaphoricity inference, demonstrating their effectiveness for metaphor detection to various degrees (Alnafesah et al., 2020;Klebanov et al., 2016;. To name a few advances, Brooks and Youssef (2020) build up an ensemble model of RNNs together with attention-based Bi-LSTMs.  adopt BERT to obtain sentence embeddings, and then apply a linear layer with softmax to each token for metaphoricity predictions. Gong et al.(2020) use RoBERTa to obtain word embeddings and concatenate it with linguistic features (e.g. WordNet, VerbNet, POS, topicality, concreteness), and then feed them into a fully-connected Feedforward network to make predictions. Maudslay et al.(2020) combines the concreteness of a word with its static and contextual embeddings as inputs into a deep Multi-layer Perceptron network for predicting metaphoricity.
Despite the above advances, metaphor detection remains a difficult task because the semantic and ontological differences between metaphorical and non-metaphorical expressions are often subtle and their perception may vary from person to person. Existing methods show different strengths on detecting metaphors, yet each has its respective disadvantages, such as having generalization problems or lack association of their results with the intrinsic properties of metaphors. Besides, the performance of metaphor detection in record is still unsatisfactory (0.6-0.8 F1). In this work, we propose an innovative deep learning model for metaphor detection combining the strengths of neural networks with sensorimotor norms. Our work provides a new perspective for understanding metaphors as a linguistic mechanism accompanied by concept shifts indicated by the scores of sensorimotor norms. It also aims to testify the effectiveness of such scores encoded in words for the token-level metaphor detection task. Ultimately, we hope to improve the task of metaphor detection in a consistent way.

The VUA Corpus
We adopt the VU Amsterdam Metaphor Corpus (VUA) (Tekiroglu et al., 2015) for the experiments. This corpus is a benchmark dataset released for the shared tasks of metaphor detection , which is publicly available for standard reference. It is a subcorpus of the British National Corpus with manually annotated labels indicating the metaphoricity of each token in the corpus. It consists of 117 text fragments sampled across four genres: Academic, News, Conversation, and Fiction. The data has been annotated using the MIPVU procedure (Steen, 2010) with a strong interannotator agreement (k>0.8). Examples of the sentences in the corpus are demonstrated in Figure 1 below. Each token in the corpus is encoded with a unique id composed by the text fragment id, sentence id and the token sequence number in each sentence. Thus, the word 'corporate' in the first sentence of Figure 1 has the id of 'ale-fragment01 1 2'. Each metaphorical expression is marked by the label of 'M ' for distinction. Based on these gold labels, we are able to conduct supervised training and testing.

The Sensorimotor Norms
One key innovation of this study is that we propose to leverage modality information for metaphor detection as we are motivated by a common phenomenon existing in metaphorical expressions. That is, we observe many mismatched modalities among the metaphorical word and its neighbouring words. For instance, the word 'head' in the 5 th line of Figure 1 has a dominant modality of 'Vision' and 'Haptic' , while its following token 'Investments' has no obvious modality senses which is thought of weakly stimulated in terms of perception. The apparent modality difference between the two words implies a possible metaphorical usage, as proven by the annotated label in the corpus.
Thus, the Lancaster Sensorimotor norms collected by Lynott (2019) is used for enriching the deep learning model with modality scores. The data include measures of sensorimotor strength (0-5 scale indicating different degrees of sense modalities and action effectors) for around 40K English words across six perceptual modalities: touch, hearing, smell, taste, vision and interception, as well as five action effectors: mouth/throat, hand/arm, foot/leg, head (excluding mouth/throat), torso. 2 .
To demonstrate the structure of the sensorimotor norms, we provide five example words and their six sensory scores 3 in Table 1  The modality with the highest scores (highlighted) among the six senses of the words marks the dominant sense modality for each word, such as 'Visual' for the word 'Big' and 'Gustatory' for the word 'Eat'. The six basic senses as well as motor affectors are directly linked to human experience reflecting the two basic infrastructures in human's perceptual and conceptual system and can be used as very useful resource for linguistic modelling WAN et al., 2020).

The Modality Enriched Model
In the model, words are processed with the integration of modality scores and word embeddings, as demonstrated in Figure 2 below. We map the words to the sensorimotor norms and obtain the modality representations (64 dimension for each word). At the same time we get the word vectors (300 dimension for each word) using GloVe (Pennington et al., 2014) and then concatenate them as inputs to neural networks. For those words not mapped in the sensorimotor norms, we assigned three kinds of values, including all zeros, random values following normal distribution, as well as average scores of the sensory words in the corpus. In the end, we chose to use the average score for each dimension of the out-ofdictionary words due to optimization reasons.

Figure 2: The Modality Enriched Model
Bi-LSTM layer produces a hidden status of each word in a sentence. We use these status to calculate an attention weight which will be multiplied with the output of Bi-LSTM layer. Let H ∈ R d×N be a matrix consisting of hidden vectors [ h 1 , h 2 .... h N ] that is produced by LSTM, where d is the size of hidden layers and N is the length of the given sentence. The attention mechanism will generate an attention weight α. The final sentence representation is given by: We also add an additional Linear layer. The final probability distribution is: Let y be the target distribution for sentence,ŷ be the predicted metaphoricity distribution. Train to minimize the cross-entropy error between y andŷ for all sentences.
Finally, we get a probability distribution of 0-1 label to train the model and as the prediction result.

Evaluation Results
In order to evaluate the effectiveness of the proposed model for metaphor detection, we randomly select a development set (4,380 tokens) from the training set (17,240 tokens) in proportion to the Train/Test ratio of the task in . The evaluation results are summarized in Table 2 Table 2, the baseline of using unigram as features and logistic regression (LR) as the classifier is implemented for a basic comparison. It is a commonly adopted baseline in the tasks of metaphor detection. We also implement several sub-categories of approaches before trying the enriched model, including the linguistic and neural networks in separate and also in combination. The results show an 18% F1 improvement of the enriched model over the baseline, a 7% F1 improvement over pure linguistic model, a 1.5% F1 improvement over the pure neural network model, and this superiority is salient and consistent in terms of both P (Precision) and R (Recall).

Peer Comparisons
To further demonstrate the effectiveness of our method, this following table presents the comparisons of our system to some highly related recent works on the same task. All the results are publicly available, as reported in . The detailed results are displayed in Table 3   Our method obtains very promising results: it outperforms 6/7 highly related works to a great extent (0.5%-11% F1 gain), also approaching a reachable performance (a 4% F1 discrepancy) to the Top 1 work in record . Moreover, our results are consistently superior to the top baseline and other linguistically-based or deep learning approaches. This suggests the effectiveness of leveraging modality norms in neural networks for metaphor detection, echoing the hypothesis in Wan et al. (2020) that metaphor manifests a concept mismatch (modality shift in particular) between source and target.

Conclusion and Future Work
We presented a linguistically enhanced method for metaphor detection using modality features plus word embeddings on the basis of attention-based neural network, as a further evidence of Wan et al. (2020)'s first proposal of utilizing conceptual norms for metaphor detection. The proposed model with access to modality scores achieves outperforming results that 1) are significantly higher than one without such information, and 2) show leading performances than most related works in record. The results of the study are in line with previous work, and strengthen the conclusion that conceptual modality is crucial to metaphors (relevant to both our understanding of metaphors and NLP). In future, we will adopt more advanced deep learning architecture and introduce modality using other paradigms rather than concatenating embedding with the hope to further enrich the content.