Using Conceptual Norms for Metaphor Detection

This paper reports a linguistically-enriched method of detecting token-level metaphors for the second shared task on Metaphor Detection. We participate in all four phases of competition with both datasets, i.e. Verbs and AllPOS on the VUA and the TOFEL datasets. We use the modality exclusivity and embodiment norms for constructing a conceptual representation of the nodes and the context. Our system obtains an F-score of 0.652 for the VUA Verbs track, which is 5% higher than the strong baselines. The experimental results across models and datasets indicate the salient contribution of using modality exclusivity and modality shift information for predicting metaphoricity.


Introduction
Metaphors are one kind of figurative language that use conceptual mapping to represent one thing (target domain) as another (source domain). As proposed by Lakoff and Johnson (1980) in Conceptual Metaphor Theory (CMT), metaphor is not only a property of the language but also a cognitive mechanism that describes our conceptual system. Thus metaphors are devices that transfer the property of one domain to another unrelated or different domain, as in 'sweet voice' (use taste to describe sound).
Metaphors are prevalent in daily life and play a significant role for people to interpret/understand complex concepts. On the other hand, as a popular linguistic device, metaphors encode versatile ontological information, which usually involve e.g. domain transfer (Ahrens et al., 2003;Ahrens, 2010;Ahrens and Jiang, 2020), sentiment reverse (Steen et al., 2010) or modality shift (Winter, 2019) etc. Therefore, detecting the metaphors in texts is essential for capturing the authentic meaning of the texts, which can benefit many natural language processing applications, such as machine translation, dialogue systems and sentiment analysis (Tsvetkov et al., 2014). In this shared task, we aim to detect token-level metaphors from plain texts by focusing on content words (Verbs, Nouns, Adjectives and Adverbs) of two corpora: VUA 1 and TOFEL 2 . To better understand the intrinsic properties of metaphors and to provide an in-depth analysis to this phenomenon, we propose a linguisticallyenriched model to deal with this task with the use of modality exclusivity and embodiment norms (see details in Section 3).

Related Work
Many approaches have been proposed for automatic detection of metaphors, using features of lexical information (Klebanov et al., 2014;Wilks et al., 2013), semantic classes (Klebanov et al., 2016), concreteness (Klebanov et al., 2015), word associations (Xiao et al., 2016), constructions and frames (Hong, 2016) and systems such as traditional machine learning classifiers (Rai et al., 2016), deep neural networks (Do Dinh and Gurevych, 2016) and sequential models (Bizzoni and Ghanimifard, 2018). Despite many advances in the above work, metaphor detection remains a challenging task. The semantic and ontological differences between metaphorical and non-metaphorical expressions are often subtle and their perception may vary from person to person. To tackle such problems, researchers resort to specific domain knowledge (Tsvetkov et al., 2014); lexicons (Mohler et al., 2013;Dodge et al., 2015); supervised methods (Klebanov et al., 2014(Klebanov et al., , 2015(Klebanov et al., , 2016 or using attention-based deep learning models to capture latent patterns (Igam-berdiev and Shin, 2018). These methods show different strengths on detecting metaphors, yet each has its respective disadvantages, such as having generalization problems or lack association of their results with the intrinsic properties of metaphors. In addition, the reported performances of metaphor detection so far (around 0.6 F1 in the last shared task) (Leong et al., 2018) are still not promising. This calls for further endeavours in all aspects.
In this work, we adopt supervised machine learning algorithms based on four categories of features, which include linguistic norms, ngram-word, -lemma and -pos collocations, word embeddings and cosine similarity between the target nodes and its neighboring words, as well as the strong baselines provided by the organizer of the shared task (Leong et al., 2018;Klebanov et al., 2014Klebanov et al., , 2015Klebanov et al., , 2016. Moreover, we use several statistical models and ensemble learning strategies during training and testing so as to test the cross-model consistency of the improvement using the various features. The methods are described in detail in the following sections.

Feature Sets
This work uses four categories of features (16 subsets in all) to represent the nodes and contextual information at hierarchical levels, which include the lexical and syntactic-to-semantic information, sensory modality scales, embodiment ratings (of verbs only), as well as word vectors of the nodes and cosine similarity of node-neighbor pairs, as detailed below.
• Linguistic Norms: Two linguistic norms are used to construct four linguistically-enriched feature sets in the jsonlines format: 3 -ME (modality exclusivity): 42 dimension of target nodes representation, containing the mapped sensorimotor values in the modality norms; -DM (dominant modality): 1 × 5 dimension of node-neighbor pairs (five lexical neighboring words) information, representing the dominant modality of the target nodes and the surrounding lexical words; -EB (embodiment): 2 dimension of nodes representation, including embodiment rating and standard deviation; -EB-diff (embodiment differences): 2 × 5 dimension of node-neighbor pairs (five lexical neighboring words) information.
The ME and DM feature sets are constructed by using the Lancaster Sensorimotor norms collected by Lynott et al. (2019). The data include measures of sensorimotor strength (0-5 scale indicating different degrees of sense modalities/action effectors) for 39,707 English words across six perceptual modalities: touch, hearing, smell, taste, vision and interception, and five action effectors: mouth/throat, hand/arm, foot/leg, head (excluding mouth/throat), torso. 4 As sensorimotor information plays a fundamental role in cognition, these norms provide a valuable knowledge representation to the conceptual categories of the target and neighboring words which serve as salient features for inferring metaphors.
The EB and EB-diff feature sets are constructed by using the embodiment norms for 687 English verbs which is collected by Sidhu et al. (2014). Research examining semantic richness effects has shown that multiple dimensions of meaning are activated in the process of word recognition (Yap et al., 2011). This data applies the semantic richness approach (Sidhu et al., 2014(Sidhu et al., , 2016 to verb stimuli in order to investigate how verb meanings are represented. The relative embodiment ratings (1-7 scale indicating different degrees of bodily involvement) revealed that bodily experience was judged to be more important to the meanings of some verbs (e.g., dance, breathe) than to others (e.g., evaporate, expect), suggesting that relative embodiment is an important aspect of verb meaning, which can be a useful indicator of meaning mismatch of the figurative usage of verbs.
• Collocations: Three sets of collocational features are constructed to represent the lexical, syntactic, grammatical information of the nodes and their neighbors: Trigram, FL (Fivegram Lemma), FPOS (Fivegram POS tags). The two corpora are lemmatized using the nltk WordNetLemmatizer 5 and POS tagged using the nltk averaged perceptron tagger 6 before constructing such features.
• Word Embeddings: For comparisons, we utilise distributional vector representation of word meaning to the nodes based on the distributional hypothesis (Firth, 1957;Lenci, 2018). Two pre-trained Word2Vec models (GoogleNews.300d and Internal-W2V.300d (pre-trained using the VUA and TOFEL corpora)) and the GloVe vectors are used. GoogleNews 7 in this work is pre-trained using the continuous bag-of-words architecture for computing vector representations of words (Church, 2017). GloVe 8 is an unsupervised learning algorithm for obtaining vector representations for words. We use the 300d vectors pre-trained on Wikipedia 2014+Gigaword 5 (Pennington et al., 2014).
• Cosine Similarity: We also investigate the cosine similarity (CS) measures for computing word sense distances between the nodes and their neighboring lexical words, based on the hypothesis that words of distant meaning are more likely to be metaphors. Three different sets of CS features are constructed in this work by using the above three different word embedding models: CS-Google, CS-GloVe, CS-Internal.
These features constitute a rather comprehensive representation of the mismatch of the nodes and their neighbors in terms of senses, domains, modalities, agentivity and concreteness etc, which are highly indicative of metaphorical uses and are hence hypothesized as more distinctive features than the strong baselines in Leong et al. (2018).

Classifiers and Experimental Setup
Three traditional classifiers are used for predicting the metaphoricity of the tokens, including Logistic Regression, Linear SVC and a Random Forest Classifier. The Machine Learning experiments are run through utilities in the SciKit-Learn Laboratory (SKLL) (Pedregosa et al., 2011). 9 For parameter tuning, we use grid search to find optimal parameters for the learners. Finally, we set up the following optimized parameters for the three classifiers:

Evaluation Results
In order to evaluate the discriminativeness of the various features for metaphor detection and their fitness to the three classifiers, we focus on the VUA Verbs phase and randomly select a development set (4380 tokens) from the training set in proportion to the Train/Test ratio. Experiments are run using the three classifiers and the setup in Section 4.
The evaluation results on the individual features in terms of F1-score are summarized in Table 1 below: In Table 1, the top five features with the LR classifier are highlighted in bold. Results show that the best individual feature is ME, followed by B1, W2V.GloVe, Trigram and FL. For the conceptual representations, modality exclusivity features demonstrate outstanding performance, while the embodiment features perform quite poorly. This is due to the data sparseness of the embodiment feature representations. As the data in the embodiment norms only contains 687 English verbs, it cannot cover most of the words in the two corpora of the shared task, which causes many empty values in the feature matrix, resulting in a poor performance in the task. Despite of this, it still helps the overall performance, if concatenated with other features, as to be shown in the later section. The performances of the three classifiers are quite close for all features, with LR performing slightly better. To test the combined power of these features for metaphor detection, we also conduct evaluation on fused features, as shown in Table 2   Results in Table 2 show that B2 is a stronger baseline than B3, so we use B2 as the comparison basis. Among the four categories of features, the linguistic and collocational features in combination with B2 achieve the greatest improvement by around 1.5% F1-score. The top three to five features also improve the performance by 1-2% F1score. However, the word embeddings and cosine similarity features show no improvement over baseline 2. Finally, we selected 12 features (excluding the W2V features) using the automatic feature selection algorithm and have achieved the best results for evaluation (.672 F1 for LR).

Results on Test Sets
We use the best feature sets and classifier (LR) in the above evaluation for the final submission. The released results of our system on the test sets of the four phases in terms of F1-score are summarized in Table 3   In Table 3, 'L+B2' stands for 'Linguistic feature fused with baseline 2' and the best results are highlighted in bold. In addition to the best methods, we also submit the Top5 features and the 'L+B2' features which all show consistent improvement (1-5% F1) over baseline 2. The evaluation results prove the effectiveness of using the linguistic features, especially the Modality Exclusivity representations for metaphor detection.

Comparison to other Works
To demonstrate the effectiveness of our method, this section presents the comparisons of our system to some highly related works that participated in the same shared task (2018) of the VUA corpus. All the results are publicly available, as reported in Leong et al. (2018). We compare our results on the VUA-Verbs and VUA-AllPOS phases to the top three teams (T1-3), the baseline2 (B2) and the only team using linguistic features (Ling) in 2018. The detailed results are displayed in Table 4 below: Obviously, our method obtains very promising results: it beats the Top 2 team for the Verbs phase and is close to Top3 for the AllPOS phase; moreover, our results are significantly superior to both the baseline and another linguistically-based approach. This suggests the effectiveness of using conceptual features for metaphor detection, echoing the hypothesis that metaphor is a concept mismatch between the source and target domains.

Conclusion
We presented a linguistically enhanced method for word-level metaphor detection using conceptual features of modality and embodiment based on traditional classifiers. As suggested by the results, the modality exclusivity and embodiment norms provide conceptual and bodily information for representing the nodes and the context, which help improve the performance of metaphor detection over the three strong baselines to a great extent. It is noteworthy that our system did not employ any deep learning architectures, showing advantages of simplicity and model efficiency, yet it outperforms many sophisticated neural networks. In the future work, we will use the current feature sets in combination with state-of-the-art deep learning models to further examine the effectiveness of this method for metaphor detection.