Contextual Modulation for Relation-Level Metaphor Identification

Identifying metaphors in text is very challenging and requires comprehending the underlying comparison. The automation of this cognitive process has gained wide attention lately. However, the majority of existing approaches concentrate on word-level identification by treating the task as either single-word classification or sequential labelling without explicitly modelling the interaction between the metaphor components. On the other hand, while existing relation-level approaches implicitly model this interaction, they ignore the context where the metaphor occurs. In this work, we address these limitations by introducing a novel architecture for identifying relation-level metaphoric expressions of certain grammatical relations based on contextual modulation. In a methodology inspired by works in visual reasoning, our approach is based on conditioning the neural network computation on the deep contextualised features of the candidate expressions using feature-wise linear modulation. We demonstrate that the proposed architecture achieves state-of-the-art results on benchmark datasets. The proposed methodology is generic and could be applied to other textual classification problems that benefit from contextual interaction.


Introduction
Despite its fuzziness, metaphor is a fundamental feature of language that defines the relation between how we understand things and how we express them (Cameron and Low, 1999). A metaphor is a figurative device containing an implied mapping between two conceptual domains. These domains are represented by its two main components, namely the tenor (target domain) and the vehicle (source domain) (End, 1986). According to the conceptual metaphor theory (CMT) of Lakoff and Johnson (1980), which we adopt in this work, a concept such as "liquids" (source domain/vehicle) can be borrowed to express another such as "emotions" (target domain/tenor) by exploiting single or common properties. Therefore, the conceptual metaphor "Emotions are Liquids" can be manifested through the use of linguistic metaphors such as "pure love", "stir excitement" and "contain your anger". The interaction between the target and the source concepts of the expression is important to fully comprehend its metaphoricity.
Over the last couple of years, there has been an increasing interest towards metaphor processing and its applications, either as part of natural language processing (NLP) tasks such as machine translation (Koglin and Cunha, 2019), text simplification (Wolska and Clausen, 2017;Clausen and Nastase, 2019) and sentiment analysis (Rentoumi et al., 2012) or in more general discourse analysis use cases such as in analysing political discourse (Charteris-Black, 2011), financial reporting (Ho and Cheng, 2016) and health communication (Semino et al., 2018).
Metaphor processing comprises several tasks including identification, interpretation and crossdomain mappings. Metaphor identification is the most studied among these tasks. It is concerned with detecting the metaphoric words or expressions in the input text and could be done either on the sentence, relation or word levels. The difference between these levels of processing is extensively studied in (Zayed et al., 2020). Identifying metaphors on the word-level could be treated as either sequence labelling by deciding the metaphoricity of each word in a sentence given the context or singleword classification by deciding the metaphoricity of a targeted word. On the other hand, relationlevel identification looks at specific grammatical relations such as the dobj or amod dependencies and checks the metaphoricity of the verb or the adjective given its association with the noun. In relation-level identification, both the source and target domain words (the tenor and vehicle) are classified either as a metaphoric or literal expression, whereas in word-level identification only the source domain words (vehicle) are labelled. These levels of analysis (paradigms) are already established in literature and adopted by previous research in this area as will be explained in Section 2. The majority of existing approaches, as well as the available datasets, pertaining to metaphor processing focus on the metaphorical usage of verbs and adjectives either on the word or relation levels. This is because these syntactic types exhibit metaphoricity more frequently than others according to corpus-based analysis (Cameron, 2003;Shutova and Teufel, 2010).
Although the main focus of both the relationlevel and word-level metaphor identification is discerning the metaphoricity of the vehicle (source domain words), the interaction between the metaphor components is less explicit in word-level analysis either when treating the task as sequence labelling or single-word classification. Relation-level analysis could be viewed as a deeper level analysis that captures information that is not captured on the word-level through modelling the influence of the tenor (e.g.noun) on the vehicle (e.g. verb/adjective). There will be reasons that some downstream tasks would prefer to have such information (i.e. explicitly marked relations), among these tasks are metaphor interpretation and cross-domain mappings. Moreover, employing the wider context around the expression is essential to improve the identification process.
This work focuses on relation-level metaphor identification represented by verb-noun and adjective-noun grammar relations. We propose a novel approach for context-based textual classification that utilises affine transformations. In order to integrate the interaction of the metaphor components in the identification process, we utilise affine transformation in a novel way to condition the neural network computation on the contextualised features of the given expression. The idea of affine transformations has been used in NLP-related tasks such as visual question-answering (de Vries et al., 2017), dependency parsing (Dozat and Manning, 2017), semantic role labelling (Cai et al., 2018), coreference resolution (Zhang et al., 2018), visual reasoning (Perez et al., 2018) and lexicon features integration (Margatina et al., 2019). Inspired by the works on visual reasoning, we use the candidate expression of certain grammatical relations, represented by deep contextualised features, as an auxiliary input to modulate our computational model. Affine transformations can be utilised to process one source of information in the context of another. In our case, we want to integrate: 1) the deep contextualised-features of the candidate expression (represented by ELMo sentence embeddings) with 2) the syntactic/semantic features of a given sentence. Based on this task, affine transformations have a similar role to attention but with more parameters, which allows the model to better exploit context. Therefore, it could be regarded as a form of a more sophisticated attention. Whereas the current "straightforward" attention models are overly simplistic, our model prioritises the contextual information of the candidate to discern its metaphoricity in a given sentence.
Our proposed model consists of an affine transform coefficients generator that captures the meaning of the candidate to be classified, and a neural network that encodes the full text in which the candidate needs to be classified. We demonstrate that our model significantly outperforms the state-ofthe-art approaches on existing relation-level benchmark datasets. The unique characteristics of tweets and the availability of Twitter data motivated us to identify metaphors in such content. Therefore, we evaluate our proposed model on a newly introduced dataset of tweets (Zayed et al., 2019) annotated for relation-level metaphors.
Word-Level Processing: Do Dinh and Gurevych (2016) were the first to utilise a neural architecture to identify metaphors. They approached the problem as sequence labelling where a traditional fully-connected feed-forward neural network is trained using pre-trained word embeddings. The authors highlighted the limitation of this approach when dealing with short and noisy conversational texts. As part of the NAACL 2018 Metaphor Shared Task (Leong et al., 2018), many researchers proposed neural models that mainly employ LSTMs (Hochreiter and Schmidhuber, 1997) with pre-trained word embeddings to identify metaphors on the word-level. The best performing systems are: THU NGN (Wu et al., 2018), OCOTA (Bizzoni and Ghanimifard, 2018) and bot.zen (Stemle and Onysko, 2018). Gao et al. (2018) were the first to employ the deep contextualised word representation ELMo (Peters et al., 2018), combined with pre-trained GloVe (Pennington et al., 2014) embeddings to train bidirectional LSTM-based models. The authors introduced a sequence labelling model and a single-word classification model for verbs. They showed that incorporating the context-dependent representation of ELMo with context-independent word embeddings improved metaphor identification. Mu et al.
(2019) proposed a system that utilises a gradient boosting decision tree classifier. Document embeddings were employed in an attempt to exploit wider context to improve metaphor detection in addition to other word representations including GLoVe, ELMo and skip-thought (Kiros et al., 2015). Mao et al. (2018Mao et al. ( , 2019 explored the idea of selectional preferences violation (Wilks, 1978) in a neural architecture to identify metaphoric words. Mao's proposed approaches emphasised the importance of the context to identify metaphoricity by employing context-dependent and context-independent word embeddings. Mao et al. (2019) also proposed employing multi-head attention to compare the targeted word representation with its context. An interesting approach was introduced by Dankers et al.
(2019) to model the interplay between metaphor identification and emotion regression. The authors introduced multiple multi-task learning tech-niques that employ hard and soft parameter sharing methods to optimise LSTM-based and BERT-based models.
Relation-Level Processing: Shutova et al. (2016) focused on identifying the metaphoricity of adjective/verb-noun pairs. This work employed multimodal embeddings of visual and linguistic features. Their model employs the cosine similarity of the candidate expression components based on word embeddings to classify metaphors using an optimised similarity threshold. Rei et al. (2017) introduced a supervised similarity network to detect adjective/verb-noun metaphoric expressions. Their system utilises word gating, vector representation mapping and a weighted similarity function. Pre-trained word embeddings and attribute-based embeddings (Bulat et al., 2017) were employed as features. This work explicitly models the interaction between the metaphor components. Gating is used to modify the vector of the verb/adjective based on the noun, however the surrounding context is ignored by feeding only the candidates as input to the neural model which might lead to loosing important contextual information.
Limitations: As discussed, the majority of previous works adopted the word-level paradigm to identify metaphors in text. The main distinction between the relation-level and the word-level paradigms is that the former makes the context more explicit than the latter through providing information about not only where the metaphor is in the sentence but also how its components come together through hinting at the relation between the tenor and the vehicle. Stowe and Palmer (2018) showed that the type of syntactic construction a verb occurs in influences its metaphoricity. On the other hand, existing relation-level approaches (Tsvetkov et al., 2014;Shutova et al., 2016;Bulat et al., 2017;Rei et al., 2017) ignore the context where the expression appears and only classify a given syntactic construction as metaphorical or literal. Studies showed that the context surrounding a targeted expression is important to discern its metaphoricity and fully grasp its meaning (Mao et al., 2018;Mu et al., 2019). Therefore, current relation-level approaches will only be able to capture commonly used conventionalised metaphors. In this work, we address these limitations by introducing a novel approach to textual classification which employs contextual information from both the targeted expression under study and the wider context surrounding it.

Proposed Approach
Feature-wise transformation techniques such as feature-wise linear modulation (FiLM) have been recently employed in many applications showing improved performance. They became popular in image processing applications such as image style transfer (Dumoulin et al., 2017); then they found their way into multi-modal tasks, specifically visual question-answering (de Vries et al., 2017;Perez et al., 2018). They also have been shown to be effective approaches for relational problems as mentioned in Section 1. The idea behind FiLM is to condition the computation carried out by a neural model on the information extracted from an auxiliary input in order to capture the relationship between multiple sources of information (Dumoulin et al., 2018).
Our approach adopts Perez's (2018) formulation of FiLM on visual reasoning for metaphor identification. In visual reasoning, image-related questions are answered by conditioning the image-based neural network (visual pipeline) on the question context via a linguistic pipeline. In metaphor identification, we can consider that the image in our case is the sentence that has a metaphoric candidate and the auxiliary input is the linguistic interaction between the components of the candidate itself. This will allow us to condition the computation of a sequential neural model on the contextual information of the candidate and leverage the feature-wise interactions between the conditioning representation and the conditioned network. To the best of our knowledge, we are the first to propose such contextual modulation for textual classification in general and for metaphor identification specifically.
Our proposed architecture consists of a contextual modulation pipeline and a metaphor identification linguistic pipeline as shown in Figure 1. The input to the contextual modulator is the deep contextualised representation of the candidate expression under study (which we will refer to as targeted expression 1 ) to capture the interaction between its components. The linguistic pipeline employs an LSTM encoder which produces a contextual representation of the provided sentence where the targeted expression appeared. The model is trained end-to-end to identify relation-level metaphoric expressions focusing on verb-noun and adjectivenoun grammatical relations. Our model takes as input a sentence (or a tweet) and a targeted expression of a certain syntactic construction and identifies whether the candidate in question is used metaphorically or literally by going through the following steps: Condition: In this step the targeted expression is used as the auxiliary input to produce a conditioning representation. We first embed each candidate of verb-direct object pairs 2 (v, n) using ELMo sentence embeddings to learn context-dependent aspects of word meanings c vn . We used the 1,024dimensional ELMo embeddings pre-trained on the One Billion Word benchmark corpus (Chelba et al., 2014). The sentence embeddings of the targeted expression are then prepared by implementing an embeddings layer that loads these pre-trained ELMo embeddings from the TensorFlow Hub 3 . The layer takes in the raw text of the targeted expression and outputs a fixed mean-pooled vector representation of the input as the contextualised representation. This representation is then used as an input to the main component of this step, namely a contextual modulator. The contextual modulator consists of a fully-connected feed-forward neural network (FFNN) that produces the conditioning parameters (i.e. the shifting and scaling coefficients) that will later modulate the linguistic pipeline computations. Given that c vn is the conditioning input then the contextual modulator outputs γ and β, the contextdependent scaling and shifting vectors, as follows: Embed: Given a labelled dataset of sentences, the model begins by embedding the tokenised sentence S of words w 1 , w 2 , ..., w n , where n is the number of words in S, into vector representations using GloVe embeddings. We used the uncased 200dimensional GloVe embeddings pre-trained on ∼2 billion tweets and contains 1.2 million words.
Encode: The next step is to train a neural network with the obtained embeddings. Since context is important for identifying metaphoricity, sentence encoder is a sensible choice. We use an LSTM sequence model to obtain a contextual representation which summarises the syntactic and semantic features of the whole sentence. The output of the LSTM is a sequence of hidden states h 1 , h 2 , ..., h n , where h i is the hidden state at the i th time-step.
Feature-wise Transformation: In this step, an affine transformation layer, hereafter AffineTrans layer, applies a feature-wise linear modulation to its inputs, which are: 1) the hidden states from the encoding step; 2) the scaling and shifting parameters from the conditioning step. By feature-wise, we mean that scaling and shifting are applied to each encoded vector for each word in the sentence.
Attend: Recently, attention mechanisms have become useful to select the most important elements in a given representation while minimising information loss. In this work, we employ an attention layer based on the mechanism presented in (Lin et al., 2017). It takes the output from the Affine-Trans layer as an input in addition to a randomly initialised weight matrix W , a bias vector b and a learnable context vector u to produce the attended output as follows: Our model is trained and evaluated with and without the attention mechanism in order to differentiate between the effect of the feature modulation and the attention on the model performance.
Predict: The last step is to make the final prediction using the output from the previous step (attended output in case of using attention or the AffineTrans layer output in case of skipping it). We use a fully-connected feed-forward layer with a sigmoid activation that returns a single (binary) class label to identify whether the targeted expression is metaphoric or not.

Datasets
The choice of annotated dataset for training the model and evaluating its performance is determined by the level of metaphor identification. Given the distinction between the levels of analysis, approaches addressing the task on the word-level are not fairly comparable to relation-level approaches since each task addresses metaphor identification differently. Therefore, the tradition of previous work in this area is to compare approaches addressing the task on the same level against each other on level-specific annotated benchmark datasets (Zayed et al., 2020). Following prior work in this area and in order to compare the performance of our proposed approach with other relation-level metaphor identification approaches, we utilise available annotated datasets that support this level of processing. The existing datasets are either originally prepared to directly support relation-level processing such as the TSV (Tsvetkov et al., 2014) dataset and the Tweets dataset by Zayed et al. (2019) or adapted from other word-level benchmark datasets to suit relation-level processing such as the adaptation of the benchmark datasets TroFi (Birke and Sarkar, 2006) and VU Amsterdam metaphor corpus (VUAMC) (Steen et al., 2010) by Zayed et al. (2020) and the adaptation of the MOH (Mohammad et al., 2016) dataset by Shutova et al. (2016). Due to space limitation, we include in Appendix A: 1) examples of annotated instances from these datasets showing their format as: sentence, targeted expression and the provided label; 2) the statistics of these datasets including their size and percentage of metaphors.
Relation-Level Datasets: These datasets focus on expressions of certain grammatical relations. Obtaining these relations could be done either automatically by employing a dependency parser or manually by highlighting targeted expressions in a specific corpus. Then, these expressions are manually annotated for metaphoricity given the surrounding context. There exist two benchmark datasets of this kind, namely the TSV dataset and Zayed et al. (2019) Tweets dataset, hereafter ZayTw dataset. The former focuses on discerning the metaphoricity of adjective-noun expressions in sentences collected from the Web and Twitter while the latter focuses on verb-direct object expressions in tweets.
Adapted Word-Level Datasets: Annotated datasets that support word-level metaphor identification are not suitable to support relation-level processing due to the annotation difference (Shutova, 2015;Zayed et al., 2020). To overcome the limited availability of relation-level datasets, there has been a growing effort to enrich and extend benchmark datasets annotated on the word-level to suit relation-level metaphor identification. Although it is non-trivial and requires extra annotation effort, Shutova et al. (2016) and Zayed et al. (2020) introduced adapted versions of the MOH, TroFi and VUAMC datasets to train and evaluate models that identify metaphors on the relation-level. Since the MOH dataset was originally created to identify metaphoric verbs on the word-level, its adaptation by Shutova et al. (2016), also referred to as MOH-X in several papers, focused on extracting the verb-noun grammar relations using a dependency parser. The dataset is relatively small and contains short and simple sentences that are originally sampled from the example sentences of each verb in WordNet (Fellbaum, 1998). The TroFi dataset was designed to identify the metaphoricity of 50 selected verbs on the word-level from the 1987-1989 Wall Street Journal (WSJ) corpus. The VUAMC (Steen et al., 2010) is the largest corpus annotated for metaphors and has been employed extensively by models developed to identify metaphors on the word-level. However, models designed to support relation-level metaphor identification can not use it in its current state. Therefore, previous research focusing on relation-level processing (Rei et al., 2017;Bulat et al., 2017;Shutova et al., 2016;Tsvetkov et al., 2014) did not train, evaluate or compare their approaches using it. Recently, a subset of the VUAMC was adapted to suit relation-level analysis by focusing on the training and test splits provided by the NAACL metaphor shared task. This corpus subset as well as the TroFi dataset are adapted by Zayed et al. (2020) to suit identifying metaphoric expressions on the relation-level, focusing on verb-direct object grammar relations (i.e dobj dependencies). The Stanford dependency parser was utilised to extract these relations which were then filtered to ensure quality.

Experimental Setup
We employ a single-layer LSTM model with 512 hidden units. The Adadelta algorithm (Zeiler, 2012) is used for optimisation during the training phase and the binary cross-entropy is used as a loss function to fine tune the network. The reported results are obtained using batch size of 256 instances for the ZayTw dataset and 128 instances for the other employed datasets. L 2 -regularisation weight of 0.01 is used to constraint the weights of the contextual modulator. In all experiments, we zeropad the input sentences to the longest sentence length in the dataset. All the hyper-parameters were optimised on a randomly separated development set (validation set) by assessing the accuracy. We present here the best performing design choices based on experimental results but we highlight some other attempted considerations in Appendix B. We implemented our models using Keras (Chollet et al., 2015) with the TensorFlow backend. We are making the source code and best models publicly available 4 . To ensure reproducibility, we include the sizes of the training, validation and test sets in Appendix B as well as the best validation accuracy obtained on each validation set. All the results presented in this paper are obtained after running the experiments five times with different random seeds and taking the average.
In this work, we selected the following state-ofthe-art models pertaining to relation-level metaphor identification for comparisons: the cross-lingual model by (Tsvetkov et al., 2014), the multimodal system of linguistic and visual features by (Shutova et al., 2016), the ATTR-EMBED model by Bulat et al. (2017) and the supervised similarity network (SSN) by Rei et al. (2017). We consider the SSN system as our baseline. For fair comparisons, we utilised their same data splits on the five employed benchmark datasets described in Section 4.

Excluding AffineTrans
We implemented a simple LSTM model to study the effect of employing affine transformations on the system performance. The input to this model is the tokenised sentence S which is embedded as a sequence of vector representations using GloVe. These sequences of word embeddings are then encoded using the LSTM layer to compute a contextual representation. Finally, this representation is fed to a feed-forward layer with a sigmoid activation to predict the class label. We used this model with and without the attention mechanism.

Results
We conduct several experiments to better understand our proposed model. First, we experiment with the simple model introduced in Section 5.2.
Then, we train the proposed models on the benchmark datasets discussed in Section 4. We experiment with and without the attention layer to assess its effect on the model performance. Furthermore, we compare our model to the current work that addresses the task on the relation-level, in-line with our peers in this area. Tables 1 and 2 show our model performance in terms of precision, recall, F1-score and accuracy.
Since the source code of Rei's (2017) system is available online 5 , we trained and tested their model using the ZayTw dataset as well as the adapted VUAMC and TroFi dataset in an attempt to study the ability of their model to generalise when applied on a corpus of a different text genre with wider metaphoric coverage including less common (conventionalised) metaphors.

Discussion
Overall performance. We analysed the model performance by inspecting the classified instances. We noticed that it did a good job identifying conventionalised metaphors as well as uncommon ones. Appendix A shows examples of classified instances by our system from the employed benchmark datasets. Our model achieves significantly better F1-score over the state-of-the-art SSN system (Rei et al., 2017) under the one-tailed paired ttest (Yeh, 2000) at p-value<0.01 on three of the five employed benchmark datasets. Moreover, our architecture showed improved performance over the state-of-the-art approaches on the TSV and MOH datasets. It is worth mentioning that the size of their test sets is relatively smaller; therefore any change in a single annotated instance drastically affects the results. Moreover, the approach proposed by Tsvetkov et al. (2014) relies on hand-coded lexical features which justifies its high F1-score.
The effect of contextual modulation. When excluding the AffineTrans layer and only using the simple LSTM model, we observe a significant performance drop that shows the effectiveness of leveraging linear modulation. This layer adaptively influences the output of the model by conditioning the identification process on the contextual information of the targeted expression itself which significantly improved the system performance, as observed from the results. Moreover, employing the contextualised representation of the targeted expression, through ELMo sentence embeddings,  was essential to explicitly capture the interaction between the verb/adjective and its accompanying noun. Then, the AffineTrans layer was able to modulate the network based on this interaction.
The effect of attention. It is worth noting that the attention mechanism did not help much in our AffineTrans model because affine transformation itself could be seen as playing a similar role to attention, as discussed in Section 1. In attention mechanisms important elements are given higher weight based on weight scaling whereas in linear affine transformation scaling is done in addition to shifting which gives prior importance (probability) to particular features. We are planning to perform an in-depth comparison of using affine transformation verses attention in our future work.
Error analysis. An error analysis is performed to determine the model flaws by analysing the predicted classification. We examined the false positives and false negatives obtained by the best performing model, namely AffineTrans (without attention). Interestingly, the majority of false negatives are from the political tweets in ZayTw dataset. Table 3 lists some examples of misclassified instances in the TSV and ZayTw datasets. Some instances could be argued as being correctly classified by the model. For instance, "spend capital" could be seen as a metaphor in that the noun is an abstract concept referring to actual money. Examples of misclassified instances from the other employed datasets are presented in Appendix A. Interestingly, we noticed that the model was able to spot mistakenly annotated instances. Although the adapted VUMAC subset contains various expressions which should help the model perform better, we noticed annotation inconsistency in some of them. For example, the verb "choose" associated with the noun "science" is annotated once as metaphor and twice as literal in very similar contexts. This aligns well with the findings of Zayed et al. (2020) who questioned the annotation of around 5% of the instances in this subset mainly due to annotation inconsistency.
Analysis of some misclassified verbs. We noticed that sometimes the model got confused while identifying the metaphoricity of expressions where the verb is related to emotion and cognition such as: "accept, believe, discuss, explain, experience, need, recognise, and want". Our model tends to classify them as not metaphors. We include different examples from the ZayTw dataset of the verbs "experience" and "explain" with different associated nouns along with their gold and predicted classifications in Appendix A. Our model's prediction seems reasonable given that the instances in the training set were labelled as not metaphors. It is  not clear why the gold label for "explain this mess" is not a metaphor while it is metaphor for "explain implications"; similarly, the nouns "insprirations" and "emotions" with the verb "experience".

Conclusions
In this paper, we introduced a novel architecture to identify metaphors by utilising feature-wise affine transformation and deep contextual modulation. Our approach employs a contextual modulation pipeline to capture the interaction between the metaphor components. This interaction is then used as an auxiliary input to modulate a metaphor identification linguistic pipeline. We showed that such modulation allowed the model to dynamically highlight the key contextual features to identify the metaphoricity of a given expression. We applied our approach to relation-level metaphor identification to classify expressions of certain syntactic constructions for metaphoricity as they occur in context. We significantly outperform the stateof-the-art approaches for this level of analysis on benchmark datasets. Our experiments also show that our contextual modulation-based model can generalise well to identify the metaphoricity of unseen instances in different text types including the noisy user-generated text of tweets. Our model was able to identify both conventionalised common metaphoric expressions as well as less common ones. To the best of our knowledge, this is the first attempt to computationally identify metaphors in tweets and the first approach to study the employment of feature-wise linear modulation on metaphor identification in general. The proposed methodology is generic and can be applied to a wide variety of text classification approaches including sentiment analysis or term extraction.  Table 4 shows the statistics of the benchmark datasets employed in this work, namely the relationlevel datasets TSV 6 and ZayTw in addition to the adapted TroFi 7 , VUAMC 8 and MOH 9 datasets. Table 5 shows examples of annotated instances from each dataset.

A.2 Datasets Analysis
Examples of correctly classified instances from the employed datasets: We show examples of correctly classified instances by our best performing model. Table 6 comprises examples from the relational-level datasets TSV and ZayTw. Table 7 lists examples from the adapted MOH and TroFi datasets as well as the adapted VUAMC.
Examples of misclassified instances by our model in the tweets dataset: Examples of misclassified instances from the TSV and ZayTw datasets as well as the adapted MOH, TroFi and VUAMC datasets are given in Table 8. Our model spotted some instances that are mistakenly annotated in the original datasets.
Missclassified Verbs: Table 9 shows examples from the ZayTw dataset of the verbs "experience" and "explain" with different associated nouns along with their gold and predicted classifications.

B.1 Experimental Settings
The word embeddings layer is intialised with the pre-trained GloVe embeddings. We used the uncased 200-dimensional GloVe embeddings pretrained on ∼2 billion tweets and contains 1.2 million words. We did not update the weights of these embeddings during training. Table 10 shows the sizes of the training, validation and test sets of each employed dataset for as well as the corresponding best obtained validation accuracy by the the Affine-Trans model (without attention). All experiments are done on a NVIDIA Quadro M2000M GPU and the average running time for the proposed models is around 1 hour for maximum of 100 epochs.

B.2 Other Trials
Sentence Embedding: We experimented with different representations other than GLoVe in order to embed the input sentence. We tried to employ the contextualised pre-trained embeddings ELMo and BERT either instead of the GloVe embeddings or as additional-features but no further improvements were observed on both validation and test sets over the best performance obtained. Furthermore, we experimented with different pre-trained GloVe embeddings including the uncased 300-dimensional pre-trained vectors on the Common Crawl dataset but we did notice any significant improvements.
Sentence Encoding: The choice of using the simple LSTM to encode the input was based on several experiments on the validation set. We tried bidirectional LSTM but observed no further improvement. This is due to the nature of the relationlevel metaphor identification task itself as the tenor (e.g. noun) affects the metaphoricity of the vehicle (e.g. verb or adjective) so a single-direction processing was enough.  The adapted TroFi In addition, the eight-warhead missiles carry guidance systems allowing them to strike Soviet targets precisely.
strike Soviet targets 0 He now says that specialty retailing fills the bill, but he made a number of profitable forays in the meantime.
fills the bill 1 A survey of U.K. institutional fund managers found most expect London stocks to be flat after the fiscal 1989 budget is announced, as Chancellor of the Exchequer Nigel Lawson strikes a careful balance between cutting taxes and not overstimulating the economy.
strikes a careful balance 1 Among the rich and famous who had come to the salon to have their hair cut, tinted and set, Paula recognised Dusty Springfield, the pop singer, her eyes big and sooty , her lips pearly pink, and was unable to suppress the thrill of excitement which ran through her.
recognised Dusty Springfield 0 The adapted VUAMC (NAACL Shared Task) But until they get any money back, the Tysons find themselves in the position of the gambler who gambled all and lost .
get any money 0 The Labour Party Conference: Policy review throws a spanner in the Whitehall machinery throws a spanner 1 Otherwise Congress would have to face the consequences of automatic across-the-board cuts under the Gramm-Rudman-Hollings budget deficit reduction law.
face the consequences 1 MOH-X commit a random act of kindness. commit a random act 0 The smoke clouded above the houses. smoke clouded 0 His political ideas color his lectures. ideas color 1 flood the market with tennis shoes. flood the market 1

Dataset Sentence Prob. TroFi
Unself-consciously , the littlest cast member with the big voice steps into the audience in one number to open her wide cat-eyes and throat to melt the heart of one lucky patron each night.

0.295
Lillian Vernon Corp., a mail-order company, said it is experiencing delays in filling orders at its new national distribution center in Virginia Beach,Va.

0.006
False Negative VUAMC It is a curiously paradoxical foundation uponupon which to build a theory of autonomy.

0.410
It has turned up in Canberra with Japan to develop Asia Pacific Economic Cooperation (APEC) and a new 12-nation organisation which will mimic the role of the Organisation for Economic Co-operation and Development in Europe.

MOH
When does the court of law sit? 0.499 The rooms communicated. 0.000

TSV
It was great to see a warm reception for it on twitter. 0.488 An honest meal at a reasonable price is a rarity in Milan. 0.000 ZayTw #brexit? we explain likely implications for business insurances on topic of #eureferendum 0.2863 @abpi uk: need #euref final facts? read why if you care about uk life sciences we're #strongerin.

0.0797
TroFi As the struggle enters its final weekend , any one of the top contenders could grasp his way to the top of the greasy pole.
0.998* Southeastern poultry producers fear withering soybean supplies will force up prices on other commodities.

0.507
False Positive VUAMC Or after we followed the duff advice of a legal journalist in a newspaper? 0.999* Aristotle said something very interesting in that extract from the Politics which I quoted earlier; he said that women have a deliberative faculty but that it lacks full authority.    Table 10: Experimental information of the five benchmark datasets including the best obtained validation accuracy by the AffineTrans model (without attention). We preserved the splits used in literature for the VUAMC and TSV datasets.