Neural Metaphor Detection with a Residual biLSTM-CRF Model

In this paper we present a novel resource-inexpensive architecture for metaphor detection based on a residual bidirectional long short-term memory and conditional random fields. Current approaches on this task rely on deep neural networks to identify metaphorical words, using additional linguistic features or word embeddings. We evaluate our proposed approach using different model configurations that combine embeddings, part of speech tags, and semantically disambiguated synonym sets. This evaluation process was performed using the training and testing partitions of the VU Amsterdam Metaphor Corpus. We use this method of evaluation as reference to compare the results with other current neural approaches for this task that implement similar neural architectures and features, and that were evaluated using this corpus. Results show that our system achieves competitive results with a simpler architecture compared to previous approaches.


Introduction
This paper presents a new model for automatic metaphor detection which has participated at the FigLang 2020 metaphor detection shared task (Leong et al., 2020). Our approach, which is based on neural networks, has been developed in the framework of the research project MOMENT (Coll-Florit et al., 2018), a project devoted to the analysis of metaphors in mental health discourses.
As it is well known in Cognitive Linguistics, a conceptual metaphor (CM) is a cognitive process which allows to understand and communicate an abstract or diffuse concept in terms of a more concrete one (cf. e.g. Lakoff and Johnson (1980)). This process is expressed linguistically by using metaphorically used words (MUW).
The study of metaphor is a prolific area of research in Cognitive Linguistics, being the Metaphor Identification Procedure (MIP) (Pragglejaz Group, 2007) and its derivative MIPVU (Steen et al., 2019) the most standard methods for manual MUW detection. MIPVU is the method that was used to annotate the VU Amsterdam Metaphor Corpus (VUA corpus), used in FigLang 2020. Moreover, in the area of Corpus Linguistics, some methods have been developed for a richer annotation of metaphor in corpora (Ogarkova and Soriano Salinas, 2014;Shutova, 2017;Coll-Florit and Climent, 2019).
CM is pervasive in natural language text and therefore it is crucial in automatic text understanding (Shutova, 2010). For this reason automated metaphor processing has become an increasingly important concern in natural language processing, as shown by the holding of the Metaphor in NLP workshop series (at NAACL-HLT 2013, ACL 2014, NAACL-HLT 2015, NAACL-HLT 2016and NAACL-HLT 2018 and a growing body of research -see Veale et al. (2016) and Shutova (2017) for quite recent reviews.
Automatic metaphor processing involves two main tasks: identifying MUW (metaphor detection or recognition) and attempting to provide a semantic interpretation for the utterance containing them (metaphor interpretation). This work deals with metaphor detection. This problem has been mainly approached in the last decade by supervised and semi-supervised machine learning techniques but recently this paradigm has largely shifted to the use of deep learning algorithms, such as neural networks. Leong et al. (2018) report that all but one of participating teams on the 2018 VUA Metaphor Detection Shared Task used this kind of architectures. Our system follows this trend by trying to improve on previous neural network methods.
Below we describe the main related works (section 2). Next we present our methodology and model (section 3), experiments (section 4) and results (Section 5). We finish with the discussion and our overall conclusions (sections 6 and 7).

Background
Research on metaphor recognition and interpretation is changing from the use of features (linguistic and concreteness features), classical methods (as generalization, classification and word associations) and the use of theoretical principles (construction grammar, frame semantics and conceptual metaphor theory) to neural networks and other deep learning techniques.
Concreteness features are used by Klebanov et al. (2015) along with re-weighting of the training examples to train a supervised machine learning system. The trained system is able to classify all content words of a text in two groups: metaphorical and non-metaphorical.  study the metaphoricity of verbs using semantic generalization and classification using word forms, lemmas and several other linguistic features. They demonstrated the effectiveness of the generalization from orthographic unigrams to lemmas and the combination of lemmas and semantic classes based on WordNet. They also used automatically generated clusters to combine with unigram lemmas getting a competitive performance.
The Meta4meaning (Xiao et al., 2016) metaphor interpretation method uses word associations extracted from a corpus to retrieve approximate properties of concepts and provide interpretations for nominal metaphors of the form NOUN 1 is [a] NOUN 2 (where NOUN 1 is the tenor and NOUN 2 the vehicle). Metaphor interpretation is obtained as a combination of the saliences of the properties to the tenor and the vehicle. Combinations can be aggregations (the product or sum of saliences), salience difference or a combination of the results of the two. As an output, Meta4meaning provides a list of interpretations with weights.
The automatic metaphor detection system MetaNet (Hong, 2016) has been designed applying theoretical principles from construction grammar, frame semantics, and conceptual metaphor theory. The system relies on a conceptual network of frames and metaphors. Rosen (2018) developed an algorithm using deep learning techniques that uses a representation of metaphorical constructions in an argument-structure level. The algorithm allows for the identification of source-level mappings of metaphors. The author concludes that the use of deep learning algorithms with the addition of construction grammatical relations in the feature set improves the accuracy of the prediction of metaphorical source domains. Wu et al. (2018) propose to use a Convolutional Neural Network -Long-Short Term Memory (CNN-LSTM) with a Conditional Random Field (CRF) or Softmax layer for metaphor detection in texts. They combine CNN and LSTM to capture both local and long-distance contextual information to represent the input sentences. Meanwhile, Mu et al. (2019) argue that using broader discourse features can have a substantial positive impact for the task of metaphorical identification. They obtain significant results using document embeddings methods to represent an utterance and its surrounding discourse. With this material a gradient boosting classifier is trained.
Other works for specific tasks within the scope of metaphor recognition, such as detecting the metaphoricity of adjective-noun (AN) pairs in English as isolated units, include the works by We propose a model that uses residual bidirectional long short-term memory (biLSTM) with a CRF, using ELMo embeddings along with additional linguistic features, such as part of speech tags (POS) and semantically disambiguated Word-Net 1 synonym sets (synsets) (Fellbaum and Miller, 1998). Our model could be grouped in the same category as the aforementioned approaches: deep neural networks models for metaphor detection.

Model Description
Most of the approaches mentioned in section 2 used the VUA corpus (Steen et al., 2010) in order to carry out model training and testing. They divided the training and test sets according to the VUA Metaphor Detection Shared Task specifications. To train and test our model we used the VUA corpus partitions, using ELMo embeddings to represent words and lemmas, and POS and synsets as additional linguistic features. ELMo (Embeddings from Language Models) embeddings (Peters et al., 2018) are derived from a bidirectional language model (biLM) and they are contextualized, deep and character based. ELMo embeddings have been successfully used in several NLP tasks.
To process the VUA corpus we used the Natural Language Toolkit (NLTK) (Loper and Bird, 2002) for Python, with this tool we performed tokenization, lemmatization, and POS tagging. Then we used Freeling (Padró and Stanilovsky, 2012) to obtain the respective synset of each token. Although NLTK provides a method for obtaining synsetsusing POS tags or Lesk's Algorithm-, Freeling implements UKB (Agirre et al., 2014), a graphbased word sense disambiguation (WSD) algorithm that is used to obtain semantically disambiguated synsets. These features along the ELMo embeddings were used -in different configurations-as input for our model. We set a sequence padding value equal to 116, which is the maximum sentence length observed in the corpus. This process normalizes the input in order to train in batches, but might contribute to sparsity on training data.
We used one-hot encoded representation for POS, and computed local 100-dimension embeddings for synsets. In the case of POS, we have a small set of tags (43), and therefore resulting in a low dimensionality of the one-hot embeddings. For synsets, the computation of local embeddings provides the semantically disambiguated relations that exist between the units that compose the training data. These embeddings, in addition with their EMLo counterparts, shall provide enough contextual and semantic data to understand metaphorical instances of words.
The main architecture of our model (shown in Figure 1) is composed by a residual biLSTM (Kim et al., 2017;Tran et al., 2017) for sequence labeling. One of the particularities of this architecture lies in the implementation of an additive operation that takes the outputs from each biLSTM layer and combines them to calculate the residual connection between them, in order to obtain previously seen information from both instances.
After computing the residual connection from both biLSTM layers, our model includes a dropout layer, followed by a time distributed layer in which a dense activation with 2 hidden units to each timestep is applied. We used ReLU (Nair and Hinton, 2010) as activation function in combination with a He-normal (He et al., 2015) kernel initialization function for the time distributed layer, which results in a zero-mean Gaussian distribution with a standard deviation equal to 2 n l . Finally, after the time distributed layer we used a conditional random field (CRF) implemented for sequence labeling (Lafferty et al., 2001). Given that the VUA corpus is composed by more negative -or literal-labels than positive -or metaphoric-labels, and that the sequence padding process added non-informative features to the input array, we opted to treat the training partition as an imbalanced dataset. We selected the Nadam optimizer (Dozat, 2016), which is based on Adam (Kingma and Ba, 2014) and tends to perform better with sparse data. This last optimization algorithm has two main components: a momentum and an adaptive learning rate component. Nadam modifies the momentum component of Adam using Nesterov's accelerated gradient (NAG). The Nadam update rule can be written as follows:

Experiments
To carry out the evaluation of our model we used the train and test splits provided in VUA shared task partitions (Shutova, 2017). In order to obtain a validation split we divided the training partition using the following percentages: 80% for training 20% for validation. With these partitions, we trained a total of 6 different model configurations: words and POS (W+POS); lemmas and POS (L+POS); words, POS and synsets (W+POS+SS); lemmas, POS and synsets (L+POS+SS); words, lemmas and POS (WL+POS); and words, lemmas, POS and synsets (WL+POS+SS). In all cases we used the same training parameters, all model configurations were trained in batches for 5 epochs, using a learning rate = 0.0025. Then, the resulting models were evaluated -using the precision, recall and F 1 score metrics-on both the all POS metaphor detection task and the metaphoric verbs detection task.

Results
Regarding the all POS prediction task (Table 1) , the L+POS+SS model had the best performance with a 0.5729 in precision, 0.6027 in recall and an F 1 score equal to 0.5874. Overall, all configuration obtained a mean F 1 score of 0.58 being the WL+POS model the one with the lowest score (0.5615). Regarding the recall score, the highest observed value was obtained by the W+POS+SS model, with a recall equal to 0.6438. It could be said that a less diverse lexicon obtained by using lemmas instead of words to obtain embeddings, helped to improve the performance of the L+POS+SS model. Nevertheless, when comparing the W+POS and L+POS configuration, both obtained similar results, with less than 1% difference in performance between them. Meanwhile, when comparing the W+POS+SS and L+POS+SS models, it can be observed that both models obtained similar F 1 scores, but a variation of 4% between the precision and recall that favours precision in the L+POS+SS model, and recall in the W+POS+SS model.

Model Precision
In the case of the metaphoric verb labeling task (Table 2), the W+POS model obtained the best scores in precision and F 1 score (0.6695 and 0.6543 accordingly), while the W+POS+SS model obtained the highest recall value (0.7032). Overall, the mean F 1 score of all configurations was equal to 0.6411, being the WL+POS the poorest performing configuration with a F 1 score of 0.6101. In a similar way to the all POS task, the W+POS+SS and L+POS+SS configurations obtained precision and recall scores with a difference of 6% in both metrics.
Unlike in the all POS task, combining features did not improve the performance of the models for verbs labeling. While using synsets to disambiguate the meaning of the different words or lemmas that were fed to the model, using ELMo embeddings and POS tags yielded better results in this task. One of the possible explanations for this behavior could be that verbs tend to be more polysemous than nouns and, therefore, obtain greater benefit from this feature. According to WordNet statistics 2 , verbs have an average polisemy index of 2.17, while nouns have an average of 1.24.  It can be observed the all POS models set the W+POS architecture has a higher precision in comparison to the W+POS+SS configuration. This behaviour can also be observed in the Verbs task model set, where both configurations obtained the higher values for these metrics. On one hand, the W+POS classifier captures fewer instances of metaphoric words, but most of the metaphors it classifies are true positives whereas, on the other hand, the W+POS+SS is a greedier model that correctly classifies metaphors but its predictions tend to include instances of false negatives.

Model Precision
Such variation might be caused by the inclusion of synsets as training feature: when additional senses are linked to each training word, they provide a polysemous representation of words and cause an increase in semantic patterns for both metaphoric and literal tokens. These semantically disambiguated patterns broaden the prediction scope of the model, as words with similar senses might occur in similar contexts. While the W+POS architecture correctly predicts metaphors to a certain degree, its scope is more precise but narrower than the W+POS+SS architecture in which words -particularly verbs-have a variety of senses that improve the recall metric at the expense of predicting literal tokens as metaphoric when compared to the W+POS model.

Discussion
Our proposed architecture has similarities to other current approaches such as Wu et al. (2018)   Regarding the all POS labeling task, the model presented by Wu et al. (2018) performs better in all metrics, with a difference of 3% in precision, 10% in recall and 8% in F 1 score. It has to be noted that our model presents a simpler architecture (as shown in section 3). Wu et al. (2018) trained their model using 200 biLSTM hidden states and 100 CNN units for 15 epochs, and trained it 20 times using an ensemble method. On the other hand, the most simple W+POS architecture that we presented takes an average time of 5 minutes by epoch 4 to train and validate, thus producing a less complex model that is faster and less expensive to train.
On both tasks the poorest performing configuration was WL+POS, combining these features improved recall but lowered both precision and F 1 . Combining words and lemmas might create redundancy in certain features that is not possible to leverage using POS. On the other hand, while the dimensionality becomes higher than the previous configuration (1024 + 1024 + 43), once synsets are added in the WL+POS+SS architecture (and increasing the feature dimensionality by 100) the performance of the model improves on both precision and recall on the all POS task, and in all metrics on the verbs task.
One of the strategies that we implemented to leverage the imbalance of the training data was using a kernel initialization function. The He-normal function uses the size of the last layer in order to generate weights that have different ranges. In this case, the time distributed layer is activated using RELu, and takes the size of the dropout layer and then initializes it with a He-normal distribution.

Conclusions and further work
In this paper we have described the system we have presented at the FigLang 2020 metaphor detection shared task. Our approach is based on neural networks using a residual biLSTM with a CRF and using ELMo embeddings along with the inclusion of several combinations of words, lemmas and linguistic features as POS and WordNet synsets. The system achieves competitive results with a simpler architecture compared to systems found in the literature. Such systems implement similar elements such as the use of bidirectional LSTM, CRF and ELMo embeddings in different configurations, and with different combination of linguistic features.
As future work, we plan to further analyse which POS benefits most from the inclusion of synset information. Other aspect we want to explore is how to deal with imbalanced data, i.e. how we can leverage a dataset with only two classes (metaphoric/literal) where most of the samples are literal. Other interesting questions that deserve more research is the effects on optimal dimensionality of the addition of linguistic information. Other features that could be implemented are concreteness value of certain words, or as an strategy to balance classes according to the influence that this feature has on literal and metaphoric classes.
Other future lines of work might include the implementation of this type of model for the detection of metaphors and source domain identification in Spanish. Current developments on metaphor detection are being carried out mainly in English, while this is a great resource it could be interesting to create resources in other languages to broaden the scope of metaphor detection and interpretation. A possible pipeline could be configured with two separated model, one that performs the detection of metaphorical words, followed by another classifier that predicts the domain of those metaphors.