Di-LSTM Contrast : A Deep Neural Network for Metaphor Detection

The contrast between the contextual and general meaning of a word serves as an important clue for detecting its metaphoricity. In this paper, we present a deep neural architecture for metaphor detection which exploits this contrast. Additionally, we also use cost-sensitive learning by re-weighting examples, and baseline features like concreteness ratings, POS and WordNet-based features. The best performing system of ours achieves an overall F1 score of 0.570 on All POS category and 0.605 on the Verbs category at the Metaphor Shared Task 2018.

1 Introduction Lakoff (1993) defines a metaphorical expression as a linguistic expression which is the surface realization of a cross-domain mapping in a conceptual system. On one hand, metaphors play a significant role in making a language more creative. On the other, they also make language understanding difficult for artificial systems.
Metaphor Shared Task 2018 (Leong et al., 2018) aims to explore various approaches for word-level metaphor detection in sentences. The task is to predict whether the target word in the given sentence is metaphoric or not. There are two categories for this shared task. The first one, All POS, tests the models for content words from all types of POS among nouns, adjectives, adverbs and verbs, while the second category, Verbs, tests the models only for verbs.

Related Work
Various attempts have been made for metaphor detection in recent years, but only a few of them utilize the power of distributed representation of words (Bengio et al., 2003) combined with deep neural networks. Rei et al. (2017) proposed and evaluated the first deep neural network for metaphor identification on two datasets, Saif M. Mohammad and Turney (2016) and Tsvetkov et al. (2014). Do Dinh and Gurevych (2016) explore MLP classifier with trainable word embeddings on VUAMC corpus and achieve comparable results to other systems which use corpus-based or based on handcrafted features.
Other attempts which employ supervised learning approaches for metaphor detection on VUAMC corpus involve the use of logistic classifier (Beigman Klebanov et al., 2014) on a set of features, which include unigrams, topic models, POS, and concreteness features. Later, Beigman Klebanov et al. (2015) showed a significant improvement by re-weighting examples for cost sensitive learning and experimenting with concreteness information. Gargett and Barnden (2015) focused on utilizing the interactions between concreteness, imageability, and affective meaning for metaphor detection. Rai et al. (2016) explored Conditional Random Fields with syntactic, conceptual, affective, and contextual (word embeddings) features. Beigman Klebanov et al. (2016) experimented with unigrams, WordNet (Miller, 1995 and VerbNet (Schuler, 2006) based features for detection of verb metaphors.

Data
The dataset provided for this task is VU Amsterdam Metaphor Corpus (VUAMC). VUAMC is extracted from the British National Corpus (BNC Baby) and is annotated using MIPVU Procedure (Steen, 2010). It contains examples from four genres of text: Academic, News, Fiction and Conversation.

System Description
This section describes our proposed system for this shared task, which we call Di-LSTM Contrast (illustrated in Figure 1 1 ) and is divided into three modules trained in an end to end setting. The input to the model is given as pre-trained word embeddings. An encoder uses these word embeddings to encode the context of the sentence with respect to the target word using forward and backward LSTMs (Hochreiter and Schmidhuber, 1997). The output from the encoder is fed to the feature selection module (section 4.2) for generating contrastbased features for the token word. The classifier module (section 4.3) then predicts the probabilities for the target word being metaphoric.

Context Encoder
The context encoder is inspired by Bidirectional LSTM (BLSTM, Graves and Schmidhuber (2005)). Given an input sentence S = {w 1 , w 2 , ...w n }, with n as the number of tokens in a sentence and i as the index of target token, we make two sets A = {w 1 , w 2 , ...w i } and B = {w n , w n−1 , ...w i } and feed them into forward and backward LSTMs respectively. The motivation for this split is to produce the context with respect to the target word (w i ).
The hidden states h f ∈ IR d and h b ∈ IR d , so obtained from forward and backward LSTMs are combined by concatenation or averaging, followed by a fully connected layer to produce v ∈ IR d , the context encoding.
is the transformation weight matrix, and b (1) ∈ IR d is bias.

Feature Selection
A combination of the context encoding (v) and the word vector of the target word u = w i is then fed to the classification module as The intuition behind this feature set g ∈ IR 2d is that the properties of the word and the difference between the general and contextual meanings play a major role in determining the metaphoricity of a word (Steen, 2010).

Classification
The vector g from the previous module is transformed to a hidden layer and then to the output layer to obtain the softmax probabilities (p ∈ IR 2 ) for metaphoricity.
To enable the use of some additional binary baseline features (section 6.3), we modify the equations as W (2) ∈ IR (m×2d) , W (3) ∈ IR (2×k) , W (4) ∈ IR (2×m) are the corresponding weight matrices, b (2) ∈ IR m , b (3) ∈ IR 2 , b (4) ∈ IR 2 are the corresponding biases, g baseline ∈ IR k is the baseline feature vector and α is a trainable variable which determines the weights to be given to the baseline features and the contrast features.

Implementation Details
We split the provided training data in 90:10 ratio as training set and development set. We use this development set to tune our hyperparameters for the different variations of our model. We use 300-dimensional GloVe vectors (Pennington et al., 2014) trained on 6B Common Crawl corpus as word embeddings, setting the embeddings of outof-vocabulary words to zero. To prevent overfitting on the training set, we use dropout regularization (Srivastava et al., 2014) and early stopping (Yao et al., 2007). We set the minibatch size to 50 examples and we zero pad the A and B split sets (as defined in section 4.1 ). More details on the hyperparameter settings can be found in the  We use TensorFlow (Abadi et al., 2015) library in Python 2 to implement our model. AdaGrad (Duchi et al., 2011) optimizer is used for optimization of the model.
We train our models only on the All POS category training set, and evaluate it on the test sets of both All POS and Verb categories, since the training set for all the verbs is a subset of the ALL POS category .

Experiments and Evaluation
In this section, we present evaluation results for our model. Table 3 shows their comparison on the test set using F1 score as the metric for evaluation. Experimental results indicate that our model generalizes well on the tests for both the task categories and the performance trends on tests are consistent with those on validation. Table 3 also shows the performance comparison of the variants of our model with the baseline results for the shares task provided by the organizers. Our best performing model surpasses the baseline results on the Verbs category, while it achieves a lesser but comparable performance with the baseline on  Table 5: Analysis of our best performing system on the Test Sets (both categories). P = Precision. R = Recall, F = F1 Score All POS category.

Experiment with the Encoder
We experiment with the combining function of the hidden states of forward and backward LSTM (in section 4.1) using both averaging and concatenation. The validation results on both the categories show that concatenation performs much better than averaging. This observation is supported by the fact that concatenation followed by a fully connected layer allows more parameterized interactions between the two states than averaging.

Re-weighting of Training Examples
We employ cost-sensitive learning (Yang et al., 2014) by re-weighting examples for our model. This brings an appreciable improvement in the performance of our model, 1.6% F1 gain on Validation, 2.0% on All POS category (Test) and 0.6% on verb category (Test). This increment in the performance agrees with the previous works on metaphor detection (Beigman Klebanov et al., 2015 which show the effectiveness of reweighting training examples on VUAMC corpus.

Additional Baseline Features
The use of baseline features like WordNet (Miller, 1995) features, part-of-speech tags and Concreteness features (Brysbaert et al., 2014) in our model additionally improves the F1 score by 0.8% on the All POS category (Test) and 1.5% on verb category (Test), though it shows a relatively lesser improvement on the Validation set.
To obtain the POS-tag-based features, we encode the POS tag of the target tokens into a one-hot vector. By Wordnet features, we refer to one-hot encoding of the 26 class classification of the words based on their general meaning. The concreteness features repre-sent the concatenation of the one hot representation of concreteness-mean-binning-BiasDown, and concreteness-mean-binning-BiasUp features (as indicated in Beigman Klebanov et al. (2015Klebanov et al. ( , 2016).

Analysis
After the completion of the shared task, we downloaded the publicly available labels of the test data to analyze the results of our best performing model across all the four genres of text (section 3) on both the categories (as shown in the Table 5). Our system performs comparatively better on academic and news texts than on conversation and fiction texts.

Conclusion and Future Work
We described a deep neural architecture Di-LSTM Contrast Network for metaphor detection, which we submitted for Metaphor Shared Task 2018 (Leong et al., 2018). We showed that our system achieves appreciable performance solely by using the contrast features, generated by our model using pre-trained word embeddings. Additionally, our model gets a significant performance boost from the use of extra baseline features, and reweighting of examples.
For our future work, we plan to experiment with CNNs along with LSTM for capturing the context representation of the sentence in light of the target word. Another interesting idea is the use of attention mechanism (Mnih et al., 2014), which has proven to be effective in many NLP tasks.