Evaluating a Bi-LSTM Model for Metaphor Detection in TOEFL Essays

This paper describes systems submitted to the Metaphor Shared Task at the Second Workshop on Figurative Language Processing. In this submission, we replicate the evaluation of the Bi-LSTM model introduced by Gao et al.(2018) on the VUA corpus in a new setting: TOEFL essays written by non-native English speakers. Our results show that Bi-LSTM models outperform feature-rich linear models on this challenging task, which is consistent with prior findings on the VUA dataset. However, the Bi-LSTM models lag behind the best performing systems in the shared task.


Introduction
In today's globalized world, text in a given language is not always written by native speakers. It is therefore important to evaluate to what degree NLP models and tools developed and evaluated primarily on edited text written and aimed at native speakers port to non-native language. The Metaphor Detection Shared Task at the Second Workshop on Figurative Language Processing offers the opportunity to perform such an evaluation on a challenging genre: argumentative essays written by non-native speakers of English as part of the Test of English as a Foreign Language (TOEFL).
We participate in the TOEFL ALLPOS task, a sequence labeling task where each word in running task is labeled with one of two tags: metaphorical (M) or literal (L). While the best-performing system described in this paper was submitted to other sections of the shared task, we focus on reporting a wider range of results for the TOEFL ALLPOS task.
Context determines whether a word or phrase is being used in a metaphorical sense. Consider an example from the TOEFL dataset: "The world is a huge stage and nearly everybody is an actor." The words "stage" and "actor" are used metaphorically to analogize the world to a stage and individuals to actors on that stage. A literal usage of these two words would be "The actor walked across the stage.", because "actor" and "stage" both occur within the context of a theatrical performance, which also matches the context of the sentence.
Beigman Klebanov et al. (2018) establish baselines for metaphor detection on TOEFL essays using feature-rich logistic regression classifiers, and show that use of metaphors is a strong predictor of the quality of the essay. The same year, Gao et al. (2018) establish a new state-of-the-art with a simple Bi-LSTM model on the VUA dataset drawn from multiple genres in the British National Corpus (BNC). Their approach departed from prior models built on linguistically motivated features (Turney et al., 2011;Hovy et al., 2013;Tsvetkov et al., 2014), visual features (Shutova et al., 2016) or learning custom word embeddings (Stemle and Onysko, 2018;Mykowiecka et al., 2018), and showed that contextualized word representations from Bi-LSTM can be more effective.
In this work, we investigate whether Gao et al. (2018)'s findings can be replicated when detecting metaphors in TOEFL essays rather than the BNC. In addition, we attempt to answer the following question: do contextualized word representations from a Bi-LSTM model detect metaphorical word use more accurately than feature-rich linear models? On the one hand, Bi-LSTM sequence labelers have proven quite successful at learning task-specific representations for many NLP problems. On the other hand, text written by non-native speakers of varying proficiency might include more variability that harms the models ability to learn useful contextual representations.
Our results show that Bi-LSTMs with word embedding inputs outperform feature-rich linear classifiers as in prior work, but their performances lag behind that of the top performing submissions in the shared task.

Task Overview
The goal of the task is to accurately predict whether words are used in a literal or metaphorical sense in a sequence labeling setting. As shown in Table 1, the literal tokens heavily outnumber the metaphorical ones. To account for this imbalance, submissions are evaluated using the F1 score for the positive class (metaphorical). In the table, a "token" refers to a labeled word in the data (not all words are assigned labels/features). We will refer the reader to the shared task description paper for a detailed description of the task. In addition to metaphor annotations, the corpus comes with pre-extracted features from Klebanov et al. (2015), labeled as Provided features in Table  2. These features include unigrams, Stanford POS tags, binned mean concreteness values (Brysbaert et al., 2013), and Topic-Latent Dirichlet Allocation (Blei et al., 2003). Unlabeled tokens are assigned a literal classification and values of zero for all non-word embedding features.

Classifiers
We ran our internal experiments using a simple baseline and two classifier architectures.The implementation, written in Python, will be made publicly available on Github. 1 Baseline As a baseline (BL), we predict the probability p(w) of a word lemma w to be positive (metaphorical) as m w /c w , where m w and c w are the number of positive occurrences and total occurrences of w respectively in the training data. If c w = 0 (the word was not encountered during training), we automatically assign a negative (literal) prediction.  (LBFGS solver with L2 penalization). We predict a binary classification for each token independently, ignoring other predictions and features in the sequence.

Linear Classifiers
Bi-LSTM Following Gao et al. (2018), we use a Bidirectional LSTM as a sequence labeler, simply using a feed-forward neural network to make a binary prediction at each time step, using the contextualized representations learned by the Bi-LSTM as input. Predictions are made for each sentence in an essay, independently of the document context. Our experiments are based on the implementation by Gao et al., with modifications to the code in order to apply their model to the TOEFL data and to incorporate different combinations of features. The LSTMs have a hidden size of 300 units for each direction. Concatenating the forward and backward representations yields a 600-dimensional output. We feed this output through a single-layer (2 units) feedforward neural network and apply a softmax function, which outputs a probability distribution for the two output classes. Dropout is applied to the LSTM input (p = 0.5) and from LSTM output to the linear layer (p = 0.1). The models are trained using the Adam algorithm, with learning rates of η = 0.005 and 0.001 for epochs 0 − 10 and 11 − 20, respectively.

Features
We experimented with different input features within each model architecture, which are summarized in Table 2.
We obtain word embedding features for each word type by concatenating GloVe (Pennington et al., 2014) and ELMo (Peters et al., 2018) word embeddings into a 1324-dimensional vector, shown as "GE" in Table 2.
All the other features were provided with the  TOEFL ALLPOS dataset, which we will refer to as 'provided' features. With the exception of Topic-LDA (T), all of them are represented with one-hot encodings (UL, P) or a vector of binary values (WN, C, CD). Various combinations of all these features were concatenated together to form the input data on which we trained and evaluated the classifiers described above.

Data Versions
Default Data We first build classifiers on the data as processed by the organizers, with the provided tokenization and no additional processing.
Since the TOEFL essays are written by nonnative English speakers, many sentences contain misspellings or grammatical errors, such as "The problems of the pollution is one of the most ones of this century." We experiment two strategies to address these sources of variability.

Spelling Correction
We created a cleaned version of the dataset using the Python pyspellchecker library, which finds a given word's minimum Levenshtein distance neighbor in the OpenSubtitles corpus. In total, we replaced 1536 (train) and 492 (test) misspelled tokens in the data. Anastasopoulos et al. (2018) showed that adding synthetic grammatical errors to training data improves neural machine translation of non-native English to Spanish text. To investigate the effect of such methods on metaphor detection, we separately inject the following errors (if applicable) into three copies of each training sentence and append them to the training set:

Error Injection
• RT: Missing determiner (includes articles) • PREP: Missing preposition • NN: Flipped noun number For simplicity, unlike Anastasopoulos et al. we did not randomly replace determiners or prepositions with another member of their confusion set. Instead, we simply removed the word from the sentence.

Evaluation Settings
When training the logistic regression and Bi-LSTM classifiers, we ran cross-validation (k = 5) and used early stopping to select a final test model based on validation loss. We then selected a probability threshold that maximized our F1 score on the validation data before finally making predictions on the test set. For our baseline model, we used the same model selection technique without early stopping, as there is no 'training' iteration involved in the baseline.

Impact of Classifier and Feature Choice
We first compare classifiers and features when training on the default data. Table 3 includes our internal results averaged across 5-fold cross-validation on the training set, and for a subset of the models, results on the blind evaluation test set taken from official leader board on CodaLab.
The baseline model performs well on both the testing and validation sets, which suggests that the identify of the word is a strong indicator of metaphorical use even before taking context into account, for the TOEFL data as for other genres. Surprisingly, the linear classifiers that did not use word embedding features did not improve over the baseline, despite the fact that they include the identity of the current lemma (UL). The only models that produced improvements over the baseline on average used GloVe and ELMo embeddings. Additionally, the effect of adding the provided features is inconsistent -in some cases, performance degrades, but in others, it improves.
The difference in F1 score between Bi-LSTM and LR models is primarily due to precision: The Bi-LSTM models that use word embeddings achieve higher precisions than the logistic regression models, while the differences in recall are small. This constrasts with the findings of Gao et al. (2018) on the VUA dataset, where the Bi-LSTM model primarily benefited recall over precision.
The best results overall are obtained with the Bi-LSTM models that use GloVe and ELMo input. Interestingly, adding unigram lemma features (UL) further improves precision and the expense of a small decrease in recall, and overall yields the best F1 both by cross-validation and on the official test set. As expected, Bi-LSTM performance degrades heavily when trained on only the dataset-provided features. Investigating better ways to incorporate these features would be a useful direction for future research. Finally, Table 5 shows our best model's performance, broken down by Penn Treebank POS tags: F1 scores are the highest for verbs and lowest for nouns, mostly due to worse recall for nouns than for verbs.

Impact of Addressing Spelling and Grammatical Errors
Spell-checking and error injection experiments have an inconsistent impact. As shown in 4, this additional data processing improves the F1 score of the Logistic Regression model most. For the Bi-LSTM, spell-checking the data yields a small F1 improvement when using cross-validation, and no significant difference on the official test set (60.9 vs. 61.0). Injecting artificial errors leads to a small F1 decrease with cross-validation and was therefore not tested on the official test set.

Official Submission
Our best submission on the leaderboard is a Bi-LSTM network trained on a spell-checked dataset embedded with GloVe, ELMo, and one-hot unigram lemma vectors. This model yields an F1 score of 0.610, which is slightly below the median score of 0.653.  Table 5: Evaluation of best Bi-LSTM model per POS tag via cross-validation. We show statistics (count, % metaphoric) for the training set. Only POS tags with more than 1000 occurrences are displayed.

Conclusion
In summary, our experiments replicate existing metaphor detection models in the new settings provided by the TOEFL ALLPOS task. Adding GloVe vectors and ELMo contextual embeddings helped push the performance of the logistic regression model over a simple frequency baseline. The use of a Bi-LSTM network in combination with GloVe, ELMo, and one-hot unigram lemma vectors yields the highest performance out of all the models tested. This confirms the benefits of contextual representations learned by the Bi-LSTM for metaphor detection highlighted by Gao et al. (2018) on the VUA dataset. However, the more challenging TOEFL ALLPOS data also shows the limitation of the Bi-LSTM model, which yields smaller improvements over the baseline than on VUA, and lags behind the best systems on the shared task leader board.