Using Language Learner Data for Metaphor Detection

This article describes the system that participated in the shared task on metaphor detection on the Vrije University Amsterdam Metaphor Corpus (VUA). The ST was part of the workshop on processing figurative language at the 16th annual conference of the North American Chapter of the Association for Computational Linguistics (NAACL2018). The system combines a small assertion of trending techniques, which implement matured methods from NLP and ML; in particular, the system uses word embeddings from standard corpora and from corpora representing different proficiency levels of language learners in a LSTM BiRNN architecture. The system is available under the APLv2 open-source license.


Introduction
Ever since conceptual metaphor theory was laid out in Lakoff and Johnson (1980), the most vexing question has remained a methodological one: how can conceptual metaphors be reliably identified in language use? Although manual identification was put on a stronger methodological footing with the Metaphor Identification Procedure (MIP) ("Pragglejaz Group", 2007) and its elaboration into MIPVU (Steen et al., 2010), fuzzy areas remain due to the fact that conceptual metaphors can vary between primary metaphors and complex metaphors (cf. Grady, 1997). Furthermore, highly conventionalized metaphorical expressions might not be processed in the same way as novel metaphors. The core process of manual metaphor identification is not completely unproblematic either since it can be difficult to establish whether the meaning of a lexical unit in its context deviates from its basic meaning or not. In the face of that slippery terrain, automatic metaphor identification emerges as an extremely challenging task. An increasing volume of research since the start of annual workshops at NAACL in 2013 has shown first promising results using different methods of automated metaphor identification (see for example Shutova et al. (2015) and Klebanov et al. (2016) for previous events). The current shared task of metaphor identification provided a further opportunity to put the computational spotting of metaphors to the test. Our bid for this task combines (cf. Section 2) fastText word embeddings (WEs) with a single-layer long short-term memory bidirectional recurrent neural network (BiRNN) architecture. The input, sequences of WE representations of words, is fed into the BiRNN which predicts metaphorical usage for each word.
The WEs were trained (cf. Section 4.2) on different large corpora (BNC, Wikipedia, enTen-Ten13, ukWaC) and on the Vienna-Oxford International Corpus of English (VOICE) as well as on the TOEFL11 Corpus of Non-Native English. The latter corpus was used, among others, in the First Native Language Identification Shared Task  held at the 8th Workshop on Innovative Use of NLP for Building Educational Applications as part of NAACL-HLT 2013.
We were led by the idea (cf. Section 2.3) that metaphorical language use changes while gaining proficiency in a language, and so we hoped to be able to utilise the information contained in corpora of different proficiency levels.
The paper is organised as follows: We present our system design with related work in Section 2, the implementation in Section 3, and the experimental setup with an evaluation in Section 4. Section 5 concludes with an outlook on possible next steps.

Design
Generally, our design builds upon the foundation laid out by Collobert et al. (2011) for a neural network (NN) architecture and learning algorithm that can be applied to various natural language processing tasks. The most related task specific design is given in Do Dinh and Gurevych (2016) who used a NN in combination with WEs to detect metaphors. In contrast to our study, they used a dense multi-layer NN while we adapted the design of Stemle (2016a,b), who combined WEs with a recurrent NN (RNN) to predict part-of-speech (PoS) tags of computer-mediated communication (CMC) and Web corpora for German and Italian. RNNs are usually considered to be more suitable for labelling sequential data such as text.

Word Embeddings
Recently, state-of-the-art results on various linguistic tasks were accomplished by architectures using neural-network based WEs. Baroni et al. (2014) conducted a set of experiments comparing the popular word2vec (Mikolov et al., 2013a,b) implementation for creating WEs with other wellknown distributional methods across various (semantic) tasks. These results suggest that the WEs substantially outperform the other architectures on semantic similarity and analogy detection tasks. Subsequently, Levy et al. (2015) conducted a comprehensive set of experiments that suggest that much of the improved results are due to the system design and parameter optimizations, rather than the selected method. They conclude that "there does not seem to be a consistent significant advantage to one approach over the other".
WEs provide high-quality low dimensional vector representations of words from large corpora of unlabelled data. The representations, typically computed using NNs, encode many linguistic regularities and patterns (Mikolov et al., 2013b).

Bidirectional Recurrent Neural Network
NNs consist of a large number of simple, highly interconnected processing nodes in an architecture loosely inspired by the structure of the cerebral cortex of the brain (O' Reilly and Munakata, 2000). The nodes receive weighted inputs through their connections on one side and fire according to their individual thresholds of their shared activation function. A firing node passes on an activation to all connected nodes on the other side. During learning the input is propagated through the network and the actual output is compared to the desired output. Then, the weights of the connections (and the thresholds) are adjusted step-wise so as to more closely resemble a configuration that would produce the desired output. After all training data have been presented, the process typically starts over, and the learned output values will usually be closer to the desired values.
Recurrent NNs (RNNs), introduced by Elman (1990), are NNs where the connections between the elements are directed cycles, i.e. the networks have loops, and this enables the NN to model sequential dependencies of the input. However, regular RNNs have fundamental difficulties learning long-term dependencies, and special kinds of RNNs need to be used (Hochreiter, 1991); a very popular one is the so called long short-term memory (LSTM) network proposed by Hochreiter and Schmidhuber (1997).
Bidirectional RNNs (BiRNN), introduced by Schuster and Paliwal (1997), extend unidirectional RNNs by introducing a layer, where the directed cycles enable the input to flow in opposite sequential order. While processing text, this means that for any given word the network not only considers the text leading up to the word but also the text thereafter.
Overall, we benefit from available labelled data with this design but also from large amounts of available unlabelled data.

Language Learner Data
Our experimental design also utilizes data from language learner corpora. This is based on the intuition that metaphor use might vary depending on learner proficiency. Beigman Klebanov and Flor (2013) indeed found a correlation between higher proficiency ratings of learner texts and a higher density of metaphors in these texts. Their study is also one of the few in the field of automated metaphor detection that are concerned with learner language. Their aim, however, is quite different to the current study as they try to establish annotations for metaphoric language use that can help to train an automated classifier of metaphors in testtaker essays. The current study, by contrast, utilizes learner corpus data to build WEs among other corpora representing written standard language. Learner language could be a particularly helpful source of information for automated metaphor de-tection via WEs as learner language provides different usage patterns compared to WEs derived from standard language corpora.

Implementation
We maintain the implementation in a source code repository 1 . Our system uses sequences of word features as input to a BiRNN with a LSTM architecture.

Word Embeddings
We use gensim 2 , a Python tool for unsupervised semantic modelling from plain text, to load precomputed WE models and to compute embeddingvector representations of words. Words missing in a WE model, i.e. out-of-vocabulary words (OOV), are first estimated by looking at a fixed context of their non-OOV words. If this fails, OOVs are mapped to their individual, randomly generated, vector representations.

Neural Network
Our implementation uses Keras (Chollet, 2015), a high-level NNs' library written in Python, on top of TensorFlow (Abadi et al., 2016), an open source software library for numerical computation.
The number of input layers corresponds to the number of employed feature sets. For multiple feature sets, e.g. multiple WE models or additional PoS tags, sequences are concatenated on the word level such that the number of features for an individual word grows.
Input sequences have a pre-defined length and represent original textual sentence segments. In case a sentence is longer than the sequence length, the input is split into multiple segments. And if a segment is shorter than the sequence length, the remaining slots are padded, i.e. they are filled with identical dummy information.
Each input layer feeds into a masking layer such that the padded values from the input sequence will be skipped in all downstream layers. 3 The masked input is fed into a bidirectional LSTM layer that, in turn, projects to a fully connected output layer that is activated by a softmax function.
The output is a single sequence of matching length with labels indicating whether the corresponding word is used metaphorically or not.
During training, we use dropout for the linear transformation of the recurrent state, i.e. the network drops a fraction of recurrent connections, which helps prevent overfitting (Srivastava et al., 2014); and we use a weighted categorical crossentropy loss function to counteract the fact that far fewer words in our sequences are labelled as metaphorical than non-metaphorical, which usually hampers classification performance (cf. Kotsiantis et al., 2006).

Experiments and Results
Participants of the ST could either participate in the metaphor prediction tracks for verbs only, all content part-of-speech only, or both. For a given text in VUA, and for each sentence, the task was to predict metaphoricity for each verb or content word respectively, and submit the result to Co-daLab 4 for evaluation. Results were calculated as the harmonic average of the precision and recall (F1-score) of the metaphoricity label. We participated with our system in both tasks.
The remainder of this section introduces the official data set, our WE models and describes our fixed hyper-parameters. The results of different combinations of WE models are shown in Table  1. Also note that all results in this paper refer only to the all content part-of-speech task.

Shared Task Data
The VUA, the corpus that was used in the shared task, originates from the British National Corpus (BNC). Altogether, it is comprised of 117 texts covering four genres (academic, conversation, fiction, news). For the ST, VUA was pre-divided by the organisers into a training and a test set. The training set was labelled and could be used to train classifiers, while the participants were supposed to label the test set and submit it. The distribution of metaphorical vs. non-metaphorical labels was imbalanced with a ratio of roughly 1:6 (11044 : 61567).

Word Embedding Models
We use pre-built WE models of the following corpora: BNC and enTenTen13 web cor- X X X X X X X 0.597 0.952 0.003 Table 1: Overview of the word embedding models we used, and evaluation results for individual models and some combinations on the metaphor prediction track for all content part-of-speech. Number of tokens in the original corpus, parameters minCount and dim for fastText during training of the models. Our calculated F1-scores on the official labelled test set (they should coincide with the organisers' results).
The mean accuracy as well as the standard deviation in the accuracy for 10-fold cross validation runs on the training set.
Three individual models were trained for the different proficiency levels low, medium and high of the training subset of the TOEFL11 ; another model was trained for the full training set comprising all three proficiency levels. One model was trained for the VOICE (Seidlhofer et al., 2013), a corpus of English as it is spoken by a non-native speaking majority of users in different contexts.
Two models were trained for ukWaC (Baroni et al., 2009), a corpus constructed from the Web using medium-frequency words from the BNC as seeds. The first model for the full corpus and the second model for a random sample of documents approximating the token count of the full TOEFL11 training set.

Hyper-Parameter Tuning
Hyper-parameter tuning is important for good performance. The parameters of our system were optimised via an ad-hoc grid search in 3-fold cross validation (CV) runs.
The weight for the categorical cross-entropy loss function is calculated as the logarithm of the ratio of number of words vs. metaphorical labels. The context for estimating OOV words was set to 10.
Once set, we used the same configuration for all experiments.

Conclusion & Outlook
The combination of WEs with a BiRNN is capable of recognizing metaphorical usage of words better than many other already tested approaches. More importantly, our design does not rely on WordNet or VerbNet information, and does not need concreteness or abstractness information like many successful architectures from previous annual workshops at NAACL. Besides VUA, our system only needs running text.
The best result on the test set was achieved with a combination of TOEFL11 learner data and data from the BNC. So far, the results are encouraging-but also mixed-regarding our initial idea that metaphorical language use at different proficiency levels could be utilised to recognizing metaphorical usage of words. To this end, we are looking forward to output from the European Network for Combining Language Learning with Crowdsourcing Techniques 8 , where poten-