Target Word Masking for Location Metonymy Resolution

Existing metonymy resolution approaches rely on features extracted from external resources like dictionaries and hand-crafted lexical resources. In this paper, we propose an end-to-end word-level classification approach based only on BERT, without dependencies on taggers, parsers, curated dictionaries of place names, or other external resources. We show that our approach achieves the state-of-the-art on 5 datasets, surpassing conventional BERT models and benchmarks by a large margin. We also show that our approach generalises well to unseen data.


Introduction
Metonymy is a widespread linguistic phenomenon, in which a thing or concept is referred to by the name of something closely associated with it. It is an instance of figurative language that can be easily understand by humans through association, but is hard for machines to interpret. For example, in They read Shakespeare, it is "the works of Shakespeare" that we are referring to, not the playwright himself.
Existing named entity recognition (NER) and word sense disambiguation (WSD) systems have no explicit metonymy detection. This is an issue as named entities and other lexical items are often used metonymically. For instance, Germany in the context of Germany lost in the semi-final refers to "the national German sports team", different to the context I live in Germany which the term is used literally. NER systems generally tag both as location without recognition of the metonymic usage in the first, and WSD systems are tied to word sense inventories and generally don't handle metonyms and other sense extensions well (Markert and Nissim, 2009). Intuitively, metonym resolution should improve NER and WSD, something we explore in this paper. Metonymy resolution is the task of determining whether a potentially metonymic word ("PMW") in a given context is used metonymically or not. It has been shown to be an important component of many NLP tasks, including machine translation (Kamei and Wakao, 1992), question answering (Stallard, 1993), anaphora resolution (Markert and Hahn, 2002), geographical information retrieval (Leveling and Hartrumpf, 2008), and geo-tagging (Monteiro et al., 2016;Gritta et al., 2018).
Conventional approaches to metonymy resolution have made extensive use of taggers, parsers, lexicons, and corpus-derived or hand-crafted features (Nissim and Markert, 2003;Farkas et al., 2007;Nastase et al., 2012). These either rely on NLP pre-processors that potentially introduce errors, or require external domain-specific resources. Recently, deep contextualised word embeddings (Peters et al., 2018) and pre-trained language models (Devlin et al., 2019) have been shown to benefit many NLP tasks, and part of our interest in this work is how to best apply these approaches to metonymy resolution.
While we include experiments for other types of metonymy, a particular focus of this work is locative metonymy. Previous work has suggested that around 13-20% of toponyms are metonymic (Markert and Nissim, 2007;Leveling and Hartrumpf, 2008;Gritta et al., 2019), such as in Vancouver welcomes you, where Vancouver refers to "the people of Vancouver" rather than the literal place.
Our contributions are as follows. First, we propose a word masking approach, which when paired with fine-tuned BERT (Devlin et al., 2019), achieves state-of-the-art accuracy over a number of benchmark metonymy datasets: our method outperforms the previous state-of-the-art by 5.1%, 12.2% and 4.8%, on SEMEVAL (Markert and Nissim, 2007), RELOCAR (Gritta et al., 2017) and WIMCOR (Kevin and Michael, 2020), respectively, and also outperforms a conventional fine-tuned BERT model by a large margin. Second, in addition to intrinsic evaluation of location metonymy resolution, we include an extrinsic evaluation, where we incorporate a locative metonymy resolver into a geoparser, and show that it boosts geoparsing performance. Third, we demonstrate that our method generalises better cross-domain, while being more data efficient. Finally, we conduct a detailed error analysis from the task rather than model perspective. Our code is available at: https://github.com/haonan-li/TWM-metonymy-resolution.

Related Work
In early symbolic work, metonymy was treated as a syntactico-semantic violation (Hobbs Sr and Martin, 1987;Pustejovsky, 1991). As such, the resolution of metonymy was based on constraint violation, usually based on the selectional preferences of verbs (Fass, 1991;Hobbs et al., 1993). Markert and Nissim (2002) were the first to treat metonymy resolution as a classification task, based on corpus and linguistic analysis. They demonstrated that grammatical roles and syntactic associations are high-utility features, which they subsequently extended to include syntactic head-modifier relations and grammatical roles (Nissim and Markert, 2003). To tackle data sparseness, they further introduce simpler grammatical features by integrating a thesaurus. Much of this work has been preserved in recent work in the form of hand-engineered features and external resources.
SemEval 2007 Task 8 on metonymy resolution (Markert and Nissim, 2007) further catalyzed interest in the task by releasing a metonymy dataset with syntatic and grammatical annotations, and fine-tuning the task definition and evaluation metrics. A range of learning paradigms (including maximum entropy, decision trees, and naive Bayes) were applied to the task. Top-ranking systems (Nicolae et al., 2007;Farkas et al., 2007;Brun et al., 2007) used features provided by the organisers, such as syntactic roles and morphological features. Most systems also used features from external resources such as WordNet, FrameNet, VerbNet, and the British National Corpus (BNC).
Later work (Nastase and Strube, 2009;Nastase et al., 2012;Nastase and Strube, 2013) used the Wikipedia category network to capture the global context of PMWs, to complement local context features.
All the above-mentioned approaches resolve metonymy by enriching the information about PMWs, in particular via resources. In contrast, our approach is end-to-end: information is contained in the pretrained embeddings and language models only. Another difference is that we focus on the context of the PMW only, and not the PMW itself.
More recently, in a departure from using ever-more hand-crafted features, Gritta et al. (2017) proposed a metonymy resolution approach based on basic parsing features and word embeddings. The main idea is to eliminate words that are superfluous to the task and keep only relevant words, by constructing a "predicate window" from the target word via a syntactic dependency graph. The classification of the target word is then based on the "predicate window". Similar to us, they do not take the identity of the target word into consideration. However, we remove the dependency on a dependency parser, and more systematically generate a context representation by masking the target word within a pretrained language model.
Researchers have released several datasets for metonymy resolution, including SEMEVAL (Nissim and Markert, 2003), RELOCAR and CONLL (Gritta et al., 2017), GWN (Gritta et al., 2019), and WIMCOR (Kevin and Michael, 2020). However, none of them have analyzed the data distribution and or generalisation across datasets. In this paper, we train our model on different datasets, and evaluate its transfer learning abilities.

Motivation
Due to the relatively small size of most existing metonymy resolution datasets, researchers have explored ways to compensate for the sparse training data, e.g. through data augmentation (Kobayashi, 2018;Wei and Zou, 2019). Modern pre-trained language models offer an alternative approach which performs well when fine-tuned over even small, task-specific datasets (Houlsby et al., 2019;Porada et al., 2019). Data sparseness may, however, still lead to overfitting. For example, in our metonymy resolution task, if the target word Vancouver appears only once during training, in the form of a metonymy, the model might overfit and always predict that Vancouver is a metonymy regardless of context. Intuitively, masking the target word during training can eliminate lexical bias and force the model to learn to classify based on the context of use rather than the target word.

BERT for Word-level Classification
For the tokenised sentence S = t 1 , t 2 , ...t n and target word t i , ...t j with position pair (i, j), we form the input sentence to the BERT encoder: We extract the representation of the target word from the last hidden layer as T ∈ R d×h , where d = j − i + 1 denotes the length of the target word and h is the hidden layer size. Element-wise averaging is applied to the word span, such that the extracted matrix is compressed into the vector T ∈ R 1×h . Finally, we feed it into a linear classifier and get the output as the label.

Data Augmentation
One method to combat lexical overfitting for the target word due to the small data setting is data augmentation. We first extract all target words from the training set as a target word pool. Then, for each training sample, a fresh sample is constructed by replacing the target word with a random one from the pre-built pool. We repeat this 10 times, expanding the training data set 10-fold in the process. We train the models on the augmented training set and evaluate on the original test set.

Target Word Masking
An alternative approach is to mask the target word, and base the classification exclusively on the context. We claim that the interpretation (metonym or literal) of a target word relies more on the context of use than the word itself. To test this claim, we force the model to predict whether the target word is metonymic based only on context. Here, we replace the input target word with the single token X during training and evaluation. Note that this is not compatible with the data augmentation method described in Section 3.3, as the target word (either original or replaced through data augmentation) is masked out.

End-to-end Metonymy Resolution
Given a raw sentence, an end-to-end metonymy resolution requires that the model can detect PMWs and predict the correct class of each. Most existing metonymy resolution methods focus on named entities (e.g. locations and organizations), which we detect by training a BERT named entity recogniser to detect locations and organizations based on the CoNLL 2003 data. These detected locations and organizations are masked one at a time, and fed into the word-level BERT classifier.

Experimental Details
In this section, we detail the five datasets used in our experiments, and then provide details of the models used in this research.

Datasets
SEMEVAL was first introduced by Nissim and Markert (2003) and subsequently used in SemEval 2007 Task 8 ( Markert and Nissim, 2007). It contains about 3800 sentences from the BNC across two types of entities: organizations and locations. In addition to coarse-level labels of metonym or literal, it contains finer-grained labels of metonymic patterns, such as place-for-people, place-for-event, or place-for-product. This is the only dataset where have such fine-grained labels of metonymy. We use the dataset in two forms, (1) spatial metonymies ("SEMEVAL LOC "); and organization metonymies ("SEMEVAL ORG ").
RELOCAR (Gritta et al., 2017) is a Wikipedia-based dataset. Compared with SEMEVAL LOC , it is intended to have better label balance, and annotation quality, without the fine-grained analysis of metonymic patterns. It contains 2026 sentences, and is focused on locations only. It is important to note that the class definitions for RELOCAR are a bit different from those for SEMEVAL LOC . The main difference is in the interpretation of political entity (e.g. Moscow opposed the sanctions), which is considered to be a literal reading in SEMEVAL, but metonymic in RELOCAR. The argument is that governments/nations/political entities (in the case of our example, "the government of Russia") are much closer to organizations or people semantically, and thus metonymic.
CONLL was released together with RELOCAR (Gritta et al., 2017) and also focused on locations. It contains about 7000 sentences taken from the CoNLL 2003 Shared Task on NER and was annotated by one annotator only, with no quantification of the quality of the labels, and is thus potentially noisy.
GWN (Gritta et al., 2019) is a fine-grained labelled dataset of toponyms consisting of around 4000 sentences. It contains not only metonymic usages of locations, but also demonyms, 2 homonyms, and noun modifiers, of which we extract instances labelled as literal, metonymic, or mixed in our experiments. We merge the mixed instances (which account for around 2% of the data) with metonymy, creating a binary classification task.
WIMCOR (Kevin and Michael, 2020) is a semi-automatically annotated dataset from English Wikipedia, based on the observation that Wikipedia disambiguation pages list different senses of ambiguous entities. The authors use disambiguation pages to identify literal and metonymic entities, and extract Wikipedia article pairs with the same natural title which refer to different but related concepts, like Delft and Delft University of Technology. Sentences are then extracted from the backlinks of the respective articles, which point to the articles that contain the target mentions. The dataset contains 206K samples, of which about one-third are metonyms. Although the dataset is large-scale, it contains only 1029 unique PMWs, which means that in the standard data split there are few unseen PMWs in the test data. To make the task more difficult, and avoid possible lexical memorization (Levy et al., 2015;Vylomova et al., 2016), we employ a different split, to ensure no PMWs occur in both the training and test splits.
A statistical breakdown of the five datasets is provided in Table 1 (noting that SEMEVAL LOC and SEMEVAL ORG are listed separately, making a total of six listings for our five datasets). Note that all datasets are in English.

Baselines
The first two baselines are the best model (Farkas et al., 2007) on SemEval-2007 Shared Task, GYDER, and results reported by Nastase and Strube (2009) and Nastase et al. (2012) in following years. We simply report their results without reimplementing the models. We reimplement the SOTA PreWin model (Gritta et al., 2017) as another baseline. To do this, we first generate the dependency structures and labels using SpaCy, 3 and index the predicate window by the dependency head of the PMW. The output of the predicate window is then fed into two LSTM layers, one for the left context and one for the right context. The dependency relation labels of the content of the predicate window are represented as one-hot vectors and feed into two dense layers, for the left and right contexts, separately. By concatenating the four layers' output and feeding it to a multi-layer perceptron, we get the final label. In line with the original paper, we use GloVe embeddings to represent the words (Pennington et al., 2014), set the window size to 5, use a dropout rate of 0.2, and trained the model for 5 epochs.
To make the baseline model more competitive with our approach, we additionally experiment with a variant of the baseline where we replace the original GloVe embeddings with BERT embeddings (Devlin et al., 2019). We experimented with both BERT-base and BERT-large, but present results for BERT-base as we observed no improvement using the larger model.

Our Model
We use BERT in three settings, for both the BERT-base ("BERT-BASE") and BERT-large ("BERT-LG") models: (1) fine-tuned over a given dataset, with no masking; (2) fine-tuned with data augmentation (see Section 3.3); and (3) fine-tuned using target word masking (see Section 3.4). We use the uncased model with a learning rate of 5e-5, and max sequence length of 256. For WIMCOR, we fine-tune for 1 epoch with a batch size of 64, and dropout rate of 0.2. For the other datasets, we fine-tune for 10 epochs with a batch size of 64 and dropout rate of 0.1. 4

BERT Ensemble
Due to the large number of parameters in BERT and small size of the training datasets (with the exception of WIMCOR), the models tend to overfit or be impacted by bad initialisations. To counter this, we experimented with ensembling different runs of a given BERT model, specifically, the BERT-large model with word masking.

Cross-domain Transfer
In Section 4.1 we noticed that the different metonymy datasets were created with different annotation guidelines and over different data sources. To study the ability of the different models to generalise across datasets, we train models on one dataset and evaluate on a second, in the following six configurations: train on SEMEVAL LOC and test on RELOCAR (and vice versa); and train on CONLL and WIMCOR separately and test on either SEMEVAL LOC or RELOCAR. For all 6 settings, we compare the PreWin model of Gritta et al. (2017) with the three BERT settings (basic; with data augmentation; and with target word masking), all based on the BERT-large-uncased model.

Extrinsic Evaluation
To extrinsically evaluate our proposed method, we combine different metonymy resolution methods with a state-of-the-art geoparser, and evaluate over the GWN dataset. The task is to detect the locations with literal reading only and ignore all other possible readings. Following Gritta et al. (2019), we classify toponyms as either literal or associative. 5 We simply pipeline the Edinburgh Geoparser (without fine-tuning) (Grover et al., 2010) with our metonymy resolver as a baseline. The Edinburgh Geoparser detects all toponyms through an NER sequence supported by the Geonames gazetteer, but does not indicate metonymic usage. After this toponym detection, our metonymy resolver filters out non-literal uses of toponyms. The other two baselines used here are a reimplementation of the NCRF++ tagger of Gritta et al. (2019), 6 and BERT-LG fine-tuned on the geoparsing task. For our end-to-end model, we separate geoparsing into the toponym detection and metonymy resolution subtasks, and fine-tune the NER part on toponym detection, and the masked model on metonymy resolution.  Ensembled BERT-LG+MASK 79.6 Table 3: Accuracy (%) of metonymy resolution for the non-locative dataset, averaged over 10 runs with standard deviation; the best result is indicated in boldface, and " * " denotes the result published by the authors.

Results
To evaluate metonymy resolution, we train each model over 10 runs and report the average accuracy and standard deviation. For geoparsing, we use precision, recall, and F1-score, based on 5-fold crossvalidation. Table 2 shows the results of metonymy resolution across the five locative datasets. For all datasets, both BERT-BASE and BERT-LG outperform the previously best-published results. The use of BERT in PREWIN BERT clearly improves the method over the original PREWIN GV , but the results are consistently lower than BERT-LG+MASK. Data augmentation ("aug") sometimes improves and sometimes degrades performance, whereas target word masking ("mask") consistently improves performance. The best single model results for all datasets are achieved with our BERT-LG+MASK model, and the ensemble version improves results a bit further. We didn't apply data augmentation on CONLL and WIMCOR because the two datasets are sufficiently large without it.   Table 5: Geoparsing results (averaged over 5 folds of cross validation, with standard deviation). E-Geoparser is the Edinburgh Geoparser.
Comparing the different datasets, the relative accuracies vary substantially: SEMEVAL LOC is the most difficult, while WIMCOR is the simplest, even with the lexically-split training and test data. With the original data split for WIMCOR, the result is over 99.5 even with BERT-BASE without masking (and even higher for the other BERT-based methods).
Although it is not the main focus of this paper, we also report the results for the non-locative dataset in Table 3. Once again, our masked model attains consistent improvements over the unmasked model. The best result is achieved by PREWIN BERT and the ensembled version of BERT-LG+MASK. Table 4 shows the results of the cross-domain experiments. From the first two rows we see that generalization between SEMEVAL LOC and RELOCAR is not good, which we hypothesise is due to the differences in the annotation schemes and label distributions. In contrast, models trained on CONLL transfer better to RELOCAR and SEMEVAL LOC . The last two rows show the cross-domain results from WIMCOR, where despite WIMCOR containing orders of magnitude more data than the other datasets, it transfers poorly to both RELOCAR and SEMEVAL LOC . This supports our conjecture that WIMCOR is more one-dimensional than the other datasets, making it hard to generalise, even with the additional training data. Overall, the models using either BERT embedding or fine-tuned BERT perform better than PreWin with GLOVE embedding, and our masking approach always give the models better generalisation ability. Table 5 shows the geoparsing results on GWN. The Edinburgh Geoparser does not perform well, as it is not fine-tuned to the dataset. The BERT tagger outperforms NCRF++, and our end-to-end model beats BERT. This is evidence that incorporating explicit metonymy resolution into a geoparser improves its performance, and also that our metonymy resolution method is sufficiently accurate to improve over a comparable model without metonymy resolution.
We further analysed the attention weights of the different fine-tuned BERT models with and without target word masking. We compare the attention weights for each layer separately (12 vs. 24 for BERT-BASE and BERT-LG, resp.): we get the attention weight of each head on the target word, and average the heads' weights to generate a single sample point. We found for both models, attention on the target word is substantially higher for the last 4-5 layers, as shown in Figure 1. Moreover, target word masking makes the model attend more to the target word. Figure 2 is the training curve for the BERT models over RELOCAR. We find that, generally, BERT-LG converges a bit slower than BERT-BASE, but in each case, the masked model performs substantially better than the unmasked models, and is more data efficient. While we do not include them in the paper, the plots for other datasets showed a similar trend.

Error Analysis
To further understand the task of metonymy resolution and why the model fails in some cases, we conducted a manual error analysis over a random sample of 150 errors from SEMEVAL LOC and RELOCAR. We roughly categorise the errors into 6 types, with each instance classified according to a unique error type. Some instances had multiple errors among types 4, 5, and 6, in which case we classified them in the priority of 6 > 5 > 4.
1. Data quality: Although the two datasets are labeled by experts, the inter-annotator agreement is not perfect, and the annotation guidlines are not exactly the same. For example, location-for-government is literal in SEMEVAL LOC but metonymic in RELOCAR. Even based on the annotation guidelines used to generate the datasets, there are some labels that we do not agree with, such as the example for Type 1 in Table 6: in our judgement, England is the literal place or country here, but the gold label is metonym.
2. Insufficient context: Due to model capacity and efficiency reasons, we restrict the context to the context sentence of the PMWs. This removes useful context in some cases, such as the example for metonymy 23% Table 6: Error analysis. "Label" is the gold label for the example, which our model was not able to predict.
Type 2 in Table 6, where the preceding sentence is: He will forsake China production schedules for fine tuning the first of many travel itineraries planned for "my new career of retirement", as he put it. With this context, it is easier to recognise the PMW Canada as an event, and hence a metonym. These errors can be resolved by including more context.
3. Mixed meaning: Some PMWs have both a literal and metonymic reading. We follow Gritta et al. (2017) in treating them as metonymies, but the dominant reading is sometimes literal. In the example for Type 3 in Table 6, Malaysia can either be the geographic place or the government of Malaysia. Such errors can be better handled with a more fine-grained classification schema.
4. Long distance implications or complex syntactic structure: The models struggle when the sentence structure becomes complex. For example, in the example for Type 4 in Table 6, the immediate context suggests the PMW is literal, but the broader context suggests the opposite. Such issues can be addressed by using models with richer representations of sentence structure.

5.
Misleading wording and complex sentence semantics: Complex semantics and grammatical phenomena like noun possessives confuse the model. America in the phrase America's nuclear stockpile has the literal reading, while Lebanon in the phrase Lebanon's long ordeal has a metonymic reading. A particularly subtle example is that for Type 5 in Table 6, which requires the model to have near-human comprehension of semantics.
6. Missing world knowledge: Some examples require background knowledge to be understood, such as the example for Type 6 in Table 6, where Vietnam refers to an event (a war that happened there, and Bill Clinton's actions related to that war 7 ). To deal with this, the model needs to have access to world knowledge, either implicitly or explicitly.
These 6 error types vary in difficulty. From Table 6, we see that 62% of current errors are caused by Types 5-6, namely the model lacking an understanding of complex sentence semantics or world knowledge, which are hard to solve. Possibly the only case with a clear resolution is Type 2, where larger-context models may perform better.

Conclusions and Future Work
In this paper, we proposed a word masking approach to metonymy resolution based on pre-trained BERT, which substantially outperforms existing methods over a broad range of datasets. We also evaluated the ability of different models in a cross-domain setting, and showed our proposed method to generalise the best. We further demonstrated that an end-to-end metonymy resolution model can improve the performance of a downstream geoparsing task, and conducted a systematic error analysis of our model.
The proposed target word masking method can be applied to tasks beyond metonymy resolution. Numerous word-level classification tasks lack large-scale, high-quality, balanced datasets. We plan to apply the proposed word masking approach to these tasks to investigate whether it can lead to similar gains over other tasks.