Metaphors in Text Simplification: To change or not to change, that is the question

We present an analysis of metaphors in news text simplification. Using features that capture general and metaphor specific characteristics, we test whether we can automatically identify which metaphors will be changed or preserved, and whether there are features that have different predictive power for metaphors or literal words. The experiments show that the Age of Acquisition is the most distinctive feature for both metaphors and literal words. Features that capture Imageability and Concreteness are useful when used alone, but within the full set of features they lose their impact. Frequency of use seems to be the best feature to differentiate metaphors that should be changed and those to be preserved.


Introduction
Metaphor is ubiquitous in everyday language and central to human thought (Lakoff and Johnson, 1980). We find manifestations of it in colloquial and academic discourse, newspaper, school textbooks, political discourse and probably anywhere language is used. There are conflicting views though on whether metaphors are a useful communication device. Golden (2010) found that metaphors present in school textbooks can make the overall content comprehension more difficult. On the other hand, the essence of metaphor is to make abstract concepts, which are often hard to grasp, more easily understandable through concrete descriptions (e.g. Kövecses, 2017).
In this paper we investigate metaphors in the context of news texts simplification. On a corpus of parallel sentences from news articles and their simplified version (to grade 4 level, which corresponds to 9-10 years of age), we analyze metaphors that are kept, changed or added in the simplified version. Our aim is to verify whether we can characterize and automatically detect metaphors that help or do not help text understanding in the context of news articles.
The task of automatic text simplification has received a considerable amount of attention within NLP research. The proposed systems have, for the most part, addressed lexical and syntactic transformations, such as substitution of difficult words with simpler equivalents or altering the structure of sentences to make them more easily understandable (e.g. Barlacchi and Tonelli, 2013;De Belder and Moens, 2010;Drndarević and Saggion, 2012;Torunoglu-Selamet et al., 2016;Vu et al., 2014).
The automatic handling of metaphorical language has also been researched extensively. However, the studies have mainly investigated the possibilities of automatic metaphor identification. Simplification of metaphorical language has not been explicitly addressed yet. This could be attributed to the fact that metaphor simplification is a challenging task for automatic implementation (cf. Drndarević and Saggion, 2012). Some approaches have considered the problem of automatic metaphor interpretation (e.g. Bollegala and Shutova, 2013;Shutova, 2013), which aims to find literal paraphrases for metaphorical expressions. It is not clear though whether the literal version is easier to understand than the original metaphor. Sometimes lexical simplifications for complex words can be too basic to convey the original meaning (cf. Vu et al., 2014).
We take a step towards filling the gap in metaphor simplification research. We combine information (in the form of features) from text simplification, and characteristics of metaphors to investigate whether there are specific features that can predict whether metaphors should be changed, and if these are different from features that are pre-dictive of lexical simplification in general. We use parallel versions ("raw" and simplified) of news data from Newsela 1 , a company that produces professionally simplified news texts, in which we annotate metaphors. In experiments that predict whether a target word will be changed or not, we analyze the performance of the features used w.r.t. the type of the target word -metaphoric or literal (the full set of features is described in Section 4.2). We find that the Age of Acquisition is the strongest feature overall, for both metaphoric and literal words. Imageability, Familiarity and Concreteness are useful when used alone, but within the context of the full set of features they lose their impact. Frequency of use is an important feature for distinguishing metaphors that should be changed from those to be preserved.

Related work
Text simplification has numerous facets, and can be approached from different angles. The general need for simplification can be predicted based on the readability of a text, from the point of view of sentence complexity (Štajner et al., 2017) or a combination of lexical, syntactic and semantic text characteristics (De Clercq and Hoste, 2016). Simplification can be targeted by identifying complex words (e.g. Paetzold and Specia, 2016;Yimam et al., 2018), and then performing lexical simplification (e.g. Glavaš andŠtajner, 2015;Glavaš and Vulić, 2018;Horn et al., 2014;Kriz et al., 2018).
Lexical simplification systems often build on sentence-aligned simplification corpora and propose substitutes for complex words from a number of synonyms based on the words' frequency, length and suitability for the original context (De Belder and Moens, 2010;Drndarević and Saggion, 2012;Vu et al., 2014). Approaches influenced by machine translation have also been explored, as lexical simplification can be viewed as monolingual translation (e.g. Nisioi et al., 2017;Xu et al., 2016;Zhu et al., 2010). Other neural based models have also been developed, which exploit word embeddings and their closeness in the vector space as clues for substitution candidates. Glavaš andŠtajner (2015) produce word simplifications in a large regular corpus using word embeddings to perform lexical substitution tasks. The simplification candidates are ranked based on features such as semantic and context similarity, and 1 https://newsela.com/ information load.
Our focus in this paper is narrower. We aim to explore metaphors in text simplification, and check if there are specific features that predict whether a metaphor should be changed or not. To represent the instances in our data we use features that previous work on text simplification have shown to be beneficial, as well as features useful in metaphor identification tasks.

Data
The data for this study comes from Newsela, a company that provides professionally simplified news texts for school reading activities. The editors follow simplification guidelines and are assisted by a tool in detecting difficult words. There is no description of the criteria used by the tool to detect such words. Regarding metaphors, the instructions are brief and seem to draw attention to idioms rather than metaphors: "be literal in lower versions. No straight out metaphors, as in no 'paint into a corner' in 5th grade or below.".
Each Newsela article has five versions of different difficulty levels determined based on the Lexile 2 readability scores, which are used to measure the complexity of texts and assign them to appropriate grade levels. Using these parallel news texts allows for the quick identification of changed items to produce a dataset to which metaphor information is then added (Wolska and Clausen, 2017).

The dataset
We use a parallel corpus of 1,130 Newsela articles by Xu et al. (2015), where each original article has been aligned with its four simplified versions at the sentence level based on Jaccard similarity. For our study we look at the original (V0) and the most simplified (V4) versions, as between them we expect the most differences w.r.t. simplification strategies. From this corpus, we automatically sampled original sentences along with their equivalents from the chosen simplified version.
Each Newsela version covers several, unevenly distributed, grade levels. Because of the potential differences between the grade levels within versions, we sampled only articles at grade level 12 from the original version and grade level 4 from the most simplified version. The selected grade levels correspond to the largest subsets within the respective versions. The sampling was randomized across documents to counter author and editor bias. The final dataset contains 582 documents, each consisting of one original sentence and one or more sentences in the corresponding simplified version. 3 All alignments were manually checked and corrected where necessary either by inserting missing sentences or by replacing wrong alignments with the correct ones. This resulted in 278 corrections, exemplified below ("m" indicates the manually inserted sentence, initially missing from the alignment): V0 A year ago, [Shaw] Mychal suffered a concussion in a game that rendered him temporarily unable to walk or speak. V4 Shaw suffered a concussion in a game last year. V4-m Shaw could not walk or speak for a while.

Metaphor annotation
We focus on the two most common word classes -nouns and verbs. In the sampled documents, we annotated their occurrences in the original sentences as either metaphoric or not by following the guidelines of the metaphor identification procedure MIPVU (Steen et al., 2010). 4 The annotation in this study builds on Wolska and Clausen (2017), where it was carried out as follows: one author initially identified metaphoric items in a smaller subset of the data. All unclear cases were then discussed with the second author and either resolved or left unannotated. The annotation was completed by the initial annotator. In this study, we use a version of the dataset with expanded annotations -every noun/verb left unannotated in the previous study was annotated for metaphoricity by the same annotator as in the initial study.
In MIPVU, metaphoricity is identified by examining a text on a word-for-word basis and determining the context and the basic senses of each word. "Words" are considered to be lexical units provided with separate part of speech tags. 5 A word is used metaphorically if its context sense can be sufficiently contrasted to and understood in comparison with its basic sense. The context sense of a word is "the meaning it has in the situation in which it is used", whereas the basic sense is taken to be "more concrete, specific, and humanoriented" (Steen et al., 2010, p. 33-35).
The senses are determined by means of a dictionary; we consult the Macmillan Dictionary 6 , which is a standard reference used by the authors of the procedure. Different senses of a word correspond to separate, numbered descriptions within its grammatical category in a dictionary.
In an example from our dataset, given in (1), the verb struggling is used metaphorically, as there exists a more basic sense ("to use your strength to fight against someone or something"), which is contrasted to and compared with the contextual sense ("to try hard to do something that you find very difficult"). 7 (1) But now she's struggling to obtain documents required by the new law.
The quantitative information on the annotated dataset is summarized in Table 1.

Simplification types
For each annotated word we marked its equivalent in the simplified version and determined the simplification type chosen by an editor. 8 There In exchange for a 4 percent piece of their companies, entrepreneurs in the program will gain access . . . . . . people in the program will give up a 4 percent share of their companies. In exchange they will get . . . phrase with metaphor(s) But now she's struggling to obtain documents required by the new law.
But now she's having a hard time getting the papers that the new law requires. phrase w/o metaphor(s) Utah officials say that since 2008, highway crashes have dropped annually on stretches of rural Interstate . . . They say there have been fewer accidents where the speed limit was raised.

word removed
Our goal is to provide Internet service to people in areas that can't afford to throw down fiber lines . . .
Our goal is to provide Internet service to people in areas that can't afford Ø usual Internet lines . . .

non-metaphoric
same nonmetaphor "In the past several hundred years, people have cultivated the habit of smoking wherever they want," she said.
"In the past several hundred years, people have "gotten used to" smoking wherever they want," she said.

other nonmetaphor
With nothing less at stake than the future of planet Earth, NASA has decided to crowdsource ideas to detect and track asteroids . . . NASA wants to find and track asteroids, but it needs help. It is asking people around the world for ideas . . .

changed to metaphor
That information could help the team's trainers implement practice plans that keep him spry the rest of the season.
That could help the team's trainers make plans that keep him healthy for the season.
phrase with metaphor(s) "Even after the Holocaust, our minority still encounters racism and discrimination," he said, noting that they are Europe's last hired, first fired.
His people still suffer unfair and insulting treatment, he said. They are the last in Europe to get jobs. They are also the first to be fired. phrase w/o metaphor(s) On Thursday, the snowpack was a paltry 25 percent of average for this time of year.
The snowpack was just one-quarter of what it usually is for this time of year. word removed SnapDragon is a cross of Honeycrisp with a Jonagold-like hybrid that's easier for farmers to manage.
SnapDragon is a cross of the tasty Honeycrisp apple and another kind that's easier Ø to grow. are six simplification options that were identified for metaphoric items in Wolska and Clausen (2017), which we now apply to non-metaphoric items as well. A word can be preserved (same metaphor/same non-metaphor) 9 , replaced by another word of the same metaphorical status (other metaphor for metaphoric items and other nonmetaphor for literal items), replaced by a word of opposite metaphorical status (changed to nonmetaphor for metaphors and changed to metaphor for literal items), rephrased with metaphorical language (phrase with metaphors) or without (phrase without metaphors), or removed (word removed). See Table 2 for an overview with examples.
The annotation of the simplification types in Wolska and Clausen (2017) was done as follows: on a smaller subset of sentences annotated for metaphoricity, two authors identified and discussed the simplification choices. Once these were finalized, one author annotated the remainder of due to various reasons, e.g. complex syntactic structure. This is to be differentiated from the option word removed, where the changes are performed on the word level and which we annotate. 9 Morphological deviations are considered the same word.
the dataset and the second author 99 instances. 10 Inter-annotator agreement on the common subset was κ = .87. In the present study, one author extended the annotations. The quantitative information on the annotated simplification types is summarized in Table 3. 11 The statistics show that metaphors can be both useful and confusing for communication: 62% of the phrases that contained metaphors in the original article version contain a metaphor (the same or another one) in the simplified version. A small number of non-metaphors (2.3%) were replaced with metaphors in the simplified version.
With respect to the two word classes -nouns and verbs -we note considerable variation in the dataset (see Table 4). 93% of the verbs (186 metaphoric and 368 literal) appear less than five times; 67% (143 metaphoric and 256 literal) only once. The most frequent verbs annotated as metaphoric are have (22), make (18) and take

Experiments
The purpose of these experiments is to test whether there are distinguishable characteristics that indicate whether a metaphoric/literal word should/should not be changed to make the text easier to understand, and also whether there are features that are particular to metaphoric or literal words with respect to simplification. We conducted two sets of experiments: on the full dataset (metaphoric and literal items), and on the metaphoric part of the data. Through the experiments on the full dataset we investigate whether there are different features indicative of metaphor and literal word simplification, respectively. In the second set of experiments we perform a more in depth exploration of the metaphoric part of the data and look at the changes within the finegrained simplification types.

Experimental setup
For the first set of experiments, we group the simplification types in two classes: preserved and changed. Unchanged items (i.e. same metaphor and same non-metaphor) were assigned the preserved class. All other simplification types were combined as changed. The quantitative information on the items used in the experiments is provided in Table 5.  The experiments were done with a Linear Support Vector Machine classifier using 10-fold crossvalidation startegy. 12 The feature values were standardized prior to the experiments. 13 We report the results of the random baseline, and the distribution of the different phenomena in the data.

Features
Data analysis has shown that both metaphors and literal words can be changed to help comprehension, and either can be replaced with metaphoric or literal expressions. To determine whether there are identifiable characteristics that could make this distinction automatable, we compile a number of features that have been shown to be useful for text simplification and metaphor identification. The metaphor-sensitive features are Imageability, Concreteness, WordNet senses and word's context; the 12 We use the SVM implementation in scikit-learn (Pedregosa et al., 2011): https://scikit-learn.org/ stable/modules/generated/sklearn.svm. LinearSVC.html 13 Standardization was performed with the Standard-Scaler in scikit-learn: http://scikit-learn. org/stable/modules/generated/sklearn. preprocessing.StandardScaler.html general features are part of speech, vector space word representations, Age of Acquisition, word frequency and Familiarity. The feature types used and their coverage in our dataset are described below.
Part of speech: The part of speech (POS) tagging was done using the NLTK toolkit 14 (Bird et al., 2009). The POS tags were then manually corrected where necessary. The two possible values are noun and verb.
Vector space word representations: We obtained vector space representations for each annotated word using Google's pre-trained word2vec model (Mikolov et al., 2013). 15 Word embeddings have been successfully used in metaphor identification (e.g. Age of Acquisition: Age of Acquisition (AoA) ratings were obtained from the AoA norms database of 51,715 English words (Kuperman et al., 2012). AoA denotes the approximate age at which a word is learned. The simplified news articles used in this study are intended for classroom use by 9-10 year old children. Words usually acquired after this age should be more readily changed/removed in the simplified version.
We extracted the AoA ratings by matching both word forms and lemmas (e.g. noun testing/testing vs. verb testing/test).

Imageability, Familiarity and Concreteness:
Imageability stands for the ability of a word to evoke mental images; Familiarity refers to the frequency of exposure to a word; Concreteness describes the level of abstraction associated with the concept a word represents. The connection of these variables to metaphor comprehension has been shown in multiple studies (e.g. Marschark et al. 1983;Paivio et al. 1968;Ureña and Faber 2010). Concrete words are more easily learned, processed and remembered than the abstract ones (Paivio et al., 1968). It is quite likely then that abstract words will be discarded during simplification. Marschark et al. (1983) found a link between high imageability and easier processing for certain metaphor types. These features were successfully used in lexical simplification (e.g. Jauhar and Specia 2012; Vajjala and Meurers 2014).
Imageability and Familiarity ratings were obtained from the MRC Psycholinguistic Database (Wilson, 1988). This database contains up to 26 (psycho)linguistic attributes for 150,837 words. Concreteness ratings were extracted from a collection of English Abstractness/Concreteness ratings (Köper and Schulte im Walde, 2017).
We extracted the values for the word forms if present in the databases and for the respective lemmas otherwise. For a number of words, the values are missing (see Table 6). De Hertog and Tack (2018) use the third and first quartile values for Imageability and Concreteness, respectively, following an assumption that rarer words tend to have lower imageability and concreteness, while Gooding and Kochmar (2018) use the null value. We decided to assign instead a "neutral" value: the median value for each feature based on the ratings in the MRC.

Available
Missing Imag 2,288 1,221 (35%) Fam 2,293 1,216 (35%) Concr 3,509 0 (0%) Word frequency: In lexical simplification systems, it is common to substitute infrequent words with their more frequent synonyms (e.g. De Belder and Moens, 2010). As Kriz et al. (2018), we assume that highly frequent words are easier to understand, whereas infrequent words are more difficult and therefore will be removed/changed in the process of simplification. We use word frequency counts from the SUBTLEX US database (Brysbaert and New, 2009), a corpus of subtitles for American English of 51M words. The frequencies are given per million words. We extracted the values based on the word forms in our data (3,503 words); 6 words (.2%) have frequency 0.
WordNet sense: The WordNet (Fellbaum, 1998) sense feature approximates a word's meaning in context. The values are the synset numbers representing the sense of a word in the original sentence. MIPVU uses sense information and comparison with a "basic" sense of a word to assign metaphoricity. The WordNet sense number could be an indication whether a word is metaphoric or not: the first sense is the more frequent, and could thus be considered basic, while the higher the sense number, the more likely it could be that the word is used metaphorically.
We use a Lesk-like (Lesk, 1986) method to disambiguate a target word relative to WordNet: a vector representation of the context of an annotated word (i.e. V0 sentence) is compared to a representation for each of the word's definitions in WordNet. The representations are generated using Google's pre-trained word2vec model. The context and each definition are compared using Word Mover's Distance (Kusner et al., 2015). 16 We chose the synset number whose definition is most similar to the word's context. The lookup in WordNet was done based on the word forms and matching POS tags. For 9 words (.3%) not found in WordNet the values are missing.
Word's context: This feature reflects the discrepancy between the level of abstractness of a metaphoric word and its context. It was operationalized with ratings of Concreteness (Köper and Schulte im Walde, 2017) and Imageability from the MRC database. Turney et al. (2011) have shown that a word's degree of abstractness, relative to the context it appears in, can be successfully used to distinguish between literal and metaphoric meanings. Broadwell et al. (2013) used Imageability ratings to discover metaphors based on the assumption that they stand out of their context as being highly imageable.
We considered a symmetrical seven-word window centered on the target word. A word w's Concreteness context (CC) value is computed as: where n is the size of the window. The Imageability context (IC) is calculated in the same way.
In the computation we used the context words with available Concreteness and Imageability scores in the database. If ratings for the target word itself or for all context words were not found, the value for the feature was set to missing. The overview of the value counts is given in Table 7. 16 We used the implementation in Gensim Python library (Řehůřek and Sojka, 2010).

Experiment 1: Metaphoric vs. literal words
To assess the impact of the different features on predicting whether a word should or should not be changed, we group the features based on the type of information they capture: • IFC (Imageability + Familiarity + Concreteness) -informative for metaphoric words • WN+IC/CC (WordNet sense + word's context) -different aspect of metaphor relevance • Freq+AoA (word frequency + Age of Acquisition) -relevant for both metaphoric and literal items The F-score results on the full dataset (1,277 changed, 2,232 preserved instances) for different feature combinations are presented in Figure 1. 17 AoA has the highest Precision for the class changed in both metaphoric and literal cases. This shows that whereas this feature might be good in accurately detecting items that need simplification, it does not differentiate between metaphoric and literal usages in the current setting. Previous studies have shown that some correlation exists between the AoA and frequency of usage (e.g. Ghyselinck et al., 2004), but in this case the AoA feature and the Frequency feature have different effects when used alone (see Figure 1). In particular, the Frequency feature is not useful to determine whether a word should be changed or not, contrary to our expectations.
We expected the "metaphor-specific" features (IFC) to have a higher impact on the metaphoric than on the literal words. When used alone they do lead to better prediction for changing metaphoric words compared to literal ones, but within the context of the full feature set, their impact is minimal (all, all-IFC/I/F/C). The imbalance in the data set could explain why, when using other features which can pick up on characteristics of literal words (or both), the effect of Familiarity, Concreteness and Imageability is lost.
The WN and context features also behave in an interesting manner. Alone, neither of these has much impact on distinguishing words that should be changed or not (WN/CC/IC). But when combined, their predictive power grows, particularly considering the approximate 1:5 ratio between metaphoric and literal target words. We tested a binary representation of the WN feature: is the disambiguated sense the first one (the "basic" one) or not. This set-up led to worse results. This could mean that assuming that the first sense in WordNet is the "basic" sense is erroneous, even though it is the most frequent one.  Looking at the results on the subsets corresponding to nouns and verbs (see Table 8), we note that there are differences in terms of the useful features. Predicting that nouns should be preserved is consistent w.r.t. the features used, and close to the majority baseline. Using all features leads to the best results overall, for both nouns and verbs, whether they should be changed or pre-served. Metaphor-relevant features (IFC and contextual information) are not helpful in predicting verbs and nouns that need to be changed. However, they appear to be more relevant for verbs. The Frequency and Age of Acquisition combination seems to be more important for verbs than for nouns.

Experiment 2: Metaphoric words
We use the subset of 285 changed and 299 preserved metaphors to test the impact of different subsets of features for predicting change/preserve for metaphoric target words. The results are given in Table 9 for the complete metaphoric dataset.
We further analyze the results of classifying originally metaphoric words as changed or preserved in the simplified texts. We look into the data subsets corresponding to the different metaphor simplification phenomena, and produce the recall results shown in Table 10. We cannot compute precision because all instances in each subset belong to one class (i.e. either changed or preserved).
The results for the metaphoric data preserve some of the tendencies seen on the complete dataset, and they also reveal some new insights. AoA leads to the highest Precision score for the class changed and has high Recall and F-score for the class preserved. Frequency of use appears to be the most useful in distinguishing between metaphors that should be changed or preserved. This is quite intuitive, as metaphors that are less common are more difficult to understand. Contrary to its impact in the first experiment -classifying whether a word should be changed or not,  regardless of whether it is metaphoric or literal -when analyzing metaphoric words and classifying them into changed/preserved, Frequency is the best feature. This effect is apparent also when looking at the subsets corresponding to the different simplification types (see Table 10).  The word's context features (IC/CC) have the highest Recall scores for the preserved cases, but in combination with the WordNet senses feature they stop being useful for differentiating between the two classes. Just as in the first experiment, when used alone the IFC features are clearly useful, but within the full set of features they lose their predictive power. For the preserved items, the context features (IC/CC) show the best results. Those metaphors that were rephrased with metaphorical content are best described with the IFC features, whereas the WN senses feature is good when identifying paraphrases without metaphors.

Conclusion
The analysis of metaphor usage in original and simplified versions of the same news texts has shown that not all metaphors are alike, from the point of view of text comprehension. A large percentage of metaphors in our dataset were either preserved or replaced using metaphorical language, while a (much) smaller number of literally used words was replaced with a metaphoric expression.
The evaluation of the features most frequently used in literature for text simplification and metaphor identification has shown that for both metaphors and literal words, the most informative feature is the Age of Acquisition. Features that capture the imageability, familiarity and concreteness of a word have similar performance in predicting change/no change for both metaphorical and literal words when used alone. When used together with our other features, their predictive power diminishes. While not useful to separate changed and preserved words in the full dataset, for metaphoric words the frequency of usage is a telling feature, even at a fine-grained level.
One factor that could have influenced the results of these experiments is the incomplete coverage provided by the Imageability and Familiarity features. In future work we plan to improve the assignment of missing values by deriving a value using the scores assigned to the most similar words. We will further explore features that capture the interaction between a target word and its context, including contextual embeddings and the word's syntactic role.