Automatically Inferring Implicit Properties in Similes

A simile is a ﬁgure of speech comparing two fundamentally different things. Sometimes, a simile will explain the basis of a comparison by explicitly mentioning a shared property. For example, “my room is as cold as Antarc-tica” gives “cold” as the property shared by the room and Antarctica. But most similes do not give an explicit property (e.g., “my room feels like Antarctica” ) leaving the reader to infer that the room is cold. We tackle the problem of automatically inferring implicit properties evoked by similes. Our approach involves three steps: (1) generating candidate properties from different sources, (2) evaluating properties based on the inﬂuence of multiple simile components, and (3) aggregated ranking of the properties. We also present an analysis showing that the difﬁculty of inferring an implicit property for a simile correlates with its interpretive diversity .


Introduction
A simile is a figure of speech comparing two essentially unlike things, typically using "like" or "as" (Paul, 1970). Comparing fundamentally different types of entities is what makes a simile figurative (Israel et al., 2004). Similes may be closed or open (Beardsley, 1981). A closed simile explains the basis for a comparison by explicitly mentioning a shared property. For example, the simile "my room is as cold as Antarctica" gives "cold" as the property shared by both the room and Antarctica. But most similes do not explicitly mention the basis for comparison, leaving people to infer what the entities have in common. An open simile expressing the same comparison is "my room feels like Antarctica", where the shared property of being cold is left implicit. In our study of similes in tweets, we found that 92% of similes are open similes so the property must be inferred. Our research tackles this problem of inferring the implicit property evoked by an open simile.
Inferring the basis of comparison in a simile is central to natural language understanding and metaphor interpretation. For example, "John was like a lion in battle" is probably a statement about John's bravery or courage, not a description of John's physical appearance. Methods to understand figurative similes could also be valuable to understand metaphor in other linguistic constructions, such as predicate nominals (e.g., "he is a lion"). Furthermore, identifying the implicit property of a simile could be useful for sentiment analysis, because similes are often used to express positive and negative feelings (Li et al., 2012). For example, "John was like a lion in battle" contains only neutral words, but inferring "bravery" as the implicit property suggests that the simile has positive polarity.
We designed a three step process to infer the implicit properties of open similes. First, we generate candidate properties for a simile by harvesting words that are associated with its verb ("event") or object of comparison ("vehicle") using a variety of methods, including syntactic patterns, dictionary definitions, and word embeddings. Each candidate property is generated from just one component of the simile. The second step of the process then evaluates each property's compatibility with the complementary component of the simile (event or vehi-cle). Finally, the third step of the process aggregates all of the candidates generated by different methods and ranks them based on collective evidence from the different sources. We evaluate the performance of our approach using gold standard properties provided by seven human annotators. We also present an analysis of the similes in our data set with respect to their interpretive diversity (intuitively, a measure of how many plausible interpretations a simile has). We show that our method performs best on similes with low diversity, as one would expect since their implicit properties are most clear to humans.

Problem Description and Data
A simile typically consists of four key components: the topic or tenor (subject of the comparison), the vehicle (object of the comparison), the event (act or state), and a comparator (usually "as", "like", or "than") (Niculae and Danescu-Niculescu-Mizil, 2014). For the simile "the room feels like Antarctica", "room" is the tenor, "feels" is the event, and "Antarctica" is the vehicle. A property (shared attribute) can optionally be included to explicitly state how the tenor is being compared with the vehicle, (e.g., "the room is as cold as Antarctica"). Table 1 shows examples of open similes from our Twitter data set, along with several properties inferred by our human annotators (our data set will be described in Section 2.1). We represent each simile using just the head noun of the tenor and vehicle, and the lemma of the event. Veale and Hao (2007) observed that when a property is explicitly given, it is usually a salient property of the vehicle. Table  1 illustrates some examples of inferred properties that are strongly associated with the vehicle (e.g., "melodic" and "dulcet" are musical attributes).
We observed that implicit properties can be strongly evoked from the event as well. For example, most inferred properties for "person buzz like fridge" emanate from the word "buzz", such as "humming", "vibrating", "distracting", and "annoying". Similarly, the tenor can also evoke properties, as we see with the inferred property "squinty" for the simile "eye feel like clam" although our observation is that this is less common. The event and the tenor need to be semantically rich to evoke implicit properties. The event in many similes is a form of "to be" or a perception verb (e.g., "feels"), which are semantically weak and contribute little. A tenor provides limited information when it is a pronoun or unknown entity (e.g., "John drives like a snail" is understandable without knowing who John is).  Ultimately, an implicit property must be compatible with the vehicle, event, and the tenor in order for a simile to make sense. For example, Antarctica is strongly associated with the color "white", but it would not make sense to infer the property "white" for the simile "my room feels like Antarctica" because of the verb "feel". Although in this example the tenor "room" is still compatible with "white" and will not help to eliminate "white" as a property, in other similes it may (e.g., rivers can be "wide", but time can not be, so "wide" can be eliminated as an implicit property in the simile "time be like river").
A novel aspect of our work is that our architecture is designed to consider a property's compatibility with multiple components. In this research, for generating candidate properties and utilizing their influence for compatibility, we particularly focus on the vehicle and event terms. Initially, we generate candidate properties from the vehicle and the event separately. But the second step then evaluates each candidate property's compatibility with the complementary simile component. If a property was initially generated from the vehicle, then we evaluate its compatibility with the event; if a property was initially generated from the event, then we evaluate its compatibility with the vehicle. This approach emphasizes the need to consider multiple components of a simile when inferring implicit properties.

Collecting Similes with Implicit Properties
For our research, we created a new data set of open similes, where the property is implicit. Similes are common on Twitter, so we extracted similes from roughly 140 million English tweets collected during the time period 2/13/2013 -4/15/2014. To identify similes, we applied a part-of-speech tagger designed for Twitter (Owoputi et al., 2013) to tweets containing the word "like" and applied rules to recognize simple noun phrases and verb phrases. We then selected tweets matching the syntactic pattern: N P 1 V ERB like N P 2 , where N P 2 can contain only a noun and an optional indefinite article. We required similes to have a vehicle term with no premodifiers to avoid problems associated with coreference (e.g., "the man" or "that man") and to focus on vehicles that represent general concepts. We leave for future work the challenge of tackling multi-word vehicle phrases (e.g., "my room is like stepping into a hurricane" or "my room is like a boots store").
This selection process extracted many similes, but it also extracted literal comparisons with no apparent property (e.g., "this flower smells like a rose") and statements that are not comparisons (e.g., "I called like five times"). To focus on figurative similes with an implicit property, we further filtered the collection to only retain similes with vehicle terms that had occurred in comparisons with an explicit property. Using the same Twitter data, we extracted nouns that appeared in the following syntactic patterns, which represent comparison constructions with an adjectival property: ADJ like [a, an] N OU N (e.g., "red like a tomato") and ADJ as [a, an] N OU N (e.g., "red as a tomato"). We only kept similes whose vehicle occurred in these patterns. Finally, we filtered similes that contain a pronoun (except personal pronouns in the tenor, which we generalized to a "person" token), common person first names 1 , profanity, 2 or words not in a dictionary 3 to avoid issues with Twitter language such as misspellings, elongated words, etc. 1 http://deron.meranda.us/data/census-derived-all-first.txt 2 http://www.bannedwordlist.com/lists/swearWords.txt 3 Using Wordnik: https://www.wordnik.com/

Gold Standard Implicit Properties
We developed a gold standard set of implicit properties for each simile using Mechanical Turk. We prequalified 7 workers, who each annotated 700 similes with frequency ≥ 3 randomly selected from our collection. Each annotator was asked to provide up to 2 properties that best captured the most likely basis for comparison between the tenor and vehicle. We also provided the annotators with the option to label a simile as Invalid if it was not a simile at all (most commonly due to parse errors, such as "he looks like ran") or label a simile as having No Property (often due to literal or underspecified comparisons, such as "she looks like my aunt"). The annotators were asked to give adjectives, adverbs, or verbs but occasionally they provided a noun. Table 1 presents sample annotated simile properties.
Among the 700 similes, a majority of the annotators labeled 59 of them as either Invalid or No Property, so we did not use these. We set aside 183 similes (29%) as a development set and the remaining 458 similes (71%) as a test set.

Inferring Implicit Properties
Our research tackles the problem of inferring properties in open similes by decomposing the problem into three subtasks: (1) generating candidate properties, (2) evaluating the candidate properties with respect to multiple simile components, and (3) aggregated ranking of the properties. Figure 1 illustrates our approach. First, the vehicle and event components of a simile are used individually to generate candidate properties. We investigate a variety of candidate generation methods, including harvesting properties from syntactic structures and dictionary definitions, identifying relevant properties using statistical cooccurrence, and assessing similarity between word embedding vectors.
Second, the candidates generated by each method are evaluated based on their strength of association with the complementary component of the simile. For candidates generated from the vehicle term, we evaluate them based on their association with the event term, and vice versa. We explore three association measures: point-wise mutual information to measure statistical co-occurrence, and vector similarity using single and composite word embeddings.
Third, we produce an aggregate ranking over the entire set of properties hypothesized by all of the candidate generation methods. Intuitively, we view each candidate generation method as an independent source, and look at the aggregate evidence across the set of different candidate generation methods (similar to an ensemble). Each property is scored based on its average rank across the different methods, so that properties highly ranked by multiple methods are preferred.

Candidate Property Generation
We generate candidate properties from the vehicle and event words of a simile. However when the event is a form of "to be" or a perception verb (taste, smell, feel, sound, look), we do not generate candidate properties from the event because the verb is too general. Only 73 (16%) of the similes in our evaluation data have a verb other than "to be" or a perception verb. We restrict properties to be adjectives, adverbs, or verb forms that can function as nominal premodifiers (e.g., "crying baby", "wilted lettuce"). We explore a total of seven methods for generating candidate properties and generate candidates using our entire Twitter corpus.
Modifying ADJ: Given a vehicle term, we extract pre-modifying adjectives. For example, "ripe" is extracted for the vehicle "tomato" from the phrase "ripe tomato".
Predicate ADJ: Given a vehicle term, we extract adjectives in predicate adjective constructions with the vehicle. For example, "red" is extracted for the vehicle "tomato" from the phrase "tomato is red".
Modifying ADV: Given an event term (verb), we ex-tract adverbs that precede or follow the verb. For example, "immaturely" is extracted for the event "act" due to the phrase "acts immaturely".
Explicit Property: We extract properties mentioned explicitly in comparison phrases. For vehicle terms, we extract properties from phrases of the form: "ADJ/ADV like NP" (e.g., "cold like Antarctica") and "ADJ/ADV as NP" (e.g., "cold as Antarctica"). For event terms, we extract properties from phrases of the form: "VERB ADJ/ADV like" and "VERB as ADJ/ADV as" (e.g., "feels as cold as").
Dictionary Definition: Dictionary definitions often mention salient properties associated with a word. We harvest adjectives, adverbs and verbs (functioning as premodifiers) as candidate properties from the dictionary definitions of the vehicle and event terms. For the definitions, we use Wordnik 4 , which contains 5 source dictionaries: Heritage Dictionary of the English Language, Wiktionary, the Collaborative International Dictionary of English, The Century Dictionary and Cyclopedia, and WordNet 3.0 (Miller, 1995).
PMI: Given a vehicle or event term, we compute point-wise mutual information (PMI) between that term and candidate properties (appearing in ≥ 100 tweets) in our Twitter corpus.
Word Embedding: We train a word embedding model using our tweet collection, limiting the vocabulary to nouns, verbs, adjectives and adverbs that occurred in ≥ 100 tweets. For training, we use word2vecf 5 (Levy and Goldberg, 2014) which allows training for arbitrary context using the skipgram model. We use 300 dimensions for the output word and context vectors. Candidate properties are generated by selecting the words whose context vector 6 is most similar to the vehicle or event's word vector using cosine similarity. To control for noisy candidates, we require that the property occurred with the vehicle (or event) as a bigram with frequency ≥ 10 in the Twitter corpus.
For each generation method, we rank the candidates and select the top 20 properties. For the four methods that use syntactic patterns, we calculate P(property | vehicle) based on the number of times the property and the vehicle appear together in that syntactic construction among all times the vehicle appear in that syntactic construction. We use this probability to rank the candidates. For the dictionary definition method, we sort the properties based on how many of the 5 dictionaries mention the property in the word's definition. We break ties based on the frequency of the property in the definitions. For the word embedding-based method, we use cosine similarity scores.

Productivity of the Candidate Generation Methods
First we investigate how many candidates each method is able to generate. If a method generates too few candidates, it will not be very useful. Conversely, if a method generates a large number of candidates, then our ranking framework needs to be robust to rank the plausible properties higher than the properties that do not fit.  Figure 2 presents statistics about the candidate properties generated by different methods. The PMI and Word Embedding-based methods were excluded here as these methods evaluate all words in the corpus. The methods that used the explicit property extraction patterns and dictionary definitions generate fewer candidates than the methods that used general syntactic structures. The trend lines in Figure 2 show that these methods do not generate more than 20 candidate properties for most similes.

Coverage of the Generated Candidates
Next, we investigate the effectiveness of our candidate generation methods. The last column of Table 2 shows candidate ranking results based on Mean Reciprocal Rank (MRR) for the top 20 properties produced by each candidate generation method. MRR is calculated by: where S is the set of similes. We observe that the PMI method (for both vehicles and events) and the Dictionary Definition method (for events) produced low MRR scores < 0.10. Therefore we decided not to use these candidate generation methods. 7 One of our primary concerns is assessing the ability of our candidate generation methods to generate at least some acceptable properties. We expect them to over-generate, but they need to produce at least one acceptable property or the downstream components will be helpless. To assess this, we evaluated the coverage of each candidate generation method based on the Top 10, Top 20, and Top 30 properties that it produced. Coverage is the percentage of similes for which the method generates at least one gold standard property (from the human annotators). Table 2 shows that the Dictionary Definitions for vehicles was the best performing method for the Top 10 candidates, generating at least one acceptable property for 40% of the similes. The Modifying ADJ method performed best for the Top 30 candidates, generating an acceptable property for 63% of similes. Note that the Explicit Property method performs reasonably well (40% coverage for Top 30 properties generated from vehicles and 6% coverage for properties generated from events), but clearly is not sufficient on its own, showing the limitation of harvesting explicitly stated properties.  property within top 10, 20, 30 ranked properties. Methods excluded in "ALL" and "TOTAL" rows are marked with (*). In the MRR calculation when the event component is source, similes with a "to be" or a perception verb were excluded.
The ALL rows show the coverage obtained by combining the property lists from all generation methods listed above in the table. The combined set of properties (Top 30) generated from vehicles yields 86% coverage, while the combined set of properties generated from events yields only 10% coverage (partly because these methods apply to only 16% of the similes), showing that vehicles are more effective for candidate generation. However, the TOTAL row shows that combining properties generated from both vehicles and events yields 88% coverage using the Top 30 candidates. The Top 20 candidates provide coverage that is nearly as good (86%) with substantially fewer properties to process downstream, so we use the Top 20 candidates for all of our experiments. 8

Ranking the Candidate Properties Using
Influence from the Second Component Next, we investigate whether the initial ranking results in the previous step can be improved by con-sidering the second component of the simile. Intuitively, suppose that "green", "slow", and "endangered" are generated as candidate properties from the vehicle "turtle" (e.g., for "dad drives like a turtle"). Taking the event verb "drive" into account can help to rank "slow" more highly than the other candidates. We explore three criteria to rank candidates generated from one simile component based on its association with the second component (unless the event is "to be" in which case we retain the original candidate ranking because the verb is too general). PMI with second component (PMI): We calculate Pointwise Mutual Information between a candidate property and the second component of a simile.
Embedding word vector similarity with the second component (EMB 1 ): We use our trained word embeddings model to calculate cosine similarity between a candidate property and the second component of the simile. As before, for properties we use the context vectors.
Embedding word vector similarity with composite simile vector (EMB 2 ): For a given event and vehicle, we create a composite simile vector by performing element-wise addition of the vectors for the event and the vehicle, and calculate cosine similarity with the candidate properties. For example, for "person talks like robot", the vectors for "talk" and "robot" are used to create a composite vector, and the similarity of the resulting vector with a candidate property's context vector is used as the ranking criteria. The intuition here is to capture what is common in the context distribution (Mikolov et al., 2013) of "robot" and "talk", and the context vector of a suitable property should have strong similarity with the resulting vector.  Table 2 are also presented in the first column (Orig). Influence from the second simile component assessed with PMI and EMB 1 improved the MRR scores for some candidate generation methods (e.g., Predicate ADJ), but did not for others (e.g., Modifying ADV). However using the composite word embedding vector (EMB 2 ) to capture the common  aspects in the context distributions of the event and vehicle consistently improved MRR for all candidate generation methods. Consequently, we use the composite word embedding vector as the ranking method for each set of candidate properties.

Aggregated Ranking
Finally, we need to consider all of the properties produced by the various candidate generation methods. As we saw in Table 2, they produce complementary sets of properties and coverage is highest when we use all of them together. To produce an aggregated ranking of all candidate properties, we calculate the harmonic mean of the rank for each individual candidate generation method. This approach rewards properties that have a consistently high ranking across different methods.
For comparison, we also show results for a voting method where a candidate property is ranked based on how many different methods generated it. To break ties, we used the frequency of the candidate in our Twitter corpus.

Results for Aggregated Ranking
Our final results use two gold standard property sets: (1) Gd (Gold): uses the set of properties from the human annotators, and (2) Gd+WN expands Gold with WordNet synsets (words in the same synset of a gold property are added) and WordNet's "similar to" relation (words that are connected to a gold property by the relation are added). The reason for using Gd+WN is to include synonyms of a gold property that would otherwise be considered wrong (e.g., if a human annotator said "beautiful" and our system said "pretty").
The first two columns in Table 4   results for our final ranking. The results show that with both Gd and Gd+WN, our aggregated ranking using harmonic mean yields much better MRR results than the individual methods and better than the Voted method, yielding our highest MRR: .33 and .41. The last 4 columns of Table 4 present the percentage of similes for which an acceptable property was ranked #1 (Top 1) or within the Top 5. Our aggregate ranking scheme ranks an acceptable property in the Top 1 position for 27% of similes based on Gd+WN, and inferred an acceptable property within the Top 5 positions for 58% of all similes.
For the above evaluations, any property given by the annotators is deemed correct, and any consensus that the annotators may have had is not accounted for. To address this, we retained properties with different degrees of consensus, and subdivided the evaluation data set. Each subset of the data kept similes that have properties from a minimum number of annotators, and only those properties are used as the gold standard. WordNet synsets and "similar to" relations are also used in determining consensus. Gd+WN gold standard, and corresponding data set sizes. Figure 3 shows that for all degrees of consensus, the aggregated ranking is consistently better than the method that uses the explicit property extraction patterns, which was the best individual candidate generation method. When properties given by at least 2 annotators are considered as the gold standard, MRR is lower than when properties given by any annotator are used. With higher consensus, MRR gradually increases, which is probably because the properties with high consensus have stronger association with the simile components, so are easier to infer.

Analysis and Discussion
Our gold standard property collection confirmed our intuition that some similes have many plausible interpretations while others do not. We hypothesized that this should contribute to the difficulty of implicit property inference. Utsumi and Kuwabara (2005) introduced "interpretive diversity" with the hypothesis that similes with more diversity in the inferred property tend to be more metaphorical, and the values of salience of the properties are more uniform. They used Shannon's entropy to measure the interpretive diversity of a simile.
To explore our hypothesis regarding difficulties associated with property inference, we first cluster our gold-standard annotated properties. When a property appears in the WordNet synset of another property, or if two properties are connected by the WordNet "similar to" relation, we group the properties to form property clusters. So each property cluster represents a set of words that are synonyms of each other. We aggregate frequency statistics of individual words in a cluster and measure interpretive diversity of a simile using Shannon's entropy (here, X is the random variable representing property clusters of a simile): Figure 4 shows the entropy curve after the 641 similes are sorted by the entropy values of their property clusters. Based on changes in the slope of the curve, we then divided the data into 3 subsets, similes with high (1-100 similes), medium (101-500 similes), and low (501-641 similes) interpretive diversity. Table 5 presents examples of similes in each category. High interpretive diversity   (9), coarse (2), raspy, sore, dry is clearly demonstrated by "person act like mom", showing properties with many different characteristics attributed to mom. Note that the properties contain both positive (e.g., friendly, loving) and negative (scolding, annoying) attributes. On the other side of the spectrum are similes with low interpretive diversity, as exemplified by "throat feel like sandpaper" where the vocabulary of the property set is more limited. Table 6 shows that it is much harder to infer the implicit property in similes with high interpretive diversity, demonstrated by a .19 difference in MRR score from high to low. This trend is also consistent when we see the percentage of similes for which the system ranks a plausible property at the topmost  position (Top 1) or within the Top 5. It is possible that with low interpretive diversity, when the property distribution is unimodal or bimodal, statistical associations between a property and simile components are stronger, and so more easily discovered by our candidate generation and ranking methods.

Related Work
Similes have been studied in linguistics and psycholinguistics to understand how humans process similes, comparisons, and metaphors, and the interplay among different components of these linguistic forms. Glucksberg et al. (1997) presented a property attribution model of metaphor comprehension where the candidate properties are selected from a vehicle and applied to a topic. Chiappe and Kennedy (2000) investigated if the number of properties varies between a metaphor and its simile form. The impacts of semantic dimensions of tenor and property salience have been compared by Gagné (2002). Fishelov (2007) experimented with affective connotation and degrees of difficulty associated with understanding a simile when a simile property is conventional or unconventional, or no property is given. Hanks (2005) manually categorized vehicle nouns of similes into semantic categories. Automatic approaches that use computational models for similes are relatively rare. Veale and Hao (2007) extracted salient properties of vehicles from the web using "as ADJ as a/an NOUN" extraction pattern to acquire knowledge for concept categories. Veale (2012) built a knowledge-base of affective stereotypes by characterizing simile vehicles with salient properties. Li et al. (2012) used explicit property extraction patterns to determine the sentiment that properties convey toward simile vehicles. Niculae and Yaneva (2013) and Niculae (2013) used constituency and dependency parsing-based techniques to identify similes in text. Qadir et al. (2015) classified similes into positive and negative affective polarities using supervised classification, with features derived from simile components. Niculae and Danescu-Niculescu-Mizil (2014) designed a classifier with domain specific, domain agnostic, and metaphor inspired features to determine when comparisons are figurative.
Computational approaches to work on figurative language also include figurative language identification using word sense disambiguation (Rentoumi et al., 2009), harvesting metaphors by using noun and verb clustering-based techniques , interpreting metaphors by generating literal paraphrases (Shutova, 2010), etc.
Although previous research has extensively used explicit property extraction patterns for various tasks, none has explored the impact of multiple simile components for inferring properties. To our knowledge, we are the first to introduce the task of automatically inferring the implicit properties in open similes, which is fundamental to automatic understanding of similes.

Conclusion
In this work, we addressed the problem of inferring implicit properties in open similes. We showed that acceptable properties for most similes can be identified by harvesting properties using syntactic structures, dictionary definitions, statistical cooccurrence, and word embedding vectors. We then demonstrated that capturing the combined influence of a simile's event and vehicle terms using a composite word embedding vector improved our ability to rank candidate properties. Finally, we showed that properties harvested by different methods can be aggregated and effectively ranked using the harmonic mean of rankings from the individual methods. Our method for inferring implicit properties performed best on similes with low interpretive diversity. In future work, we plan to use the inferred properties to improve affective polarity recognition in similes.