Generating similes effortlessly like a Pro: A Style Transfer Approach for Simile Generation

Literary tropes, from poetry to stories, are at the crux of human imagination and communication. Figurative language such as a simile go beyond plain expressions to give readers new insights and inspirations. In this paper, we tackle the problem of simile generation. Generating a simile requires proper understanding for effective mapping of properties between two concepts. To this end, we first propose a method to automatically construct a parallel corpus by transforming a large number of similes collected from Reddit to their literal counterpart using structured common sense knowledge. We then propose to fine-tune a pretrained sequence to sequence model, BART~\cite{lewis2019bart}, on the literal-simile pairs to gain generalizability, so that we can generate novel similes given a literal sentence. Experiments show that our approach generates $88\%$ novel similes that do not share properties with the training data. Human evaluation on an independent set of literal statements shows that our model generates similes better than two literary experts \textit{37\%}\footnote{We average 32.6\% and 41.3\% for 2 humans.} of the times, and three baseline systems including a recent metaphor generation model \textit{71\%}\footnote{We average 82\% ,63\% and 68\% for three baselines.} of the times when compared pairwise.\footnote{The simile in the title is generated by our best model. Input: Generating similes effortlessly, output: Generating similes \textit{like a Pro}.} We also show how replacing literal sentences with similes from our best model in machine generated stories improves evocativeness and leads to better acceptance by human judges.


Introduction
Comparisons are inherent linguistic devices that express the likeness of two entities, concepts or ideas.When used in a figurative sense, these comparisons The bottom of the ocean is dark , scary The bottom of the ocean is dark , like a cave The city was beautiful The city was like a painting are called similes.They are a figure of speech that compare two different kind of things, usually with the intent to make the description more emphatic or vivid, being often used in literature and poetry to spark the reader's imagination (Paul et al., 1970).Take the following two examples: "The city was like a painting", and "If it falls into the wrong hands it would be as catastrophic as a nuclear bomb."In the first example, the comparison draws on the implicit "beauty" property being shared by the two very different entities, city and painting, while in the second the "catastrophic" property is shared by falling into the wrong hands and nuclear bomb.
While most computational work has focused on simile detection (Niculae and Danescu-Niculescu-Mizil, 2014;Mpouli, 2017;Qadir et al., 2015Qadir et al., , 2016;;Zeng et al., 2019;Liu et al., 2018), research on simile generation is under-explored.Generating similes could impact many downstream applications such as creative writing assistance, and literary or poetic content creation.To tackle the generation problem, we take advantage of the relatively simple structure of similes that consists of five elements (Hanks, 2013;Niculae and Danescu-Niculescu-Mizil, 2014): the TOPIC (usually a noun phrase that acts as logical subject), the VEHICLE (the logical object of the compari-son, usually a noun phrase), the PROPERTY (what the two things being compared have in common, usually an adjective), the EVENT (eventuality or state, usually a verb), and the COMPARATOR (the trigger word or phrase that marks the presence of a comparison, usually the preposition "like" or "as...as").All elements of a simile are explicit, with the exception of PROPERTY, which can be both implicit and explicit.If we take the first example above, its structure is: "[The city/TOPIC] [was/EVENT] [like/COMPARATOR] [a painting/VEHICLE]" (PROPERTY is implicit).Unlike metaphors, the semantic context of similes tends to be very shallow, transferring a single property (Hanks, 2013).Moreover, the explicit syntactic structure of similes allows, in exchange, for more lexical creativity (Niculae and Danescu-Niculescu-Mizil, 2014).
We focus on the task of generating a simile starting from a literal utterance that contains the TOPIC, EVENT and PROPERTY.We frame this task as a style-transfer problem (Shen et al., 2017;Fu et al., 2017;Li et al., 2018;Sudhakar et al., 2019), where the author's intent is to make the description of the TOPIC more emphatic by introducing a comparison with the VEHICLE via a shared PROPERTY (See Figure 1 for example of literal descriptive sentences and the generated similes).We call our approach SCOPE (Style transfer through COmmonsense PropErty).There are two main challenges we need to address: 1) the lack of training data that consists of pairs of literal utterances and their equivalent simile in order to train a supervised model; 2) ensuring that the generated simile makes a meaningful comparison between the TOPIC and the VEHICLE via the shared PROPERTY explicitly or implicitly expressed (e.g., Figure 1 GenSimile1 and GenSim-ile2, respectively).To the best of our knowledge, this is the first work in attempting to generate similes.By framing the task as a style-transfer problem we make three contributions: 4Automatic creation of a parallel corpus of [literal sentence, simile] pairs.Our constructed corpus contains 87,843 such pairs.As a first step, we use distant supervision to automatically collect a set of self-labeled similes using the phrase like a.We then convert these similes to their literal versions by removing the COMPARATOR and replac-ing the VEHICLE with the associated PROPERTY by leveraging the structured common sense knowledge achieved from COMET (Bosselut et al., 2019), a language model fine-tuned on ConceptNet (Speer et al., 2017).For example, for the simile "Love is like a unicorn" our method will generate "Love is rare" (Section 2.1).
Transfer learning from a pre-trained model for generating high quality similes.Our system SCOPE, fine-tunes BART (Lewis et al., 2019)a state of the art pre-trained denoising autoencoder built with a sequence to sequence model, on our automatically collected parallel corpus of [literal sentence, simile] pairs (Section 2.2) to generate similes.Human evaluations show that this approach generates similes that are better 37% of the time on average compared to 2 literary experts, 82% and 63% of times compared to two well crafted baselines, and 68% of the times compared to a state of the art system for metaphor generation (Stowe et al., 2020)

(Section 4).
A task-based evaluation.We show the effectiveness of the generated similes as a tool for enhancing creativity and evocativeness in machine generated stories.Evaluation via Amazon Mechanical Turk shows that stories containing similes generated by SCOPE is preferred by Turkers 42% of the times compared to stories without similes, which is preferred 25% of the times (Section 6).

SCOPE: Style Transfer through COmmonsense PropErty
Our style transfer approach for simile generation from literal descriptive sentences has two steps: 1) first convert self-labeled similes into literal sentences using structured common sense knowledge (Section 2.1); and 2) given the [literal sentence, simile] pairs, fine-tune a seq2seq model on these pairs to generate a simile given a literal sentence (Section 2.2).This two-step approach is shown in the upper half of Figure 2.

Automatic Parallel Corpus Creation
One of the requirements to train a supervised generative model for text style transfer is the presence of a large-scale parallel corpus.We use distant supervision to collect self-labeled similes using the phrase like a5 from Reddit (e.g., the rows labeled as Simile in Table 1).For fine-tuning, the similes form the "target" side of our parallel data.For the "source" side of our parallel data, we use commonsense knowledge to transform the similes to their literal version (e.g., the rows labeled as Best Literal in Table 1).One of the possible ways to collect similes would be to train a supervised model using existing data and methods for simile detection but most data sets are very small in size (in order of a few hundred).The only large-scale dataset is that of (Niculae and Danescu-Niculescu-Mizil, 2014) however their data is from a rather restricted domain of product reviews on Amazon which might often lack variety, diversity and creativity needed for this task.
Simile Dataset Collection.We hypothesize that similes are used frequently in creative writing or humorous content on social media (Veale, 2013).Hence, we obtain training data by scraping the subreddits WRITINGPROMPTS6 and FUNNY7 from social media site Reddit for comments containing the phrase like a. Similes can be both Open and Closed.For example the Closed Simile, "The boy was as strong as an ox" gives strong as the PROPERTY shared by the boy and ox.But most similes do not give an explicit PROPERTY such as the Open Simile (e.g., "The boy was like an ox") leaving the reader to infer that the boy is strong/large/fast (Qadir et al., 2016).Due to their implicit nature, generating open similes is often more challenging and hence we resort to only using like as a comparator instead of as...as.We use the API provided by pushshift.io8to mine comments.

Simile
Love is like a unicorn.Through this process we collect 87,843 self-labeled human written similes, from which we use 82,697 samples for training and 5,146 for validation.
Simile to Literal Transformation via Commonsense Property.From a theoretical perspective, similes are created by making a comparison between the TOPIC and the VEHICLE through a shared PROPERTY.While this property is naturally known to humans through common sense and connotative knowledge, computers still struggle to perform well on such tasks when the PROPERTY is not expressed.Hence we use structured common sense knowledge to derive properties to transform similes to their literal versions.
To generate the common sense PROPERTY that is implied by the VEHICLE in the simile, we take advantage of the simple syntactic structure of a simile.We extract the VEHICLE by extract-ing the phrase after like a and feed it as input to COMET (Bosselut et al., 2019).COMET is an adaptation framework for constructing commonsense knowledge based on pre-trained language models.Our work only leverages the HasProperty relation from COMET 9 .
For a given simile 'Love is like a unicorn.', the TOPIC Love is compared to the VEHICLE unicorn.As shown in Table 1, COMET tells us the top 5 properties associated with the VEHICLE are very rare, rare, beautiful, beautiful and smart, color.
COMET gives us the properties sorted by probability in isolation by just relying on the VEHICLE.While in most situations all of the properties are apt, we need to make the literal sentence as meaningful as possible.To do this, we append the common sense property to the portion of the simile before 'like a'.This typically consists of the TOPIC, the EVENT, and a PROPERTY if stated explicitly.We take the top 5 properties from COMET to form 5 possible literal versions for a particular simile.To rank these literal versions and select the best one, we rely on perplexity scores obtained from a pre-trained language model GPT (Radford et al., 2018).Table 1 shows human written similes collected from Reddit, the top 5 common sense properties associated with the VEHICLE, and the literal version created by taking the best PROPERTY.To correct any grammatical error introduced by this manipulation, we rely on a grammatical error correction model (Zhao et al., 2019).
Test Data Collection.Our task is to generate a simile given a literal input.The automaticallygenerated parallel data might contain stylistic biases.To truly measure the effectiveness of our approach, we need to evaluate on a dataset independent of our training and validation data.Towards this end, we again scrape WRITING-PROMPTS subreddits for sentences which are this time literal in nature (without any comparators like, as).Since literal utterances contains the description of TOPIC via a PROPERTY and usually the PROPERTY is an adjective or adverb, we restrict the last word of our literal sentences to adverbs or adjectives.We crawl 500 such sentences and randomly sample 150 literal utterance.We used two literary experts, a student in creative writing, and a student in comparative literature who is the author of a novel, to write corresponding similes for each 9 https://mosaickg.apps.allenai.org/comet_conceptnet of these 150 inputs for evaluation and comparison.

Seq2Seq Model for Simile Generation
Our goal of generating similes can be broken down into two primary tasks: 1) identifying the words in the literal sentence that should be removed or replaced and 2) generating the appropriate substitutions while being pertinent to the context.Sequence to sequence (seq2seq) neural network models (Sutskever et al., 2014) have demonstrated great success in many text generation tasks, such as machine translation, dialog system and image caption, with the requirement of a considerable amount of parallel data.Hence we use seq2seq models for simile generation.BART (Lewis et al., 2019) is a pre-trained model combining bidirectional and auto-regressive transformers.It is implemented as a sequence-tosequence model with a bidirectional encoder over corrupted text and a left-to-right autoregressive decoder.In principle, the pre-training procedure has two stages: (1) text is corrupted with an arbitrary noising function, and (2) a transformer-totransformer model is learned to reconstruct the original text.Because BART has an autoregressive decoder, it can be directly fine-tuned for most sequence generation tasks.Here, the encoder input is the a sequence of words, and the decoder generates outputs autoregressively, as shown in Figure 3. BART achieves new state-of-the art results on a number of text generation tasks, making it an ideal choice for generating similes.We refer the reader to (Lewis et al., 2019) for further details.
For our task, we fine-tune BART by treating the literal input as encoder source and the simile as the the decoder target.Post fine-tuning at the inference step, we use top-k sampling strategy (Fan et al., 2018) to generate similes conditioned on a test literal input.
Implementation details.Hyper-parameters, and essential details needed for reproducing experi-ments are given in Appendix A.1.

Experimental Setup
To compare the quality of the generated similes, we benchmark SCOPE model and human generations (HUMAN1 & HUMAN2) described in Section 2.1 with three baseline systems described below

Baseline Systems
Simile generation is a new task.The baselines outlined below have been used for other generation tasks.We adapt them to generate similes.
1. BART: This is the pre-trained BART model.
Since BART is a pre-trained sequence to sequence model, it can still be used for conditional text generation.To this end we use the same literal sentence (For example The city was beautiful) as an input to the encoder and force the decoder to begin with same prefix by removing the adjective/adverb at the end and appending the comparator and the article (The city was like a) and generate a simile.

Retrieval (RTRVL):
We also experiment with a retrieval approach where we retrieve a VEHICLE from ConceptNet (Speer et al., 2017) having the highest HasProperty relation w.r.t our input (i.e., an adjective or adverb at the end of literal sentence) 10 .For the input The city was beautiful we query Con-ceptNet with beautiful and it returns sunset as the VEHICLE having highest weight for HasProperty beautiful.We take this retrieved VEHICLE and append it to the prefix ending in like a.If the word is not in ConceptNet, we fall back to its synonyms obtained from Wordnet (Miller, 1995).

Metaphor Masking (META M):
The third baseline is the metaphor generation model given a literal sentence described by Stowe et al. (2020).Following their approach, we fine-tune BART where we mask the adjective or adverb in the end of the literal sentence.
The input is the masked text, with the hidden adjective or adverb (The city was <MASK >), and the output is the original simile (The 10 ConceptNet is a weighted graph with multiple relations as can be viewed here http://conceptnet.io/ .We use 'has property" for our work.There are multiple edges for objects with their properties.We choose edge with highest weight city was like a painting).Through this learning paradigm, the model learns that it needs to generate simile when it encounters the mask token.At test time, we provide the model with the literal input, mask the adjective/adverb, and the model produces an output conditioned on the adjective/adverb masking training.

Evaluation Criteria
Automatic evaluation.BLEU (Papineni et al., 2002) is one of the most widely used automatic evaluation metric for generation tasks such as Machine Translation.However, for creative text generation, it is not ideal to expect significant n-gram overlaps between the machine-generated and the gold-standard sentences.We still report the BLEU scores for generated VEHICLE after discarding the common prefix with the gold.BERTScore (Zhang et al., 2019) has been used recently for evaluating text generation using contextualized embeddings and said to somewhat ameliorate the problems with BLEU.It computes a similarity score using contextual embeddings for each token in the candidate (here VEHICLE in generated simile) with each token in the reference (VEHICLE in human written simile).To compute F1-Score it uses Recall (matching each token in reference to a token in candidate) and Precision(matching each token in candidate to a token in reference).We report F1Score of BERTScore.
Novelty.To measure the model's generalization capability, we also want to test how well our models can generate novel content.We capture the proportion of generated VEHICLE conditioned on an adverb/adjective literal PROPERTY that does not appears in the training set.
Human evaluation.Automated metrics are not adequate on their own for evaluating methods to generate creative text so we present a human-based evaluation as well.We evaluate on a total of 900 utterances, 600 generated from 4 systems and 300 utterances generated by humans.We proposed a  set of 4 criteria to evaluate the generated output: (1) Creativity (C) ("How creative are the utterances?"), (2) Overall Quality (OQ) ("How good is the simile overall?( Turk guidelines was to score based on how creative, well formed, meaningful and relevant it is with respect to the literal utterance)), (3) Relevance1 (R1) ("How relevant is the generated VEHICLE in terms of portraying the PROPERTY?")and ( 4) Relevance2 (R2) ("How relevant is the VEHICLE to the TOPIC in the generation?").As we evaluate on 4 separate dimensions for 900 utterances we have a total of 3600 evaluations.We hired Turkers on MTurk to rate outputs from the 4 systems and 2 humans.Each Turker was given the literal utterance as well as the 6 generate similes (randomly shuffled) Each criteria was rated on a scale from 1 (not at all) to 5 (very).Each utterance was rated by three separate Turkers.We hired 86,48,42,46 Turkers for the tasks of Creativity, Overall Quality, Relevance1, Rele-vance2 respectively.Turkers were paid at the rate of 20 dollars an hour for the task.Further details in Appendix A.4 .

Automatic Evaluation
Table 2 shows BLEU-1, BLEU-2 and BERTScore of our system compared to the three baselines.The low scores can be attributed to the nature of creative NLG tasks.To further validate this we also compute the BLEU-1 and BLEU-2 score between the two literary experts treating one as reference and other as candidate and get scores of 4.12 and 0.52 respectively.BERTScore is often a better metric as it utilizes contextualized embeddings.For example for a candidate [desert] with multi-reference as [[sandy death trap],[wasteland]] , we get a BERTscore of 0.99 while BLEU score is 0.0.Finally our best model SCOPE emerges as the winner for both BLEU and BERTScore.For novelty SCOPE can still generate novel content 88% of the time proving it is generalizable to unseen test data.Further there are 5558 unique PROPERTY in training data and 41% of PROPERTY in testing data does not appear in training, showing our model is generalizable on unseen PROPERTY as well.

Human Evaluation Scores
Table 3 presents the scores of the aforementioned evaluation criteria for our model and the baselines on the test set.The results show that SCOPE is significantly (p < .001according to approximate randomization test) better than the baselines on all four criteria.For all metrics our best system is comparable to humans.We also computed Pearson's correlation between OQ with other metrics and observed that R1 and R2 had moderate correlation of 0.54 and 0.52 with OQ , while C was fairly correlated (0.31) to OQ suggesting a relevance matters when deciding the quality of a simile.
Pairwise Comparison between systems.Table 4 shows the pairwise comparisons between the SCOPE and human generated simile (HUMAN1 and HUMAN2), and META M (Stowe et al., 2020), respectively.Given a pair of inputs, we decide win/lose/tie by comparing the average scores (over three Turkers) of both outputs.We see that SCOPE outperforms META M on all the metrics.For overall quality, although it is a given that literary experts are better, the SCOPE model still has a winning rate of 32.6% and 41.3% respectively.Table 9 demonstrates several generation outputs from different systems along with human judgements on individual criteria.We observe that often our model is better than at least one human on a certain criteria while outperforming the baselines by a large margin.

Role of Relevance
While conditioning on the context of literal sentences might lead to grammatically correct similes, they are often not meaningful and relevant to the PROPERTY in consideration.META M generates similes by fine-tuning BART on literal sentences where the common sense PROPERTY is masked.
The lack of relevance mapping during fine-tuning often leads to improper generations.For instance, referring to Table 9, the context of 'falling into the wrong hands' is more likely to lead to something bad and hence here 'gift' is not appropriate while 'nuclear bomb' is.One possible way of incorporating relevance is through common sense knowledge.

Role of Context
The role of context is necessary for simile generation.For example given the literal input 'But times are hard, and silver bullets are expensive' even though ConceptNet tells us diamonds are objects with HasProperty expensive, a generated simile by RTRVL model 'But times are hard, and silver bullets are like a diamond' seems inappropriate suggesting that a context leads to better generation.Our SCOPE model generates 'But times are hard, and silver bullets are like a luxury item' 6 Task-based Evaluation: Simile for Story Generation Similes are often used to evoke imagery.Generating or transforming text to be evocative can be useful for computational journalism (Spangher et al.), poetry generation (Ghazvininejad et al., 2017;Van de Cruys, 2020) and story writing (Goldfarb-Tarrant et al., 2020;Yao et al., 2019).
Table 10 shows how we can use our simile generation module as a post processing step to replace literal sentences leading to more expressive and creative stories.To further test this hypothesis we conduct an experiment further outlined below.
GPT2 GPT2+META GPT2+SCOPE 23% 25% 42% Table 7: Win% (in terms of average score over three annotators) of stories generated with only GPT2, GPT2 with META M or SCOPE simile post processing.The rest are ties.

Story Generation
We use the ROCStories (Mostafazadeh et al., 2016) dataset to generate stories using the Plan and Write model outlined by Yao et al. (2019).We introduce a two step pipeline procedure where we fine-tune a pre-trained GPT2 (Radford et al., 2018) model on titles and storyline from the training set to generate a storyline given a title (Row 1 Table 10).
In parallel, we also fine-tune GPT2 on storylines and stories from the training set to generate a story given a storyline (Row 2 Table 10).At test time, we generate a storyline using an input title first and then use the generated storyline to generate a story.

Post Processing
There can be multiple sentences ending with an adjective or adverb and replacing each of them with a simile might lead to over-embellishment.Under such situations we feed only one randomly selected sentence to SCOPE and META M module and replace the sentence in GPT2 generated story with the output from SCOPE or META M, respectively.

Human evaluation.
We randomly select 50 titles from ROCStories data set and generate stories as described above.We postprocess it using both SCOPE and META M separately.Thus for each title we have 3 stories 1) the original GPT2 story 2)the GPT2 story postprocessed with SCOPE 3)the GPT2 story postprocessed with META M. For each given titles, we present these 3 stories each to workers in AMT and ask them to score them in a range of 1(poor) to 5 (excellent) based on creativity and evocativeness.Experimental results from Table 7 prove that effective usage of similes can improve evocativeness and reception of machine generated stories.

Related Work
Simile generation is a relatively new task.Most prior work has focused on detection of similes.The closest task in NLP to simile generation is generating metaphors.However it should be noted the overlap between the expressive range of similes and metaphors is known to be only partial: there are similes that cannot be rephrased as metaphors, similarly the other way around (Israel et al., 2004).

Simile Detection and Analysis
Niculae and Danescu-Niculescu-Mizil (2014) proposed frameworks for annotating similes from product reviews by considering their semantic and syntactic characteristics as well as the challenges inherent to the automatic detection of similes.Qadir et al. (2015Qadir et al. ( , 2016) ) built computational models to recognize affective polarity and implicit properties in similes.Unlike these works, we focus on generating similes by transforming a literal sentence while still being faithful to the property in context.

Metaphor Generation
Earlier works in metaphor generation (Abe et al., 2006;Terai and Nakagawa, 2010) were conducted on a lexical or phrase level, using template and heuristic-based methods.(Gero and Chilton, 2019) presented an interactive system for collaboratively writing metaphors with a computer.They use an open source knowledge graph and a modified Word Mover's Distance algorithm to find a large, ranked list of suggested metaphorical connections.Word embedding approaches (Gagliano et al., 2016) have also been used for metaphor generation.(Young, 1987) also present a relational data base method for automatic metaphor generation.However, the metaphors generated through these methods do not take semantic context into consideration and lack the flexibility and creativity necessary to instantiate similes through a natural language sentence.
Yu and Wan (2019) use neural models to generate metaphoric expressions given a literal input in an unsupervised manner.Stowe et al. (2020) develop a new framework dubbed 'metaphor masking' where they train a supervised seq2seq model with input as the masked text, where they mask or hide the metaphorical verb while preserving the original text as the output.However, both these works hinge on metaphoric verbs unlike similes where we not only need to replace the literal property with a vehicle but it also needs to be relevant to the context and the tenor.Additionally we also use (Stowe et al., 2020) as a baseline and show that their approach may not be the best way for generating similes.

Conclusion
We establish a new task for NLG: simile generation from literal sentences.We propose a novel way of creating parallel corpora and a transfer-learning approach for generating similes.Human and automatic evaluations show that our best model is successful at generating similes.Our experimental results further show that to truly be able to generate similes based on actual metaphoric or conceptual mappings, it is important to incorporate some common sense knowledge about the topics and their properties.Future directions include exploration of other knowledge bases to help the inference and applying our simile generation approach to different creative NLG tasks such as sarcasm (Chakrabarty et al., 2020), hyperbole (Troiano et al., 2018)  For decoding we generate similes from our models using a top-k random sampling scheme (Fan et al., 2018).At each timestep, the model generates the probability of each word in the vocabulary being the likely next word.We randomly sample from the k = 5 most likely candidates from this distribution.We also use a softmax temperature of 0.7.While distant supervision is often used to collect a lot of data without human/ expert annotation through this process we introduce, noise in our self labeled similes.For example the sentence I feel like a fool is ideally not a simile.We notice 1.1% of the training data with PNP in TOPIC and typically <= 6 in token count such as I would like a , I don't like a, I feel like a, I think like a. However our transformation method still works here.Based on Figure 5 we see the common sense properties associated for fool are sneaky, stupid,funny,dangerous, bad.Our best literal transformation for I feel like a fool is then I feel stupid.So even though there is some noise this method still benefits our training procedure

A.3 Examples
Table 9 shows generations from all 4 systems along with gold similes and how turkers scored them on a scale of 1 to 5 for C,R1,R2 and OQ.

A.4 Amazon Mechanical Turk Settings
The 2nd column Table 8 shows the number of distinct workers employed for each task.Column 3 shows inter-rater agreement between workers.Except for Creativity, for the other 3 tasks workers are moderately correlated.For creativity workers are fairly correlated.
Figure 6,7,8 and 9 show the Amazon Mechanical Turk interfaces for the tasks of Creativity (C) ("How creative are the utterances ?"), (2) Relevance1 (R1) ("How relevant is the generated VEHICLE in terms of portraying the PROPERTY?")and (3) Relevance2 (R2) ("How relevant is the VEHICLE to the TOPIC in the generation?") (4) Overall Quality (OQ) ("How good is the simile overall ?".As can be seen we provide with explicit examples and a clear description of the task to turkers.We also mention and highlight the importance of evaluating similes along with input and not in isolation.
A.5 GPT2 generated stories preprocessed with SCOPE

Figure 1 :
Figure 1: Examples of two generated similes GenSim-ile1 and GenSimile2 from their literal inputs.

COMETI
Figure 2: A schematic illustration of our system, where the top block shows our training process where we use COMET to transform similes to literal sentences and use them to fine-tune BART.The block below shows the inference step where we use fine-tuned BART to generate novel similes conditioned on a literal sentence.

Figure 3 :
Figure 3: The backbone of SCOPE: fine-tuning BART on literal to simile pairs.

Figure 4 :
Figure 4: Barchart showing the percent of times each individual system won in terms of Overall Quality.

Figure 5 :
Figure 5: Proprty associated with fool the flat pack TV cabinet was meant to be like a house 1oracle whose predictions have always come true like a bolt from the blue 2.7 2.3 3.3 3.0 SCOPE You are an oracle whose predictions have always come true like a prophecy 3.0 2.7 3.7 4.0Table9: Examples of generated outputs from different systems (with human written similes as references).We show average scores (over three annotators) on a 1-5 scale where 1 denotes the worst and 5 be the best.The italics texts in the literal column represent the PROPERTY while those in Simile column represents the generated VEHICLE.Boldface indicates the best results.

Table 3 :
Human evaluation on several criteria of similes' quality for different systems' outputs and human written similes.We show average scores on a 1-5 scale with 1 denotes the worst and 5 be the best; the corresponding inter-annotator agreement (IAA) is in the parenthesis.Boldface denotes the best results and underscore denotes the second bests.

Table 4 :
Pairwise comparison between SCOPE and HUMAN1(H1), HUMAN2(H2), and META M. Win[w]% (lose[l]%) is the percentage of SCOPE gets a higher (lower) average score compared to HUMAN1, HUMAN2 and META M. The rest are ties.
Jane wanted to see the sunset.She decided to go for a walk.She walked for a long time.When she was done she saw the sunset was beautiful.
Storyline: sky → sunset → walk → walked → beautiful The sky was beautiful [like a blue canvas].

Table 6 :
An example of a GPT-2 generated short stories with the title Sunset and Car Accident.We replace the literal sentences with generated similes from SCOPE. etc.
Table 10 shows several example stories where a literal sentence has been replaced by a simile.
Title: a gift from the mentor Storyline: loved → playing → promised → tried → surprised Harry loved playing tennis.One day while playing he broke his racket.His coach had promised to buy him a new racket if he practiced.Harry tried hard to practice and was confident in his new racket.To his surprise his coach bought him a racket for his birthday and he was ecstatic [like a child on Christmas day] Title: The pet bug Storyline: playing → caught → bug → hoped → release Oliver was playing in his yard.Suddenly he spotted a bug he hadn't caught.The bug was a big beetle.He hoped it would be there forever [like a shadow].But unfortunately it was too late to release it Title: fishing Storyline: fish → lake → kids → caught → home The kids were great at catching fish.They woke up early and packed up their tackle box and hiked to the lake.The kids set up their lures and caught as many as they could.The fish were all caught and the kids laughed heartily [like a group of hyenas].They went home and had a great day fishing

Table 10 :
Example of a GPT-2 generated short story on respective title , storyline.We replace the first literal sentence with a generated simile from SCOPE.