A Theme-Rewriting Approach for Generating Algebra Word Problems

Texts present coherent stories that have a particular theme or overall setting, for example science fiction or western. In this paper, we present a text generation method called {\it rewriting} that edits existing human-authored narratives to change their theme without changing the underlying story. We apply the approach to math word problems, where it might help students stay more engaged by quickly transforming all of their homework assignments to the theme of their favorite movie without changing the math concepts that are being taught. Our rewriting method uses a two-stage decoding process, which proposes new words from the target theme and scores the resulting stories according to a number of factors defining aspects of syntactic, semantic, and thematic coherence. Experiments demonstrate that the final stories typically represent the new theme well while still testing the original math concepts, outperforming a number of baselines. We also release a new dataset of human-authored rewrites of math word problems in several themes.


Introduction
Storytelling is the complex activity of expressing a plot, its events and participants in words meaningful to an audience.Automatic storytelling systems can be used for customized sport commentaries, enriching video games with personalized or dynamic plot-lines (Barros and Musse, 2007), or providing customized learning materials which meet each individual student's needs and interests (Bartlett, 2004).In this paper, we focus on generating narrative-style Jim walked 0.2 of a mile from school to David's house and 0.7 of a mile from David's house to his own house.How many miles did Jim walk in all?

Star Wars
Uncle Owen walked 0.2 of a mile from hangar to Luke Skywalker's room and 0.7 of a mile from Luke Skywalker's room to his own room.How many miles did Uncle Owen walk in all?

Cartoon
Finn squished 0.2 of a mile from cupboard to Melissa's dock and 0.7 of a mile from Melissa's dock to his own dock.How many miles did Finn squish in all?

Western
Duane strolled 0.2 of a mile from barn to Madeline's camp and 0.7 of a mile from Madeline's camp to his own camp.How many miles did Duane stroll in all?math word problems (Figure 1) and demonstrate that it is possible to design an algorithm that can automatically change the overall theme of a text without changing its underlying story, for example to create more engaging homework that is in the theme of a student's favorite movie.
A math word problem is a coherent story that provides the student with good clues to the correct mathematical operations between the numerical quantities described therein.However, the particular theme of a problem, whether it be about collecting apples or traveling distances through space, can vary significantly so long as the correlation between the story and underlying equation is maintained.Students' success at solving a word problem is tied to their interest in the problem's theme (Renninger et al., 2002), and personalizing word problems increases student understanding, engagement, and performance in the problem solving process (Hart, 1996;Davis-Dorsey et al., 1991).
Motivated by this need for thematically diverse, highly coherent stories, we address the problem of story rewriting, or transforming human-authored stories into novel, coherent stories in a new theme.Rather than synthesizing first a story plot (McIntyre and Lapata, 2009;McIntyre and Lapata, 2010) or script (Chambers and Jurafsky, 2009;Pichotta and Mooney, 2016;Granroth-Wilding and Clark, 2016) from scratch, we instead begin from an existing story and iteratively edit it towards a thematically novel but -most crucially-semantically compatible story.This approach allows us to reuse much, but not all, of the syntactic and semantic structure of the original text, resulting in the creation of more coherent and solvable math word problems.
We define a theme to be a collection of reference texts, such as a movie script or series of books.Given a theme, the rewrite algorithm constructs new texts by substituting thematically appropriate words and phrases, as measured with automatic metrics over the theme text collection, for parts of the original texts.This process optimizes for a number of metrics of overall text quality, including syntactic, semantics, and discourse scores.It uses no hand crafted templates and requires no theme-specific tuning data, making it easy to apply for new themes in practice.Tables 4-6 show example stories generated from the rewrite system.
To evaluate performance, we collected a corpus of 450 rewrites of math word problems in Star Wars and Children's Cartoon themes via crowdsourcing.1 Experiments with automated metrics and human evaluations demonstrate that the approach described here outperforms a number of baselines and can produce solvable problems in multiple different themes, even with no in-domain tuning.

Related Work
Our approach is related to the previous work in story generation (e.g., McIntyre and Lapata (2010)) and sentence rewriting (e.g., text simplification (Xu et washington.edu/kedzior/Rewriter/. al., 2016)), as reviewed in this section.It has three major differences from all these approaches: First, we focus on multi-sentence stories where preserving the coherence, discourse relations, and solvability is essential.Previous work mainly focuses on rewriting single sentences.Second, we build a theme from a text corpus and show how the stories can be adapted to new themes.Third, our method leverages the human-authored story to capture the semantic skeleton and the plot of the current story, rather than synthesizing the story plot.To our knowledge, we are the first to introduce a text rewriting formulation for story generation.
Story generation has been of long interest to AI researchers (Meehan, 1976;Lebowitz, 1987;Turner, 1993;Liu and Singh, 2002;Mostafazadeh et al., 2016).Recent methods in story generation first synthesize candidate plots for a story and then compile those plots into text.Li et al. (2013) use crowdsourcing to build plot graphs.McIntyre and Lapata (2009;2010) address story generation through the automatic deduction and reassembly of scripts (Schank and Abelson, 1977), or structured representations of events and their participants, and causal relationships involved.Leveraging the automatic script learning methods of Chambers and Jurafsky (2009), McIntyre and Lapata (2010) learn candidate entity-centered plot graphs, or possible events involving the entity and an ordering between these events, with the use of a genetic algorithm.Then plots are compiled into stories through the use of a rule-based text surface realizer (Lavoie and Rambow, 1997) and reranked using a language model.Polozov et al. (2015) automatically generate math word problems tailored to a student's interest using Answer Set Programming to satisfy a collection of pedagogical and narrative requirements.This method naturally produces highly coherent, personalized story problems that meet pedagogical requirements, at the expense of building the thematic ontologies and discourse constraints by hand. 2dditionally, there is related work in text simplification (Wubben et al., 2012;Kauchak, 2013;Zhu et al., 2010;Vanderwende et al., 2007;Woodsend and Lapata, 2011b;Hwang et al., 2015), sentence compression (Filippova and Strube, 2008;Rush et al., 2015), and paraphrasing (Ganitkevitch et al., 2013;Chen and Dolan, 2011;Ganitkevitch et al., 2011).All these tasks are focused on rewriting sentences under a predefined set of constraints, such as simplicity.Different rule-based and data-driven approaches are introduced by Petersen and Ostendorf (2007), Vickrey andKoller (2008), andSiddharthan (2004).Most data-driven approaches take advantage of machine translation techniques, use source-target sentence pairs, and learn rewrite operations (Yatskar et al., 2010;Woodsend and Lapata, 2011a), or use additional external paraphrasing resources (Xu et al., 2016).

Problem Formulation
Our system takes as input a story s and a theme t, and outputs the best rewrite s * from generated candidates S.
A theme t is defined as a textual corpus that describes a topic or a domain.This is an intentionally broad definition that allows a variety of textual resources to serve as themes.For example, the collection of all Science Fiction stories from the Project Gutenberg can be a theme, or the script of a single movie, or a sampling of fan fiction from the Internet.This flexibility adds to the utility of our work, as varying amounts of thematic text may be available.
The generated candidate s * is the most thematically fit problem that is syntactically and semantically coherent given the original problem s and the new theme t.We represent a story in terms of the words it contains, so that s = {w 1 , w 2 , . . .w n } and Sam had 2 dogs.Each had 3 puppies.

Semantic relations
Luke Skywalker had 2 ships.Each had 3 droids.

Syn(s'|s) Sem Pair (s'|s) Th(s'|t) Sem Lex (s'|s)
where the function f (w) : V o → V K t ∪ ∅, rewrites a word from the vocabulary of the original problem V o to either a word, a trivial noun compound of length K (e.g., multi-word named entity) from the vocabulary of the the thematic vocabulary V t , or reduces to the empty symbol, i.e., omits the input word entirely; hence the length of s can differ from that of the original problem.
Formally, our goal is to select the candidate s ∈ S by maximizing a scoring function R over thematic, syntactic and semantic constraints, subject to a set of parameters θ: In order to find the best story s * , our problem reduces to generating candidate stories s from the space of possible rewrites of the human-authored story s in a new theme t (Section 5).Since there are exponentially many rewrites, we follow a twostage decoding approach: first we identify only the content words w i in the input problem, and provide for each a list of the top-k most salient thematic candidate words and trivial noun compounds.We then search the space by progressively introducing more rewrites in the beam, and scoring them according to R (Section 4). Figure 2 shows the overview of the scoring function for a candidate sentence s .
The scoring function R decomposes into three components, capturing aspects of syntactic compatibility, semantic coherence, and thematicity: R(s |s, t; θ) =α × Sem(s |s) The syntactic (Syn) and semantic (Sem) coherence components measure the coherence of the words in the new story s , as well as their compatibility to the syntactic and semantic relations in the original story s.On the other hand, thematicity (Th) scores the relevance and importance of words in the new story with respect to theme t.
We describe each of these components and the decoding process in the following sections.

Thematicity
Recall that a theme t is defined as a collection of documents that share a common topic, such as books in the science fiction genre, or scripts of horror movies.We define thematicity of a word w as the measure of salience, or how discriminative that word is to a given theme. 3For example, robot and spaceship are expected to be highly thematic with respect to Star Wars.In our setting we extend this definition to a candidate problem s given s and t as: where w i is a word from the candidate problem, and Sal is its salience score with respect to the theme.
In the context of this work we argue that the thematic adaptation of the content words, i.e., nouns, verbs, named entities, and adjectives, plays the most important role in forming a new thematic problem.Therefore, we define their salience (except named entities) based on their tf-idf score over the theme t, and set it to zero for function words.Since named entities have relatively low frequencies in the theme corpus we set their salience to 1− 1 c(w i ) , where c(w i ) is the number of times w i occurs in the theme.In the example story in Figure 2 the thematicity score is derived as Sal(Luke Skywalker) + Sal(ships) + Sal(droids).

Syntactic compatibility
This work offers a new method for syntactic and discourse coherence based on preserving humanauthored syntactic structure in generated text (hence our use of the term rewriting).The syntactic constructs in a document play a distinctive role in maintaining cohesion across sentences.We consider the human-authored syntax of the original story s as gold standard, and use it to score a candidate problem s by considering how well the syntactic relations of s apply to s .Formally, given a dependency triple (w i , w j , l) from a parse of a sentence in s, we compute the likelihood for the corresponding triple (w i , w j , l) for w i , w j in s .We define the syntactic score for all sentences in s as: where Dep(s) are the dependency parse trees for all sentences in s; L Dep is a 3-gram language model over dependency triples which gives the likelihood of an arc label l being used between a pair of words (w i , w j ).For example in Figure 2, the syntactic compatibility score includes dependency likelihoods of L Dep (ship, 2, num), L Dep (had, ship, dobj).
Therefore, the Syn function prefers stories s that (a) have similar dependency structure to the original story s and (b) make use of a common syntactic configuration.

Semantic Coherence
The semantic coherence component expresses how well a candidate s rewrites individual words and realizes the semantic relationships that exist in the human-authored story s.Ideally, we would like to preserve enough of the semantics of s in order to produce a coherent story s , yet we are populating s with words taken from an unrelated theme.Therefore, we model the semantics of a story s in terms of the lexical semantics contributed by individual words as well as semantic relationships that exist between its elements.Note that the relationships can cross the sentence boundaries, promoting discourse coherence.
We decompose semantic relations in a story into a set of local, lexical relationships between pairs of words.Specifically, we consider semantic relations for noun-noun and verb-verb pairs as provided by WordNet (Miller, 1995).Since some relations are not directly outlined in these resources (e.g., the selectional preferences of nouns with regard to their adjectival modifiers), we also consider the wordembedding similarity between words.For example in Figure 2 the semantic relationships are denoted with blue arrows between pairs of content words in the story (e.g., {Sam, dogs}, {dogs, puppies}, etc).
More formally, we define the semantic coherence of s with respect to s as: where CW is the set of pairs of indices of content words (nouns, verbs, adjectives, and named entities) from s.We focus on the content words of the original problem, as they carry most of the semantic information.Sem Lex and Sem P air functions are semantic adaptation scores for individual words and semantic relations respectively, described below.Semantic Compatibility between words (Sem Lex ) is defined as: where cos(w i , w i ) denotes the cosine similarity between the vector space embeddings of two words w i and w i4 , and Resnik(w i , w i ) expresses the information content of the lowest subsumer of {w i , w i } in WordNet.For example in Figure 2, the semantic compatibility score incorporates lexical similarities Sem Lex (dog, ship), etc.
Compatibility score between semantic relations (Sem P air ) is defined by adding two components: P airSim and Analogy that compute how semantic relations between pairs of words are preserved in the new story: P airSim = cos(w i , w j ) * cos(w i , w j ) (7) P airSim preserves the similarity between pairs of words {w i , w j } in s and the corresponding pair {w i , w j } in the new story s .Intuitively, if w i and w j are semantically close to each other, we would like the corresponding words to be close in the new story as well.For example in Figure 2, 'dog' and 'puppy' are similar in the original story, we expect the corresponding words 'ship' and 'droid' to be similar in the new story.The Analogy function, inspired by Mikolov et al. (2013), computes the analogy of w j from w i given the relationship that holds between w i and w j in the vector space.For example in Figure 2, the relation between 'Sam' and 'dog' is similar to the relation between 'Luke Skywalker' and 'ship'.

Decoding
Our decoding process begins by first identifying the content words w i (nouns, verbs, adjectives and named entities) in the original problem s that will be considered as initial points for rewriting.For each of these lexical classes we extract the top-k most thematic words and trivial noun compounds from the theme t.For example, in Figure 2, candidate nouns are: 'ships', 'robots', 'droids', etc., and for verbs: 'blast', 'soar', 'command', etc. Recall that the space of candidate rewrites is large, prohibiting an exhaustive enumeration.We therefore do approximate search with a beam by considering simultaneously all possible paths that start at the different initial points.At each step the decoder considers an additional rewrite from the list of candidates, adds it to the existing hypothesis path, and scores it according to function R (Equation 2).
All the counterpart scores are locally optimal, as they factor over each new word w i or pair of {w i , w j }, where w j is a rewrite already existing in the hypothesis path.At any given step we may recombine hypotheses that share the same prefix hypothesis path, and keep the top scoring one.The process terminates when there are no more rewrites left.We also experimented decoding with a variety of orderings of the text in the original problem s, including left-to-right, and head-first following the dependency tree of each sentence and then concatenating these linearizations; we observed that considering multiple paths achieves the best performance.

Data Collection
For the set of human-authored stories {s}, we use a corpus of math word problems described in Koncel-Kedziorski et al. (2016).We select a subset of 150 problems targeting 5th and 6th grade levels, all of which involve a single equation in one variable.These problems have 2.7 sentences and 29.4 words on average, 12.6 of which are considered content words by our system.In order to tune and evaluate our model, we collect a corpus of human-authored rewrites produced by workers from Amazon Mechanical Turk based on two themes: Star Wars, and Adventure Time (a children's cartoon).
We experimented with different ways of helping to define the theme for the workers, including offering automatically generated word clouds or enforcing that a response includes one of several keywords.In practice, we have found that using specific cultural elements as themes (such as famous movie or cartoon franchises) attracts workers who already have a strong knowledge of the theme, resulting in higher quality work.
To help explain the rewriting process, we show workers three examples of thematic rewrites with varying degrees of correlation to the original problems.We then show workers a random problem from the original set {s} and a corresponding equation for that problem.We instruct the workers to "rewrite" the problem according to the theme, ensuring that their rewritten problem can be solved by the provided equation.The final dataset collection comprises of 450 human-authored rewrites.We collect 3 rewrites for 100 of the original problems for the Star Wars theme (based on the popular Star Wars sequel movies), and 3 rewrites for the rest of the 50 original problems, for the Children Cartoons Theme (CARTOON), based on the Adventure Time TV show.We keep 150 examples from the Star Wars theme for development (STAR dev ), and the rest 150 for testing (STAR test ).
We collected the STAR dev and CARTOON data based on workers with the "master" designation and at least 95% approval rating.Then we pro-ceeded collecting STAR test by a subset of the authors of STAR dev who self-identify as theme experts and whose quality of work is manually confirmed.

Setup
Implementation Details We pre-process the themes using the Stanford CoreNLP tools (Manning et al., 2014) for tokenization, Named Entity Recognition (Finkel et al., 2005), and dependency parsing (Chen and Manning, 2014).For calculating salience scores, we use the ScriptBase dataset of movie scripts (Gorinski and Lapata, 2015).The Star Wars theme is constructed from the available script, roughly 7300 words.The Cartoon theme is constructed from fan-authored scripts of the first 10 episodes of the show (Springfield, 2016) totaling 1370 words.Since our thematic options are taken from arbitrary text, we use the lists of offensive terms published by The Racial Slur database (Database, 2016) and FrontGate Media (Media, 2016) to filter out offensive content.To prohibit overgeneration, we forbid the transformation of stop words or math-specific words (Survivors, 2013;Koncel-Kedziorski et al., 2015b).
For syntactic compatibility score Syn (Equation 4) we use the English Fiction subset of the Google Syntactic N-grams corpus (Goldberg and Orwant, 2013) and train a 3-gram language model using KenLM (Heafield, 2011).For Sem Lex , P airSim and Analogy (Equations 6-8) we use the pretrained word embeddings of Levy and Goldberg (2014).These embeddings are trained using dependency contexts rather than windows of adjacent words, allowing them to capture functional word similarity.Finally, we tune the parameters of our model (Equation 2) on the development set STAR dev and pick those values5 that maximize ME-TEOR score (Denkowski and Lavie, 2014) against 3 human references.
Evaluation We compare two ablated configurations of our method against our full model (FULL): -SYN that only uses semantic and thematicity components and does not incorporate the syntactic compatibility score, -SEM replaces the semantic coher- ence score with the simpler cos(w i , w i ), effectively rewriting only single words, and not pairs.We refrained from ablating the thematicity score as it is the core part of our model that drives the rewriting process into a new theme.We evaluate our method using an automatic metric, and via eliciting human judgments on Amazon Mechanical Turk.For automatic evaluation, we compute the METEOR score, comparing the output of each model for a given problem and theme to the 3 human rewrites we collected, on STAR dev , STAR test and CARTOON.METEOR is a recalloriented metric, widely used in the MT community; the additional stemming, synonym and paraphrase matching modules make it more applicable for our use, given the nature of our rewriting task. 6or human evaluation, we conduct pairwise comparison tests, pairing FULL against a human rewrite (HUMAN), FULL against -SYN, and FULL against -SEM.Participants were given a short description of the theme, and the output of each system.For each test we asked 40 subjects to select which problem they preferred over 5 pairs of outputs; we obtained a total of 200 (5x40) responses for STAR test and CARTOON.
In order to better understand the strengths and weaknesses of the generated stories, we conducted a more detailed human evaluation.8 participants were presented with the output of the three automatic systems, human rewrites (HUMAN), and a theme.The participants were asked to rate the stories across three dimensions: coherence (how coherent is the text of the problem?), solvability (can elementary school students solve it?),and thematicity (how well does the problem express them?) on a scale from 1 to 5. We collected ratings over 16 outputs from

Results
Table 1 reports METEOR; we notice that removing the semantic coherence scores in -SEM hurts the performance compared to FULL; this confirms our claim that semantic compatibility is crucial for building coherent stories.On the other hand, -SYN performs similarly to FULL.Closer inspection of the -SYN system's output reveals a greater diversity in thematic elements as a result of the relaxed syntactic compatibility constraints.Hence it is more likely to have greater overlap with any of the reference rewrites, resulting in higher METEOR scores.However, a pairwise comparison between FULL and -SYN (Table 2) reveals that human subjects consistently prefer the output of FULL instead of -SYN both for STAR test and CARTOON.Table 2 also reports that HUMAN outperforms the output of the FULL model, and a pairwise comparison of FULL and -SEM which yields a result in line with the METEOR scores.
Table 3 shows the results of the detailed comparison of Thematicity, Coherence, and Solvability.This table clearly shows the strong contribution of the semantic component of our system.The specific contribution of the syntactic component is to pro-Star Wars s 1 .Wendy bought 4 new chairs and 4 new tables for her house.If she spent 6 minutes on each piece furniture putting it together, how may minutes did it take her to finish?s 1 .Leia bought 4 new ships and 4 new guns for her room.If she spent 6 minutes on each wasteland weapon putting it together, how many minutes did it take her to terminate?s 2 .My car gets 20 miles per gallon of gas.How many miles can I drive on 5 gallons of gas? s 2 .My cruiser gets 20 miles per gallon of light.How many miles can I drive on 5 gallons of light?s 3 .Tyler had 15 dogs.Each dog had 5 puppies.How many puppies does Tyler now have?s 3 .Biggs had 15 creatures.Each creature had 5 creatures.How many creatures does Biggs now have?duce overall more solvable and thematically satisfying problems, although it can slightly affect coherence especially when automatic parses fail.Finally, the overall high ratings for human-authored stories across all three dimensions, confirm the high quality of the crowd-sourced stories.

Qualitative Examples
Table 4-6 shows some problems generated by our method.Recall that since our system needs no annotated thematic training data, we can easily generate from any theme where thematic text is available.To demonstrate this fact, we include generated examples in a Western theme from novels from the Project Gutenberg corpus.Many of the results of our system are very legible, with only minor agreement errors.Coherent, thematic semantic relations are evident in problems such as s 1 , where ships, guns, and weapons combine to effect the Star Wars theme; this is also evident in s 5 , where people with western sounding names like Kurt and Madeline trade in cigarettes, an old-fashioned pre-cursor to e-cigarettes.
In some cases, semantic inconsistencies result in weird sounding problems, such as in s 6 where the main character receives "wheat of grub".But because of the syntactic compatibility component, our model scores this candidate higher because of the Cartoon s 7 .Dave was helping the cafeteria workers pick up lunch trays, but he could only carry 9 trays at a time.If he had to pick up 17 trays from one table and 55 trays from another, how many trips will he make?s 7 .Finn was helping the cupboard men pick up candy bottles, but he could only carry 9 bottles at a time.If he had to pick up 17 bottles from one ring and 55 bottles from another, how many swords will he make?s 8 .If books came from all the 4 continents that Bryan had been into and he collected 122 books per continent, how many books does he have from all 4 continents combined?s 8 .If dances came from all the 4 mountains that Finn had been into and he collected 122 dances per mountain, how many dances does he have from all 4 mountains combined?s 9 .A bucket contains 3 gallons of water.If Derek adds 6.8 gallons more, how many gallons will there be in all?s 9 .A bottle makes 3 gallons of serum.If Finn adds 6.8 gallons more, how many gallons will there be in all?
Table 5: Examples of the original stories s i and rewritten math word problems s i in Cartoon theme.connection between "wheat" and "graze".
Semantic incoherence is less of a problem in the cartoon theme, where absurd interactions between characters are expected.However, a difficulty for our system is demonstrated in s 7 , where the physical entity "swords" is substituted for the nominalization of an event "trips".Improvements to the semantic coherence component could resolve such issues.
Table 7 shows some instances where the rewrite algorithm produces unusable results.An example of under-generation is s 10 .Here, too many words are left untouched, resulting in both ungrammaticality and semantic incoherence.In s 11 , we witness some limitations of using word vectors.The rare word "Ferris" is not close to anything in the Star Wars theme, and is thus mapped almost arbitrarily to "int" (movie script shorthand for an interior shot).Better treatment of noun compounds and the use of phrase vectors would reduce such errors.

Conclusion
We formalized the problem of story rewriting as automatically changing the theme of a text without Western s 4 .Christians father and the senior ranger gathered firewood as they walked towards the lake in the park ... s 4 .Christian 's partner and the lone sheriff harvested barley as they strolled towards the hip in the orchard ... s 5 .Sally had 27 cards.Dan gave her 41 new cards.Sally bought 20 cards.How many cards does Sally have now?s 5 .Madeline had 27 cigarettes.Kurt gave her 41 new cigarettes.Madeline bought 20 cigarettes.How many cigarettes does madeline have now?s 6 .For Halloween Megan received 11 pieces of candy from neighbors and 5 pieces from her older sister.If she only ate 8 pieces a day, how long would the candy last her?s 6 .For Halloween Madeline received 11 wheat of grub from proprietors and 5 wheat from her nameless partner.If she only grazed 8 wheat a day, how long would the grub last her?altering the underlying story and developed an approach for rewriting algebra word problems where the rewriting model optimized for a number of measures of overall text coherence.Experiments on a newly gathered dataset demonstrated our model can produce themed texts that are usually solvable.
Future work could improve the thematicity and solvability components by incorporating domainspecific and commonsense knowledge, leveraging information extraction.Additionally, neural network architectures (e.g., LSTMs, seq2seq) can be trained to rewrite coherently with less reliance on brittle syntactic parses.Additionally, we plan to study rewriting in other domains such as children's short stories and extend the model to generate math word problems directly from equations.Finally, we intend to incorporate the generated problems in educational technology and tutoring systems.
Poor Rewrites s 10 .It rained 0.9 inches on Monday.On Tuesday, it rained 0.7 inches less than on Monday.How much did it rain on Tuesday?s 10 .It blasted 0.9 inches on Monday.On Tuesday, it blasted 0.7 inches less than on Monday.How much did it light on Tuesday?s 11 .The Ferris wheel in Paradise Park has 14 seats.Each seat can hold 6 people.How many people can ride the Ferris wheel at the same time?s 11 .The int grab in chewbacca mesa has 14 areas.Each area can hold 6 troops.How many troops can ride the int grab at the same time?award.We thank the anonymous reviewers for their helpful comments.

Figure 1 :
Figure 1: An example story and rewrites in 3 themes.

Figure 2 :
Figure 2: An overview of our method for scoring a candidate story s given a human-authored story s and a theme t.Syn(s |s): compatibility of syntactic relations (purple arrows), Sem pair (s |s): coherence of semantic relations (blue arrows), Sem Lex (s |s): semantic mapping of individual words, and T h(s |s, t): thematicity.

Table 1 :
Model STAR dev STAR test CARTOON METEOR results for different configuration of our model on STAR dev , STAR test and CARTOON datasets.

Table 2 :
Human evaluation results on pairwise comparisons between FULL and -SYN, and FULL and HUMAN, on STAR test and CARTOON datasets.

Table 3 :
Human evaluation results for FULL, -SYN, -SEMand HUMAN on thematicity, coherence and solvability on STAR test .STAR test , resulting in 128 responses.

Table 4 :
Examples of the original stories s i and rewritten math word problems s i in Star War theme.

Table 6 :
Examples of the original stories s i and rewritten math word problems s i in Western theme.

Table 7 :
Examples of the original stories s i and poorer rewrites s i in the Star Wars theme.