Towards Generating Math Word Problems from Equations and Topics

A math word problem is a narrative with a specific topic that provides clues to the correct equation with numerical quantities and variables therein. In this paper, we focus on the task of generating math word problems. Previous works are mainly template-based with pre-defined rules. We propose a novel neural network model to generate math word problems from the given equations and topics. First, we design a fusion mechanism to incorporate the information of both equations and topics. Second, an entity-enforced loss is introduced to ensure the relevance between the generated math problem and the equation. Automatic evaluation results show that the proposed model significantly outperforms the baseline models. In human evaluations, the math word problems generated by our model are rated as being more relevant (in terms of solvability of the given equations and relevance to topics) and natural (i.e., grammaticality, fluency) than the baseline models.


Introduction
A math word problem is a narrative which describes a story under a specific topic. Moreover, it provides clues to the correct equation interpreting mathematical relations of numerical quantities and variables. The two example problems in Table 1 belong to two different topics (ticket selling, land purchase) respectively. Meanwhile they share the same equation template interpreting the underlying mathematical relations between numbers and variables. To generate a math word problem, a system needs to produce a topic-specific story while maintaining the underlying equation.
There is a surge of interest in automatic math word problem generation (K. and Elliot, 2002;Deane and Sheehan, 2013;Polozov et al., 2015;* Equal contribution. Koncel-Kedziorski et al., 2016). Previous attempts are mainly based on templates. Polozov et al. (2015) consider several components (e.g., event graph construction, surface text realization), each with manual defined templates and rules. Koncel-Kedziorski et al. (2016) generate math problems by revising existing problems into a new topic. They use a problem as the verbal template and simply replace nouns and verbs with suitable words from the new topic. The generation of template-based systems is based directly on existing items with high coherence. However, they have clear limitations. As templates are fixed, the possible outputs are limited to follow template patterns without too many grammatical and lexical options. Additionally, they require manual effort to construct domain-specific templates.
Recently, neural network approaches to automatic generation of questions (Du et al., 2017;Zhou et al., 2017) and stories (Fan et al., 2018) have shown promising results. Despite their success, they cannot be directly applied to math word problem generation, since generation of math word problems need to maintain the underlying mathematical operations between quantities and variables, while at the same time ensuring the relevance of the output problem and a given topic.
In this paper, we propose a novel neural network model for Math word Problem Generation from Equations and Topics (MAGNET). The proposed model consists of three main components: an equation encoder, a topic encoder and a math problem decoder. The equation encoder is implemented with a bidirectional recurrent neural networks (RNN) which takes the equation tokens as input and produces a sequence of hidden vectors. The topic encoder maps the given topic words into continuous word representations. The decoder is a single directional RNN with dual-attention mechanism, which can dynamically extract information x + y = 267, 4 * x + 2.5 * y = 1042.5 Problem 2: A farmer bought 100 acres of land, part at $300 an acre and part at $450, paying for the whole $42,200. How much land was there in each part? Topic: land purchase Equation: x + y = 100, 370 * x + 450 * y = 42200 from equations and topic words. To leverage both the equation and topic information, we design an equation-topic fusion mechanism to enable the decoder to choose which information to use. Furthermore, to ensure that the generated math word problem is highly related to the given equations, we introduce a novel entity-enforced loss function which considers the correspondence between variables in the given equations and entities in the output problem.
Large-scale annotated math problem datasets play a crucial role in developing neural math problem generation systems. We propose to adapt Dolphin18K (Huang et al., 2016) as the training, development and test sets, since it is one of the current largest math problem datasets with diverse problem types. It contains 18,460 elementary math problems from Yahoo! Answers 1 , with annotation of equations and answers.
Extensive experiments are conducted on the Dolphin18K dataset. We first propose three baseline methods: 1) a retrieve-based model that find the closest math problems in the training set; 2) a sequence-to-sequence model which takes only the equation as input (Equ2Math); 3) a neural decoder model conditioned on topic words (Topic2Math). We use three commonly used automatic evaluation metrics in recent text generation works, i.e., BLEU (Papineni et al., 2002), ROUGE (Lin, 2004) and METEOR (Denkowski and Lavie, 2014). Evaluation results on all three metrics show that our MAGNET model outperforms the baseline methods. To further examine the quality of generated math word problems, we also conduct human evaluations. Human evaluation results show that our MAGNET model performs better than the baseline systems on three aspects, i.e., 1) solvability to the given equation; 2) relevance to the given topic and 3) grammaticality and fluency of language.
Our contributions are three-folds: 1. We propose a novel end-to-end neural network model MAGNET to generate math word problems based on given equations and topics. 2. We introduce an Equation-Topic Fusion mechanism which helps the decoder incorporate both the information from the equation and the topic. 3. We design an entity-enforced loss function to improve the relevance between the generated math word problem and the given equations.

Related Work
Automatic question generation from text aims to generate questions taking text as input, which has the potential value of education purpose (Heilman, 2011). Previous question generation works focus on generating natural language questions from a given piece of text. Heilman (2011) employs a syntactic parser to parse the input text into a tree and extract answer candidates. Then a rule-based system transforms the tree into the corresponding question. Recently, generative neural network methods are also applied to this area since largescale manually annotated passage-question pairs become available. Du et al. (2017) and Zhou et al. (2017) propose to use SQuAD (Rajpurkar et al., 2016) question answering dataset as the training data of question generation. In SQuAD dataset, the given passage is a piece of text from Wikipedia and the answer is a sub-span in it. Du et al. (2017) use a sequence-to-sequence model on the passagequestion pair to generate questions. Their model takes the passage text as input to generate a question from it. Different from Du et al. (2017), they add the answer position to the model input as BIO tagging features. However, these methods cannot directly applied to math word problem generation. There are previous approaches specifically targeting math problem generation. Most of them are template based, such as natural language schemas (K. and Elliot, 2002) and semantic frames of conceptual structures (Deane and Sheehan, 2013). Polozov et al. (2015) propose a pipeline including equation generation, plot generation and surface text realization, which requires manually defined ontology and templates. These approaches are ensured to maintain highly-coherent story, but with the manual cost of template construction, which is difficult to extend to more domains. Recently, Koncel-Kedziorski et al. (2016) propose a rewrite-based approach. They generate new problems by simply replacing noun phrases and verbs in the existing math problems with words in the target topic. However, they do not consider global optimization of the whole problem that results in semantic incoherence.
Math problem solving, which can be formatted as learning the mapping from math problem to equations, is also related to our work. In this paper, we adapt a math problem dataset Dol-phin18K for development. Dolphin18K (Huang et al., 2016) is constructed from Yahoo! Answer containing over 18,000 math problems. Previous to that, there are several datasets with size less than 2,000, such as VERB-375 (Hosseini et al., 2014), ALG514  and Dol-phin1878 (Shi et al., 2015).

Problem Statement
Given an equation template and a target topic, our goal is to generate a math word problem in natural language. In this section, we first define the equation template and the topic, and then give the formal introduction of our task.

Equation Template
Equation template, introduced in , is a unique form of an equation system. For example, given an equation system as follows: x + y = 20; x − 4 = y We replace the numbers with tokens and generalize the equations as the following template: Equation is a solution for a specific math problem, while an equation template can correspond to several math problems. Therefore, an equation template can be seen as an abstraction of a set of equations.

Topic
As pointed out in Koncel-Kedziorski et al. (2016), math problems are coherent stories with different topics (e.g., ticket selling or land purchase). In one math problem, there are words that act as topic indicators. For the problems in Table 1, the corresponding topic indicators are: Problem 1: {tickets, movie, adults, students, sold} Problem 2: {farmer, bought, dollar, land, pay} Therefore, we extract the keywords of a math problem as its topic words for representing the topic. The details of topic words extraction will be described in Section 3.4.

Math Problem Generation
Now we can formally define the task of math word problem generation. Given an equation template E and a set of topic words T as input, the goal is to generate a math word problem P , satisfying: (1) P is a piece of natural language text whose topic is T ; (2) P maintains the mathematical operations between numerical quantities and variables in the equation template E.

Dataset Creation
We create the math word problem generation dataset based on the Dolphin18K (Huang et al., 2016) dataset. Specifically, we construct (E, T, P ) triple where E is an equation, T is a set of topic words, and P is the corresponding math word problem. In the Dolphin18K dataset, the equation E and math word problem P are given. Therefore, we need to extract the topic words from the text of P .
There are previous studies on the task of topic word extraction, such as simple counting of word frequency and LDA topic model (Blei et al., 2003). We practically observe that the TF-IDF method is effective which satisfies our needs. We calculate the scores of the words as follows: where tf ij is the term frequency of word i in problem P j , and idf i is the inverse document frequency of word i. We sort the score of each word i in P j , and keep the top n tp words as the problem's topic words.

MAGNET
As shown in Figure 1, our MAGNET model consists of three main parts, namely, the topic encoder, the equation encoder and the math word problem decoder. The topic encoder and equation decoder are used to map topic words and equations to continuous vectors. The decoder is a single directional recurrent neural network equipped with dual-attention mechanism which leverages by the equation-topic fusion mechanism.

Topic Encoder
The input topic T contains a set of keywords t 1 , t 2 , . . . , t ntp . Considering the fact that these topic words do not have sequential or temporal relationships, we represent them as a set of word embeddings tp 1 , tp 2 , . . . , tp ntp as shown in the upper-left part of Figure 1. Specifically, the topic encoder is a lookup table which maps input topic words to the corresponding real-valued vectors.

Equation Encoder
The encoder is implemented as a single-layer bidirectional GRU (Cho et al., 2014) (BiGRU). We concatenate all the equations together with a special delimiter "," (indicates the end of an equation). The BiGRU reads the input equation tokens one-by-one, producing a sequence of hidden states The initial states of the BiGRU are set to zero vectors, i.e., h 1 = 0 and h n = 0.

Math Word Problem Decoder
At each time-step t, the decoder GRU holds its previous hidden state s t−1 , the embedding of previous output word y t−1 and the previous context vector c t−1 . With these previous states, the decoder GRU updates its states as given by Equation 6. To initialize the GRU hidden state, we use a linear layer with the last backward encoder hidden state h 1 of equation as input: Then the decoder first generates a readout state r t and passes it through a maxout hidden layer (Goodfellow et al., 2013) to predict the next word with a softmax layer over the output vocabulary.
where W r , U r , V r and W o are weight matrices. w t−1 is the word embedding of the previously generated word y t−1 . The readout state r t is a 2ddimensional vector, and the maxout layer (Equation 9) picks the max value for every two numbers in r t and produces a d-dimensional maxout vector r t . We then apply a linear transformation on r t to get a target vocabulary size vector and predict the next word y t with the softmax operation.

Equation-Topic Fusion
To incorporate both the information of equation and topic, we propose the Equation-Topic fusion mechanism. Intuitively, the Equation-Topic Fusion mechanism enables the decoder to pay different portions of attention to the equation templates and topic words. For instance, when the decoder is generating descriptive words about the story, it should pay more attention to the topic words. Vice versa, the decoder should pay more attention to the equation if it is generating numbers or variables in the equation. In detail, the context vector c t in Equation 6 and 8 is a fused vector of equation and topic. We employ two attention modules to produce the corresponding context vectors of equa- tion and topic : where repr(·) i represents the vector of encoded equation tokens or topic words, which can be h i or tp i . The v a , W a and U a are learnable parameters. Since the equation and topic information are of different types, we use two sets of these parameters for equation and topic attention modules. We represent the equation context vector c(equation) t and topic context vector c(topic) t as EC t and T C t respectively. To fuse EC t and T C t together, we predict a fusion coefficient g t using an MLP: where g t is the fusion gate. Therefore, the context vector c t is the combination of equation template and topic which is determined by the current decoding state s t .

Entity-Enforced Loss
As we mention before, the generated math problems should be highly relevant to the given equation template. The entities in the generated math problem should correspond to the variables in the equations (e.g., m, n, [num0]). To ensure high relevance of equation template and the generated math problem, we propose an entity-enforced loss: where L is the length of output problem, and ReLU is rectifier function defined as: The intuition behind the entity-enforced loss is that the model needs to attend to the entities in the given equations. In Equation 16, we accumulate the attention scores of variables in the equation for all the decoding time steps. Then a ReLU function is applied on (1 − acc e ) to ensure that the entity e is attended for at least one time during decoding.

Objective Function
Given a training dataset with n equation-topic-question triples D = {(E (1) , T (1) , P (1) ), . . . , (E (n) , T (n) , P (n) )}, the training objective is to minimize the negative log likelihood loss L with respect to the model parameter θ: where λ is a hyper-parameter that controls the contribution of entity-enforced in the loss.

Experiment
In this section, we evaluate our model with both automatic and human evaluations.

Datasets
We conduct our experiments on the Dolphin18K dataset 2 . Since we need equation templates as input to generate math problems, we use its subset with equation annotation, which sums up to 10,644 problems with 5,738 equation templates.  Huang et al. (2016).
As pre-processing, we obtain equation template and topic words for each problem as their input. We extract at most n tp = 10 words with highest TF-IDF scores as the topic words in our experiments. According to the statistic, the average number of extracted topic words are 7.7 and 7.5 in the training and testing datasets respectively.

Baselines
We provide three baselines of math word problem generation, considering the input of equation template and topic words respectively 3 .
KNN finds the closest problem in the training set given the input topic words. It first narrows down training problems to those with the same input equation template. Then a TF-IDF vector for topic words was created and KNN is applied to retrieve the nearest training problem.
Topic2Math generates math problems only given the input of topic words. Topics words are encoded by Topic Encoder component described in Section 4.1.
Equ2Math generates math problems only given the input of equation template. Equation template is encoded by Equation Encoder component described in Section 4.2.
2 Other datasets are small and biased on problem types, which can be seen as subsets of the dataset we used.
3 Previous works are not comparable: 1) The rules used in Polozov et al. (2015) are not publicly available; 2) Koncel- Kedziorski et al. (2016) have a different input from us, that their system needs a full math word problem and then rewrite it. While our system tries to generate a problem from scratch.

Implementation Details
The dimension of encoder/decoder hidden state and embedding are set to 512. The hyperparameter λ in Equation 19 is 0.7. Dropout rate is set to 0.6. All model parameters are initialized using a Gaussian distribution with Xavier scheme (Glorot and Bengio, 2010). We use the Adam (Kingma and Ba, 2015) optimizer with its hyper-parameters set as: learning rate α = 0.001, momentum parameters β 1 = 0.9 and β 1 = 0.999, and = 10 −8 . We also apply gradient clipping (Pascanu et al., 2013) with range [−5, 5]. The beam size is set to 3 in the decoding stage. We release the source code at an anonymous URL for blind review.

Automatic Evaluation
Though the automatic evaluation methods have their limitations in natural language generation evaluation, we use them as important evaluation methods since they are easily reproducible. Furthermore, in the task of math word problem generation, retaining some key information such as the quantities and entities can be well measured by the automatic evaluation methods.

Evaluation Metrics
We evaluate the performance of our model using three evaluation metrics following recent text generation works (Du et al., 2017;Zhou et al., 2017;Fan et al., 2018): BLEU (Papineni et al., 2002) is a widely used evaluation method in machine translation and text generation.
ROUGE (Lin, 2004) is commonly to evaluate n-gram overlap of summaries with goldstandard sentences.
METEOR (Denkowski and Lavie, 2014) is provided following previous work (Koncel-Kedziorski et al., 2016).   ROUGE-1, +1.23 ROUGE-2, +1.39 ROUGE-L) relative gain respectively. The improvement over the baselines demonstrates the usefulness of both equation and topic input. Moreover, without the entity-enforced loss, the performance drops on all metricsthe, which shows its effectiveness.

Human Evaluation
To better evaluate the performance of our system, we recruit three human annotators to judge the quality of the generated math problems, in addition to the automatic metrics. We randomly select 50 instances in the test set, and show the equation template and topic words with generated math problems from different models. We then ask the annotators to rate the outputs with scores ranged 1 to 3 from the following four aspects (detailed guidelines attached in supplementary): 1. Equation Relevance: the generated problem is relevant to the given equation; 2. Topic Relevance: the generated problem is relevant to the given topic words; 3. Solvability: the generated problem can be solved by an (given) equation; 4. Language Fluency: the generated problem is grammatical and fluent. Table 3 reports the human evaluation results. As we can see, MAGNET has the highest scores across the three criteria of equation relevance (2.08), topic relevance (2.9), solvability (2.07), outperforming all the baselines and the ablation test. KNN performs the best in terms of language fluency, since as a retrieval-based method its outputs are existing problems in the training data. The Kappa (Randolph, 2005) values on all models range from 0.4 to 0.9, indicating relatively intermediate to excellent agreement among annotators. The human evaluation result is consistent with the automatic evaluation.

Discussion
To better understand the model, we show the attention visualization and qualitative analysis with some examples.

Effect of Model Fusion
To illustrate how MAGNET leverages both inputs, we visualize the output fusion coefficient g (top), and attention of topic words (middle) and equation template (bottom) in Figure 2.
We can see MAGNET generates a reasonable math problem. When generating the words such as "product", "decreased" in the output, the fusion module is concentrated on the topic words;    Table 4 shows an example case from the test set. We can see that MAGNET generates a more reasonable math problem, with respect to equation solvability and topic relevance. Please note that the equation template does not exist in the training data. Surprisingly, MAGNET has captured the constant "60" in the equation template and generates "kilometer per h" as unit conversion of minute to hour. The ablation model MAGNET-Entity does not generate reasonable problem as well as the baselines, while MAGNET generates "speed" word problem, perfectly describing the division of two numbers. This further demonstrates the effectiveness of the entity-enforced loss which encour-ages the relevance between the equation template and the output problem. Due to space limit, we attach more examples in the supplementary.

Error Analysis
Furthermore, we observe two main types of errors by our model (examples shown in Supplementary): (1) Problem soundness. The generated problem lacks semantic coherence. For example, the model generates "plants [num0] feet of fence to build a fence" that is non-comprehensive; (2) Equation matchness. The input equation template is partially correlated to the output, but not an exact solution of it. This is somewhat expected, since we use the entity-enforced loss only as a soft constraint to ensure the relevance with equation.

Conclusion
In this work, we present MAGNET, a novel model for math word problem generation. It considers the input of both equations and topics using a fusion module. Additionally, an entity-enforced loss is introduced to ensure the relevance of equation and the problem during training. Experiments on a large-scale math problem dataset demonstrated our model can produce fluent math word problems that are highly relevant to the given equations and topics.
Future work could incorporate language models to improve the language fluency, and design more fine-grained models to improve the semantic coherence by employing harder constrains of equation template. Furthermore, we would like to extend to more diverse topics with external resources.