Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems

Solving algebraic word problems requires executing a series of arithmetic operations—a program—to obtain a final answer. However, since programs can be arbitrarily complicated, inducing them directly from question-answer pairs is a formidable challenge. To make this task more feasible, we solve these problems by generating answer rationales, sequences of natural language and human-readable mathematical expressions that derive the final answer through a series of small steps. Although rationales do not explicitly specify programs, they provide a scaffolding for their structure via intermediate milestones. To evaluate our approach, we have created a new 100,000-sample dataset of questions, answers and rationales. Experimental results show that indirect supervision of program learning via answer rationales is a promising strategy for inducing arithmetic programs.


Introduction
Behaving intelligently often requires mathematical reasoning. Shopkeepers calculate change, tax, and sale prices; agriculturists calculate the proper amounts of fertilizers, pesticides, and water for their crops; and managers analyze productivity. Even determining whether you have enough money to pay for a list of items requires applying addition, multiplication, and comparison. Solving these tasks is challenging as it involves recognizing how goals, entities, and quantities in the real-world map onto a mathematical formalization, computing the solution, and mapping the solution back onto the world. As a proxy for the richness of the real world, a series of papers have used natural language specifications of algebraic word problems, and solved these by either learning to fill in templates that can be solved with equation solvers (Hosseini et al., 2014; or inferring and modeling operation sequences (programs) that lead to the final answer (Roy and Roth, 2015).
In this paper, we learn to solve algebraic word problems by inducing and modeling programs that generate not only the answer, but an answer rationale, a natural language explanation interspersed with algebraic expressions justifying the overall solution. Such rationales are what examiners require from students in order to demonstrate understanding of the problem solution; they play the very same role in our task. Not only do natural language rationales enhance model interpretability, but they provide a coarse guide to the structure of the arithmetic programs that must be executed. In fact the learner we propose (which relies on a heuristic search; §4) fails to solve this task without modeling the rationales-the search space is too unconstrained.
This work is thus related to models that can explain or rationalize their decisions (Hendricks et al., 2016;Harrison et al., 2017). However, the use of rationales in this work is quite different from the role they play in most prior work, where interpretation models are trained to generate plausible sounding (but not necessarily accurate) posthoc descriptions of the decision making process they used. In this work, the rationale is generated as a latent variable that gives rise to the answer-it is thus a more faithful representation of the steps used in computing the answer.
This paper makes three contributions. First, we have created a new dataset with more than 100,000 algebraic word problems that includes both answers and natural language answer rationales ( §2). Figure 1 illustrates three representative instances from the dataset. Second, we propose a sequence to sequence model that generates a sequence of instructions that, when executed, generates the rationale; only after this is the answer chosen ( §3). Since the target program is not given in the training data (most obviously, its specific form will depend on the operations that are supported by the program interpreter); the third contribution is thus a technique for inferring programs that generate a rationale and, ultimately, the answer. Even constrained by a text rationale, the search space of possible programs is quite large, and we employ a heuristic search to find plausible next steps to guide the search for programs ( §4). Empirically, we are able to show that state-of-the-art sequence to sequence models are unable to perform above chance on this task, but that our model doubles the accuracy of the baseline ( §6).

Dataset
We built a dataset 1 with 100,000 problems with the annotations shown in Figure 1. Each question is decomposed in four parts, two inputs and two outputs: the description of the problem, which we will denote as the question, and the possible (multiple choice) answer options, denoted as options.
Our goal is to generate the description of the rationale used to reach the correct answer, denoted as rationale and the correct option label. Problem 1 illustrates an example of an algebra problem, which must be translated into an expression (i.e., (27x + 17y)/(x + y) = 23) and then the desired quantity (x/y) solved for. Problem 2 is an example that could be solved by multi-step arithmetic operations proposed in (Roy and Roth, 2015). Finally, Problem 3 describes a problem that is solved by testing each of the options, which has not been addressed in the past.

Construction
We first collect a set of 34,202 seed problems that consist of multiple option math questions covering a broad range of topics and difficulty levels. Examples of exams with such problems include the GMAT (Graduate Management Admission Test) and GRE (General Test). Many websites contain example math questions in such exams, where the answer is supported by a rationale.
Next, we turned to crowdsourcing to generate new questions. We create a task where users are presented with a set of 5 questions from our seed dataset. Then, we ask the Turker to choose one of the questions and write a similar question. We also force the answers and rationale to differ from the original question in order to avoid paraphrases of the original question. Once again, we manually check a subset of the jobs for each Turker for quality control. The type of questions generated using this method vary. Some turkers propose small changes in the values of the questions (e.g., changing the equality p(a) − p(b) = p(a − b) in Problem 3 to a different equality is a valid question, as long as the rationale and options are rewritten to reflect the change). We designate these as replica problems as the natural language used in the question and rationales tend to be only minimally unaltered. Others propose new problems in the same topic where the generated questions tend to dif-  fer more radically from existing ones. Some Turkers also copy math problems available on the web, and we define in the instructions that this is not allowed, as it will generate multiple copies of the same problem in the dataset if two or more Turkers copy from the same resource. These Turkers can be detected by checking the nearest neighbours within the collected datasets as problems obtained from online resources are frequently submitted by more than one Turker. Using this method, we obtained 70,318 additional questions.

Statistics
Descriptive statistics of the dataset is shown in Figure 1. In total, we collected 104,519 problems (34,202 seed problems and 70,318 crowdsourced problems). We removed 500 problems as heldout set (250 for development and 250 for testing). As replicas of the heldout problems may be present in the training set, these were removed manually by listing for each heldout instance the closest problems in the training set in terms of character-based Levenstein distance. After filtering, 100,949 problems remained in the training set. We also show the average number of tokens (total number of tokens in the question, options and rationale) and the vocabulary size of the questions and rationales. Finally, we provide the same statistics exclusively for tokens that are numeric values and tokens that are not. Figure 2 shows the distribution of examples based on the total number of tokens. We can see that most examples consist of 30 to 500 tokens, but there are also extremely long examples with more than 1000 tokens in our dataset.

Model
Generating rationales for math problems is challenging as it requires models that learn to perform math operations at a finer granularity as each step within the solution must be explained. For instance, in Problem 1, the equation (27x + 17y)/(x + y) = 23 must be solved to obtain the answer. In previous work , this could be done by feeding the equation into an expression solver to obtain x/y = 3/2. However, this would skip the intermediate steps 27x + 17y = 23x + 23y and 4x = 6y, which must also be generated in our problem. We propose a model that jointly learns to generate the text in the rationale, and to perform the math operations required to solve the problem. This is done by generating a program, containing both instructions that generate output and instructions that simply generate intermediate values used by following instructions.
In our particular problem, we are given the problem and the set of options, and wish to predict the rationale and the correct option. We set x as the sequence of words in the problem, concatenated with words in each of the options separated by a special tag. Note that knowledge about the possible options is required as some problems are solved by the process of elimination or by testing each of the options (e.g. Problem 3). We wish to generate y, which is the sequence of words in the rationale. We also append the correct option as the last word in y, which is interpreted as the chosen option. For example, y in Problem 1 is "Let the . . . = 3/2 . EOR B EOS ", whereas in Problem 2 it is "Let s be . . . Answer is C EOR C EOS ", where " EOS " is the end of sentence symbol and Id("\n") \n y8 9 cards Id("Then") Then y9 10 are Id("n") n y10 11 drawn " EOR " is the end of rationale symbol.

Generating Programs to Generate Rationales
We wish to generate a latent sequence of program instructions, z = z 1 , . . . , z |z| , with length |z|, that will generate y when executed. We express z as a program that can access x, y, and the memory buffer m. Upon finishing execution we expect that the sequence of output tokens to be placed in the output vector y. Table 2 illustrates an example of a sequence of instructions that would generate an excerpt from Problem 2, where columns x, z, v, and r denote the input sequence, the instruction sequence (program), the values of executing the instruction, and where each value v i is written (i.e., either to the output or to the memory). In this example, instructions from indexes 1 to 14 simply fill each position with the observed output y 1 , . . . , y 14 with a string, where the Id operation simply returns its parameter without applying any operation. As such, running this operation is analogous to generating a word by sampling from a softmax over a vocabulary. However, instruction z 15 reads the input word x 5 , 52, and applies the operation Str to Float, which converts the word 52 into a floating point number, and the same is done for instruction z 20 , which reads a previously generated output word y 17 . Unlike, instructions z 1 , . . . , z 14 , these operations write to the external memory m, which stores intermediate values. A more sophisticated instruction-which shows some of the power of our model-is z 21 = Choose(m 1 , m 2 ) → m 3 which evaluates m 1 m 2 and stores the result in m 3 . This process repeats until the model generates the end-of-sentence symbol. The last token of the program as said previously must generate the correct option value, from "A" to "E". By training a model to generate instructions that can manipulate existing tokens, the model benefits from the additional expressiveness needed to solve math problems within the generation process. In total we define 22 different operations, 13 of which are frequently used operations when solving math problems. These are: Id, Add, Subtract, Multiply, Divide, Power, Log, Sqrt, Sine, Cosine, Tangent, Factorial, and Choose (number of combinations). We also provide 2 operations to convert between Radians and Degrees, as these are needed for the sine, cosine and tangent operations. There are 6 operations that convert floating point numbers into strings and vice-versa. These include the Str to Float and Float to Str operations described previously, as well as operations which convert between floating point numbers and fractions, since in many math problems the answers are in the form "3/4". For the same reason, an operation to convert between a floating point number and number grouped in thousands is also used (e.g. 1000000 to "1,000,000" or "1.000.000"). Finally, we define an operation (Check) that given the input string, searches through the list of options and returns a string with the option index in {"A", "B", "C", "D", "E"}. If the input value does not match any of the options, or more than one option contains that value, it cannot be applied. For instance, in Problem 2, once the correct probability "1/221" is generated, by applying the check operation to this number we can Figure 3: Illustration of the generation process of a single instruction tuple at timestamp i.

Generating and Executing Instructions
In our model, programs consist of sequences of instructions, z. We turn now to how we model each z i , conditional on the text program specification, and the program's history. The instruction z i is a tuple consisting of an operation (o i ), an ordered sequence of its arguments (a i ), and a decision about where its results will be placed (r i ) (is it appended in the output y or in a memory buffer m?), and the result of applying the operation to its arguments (v i ). That is, Formally, o i is an element of the pre-specified set of operations O, which contains, for example add, div, Str to Float, etc. The number of arguments required by o i is given by argc(o i ), e.g., argc(add) = 2 and argc(log) = 1. The arguments are a i = a i,1 , . . . , a i,argc(o i ) . An instruction will generate a return value v i upon execution, which will either be placed in the output y or hidden. This decision is controlled by r i . We define the instruction probability as: where [p] evaluates to 1 if p is true and 0 otherwise, and apply(f, x) evaluates the operation f with arguments x. Note that the apply function is not learned, but pre-defined.
The network used to generate an instruction at a given timestamp i is illustrated in Figure 3. We first use the recurrent state h i to generate p(o i | z <i , x) = softmax o i ∈O (h i ), using a softmax over the set of available operations O.
In order to predict r i , we generate a new hidden state r i , which is a function of the current program context h i , and an embedding of the current predicted operation, o i . As the output can either be placed in the memory m or the output y, we compute the probability p(r where σ is the logistic sigmoid function. If r i = OUTPUT, v i is appended to the output y; otherwise it is appended to the memory m. Once we generate r i , we must predict a i , the argc(o i )-length sequence of arguments that operation o i requires. The jth argument a i,j can be either generated from a softmax over the vocabulary, copied from the input vector x, or copied from previously generated values in the output y or memory m. This decision is modeled using a latent predictor network (Ling et al., 2016), where the control over which method used to generate a i,j is governed by a latent variable q i,j ∈ {SOFTMAX, COPY-INPUT, COPY-OUTPUT}. Similar to when predicting r i , in order to make this choice, we also generate a new hidden state for each argument slot j, denoted by q i,j with an LSTM, feeding the previous argument in at each time step, and initializing it with r i and by reading the predicted value of the output r i .
• If q i,j = SOFTMAX, a i,j is generated by sampling from a softmax over the vocabulary Y, This corresponds to a case where a string is used as argument (e.g. y 1 ="Let"). • If q i,j = COPY-INPUT, a i,j is obtained by copying an element from the input vector with a pointer network (Vinyals et al., 2015) over input words x 1 , . . . , x |x| , represented by their encoder LSTM state u 1 , . . . , u |x| . As such, we compute the probability distribution over input words as: Function f computes the affinity of each token x a i,j and the current output context q i,j . A common implementation of f , which we follow, is to apply a linear projection from [u a i,j ; q i,j ] into a fixed size vector (where [u; v] is vector concatenation), followed by a tanh and a linear projection into a single value.
• If q i,j = COPY-OUTPUT, the model copies from either the output y or the memory m. This is equivalent to finding the instruction z i , where the value was generated. Once again, we define a pointer network that points to the output instructions and define the distribution over previously generated instructions as: Here, the affinity is computed using the decoder state h a i,j and the current state q i,j .
Finally, we embed the argument a i,j 2 and the state q i,j to generate the next state q i,j+1 . Once all arguments for o i are generated, the operation is executed to obtain v i . Then, the embedding of v i , the final state of the instruction q i,|a i | and the previous state h i are used to generate the state at the next timestamp h i+1 .

Inducing Programs while Learning
The set of instructions z that will generate y is unobserved. Thus, given x we optimize the marginal probability function: where p(y | z) is the Kronecker delta function δ e(z),y , which is 1 if the execution of z, denoted as e(z), generates y and 0 otherwise. Thus, we can redefine p(y|x), the marginal over all programs Z, as a marginal over programs that would generate y, defined as Z(y). As marginalizing over z ∈ Z(y) is intractable, we approximate the marginal by generating samples from our model. Denote the set of samples that are generated byẐ(y). We maximize z ∈Ẑ(y)p(z|x).
However, generating programs that generate y is not trivial, as randomly sampling from the RNN distribution over instructions at each timestamp is unlikely to generate a sequence z ∈ Z(y). This is analogous to the question answering work in Liang et al. (2016), where the query that generates the correct answer must be found during inference, and training proved to be difficult without supervision. In Roy and Roth (2015) this problem is also addressed by adding prior knowledge to constrain the exponential space.
In our work, we leverage the fact that we are generating rationales, where there is a sense of progression within the rationale. That is, we assume that the rationale solves the problem step by step. For instance, in Problem 2, the rationale first describes the number of combinations of two cards in a deck of 52 cards, then describes the number of combinations of two kings, and finally computes the probability of drawing two kings. Thus, while generating the final answer without the rationale requires a long sequence of latent instructions, generating each of the tokens of the rationale requires far less operations.
More formally, given the sequence z 1 , . . . , z i−1 generated so far, and the possible values for z i given by the network, denoted Z i , we wish to filter Z i to Z i (y k ), which denotes a set of possible options that contain at least one path capable of generating the next token at index k. Finding the set Z i (y k ) is achieved by testing all combinations of instructions that are possible with at most one level of indirection, and keeping those that can generate y k . This means that the model can only generate one intermediate value in memory (not including the operations that convert strings into floating point values and vice-versa).
Decoding. During decoding we find the most likely sequence of instructions z given x, which can be performed with a stack-based decoder. However, it is important to refer that each generated instruction z i = (o i , r i , a i,1 , . . . , a i,|a i | , v i ) must be executed to obtain v i . To avoid generating unexecutable code-e.g., log(0)-each hypothesis instruction is executed and removed if an error occurs. Finally, once the " EOR " tag is generated, we only allow instructions that would generate one of the option "A" to "E" to be generated, which guarantees that one of the options is chosen. for training the model.
For both the attention and copy mechanisms, for each instruction z i , the model needs to compute the probability distribution between all the attendable units c conditioned on the previous state h i−1 . For the attention model and input copy mechanisms, c = x 0,i−1 and for the output copy mechanism c = z. These operations generally involve an exponential number of matrix multiplications as the size of c and z grows. For instance, during the computation of the probabilities for the input copy mechanism in Equation 1, the affinity function f between the current context q and a given input u k is generally implemented by projecting u and q into a single vector followed by a non-linearity, which is projected into a single affinity value. Thus, for each possible input u, 3 matrix multiplications must be performed. Furthermore, for RNN unrolling, parameters and intermediate outputs for these operations must be replicated for each timestamp. Thus, as z becomes larger the attention and copy mechanisms quickly become a memory bottleneck as the computation graph becomes too large to fit on the GPU. In contrast, the sequence-to-sequence model proposed in (Sutskever et al., 2014), does not suffer from these issues as each timestamp is dependent only on the previous state h i−1 .
To deal with this, we use a training method we call staged back-propagation which saves memory by considering slices of K tokens in z, rather than the full sequence. That is, to train on a minibatch where |z| = 300 with K = 100, we would actually train on 3 mini-batches, where the first batch would optimize for the first z 1:100 , the second for z 101:200 and the third for z 201:300 . The advantage of this method is that memory intensive operations, such as attention and the copy mechanism, only need to be unrolled for K steps, and K can be adjusted so that the computation graph fits in memory.
However, unlike truncated back-propagation for language modeling, where context outside the scope of K is ignored, sequence-to-sequence models require global context. Thus, the sequence of states h is still built for the whole sequence z. Afterwards, we obtain a slice h j:j+K , and compute the attention vector. 3 Finally, the prediction of the instruction is conditioned on the LSTM state

Experiments
We apply our model to the task of generating rationales for solutions to math problems, evaluating it on both the quality of the rationale and the ability of the model to obtain correct answers.

Baselines
As the baseline we use the attention-based sequence to sequence model proposed by Bahdanau et al. (2014), and proposed augmentations, allowing it to copy from the input (Ling et al., 2016) and from the output (Merity et al., 2016).

Hyperparameters
We used a two-layer LSTM with a hidden size of H = 200, and word embeddings with size 200. The number of levels that the graph G is expanded during sampling D is set to 5. Decoding is performed with a beam of 200. As for the vocabulary of the softmax and embeddings, we keep the most frequent 20,000 word types, and replace the rest of the words with an unknown token. During training, the model only learns to predict a word as an unknown token, when there is no other alternative to generate the word.

Evaluation Metrics
The evaluation of the rationales is performed with average sentence level perplexity and BLEU-4 (Papineni et al., 2002). When a model cannot generate a token for perplexity computation, we predict unknown token. This benefits the baselines as they are less expressive. As the perplexity of our model is dependent on the latent program that is generated, we force decode our model to generate the rationale, while maximizing the probability of the program. This is analogous to the method used to obtain sample programs described in Section 4, but we choose the most likely instructions at each timestamp instead of sampling. Finally, the correctness of the answer is evaluated by computing the percentage of the questions, where the chosen option matches the correct one.

Results
The test set results, evaluated on perplexity, BLEU, and accuracy, are presented in Table 3.  Perplexity. In terms of perplexity, we observe that the regular sequence to sequence model fares poorly on this dataset, as the model requires the generation of many values that tend to be sparse. Adding an input copy mechanism greatly improves the perplexity as it allows the generation process to use values that were mentioned in the question. The output copying mechanism improves perplexity slightly over the input copy mechanism, as many values are repeated after their first occurrence. For instance, in Problem 2, the value "1326" is used twice, so even though the model cannot generate it easily in the first occurrence, the second one can simply be generated by copying the first one. We can observe that our model yields significant improvements over the baselines, demonstrating that the ability to generate new values by algebraic manipulation is essential in this task. An example of a program that is inferred is shown in Figure 4. The graph was generated by finding the most likely program z that generates y. Each node isolates a value in x, m, or y, where arrows indicate an operation executed with the outgoing nodes as arguments and incoming node as the return of the operation. For simplicity, operations that copy or convert values (e.g. from string to float) were not included, but nodes that were copied/converted share the same color.
Examples of tokens where our model can obtain the perplexity reduction are the values "0.025", "0.023", "0.002" and finally the answer "E" , as these cannot be copied from the input or output.
BLEU. We observe that the regular sequence to sequence model achieves a low BLEU score. In fact, due to the high perplexities the model generates very short rationales, which frequently consist of segments similar to "Answer should be D", as most rationales end with similar statements. By applying the copy mechanism the BLEU score improves substantially, as the model can define the variables that are used in the rationale. In-terestingly, the output copy mechanism adds no further improvement in the perplexity evaluation. This is because during decoding all values that can be copied from the output are values that could have been generated by the model either from the softmax or the input copy mechanism. As such, adding an output copying mechanism adds little to the expressiveness of the model during decoding. Finally, our model can achieve the highest BLEU score as it has the mechanism to generate the intermediate and final values in the rationale.
Accuracy. In terms of accuracy, we see that all baseline models obtain values close to chance (20%), indicating that they are completely unable to solve the problem. In contrast, we see that our model can solve problems at a rate that is significantly higher than chance, demonstrating the value of our program-driven approach, and its ability to learn to generate programs.
In general, the problems we solve correctly correspond to simple problems that can be solved in one or two operations. Examples include questions such as "Billy cut up each cake into 10 slices, and ended up with 120 slices altogether. How many cakes did she cut up? A) 9 B) 7 C) 12 D) 14 E) 16", which can be solved in a single step. In this case, our model predicts "120 / 10 = 12 cakes. Answer is C" as the rationale, which is reasonable.

Discussion.
While we show that our model can outperform the models built up to date, generating complex rationales as those shown in Figure 1 correctly is still an unsolved problem, as each additional step adds complexity to the problem both during inference and decoding. Yet, this is the first result showing that it is possible to solve math problems in such a manner, and we believe this modeling approach and dataset will drive work on this problem.

Related Work
Extensive efforts have been made in the domain of math problem solving (Hosseini et al., 2014;Roy and Roth, 2015), which aim at obtaining the correct answer to a given math problem. Other work has focused on learning to map math expressions into formal languages (Roy et al., 2016). We aim to generate natural language rationales, where the bindings between variables and the problem solving approach are mixed into  Figure 4: Illustration of the most likely latent program inferred by our algorithm to explain a held-out question-rationale pair. a single generative model that attempts to solve the problem while explaining the approach taken. Our approach is strongly tied with the work on sequence to sequence transduction using the encoder-decoder paradigm (Sutskever et al., 2014;Bahdanau et al., 2014;Kalchbrenner and Blunsom, 2013), and inherits ideas from the extensive literature on semantic parsing (Jones et al., 2012;Berant et al., 2013;Andreas et al., 2013;Quirk et al., 2015;Liang et al., 2016;Neelakantan et al., 2016) and program generation (Reed and de Freitas, 2016;Graves et al., 2016), namely, the usage of an external memory, the application of different operators over values in the memory and the copying of stored values into the output sequence.
Providing textual explanations for classification decisions has begun to receive attention, as part of increased interest in creating models whose decisions can be interpreted. Lei et al. (2016), jointly modeled both a classification decision, and the selection of the most relevant subsection of a document for making the classification decision. Hendricks et al. (2016) generate textual explanations for visual classification problems, but in contrast to our model, they first generate an answer, and then, conditional on the answer, generate an explanation. This effectively creates a post-hoc justification for a classification decision rather than a program for deducing an answer. These papers, like ours, have jointly modeled rationales and answer predictions; however, we are the first to use rationales to guide program induction.

Conclusion
In this work, we addressed the problem of generating rationales for math problems, where the task is to not only obtain the correct answer of the problem, but also generate a description of the method used to solve the problem. To this end, we collect 100,000 question and rationale pairs, and propose a model that can generate natural language and perform arithmetic operations in the same decoding process. Experiments show that our method outperforms existing neural models, in both the fluency of the rationales that are generated and the ability to solve the problem.