Using Intermediate Representations to Solve Math Word Problems

To solve math word problems, previous statistical approaches attempt at learning a direct mapping from a problem description to its corresponding equation system. However, such mappings do not include the information of a few higher-order operations that cannot be explicitly represented in equations but are required to solve the problem. The gap between natural language and equations makes it difficult for a learned model to generalize from limited data. In this work we present an intermediate meaning representation scheme that tries to reduce this gap. We use a sequence-to-sequence model with a novel attention regularization term to generate the intermediate forms, then execute them to obtain the final answers. Since the intermediate forms are latent, we propose an iterative labeling framework for learning by leveraging supervision signals from both equations and answers. Our experiments show using intermediate forms outperforms directly predicting equations.


Introduction
There is a growing interest in math word problem solving (Kushman et al., 2014;Koncel-Kedziorski et al., 2015;Huang et al., 2017;Roy and Roth, 2018). It requires reasoning with respect to sets of numbers or variables, which is an essential capability in many other natural language understanding tasks. Consider the math problems shown in Table 1. To solve the problems, one needs to know how many numbers to be summed up (e.g. "2 numbers/3 numbers"), and the relation between * Work done while this author was an intern at Microsoft Research.
1) The sum of 2 numbers is 18. The first number is 4 more than the second number. Find the two numbers.
Equations: x + y = 18, x = y + 4 2) The sum of 3 numbers is 15. The larger number is 4 times the smallest and the middle number is 5. What are the numbers? Equations: x + y + z = 15, x = 4 * z, y = 5 variables ("the first/second number"). However, an equation system does not encode these information explicitly. For example, an equation represents "the sum of 2 numbers" as (x + y) and "the sum of 3 numbers" as (x + y + z). This makes it difficult to generalize to cases unseen from data (e.g. "the sum of 100 numbers"). This paper presents a new intermediate meaning representation scheme for solving math problems, aiming at closing the semantic gap between natural language and equations. To generate the intermediate forms, we adapt a sequence-to-sequence (seq2seq) network following recent work that tries to generate equations from problem descriptions for this task. Wang et al. (2017) have shown that seq2seq models have the power to generate equations of which problem types do not exist in training data. In this paper, we propose a new method which adds an extra meaning representation and generate an intermediate form as output. Additionally, we observe that the attention weights of the seq2seq model repetitively concentrates on numbers in the problem description. To address the issue, we further propose to use a form of attention regularization.
To train the model without explicit annotations of intermediate forms, we propose an iterative la-beling framework to leverage signals from both equations and their solutions. We first derive possible intermediate forms with ambiguity using the gold-standard equation systems, and use these forms for training to get a pre-trained model. Then we iteratively refine the intermediate forms using the learned model and the signals from the goldstandard answers.
We conduct experiments on two publicly available math problem datasets. Our experimental results show that using the intermediate forms for training performs significantly better than directly mapping problems to equation systems. Furthermore, our iterative labeling framework creates better labeled data with intermediate forms for training, which leads to improved performance.
To summarize, our contributions include: • We present a new intermediate meaning representation scheme for solving math problems.
• We design an iterative labeling framework to automatically augment training data with intermediate meaning representation.
• We propose using attention regularization in training to address the issue of incorrect attention in the seq2seq model.
• We verify the effectiveness of our proposed solutions by conducting experiments and analysis on real-world datasets.

Meaning Representation
In this section, we will compare meaning representations for solving math problems and introduce the proposed intermediate meaning representation.

Meaning Representations for Math Problem Solving
We first discuss two meaning representation schemes for math problem solving.
An equation system is a collection of one or more equations involving the same set of variables, which should be considered as highly abstractive symbolic representation. The Dolphin Language is introduced by Shi et al. (2015). It contains about 35 math-related classes and over 200 math-related functions, with additional classes and functions automatically mined from Freebase.
Unfortunately, these representation schemes do not generalize well. Consider the two problems listed in Table 2. They belong to the same type of problems asking about the summation of consecutive integers. However, their meaning representations are very different in the Dolphin language and in equations. On one hand, the Dolphin language aligns too closely with natural utterances. Since the math problem descriptions are diverse in using various nouns and verbs, Dolphin language may represent the same type of problems differently. On the other hand, an equation system does not explicitly represent useful problem solving information such as "number of variables" and "numbers are consecutive"

Intermediate Meaning Representation
To bridge the semantic gap between the two meaning representations, we present a new intermediate meaning representation scheme for math problem solving. It consists of 6 classes and 23 functions. Here a class is the set of entities with the same semantic properties and can be inherited (e.g. 2 ∈ int, int num). A function is comprised of a name, a list of arguments with corresponding types, and a return type. For example, there are two overloaded definitions for the function math#sum (Table 3). These forms can be constructed by recursively applying joint operations on functions with class type constraints. Our representation scheme attempts to borrow the explicit use of higher-order functions from the Dolphin language, while avoiding to be too specific. Meanwhile, the intermediate forms are not as concise as the equation systems (Table 2). We leave more detailed definitions to the supplement material due to space limit.

Problem Statement
Given a math word problem p, our goal is to predict its answer A p . For each problem we have annotations of both the equation system E p and the answer A p available for training. The latent intermediate form will be denoted as LF p .

Model
In this section, we describe (1) the basic sequenceto-sequence model, and (2) attention regularization.

Sequence-to-Sequence RNN Model
Our baseline model is based on sequence-tosequence learning (Sutskever et al., 2014) with attention (Bahdanau et al., 2015) and copy mechanism (Gulcehre et al., 2016;Gu et al., 2016). Encoder: The encoder is implemented as a singlelayer bidirectional RNN with gated recurrent units (GRUs). It reads words one-by-one from the input problem, producing a sequence of hidden states where φ in maps each input word x i to a fixeddimensional vector. Decoder with Copying: At each decoding step j, the decoder receives the word embedding of the previous word, and an attention function is applied to attend over the input words as follows: where s j is the decoder hidden state. Intuitively, a ji defines the probability distribution of attention over the input words. They are computed from the unnormalized attention scores e ji . c j is the context vector, which is the weighted sum of the encoder hidden states.
At each step, the model has to decide whether to generate a word from target vocabulary or to copy a number from the problem description. The generation probability p gen is modeled by: where w c , w s and b ptr are model parameters. Next, p gen is used as a soft switch: with probability p gen the model decides to generate from the decoder state. The probability distribution over all words in the vocabulary is: with probability 1 − p gen the model decides to directly copy an input word according to its attention weight. This leads to the final distribution of decoder state outputs:

Attention Regularization
In preliminary experiments, we observed that the attention weights in the baseline model repetitively concentrate on the numbers in the math problem description (will be discussed in later sections with Figure 1(a)). To address this issue, we regularize the accumulative attention weights for each input token using a rectified linear unit (ReLU) layer, leading to the regularization term: where ReLU(x) = max(x, 0). This term penalizes the accumulated attention weights on specific locations if it exceeds 1. Adding this term to the primary loss to get the final objective function: (10) where λ is a hyper-parameter that controls the contribution of attention regularization in the loss. The format of our attention regularization term resembles the coverage mechanism used in neural machine translation (Tu et al., 2016;Cohn et al., 2016), which encourages the coverage or fertility control for input tokens.

Iterative Labeling
Since explicit annotations of our intermediate forms do not exist, we propose an iterative labeling framework for training.

Deriving Latent Forms From Equations
We use the annotated equation systems to derive possible latent forms. First we define some simple rules that map an expression to our intermediate form. For example, we use regular expressions to match numbers and unknown variables. Example rules are shown in Table 4 (see Section 2 of the Supplement Material for all rules).

Regex/Rules
Class

Ambiguity in Derivation
For one equation system, several latent form derivations are possible. Take the following math problem as an example: Find 3 consecutive integers that 3 times the sum of the first and the third is 79.
Given the annotation of its equation 3 * (x + (x + 2)) = 79, there are two possible latent intermediate forms: There exist two types of ambiguities: a) operator ambiguity. (x + 2) may correspond to the operator "ordinal(3)" or "max()"; b) alignment ambiguity. For each "3" in the intermediate form, it is unclear which "3" in the input to be copied. Therefore, we may derive multiple intermediate forms with spurious ones for a training problem.
We can see from Table 5 that both datasets we used have the issue of ambiguity, containing about 20% of problems with operator ambiguity and 10% of problems with alignment ambiguity.

Iterative Labeling
To address the issue of ambiguity, we perform an iterative procedure where we search for correct intermediate forms to refine the training data. The 3.86 Table 5: Statistics of latent forms on two datasets.
The percentage of problems with operator and alignment ambiguity is shown in the 2nd and 3rd columns respectively. We also show the average number of intermediate forms of problems with derivation ambiguity in the rightmost column.
intuition is that a better model will lead to more correct latent form outputs, and more correct latent forms in training data will lead to a better model.

Algorithm 1 Iterative Labeling
Require: (1) Tuples of (math problem description, equation system, answer) 4) training iterations N iter , pre-training iterations N pre Procedure: for iter = 1 to N iter do if iter < N pre then θ ← MLE with P LF else for (p, LF ) in P LF do C = Decode B latent forms given p for j in 1...B do if Ans(C j ) is correct then LF ⇐ C j break θ ← MLE with relabeled P LF Algorithm 1 describes our training procedure. As pre-training, we first update our model by maximum likelihood estimation (MLE) with all possible latent forms for N pre iterations. Ambiguous and wrong latent forms may appear at this stage. This pre-training is to ensure faster convergence and a more stable model. After N pre iterations, iterative labeling starts. We decode on each training instance with beam search. We declare C j to be the consistent form in the beam if it can be ex-ecuted to yield the correct answer. Therefore we can relabel the latent form LF with C j for problem p and use the new pairs for training. If there is no consistent form in the beam, we keep it unchanged. With iterative labeling, we update our model by MLE with relabeled latent forms. There are two conditions of N pre to consider: (1) N pre = 0, the training starts iterative labeling without pre-training.
(2) N pre = N iter , the training is pure MLE without iterative labeling.

Experiments
In this section, we compare our method against several strong baseline systems.

Experiment Setting
Following previous work, experiments are done in 5-fold cross validation: in each run, 20% is used for testing, 70% for training and 10% for validation. Representation To make the task easier with less auxiliary nuisances (e.g. bracket pairs), we represent the intermediate forms in Polish notation. 3 Implementation details The dimension of encoder hidden state, decoder hidden state and embeddings are 100 in NumWord, 512 in Dol-phin18K. All model parameters are initialized randomly with Gaussian distribution. The hyperparameter λ for the weight of attention regularization is set to 1.0 on NumWord and 0.4 on Dol-phin18K. We use SGD optimizer with decaying learning rate initialized as 0.5. Dropout rate is set to 0.5. The stopping criterion for training is validation accuracy with the maximum number of iterations no more than 150. The vocabulary consists of words observed no less than N times in training set. We set N = 1 for NumWord and N = 5 for Dolphin18K. The beam size is set to 20 in the decoding stage. For iterative training, we first train a model for N pre = 50 iterations for pre-training. We tune the hyper-parameters on a separate dev set.
We consider the following models for comparisons: • Wang et al. (2017): a seq2seq model with attention mechanism. As preprocessing, it replaces numbers in the math problem with tokens {n 1 , n 2 , ...}. It generates equation as output and recovers {n 1 , n 2 , ...} to corresponding numbers in the post-processing.
• Seq2Seq Equ: we implement a seq2seq model with attention and copy mechanism. Different from Wang et al. (2017), it has the ability to copy numbers from problem description.
• Shi et al. (2015): a rule-based system. It parses math problems into Dolphin language trees with predefined grammars and reasons across trees to get the equations with rules. We report numbers from their paper as the Dolphin language is not publicly available.
• Huang et al. (2017): the current state-of-theart model on Dolphin18K. It is a featurebased model. It generates candidate equations and find the most probable equation by ranking with predefined features.

Results
Overall results are shown in Table 6 This result is expected as the Dolphin18K dataset is more challenging, containing many other types of difficulties discussed in Section 6.3.
Effect of Attention Regularization: Attention regularization improves the seq2seq model on the two datasets as expected. Figure 1 shows an example. The attention regularization does meet the expectation: the alignments in Fig 1(b) are less concentrated on the numbers in the input and more importantly and alignments are more reasonable. For example, when generating "math#product" in the output, the attention is now correctly focused on the input token "times".
Effect of Iterative Labeling: We can see from Table 6 that iterative labeling clearly contributes to the accuracy increase on the two datasets. Now we compare the performance with and without pretraining in Table 7. When N pre = 0 in Algorithm 1, the model starts iterative labeling from the first iteration without pre-training. We find that training with pre-training is substantially better, as the model without pre-training can be unstable and may generate misleading spurious candidate forms.
Next, we compare the performance with pure MLE training on NumWord (Linear) in Figure 2. The difference is that after 50 iterations of MLE training, iterative labeling would refine the latent forms of training data. In pure MLE training, the accuracy converges after 130 iterations. By using iterative labeling, the model achieves the accuracy of 61.6% at 110th iterations, which is faster to converge and leads to better performance.
Furthermore, to check whether iterative labeling actually resolves ambiguities in the intermediate forms of the training data, we manually sample 100 math problems with derivation ambiguity. 78% of them are relabeled with correct latent forms as we have checked. From Table 8, we can see the latent form of one training problem is iteratively refined to the correct one.

Model Comparisons
To explore the generalization ability of the neural approach and better guide our future work, we compare the problems solved by our neural-based model with the rule-based model (Shi et al., 2015) and the feature-based model (Huang et al., 2017).
Neural-based v. Rule-based: On NumWord (ALL), 41.6% of problems can be solved by both models. 15.5% can only be solved by our neural model, while the rule-based model generates an empty or a wrong semantic tree due to the limitations of the predefined grammar. The neural model is more consistent with flexible word order and insertion of lexical items (e.g. rule-based model cannot handle the extra word 'whole' in "Find two consecutive whole numbers").
Neural-based v. Feature-based: On Dol-phin18K, 9.2% of problems can be solved by both models. 7.6% can only be solved by our neural model, which indicates that the neural model  (2015) 63.6% 60.2% n/a Huang et al. (2017) 20.8% n/a 28.4% Table 6: Performances on two datasets. "LF" means that the model generates latent intermediate forms instead of equation systems. "AttReg" means attention regularization. "Iter" means iterative labeling. "n/a" means that the model does not run on the dataset.     can capture novel features that the feature-based model is missing. While our neural model is complementary to the above mentioned models, we observe two main types of errors (more examples are shown in the supplementary material): 1. Natural language variations: Same type of problems can be described in different scenarios. The two problems: (1) "What is 10 minus 2?" and (2) "John has 10 apples. How many apples does John have after giving Mary 2 apples", lead to the same equation x = 10 − 2 but with very different descriptions. With limited size of data, we could not be expected to cover all possible ways to ask the same underlining math problems. Although the feature-based model has considered this with some features (e.g. POS Tag), the challenge is not well-addressed. 2. Nested operations: Some problems require multiple nested operations (e.g. "I think of a number, double it, add 3, multiply the answer by 3 and then add on the original number"). The rule-based model performs more consistently on this.

Related Work
Our work is related to two research areas: math word problem solving and semantic parsing.

Math Word Problem Solving
There are two major components in this task: (1) meaning representation; (2) learning framework.
Semantic Representation With the annotation of equation system, most approaches attempt at learning a direct mapping from math problem description to an equation system. There are other approaches considering an intermediate representation that bridges the semantic gap between natural language and equation system. Bakman (2007) defines a table of schema (e.g. Transfer-In-Place, Transfer-In-Ownership) with associated formulas in natural utterance. A math problem can be mapped into a list of schema instantiations, then converted to equations. Liguda and Pfeiffer (2012) use augmented semantic network to represent math problems, where nodes represent concepts of quantities and edges represent transition states. Shi et al. (2015) design a new meaning representation language called Dolphin Language (DOL) with over 200 math-related functions and more additional noun functions. With predefined rules, these approaches accept limited well-format input sentences. Inspired by these representations, our work describes a new formal language which is more compact and is effective in facilitating better machine learning performance.
Learning Framework In rule-based approaches (Bakman, 2007;Liguda and Pfeiffer, 2012;Shi et al., 2015), they map math problem description into structures with predefined grammars and rules.
Feature-based approaches contain two stages: (1) generate equation candidates; They either re-place numbers of existing equations in the training data as new equations (Kushman et al., 2014;Zhou et al., 2015;Upadhyay et al., 2016), or enumerate possible combinations of math operators and numbers and variables (Koncel-Kedziorski et al., 2015), which leads to intractably huge search space. (2)

Semantic Parsing
Our work is also related to the classic settings of learning executable semantic parsers from indirect supervision (Clarke et al., 2010;Liang et al., 2011;Zettlemoyer, 2011, 2013;Berant et al., 2013;Pasupat and Liang, 2016). Maximum marginal likelihood with beam search (Kwiatkowski et al., 2013;Pasupat and Liang, 2016;Ling et al., 2017) is traditionally used. It maximizes the marginal likelihood of all consistent logical forms being observed. Recently reinforcement learning (Guu et al., 2017;Liang et al., 2017) has also been considered, which maximizes the expected reward over all possible logical forms. Different from them, we only consider one single consistent latent form per training instance by leveraging training signals from both the answer and the equation system, which should be more efficient for our task.

Conclusion
This paper presents an intermediate meaning representation scheme for math problem solving that bridges the semantic gap between natural language and equation systems. To generate intermediate forms, we propose a seq2seq model with novel attention regularization. Without explicit annotations of latent forms, we design an iterative labeling framework for training. Experimental result shows that using intermediate forms is more effective than directly using equations. Furthermore, our iterative labeling effectively resolves ambiguities and leads to better performances.
As shown in the error analysis, same types of problems can have different natural language expressions. In the future, we will focus on tackling this challenge. In addition, we plan to expand the coverage of our meaning representation to support more mathematic concepts.