Text2Math: End-to-end Parsing Text into Math Expressions

We propose Text2Math, a model for semantically parsing text into math expressions. The model can be used to solve different math related problems including arithmetic word problems and equation parsing problems. Unlike previous approaches, we tackle the problem from an end-to-end structured prediction perspective where our algorithm aims to predict the complete math expression at once as a tree structure, where minimal manual efforts are involved in the process. Empirical results on benchmark datasets demonstrate the efficacy of our approach.


Introduction
Designing computer algorithms that can automatically solve math word problems is a challenge for the AI research community (Bobrow, 1964). Two representative tasks have been proposed and studied recently -solving arithmetic word problems Roy and Roth, 2018;Zou and Lu, 2019b) and equation parsing (Roy et al., 2016), as illustrated in Figure 1. The former task focuses on mapping the input paragraph (which may involve multiple sentences) into a target math expression, from which an answer can be calculated. The latter task focuses on mapping a description (usually a single sentence) into a math equation that typically involves one or more unknowns. As we can observe from Figure 1, in both cases, the output can be represented as a tree structure.
Earlier approaches to solving arithmetic word problems focused on rule-based methods where hand-crafted rules have been used (Mukherjee and Garain, 2008;Hosseini et al., 2014). Recently, learning-based approaches based on statistical classifiers Roy and Roth, 2015;Roy et al., 2016;Liang et al., 2018) or Problem 1 Mike picked 7 apples. Nancy picked 3 apples and Keith picked 6 apples at the farm. In total, how many apples were picked? Expressionion (7 + (3 + 6)) + 7 + 3 6 Answerioniion 16 Problem 2 3 times one of the numbers is 11 less than 5 times the other. neural networks (Wang et al., , 2018b have been used for making decisions in the expression 1 construction process. However, these models do not focus on predicting the target tree as a complete structure at once, but locally trained classifiers are often used and local decisions are then combined. Such local classifiers often make predictions on the choice of the underlying operator between two operands (e.g., numbers) appearing in the text in a particular order. As a result, special treatments of the non-commutative operators such as subtraction (−) and division (÷) are often involved, where the introduction of inverse operators is typically required 2 . Shi et al. (2015) tackled the problem from a structured prediction perspective, where a semantic parsing algorithm us-ing context-free grammars (CFG) was used. However, their approach relies on semi-automatically generated rules and involves a manual step for interpreting the semantic representation they used. While all these approaches focused on solving arithmetic word problems only, separate models have been developed for the task of equation parsing (Roy et al., 2016). It is not clear how easy each of these models specifically designed for one task can be adapted for the other task. Motivated by the observation that both problems involve mapping a text sequence to a tree structured representation, we propose Text2Math which regards both tasks as a class of structured prediction problems, and tackle them from a semantic parsing perspective. We make use of an end-toend latent-variable approach to automatically produce the target math expression at once as a complete structure, where no prior knowledge on the operators (such as whether an operator is noncommutative) is required. Our model outperforms all baselines on two benchmark datasets. To the best of our knowledge, this is the first approach based on semantic parsing that tackles both arithmetic word problems and equation parsing with a single model. Our code is available at http: //statnlp.org/research/ta.

Expression Tree
We first define tree representations for math expressions, which will then be regarded as the semantic representations used in the standard semantic parsing setup.
The nodes involved in the math expression trees can be classified into two categories, namely, operator and quantity nodes. Specifically, operator nodes are the tree nodes that define the types of operations involved in expressions. In this work we consider ADD (addition, +), SUB (subtraction, −), MUL (multiplication, ×) and DIV (division, ÷). We also regard the equation sign (=) as an operation involved in math expressions and use EQU to denote it. We consider two types of quantity nodes: CON denoting constants, and VAR for unknown variables. Table 1 lists the above nodes. Each tree node comes with an arity which specifies the number of direct child nodes that should appear below the given node. For example, the operator node SUB with arity 2 is expecting two child nodes below it in the expression tree, while CON Category Node Interpretation Arity Math expression (7 + (3 + 6)) Order in text (7, 3, 6) Expression tree Figure 2: Expression trees for the two math expressions in Figure 1. "Order in text" refers to the order that the textual expressions of operands appear in the problem text. We use subscripts to indicate the actual semantic interpretations.
with arity 0 is supposed to be a leaf node. The two math expressions in Figure 1 can be equivalently represented by expression trees consisting of such nodes, as illustrated in Figure 2.

Latent Text-Math Tree
With the specifically designed expression trees for representing the math expressions, we will now be able to design a model for parsing the text into the expression tree. This is essentially a semantic parsing task. One of the key assumptions made by the various semantic parsing algorithms is the intermediate joint representation used for connecting the words and semantics (Wong and Mooney, 2006;Zettlemoyer and Collins, 2007;Lu et al., 2008;Artzi and Zettlemoyer, 2013b). In this work, we adopt an approach that is inspired by (Lu et al., 2008;Lu, 2014), which learns a latent joint representation for words and semantics in the form of hybrid trees where word-semantics correspondence information is captured. Specifically, we introduce a text-math tree representation that jointly encodes both text and the math expression tree.
Mike picked 7 apples. Nancy picked 3 apples and Keith picked 6 apples at the farm. In total, how many apples were picked?
{In total, how many apples were picked?}  Figure 3: Example text-math trees for the arithmetic word problem example in Figure 1. The left tree captures the semantic correspondence well, while the right tree fails to capture the correct correspondence.
Such joint representations can be understood as a modified expression tree where each semantic node is now augmented with additional word information from the corresponding text.
From the joint representations we would be able to recover the semantic level correspondence information between words and math expressions.  Figure 3. Here each node in such joint representations is essentially a node in the original expression trees augmented with words from the problem text. For example, consider the left tree in Figure 3, the root node is an operator (the root node of the original expression tree) paired with discontiguous sequence of words "{Mike picked}. . . {.}. . . {In total, how many apples were picked?}" that appear in the problem text. This way, such a text-math tree is able to capture the semantic correspondence between words and basic units involved in the math expressions (i.e., operators, quantities). However, the textmath trees are not explicitly given during the training phase. For example, the right side of Figure  3 gives an alternative text-math tree that can also serve as a joint representation of both the text and the expression tree. Comparing both trees we may see the one on the left appears to be better at capturing the true semantic level correspondence between words and math expressions. Since there is no gold text-math tree explicitly given, we model it with a latent-variable approach.
Formally, given a text x, paired with the expression y (or equivalently, the expression tree), we assume there exists a latent joint text-math representation in the form of text-math tree, that comprises exactly x and y, denoted as t. Each node is a word-semantics association x, y, p where x is a (possibly discontiguous) word sequence of x and y is an individual expression tree node from y, and p is the word association pattern that is used to specify how words interact with the expression tree (further details will be provided in Sec. 2.3). Intuitively, such a joint text-math representation should precisely contain the exact information associated with the text and its corresponding math expression and nothing else. We will defer the discussion on how to exactly construct such joint representations until Sec. 2.3.
The training corpus provides both the problem text x and its math expression, which we represent with an expression tree y. The joint representation t is not available in the training data, which we model as a latent variable. The conditional random fields (CRF) (Lafferty et al., 2001) has been successfully applied to many tasks in the NLP community (Lample et al., 2016;Lu, 2018, 2019a). In this work, we also apply CRF to model the conditional probability of the latent variable t and output expression y, conditioned on the input x. The objective is defined as follows: where Φ(x, y, t) returns a list of discrete features defined over the tuple (x, y, t), Λ is the feature weight vector, G Θ is a neural scoring function parameterized by Θ and T (x, y) is a set of possible joint representations (i.e., text-math trees) for the pair (x, y).
3 times one of the numbers is 11 less than 5 times the other.
3 times one of the numbers is 11 minus 5 times the other. Figure 4: Left: example text-math tree for the equation parsing example in Figure 1, where the word association pattern BwA in the example is used for modeling reordering. Right: another example for a slightly different instance where the reordering is not required.

Inference
One challenge associated with the inference procedure in both training and decoding is how to handle the large space of latent structures defined by T (x, y) and T (x). Without any constraints, searching or calculation that involves all possible structures within this space may be intractable. We therefore introduce some assumptions on the set of allowable structures, such that tractable inference can be applied to such structures. We first introduce three symbols A, B and w. The symbol A refers to a placeholder for the left sub-tree (rooted by the left child node), and similarly B is a placeholder for the right sub-tree. The symbol w refers to a contiguous sequence of (1 or more) words. We will then use these three symbols to define the set of word association patterns, which are used to specify how the words interact with the sub-trees of the current node. Specifically, for expression tree nodes with arity 0 (i.e., quantity nodes), only one pattern w is allowed to be attached to them, indicating that a contiguous word sequence from a given problem text is associated with such expression tree nodes. As for expression tree nodes with arity 2 (i.e., operator nodes), we define 16 allowable patterns denoted Based on such word association patterns, we will be able to define a set of possible text-math trees for a particular text-expression pair that we regard as valid.
Before we formally define what is a valid textmath tree, let us look at an example in Figure 3 which shows two valid trees. First of all, we can verify that, if we exclude the words from both trees, we arrive at the math expression that corre-sponds to the text. Second, we can also recover the text information from such a joint representation. Let us look at the right tree in Figure 3. Consider the right sub-tree of the complete tree rooted by the node x, y, p = {at the farm. In total, how many}, ADD, ABw . If we replace the placeholders A and B with the word sequences associated with its left and right sub-trees, respectively, we will arrive at the word sequence "3 apples and Keith picked 6 apples at the farm. In total, how many". Recursively performing such a rewriting procedure in a bottom-up manner, we will end up with a word sequence which is exactly the original input text as illustrated at the top of Figure 3.
Based on the above discussion, we can define T (x, y) as a set that consists of the valid trees: Definition 2.1 For a given text x and an expression tree y, a valid text-math tree satisfies the following two properties: 1) the semantics portion of the tree gives exactly y, and 2) the text obtained through the recursive rewriting procedure discussed above gives exactly x 3 .
Given the definition of the valid text-math trees, we will be able to use a bottom-up procedure to construct the set T (x, y). Similarly, we will be able to construct the set T (x) by considering a forest-structured semantic representation that encodes all possible expression trees following (Lu, 2015). One nice property associated with considering only such joint representations is that there are known algorithms that can be used for performing efficient inference. Indeed, the resulting text-math trees are similar to the hybrid tree representations used in (Lu et al., 2008;Lu, 2014) 4 , where dynamic programming based inference algorithms have been developed. Such algorithms allow O(n 3 m) time complexity for inference where n is the text length and m is the number of grammar rules 5 associated with the latent text-math trees.
We note that some prior systems Roth, 2017, 2018) require extra inverse operators -inverse subtraction "− r " and inverse division "÷ r " to handle the scenarios where the order of quantities appearing in the text is not consistent with the order that they appear in the expression. Exemplified by the example in the left of Figure 4, by introducing two operators − r and ÷ r to take their operands in a reverse order, the equation on the left is represented as "(3 × X 1 ) = 11 − r (5 × X 2 )". However, we do not need such two inverse operators. A pattern from the first group handles the order that is consistent with the problem text, while a pattern from the second group is able to capture reordering of operands below an operator. Exemplified by Figure 4, reordering is required for the first example, but not for the second, though their texts only differ slightly. Unlike the second example, instead of using the pattern AwB, the first joint representation adopts the pattern BwA for the SUB expression node. Thus, our model is able to work without the underlying knowledge on whether an operator is commutative or not.

Features
Discrete Features.
The feature function Φ(x, y, t) is defined over each node x, y, p in the joint tree as well as the complete expression tree y. For each node x, y, p , we extract word ngram, the word association pattern, and POS tags for words (Manning et al., 2014). The knowledge that whether a number is relevant to the question (if available in the annotated data) is also taken as a binary feature. To assess the quality of the structure associated with the expression tree (i.e., features defined over y), we extract parent-child relational information (y a , y b ) from y, where y a is 4 They need to handle semantic nodes with arity 1 (which requires special constraints for properly defining T (x) (Lu, 2015)), and their semantic nodes are also assumed to convey semantic type information for guiding the expression tree construction process, while we do not need to consider them. 5 The grammars are related to the word association patterns. The possible latent text-math trees are constructed based on such grammar rules. the parent of y b , as features. Following previous works (Roy et al., 2016;Liang et al., 2018), we also consider incorporating a lexicon in our model so as to make a fair comparison with such works, although we would like to stress that our model does not strictly require such lexicons for learning. More details are in supplementary material. Neural Features. We design neural features over the pair of the L-sized window surrounding the target word x i in x and an expression tree node y j . The network takes as input the contiguous word sequence (x i−L , . . . , x i , . . . , x i+L ), whose distributed representation is a simple concatenation of embeddings of each word. The hidden layer applies an affine transformation with an element-wise nonlinear activation function, like tanh and ReLU. The final output layer contains as many nodes as there are expression tree nodes in the training set. The output is a score vector that gives a score for the input word sequence (x i−L , . . . , x i , . . . , x i+L ) and an expression tree node y j . The neural scoring function is defined as follows: where W(x, y, t) is the set of (x, y) pairs extracted from (x, y, t), c returns the number of occurrences and ψ(x, y) is a score of the target word x with L-sized windows and the expression tree node y, returned by the neural network. We regard L as a hyperparameter.

Algorithms
Given the complete training set, the log-likelihood is calculated as: where (x i ,y i ) refers to i-th instance in the training set. The additional L 2 regularization term can be introduced to avoid over-fitting. Here, we omit it for brevity. The goal is to find optimal model parameters, i.e., Λ and Θ, which maximize the objective. We first consider the computation of gradients for Λ. Assuming Λ = λ 1 , λ 2 , . . . , λ N , to learn the optimal feature weight values, we can calculate the gradient for each λ k in Λ as: where φ k (x, y, t) is the number of occurrences for the k-th feature extracted from (x, y, t).
We then compute the gradient for the neural network parameters Θ. For an input word window x and a semantic unit y, the gradient is defined as: The gradients (3,4) can be efficiently calculated by applying a generalized forward-backward algorithm, which allows us to conduct exact inference using the dynamic programming algorithm described in (Lu, 2014). Next, standard methods like gradient descent, L-BFGS (Liu and Nocedal, 1989) can be used to find optimal values for model parameters.
During decoding, the optimal equation tree y * for a new input x can be calculated by: where T (x, y) refers to the set of all possible textmath trees that contain x and y. Instead of directly computing the summation over all possible latent text-math structures, we essentially replace the by the max operation inside the arg max. In other words, we first find the latent text-math tree t * which yields the highest score and contains the input text x. Then, the optimal expression tree y * can be automatically extracted from t * .
An efficient dynamic programming based inference algorithm similar to the work of Lu (2014) was leveraged to find the optimal latent structure t * . We then obtain the optimal expression tree y * from t * , which is the output of our system for the input problem text x. Roy et al. (2016) It is worth noting that Roy et al. (2016) also proposed a system that maps text into an equation tree. Unlike this work that maps math problem texts into math expressions in an end-to-end fashion, Roy et al. (2016) designed three classifiers which sequentially make local decisions, namely identifying relevant numbers, recognizing possible variables and producing equations. They also require extra inverse operators to handle the noncommutative operation issues, which is not necessary for our model. Generating equations via a sequence of local classification decisions may propagate errors and even limit the ability to wholisticly understand the underlying semantics of problem texts which is important for predicting correct mathematical operations. In this work, we regard equation parsing problem as a structure prediction task that allows to parse the text to equations from a semantic parsing perspective. Moreover, Text2Math is capable to handle both tasks of equation parsing and arithmetic word problems, while the system of Roy et al. (2016) is specific to equation parsing.

Datasets.
Following prior works (Roy and Roth, 2015;Liang et al., 2018), we focus on two commonly-used benchmark datasets for arithmetic word problems, AI2 (Hosseini et al., 2014) and IL (Roy and Roth, 2015). We consider mathematical relations among numbers and calculate numerical values of the predicted expressions. For equation parsing, we also evaluate our model on the data released by (Roy et al., 2016). A predicted equation is regarded as a correct one if it is mathematically equivalent to the gold equation.

Empirical Results
Arithmetic Word Problem. Following previous work (Liang et al., 2018), we conduct 3-fold and 5-fold cross-validation on AI2 and IL, respectively, and report the accuracy scores, as shown in Table 2. Our method achieves competitive results on AI2 and IL. Overall, it performs better than previous systems in terms of average scores.
Ablation tests have been done to investigate the effectiveness of different components, such as POS tags, the lexicon and number relevance, as indicated by "-POS", "-LEX", "-ID" in Table  2. By eliminating POS tag features, we achieve  new state-of-the-art results on two datasets, which shows POS tag features do not appear to be helpful in this case. Without using lexicon, the performance drops a lot as expected, but the results are still comparable with most previous systems. These figures demonstrate the effectiveness of the lexicon. It is worth noting that the work of Liang et al. (2018) that achieve previous state-of-the-art results leverage inference rules during the inference phase. Their approach can be regarded as a different way of using the lexicon similar to ours. We also consider the effects of neural features (see Sec. 2.4) with different window sizes L ∈ {0, 1, 2, 3, 4, 5, 6}. According to empirical results, a larger window size tends to give better results. One possible reason is that an arithmetic word problem often consists of several sentences, where a large word window is required to capture mathematical semantics.

Equation Parsing
. We compare our model with previous work (Roy et al., 2016) on the equation parsing dataset, as shown in Table 3. Our method yields competitive results. Unlike the work of Roy et al. (2016), annotations of unknown variables are not required in our model. As reported in (Roy et al., 2016), they trained SPF (Artzi and Zettlemoyer, 2013a), a publicly available semantic parser, with sentence-equation pairs and a seed lexicon for mathematical terms. But it only obtained 3.1% accuracy. The result taken from (Roy et al., 2016) shows that it might be difficult for such a semantic parser in handling the equation  parsing task even with a high precision lexicon. One possible reason is that mapping text into a math equation is essentially a structure prediction problem. Our model is capable to make guaranteed decisions from a structure prediction perspective. Different from arithmetic word problems, where numbers are explicitly given in the form of digits, some texts from equation parsing corpus describe numbers in string forms. Hence, a structured predictor is used to identify the numbers in the sentence, which achieves 95.3% accuracy. The identifications of numbers are taken as features. We also consider the gold label of numbers, indicated by (+G). The performance improves a lot, which shows that the accurate identification of numbers is necessary to in order to obtain a good performance. By removing POS tag features, there is a slight drop in accuracy. On the other hand, it is worth noting that even without the high precision lexicon, our model can still achieve new state-of-the-art accuracy in this task, while the previous work (Roy et al., 2016) always requires a high precision lexicon to boost performance. Incorporating neural features leads to new state-of-the-art accuracy of 74.5% when L = 4. Expression Construction. In arithmetic word problems, the expression consists of several numbers only, exemplified by Problem 1 in Figure 1.
In practice, an unknown variable X, representing the goal that the problem aims to calculate, can be appended to the expression to form an equation. We further investigate two constructions: appending the unknown variable X to the beginning or to the end of an expression. Results are listed in the first block of  two new constructions of the running example are X = (29 + (16 + 20)) as indicated by "Prefix X", and (29 + (16 + 20)) = X reported as "Suffix X". It is interesting that including an X and its position influences the performance. Overall, excluding X works the best which is adopted in this work. Inverse Operators. As we discussed in Sec.
2.3, one distinct advantage of our approach, as compared to others, is that we do not need inverse operators, such as "− r " and "÷ r ". Our designed word association patterns are capable to handle the reordering issue. Here, we consider model variants by introducing two inverse expression tree nodes, SUB r and DIV r , to represent "− r " and "÷ r ", respectively. Empirical results, reported in the second block in Table 4, show that Text2Math (without including inverse operators) can obtain comparative results compared to the model variants with inverse operators. These results confirm that our model does not require additional knowledge of the semantics of operands, which is a unique property of our approach.

Qualitative Analysis
Output Comparisons. Equation parsing is more challenging than arithmetic word problems, since it requires generating unknown variables mapped to phrases residing in the text. We analyze output of this task to investigate the source that leads to better performance. Comparing predictions made by Pipeline (Roy et al., 2016) and our approach, we found that Text2Math can better capture the meaning of the problem text. We illustrate two examples in Table 5. The Pipeline approach fails to capture the meaning of "rises to 36% from 3.4%" which implies subtraction of two numbers, while our model is capable to capture such knowledge.
In the second example, Pipeline misunderstands the meaning of "five more than three", although it seems correct in a local context. However, an equation should be mapped from the complete sentence that captures mathematical relations in a Input: Japan January jobless rate rises to 3.6% from 3.4%. Gold: X 1 = 0.036 − 0.034 Pipeline: X 1 × 0.036 = 0.034 Text2Math: X 1 = 0.036 − 0.034 Input: The number of baseball cards he has is five more than three times the number of football cards. Gold: global perspective. Our model holds such a capability and makes more guaranteed predictions, which proves the efficacy of solving math problems from a structure prediction perspective. Robustness. To further investigate the property of our model, we studied outputs. We found that our method is able to conduct self-correction. Exemplified by Example 3 in Table 6, considering the sentence "Germany's DAX opens 0.7% lower at 18,842." with annotated equation X 1 + (0.007 × X 1 ) = 18, 842, the prediction made by our method is X 1 −(0.007×X 1 ) = 18, 842. It can be seen that the prediction made by our method is supposed to be the correct one, while the annotation is actually wrong. To make a fair comparison with previous works, we did not count such cases as correct during evaluation, which implies that accuracy reported in Table 3 is in fact higher. Error Analysis. For arithmetic word problem, it is interesting that the operand of two operands should be addition/subtraction (multiplication/division), while the prediction is subtraction/addition (division/multiplication). Consider Example 4 and 5 in Table 6. Descriptions of such two problems share many words, such as each, how many, there are, etc. Slight difference in problem descriptions may lead to different results, which makes it a challenge.
As for equation parsing, the work of Roy et al. (2016) requires annotations on which phrases should be mapped to unknowns during the training phase. However, such supervised knowledge is not required for our method. In our setup, we did not make hard constraint that each prediction must contain one or two variables. Therefore, missing or redundant variables appearing in the predicted equations are one of the major error sources. Example 6 and 7 from Table 6 illustrate such cases. On the other hand, lack of professional background information also leads to miss-Example 3: Germany's DAX opens 0.7% lower at 10,842. Gold: X 1 + (0.007 × X 1 ) = 10842 Text2Math: X 1 − (0.007 × X 1 ) = 10842 Example 4: Each child has 5 bottle caps. If there are 9 children, how many bottle caps are there in total? Gold: 5 × 9 Text2Math: 5 ÷ 9 Example 5: The school is planning a field trip. There are 14 students and 2 seats on each school bus. How many buses are needed to take the trip? Gold: 14 ÷ 2 Text2Math: 14 × 2 Example 6: 530 pesos can buy 4 kilograms of fish and 2 kilograms of pork. Gold: 530 = (4 × X 1 ) + (2 × X 2 ) Text2Math: 530 × X 3 = (4 × X 1 ) + (2 × X 2 ) Example 7: Flying with the wind , a bird was able to make 150 kilometers per hour. Gold: X 1 + X 2 = 150 Text2Math: X 1 = 150 Table 6: Examples with wrong predictions. Gold denotes the annotated correct equations and Text2Math refers to output equations generated by our method.
ing variables. Consider Example 6. Without world knowledge, it might be difficult for the algorithm to recognize that "Flying with the wind" implies the speed of the wind which should be considered as a variable of the equation.

Related Work
Math Word Problems. Mukherjee and Garain (2008) surveyed related approaches to this task in literature. Hosseini et al. (2014); Mitra and Baral (2016) solved the task by categorizing verbs or problems. The first method that can handle general arithmetic problems with multiple steps was proposed by Roy and Roth (2015), which was further extended by introducing Roth, 2017, 2018). Zou and Lu (2019b,c) is the first work that proposed a sequence labelling approach to solving arithmetic word problems, which focuses on addition-subtraction word problems. Other systems include semantic parsing based approaches (Liang et al., 2018) and neural methods (Wang et al., , 2018a. Unlike arithmetic word problems, the goal of algebra word problems is to map the text to an equation set Shi et al., 2015). Other types of problems have also been investigated, including probability problems (Dries et al., 2017), logic puzzle problems (Mitra and Baral, 2015;Chesani et al., 2017) and geometry problems (Seo et al., 2014(Seo et al., , 2015. Besides the benchmark datasets used in this work, other popular datasets include Dolphin18K (Shi et al., 2015) and AQuA (Ling et al., 2017) for algebra word problems which are not the focus in this work. Roy et al. (2016) first proposed the Equation Parsing task and designed a pipeline method with three structured predictors. Semantic Parsing. Another line of related works is semantic parsing (Wong and Mooney, 2006;Zettlemoyer and Collins, 2007;Kwiatkowksi et al., 2010;Liang et al., 2011;Dong and Lapata, 2018;Zou and Lu, 2018), which aims to map sentences into logic forms, including CCGbased lambda calculus expressions (Zettlemoyer and Collins, 2007;Kwiatkowksi et al., 2010;Artzi and Zettlemoyer, 2013b;Lapata, 2016), FunQL (Kate et al., 2005;Wong and Mooney, 2006;Jones et al., 2012), lambda-DCS (Liang et al., 2011;Berant et al., 2013;Jia and Liang, 2016), graph queries (Harris et al., 2013;Holzschuher and Peinl, 2013) and SQL (Yin et al., 2015;Sun et al., 2018). In this work, we adopt a text-math semantic representation encoding words and the expression tree.

Conclusion
In this work, we propose a unified structured prediction approach, Text2Math, to solving both arithmetic word problems and equation parsing tasks. We leverage a novel joint representation to automatically learn the correspondence between words and math expressions which reflects semantic closeness. Different from many existing models, Text2Math is agnostic of the semantics of operands and learns to map from text to math expressions in an end-to-end manner based on a data-driven approach. Experiments demonstrate the efficacy of our model. In the future, we would like to investigate how such an approach can be applied to more complicated math word problems, like algebra word problems where a problem usually maps to an equation set. Another interesting direction is to investigate how to incorporate world knowledge into the graph-based approach to boost the performance.