Solving Math Word Problems with Multi-Encoders and Multi-Decoders

Math word problems solving remains a challenging task where potential semantic and mathematical logic need to be mined from natural language. Although previous researches employ the Seq2Seq technique to transform text descriptions into equation expressions, most of them achieve inferior performance due to insufficient consideration in the design of encoder and decoder. Specifically, these models only consider input/output objects as sequences, ignoring the important structural information contained in text descriptions and equation expressions. To overcome those defects, a model with multi-encoders and multi-decoders is proposed in this paper, which combines sequence-based encoder and graph-based encoder to enhance the representation of text descriptions, and generates different equation expressions via sequence-based decoder and tree-based decoder. Experimental results on the dataset Math23K show that our model outperforms existing state-of-the-art methods.


Introduction
Math word problems (MWPs) solving, a task that transforms text descriptions into solvable equation expressions, is considered a crucial step towards general AI (Wang et al., 2018b). Since semantic understanding and mathematical logic reasoning both contribute to correct answers, MWPs solving remains a challenging topic in NLP. Table 1 shows a typical example of MWPs.
Problem: A slow car drives 58(n 1 ) km/h, and a fast car drives 85(n 2 ) km/h. The two cars drive at the same time in inverse direction, and they meet after 5(n 3 ) hours. How many kilometers does the fast car drive more than the slow car when they meet? AST: Equation: (n 2 − n 1 ) × n 3 × − Prefix: × − n 2 n 1 n 3 Suffix: n 2 n 1 − n 3 × Answer: 135 Researches on MWPs solving has a long history. Early researches focused on rule-based methods (Fletcher, 1985;Bakman, 2007;Yuhui et al., 2010) and statistical machine learning methods Hosseini et al., 2014;Mitra and Baral, 2016) that map problems into predefined templates. The main drawbacks of these methods lie in their heavy dependency on manual features and incapacity to generate new templates for new problems. Consequently, they can only achieve satisfactory results on small-scale datasets (Zhang et al., 2018). Recently, more researchers have been introducing Seq2Seq models, which are capable of generating new equation expressions that do not exist in the training set Wang et al., 2018a;. However, these models may generate invalid expressions since the sequencebased decoder cannot control the generation process. Based on the fact that each equation expression could be transformed into an abstract syntax tree (AST), some studies (Liu et al., 2019;Xie and Sun, 2019) changed the pattern of sequence generation from left to right and followed the top-down decoding process. Such tree-based decoders match the prefix order of AST. Although these models considered the structural information of equation expressions, they ignored that text descriptions also contain rich structural information, such as dependency parse tree and numerical comparison information.
The dependency parse tree represents various grammatical relationships between pairs of text words, for example, nouns are usually matched with verbs, and numerals are usually matched with quantifiers. In Table 1, n 1 can be subtracted from n 2 because n 1 and n 2 have the same quantifiers. Therefore, considering the dependency parse tree can reduce the situation of unreasonable operators between number pairs. In addition, most of MWPs solving replace numbers with special tokens (i.e. n 1 , n 2 ), which loses important numerical comparison information contained in text descriptions. For example, in Table 1, the underlined words 'slow car' and 'fast car' imply the fact that 'n 1 < n 2 '. Similarly, we incline to ask 'How many kilometers does the fast car drive more than the slow car?' rather than 'How many kilometers does the slow car drive more than the fast car?'. In other words, text descriptions match the numerical comparison information. Provided that a model knows numerical comparison information in advance, the model can better understand potential semantic without wasting a lot of time in mining these established facts from a large number of corpus. Now back to the design of decoder, existing methods only adopt one decoder, which limits the generation ability of the model. (Wang et al., 2018a) provided an ensemble model that selects the result according to various models' generation probability. However, there is still one decoder for a single model. (Meng and Rumshisky, 2019) integrated two decoders in one model, but both are sequence-based decoders. The same type of decoder cannot significantly improve generalization performance.
With the aim of solving aforementioned challenges, we propose a novel model with multi-encoders and multi-decoders, which combines sequence-based encoder and graph-based encoder to enhance the representation of text descriptions, and obtains different equation expressions via sequence-based decoder and tree-based decoder. Specifically, we leverage a sequence-based encoder to get the context representation of text descriptions, and integrate the dependency parse tree and numerical comparison information via a graph-based encoder. In the decoding stage, a sequence-based decoder is used to generate the suffix order of AST, and a tree-based decoder is used to generate the prefix order. The final result is selected according to the generation probability of different decoders. The main contributions of this paper are summarized as follows: • We integrate the dependency parse tree and numerical comparison information in the model, which enhances the representation of text descriptions.
• We use two types of decoders to generate different equation expressions, which strengthens the generation ability of the model.
• We evaluate our model on a large-scale dataset Math23K. The experimental results show that our model outperforms all existing state-of-the-art methods.

Related Work
MWPs solving may date back to the 1960s and continues attracting current NLP researchers. Here we will introduce recent studies based on the Seq2Seq framework. The work presented in (Zhang et al., 2018) reviews more early approaches.  made the first attempt to directly generate equation expressions by using the Seq2Seq model and published a high-quality Chinese dataset Math23K. (Wang et al., 2018a) found that using the suffix order of AST can eliminate brackets in the original expressions, and proposed an equation normalization method to reduce the number of duplicated equations.  proposed a twostage model that first used a Seq2Seq model to generate expressions without operators, and then used a recursive neural network to predict the operator between numbers. (Chiang and Chen, 2019) adopted a stack to track the semantic meanings of numbers.  added different functional multihead attentions to the Seq2Seq framework. (Meng and Rumshisky, 2019) applied double sequencebased decoders in one model. However, these Seq2Seq models only consider input/output objects as sequences, ignoring the important structural information of equation expressions. Consequently, they cannot guarantee the generation of valid equation expressions.
The idea of the tree-based decoder was proposed in (Liu et al., 2019;Xie and Sun, 2019). They changed the pattern of sequence generation from left to right and followed the top-down decoding process. However, these methods ignored rich structural information contained in text descriptions. (Li et al., 2020;Zhang et al., 2020) proposed the graph-based encoder. (Li et al., 2020) integrated the dependency parse tree and constituency tree of text descriptions. (Zhang et al., 2020) constructed the quantity cell graph and quantity comparison graph. Since these methods considered the structure information of text descriptions, they have been current state-of-the-art models.
The encoders and decoders designed by these Seq2Seq models are summarized in Table 2. As we can see, our model is the first model to adopt multi-encoders and multi-decoders.

Methodology
The framework of our model is shown in Figure 1, which consists of four components: the sequencebased encoder obtains the context representation of text descriptions; the graph-based encoder integrates the dependency parse tree and numerical comparison information; the sequence-based decoder generates the suffix order of AST, and the tree-based decoder generates the prefix order. The final generation result is selected according to the generation probability of different decoders.

Sequence-Based Encoder
The goal of sequence-based encoder is to get the context representation of text descriptions. Without loss of generality, we use a BiGRU to encode text words. Formally, given the text words P = {x 1 , · · · , x n }, we first embed each word token x i to a word embedding vector e i 1 , and then feed these embedding vectors into a BiGRU to produce hidden state sequences H = {h 1 , · · · , h n }.

Dependency Parse Tree
As is discussed in Section 1, the dependency parse tree represents various grammatical relationships between pairs of text words, which is helpful to find reasonable operators between number pairs. We can  Figure 1: The framework of our model. We first exploit a sequence-based encoder to obtain the context representation of text descriptions. Later, a graph-based encoder is used to integrate the dependency parse tree and numerical comparison information. In the decoding process, the sequence-based decoder and tree-based decoder generate different equation expressions.
easily obtain the graph-based structure of the dependency parse tree by using dependency relationships in the parse tree. Hence, we consider the following parse graph.
• Parse Graph (G): For two words x i , x j ∈ P , there is an edge e ij = (x i , y j ) ∈ G if the pair has dependency relationship in the dependency parse tree, referring to the table in Figure 1.
Note that the parse graph is an undirected graph. After building the graph-based structure of the dependency parse tree, we need to find an effective way to learn the graph representation. Here we introduce GraphSAGE (Hamilton et al., 2017), which is a flexible graph neural network. Specifically, we first use the sequence H = {h 1 , · · · , h n } obtained by the sequence-based encoder as the initial embedding of each node. Then each node updates its embedding vector from neighborhood nodes, which can be expressed as where P k N denotes the aggregating information from neighborhood nodes, P k denotes the updated embedding of each node and P 0 = H. D = D + L, A = A + L, D represents the degree matrix and A represents the adjacency matrix of the parse graph. k ∈ {1, · · · , K} is the iteration index and {W , W P } are parameter matrices.

Numerical Comparison Information
Numerical comparison information also plays an important role in enhancing text descriptions. We also use a graph-based structure to represent the numerical comparison information. We denote the numbers in the text words as V n = {n 1 , · · · , n l } and consider the following two types of numerical graphs.
• Greater Graph (G g ): For two numbers n i , n j ∈ V n , there is an edge e ij = (n i , n j ) ∈ G g if n i > n j , referring to the red solid lines in Figure 1.
• Lower Graph (G l ): For two numbers n i , n j ∈ V n , there is an edge e ij = (n i , n j ) ∈ G l if n i ≤ n j , referring to the red dashed lines in Figure 1.
Unlike the parse graph, there are two types of numerical graphs and they are directed graphs. Hence we extend GraphSAGE to fit the integration of numerical comparison information. The updating rule of each number can be expressed as where {Q k Ng , Q k N l } represent the aggregating information from neighborhood nodes in two graphs, Q k represents the updated embedding of each node and Q 0 = P K . M a controls the weight of two graphs, ' * ' denotes element-wise multiplication and 'σ' denotes the 'Sigmoid' function. k ∈ {1, · · · , K} is the iteration index and {W g , W l , W a , W Q } are parameter matrices.
The final encoder vectors of text descriptions incorporate the node embedding vectors in the parse graph and numerical graphs, which can be calculated as where Z = {z 1 , · · · , z n } denotes the final encoder vectors of each word, and g represents the global vector of text descriptions for further decoding.

Sequence-Based Decoder
The sequence-based decoder is used to generate the suffix order of AST. We use a GRU with attention layer to generate the sequence, which can be expressed as where s i denotes the hidden state vector of the decoder, c i denotes the context vector. α ij controls the attention weight of every encoder vector.ŷ i is the output and {v s , W s } are parameter matrices.

Tree-Based Decoder
The tree-based decoder is used to generate the prefix order of AST. We follow the Goal-driven Tree Structure (GTS) proposed in (Xie and Sun, 2019), which not only realized the top-down decoding process but also used the bottom-up subtree embedding manners. Here we simply introduce the decoding process, which can be expressed as • Step 1 (Root Goal Generation): GTS followed the pre-order traversal manner, so the primary goal is to generate the root node. We use g as the initial goal vector of the root node, and apply the same attention mechanism in the sequence-based decoder to get the context vector c 1 .
Note that the algorithm terminates directly ifŷ 1 is a number; otherwise, we will go to step 2. • Step 2 (Left Goal Generation): The left goal g l is generated according to the goal vector and the predicted token of its parent node, which can be expressed as whereŷ p , g p and c p stand for the predicted token, goal vector and content vector of the parent node respectively. The process of generating the left goal continues untilŷ l is a number, referring to the red dashed lines in Figure 1. Later, we will go to step 3. • Step 3 (Right Goal Generation): When the right goal node is being generated, its left sibling node has been completed. Therefore, GTS considered the subtree embedding of its sibling node to generate the right goal g r , which can be expressed as Here, t l is the tree embedding of the left goal, as illustrated by the blue solid lines in Figure 1. Similarly, we will go back to step 2 ifŷ r is an operator. The algorithm backtracks to check whether there are right goals in the tree that need to be generated ifŷ r is a number. When the model cannot find any generation goal, the algorithm terminates; otherwise, we will continue step 3.

Model Training
Since our model integrates two types of decoders, we combine the loss functions of the sequence-based decoder and tree-based decoder. For each sample of problem-expression (P, T ), the optimization objective of our model is defined as where where m denotes the number of tokens in the equation expression. T s represents the suffix order and T t represents the prefix order. {W 1 , W 2 , W 3 , W 4 } are parameter matrices. Finally, we use the log probability scores to perform a beam search. After obtaining the top equation expression from double decoders respectively, we select the one with a higher score as the final result.

Experiments
In this section, we evaluate our model on a large-scale dataset Math23K. We compare our model with several state-of-the-art methods and demonstrate the effectiveness of our model via a series of controlled experiments. Our code can be downloaded at https://github.com/YibinShen/MultiMath.  (Huang et al., 2016) (with 18,460 MWPs) and AQuA (Ling et al., 2017) (with 100,000 MWPs), they contain either some unlabeled problems or informal equation expressions (mixed with texts). Therefore, Math23K is still the most ideal large-scale and high-quality publish dataset.

Hyperparameters
In the sequence-based encoder, we use a two-layer BiGRU with 512 hidden units as the encoder, and the dimension of word embedding is set as 128. In the graph-based encoder, we set the number of iteration steps as K = 2. We also use a two-layer GRU with 512 hidden units as the decoder in the sequence-based decoder. The hyper-parameters of the tree-based decoder are consistent with GTS. As to the optimizer, we use Adam with an initial learning rate at 0.001, and the learning rate will be halved every 20 epochs. The number of epochs, batch size and dropout rate are set 80, 64 and 0.5 respectively. At last, we use a beam search with beam size 5 in the sequence-based decoder and tree-based decoder. Our model is implemented in PyTorch 1.4.0 and runs on a server with one NVIDIA Tesla V100. We use pyltp 0.2.1 to preform dependency parsing and POS tagging.

Metric
Since a math word problem can be solved by multiple equation expressions, we use the answer accuracy as the evaluation metric. For Math23K, some of the previous studies were evaluated on the publish test set, while others used the 5-fold cross-validation. We evaluate our model on the two situations.

Baselines
We compare our model with some state-of-the-art methods, including: DNS   Graph2Tree (Zhang et al., 2020) integrated the quantity cell graph and quantity comparison graph.

Type
Model   3 depicts the performance comparison of different models on Math23K. As we can see, Seq2Seq models cannot exceed 70% accuracy because they ignored the structural information of text descriptions and equation expressions. Seq2Tree models made full use of tree-based structure expressions and followed the top-down decoding process, which outperforms most of Seq2Seq models. In particular, GTS also realized bottom-up subtree embedding manners and have a good performance on Math23K. Graph2Tree considered the structure information of text descriptions, integrating the quantity cell graph and quantity comparison graph, so it achieves sub-optimal performance in all models. As to our model, our model not only uses multi-encoders to integrate the structural information of the dependency parse tree and numerical comparison graphs, but also enhances the generation ability of the model via multidecoders, which outperforms aforementioned models.

Experimental Analysis
In Table 4, we show the accuracy of the top-5 most frequent expressions on Math23K * . Intuitively, our model achieves more than 90% accuracy in all situations and outperforms the other two models in most cases. Note that our model has a significant improvement over GTS under expressions with '÷' or '−'. This is because the division and subtraction operators do not meet the commutative law, which requires the model to learn the correct arithmetic order. Since GTS doesn't integrate the numerical comparison information in the model, it cannot deal with these expressions well.    Results show that our model outperforms GTS and Graph2Tree in all situations. However, the performance of our model has a rapid drop when the expression becomes longer. There are two reasons for this phenomenon: (1) longer expressions contain more operators, and the neural network cannot save the results of intermediate variables well; (2) longer expressions only account for a small part of the dataset (e.g. each expression longer than 9 can be matched to 1.67 problems on average), and the model lacks samples for training. In future work, we will consider question generation technology to generate more MWPs, which may solve this problem.

Case Study
To demonstrate the effectiveness of our model, we conduct a case study in Table 5. Test 1 exchanges the order of text descriptions and Test 2 changes the form of question description. These two simple tests are used to investigate whether the model can mine the correct mathematical logic from natural language.
In the original problem, GTS obtains a negative answer, which conflicts with the problem. It is funny that GTS obtains the correct answer when we change the order of text description. Note that GTS generates the same expression in Test 1, which implies that GTS only remembers the order of the numbers instead of the real mathematical logic within the problem.
Problem: A slow car drives 58(n 1 ) km/h, and a fast car drives 85(n 2 ) km/h. The two cars drive at the same time in inverse direction, and they meet after 5(n 3 ) hours. How many kilometers does the fast car drive more than the slow car when they meet? Result: GTS: × − n 1 n 2 n 3 = −135 (error) Ours: × − n 2 n 1 n 3 = 135 (correct) Test 1: A fast car drives 85(n 1 ) km/h, and a slow car drives 58(n 2 ) km/h. The two cars drive at the same time in inverse direction, and they meet after 5(n 3 ) hours. How many kilometers does the fast car drive more than the slow car when they meet? Result: GTS: × − n 1 n 2 n 3 = 135 (correct) Ours: × − n 1 n 2 n 3 = 135 (correct) Test 2: A slow car drives 58(n 1 ) km/h, and a fast car drives 85(n 2 ) km/h. The two cars drive at the same time in inverse direction, and they meet after 5(n 3 ) hours. How many kilometers does the slow car drive less than the fast car when they meet? Result: GTS: × − n 1 n 2 n 3 = −135 (error) Ours: × − n 2 n 1 n 3 = 135 (correct) In Test 2, we change the form of question description, GTS and our model obtain the same expressions that generated in the original problem. This is because we use the attention mechanism in the model, changing the form of question description has no impact on generating correct expressions.
Since Graph2Tree also considered the quantity comparison graph in the model, the same results are obtained in this case as our model.

Ablation Study
Last but not least, we conduct an ablation study to better understand the effect of encoders and decoders in the model, as is shown in Table 6. When we use a fully connected layer to replace the sequencebased encoder, the performance of our model observably drops. This is because other encoders and decoders depend on the context representation obtained by the sequence-based encoder. We find that the performance has a drop if we discard any type of graph-based structure, which proves the importance of considering the structure information in text descriptions. When the model has only one decoder, the generation ability is limited, which indicates the necessity of designing multi-decoders.

Conclusion and Future Work
Inspired by the fact that text descriptions and equation expressions have structural information, a model with multi-encoders and multi-decoders is proposed in this paper. To be specific, we use the sequencebased encoder to obtain the context representation, and the graph-based encoder is used to integrate the structure information of text descriptions. Two types of decoders generate different expressions, which strengthens the generation ability of the model. Experimental results on Math23K proves the advantages of our model over existing state-of-the-art methods. The Experiment analysis shows that the effectiveness of mathematical logic mining from the problem. In future work, we will explore the question generation technique to increase samples of the dataset and solve the problems with complex expressions.