Modeling Intra-Relation in Math Word Problems with Different Functional Multi-Head Attentions

Several deep learning models have been proposed for solving math word problems (MWPs) automatically. Although these models have the ability to capture features without manual efforts, their approaches to capturing features are not specifically designed for MWPs. To utilize the merits of deep learning models with simultaneous consideration of MWPs’ specific features, we propose a group attention mechanism to extract global features, quantity-related features, quantity-pair features and question-related features in MWPs respectively. The experimental results show that the proposed approach performs significantly better than previous state-of-the-art methods, and boost performance from 66.9% to 69.5% on Math23K with training-test split, from 65.8% to 66.9% on Math23K with 5-fold cross-validation and from 69.2% to 76.1% on MAWPS.


Introduction
Computer systems, dating back to 1960s, have been developing to automatically solve math word problems (MWPs) (Feigenbaum and Feldman, 1963;Bobrow, 1964). As illustrated in Table 1, when solving this problem, machines are asked to infer "how many shelves would Tom fill up " based on the textual problem description. It requires systems having the ability to map the natural language text into the machine-understandable form, reason in terms of sets of numbers or unknown variables, and then derive the numeric answer.
In recent years, a growing number of deep learning models for MWPs Ling et al., 2017;Wang et al., 2018b,a;Huang et al., 2018a,b; have drawn inspiration from advances in machine translation. * corresponding author Problem: For a birthday party Tom bought 4 regular sodas and 52 diet sodas. If his fridge would only hold 7 on each shelf, how many shelves would he fill up? Equation: x = (4.0 + 52.0)/7.0 Solution: 8 The core idea is to leverage the immense capacity of neural networks to strengthen the process of equation generating. Compared to statistical machine learning-based methods (Kushman et al., 2014;Mitra and Baral, 2016;Roy and Roth, 2018;Zhou et al., 2015;Huang et al., 2016) and semantic parsing-based methods (Shi et al., 2015;Koncel-Kedziorski et al., 2015;Roy and Roth, 2015;Huang et al., 2017), these methods do not need hand-crafted features and achieve high performance on large datasets. However, they lack in capturing the specific MWPs features, which are an evidently vital component in solving MWP. More related work and feature-related information can be found in .
Inspired by recent work on modeling locality using multi-head attention (Li et al., 2018;Yang et al., , 2019, we introduce a group attention that contains different attention mechanisms to extract various types of MWPs features. More explicitly, there are four kinds of attention mechanisms: 1) Global attention to grab global information; 2) Quantity-related attention to model the relations between the current quantity and its neighbor-words; 3) Quantity-pair attention to acquire the relations between quantities; 4) Question-related attention to capture the connections between the question and quantities. The experimental results show that the proposed model establishes the state-of-the-art performance on both Math23K and MAWPS datasets. In addtion, we release the source code of our model in Github 1 .
2 Background: Self-Attention Network Self-attention networks have shown impressive results in various natural language processing tasks, such as machine translation (Vaswani et al., 2017;Shaw et al., 2018) and natural language inference  due to their flexibility in parallel computation and power of modeling long dependencies. It can model pairwise relevance by calculating attention weights between pairs of elements of an input sequence. In Vaswani et al. (2017), they propose a self-attention computation module, known as "Scaled Dot-Product Attention"(SDPA). It is used as the basic unit of multihead attention. This module's input contains query matrix Q ∈ R m×d k , key matrix K ∈ R m×d k and value matrix V ∈ R m×dv , where m is the number of input tokens, d k is the dimension of query or key vector, d v is the dimension of value vector. Output can be computed by: As Vaswani et al. (2017) found, performing attention by projecting the queries, keys, and values into subspace with different learnable projection functions instead of a single attention can enhance the capacity to capture various context information. More specifically, this attention model first transforms Q, K, and and then obtains the output features {head 1 , head 2 , · · · , head k } by SDPA, where k is the number of SDPA modules. Finally, these output features are concatenated and projected to produce the final output state O .

Approach
In this section, we introduce how the proposed framework works and the four different types of attention we designed.

Overview
We propose a sequence-to-sequence (SEQ2SEQ) model with group attention to capture different types of features in MWPs. The SEQ2SEQ model takes the text of the whole problem as the input and corresponding equation as the output. Specifically, the group attention consists of four different types of multi-head attention modules. As illustrated in Figure 1, the pre-processed input Following the same paradigm in (Vaswani et al., 2017), we add a fully-connected feed forward layer to the multi-head attention mechanism layer (i.e., group attention), and each layer is followed by a residual connection and layer normalization. Consequently, the output of group attention block O is obtained. During decoding, we employ the pipeline in (Wang et al., 2018a). The output Y is obtained through where h d t is the hidden state at the t-th step, o j is the j-th state vector from the output O of the group attention block.

Pre-Processing of MWPs
Given a MWP P and its corresponding groudtruth equation, we project words of the MWP through a word embedding matrix E, i.e., e P i = Ew P i . Considering the diversity of quantities in natural language, we follow the work of Wang et al. (2017) which proposed to map quantities into special tokens in the problem text by the following two rules: 1) All the quantities that appear in the MWP are determined if they are significant quantities that will be used in the equation using Significant Number Identify (SNI); 2) All recognized significant quantities in the MWP P are mapped to a list of mapped quantity tokens {n 1 , ..., n l } in terms of their appearance order in the problem text, where l is the number of quantities. Through the above rules, the mapped MWP text X = {x 1 , · · · , x m } that will be used as the input of the SEQ2SEQ model can be acquired.
In addition, the quantity tokens in the equation are also substituted according to the corresponding mapping in problem text. For example, the mapped quantity tokens and the mapped equation of the problem in Table 1 are {n 1 = 4, n 2 = 52, n 3 = 7} and (n 1 + n 2 ) ÷ n 3 respectively. To address the issue that a MWP may have more than one correct solution equations (e.g., 3×2 and 2×3 are both correct equations to solve the problem "How many apples will Tom eat after 3 days if he eats 2 apples per day?"), we normalize the equations to postfix expressions following the rules in Wang et al. (2018a), ensuring that every problem is corresponding to a unique equation. Thus, we can obtain the mapped equation E q that will be regarded as the target sequence.

Group Attention
With the aim of implementing group attention, as illustrated in Figure 2, we separate the problem text X = {x 1 , · · · , x m } into quantity spans X quant = {X quant,1 , · · · , X quant,l } and the question span X quest . The quantity span includes one or more quantity and their neighborhood words, and the question span consists of words of the question. For simplicity, the spans are separated by commas and periods, which naturally separate the sentence semantically and each span often contains one quantity, and spans with quantity (but not last) are considered as quantity spans while the last span is considered as question span since it always contains the question. By doing this, spans do not overlap with each other.
As illustrated in Figure 3, following how the problem text is divided, {Q, K, V } are masked into the input of group attention, We will describe four types of group attention in detail in the following passage. Global Attention: Document-level features play an important role in distinguishing the category of MWPs and quantities order in equations. To capture these features from a global perspective, we introduce a type of attention named as global attention, which computes the attention vector based on the whole input sequence.
For Q g , K g , and V g , we set them to H e . The output O g can be obtained by SDPA modules belonging to global attention. For example, the word "apple" illustrated in Figure 2 will attend to the words in the whole problem text from "Janet" to "?".
Quantity-Related Attention: The words around quantity usually provide beneficial clues for MWPs solving. Hence, we introduce quantityrelated attention, which focuses on the question span or quantities span where the current quantity resides.
For i-th span, its Q c , K c , and V c are all derived from X quant,i within its own part. For example, as illustrated in Figure 2, the word "she" only attends to the words in the 2-nd quantity span "She finds another 95,".
Quantity-Pair Attention: The relationship between two quantities is of great importance in determining their associated operator. We design an attention module called quantity-pair attention, which is used to model this relationship between quantities.
The question span can be viewed as the quantity span containing an unknown quantity. Thus, the computation process consists of two parts: 1) Attention between quantities: the query Q p is derived from X quant,i , and corresponding K p and V p are stemmed from X quant,j (j = i). For example, as illustrated in Figure 2, the word "has" in the 1-st quantity span can only attend to words from the 2nd quantity span; 2) Attention between quantities and question: the query Q p is originated X quest within the question span, and corresponding K p and V p are derived from X quant . For example, as illustrated in Figure 2, the word "How" attends to the words in the quantity spans from "Janet" to "95,".
Question-Related Attention: The question can also derive distinguishing information such as whether the answer value is positive. Thus, we propose question-related attention, which is utilized to model the connections between question and problem description stem.
There are also two parts when modeling this type of relation: 1) Attention for quantity span: the query Q q is derived from X quant,i , the corresponding K q and V q are stemmed from X quest . For example, as illustrated in Figure 2, the word "apples" in quantity span only attends to the words from the question span; 2) Attention for question span: for the query Q q corresponding to X quest , the corresponding K q and V q are extracted according to X quant . For example, as illustrated in Figure 2, the word "does" in question span attends to the words in all the quantity spans.
Datasets: Math23K is collected from multiple online educational websites. This dataset contains 23,162 Chinese elementary school level MWPs. MAWPS is another large scale dataset which owns 2,373 arithmetic word problems after harvesting ones with a single unknown variable.
Evaluation Metrics: We use answer accuracy to evaluate our model. The accuracy calculation follows a simple formula. If a generated equation produces an answer equal to the corresponding ground truth answer, we consider it to be right.
Implementation details: For Math23K, we follow the training and test set released by , and we also evaluate our proposed method with 5-fold cross-validation in main results table. We adopt the pre-trained word embeddings with dimension set to 128 and use a twolayer Bi-LSTM with 256 hidden units and a group attention with four different functional 2-head attention as the encoder, and a two-layer LSTM with 512 hidden units as the decoder. Dropout probabilities for word embeddings, LSTM and group attention are all set to 0.3. The number of epochs and mini-batch size are set to 300 and 128 respectively. As to the optimizer, we use the Adam optimizer with β 1 = 0.9, β 2 = 0.98 and e = 10 −9 . Refer to (Vaswani et al., 2017), we use the same policy to vary the learning rate with warmup steps=2000. For MAWPS, we use 5fold cross-validation, and the parameter setting is similar to those on Math23K.
Baselines: We compare our approach with retrieval models, deep learning based solvers. The retrieval models Jaccard and Cosine in (Robaidek et al., 2018) find the most similar math word problem in training set under a distance metric and use its equation template to compute the result. DNS  first applies a vanilla SEQ2SEQ model with GRU as encoder and LSTM as the decoder to solve MWPs. In (Wang et al., 2018a), the authors apply Bi-LSTM with equation normalization to reinforce the vanilla SEQ2SEQ model. T-RNN  launches a two-stage system named as T-RNN that first predicts a tree-structure template to be filled, and then accomplishes the template with operators predicted by the recursive neural network. In S-Aligned (Chiang and Chen, 2019), the encoder is designed to understand the semantics of problems, and the decoder focuses on deciding which symbol to generate next over semantic meanings of the generated symbols.  As illustrated in Table 2, we can see that retrieval approaches work poorly on both two datasets. Our method named as GROUP-ATT performs substantially better than existing deep learning based methods, increasing the accuracy from 66.9% to 69.5% on Math23K based on trainingtest split, from 65.8% to 66.9% on Math23K with 5-fold cross-validation and from 69.2% to 76.1% on MAWPS. In addition, DNS and T-RNN also boost the performance by integrating with retrieval methods, while (Wang et al., 2018a) improves the performance by combining different SEQ2SEQ models. However, we only focus on improving the performance of single model. It is worth noting that GROUP-ATT also achieves higher accuracy than the state-of-the-art ensemble models    In addition, we perform an ablation study to empirically examine the ability of designed group attentions. We adopt the same parameter settings as GROUP-ATT while applying a single kind of attention with 8 heads. Table 3 shows the results of ablation study on Math23K. Although each specified attention tries to catch related information alone, it still outperforms Bi-LSTM by a margin from 1.0% to 1.5%, showing its effectiveness.
In a parking lot, there are ! " cars and motorcycles in total, each car has ! # wheels, and each motorcycle has n & wheels. These cars have ! ' wheels in total, so how many motorcycles are there in the parking lot?

Visualization Analysis of Attention
To better understand how the group attention mechanism works, we implement an attention visualization on a typical example from Math23K. As shown in Figure 4, n 3 describes how many wheels a motorcycle has. Through quantity-pair and quantity-related attention heads, n 3 pays attention to all quantities that describe the number of wheels. Question-related attention helps n 3 attend to "motorcycle" in question. In addition, surprisingly, in the quantity-pair heads, the attention of n 3 becomes more focused on the words "These", "in total" from "These vehicles have n 4 wheels in total". This indicates part-whole relation(i.e., one quantity is part of a larger quantity), mentioned in (Mitra and Baral, 2016;Roy and Roth, 2018), which is of great importance in MWPs solving. Our analysis illustrates that the hand-crafted grouping can force the model to utilize distinct information and relations conducive to solving MWPs.

Conclusion
In this paper, we introduce a group attention method which can reinforce the capacity of model to grab various types of MWPs specific features.
We conduct experiments on two benchmarks and show significant improvements over a collection of competitive baselines, verifying the value of our model. Plus, our ablation study demonstrates the effectiveness of each group attention mechanism.