Dependency-based Hybrid Trees for Semantic Parsing

We propose a novel dependency-based hybrid tree model for semantic parsing, which converts natural language utterance into machine interpretable meaning representations. Unlike previous state-of-the-art models, the semantic information is interpreted as the latent dependency between the natural language words in our joint representation. Such dependency information can capture the interactions between the semantics and natural language words. We integrate a neural component into our model and propose an efficient dynamic-programming algorithm to perform tractable inference. Through extensive experiments on the standard multilingual GeoQuery dataset with eight languages, we demonstrate that our proposed approach is able to achieve state-of-the-art performance across several languages. Analysis also justifies the effectiveness of using our new dependency-based representation.


Introduction
Semantic parsing is a fundamental task within the field of natural language processing (NLP). Consider a natural language (NL) sentence and its corresponding meaning representation (MR) as illustrated in Figure 1. Semantic parsing aims to transform the natural language sentences into machine interpretable meaning representations automatically. The task has been popular for decades and keeps receiving significant attention from the NLP community. Various systems (Zelle and Mooney, 1996;Kate et al., 2005;Zettlemoyer and Collins, 2005;Liang et al., 2011) were proposed over the years to deal with different types of semantic representations. Such models include structure-based models (Wong and Mooney, 2006;Lu et al., 2008; NL: What rivers do not run through Tennessee ? MR: answer(exclude(river(all), traverse(stateid( tn )))) dependency-based hybrid tree representation. Kwiatkowski et al., 2010;Jones et al., 2012) and neural network based models (Dong and Lapata, 2016;Cheng et al., 2017). Following various previous research efforts (Wong and Mooney, 2006;Lu et al., 2008;Jones et al., 2012), in this work, we adopt a popular class of semantic formalism -logical forms that can be equivalently represented as tree structures. The tree representation of an example MR is shown in the middle of Figure 1. One challenge associated with building a semantic parser is that the exact correspondence between the words and atomic semantic units are not explicitly given during the training phase. The key to the building of a successful semantic parsing model lies in the identification of a good joint latent representation of both the sentence and its corresponding semantics. Example joint representations proposed in the literature include a chart used in phrase-based translation (Wong and Mooney, 2006), a constituency tree-like representation known as hybrid tree (Lu et al., 2008), and a CCG-based derivation tree (Kwiatkowski et al., 2010).
Previous research efforts have shown the effec-tiveness of using dependency structures to extract semantic representations (Debusmann et al., 2004;Cimiano, 2009;Bédaride and Gardent, 2011;Stanovsky et al., 2016). Recently, Reddy et al. (2016 proposed a model to construct logical representations from sentences that are parsed into dependency structures. Their work demonstrates the connection between the dependency structures of a sentence and its underlying semantics. Although their setup and objectives are different from ours where externally trained dependency parsers are assumed available and their system was trained to use the semantics for a specific down-stream task, the success of their work motivates us to propose a novel joint representation that can explicitly capture dependency structures among words for the semantic parsing task.
In this work, we propose a new joint representation for both semantics and words, presenting a new model for semantic parsing. Our main contributions can be summarized as follows: • We present a novel dependency-based hybrid tree representation that captures both words and semantics in a joint manner. Such a dependency tree reveals semantic dependencies between words which are easily interpretable. • We show that exact dynamic programming algorithms for inference can be designed on top of our new representation. We further show that the model can be integrated with neural networks for improved effectiveness. • Extensive experiments conducted on the standard multilingual GeoQuery dataset show that our model outperforms the state-of-theart models on 7 out of 8 languages. Further analysis confirms the effectiveness of our dependency-based representation.
To the best of our knowledge, this is the first work that models the semantics as latent dependencies between words for semantic parsing.

Related Work
The literature on semantic parsing has focused on various types of semantic formalisms. The λ-calculus expressions (Zettlemoyer and Collins, 2005) have been popular and widely used in semantic parsing tasks over recent years (Dong and Lapata, 2016;Gardner and Krishnamurthy, 2017;Reddy et al., 2016Susanto and Lu, 2017a;Cheng et al., 2017). Dependency-based composi-tional semantics (DCS) 2 was introduced by Liang et al. (2011), whose extension, λ-DCS, was later proposed by Liang (2013). Various models (Berant et al., 2013;Wang et al., 2015;Jia and Liang, 2016) on semantic parsing with the λ-DCS formalism were proposed. In this work, we focus on the tree-structured semantic formalism which has been examined by various research efforts (Wong and Mooney, 2006;Kate and Mooney, 2006;Lu et al., 2008;Kwiatkowski et al., 2010;Jones et al., 2012;Lu, 2014;Zou and Lu, 2018). Wong and Mooney (2006) proposed the WASP semantic parser that regards the task as a phrasebased machine translation problem. Lu et al. (2008) proposed a generative process to generate natural language words and semantic units in a joint model. The resulting representation is called hybrid tree where both natural language words and semantics are encoded into a joint representation. The UBL-s (Kwiatkowski et al., 2010) parser applied the CCG grammar (Steedman, 1996) to model the joint representation of both semantic units and contiguous word sequences which do not overlap with one another. Jones et al. (2012) applied a generative process with Bayesian tree transducer and their model also simultaneously generates the meaning representations and natural language words. Lu (2014Lu ( , 2015 proposed a discriminative version of the hybrid tree model of (Lu et al., 2008) where richer features can be captured. Dong and Lapata (2016) proposed a sequence-totree model using recurrent neural networks where the decoder can branch out to produce tree structures. Susanto and Lu (2017b) augmented the discriminative hybrid tree model with multilayer perceptron and achieved state-of-the-art performance.
There exists another line of work that applies given syntactic dependency information to semantic parsing. Titov and Klementiev (2011) decomposed a syntactic dependency tree into fragments and modeled the semantics as relations between the fragments. Poon (2013) learned to derive semantic structures based on syntactic dependency trees predicted by the Stanford dependency parser. Reddy et al. (2016 proposed a linguistically motivated procedure to transform syntactic dependencies into logical forms. Their semantic parsing performance relies on the quality of the syntactic dependencies. Unlike such efforts, we do not re-  3 Approach

Variable-free Semantics
The variable-free semantic representations in the form of FunQL (Kate et al., 2005) used by the defacto GeoQuery dataset (Zelle and Mooney, 1996) encode semantic compositionality of the logical forms (Cheng et al., 2017). In the tree-structured semantic representations as illustrated in Figure 1, each tree node is a semantic unit of the following form: where m i denotes the complete semantic unit, which consists of semantic type τ α , function symbol p α and an argument list of semantic types τ * β (here * denotes that there can be 0, 1, or 2 semantic types in the argument list. This number is known as the arity of m i ). Each semantic unit can be regarded as a function that takes in other (partial) semantic representations of certain types as arguments and returns a semantic representation of a specific type. For example in Figure 1, the root unit is represented by m 1 , the type of this unit is QUERY, the function name is answer and it has a single argument RIVER which is a semantic type. With recursive function composition, we can obtain a complete MR as shown in Figure 1.

Dependency-based Hybrid Trees
To jointly encode the tree-structured semantics m and a natural language sentence n, we in-troduce our novel dependency-based hybrid tree. Figure 2 (right) shows the two equivalent ways of visualizing the dependency-based hybrid tree based on the example given in Figure 1. In this example, m is the tree-structured semantics m 1 (m 2 (m 3 , m 4 (m 5 (m 6 )))) and n is the sentence {w 1 , w 2 , · · · , w 8 } 3 . Our dependency-based hybrid tree t consists of a set of dependencies between the natural language words, each of which is labeled with a semantic unit. Formally, a dependency arc is represented as (w p , w c , m i ), where w p is the parent of this dependency, w c is the child, and m i is the semantic unit that serves as the label for the dependency arc. A valid dependency-based hybrid tree (with respect to a given semantic representation) allows one to recover the correct semantics from it. Thus, one constraint is that for any two adjacent dependencies (w p , w c , m i ) and (w p , w c , m j ), where w c ≡ w p , m i must be the parent of m j in the tree-structured representation m. For example, in Figure 2, the dependencies (not, through, m 4 ) and (through, Tennessee, m 5 ) satisfy the above condition. However, we cannot replace (through, Tennessee, m 5 ) with, for example, (through, Tennessee, m 6 ), since m 6 is not the child of m 4 . Furthermore, the number of children for a word in the dependency tree should be consistent with the arity of the corresponding semantic unit that points to it. For example, "not" has 2 children in our dependency-based hybrid tree representation because the semantic unit m 2 (i.e., RIVER : exclude (RIVER, RIVER)) has arity 2. Also, "rivers" is the leaf as m 3 , which points to it, has arity 0. We will discuss in Section 3.3  on how to derive the set of allowable dependencybased hybrid trees for a given (m, n) pair.
To understand the potential advantages of our new joint representation, we compare it with the relaxed hybrid tree representation (Lu, 2014), which is illustrated on the left of Figure 2. We highlight some similarities and differences between the two representations from the span level and word level perspectives.
In a relaxed hybrid tree representation, words and semantic units jointly form a constituency tree-like structure, where the former are leaves and the latter are internal nodes of such a joint representation. Such a representation is able to capture alignment between the natural language words and semantics at the span level. 4 For example, m 2 covers the span from "rivers" to "Tennessee", which allows the interactions between the semantic unit and the span to be captured. Similarly, in our dependency-based hybrid tree, such span level word-semantics correspondence can also be captured. For example, the arc between "not" and "through" is labeled by the semantic unit m 4 . This also allows the interactions between m 4 and words within the span from "not" to "through" to be captured.
While both models are able to capture the spanlevel correspondence between words and semantics, we can observe that in the relaxed hybrid tree, some words within the span are more directly related to the semantic unit (e.g., "do not" are more related to m 2 ) and some are not. Specifically, in their representation, the span level information assigned to the parent semantic unit always contains the span level information assigned to all its child semantic units. This may not always be desirable and may lead to irrelevant features. In fact, Lu (2014) also empirically showed that the spanlevel features may not always be helpful in their representation. In contrast, in our dependencybased hybrid tree, the span covered by m 2 is from "What" to "not", which only consists of the span level information associated with its first child semantic units. Therefore, our representation is 4 We refer readers to (Lu, 2014) for more details. more flexible in capturing the correspondence between words and semantics at the span level, allowing the model to choose the relevant span for features. Furthermore, our representation can also capture precise interactions between words through dependency arcs labeled with semantic units. For example, the semantic unit m 4 on the dependency arc from "not" to "through" in our representation can be used to capture their interactions. However, such information could not be straightforwardly captured in a relaxed hybrid tree, which is essentially a constituency tree-like representation. In the same example, consider the word "not" that bridges two arcs labeled by m 2 and m 4 . Lexical features defined over such arcs can be used to indirectly capture the interactions between semantic units and guide the tree construction process. We believe such properties can be beneficial in practice, especially for certain languages. We will examine their significance in our experiments later.

Dependency Patterns
To define the set of allowable dependency-based hybrid tree representation so as to allow us to perform exact inference later, we introduce the dependency patterns as shown in Table 1. We use A, B or C to denote the abstract semantic units with arity 0, 1, and 2, respectively. We use W to denote a contiguous word span, and X and Y to denote the first and second child semantic unit, respectively.
We explain these patterns with concrete cases in Figure 3 based on the example in Figure 2. For the first case, the semantic unit m 3 has arity 0, the pattern involved is WW, indicating both the lefthand and right-hand sides of "rivers" (under the dependency arc with semantic unit m 3 ) are just word spans (W, whose length could be zero). In the second case, the semantic unit m 4 has arity 1, the pattern involved is WX, indicating the lefthand side of "through" (under the arc of semantic unit m 4 ) is a word span and the right-hand side should be handled by the first child of m 4 in the semantic tree, which is m 5 in this case. In the third case, the semantic unit m 2 has two arguments, and the pattern involved in the example is XY, meaning the left-hand and right-hand sides should be handled by the first and second child semantic units (i.e., m 3 and m 4 ), respectively. 5 The final case illustrates that we also allow self-loops on our dependency-based hybrid trees, where an arc can be attached to a single word. 6 To avoid an infinite number of self-loops over a word, we set a maximum depth c to restrict the maximum number of recurrences, which is similar to the method introduced in (Lu, 2015).
Based on the dependency patterns, we are able to define the set of all possible allowable dependency-based hybrid tree representations. Each representation essentially belongs to a class of projective dependency trees where semantic units appear on the dependency arcs and (some of the) words are selected as nodes. The semantic tree can be constructed by following the arcs while referring to the dependency patterns involved.

Model
Given the natural language words n, our task is to predict m, which is a tree-structured meaning representation, consisting of a set of semantic units as the nodes in the semantic tree. We use t to denote a dependency-based hybrid tree (as shown in Figure 2), which jointly encodes both natural language words and the gold meaning representation. Let T (n, m) denote all the possible dependencybased hybrid trees that contain the natural language words n and the meaning representation m. We adopt the widely-used structured prediction model conditional random fields (CRF) (Lafferty et al., 2001). The probability of a possible meaning representation m and dependency-based hybrid tree t for a sentence n is given by: where f (n, m, t) is the feature vector defined over the (n, m, t) tuple, and w is the parameter vector. Since we do not have the knowledge of the "true" dependencies during training, t is regarded as a latent-variable in our model. We marginalize 5 Analogously, the pattern YX would mean m4 handles the left-hand side and m3 right-hand side. 6 The limitations associated with disallowing such a pattern have been discussed in the previous work of (Lu, 2015). t in the above equation and the resulting model is a latent-variable CRF (Quattoni et al., 2005): Given a dataset D of (n, m) pairs, our objective is to minimize the negative log-likelihood: 7 The gradient for model parameter w k is: where f k (n, m, t) represents the number of occurrences of the k-th feature. With both the objective and gradient above, we can minimize the objective function with standard optimizers, such as L-BFGS (Liu and Nocedal, 1989) and stochastic gradient descent. Calculation of these expectations involves all possible dependency-based hybrid trees. As there are exponentially many such trees, an efficient inference procedure is required. We will present our efficient algorithm to perform exact inference for learning and decoding in the next section.

Learning and Decoding
We propose dynamic-programming algorithms to perform efficient and exact inference, which will be used for calculating the objective and gradients discussed in the previous section. The algorithms are inspired by the inside-outside style algorithm (Baker, 1979), graph-based dependency parsing (Eisner, 2000;Koo and Collins, 2010;Shi et al., 2017), and the relaxed hybrid tree model (Lu, 2014(Lu, , 2015. As discussed in Section 3.3, our latent dependency trees are projective as in traditional dependency parsing (Eisner, 1996;Nivre and Scholz, 2004 We can see the first term is essentially the combined score of all the possible latent structures containing the pair (n, m). The second term is the combined score for all the possible latent structures containing n. We show how such scores can be calculated in a factorized manner, based on the fact that we can recursively decompose a dependency-based hybrid tree based on the dependency patterns we introduced.
Formally, we introduce two interrelated dynamic-programming structures that are similar to those used in graph-based dependency parsing (Eisner, 2000;Koo and Collins, 2010;Shi et al., 2017), namely complete span and complete arc span. Figure 4a shows an example of complete span (left) and complete arc span (right). The complete span (over [i, j]) consists of a headword (at i) and its descendants on one side (they altogether form a subtree), a dependency pattern and a semantic unit. The complete arc span is a span (over [i, j]) with a dependency between the headword (at i) and the modifier (at k). We use C i,j,p,m to denote a complete span, where i and j represent the indices of the headword and endpoint, p is the dependency pattern and m is the semantic unit. Analogously, we use A i,k,j,p,m to denote a complete arc span where i and k are used to denote the additional dependency from the word at the i-th position as headword to the word at the k-th position as modifier.
As we can see from the derivation in Figure 4, each type of span can be constructed from smaller spans in a bottom-up manner. Figure 4a shows that a complete span is constructed from a complete arc span following the dependency patterns in Table 1. Figure 4b shows a complete arc span can be simply constructed from two smaller complete spans based on the dependency pattern. In Figure  4c and 4d, we further show how such two complete spans with pattern X (or Y) and W can be constructed. Figure 4c illustrates how to model a transition from one semantic unit to another where 8 Regularization term is excluded for brevity. the parent is m 1 and the child is m 2 in the semantic tree. If m 2 has arity 1, then the pattern is B following the dependency patterns in Table 1. For spans with a single word, we use the lowercase w as the pattern to indicate this fact, as shown in Figure 4d. They are the atomic spans used for building larger spans. As the complete span in Figure  4d is associated with pattern W, which means the words within this span are under the semantic unit m 1 , we can incrementally construct this span with atomic spans. We illustrate the construction of a complete dependency-based hybrid tree in the supplementary material.
Our final goal during training for a sentence n = {w 0 , w 1 , · · · , w N } is to construct all the possible complete spans that cover the interval [0, N ], which can be represented as C 0,N,·,· . Similar to the chart-based dependency parsing algorithms (Eisner, 1996(Eisner, , 2000Koo and Collins, 2010), we can obtain the inside and outside scores using our dynamic-programming derivation in Figure 4 during the inference process, which can then be used to calculate the objective and feature expectations. Since the spans are defined by at most three free indices, the dependency pattern and the semantic unit, our dynamic-programming algorithm requires O(N 3 M ) time 9 where M is the number of semantic units. The resulting complexity is the same as the relaxed hybrid tree model (Lu, 2014).
During decoding, we can find the optimal (treestructured) meaning representation m * for a given  A similar decoding procedure has been used in previous work (Lu, 2014;Durrett and Klein, 2015) with CKY-based parsing algorithm.

Features
As shown in Equation 1, the features are defined on the tuple (n, m, t). With the dynamicprogramming procedure, we can define the features over the structures in Figure 2. Our feature design is inspired by the hybrid tree model (Lu, 2015) and graph-based dependency parsing (Mc-Donald et al., 2005). Table 2 shows the feature templates for the example in Figure 2. Specifically, we define simple unigram features (concatenation of a semantic unit and a word that directly appears under the unit), pattern features (concatenation of the semantic unit and the child pattern) and transition features (concatenation of the parent and child semantic units). They form our basic feature set. Additionally, with the structured properties of dependencies, we can define dependency-related features (McDonald et al., 2005). We use the parent (head) and child (modifier) words of the dependency as features. We also use the bag-of-words covered under a dependency as features. The dependency features are useful in helping improve the performance as we can see in the experiments section.

Neural Component
Following the approach used in Susanto and Lu (2017b), we could further incorporate neural networks into our latent-variable graphical model. The integration is analogous to the approaches described in the neural CRF models (Do and Artieres, 2010;Durrett and Klein, 2015;Gormley, 2015;Lample et al., 2016), where we use neural networks to learn distributed feature representations within our graphical model.
We employ a neural architecture to calculate the score associated with each dependency arc (w p , w c , m) (here w p and w c are the parent and child words in the dependency and m is the semantic unit over the arc), where the input to the neural network consists of words (i.e., (w p , w c )) associated with this dependency and the neural network will calculate a score for each possible semantic unit, including m. The two words are first mapped to word embeddings e p and e c (both of dimension d). Next, we use a bilinear layer 10 (Socher et al., 2013;Chen et al., 2016) to capture the interaction between the parent and the child in a dependency: where r i represents the score for the i-th semantic unit and U i ∈ R d×d . The scores are then incorporated into the probability expression in Equation 1 during learning and decoding. As a comparison, we also implemented a variant where our model directly takes in the average embedding of e p and e c as additional features, without using our neural component.

Experiments
Data and evaluation methodology We conduct experiments on the publicly available variablefree version of the GeoQuery dataset, which has been widely used for semantic parsing (Wong and Mooney, 2006;Lu et al., 2008;Jones et al., 2012). The dataset consists of 880 pairs of natural language sentences and the corresponding treestructured semantic representations. This dataset is annotated with eight languages. The original annotation of this dataset is English (Zelle and Mooney, 1996) Table 3: Performance comparison with state-of-the-art models on GeoQuery dataset. ( † represents the system is using lambda-calculus expressions as meaning representations.) standard evaluation procedure used in various previous works (Wong and Mooney, 2006;Lu et al., 2008;Jones et al., 2012;Lu, 2015) to construct the Prolog query from the tree-structured semantic representation using a standard and publicly available script. The queries are then used to retrieve the answers from the GeoQuery database, and we report accuracy and F 1 scores.
Hyperparameters We set the maximum depth c of the semantic tree to 20, following Lu (2015). The L 2 regularization coefficient is tuned from 0.01 to 0.05 using 5-fold cross-validation on the training set. The Polyglot (Al-Rfou et al., 2013) multilingual word embeddings 11 (with 64 dimensions) are used for all languages. We use L-BFGS (Liu and Nocedal, 1989) to optimize the DEPHT model until convergence and stochastic gradient descent (SGD) with a learning rate of 0.05 to optimize the neural DEPHT model. We implemented our neural component with the Torch7 library (Collobert et al., 2011). Our complete implementation is based on the StatNLP 12 structured prediction framework (Lu, 2017).

Results and Discussion
Table 3 (top) shows the results of our dependencybased hybrid tree model compared with nonneural models which achieve state-of-the-art performance on the GeoQuery dataset. Our model DEPHT achieves competitive performance and outperforms the previous best system RHT on 6 languages. Improvements on the Indonesian dataset are particularly striking (+11.8 absolute points in F 1 ). We further investigated the outputs from both systems on Indonesian by doing error analysis. We found 40 instances that are incorrectly predicted by RHT are correctly predicted by DEPHT. We found that 77.5% of the errors are due to incorrect alignment between words and semantic units. Figure 5 shows an example of such errors where the relaxed hybrid tree fails to capture the correct alignment. We can see the question is asking "What state is San Antonio located in?". However, the natural language word order in Indone-  sian is different from English, where the phrase "berada di" that corresponds to m 2 (i.e., loc) appears between "San Antonio" (which corresponds to m 5 -san antonio ) and "what" (which corresponds to m 1 -answer). Such a structural non isomorphism issue between the sentence and the semantic tree makes the relaxed hybrid tree parser unable to produce a joint representation with valid word-semantics alignment. This issue makes the RHT model unable to predict the semantic unit m 2 (i.e., loc) as RHT has to align the words "San Antonio" which should be aligned to m 5 before aligning "berada di". However, m 5 has arity 0 and cannot have m 2 as its child. Thus, it would be impossible for the RHT model to predict such a meaning representation as output. In contrast, we can see that our dependency-based hybrid tree representation appears to be more flexible in handling such cases. The dependency between the two words "di" (in) and "berada" (located) is also well captured by the arc between them that is labeled with m 2 . The error analysis reveals the flexibility of our joint representation in different languages in terms of the word ordering, indicating that the novel dependency-based joint representation is more robust and suffers less from languagespecific characteristics associated with the data.
Effectiveness of dependency To investigate the helpfulness of the features defined over latent dependencies, we conduct ablation tests by removing the dependency-related features. Table 4 shows the performance of augmenting different dependency features in our DEPHT model with basic features. Specifically, we investigate the performance of head word and modifier word features (HM) and also the bag-of-words features (BOW) that can be extracted based on dependencies. It can be observed that dependency features associated with the words are crucial for all languages, especially the BOW features.
Effectiveness of neural component The bottom part of Table 3 shows the performance comparison among models that involve neural networks. Our DEPHT model with embeddings as features can outperform neural baselines across several languages (i.e., Chinese, Indonesian and Swedish). From the table, we can see the neural component is effective, which consistently gives better results than DEPHT and the approach that uses word embedding features only. Susanto and Lu (2017b) presented the NEURAL HT model with different window size J for their multilayer perceptron. Their performance will differ with different window sizes, which need to be tuned for each language. In our neural component, we do not require such a language-specific hyperparameter, yet our neural approach consistently achieves the highest performance on 7 out of 8 languages compared with all previous approaches. As both the embeddings and the neural component are defined on the dependency arcs, the superior results also reveal the effectiveness of our dependencybased hybrid tree representation.

Conclusions and Future Work
In this work, we present a novel dependencybased hybrid tree model for semantic parsing. The model captures the underlying semantic information of a sentence as latent dependencies between the natural language words. We develop an efficient algorithm for exact inference based on dynamic-programming. Extensive experiments on benchmark dataset across 8 different languages demonstrate the effectiveness of our newly proposed representation for semantic parsing. Future work includes exploring alternative approaches such as transition-based methods (Nivre et al., 2006;Chen and Manning, 2014) for semantic parsing with latent dependencies, applying our dependency-based hybrid trees on other types of logical representations (e.g., lambda calculus expressions and SQL (Finegan-Dollak et al., 2018)) as well as multilingual semantic parsing (Jie and Lu, 2014;Susanto and Lu, 2017a).