Learning Cross-lingual Distributed Logical Representations for Semantic Parsing

With the development of several multilingual datasets used for semantic parsing, recent research efforts have looked into the problem of learning semantic parsers in a multilingual setup. However, how to improve the performance of a monolingual semantic parser for a specific language by leveraging data annotated in different languages remains a research question that is under-explored. In this work, we present a study to show how learning distributed representations of the logical forms from data annotated in different languages can be used for improving the performance of a monolingual semantic parser. We extend two existing monolingual semantic parsers to incorporate such cross-lingual distributed logical representations as features. Experiments show that our proposed approach is able to yield improved semantic parsing results on the standard multilingual GeoQuery dataset.


Introduction
Semantic parsing, one of the classic tasks in natural language processing (NLP), has been extensively studied in the past few years (Zettlemoyer and Collins, 2005;Wong andMooney, 2006, 2007;Liang et al., 2011;Kwiatkowski et al., 2011;Artzi et al., 2015).With the development of datasets annotated in different languages, learning semantic parsers from such multilingual datasets also attracted attention of researchers (Susanto and Lu, 2017a).However, how to make use of such cross-lingual data to perform cross-lingual semantic parsing -using data annotated for one language to help improve the performance of another lan- English: which rivers do not run through texas ?German: welche flüsse fliessen nicht durch texas ? Figure 1: An example of two semantically equivalent sentences (below) and their tree-shaped semantic representation (above).
guage remains a research question that is largely under-explored.
Prior work (Chan et al., 2007) shows that semantically equivalent words coming from different languages may contain shared semantic level information, which will be helpful for certain semantic processing tasks.In this work, we propose a simple method to learn the distributed representations for output structured semantic representations which allow us to capture cross-lingual features.Specifically, following previous work (Wong and Mooney, 2006;Jones et al., 2012;Susanto and Lu, 2017b), we adopt a commonly used tree-shaped form as the underlying meaning representation where each tree node is a semantic unit.Our objective is to learn for each semantic unit a distributed representation useful for semantic parsing, based on multilingual datasets.Figure 1 depicts an instance of such tree-shaped semantic representations, which correspond to the two semantically equivalent sentences in English and German below it.
For such structured semantics, we consider each arXiv:1806.05461v1[cs.CL] 14 Jun 2018 semantic unit separately.We learn distributed representations for individual semantic unit based on multilingual datasets where semantic representations are annotated with different languages.Such distributed representations capture shared information cross different languages.We extend two existing monolingual semantic parsers (Lu, 2015;Susanto and Lu, 2017b) to incorporate such crosslingual features.To the best of our knowledge, this is the first work that exploits cross-lingual embeddings for logical representations for semantic parsing.Our system is publicly available at http://statnlp.org/research/sp/.

Related Work
Many research efforts on semantic parsing have been made, such as mapping sentences into lambda calculus forms based on CCG (Artzi and Zettlemoyer, 2011;Artzi et al., 2014;Kwiatkowski et al., 2011), modeling dependencybased compositional semantics (Liang et al., 2011;Zhang et al., 2017), or transforming sentences into tree structured semantic representations (Lu, 2015;Susanto and Lu, 2017b).With the development of multilingual datasets, systems for multilingual semantic parsing are also developed.Jie and Lu (2014) employed majority voting to combine outputs from different parsers for certain languages to perform multilingual semantic parsing.Susanto and Lu (2017a) presented an extension of one existing neural parser, SEQ2TREE (Dong and Lapata, 2016), by developing a shared attention mechanism for different languages to conduct multilingual semantic parsing.Such a model allows two types of input signals: single source SL-SINGLE and multi-source SL-MULTI.However, semantic parsing with cross-lingual features has not been explored, while many recent works in various NLP tasks show the effectiveness of shared information cross different languages.Examples include semantic role labeling (Kozhevnikov and Titov, 2013), information extraction (Wang et al., 2013;Pan et al., 2017;Ni et al., 2017), and question answering (Joty et al., 2017), which motivate this work.
Our work involves exploiting distributed output representations for improved structured predictions, which is in line with works of (Srikumar and Manning, 2014;Rocktäschel et al., 2014;Xiao and Guo, 2015).The work of (Rocktäschel et al., 2014) is perhaps the most related to this research.
The authors first map first-order logical statements produced by a semantic parser or an information extraction system into expressions in tensor calculus.They then learn low-dimensional embeddings of such statements with the help of a given logical knowledge base consisting of first-order rules so that the learned representations are consistent with these rules.They adopt stochastic gradient descent (SGD) to conduct optimizations.This work learns distributed representations of logical forms from cross-lingual data based on co-occurrence information without relying on external knowledge bases.

Semantic Parser
In this work, we build our model and conduct experiments on top of the discriminative hybrid tree semantic parser (Lu, 2014(Lu, , 2015)).The parser was designed based on the hybrid tree representation (HT-G) originally introduced in (Lu et al., 2008).The hybrid tree is a joint representation encoding both sentence and semantics that aims to capture the interactions between words and semantic units.A discriminative hybrid tree (HT-D) (Lu, 2014(Lu, , 2015) ) learns the optimal latent wordsemantics correspondence where every word in the input sentence is associated with a semantic unit.Such a model allows us to incorporate rich features and long-range dependencies.Recently, Susanto and Lu (2017b) extended HT-D by attaching neural architectures, resulting in their neural hybrid tree (HT-D (NN)).
Since the correct correspondence between words and semantics is not explicitly given in the training data, we regard the hybrid tree representation as a latent variable.Formally, for each sentence n with its semantic representation m from the training set, we assume the joint representation (a hybrid tree) is h.Now we can define a discriminative log-linear model as follows: (1) where H(n, m) returns the set of all possible joint representations that contain both n and m exactly, and F is a scoring function that is calculated as a dot product between a feature function Φ defined over tuple (m, n, h) and a weight vector Λ.
To incorporate neural features, HT-D (NN) defines the following scoring function: where Θ is the set of parameters of the neural networks and G is the neural scoring function over the (n,m,h) tuple (Susanto and Lu, 2017b).Specifically, the neural features are defined over a fixed-size window surrounding a word in n paired with its immediately associated semantic unit.Following the work (Susanto and Lu, 2017b), we denote the window size as J ∈ {0, 1, 2}.

Cross-lingual Distributed Semantic Representations
A multilingual dataset used in semantic parsing comes with instances consisting of logical forms annotated with sentences from multiple different languages.In this work, we aim to learn one monolingual semantic parser for each language, while leveraging useful information that can be extracted from other languages.Our setup is as follows.Each time, we train the parser for a target language and regard the other languages as auxiliary languages.To learn cross-lingual distributed semantic representations from such data, we first combine all data involving all auxiliary languages to form a large dataset.Next, for each target language, we construct a semantics-word co-occurrence matrix M ∈ R m×n (where m is the number of unique semantic units, n is the number of unique words in the combined dataset).Each entry is the number of co-occurrences for a particular (semantic unit-word) pair.We will use this matrix to learn a low-dimensional cross-lingual representation for each semantic unit.To do so, we first apply singular value decomposition (SVD) to this matrix, leading to: where U ∈ R m×m and V ∈ R n×m are unitary matrices, V * is the conjugate transpose of V, and Σ ∈ R m×m is a diagonal matrix.We truncate the diagonal matrix Σ and left multiply it with U: singular values.We leave the rank d as a hyperparameter.Each row in the above matrix is a ddimensional vector, giving a low-dimensional representation for one semantic unit.Such distributed output representations can be readily used as continuous features in Φ(n, m, h).

Training and Decoding
During the training process, we optimize the objective function defined over the training set as: We follow the dynamic programming approach used in (Susanto and Lu, 2017b) to perform efficient inference, and follow the same optimization strategy as described there.
In the decoding phase, we are given a new input sentence n, and find the optimal semantic tree m * : Again, the above equation can be efficiently computed by dynamic programming (Susanto and Lu, 2017b).

Datasets and Settings
We evaluate our approach on the standard Geo-Query dataset annotated in eight languages (Wong and Mooney, 2006;Jones et al., 2012;Susanto and Lu, 2017b).We follow a standard practice for evaluations which has been adopted in the literature (Lu, 2014(Lu, , 2015;;Susanto and Lu, 2017b).Specifically, to evaluate the proposed model, predicted outputs are transformed into Prolog queries.An output is considered as correct if answers that queries retrieve from GeoQuery database are the same as the gold ones .The dataset consists of 880 instances.In all experiments, we follow the experimental settings and procedures in (Lu, 2014

Results
We compare our models against different existing systems, especially the two baselines HT-D (Lu, 2015) and HT-D (NN) (Susanto and Lu, 2017b) with different word window sizes J ∈ {0, 1, 2}.WASP (Wong and Mooney, 2006) is a semantic parser based on statistical phrase-based machine translation.UBL-S (Kwiatkowski et al., 2010) induced probabilistic CCG grammars with higherorder unification that allowed to construct general logical forms for input sentences.TREETRANS (Jones et al., 2012) is built based on a Bayesian inference framework.We run WASP, UBL-S, HT-G, UBL-S, SEQ2TREE and SL-SINGLE1 for comparisons.Note that there exist multiple versions of logical representations used in the GEO-QUERY dataset.Specifically, one version is based on lambda calculus expression, and the other is based on the variable free tree-shaped represen-tation.We use the latter representation in this work, while the SEQ2TREE and SL-SINGLE employ the lambda calculus expression.It was noted in Kwiatkowski et al. (2010); Lu (2014) that evaluations based on these two versions are not directly comparable -the version that uses tree-shaped representations appears to be more challenging.We do not compare against (Jie and Lu, 2014) due to their different setup from ours. 2able 2 shows results that we have conducted on eight different languages.The highest scores are highlighted.We can observe that when distributed logical representations are included, both HT-D and HT-D (NN) can lead to competitive results.Specifically, when such features are included, evaluation results for 5 out of 8 languages get improved.
We found that the shared information cross different languages could guide the model so that it can make more accurate predictions, eliminating certain semantic level ambiguities associated with the semantic units.This is exemplified by a real instance from the English portion of the dataset: Input: Which states have a river?Gold: answer(state(loc(river(all)))) Output: answer(state(traverse(river(all)))) Output (+O): answer(state(loc(river(all)))) Here the input sentence in English is "Which states have a river?", and the correct output is shown below the sentence.Output is the prediction from HT-D (NN) and Output (+O) is the parsing result given by HT-D (NN+O) where the learned cross-lingual representations of semantics are included.We observe that, by introducing our learned cross-lingual semantic information, the system is able to distinguish the two semantically related concepts, loc (located in) and traverse (traverse), and further make more promising predictions.
Interestingly, for German, the results become much lower when such features are included, indicating such features are not helpful in the learning process when such a language is considered.Reasons for this need further investigations.We note, however, previously it was also reported in the literature that the behavior of the performance associated with this language is different than other languages in the presence of additional features (Lu, 2014).

Visualizing Output Representations
To qualitatively understand how good the learned distributed representations are, we also visualize the learned distributed representations for semantic units.In the Figure 2, we plot the embeddings of a small set of semantic units which are learned from all languages other than English.Each representation is a 30-dimensional vector and is projected into a 2-dimensional space using Barnes-Hut-SNE (Maaten, 2013) for visualization.In general, we found that semantic units expressing similar meanings tend to appear to-gether.For example, the two semantic units STATE : smallest one ( density (STATE)) and STATE : one ( population (STATE)) share similar representations.However, we also found that occasionally semantic units conveying opposite meanings are also grouped together.This reveals the limitations associated with such a simple cooccurrence based approach for learning distributed representations for logical expressions.

Conclusions
In this paper, we empirically show that the distributed representations of logical expressions learned from multilingual datasets for semantic parsing can be exploited to improve the performance of a monolingual semantic parser.Our approach is simple, relying on an SVD over semantics-word co-occurrence matrix for finding such distributed representations for semantic units.Future directions include investigating better ways of learning such distributed representations as well as learning such distributed representations and semantic parsers in a joint manner.

Figure 2
Figure 2: 2-D projection of learned distributed representations for semantics.
5)where Σ ∈ R m×d is a matrix that consists of only the left d columns of Σ, containing the d largest

Table 2 :
Lu, 2017b)nce on multilingual datasets.Acc.: accuracy (%), F : F1-measure (%).+O: including distributed representations for semantic units as features.(†indicatessystemsthat make use of lambda calculus expressions as meaning representations.)2015;SusantoandLu,2017b).In particular, we use 600 instances for training and 280 for test and set the maximum optimization iteration to 150.In order to tune the rank d, we randomly select 80% of the training instances for learning and use the rest 20% for development.We report the value of d for each language in Table1and the F1 scores on the development set.