Neural Architectures for Multilingual Semantic Parsing

In this paper, we address semantic parsing in a multilingual context. We train one multilingual model that is capable of parsing natural language sentences from multiple different languages into their corresponding formal semantic representations. We extend an existing sequence-to-tree model to a multi-task learning framework which shares the decoder for generating semantic representations. We report evaluation results on the multilingual GeoQuery corpus and introduce a new multilingual version of the ATIS corpus.


Introduction
In this work, we address multilingual semantic parsing -the task of mapping natural language sentences coming from multiple different languages into their corresponding formal semantic representations. We consider two multilingual scenarios: 1) the single-source setting, where the input consists of a single sentence in a single language, and 2) the multi-source setting, where the input consists of parallel sentences in multiple languages. Previous work handled the former by means of monolingual models (Wong and Mooney, 2006;Lu et al., 2008;Jones et al., 2012), while the latter has only been explored by Jie and Lu (2014) who ensembled many monolingual models together. Unfortunately, training a model for each language separately ignores the shared information among the source languages, which may be potentially beneficial for typologically related languages. Practically, it is also inconvenient to train, tune, and configure a new model for each language, which can be a laborious process.
In this work, we propose a parsing architecture that accepts as input sentences in several languages. We extend an existing sequence-totree model (Dong and Lapata, 2016) to a multitask learning framework, motivated by its success in other fields, e.g., neural machine translation (MT) (Dong et al., 2015;Firat et al., 2016). Our model consists of multiple encoders, one for each language, and one decoder that is shared across source languages for generating semantic representations. In this way, the proposed model potentially benefits from having a generic decoder that works well across languages. Intuitively, the model encourages each source language encoder to find a common structured representation for the decoder. We further modify the attention mechanism (Bahdanau et al., 2015) to integrate multisource information, such that it can learn where to focus during parsing; i.e., which input positions in which languages.
Our contributions are as follows: • We investigate semantic parsing in two multilingual scenarios that are relatively unexplored in past research, • We present novel extensions to the sequenceto-tree architecture that integrates multilingual information for semantic parsing, and • We release a new ATIS semantic dataset annotated in two new languages.

Related Work
In this section, we summarize semantic parsing approaches from previous works. Wong and Mooney (2006) created WASP, a semantic parser based on statistical machine translation. Lu et al. (2008) proposed generative hybrid tree structures, which were augmented with a discriminative reranker. CCG-based semantic parsing systems have been developed, such as ZC07 (Zettlemoyer and Collins, 2007) and UBL (Kwiatkowski et al., 2010). Researchers have proposed sequence-tosequence parsing models (Jia and Liang, 2016;Dong and Lapata, 2016;Kočiskỳ et al., 2016). Recently, Susanto and Lu (2017) extended the hybrid tree with neural features. Recent progress in multilingual NLP has moved towards building a unified model that can work across different languages, such as in multilingual dependency parsing (Ammar et al., 2016), multilingual MT (Firat et al., 2016), and multilingual word embedding (Guo et al., 2016). Nonetheless, multilingual approaches for semantic parsing are relatively unexplored, which motivates this work. Jones et al. (2012) evaluated an individuallytrained tree transducer on a multilingual semantic dataset. Jie and Lu (2014) ensembled monolingual hybrid tree models on the same dataset.

Model
In this section, we describe our approach to multilingual semantic parsing, which extends the sequence-to-tree model by Dong and Lapata (2016). Unlike the mainstream approach that trains one monolingual parser per source language, our approach integrates N encoders, one for each language, into a single model. This model encodes a sentence from the n-th language X = x 1 , x 2 , ..., x |X| as a vector and then uses a shared decoder to decode the encoded vector into its corresponding logical form Y = y 1 , y 2 , ..., y |Y | . We consider two types of input: 1) a single sentence in one of N languages in the single-source setting and 2) parallel sentences in N languages in the multi-source setting. We elaborate on each setting in Section 3.1 and 3.2, respectively.
The encoder is implemented as a unidirectional RNN with long short-term memory (LSTM) units (Hochreiter and Schmidhuber, 1997), which takes a sequence of natural language tokens as input. Similar to previous multi-task frameworks, e.g., in neural MT (Firat et al., 2016;Zoph and Knight, 2016), we create one encoder per source language, i.e., {Ψ n enc } N n=1 . For the n-th language, it updates the hidden vector at time step t by: where Ψ n enc is the LSTM function and E n x ∈ R |V |×d is an embedding matrix containing row vectors of the source tokens in the n-th language. Each encoder may be configured differently, such  Figure 1: Illustration of the model with three language encoders and a shared logical form decoder (in λ-calculus). Two scenarios are considered: (a) single-source and (b) multi-source with a combiner module (in grey color).
as by the number of hidden units and the embedding dimension for the source symbol.
In the basic sequence-to-sequence model, the decoder generates each target token in a linear fashion. However, in semantic parsing, such a model ignores the hierarchical structure of logical forms. In order to alleviate this issue, Dong and Lapata (2016) proposed a decoder that generates logical forms in a top-down manner, where they define a "non-terminal" token <n> to indicate subtrees. At each depth in the tree, logical forms are generated sequentially until the end-ofsequence token is output.
Unlike in the single language setting, here we define a single, shared decoder Ψ dec as opposed to one decoder per source language. We augment the parent non-terminal's information p when computing the decoder state z t , as follows: where Ψ dec is the LSTM function andỹ t−1 is the previous target symbol. The attention mechanism (Bahdanau et al., 2015;Luong et al., 2015) computes a timedependent context vector c t (as defined later in Section 3.1 and 3.2), which is subsequently used for computing the probability distribution over the next symbol, as follows: where U, V, and W are weight matrices. Finally, the model is trained to maximize the following conditional log-likelihood: log p(y t |y <t , X) (5) where (X, Y ) refers to a ground-truth sentencesemantics pair in the training data D. We use the same formulation above for the encoders and the decoder in both multilingual settings. Each setting differs in terms of: 1) the decoder state initialization, 2) the computation of the context vector c t , and 3) the training procedure, which are described in the following sections.

Single-Source Setting
In this setting, the input is a source sentence coming from the n-th language. Figure 1 (a) depicts a scenario where the model is parsing Indonesian input, with English and Chinese being non-active.
The last state of the n-th encoder is used to initialize the first state of the decoder. We may need to first project the encoder vector into a suitable dimension for the decoder, i.e., z 0 = φ n dec (h n |X| ), where φ n dec can be an affine transformation. Similarly, we may do so before computing the attention scores, i.e.,h n k = φ n att (h n k ). Then, we compute the context vector c n t as a weighted sum of the hidden vectors in the n-th encoder: We set c t = c n t for computing Equation 3. We propose two variants of the model under this setting. In the first version, we define separate weight matrices for each language, i.e., {U n , V n , W n } N n=1 . In the second version, the three weight matrices are shared across languages, essentially reducing the number of parameters by a factor of N .
The training data consists of the union of sentence-semantics pairs in N languages, where the source sentences are not necessarily parallel. We implement a scheduling mechanism that cycles through all languages during training, one language at a time. Specifically, model parameters are updated after one batch from one language before moving to the next one. Similar to Firat et al. (2016), this mechanism prevents excessive updates from a specific language.

Multi-Source Setting
In this setting, the input are semantically equivalent sentences in N languages. Figure 1 (b) depicts a scenario where the model is parsing English, Indonesian, and Chinese simultaneously. It includes a combiner module (denoted by the grey box), which we will explain next.
The decoder state at the first time step is initialized by first combining the N final states from each encoder, i.e., z 0 = φ init (h 1 |X| , · · · , h N |X| ), where we implement φ init by max-pooling.
We propose two ways of computing c t that integrates source-side information from multiple encoders. First, we consider word-level combination, where we combine N encoder states at every time step, as follows: Alternatively, in sentence-level combination, we first compute the context vector for each language in the same way as Equation 6 and 7. Then, we perform a simple concatenation of N context vectors: c t = c 1 t ; · · · ; c N t . Unlike the single-source setting, the training data consists of N -way parallel sentencesemantics pairs. That is, each training instance consists of N semantically equivalent sentences and their corresponding logical form.

Datasets and Settings
We conduct our experiments on two multilingual benchmark datasets, which we describe below. Both datasets use a meaning representation based on lambda calculus.
The GeoQuery (GEO) dataset is a standard benchmark evaluation for semantic parsing. The ATIS dataset contains natural language queries to a flight database. The data is split into 4,434 instances for training, 491 for development, and 448 for evaluation, same as Zettlemoyer and Collins (2007). The original version only includes English. In this work, we annotate the corpus in Indonesian and Chinese. The Chinese corpus was annotated (with segmentations) by hiring professional translation service. The Indonesian corpus was annotated by a native Indonesian speaker.
We use the same pre-processing as Dong and Lapata (2016), where entities and numbers are replaced with their type names and unique IDs. 1 English words are stemmed using NLTK (Bird et al., 2009). Each query is paired with its corresponding semantic representation in lambda calculus (Zettlemoyer and Collins, 2005).
In all experiments, following Dong and Lapata (2016), we use a one-layer LSTM with 200dimensional cells and embeddings. We use a minibatch size of 20 with RMSProp updates (Tieleman and Hinton, 2012) for a fixed number of epochs, with gradient clipping at 5. Parameters are uniformly initialized at [-0.08,0.08] and regularized using dropout (Srivastava et al., 2014). Input sequences are reversed. See Appendix A for detailed experimental settings.
For each model configuration, all experiments are repeated 3 times with different random seed values, in order to make sure that our findings are reliable. We found empirically that the random seed may affect SEQ2TREE performance. This is especially important due to the relatively small dataset. As previously done in multitask sequence-to-sequence learning (Luong et al., 2016), we report the average performance for the baseline and our model. The evaluation metric is defined in terms of exact match accuracy with the ground-truth logical forms. See Appendix B for the accuracy of individual runs.

Results
Table 1 compares the performance of the monolingual sequence-to-tree model (Dong and Lapata, 2016), SINGLE, and our multilingual model, MULTI, with separate and shared output parameters under the single-source setting as described in Section 3.1. On average, both variants of the multilingual model outperform the monolingual model by up to 1.34% average accuracy on GEO. Parameter sharing is shown to be helpful, in particular for GEO. We observe that the average performance increase on ATIS mainly comes from Chinese and Indonesian. We also learn that although including English is often helpful for the other languages, it may affect its individual performance. Table 2 shows the average performance on 1 See Section 3.6 of (Dong and Lapata, 2016).  multi-source parsing by combining 3 to 4 languages for GEO and 2 to 3 languages for ATIS. For RANKING, we combine the predictions from each language by selecting the one with the highest probability. Indeed, we observe that system combination at the model level is able to give better performance on average (up to 4.29% on GEO) than doing so at the output level. Combining at the word level and sentence level shows comparable performance on both datasets. It can be seen that the benefit is more apparent when we include English in the system combination.

Analysis
In this section, we report a qualitative analysis of our multilingual model. Table 3 shows example output from the monolingual model, SINGLE, trained on the three languages in ATIS and the multilingual model, MULTI, with sentence-level combination. This example demonstrates a scenario when the multilingual model successfully parses the three input sentences into the correct logical form, whereas the individual models are unable to do so. Figure 2 shows the alignments produced by MULTI (sentence) when parsing ATIS in the multisource setting. Each cell in the alignment matrix corresponds to α n k,t which is computed by Equation 6. Semantically related words are strongly aligned, such as the alignments between ground (en), darat (id), 地面 (zh) and ground transport. This shows that such correspondences can be jointly learned by our multilingual model.
In Table 4, we summarize the number of parameters in the baseline and our multilingual model. The number of parameters in SINGLE and RANK-ING is equal to the sum of the number of parameters in their monolingual components. It can be seen that the size of our multilingual model is about 50-60% smaller than that of the baseline.

Conclusion
We have presented a multilingual semantic parser that extends the sequence-to-tree model to a multitask learning framework. Through experiments, we show that our multilingual model performs better on average than 1) monolingual models in the single-source setting and 2) ensemble ranking in the multi-source setting. We hope that this work will stimulate further research in multilingual semantic parsing. Our code and data is available at http://statnlp.org/research/sp/.   all multilingual models, we initialize the encoders using the encoder weights learned by the monolingual models. For the multi-source setting, we also initialize the decoder using the first language in the list of the combined languages.

B Additional Experimental Results
In Table 6 and 7, we report the accuracy of the 3 runs for each model and dataset. In both settings, we observe that the best accuracy on both datasets is often achieved by MULTI. This is the same conclusion that we reached when averaging the results over all runs.