Multi-level Alignment Pretraining for Multi-lingual Semantic Parsing

In this paper, we present a multi-level alignment pretraining method in a unified architecture formulti-lingual semantic parsing. In this architecture, we use an adversarial training method toalign the space of different languages and use sentence level and word level parallel corpus assupervision information to align the semantic of different languages. Finally, we jointly train themulti-level alignment and semantic parsing tasks. We conduct experiments on a publicly avail-able multi-lingual semantic parsing dataset ATIS and a newly constructed dataset. Experimentalresults show that our model outperforms state-of-the-art methods on both datasets.


Introduction
1 The goal of semantic parsing is to convert a natural language sentence to an executable logical form, which has been studied in the past few years and used on various applications, such as question answering (Kwiatkowski et al., 2011), task-oriented dialog systems (Yih et al., 2015) and interpreting instructions (Artzi and Zettlemoyer, 2013).
Due to the importance of semantic parsing, various approaches have been proposed for this task, such as (Kwiatkowski et al., 2011;Jia and Liang, 2016;Dong and Lapata, 2018;Chen et al., 2018). However, most existing methods only handle monolingual semantic parsing, while in real world applications such as Chatbot and search engine, we generally need to handle multi-lingual semantic parsing. Table 1 shows an example of the multi-lingual semantic parsing task, and the task aims to convert the question from different languages into the corresponding lambda calculus. For multi-lingual semantic parsing, previous works such as Jie and Lu (2014) and Susanto and Lu (2017) study it from different perspectives. Jie and Lu (2014) train the model for each language respectively and use ensemble method to combine the models on a multi-lingual semantic parsing dataset. Susanto and Lu (2017) propose a hybrid combination method to model multi-source input. Both of them need enough multi-lingual semantic parsing data for training. However, it is very hard to collect enough multi-lingual semantic parsing data.
EN Who is the director of Inception? ZH 谁是电影Inception的导演 LF λx.f ilm f ilm director(Inception, x) Table 1: An example of our multi-lingual semantic parsing dataset, including a lambda calculus (LF) with the English (EN) and Chinese (ZH) question.
Recently, various pretraining methods have been successfully used to solve the labeled data insufficient problem in different tasks. In these methods, unsupervised data (Peters et al., 2017;Alec Radford, 2018;Devlin et al., 2018) or richly supervised data (McCann et al., 2017;Lample and Conneau, 2019) from other tasks are used to pretrain their models and achieve significant performance improvement in different tasks.
In this paper, we propose a multi-level alignment pretraining method to align the space level, word level and sentence level semantic representations for different languages. We design an adversarial training method to align the space level representation using unsupervised data. And to align the semantic level representation of parallel corpus in different languages, we use machine translation corpus and bilingual tokens to learn a shared cross-lingual encoder for our semantic parsing model. To better evaluate our method, we construct an open domain multi-lingual semantic parsing dataset, since most existing multi-lingual semantic parsing datasets (Hemphill et al., 1990;Zettlemoyer and Collins, 2012) are for specific domain and relatively small in scale.
The main contributions of this paper are: • We design a multi-level alignment pretraining method to pretrain the multi-lingual semantic parsing model.
• We construct a new multi-lingual semantic parsing dataset on open domain, we will release this dataset to help the research of multi-lingual semantic parsing tasks.
• We conduct an experiment on ATIS and our dataset. Experimental results show that our model achieves new state-of-the-art results on both datasets.

Model
In this section, we will first briefly introduce the basic sequence-to-sequence (S2S) model as our baseline model. Then, we introduce the architecture of our Multi-level Alignment pretraining for multi-lingual Semantic Parsing (MASP) model.

S2S Model for Semantic Parsing
The S2S model has been successfully used in recent semantic parsing task (Dong and Lapata, 2016).
The input of the model is a natural language question q = [x 1 , x 2 ...x |q| ] and output is a logical form sequence l = [y 1 , y 2 , ...y |l| ]. The tokens of the question q are fed one-by-one into the encoder, producing a sequence of encoder hidden states h = [h 1 , h 2 , ...h |q| ]. In the decoding process, at each time step t, the decoder computes the attention distribution to obtain a context vector c t as follows: where f is a non-linear function, and we use tanh here. u, W e and b e are parameters. s t is the decoder hidden state at step t. The context vector c t is used to compute the generation distribution P v of the target vocabulary with the hidden state s t : where W p , W, b, b p are parameters.
In particular, to tackle out-of-vocabulary words, we incorporate the same copy mechanism as in (See et al., 2017) in our decoder. Attention score a i is used as probability distribution of the copy mechanism over the source words. The copy distribution P c is defined as follows: To combine the copy distribution with the generation distribution, we use a gate g c to choose whether to copy from q or generate from the target vocabulary: where vectors W * , b * are parameters. z t−1 is the word embedding of the previous word. We get final distribution score on each step t: where P f (y t ) is considered as the final vocabulary distribution for step t.
We compute the overall loss of all steps as:  In this section, we will introduce our Multi-level Alignment pretraining for multi-lingual Semantic Parsing (MASP) model. Our model uses pretraining method to incorporate rich unsupervised and supervised corpus to align sentences in different languages into shared multi-lingual space. And then we apply our model to multi-lingual semantic parsing task.

Multi-level Alignment
In this section, we will introduce our alignment strategies. Our model integrates three alignment strategies in space, word and sentence level, to learn shared semantic information during training. The space level alignment only uses monolingual corpus, word-level alignment needs bilingual dictionary, and sentence level alignment learns shared semantic information from parallel corpus. The input of the multi-lingual model is pair-wised, including two sentences, Q E = {x 1 , x 2 , ..., x n } in English, Q C = {c 1 , c 2 , ..., c m } in Chinese, where n and m is the length of Q E and Q C .

Space-level Alignment
In this section, we design an adversarial learning method to maximize the confusion between two language representations, which has been successfully used in domain adaptation (Tzeng et al., 2017). The distributions of their representations are quite different, which will harm the performance of shared semantic parsing model. To align the distribution space of the two languages, we use the adversarial learning method to maximize the confusion between the two languages, which aligns the distribution of the sentence representations. And questions from two languages can be considered as two special domains.
The discriminator D is aimed to distinguish whether the input representation is from English or Chinese. In our model, the discriminator D is a binary classifier with a standard softmax layer. The input of D is all hidden states, i.e h e of Q E , from the shared RNN encoder. Furthermore, we give an extra label y l ∈ {0, 1} for discriminator D to indicate which language the input of discriminator belongs to.
The discriminator D sums up all hidden states as input features, and predicts which language the encoded sentence belongs to. For the English question Q E , the final distribution in discriminator can be formulated as, where M a , b a are trainable parameters, P ad is the probability distribution of labels that indicate the language type. The final distribution Q C is the same as Q E . Then we compute the cross entropy loss L ad of the discriminator D: For our multi-lingual model, we maximize the reversal classification loss to optimize the parameters, which aims to confuse the discriminator, and the reversal loss L g is formulated as follows, This strategy can align the sentence representation space of different languages to help our model learn shared semantic information.
Word-Level Alignment Space-level alignment strategy can align the distribution space of the two languages. However, the shared semantic information is not aligned. In this section, we will introduce our word level alignment strategy to map monolingual word embedding into shared cross-lingual semantic space with the dictionary of bilingual lexicons. The model is first initialized with a pretrained word embedding matrix, trained by word2vec based methods (Mikolov et al., 2013;Bojanowski et al., 2017) in the two different languages. Here we define the two word embedding matrices, X E = R |X E | * d in English and X C = R |X C | * d in Chinese, d is the dimension of word embedding. The word embedding matrix in each language is pretrained respectively, embeddings of words that have the same meanings are unaligned, which will increase difficulty to encode sentences in our model. Thus, in order to map X E and X C into a shared semantic space, we define two linear transformation matrices W E and W C using as the multi-lingual projection. The matrices apply a linear transformation on X E and X C to align their embedding in each dimension. Then, we optimize our model by monolingual corpus with an extra bilingual lexicon dictionary B.
Formally, we compute the multi-lingual representations and add the word alignment loss as, where cos(., .) is the function that computes the cosine distance of two vectors. Through Eq.10, we align the word embedding in the pretraining process to help the performance of the encoder, which share the parameters between the two languages.

Sentence-Level Alignment
Word level alignment is used to align monolingual embedding, but it can not cover more complex semantic information between different languages, since semantic equivalent sentences in the two languages are different in structures. To align the two representations for semantic parsing, end-to-end training a semantic parser requires a large amount of multi-lingual semantic parsing data, which is costly to annotate. However, there are sufficient machine translation data that contains the semantic alignment information across different languages. Thus, we pretrain our model on these data to align the representation of semantic equivelant sentences . In our model, for each sentence pair Q E and Q C from the two languages, the shared BiLSTM encoder compute its contextual representations [h e 1 , h e 2 , ..., h e n ] of Q E and [h c 1 , h c 2 , ..., h c m ] of Q C respectively. We use the final hidden h e n and h c m state as the sentence representation. Sentence pairs in the equivalent semantics should have similiar sentence representations that will be used in the decoder to generate the same logical form. We also construct some negative sentence pair with different meaning by randomly sampling sentences in both languages. It can be considered as an auxiliary task that predict whether the sentences' pair from different languages has the same meaning. We also randomly pair questions in the two language as the negative samples, and we use extra label y s ∈ {0, 1} to indicate whether the sentence pair is semantic equivalent.
Then we can compute the sentence alignment loss as, where L s is the loss of sentence-level alignment.

Training Process
The full training process contains two steps, firstly, pretraining our model with the three alignment strategies, then jointly training the model on multi-lingual semantic parsing datasets with pretraining corpus.
Pretraining We pretrain our multi-lingual model with machine translation parallel corpus(for the experiment without sentence alignment, we use monolingual corpus instead) and bilingual dictionary. The pretraining model contains the multi-lingual encoder and the discriminator D. We feed each pair of sentences in the two languages in our model with the bilingual lexicon vocabulary B. The overall loss of multi-lingual model contains word-level alignment loss, sentence-level alignment loss sentence and language confusion loss, Simultaneously, we alternately optimize the discriminator D with the loss L ad until both losses converges. Then the pretrained model will be saved for multi-lingual semantic parsing tasks.

Joint Training
We initialize the parameters with the pretrained multi-lingual model, and then finetune on the corresponding semantic parsing dataset. The input sample in a multi-lingual semantic parsing dataset, contains questions in different languages and corresponding logical forms. And we train our model with these samples by Eq. 6. In order to keep the alignment property, we use a joint training method with the pretraining corpus and optimize the model with the loss L f t , which contains the generation loss L s2s and alignment loss L pre : where α is used to control the weight of the alignment loss.  Table 2: Accuracy on ATIS and MLSP datasets. "EN" represents the accuracy of English, and "ZH" represents the result of Chinese. In our methods, "w/o" means to ablate each alignment strategy respectively.

Dataset Construction
In this section, we introduce a new multi-lingual semantic parsing (MLSP) dataset based on Satori 2 It contains a set of nodes and edges that are represented by triple {s, p, o}. Each triple denotes two nodes, a subject entity s, an object entity o and the directed edge p between them as a predicate. We collect our dataset by crowd sourcing, which involves two steps: 1) First, we collect the connected triples in Freebase randomly. Second, we annotate a simple question for each triple as seed questions. Third, we automatically generate complex questions for the connected triples with the simple questions of selected triples using a template, following the procedure from Com-plexWebQuestion (CWQ) (Talmor and Berant, 2018). Fourth, we ask native speakers to paraphrase the questions generated from the template. Fifth, three other annotators verify the quality of the paraphrased results, and annotate three additional labels to indicate whether the paraphrased questions are the semantic equivalents of the automatically generated questions. We obtained a two-vote consensus of 97% and dropped the 3% additional samples.
2) To generate Chinese questions for our dataset, we first use Microsoft's translator 3 to translate English questions into Chinese. Then we ask annotators to translate the English questions into Chinese given the machine translated questions as a reference. For the questions which are difficult to translate, we label them as "None" and drop them from our experiment. After this step, about 92% of the questions are retained.

Dataset Analysis
In total, MLSP contains 15,991 samples. Each sample in our dataset has an English question, a Chinese question and a corresponding lambda calculus, which contains primary functions such as Argmax, Argmin,Argmore, Argless, Max, Min defined to denote basic functions. We also calculate the number of question patterns and logical from patterns, whose entity name are replaced with a placeholder, and our dataset contains 7,482 qustion patterns and 3,429 logical form patterns. Compared with existing datasets GEO and ATIS, which contain 880 and 5,410 samples respectively, MLSP is a large scale dataset in open domain. We will release this dataset to advance research in multi-lingual semantic parsing.
To evaluate the quality of the dateset, we randomly select 1% annotated samples to double check, and we find that 95% of these samples are correct. We will publish this dataset with more detailed instructions.

Experiment
We conduct our experiments on two datasets, ATIS and MLSP.

Datasets
ATIS contains 5410 queries from a flight booking system (Hemphill et al., 1990). The data samples have been split into 4348 training instances, 491 validation instances, and 448 test instances. Each pair contains a question and the lambda-calculus expression with the identified values for the variables of date, time, city, aircraft code, airport, airline, and number. The corpus was translated into Chinese with segmentation from (Susanto and Lu, 2017).
For our MLSP dataset introduced in Section 3, we randomly split the data into 0.8/0.1/0.1 as train/dev/test sets in our model.
In pretraining, we use the English-Chinese translation corpus, News Commentary v12 of WMT 2017 (Bojar et al., 2017). The English corpus is tokenized by NLTK (Bird and Loper, 2004) and the Chinese corpus is tokenized by Jieba segmenter 4 . In space-level and word-level alignment, we use the unsupervised corpus of Wikipedia 5 . We also randomly sample the same number of sentence pairs as the MT dataset used as the negative samples in sentence level alignment experiment. In experiment of word level alignment, we also construct a simple bilingual lexicon dictionary by translating the words contained in the English version into Chinese. We randomly collect 1k word pairs as bilingual lexicons. If the word pair appears in the sentence pair, we will mark their positions with labels for word-level alignment experiment. For ATIS, the pre-processing is the same as (Dong and Lapata, 2016), which replace entities with their type name. To evaluate our method in situations when there is not an annotated multi-lingual semantic parsing dataset, we translate the English semantic parsing corpus by the open translation service of Microsoft. This is expected to be a common scenario in practice.

Settings
We set the vocabulary size to 50k for both languages in our model. We use Glove (Pennington et al., 2014) 6B and Fasttext pretrained Chinese (Bojanowski et al., 2017) as English and Chinese pretrained word embedding. For words in vocabulary which do not have pretrained embeddings, we assign them uniform randomized values. The size of the word embedding is set to 300. During training, we update all word embeddings. We use accuracy on the development set to implement early stopping. Parameters are randomly initialized from a uniform distribution (-0.01, 0.01). For regularization, we use dropout and set the dropout rate to 0.5. Dimensions of hidden vectors in encoder and decoder are 300. α in joint training is set to be 0.1. Adagrad (Duchi et al., 2011) is used in training with an initial accumulator value of 0.1. Table 2 shows the results of our model and the state-fo-the-art methods on multi-lingual ATIS, and MLSP datasets, we report accuracy of exact match to evaluate our model. "SL-SINGLE" represents applying SEQ2TREE (Dong and Lapata, 2016) to each language respectively, "SL-SHARED/SEPARATE" denotes training the model with shared/separate encoder in (Susanto and Lu, 2017). "SS-SINGLE" represents training seq2seq model (described in 2.1) for each language respectively. "SS-SHARED" denotes using both English and Chinese data to train the model. "MASP" is the proposed model in this paper. Specially, we report the baseline "Translated Test" which represents we translate questions of one language in the test dataset and evaluate on baseline model trained with data in the other language.

Results
From the results, we observe that our model achieves a new state-of-the-art results on all dataset. Comparing "SS-SHARED" with "SS-SINGLE", we see that merging the data in different languages does not achieve promising improvement, this is because the Chinese and English are different in word and sentence level. Compared with "SS-SHARED", the proposed model "MASP" achieves significant improvement in both languages which illustrates the effectiveness of multi-level alignment method.
We conduct an ablation study on the variants of "MASP" and investigate the effect of our alignment strategy. The last four lines in Table 2 show the results by ablating each aligment strategy respectively. "w/o SENT" represents the model without sentence level alignment, "w/o SPACE" and "w/o WORD" denotes without space level alignment and word level alignment respectively. From the results, we observe that "MASP" outperforms "w/o SPACE", "w/o WORD" and "w/o SENT" on all the results, which illustrates that removing each alignment harms the performance of our model.

Alignment Analysis
We analyze the space level, word level and sentence level alignments in this section.

Space-level Alignment
We evaluate the performance of discriminator in the space-level alignment. We use questions in machine translation dataset with their language label as input, and feed the encoded representation into the discriminator. Figure 2 shows the discriminator results during pretraining. "MASP w/o CL" represents the model without confusion loss in space alignment. The results show that the discriminator achieves high accuracy of discriminating the representations in "MASP w/o CL", while it is hard to discriminate the representations in "MASP" after 5 epochs. The results illustrate our space alignment method successfully align representation distribution of different languages and confuses the discriminator, which helps our model to handle multi-lingual questions. To evaluate word alignment result, we compute the cosine distance between the tokens of the questions in the two languages. Figure 3(a) shows the results without word level alignment and Figure 3(b) shows the results using word level alignment. From the results, we observe that most words from different languages with the same meaning have been aligned by our word alignment method. For example, "fight" and "航班" are semantic equivalent in English and Chinese, and their embedding are closed in cosine similarity but it is not show the same trend in the model without our alignment strategy. It shows that our word alignment layer can successfully transform the word embedding into a shared multi-lingual space, thus help to improve the model performance.

Sentence-Level Alignment
We also evaluate our sentence alignment results by a classification task as auxiliary. We use the pretrained model to encode the sentence pair in different languages, and use the cosine distance of sentence representations to predict whether the two sentences have the same meaning. We assume that if the cosine distance is greater than 0, the two sentences are semantic equivalent. We evaluate our model on multi-lingual ATIS datasets. We use their question pair data as positive samples and randomly select the same number of Chinese questions and English questions, pair as negative samples to compute the classification accuracy. Then we find that the accuracy is up to 97% in both datasets by our sentence-level alignment method. However, without sentence-level alignment the accuracy is 61.3%. This experiment shows our sentence-level alignment method successfully aligns sentence representations in semantic.  Usually in real world scenarios, we only have monolingual semantic parsing dataset instead of multilingual dataset. In this section, we use Microsoft Translator to generate Chinese semantic parsing corpus from the English corpus and use these data to evaluate our model. Table 3 shows the results of our model and baseline methods. From the results in Table 3 and Table 2, we find that the performance using translated corpus on both English and Chinese are lower than using annotated data. However, with these translated corpus, our methods can improve the target language performance, which shows its robustness. The results demonstrate that both the multi-lingual data on ATIS and MLSP effectively improve the semantic parsing performance. And also we see that our model achieves state-of-the-art results on all the results, which shows the effectiveness and robustness of our method. This experiment illustrates that our methods can be used in real world scenarios with the help of existing machine translator.

Related Work
Semantic parsing, as an important task in natural language understanding, has attracted significant attention in the research and industry. Recently, various semantic parsing models have been proposed such as (Kwiatkowski et al., 2011;Xiao et al., 2017;Yin and Neubig, 2017;Fan et al., 2017;Dong and Lapata, 2016;Chen et al., 2018;Dong and Lapata, 2018). Kwiatkowski et al. (2011) propose a combinatory categorical grammar induction technique for semantic parsing. Xiao et al. (2017;Yin and Neubig (2017) use grammar and syntax information to improve semantic parsing models. (Fan et al., 2017) apply a transfer learning method to semantic parsing. Dong and Lapata (2016) propose a tree-based decoder to model structure of logical forms. Chen et al. (2018) translate the decode process as a sequence of actions with a sequence-to-sequence model. Recently, Dong and Lapata (2018) propose a two-stage model to decode the logical form with the help of sketches, which contain structure and predicates in logical forms.
In multi-lingual semantic parsing, Jie and Lu (2014) use majority voting ensemble method to combine outputs from parsers for certain languages to apply on multi-lingual semantic parsing. Zhang et al. (2018) use a sequence-to-sequence model to map the questions in the source language into decompositional semantic representations in the target language. In Susanto and Lu (2017)'s work, they propose a combination method to combine questions in different language simultaneously for multi-source input and achieve promising improvement on ATIS (Hemphill et al., 1990). They also explore different architectures for single-source input without their combination mechanism. Zou and Lu (2018) propose a method to learn a cross lingual representation and use it in their semantic parsing model (Zettlemoyer and Collins, 2012).

Conclusion
In this paper, we propose a multi-lingual semantic parsing model, which is first pretrained using a multilevel alignment mechanism, and then we jointly train the multi-lingual semantic parsing and multi-level alignment tasks. Most existing multi-lingual semantic parsing datasets are based on specific domain, to better evaluate our method on open domain, we annotate a relative large scale multi-lingual semantic parsing dataset on open domain. Experimental results on ATIS and our dataset show the effectiveness and robustness of our model.