Improving Zero-Shot Cross-lingual Transfer for Multilingual Question Answering over Knowledge Graph

Multilingual question answering over knowledge graph (KGQA) aims to derive answers from a knowledge graph (KG) for questions in multiple languages. To be widely applicable, we focus on its zero-shot transfer setting. That is, we can only access training data in a high-resource language, while need to answer multilingual questions without any labeled data in target languages. A straightforward approach is resorting to pre-trained multilingual models (e.g., mBERT) for cross-lingual transfer, but there is a still significant gap of KGQA performance between source and target languages. In this paper, we exploit unsupervised bilingual lexicon induction (BLI) to map training questions in source language into those in target language as augmented training data, which circumvents language inconsistency between training and inference. Furthermore, we propose an adversarial learning strategy to alleviate syntax-disorder of the augmented data, making the model incline to both language- and syntax-independence. Consequently, our model narrows the gap in zero-shot cross-lingual transfer. Experiments on two multilingual KGQA datasets with 11 zero-resource languages verify its effectiveness.


Introduction
With the advance of large-scale human-curated knowledge graphs (KG), e.g., DBpedia (Auer et al., 2007) and Freebase (Bollacker et al., 2008), question answering over knowledge graph (KGQA) has become a crucial natural language processing (NLP) task to answer factoid questions. It has been integrated into real-world applications like search engines and personal assistants, so it attracts more attention from both academia and industry (Liang et al., 2017;Hu et al., 2018;Shen et al., 2019).
Recently, a rising demand of KGQA systems is to answer the multilingual questions, motivating us * Work is done during internship at Microsoft. † Corresponding authors.
to focus on multilingual KGQA. However, building a large-scale KG, as well as annotating QA data, is costly for each new language, not to mention many minority languages with a few native annotators. Therefore, we adopt a zero-shot cross-lingual transfer setting -a KGQA model is developed to perform inference on multilingual questions with the only access to training data and associated KG in a high-resource language (e.g., English).
Providing the success of pre-trained monolingual encoders (Peters et al., 2018;, some works (e.g., mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020)) pre-train a Transformer encoder (Vaswani et al., 2017) on largescale non-parallel multilingual corpora in a selfsupervised manner. Then given an NLP task, a general paradigm for zero-shot cross-lingual transfer is to fine-tune a pre-trained multilingual encoder on the data in a data-rich (source) language. And the fine-tuned model is generalizable enough to perform inference in other low-resource (target) languages with surprising quality of prediction. This paradigm can be adapted to KGQA to build symbolic logical forms (e.g., query graph (Yih et al., 2015)) for KG query. However, it is witnessed that there is a considerable KGQA performance gap between source and target languages, which is consistent with the empirical results on a wide range of other tasks by prior works (Conneau et al., 2020).
To bridge the gap, translation approaches are proven effective on multilingual benchmarks (Hu et al., 2020;Liang et al., 2020). As a way of data augmentation, they perform source-to-target translation to obtain multilingual training data. Further with advanced techniques (Cui et al., 2019;Fang et al., 2020), they achieve state-of-the-art effectiveness. But these approaches rely heavily on a well-performing translator. The translator is not always available especially for a minority language since its training requires a large volume of parallel bilingual corpus. Therefore, to be applicable to more languages, we assume that neither translators nor parallel corpora are available in this work.
In this paper, to adapt the translation approaches in our zero-resource scenario, we naturally propose to replace the full-supervised machine translator with unsupervised bilingual lexicon induction (BLI) for word-level translation. Specifically, as in prior works (Lample et al., 2018b;Artetxe et al., 2018), a BLI model is first trained on non-parallel bilingual corpora. Then, via bilingual word alignments in BLI, we map the training questions in source language into those in target languages to obtain augmented multilingual training data. Consequently, even simply learning a KGQA model on the augmented data can circumvent language inconsistency between training and inference and thus bridge the performance gap in zero-shot crosslingual transfer. To explain why BLI is competent, it is observed that KGQA mainly involves phraselevel semantics (Berant et al., 2013). Compared to other tasks depending on sentence-level contextualization, KGQA is insensitive to long-term dependency but benefits from the language consistency.
Moreover, we propose an adversarial strategy to mitigate the syntax-disorder caused by BLI. Specifically, we present a discriminator on top of the encoder, which is trained to distinguish whether the input is a grammatical question in source language or a BLI-translated one in target language. Meanwhile, jointly with KGQA goal, the encoder is fine-tuned to fool the discriminator so that the questions' representations are both language-and syntax-agnostic. So the trained KGQA model is robust to syntax-disorder and becomes insensitive to the question language, leading to superior performance on multilingual KGQA.
Experiments conducted on two multilingual KGQA datasets with 11 zero-resource languages verify the effectiveness of our approach.

KGQA Task Definition
We give a background of monolingual KGQA, followed by multilingual KGQA and its data format.
Monolingual KGQA. A knowledge graph G is comprised of a set of directed triples (h, p, t), where h ∈ E denotes a head entity, t ∈ E L denotes a tail entity or literal value, and p ∈ P denote a predicate between h and t. KGQA aims at generating answers for a natural language question q based on G. Usually a model M first parses the question q into an intermediate logical form, which is then transformed into a SPARQL query, and the answer is derived by executing the SPARQL query on G. An example is shown in Figure 1: the question in the bottom, intermediate logical form in the upper right and the corresponding SPARQL query in the top. Following Maheshwari et al. (2019), we take a restricted subset of λ-calculus -query graph, as the intermediate logical form. Typically, a query graph consists of four types of nodes: grounded entity(s) (in rounded rectangle), existential variable(s) "?y" (in circle), a lambda variable "?x" (in shaded circle), and an aggregation function (in diamond).
Considering entity-linking is a standalone system and there are many tools, we assume grounded entities in a question are given. This avoids uncertainty caused by entity-linking, and facilitates us to focus on the query graph construction process.
Multilingual KGQA. We focus on a zero-shot cross-lingual transfer setting of KGQA. That is, we only have a labeled dataset D src = {(q src l , s src l ) N l=1 }, as well as the associated knowledge graph G, in a high-resource language src, where q src l and s src l denote a natural language question and a formal query, respectively. We will omit subscript l of example index in D src . Multilingual KGQA is to learn a model M which can answer questions q tgt in multiple target languages tgt. A recent baseline is to fine-tune pre-trained multilingual models (e.g. mBERT) in src and directly perform inference in tgt.

Methodology
This section starts with a base framework for monolingual KGQA, followed by our proposed multilingual solutions. Lastly, details about training and inference are elaborated.

Base Monolingual Framework
Following Maheshwari et al. (2019), we present a base pipeline framework as in Figure 1 to construct query graphs. It consists of three modules: 1) inferential chain ranking, 2) type constraint ranking, and 3) aggregator classification.

Inferential Chain
Ranking. An inferential chain (IC) refers to a sequence of directed predicate from a grounded entity to lambda variable ?x. Given an entity e grounded from the question q, we first search its chain candidates C e = (c e 1 , . . . , c e n ) by exploring legitimate predicate sequences starting from e in G. Following previous works (Yih where a e i is a score for their relatedness, and θ (IC)parameterized SemMatch(·) can be any model for pairwise relatedness, such as Co-Attention network  and BERT-based Matching (Devlin et al., 2019). Finally, the resulting of this module is the top-1 ranked inferential chain, i.e., c e = arg max c e i (a e i , ∀i = 1, . . . , n).
Note, if there are multiple grounded entities in q, we will predict an inferential chain for each entity.
Type Constraint Ranking. Type constraints (TC) refer to the entity types specified in the question for each variable on an inferential chain. They can be used to disambiguate the entities and thus boost KGQA performance. For example, answer entity(s) to the example question in Figure 1 are constrained by type Scientist. Hence, type constraint ranking is proposed to capture such information, which is also achieved by a semantic matching model. Specifically, given the resulting inferential chainc e , we first enumerate type candidates T e y = {t e y1 , . . . } for the existential variable and T e x = {t e x1 , . . . } for the lambda variable. Then, because there is scarcely overlap of gold type constraints between the two variables, a single semantic matching model is adequate for both. Thus, we define the model to derive relatedness scores as where, ∀ * ∈ {y, x}, and ∀j = 1, . . .
Finally, we get the type constraints for existential and lambda variable with a threshold γ (thresh) , i.e., Aggregator Classification Given several answer formats in the dataset, aggregator classification (AC) is presented to distinguish the format among Bool, Count and Entity(s). The principle of each is detailed in the middle right of Figure 1. Formally, a simple text classifier can satisfy, i.e., where the Classifier(·) is composed of a contextualized encoder, a pooler and an MLP with softmax.
Once the above is completed, their results can compose a query graph, which is transformed into SPARQL and then executed on G for the answer.

Proposed Multilingual KGQA Approach
Built upon the base framework detailed before, we extend it with a multilingual inference capability, i.e., multilingual KGQA. We are in line with a recent popular zero-shot transfer paradigm (Conneau et al., 2020;Fang et al., 2020) that: a pre-trained multilingual encoder is only fine-tuned in src, and a translation-based data augmentation technique is integrated to narrow the performance gap between src and tgt. To emphasize the gap in KGQA, 65% F1 score in English (src) vs. 54% in Italian (tgt) is observed by mBERT zero-shot transfer in our pipeline without any multilingual augmenting.
Distinct from prior works in this paradigm requiring well-trained translators, we propose a fully unsupervised way for wide applicability with neither tgt KGQA data nor src-tgt parallel corpora. It is natural to resort to bilingual lexicon induction (BLI) with unsupervised training and acceptable word-level translating quality. In the following, we first present a BLI-based augmentation for multilingual training data, followed by our adaptation of the monolingual base framework ( § 3.1) to the augmented data. Finally, we propose an adversarial learning strategy coupled with BLI-based augmentation for robust cross-lingual transfer. An illustration of our proposed semantic matching model with symbolic candidates is in Figure 2.

BLI-based Multilingual Augmentation
We leverage the BLI model by Lample et al. (2018b). First, it pre-trains monolingual word embeddings U src ∈ R d×|V src | and U tgt ∈ R d×|V tgt | in src and tgt respectively. Then, it learns a linear transformation to unsupervisedly align the word embeddings in two languages to one space, i.e.,

W= arg min
The unsupervised alignment between k-th src word and l-th tgt word is captured by adversarial learning, and Distance(·) is implemented by crossdomain similarity local scaling (CSLS). Please refer to (Lample et al., 2018b) for its details.
Based on the BLI model, we can build a word-byword translator, BLI (trans) src→tgt , from src to arbitrary tgt, as long as its monolingual corpus is available. Note, when performing word-level translation, we also employ CSLS to mitigate the hubness problem and find the most likely alignment. Then, we translate each question q src in D src to other languages: where src denotes English (en) in our experiments while tgt can be one of 11 other languages, such Farsi (fa), Italian (it), etc. Consequently, q tgt is the augmented multilingual data for model training.
Remark: Although BLI provides multilingual data, open questions still remain. 1) Why is BLI competent here: It is observed KGQA mainly involves word-/phrase-level semantics of symbolic candidates, rather than sentence-level one in most other NLP tasks. As the Module 1 and 2 in Figure 1, the matching only involves morphological similarity (e.g., scientist vs. <dbo:Scientist>), synonym (e.g., won an award vs. <dbp:prizes>), etc. Thus, KGQA is less sensitive to long-term context than other tasks. This has been leveraged by Berant et al. (2013) to propose a phrase matching model for monolingual KGQA. 2) Will BLI lead to error propagation: Since BLI model achieves a high Precision@10 but a relatively low Preci-sion@1, wrong translation and the corresponding ground truth are semantically similar. Intuitively, their word embeddings are spatially close to each other, so wrong word-level translation is equivalent to applying tiny noise to word embeddings, which hardly leads to error propagation when robust pretrained Transformer-based encoder is used.

Multilingual Models
Symbolic Candidate Processing. For an inferential chain, we enrich each predicate on the chain by 1) transforming each camel-represented phrase into sequence-formatted words 2) prefixing +/-for directional information, and 3) concatenating topfrequent types in local closed-world assumptions (Krompaß et al., 2015). For a type constraint, we simply transform each camel-represented phrase into sequence-formatted words. In the following, we denote the text of a processed symbolic candidate as z no matter it is a chain or type.
Multilingual Semantic Matching Model. As detailed in §3.1, both inferential chain ranking and type constraint ranking modules are built upon a semantic matching model between the question q and a symbolic candidate z. Note, z is always in src while q can be in either src or BLI-translated tgt. Following the common practice, we first concatenate q and z with special tokens (Devlin et al., 2019), which is passed into a pre-trained multilingual Transformer encoder, i.e., v = Pool(Transformer(text)), where Pool(·) denotes using the contextualized embedding of [CLS] to represent the entire input. In this paper, the encoder is alternative between mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020). Lastly, a 1-way multi-layer perceptron (MLP) built upon v is presented to calculate the matching score in Eq.(1) or Eq.(3).
Multilingual Classification Model. As detailed in §3.1, a text classification model is required to identify aggregator. To fit into our zero-resource multilingual scenario, the model, consisting of a pre-trained multilingual encoder and an MLPbased predicting layer, can be directly fine-tuned on the augmented questions, i.e., q src and q tgt .

Syntax-agnostic Adversarial Strategy
Although training the KGQA model on BLIaugmented multilingual data circumvents language inconsistency, it inevitably introduces syntax disorder and grammatical problem, which could hurt the performance. We thus present an adversarial strategy in pair with BLI-augmented data to push the Transformer encoder deriving language-and syntax-independent representations. Formally, a discriminator is built upon the single vector representation v produced by the Transformer encoder: where p (src) is the probability of the question in source. The discriminator is trained to minimize L (adv) θ (dis) =−I (src) log p (src) −I (tgt) log(1−p (src) ). (10) On the contrary, the Transformer encoder is learned to fool by minimizing an adversarial loss, i.e., L (adv) I (tgt) denotes if the question in BLI-translated tgt, and θ (enc) is encoder's parameters in each module.

Training
Before constructing the objectives, we conduct uniform negative sampling for the two ranking models with the maximum negative number limited to 100. First, gold labels of a q for the three modules stem from the formal query s src . A margin-based hinge loss is defined for inferential chain ranking: where, D is the augmented dataset, N is a set of negative chains,ã e is derived from the gold chain andâ e i is derived from a negative chain. Similarly, the loss defined for type constraint ranking iŝ Lastly, the loss of aggregator classification iŝ where p (AC) [i=g] denotes probability corresponding to gold aggregator class.
During training, the adversarial loss is added to the loss function of each module to compose the final training objective, i.e., L ( * ) =L ( * ) +αL

Inference Algorithm
As in Algorithm 1, we provide a detailed procedure for model inference in target language.
We also provide an explanation of query graph in Figure 1. As the example query graph shown in the right of the figure: a topic entity is first grounded as e ="<dbr:Ven.-Ram>" in rounded rectangle, an existential variable in circle denotes intermediate entity set ?y = {h|(h, leaderName, e)}, a lambda variable in shaded circle denotes the answer entity set ?x = {h|(h, prizes, e) ∧ ∀e ∈?y}, and an aggregator COUNT is finally applied to ?x that is constrained by entity type "<dbo:Scientist>". Note that, the existential variable can not exist if only 1-hop relation is expressed in a question, and if multiple topic entities are grounded, multiple "?x" will be merged by intersection.

Algorithm 1 Inference in Target Language.
Require: : A q in tgt and its grounded topic entities E q ; KG G; Models θ (IC) , θ (T C) , θ (AC) 1: Search the chain candidates C e on G, ∀e ∈ E q 2: Rank each C e by Eq.(1), and keep top-3 in C e 3: C e ← {c e |c e ∈ C e ∧ Size(?x ∈ c e ) > 0} 4:c e ← Null 5: if Size(C e ) > 0 then c e ← the top1 inferential chain in C e 6: end if 7: Merge chains {c e |∀e ∈ E q ∧c e is not Null} 8: Rank type constraint candidates by Eq.(3) and apply the top-1 constraint w/ score > γ (thresh) 9: Generate SPARQL and execute on G for answer entity set A 10: Identify the aggregator for q by Eq.(5) 11: A ← Aggregate(A) by following Figure 1 12: return A;

Datasets and Evaluation Metrics
We evaluate the proposed approach on two datasets, LC-QuAD (Trivedi et al., 2017) and QALDmultilingual (Usbeck et al., 2018), both of which contain questions with corresponding SPARQL queries over DBpedia 1 . DBpedia is a largescale knowledge graph extracted from Wikipedia pages with 6 million/60 thousands/13 billion entities/predicates/triples in the English edition.
LC-QuAD. LC-QuAD is a large-scale complex question answering dataset, which contains 5000 English question-SPARQL pairs 2 . We follow the official split with 1000 questions in the test set, and further split the original training set into training/valid with 3500/500 questions. To evaluate the effectiveness of multilingual KGQA, questions in the test set are translated into 10 languages (fa, de, ro, it, ru, fr, nl, es, hi, pt) 3 using Google Translator 4 .
QALD-multilingual. QALD is a series of evaluation campaigns on question answering over linked data 5 . We collect all multilingual questions along with their SPARQL queries from QALD4 to QALD9 and filter out some out-of-scope ones 6 . There are overall 429 distinct question-SPARQL pairs and most are expressed in 12 languages (en, fa, de, ro, it, ru, fr, nl, es, hi_IN, pt, pt_BR). Considering the small size of this dataset, we take all QALD-multilingual questions as test set, and use the training data of LC-QuAD for model training.
Evaluation Metrics. We adopt two widely-used metrics as following (Maheshwari et al., 2019), i.e., inferential chain accuracy (ICA) and macro F1 score. The former is used to measure the accuracy (i.e., Precision@1) of inferential chain model, and defined as the percent of correctly-predicted inferential chains. The macro F1 score is used to measure the performance of final answers. Please refer to (Maheshwari et al., 2019) for the details.

Experimental Setting
We evaluate our approach with 2 multilingual encoding models, i.e. mBERT base and XLM-R base . The embedding and hidden size in both models are set to 768. We use Adam optimizer (Kingma and Ba, 2015) to optimize the KGQA loss with the learning rate of 5 × 10 −5 and a linear warmup (Vaswani et al., 2017). The maximum training epoch, warm-up epoch, and batch size are set to 35, 3, and 32. The discriminator is trained along with each module's objective, with α set to 5 × 10 −4 for learning to fool. The discriminator is optimized via the Adam optimizer with a learning rate of 5×10 −5 . γ (thresh) for the type constraint model is set to 0.7. We follow (Maheshwari et al., 2019) and use the same values for other parameters in model training.

Main Results
We compare our approach with a natural, widelyused baseline, which fine-tunes a pre-trained multilingual model (e.g., mBERT, XLM-R) on source language, and then directly apply it to target languages. The comparison on QALD-multilingual and LC-QuAD with mBERT are reported in Table 1 and 2 respectively. It is showed that our approach outperforms the baseline significantly on both datasets for all languages. ICA is improved by 1%-4%, and 2.9% on average on the QALD dataset. The improvement on LC-QuAD is even larger, i.e., averaged ICA and F1 score of all languages are increased by around 7% and 4% respectively. Notably, with the BLI-augmented data and syntax-  agnostic adversarial learning, the performance of source-language (i.e., English) questions are also increased by a large margin, i.e., F1 score increases from 65% to 66.7% on QALD, and from 80% to 85% on LC-QuAD. We also evaluate the propose approach using XLM-R as the multilingual encoder. The comparison on QALD-multilingual is shown in Table 3. We can observe similar improvements as in mBERT, where both averaged ICA and F1 score are increased by around 1%, verifying the effectiveness of our proposed approach.

Ablation Study
Our approach consists of two important components, BLI-based data augmentation and a syntaxagnostic learning strategy. We conduct an ablation study to investigate the effect of each component. Table 4 reports the averaged results of all target-languages on QALD-multilingual and LC-QuAD-multilingual. From the table we can see that, with BLI-based data augmentation, our approach increases the ICA score on QALD by 1.7%, and the syntax-agnostic adversarial learning further improves it by 1.2%. Similar improvements are observed on LC-QuAD, which verifies the effectiveness of both components in our approach.

Analysis
Impact of BLI Accuracy. We assess the impact of BLI accuracy on five Romance languages (i.e. it, fr, es, pt, and ro) by injecting noise into BLI results. Specifically, when mapping source-language words into a target language via BLI, we randomly replace translated words with wrong ones with a probability of p (10%, 20%, 30%, 40%, and 50%). The averaged performance of our approach on the five languages is reported in Figure 3. It is observed, with more noise added, the performance of our approach drops, which is in accordance with intuition. But even when 50% of the translated words are noisy, our method still outperforms the baseline model. For example, it is superior than the baseline by 1% in terms of ICA with 50% noise, showing the robustness of our approach.
Deep Dive into Adversarial Learning. We take the inferential chain ranking model as an example, and take a deep dive into the impact of syntaxagnostic adversarial learning. The adversarial learning involves a discriminator to distinguish whether a question is grammatical or syntax-disorder, and an inferential chain ranking model to identify the gold chain. Their loss values, i.e., L    L (IC) , are plot in Figure 4. We can see that the classification loss of the discriminator quickly drops and then slowly goes up, indicating that the discriminator gets good performance and then it is fooled later by the language-/syntax-agnostic embeddings generated by mBERT. Meanwhile, the inferential ranking loss drops quickly and stays very small in following epochs, showing that when mBERT is generating syntax-agnostic embeddings, it also supports the inferential chain ranking very well.

Case Study
We take several examples of inferential chain ranking to show how our approach works. We use t-SNE (Maaten and Hinton, 2008) to map the embedding of a question-chain pair into a twodimensional data point. A question in a specific language is paired with its golden inferential chain and top-1 ranked negative candidate. Figure 5 compares the baseline with our approach for two questions. Positive and negative examples of the same question in different languages are plot in the same figure. We can see that the baseline model can not distinguish positive inferential chains from negative ones well, while our approach can learn a language-agnostic representation that focuses more on ranking inferential chain candidates.

Related Work
There are mainly two categories of approaches to handle monolingual question answering over knowledge graph (KGQA) task. (1) Information retrieval-based approaches align a question with its answer candidates in the same semantic space, where the candidates usually stem from KG neighbors of the topic entity detected in the questions (Bordes et al., 2014b,a;Dong et al., 2015;Jain, 2016;Xu et al., 2016;Hao et al., 2017;. (2) Semantic parsing-based approaches first translate a question into the corresponding logical form, e.g., program (Guo et al., 2018;Shen et al., 2019) or query graph (Yih et al., 2015;Jia and Liang, 2016;Xiao et al., 2016;Dong and Lapata, 2016;Liang et al., 2017;Dong and Lapata, 2018;Maheshwari et al., 2019), and then execute the logical form over KG to derive the final answer. Note a logical form is usually composed of a series of grammars or operators pre-defined by experts. This paper is in line with the second category to generate query graph for KG execution. To the best of our knowledge, there are only few works targeting multilingual KGQA (Hakimov et al., 2017;Veyseh, 2016), which rely on extensive multilingual training data with hand-crafted features while are inapplicable to the zero-shot transfer scenario. So we adopt the pipeline by Maheshwari et al. (2019) for monolingual scenario as our base model but update the encoders with the Transformer (Vaswani et al., 2017) to strengthen their expressive power and facilitate recent pre-trained multilingual initializations.
Given task-specific data in a source language, cross-lingual models are trained to perform inference in target languages in a low-or zero-resource scenario. Typically, cross-lingual models are proposed in two paradigms. 1) Universal encodingbased paradigm represents multilingual natural language text into language-agnostic embeddings the same semantic space. Early works focus on aligning multilingual word embedding (Mikolov et al., 2013;Faruqui and Dyer, 2014;Xu et al., 2018), while recent efforts are mainly made on large-scale pre-trained multilingual encoder, such as mBERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019), Unicoder (Huang et al., 2019a), XLM-R (Conneau et al., 2020), InfoXLM (Chi et al., 2020), and ALM . They can perform zero-shot cross-lingual transfer by training in the source language while directly inference in target language. 2) translation-based paradigm employs well-trained machine translators to map the training or test examples in source language to those in target translation. Recent common practice tends to leverage the second paradigm to generate multilingual data to narrows the zero-shot cross-lingual performance gap in the first paradigm, which leads to state-of-the-art results on several cross-lingual benchmarks. In contrast, we consider a zero-resource scenario where translators are unavailable and we thus resort to unsupervised BLI in light of KGQA's characteristics.
As a branch of universal encoding at word level, bilingual lexicon induction (BLI) (a.k.a crosslingual word embedding -CLWE) is learned to align bilingual word embeddings in the same space, where the embeddings are pre-trained on monolingual corpora and the alignment is trained in either a (semi-)supervised or unsupervised manner (Smith et al., 2017;Lample et al., 2018b;Artetxe et al., 2018Artetxe et al., , 2019Huang et al., 2019b;Patra et al., 2019;Karan et al., 2020;Zhao et al., 2020;Ren et al., 2020). To alleviate "hubness" problem (Dinu and Baroni, 2015) in BLI, alternatives of the distance measurement are proposed to substitute nearest neighbor (NN) during the alignment, such as inverted-softmax (Smith et al., 2017) and CSLS (Lample et al., 2018b). In addition to building bilingual dictionary via word-level translation, a well-trained BLI model can serve as a weak baseline of sentence-level translation (Lample et al., 2018a), a seed model for unsupervised translation (Lample et al., 2018a) or a bilingual variant of copy mechanism in summarization .
Moreover, adversarial training is usually integrated into cross-lingual models for languageagnostic representation learning, such as unsupervised BLI (Lample et al., 2018b;, unsupervised translation (Lample et al., 2018a), cross-Lingual sequence labeling (Kim et al., 2017;Huang et al., 2019c) and cross-Lingual classification . In contrast, our adversarial strategy not only considers languageagnostic representations but also aims at making the model insensitive to syntax-disorder and thus competent in zero-resource scenario.

Conclusion
We propose a novel approach for zero-shot crosslingual transfer in multilingual KGQA, which augments training data by bilingual lexicon induction, and leverages a syntax-agnostic adversarial learning strategy to alleviate the syntax-disorder problem caused by BLI. Experimental results on two multilingual KGQA datasets in 11 zero-resource languages verify its effectiveness.