Cross-Lingual Dependency Parsing by POS-Guided Word Reordering

We propose a novel approach to cross-lingual dependency parsing based on word reordering. The words in each sentence of a source language corpus are rearranged to meet the word order in a target language under the guidance of a part-of-speech based language model (LM). To obtain the highest reordering score under the LM, a population-based optimization algorithm and its genetic operators are designed to deal with the combinatorial nature of such word reordering. A parser trained on the reordered corpus then can be used to parse sentences in the target language. We demonstrate through extensive experimentation that our approach achieves better or comparable results across 25 target languages (1.73% increase in average), and outperforms a baseline by a significant margin on the languages that are greatly different from the source one. For example, when transferring the English parser to Hindi and Latin, our approach outperforms the baseline by 15.3% and 6.7% respectively.


Introduction
The rise of machine learning (ML) methods and the availability of treebanks (Buchholz and Marsi, 2006) for a wide variety of languages have led to a rapid increase in research on data-driven dependency parsing (McDonald and Pereira, 2006;Nivre, 2008;Kiperwasser and Goldberg, 2016). However, the performance of dependency parsers heavily relies on the size of corpus. Due to the great cost and difficulty of acquiring sufficient training data, ML-based methods cannot be trivially applied to low-resource languages.
Cross-lingual transfer is a promising approach to tackle the lack of sufficient data. The idea is to train a cross-lingual model that transfers knowledge learned in one or multiple high-resource source languages to target ones. This approach has been successfully applied in various tasks, including part-of-speech (POS) tagging (Kim et al., 2017), dependency parsing (McDonald et al., 2011), named entity recognition (Xie et al., 2018), entity linking , question answering (Joty et al., 2017), and coreference resolution .
A key challenge for cross-lingual parsing is the difficulty to handle word order difference between source and target languages, which often causes a significant drop in performance (Rasooli and Collins, 2017;Ahmad et al., 2019). Inspired by the idea that POS sequences often reflect the syntactic structure of a language, we propose CURSOR (Cross lingUal paRSing by wOrd Reordering) to overcome the word order difference issue in crosslingual transfer. Specifically, we assume we have a treebank in the source language and annotated POS corpus in the target language 1 . We first train a POS-based language model on a corpus in the target language. Then, we reorder words in each sentence on the source corpus based on the POSbased language model to create pseudo sentences with target word order. The resulting reordered treebank can be used to train a cross-lingual parser with multi-lingual word embeddings.
We formalize word reordering as a combinatorial optimization problem to find the permutation with the highest probability estimated by a POS-based language model. However, it is computationally difficult to obtain the optimal word order. To find a near-optimal result, we develop a population-based optimization algorithm. The algorithm is initialized with a population of feasible solutions and iteratively produces new generations by specially designed genetic operators. At each iteration, better solutions are generated by applying selection, crossover, and mutation subroutines to individuals in the previous iteration.
Our contributions are summarized as follows: (i) We propose a novel cross-lingual parsing approach, called CURSOR, to overcome the word order difference issue in cross-lingual transfer by POS-guided word reordering. We formalize word reordering as a combinatorial optimization problem and develop a population-based optimization algorithm to find a near-optimal reordering result.
(ii) Extensive experimentation with different neural network architectures and two dominant parsing paradigms (graph-based and transition-based) shows that our approach achieves an increase of 1.73% in average UAS, if English is taken as the source language and the performance is evaluated on other 25 target languages. Specifically, for the RNN-Graph model, our approach gains an increase of 2.5% in average UAS, and the improvement rises to 4.12% by the combination of our data augmentation and ensemble method.
(iii) Our approach performs exceptionally well when the target languages are quite different from the source one in their word orders. For example, when transferring the English RNN-Graph parser to Hindi and Latin, our approach outperforms a baseline by 15.3% and 6.7%, respectively.

Related Work
Many efforts (Zeman and Resnik, 2008;Cohen et al., 2011;Rosa andŽabokrtskỳ, 2015) have been devoted to cross-lingual dependency parsing via transfer learning, in which manually annotated corpora are no longer required for low-resource languages. One of the challenges is the word orders in source and target languages might be different (e.g., some languages are prepositional and some are postpositional). Various studies have been dedicated to addressing this issue (Naseem et al., 2012;Zhang and Barzilay, 2015;Wang and Eisner, 2017).
In particular, some studies proposed to bypass word order issue by selecting source languages that have similar word orders to the target language (Naseem et al., 2012;Rosa andŽabokrtskỳ, 2015). Good source languages can be selected by measuring the similarity of POS sequences between the source and target languages (Agic, 2017), querying the information stored in topological databases (Deri and Knight, 2016), and formalizing such selection as a ranking problem (Lin et al., 2019).
Treebank translation (Tiedemann et al., 2014;Tiedemann and Agić, 2016) tackles this problem by transforming an annotated source treebank to instances with target language grammar through machine translation. However, this method may suffer from imperfect word alignment between two languages.  proposed to perform such syntactic transfer by code mixing in which only the confident words in a source treebank will be transformed.
Another interesting solution to cross-lingual transfer is an annotation projection (Hwa et al., 2005;Ganchev et al., 2009;Ma and Xia, 2014). In this approach, source-side sentences of a parallel corpus are parsed by the parser trained on the source treebank, then the source dependencies are projected onto the target sentences using the results of word alignments. However, the resulting treebank could be highly noisy because the source dependency trees are constructed automatically and cannot be taken as ground truth. Lacroix et al. (2016) considered removing not well-aligned sentences to obtain high-quality data.  Meng et al. (2019) embraced the linguistic knowledge of target languages to guide the inference. Some researchers also exploit lexical features to enhance the parsing models. Cross-lingual word clusters (Täckström et al., 2012), word embeddings (Guo et al., 2015(Guo et al., , 2016Ammar et al., 2016), and dictionaries (Durrett et al., 2012;Rasooli and Collins, 2017) are used as the features to better transfer linguistic knowledge among different languages.
Our work is in line with a recently proposed solution, namely treebank reordering Eisner, 2016, 2018;Rasooli and Collins, 2019), which aims to rearrange the word order in source sentences to make them more similar to the target one. Wang and Eisner (2018) proposed to permute the constituents of an existing dependency treebank to make its surface POS statistics approximately match those of the target language. However, they used POS bigrams to measure the surface closeness between two languages, which is unable to capture global information. Rasooli and Collins (2019) proposed two different syntactic reordering methods, one is based on the dominant dependency direction in the target language, the other learns a reordering classifier, but both methods rely on parallel corpus.
In this study, we explore the feasibility of utiliz-ing a POS-based neural language model to guide treebank reordering. Our approach does not require any parallel corpus, and can be applied to a pair of source and target languages as long as their POS tags are available. We designed a population-based optimization algorithm to deal with the combinatorial nature of word reordering. This algorithm is able to find the close-to-optimal results of reordering, which yields a new state-of-the-art for cross-lingual parsing in various languages.

Approach
In this section, we first formalize the word reordering as a combinatorial optimization problem, and then present our method to solve the problem.

Problem Definition
Given a sentence x = {x 1 , x 2 , ..., x n } in the source dataset S, we aim to permute the words in the sentence to mimic the order in the target language. To measure the goodness of a permutation, we train a POS-based language model p T on the target corpus T using a multi-layer LSTM. The log-likelihood of a sentence under p T can be formulated as follows: (1) The objective is to find one permutation x * so that the reordered sentence will achieve a high probability estimated by the language model: where R(x) is a set of all possible permutations of the words in x. In theory, the number of the feasible candidates is n!, while most of the permutations may be radically different from the original sentence and break the meaning. To avoid that, we apply a syntactic constraint when generating R(x): a sub-sequence that forms a constituent in the original sentence should still be a sub-sequence after reordering, while the inner order of words in the sub-sequence may change.

Population-based Optimization
Finding the optimal x * in Equation (2) can be reduced to a well-known travelling salesperson problem 2 , which is NP-hard. Therefore, the optimal reordering is computationally difficult to obtain, and we design a genetic algorithm to find near-optimal results instead.  Genetic algorithm is a heuristic search method inspired by the process of natural selection, which iteratively evolves a population of candidate solutions towards better ones. The population of each iteration is called a generation. The algorithm starts by executing initialization operator to create the initial generation. At each generation, the fitness of every individual in the population is evaluated, and individuals with higher fitness score have more chance to breed the next generation by applying selection operator. The next generation is produced through a combination of two genetic operators: crossover and mutation. The crossover operator combines the genetic information of two parents to generate new offspring, while the mutation operator introduces diversity into the sampled population. Genetic algorithms are known to perform well in solving combinatorial optimization problems (Anderson and Ferris, 1994;Mühlenbein, 1989) and are suitable for the word reordering problem.
In order to meet the syntactic constraint, we design the crossover and mutation operators at the subtree level, which means whenever a word is moved to some other place, the subtree of it should be moved at the same time. We describe each components of the proposed genetic algorithm below: Fitness: The fitness score of an individual is defined by its log-likelihood in the target language model as Equation (1).
Selection: In a generation, "fitter" solutions are more likely to be selected for breeding the next generation. We normalize the fitness score of sentences in the generation and use it as the probability that each sentence may be selected randomly.
Crossover: We use the example shown in Figure 1a to better describe the crossover operator. Given two parents parent 1 and parent 2 chosen randomly by the selection operator, we then ran-Algorithm 1 Genetic algorithm-based reordering Input: S: source treebank; Ng: the number of generations; Np: the number of populations; α: mutation probability Output: reordered treebank S 1: for xorig ∈ S do 2: for i = 1, · · · , Np do 3: end for 5: for g = 1, · · · , Ng do 6: P g = P g−1 7: for i = 1, · · · , Np do 8: end for 10: p selection = Normalize(F g−1 ) 11: for i = 1, · · · , Np in population do 12: Sample parent1 from P g−1 with p selection 13: Sample parent2 from P g−1 with p selection 14: child = Crossover(parent1, parent2) 15: if UniformSampling(0, 1) <α then 16: child = Mutation(child) 17: end if 18: Add child to P g 19: end for 20: P g = top-Np elements in P g with largest fitness 21: end for 22: x * = arg max x∈P g pT (x) 23: Add x * to S 24: end for domly pick a word ("surgery" in the example) as the crossover point. We copy the entire inside tree ("a surgery routine" in parent 1 ) and then merge it with the remaining words as the order occurred in parent 2 to produce an offspring sentence.
Mutation: We move a child node (along with its subtree) from one side of its head node to the opposite side. An example of mutation is shown in Figure 1b, we first randomly select a pair of words ("had" → "surgery"), and then move the word "surgery" and its subtree to the left side of the head word "had".
Initialization: We repeatedly apply the mutation operator (discussed above) to the original sentence to generate an initial generation.
The overall algorithm is listed in Algorithm 1. For each sentence in S, the descendant with the highest fitness score is added to the reordered treebank S . After reordering the corpus, a parser trained on S can be used to analyse the target language since the instances in S are conformed with the grammar of the target language.

Experiments
We evaluate CURSOR by transferring four different parsing models trained on English corpus to 30 target languages. We first introduce the experimental setup, then discuss the results as well as in-depth analysis, and finally, we propose a combined approach to further improve the performance.

Setup
Data We conduct experiments on Universal Dependencies (UD) Treebanks (v2.2) (Nivre et al., 2018), in which 31 different languages (one as the source and others as target languages) are selected for evaluation. The number of tokens is more than 100K for each selected language. We take English as the source language and 30 other languages as target ones. 5 target languages are used to tune the hyperparameters and remaining 25 languages are held out for final evaluation.
Parsing Models We evaluate CURSOR with four different parsing models described by Ahmad et al. (2019): SelfAtt-Graph, RNN-Graph, SelfAtt-Stack, and RNN-Stack. These models are built upon two encoders (SelfAtt/RNN) as well as two decoders (Graph/Stack). RNN encoder uses bidirectional LSTMs while SelfAtt encoder uses a transformer (Vaswani et al., 2017) instead. Graph decoder utilizes a deep biaffine attentional scorer proposed by Dozat and Manning (2017), and Stack decoder is a top-down transition-based decoder proposed by Ma et al. (2018). Following (Ahmad et al., 2019), all the parsing models take words as well as their gold POS tags as input. We also leverage pre-trained multilingual embeddings from Fast-Text (Bojanowski et al., 2017) that project the word embeddings from different languages into the same space using an offline transformation method (Smith et al., 2017;Conneau et al., 2018).

Lexicalized Features
Training Details For fair comparison, we use the same hyper-parameter settings and the training strategy as Ahmad et al. (2019) to train the parsing models. Each POS-based language model for word reordering is trained on the training set of a corresponding target language, in which the POS tag dimension is set to 50 (as the same as that in the parsing models), the hidden size h ∈ {50, 100} and the number of layers l ∈ {1, 2, 3} are tuned on the development sets of 5 non-held-out languages. In Algorithm 1, we introduce three new hyperparameters of N p , N g , α , and thier values are tuned from a few choices: N p ∈ {5, 10, 20}, N g ∈ {5, 10, 20}, α ∈ {0.5, 0.8, 1.0}. On the five non-held-out target languages, the best performance is obtained with the setting of h = 100, l = 2, N p = 10, N g = 10 and α = 0.5.

Methods for Comparison
We mainly compare CURSOR to the models described by Ahmad et al.
(2019), denoted as "Baseline", which is different from CURSOR in that the words of the sentences from source languages are not reordered. We also compare CURSOR to two models proposed by Wang and Eisner (2018) and Meng et al. (2019), respectively denoted as MiniDiver and LagraRelax. MiniDiver is also based on word reordering, which reorders the words of the source sentences to minimize the difference in POS sequence distribution between the source and the target languages. La-graRelax solves the word order difference problem by using a Lagrangian relaxation to force the constraints derived from corpus-statistics in the inference time, which yields a significant improvement in transfer parsing. Different external resources are used by these approaches. MiniDiver assumes that the target POS corpus is available like CURSOR, while LagraRelax utilizes World Atlas of Language Structures (WALS) (Dryer and Haspelmath, 2013) linguistic features.

Results
We report in Table 1 the results of Baseline and CURSOR on the test sets for 30 different languages. Those languages are sorted in ascending order by their typology distances to English as reported by Ahmad et al. (2019). Following their recommendation, we use delexicalized models where only POS tags are used as inputs for two target languages of Chinese (zh) and Japanese (ja) since their word embeddings were found to be not well aligned with those of the others.
As we can see from Table 1, comparing to the baseline, the cross-lingual transfer performances  are all improved with four different parsing models trained on the corpora after the word reordering. The models using RNN encoder benefit more than others probably because they are more sensitive to the word orders than those using SelfAtt encoder. RNN-Graph model enhanced by our treebank reordering achieved the best average UAS of 66.6%, which beats the baseline by 2.5%. The improvements are exceptionally significant for those languages whose word orders are quite different from English, such as Hindi (hi) and Latin (la).
We report in Figure 3 the results of comparing our approach to other competitors. The results of CURSOR are those achieved by the model based on RNN-Graph architecture. For MiniDiver, we use the code released by Wang and Eisner (2018) to reorder source treebanks, then train an RNN-Graph parser on the reordered treebank. The results of LargraRelax are excerpted from Meng et al. (2019). It shows that CURSOR performs better than Mini-Diver in almost all languages, which demonstrates that the POS-based neural language model can lead to better results of word reordering than the bigram language model. Besides, CURSOR achieves slightly better results than LagraRelax (the average UAS of CURSOR is 66.6%, while that of LagraRelax is 66.3%). However, our reordering method can be applied to both the graph-based and transition-based parsing paradigms, while La-graRelax can only be used for the graph-based parsing. Furthermore, the performance of CURSOR can be further improved to 68.21% by the combination of our data augmentation and ensemble method (see Section 4.4).
Although all the experimental results reported so far take English as the source language, our approach can be applied to the case where any language is chosen as the source language without any additional effort. We also run experiments in which Hebrew (he) is taken as the source language. Experimental results with four different parsing models show that CURSOR can consistently improve the average UAS across 30 target languages by 4.23%, 6.48%, 2.91%, and 5.52% respectively.

Analysis
In this section, we study the relationship between the cross-lingual transfer parsing performances and the similarities of the source and target languages, and how the difference in arc directionality and arc distance impact on the performance.

Performance versus Similarity between Languages
We here first validate our hypothesis that "if two languages have higher similarity, the transfer performance will be better". Then, we demonstrate (c) show that the differences in the directionality between source and target corpus can be reduced by our word reordering method. (b) and (d) show that large differences will lead to poor transfer performance, and CURSOR can benefit from the reduced differences.
that our word reordering method can make two different languages "closer" in their typology distance, which usually leads to an improvement in the cross-lingual transfer.
We define a metric M to measure how a source language S is similar to a target one T with the help of the POS-based language model p T as follows: We show the correlation between the transfer performance and the similarity of source and target languages in Figure 2a, and found that they are correlated in general, especially when the value of M is less than −8. Figure 2b shows that after reordering S, its similarity to T increases, and the corresponding cross-lingual parsing performance will improve. Particularly, target languages with greater differences to the source one in their word order will benefit more from our reordering method.

Performance versus Difference in Arc Directionality
We will show that given a specific arc label, the transfer performance is significantly affected by the difference in the directionality (Wang and Eisner, 2017) of the source and target languages, and demonstrate that CURSOR can reduce such difference thus improving the performance. Given a label l, we define the directionality α(l) ∈ [0, 1] as the probability that a modifier is at the right side of its head. For the label l, the difference of directionality between the source (English) and target language T can be calculated as: In Figure 4, we sort the arc labels by their corresponding δ T (l) in ascending order. As shown in Figure 4b and 4d, large δ T (l) will lead to poor transfer performance. We also observe that our word reordering method can effectively reduce the difference of such directionality, which usually improves the performance of cross-lingual transfer. For example (see Figure 4a), δ ja (cop) and δ ja (aux) are greatly reduced after reordering. As a result, the parsing UAS of these two labels improves significantly as shown in Figure 4b (from 10.12% to 44.64% and from 13.84% to 64.09%, respectively).

Performance versus Arc Distance
We show in Figure 5 the parsing performances versus the arc distances for German (de). The arc distance of a modifier and its head is calculated by the number of words staying between them. It shows that CURSOR outperforms the baseline by a significant margin in all cases. Such margin increases when the arc distance becomes longer, indicating that the model is more sensitive to the correctness of word order when making predictions on the longdistance dependencies.

Combined Approach
We here explore the feasibility of improving the cross-lingual parsing based on RNN-Graph by data augmentation and ensemble method.

Data Augmentation
In Algorithm 1, we only add the result of word reordering with the highest fitness score to the reordered training treebank S . However, the fitness scores of the top-k results are normally very close, and we try to use all these results to train the parsing model. As shown in table 2, increasing the number of the top-k word reordering results can improve the transfer parsing performance, and the highest performance is achieved when k = 3.

Model Ensemble
Although the population-based optimization can reduce the difference in word order between two languages, it may change the well-formed syntactic  Table 2: Results of RNN-Graph parser across 25 target languages in average UAS and LAS. Generally, the more number (k) of the word reordering results are used to train the model, the better the performance will be. Ensembling CURSOR (k = 3) with the baseline achieves the highest accuracy in both UAS and LAS. structure of a source language. For a pair of similar languages, such change may cause a drop in the performance. We thus propose an inference-time ensemble method which combines the output of CURSOR and Baseline by: where w(m, h) denotes the score that h is the head of m, γ T governs the relative importance of two models, max M (S ·) and min M (S ·) are the highest and lowest scores computed as Equation (3) among 25 target languages. If the target language is more similar to the source one we will put more weights on Baseline.
We show in Figure 6 that the ensemble method can further improve the transfer performance of CURSOR, and outperform Baseline in all languages. Ensembling CURSOR (k = 3) with Baseline achieves the best performance (68.21% in UAS and 58.04% in LAS), establishing a new start-ofthe-art as shown in Table 2.

Conclusion
We propose a treebank reordering approach for cross-lingual dependency parsing. Our approach does not require any parallel corpus and can be applied to any pair of source and target languages as long as their POS tags are available. Extensive experimentation with different network architectures across 30 languages demonstrates that our approach can substantially improve the performance of the cross-lingual parsing.