Cross-Lingual Syntactic Transfer through Unsupervised Adaptation of Invertible Projections

Cross-lingual transfer is an effective way to build syntactic analysis tools in low-resource languages. However, transfer is difficult when transferring to typologically distant languages, especially when neither annotated target data nor parallel corpora are available. In this paper, we focus on methods for cross-lingual transfer to distant languages and propose to learn a generative model with a structured prior that utilizes labeled source data and unlabeled target data jointly. The parameters of source model and target model are softly shared through a regularized log likelihood objective. An invertible projection is employed to learn a new interlingual latent embedding space that compensates for imperfect cross-lingual word embedding input. We evaluate our method on two syntactic tasks: part-of-speech (POS) tagging and dependency parsing. On the Universal Dependency Treebanks, we use English as the only source corpus and transfer to a wide range of target languages. On the 10 languages in this dataset that are distant from English, our method yields an average of 5.2% absolute improvement on POS tagging and 8.3% absolute improvement on dependency parsing over a direct transfer method using state-of-the-art discriminative models.


Introduction
Current top performing systems on syntactic analysis tasks such as part-of-speech (POS) tagging and dependency parsing rely heavily on largescale annotated data (Huang et al., 2015;Dozat and Manning, 2017;Ma et al., 2018).However, because creating syntactic treebanks is an expensive and time consuming task, annotated data is scarce for many languages.Prior work has 0.4 0.5 0.6 0.7 0.8 language distance to English  model (Ahmad et al., 2019).These models are trained on the labeled English corpus and directly evaluated on different target languages.The x-axis represents language distance to English (details in Section 2.1).Both models take pre-trained cross-lingual word embeddings as input.The parsing model also uses gold universal POS tags.
demonstrated the efficacy of cross-lingual learning methods (Guo et al., 2015;Tiedemann, 2015;Guo et al., 2016;Zhang et al., 2016;Ammar et al., 2016;Ahmad et al., 2019;Schuster et al., 2019), which transfer models between different languages through the use of shared features such as cross-lingual word embeddings (Smith et al., 2017;Conneau et al., 2018) or universal part-ofspeech tags (Petrov et al., 2012).In the case of zero-shot transfer (i.e. with no target-side supervision), a common practice is to train a strong supervised system on the source language and directly apply it to the target language over these shared embedding or POS spaces.This method has demonstrated promising results, particularly for transfer of models to closely related target languages (Ahmad et al., 2019;Schuster et al., 2019).However, this direct transfer approach often produces poor performance when transferring to more distant languages that are less similar to the source.For example, in Figure 1 we show the results of direct transfer of POS taggers and dependency parsers trained on only English and evaluated on 20 target languages using pretrained cross-lingual word embeddings, where the x-axis shows the linguistic distance from English calculated according to the URIEL linguistic database (Littell et al., 2017) (more details in Section 2).As we can see, these systems suffer from a large performance drop when applied to distant languages.The reasons are two-fold: (1) Cross-lingual word embeddings of distant language pairs are often poorly aligned with current methods that make strong assumptions of orthogonality of embedding spaces (Smith et al., 2017;Conneau et al., 2018).( 2) Divergent syntactic characteristics make the model trained on the source language non-ideal, even if the crosslingual word embeddings are of high quality.
In this paper we take a drastically different approach from most previous work: instead of directly transferring a discriminative model trained only on labeled data in another language, we use a generative model that can be trained in an supervised fashion on labeled data in another language, but also perform unsupervised training to directly maximize likelihood of the target language.This makes it possible to specifically adapt to the language that we would like to analyze, both with respect to the cross-lingual word embeddings and the syntactic parameters of the model itself.
Specifically, our approach builds on two previous works.We follow a training strategy similar to Zhang et al. (2016), who have previously demonstrated that it is possible to do this sort of crosslingual unsupervised adaptation, although limited to the sort of linear projections that we argue are too simple for mapping between embeddings in distant languages.To relax this limitation, we follow He et al. (2018) who, in the context of fully unsupervised learning, propose a method using invertible projections (which is also called flow) to learn more expressive transformation functions while nonetheless maintaining the ability to train in an unsupervised manner to maximize likelihood.We learn this structured flow model (detailed in Section 3.1) on both labeled source data and unlabeled target data through a soft parameter sharing scheme.We describe how to apply this method to two syntactic analysis tasks: POS tagging with a hidden Markov model (HMM) prior and dependency parsing with a dependency model Language Category
with valence (DMV; Klein and Manning (2004)) prior (Section 4.3).We evaluate our method on Universal Dependencies Treebanks (v2.2) (Nivre et al., 2018), where English is used as the only labeled source data.10 distant languages and 10 nearby languages are selected as the target without labels.On 10 distant transfer cases -which we focus on in this paper -our approach achieves an average of 5.2% absolute improvement on POS tagging and 8.3% absolute improvement on dependency parsing over strong discriminative baselines.We also analyze the performance difference between different systems as a function of language distance, and provide preliminary guidance on when to use generative models for cross-lingual transfer.

Difficulties of Cross-Lingual Transfer on Distant Languages
In this section, we demonstrate the difficulties involved in performing cross-lingual transfer to distant languages.Specficially, we investigate the direct transfer performance as a function of language distances by training a high-performing system on English and then apply it to target languages.We first introduce the measurement of language distances and selection of 20 target languages, then study the transfer performance change on POS tagging and dependency parsing tasks.

Language Distance
To quantify language distances, we make use of the URIEL (Littell et al., 2017) database, 2 which represents over 8,000 languages as informationrich typological, phylogenetic, and geographical vectors.These vectors are sourced and predicted from a variety of linguistic resources such as WALS (Dryer, 2013), PHOIBLE (Moran et al., 2014), Ethnologue (Lewis et al., 2015), and Glottolog (Hammarstrm et al., 2015).Based on these vectors, this database provides ready-to-use distance statistics between any pair of languages included in the database in terms of various metrics including genetic distance, geographical distance, syntactic distance, phonological distance, and phonetic inventory distance.These distances are represented by values between 0 and 1.Since phonological and inventory distances mainly characterize intra-word phonetic/phonological features that have less effect on word-level language composition rules, we remove these two and take the average of genetic, geographic, and syntactic distances as our distance measure.We rank all languages in Universal Dependencies (UD) Treebanks (v2.2) (Nivre et al., 2018) according to their distances to English, with the distant ones on the top.Then we select 10 languages from the top that represent the distant language group, and 10 languages from the bottom that represent the nearby language group.The selected languages are required to meet the following two conditions: (1) at least 1,000 unlabeled training sentences present in the treebank since a reasonably large amount of unlabeled data is needed to study the effect of unsupervised adaptation, and (2) an offline pre-trained word embedding alignment matrix is available. 3The 20 selected target languages are shown in Table 1, which contains distant languages like Persian and Arabic, but also closely related languages like Spanish and French.Detailed statistical information of the selected languages and corresponding treebanks can be found in Appendix A.

Observations
In the direct transfer experiments, we use the pre-trained cross-lingual fastText word embeddings (Bojanowski et al., 2017), aligned with the method of Smith et al. (2017).These embeddings are fixed during training otherwise the alignment would be broken.We employ a bidirectional LSTM-CRF (Huang et al., 2015) model for POS tagging using NCRF++ toolkit (Yang and Zhang,   3 Following Ahmad et al. (2019), we use the offline pre-trained alignment matrix present in https://github.com/Babylonpartners/fastText_multilingual, which contains alignment matrices for 78 languages, which also allows comparison with their numbers in Section 4.3.
/ l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Z N K u R e 2 3 l z C J M N e s h 6 c R g w e K X J I a s 7 z / x k Z q y < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " V w Y P d o e V j X k U U 9 1 2 k w m j d y m a S + E a s 7 z / x k Z q y < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " V w Y P d o e V j X k U U 9 1 2 k w m j d y m a S + E < l a t e x i t s h a 1 _ b a s e 6 4 = " q h 5 O B A 1 S / e 2 + D b 3 e Z Y N h 3 v s E q z 4 = " > A r V f r z X q f t i 5 Y x c w e + A P r 8 x v K o a U 5 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " q h 5 O B A 1 S / e 2 + D b 3 e Z Y N h 3 v s E q z 4 = " > A r V f r z X q f t i 5 Y x c w e + A P r 8 x v K o a U 5 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " q h 5 O B A 1 S / e 2 + D b 3 e Z Y N h 3 v s E q z 4 = " > A r V f r z X q f t i 5 Y x c w e + A P r 8 x v K o a U 5 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " q h 5 O B A 1 S / e 2 + D b 3 e Z Y N h 3 v s E q z 4 = " > A r V f r z X q f t i 5 Y x c w e + A P r 8 x v K o a U 5 < / l a t e x i t > z ⇠ Syntactic Prior < l a t e x i t s h a 1 _ b a s e 6 4 = " l h m p R I 1 1 8 e w a J o n F U 8 t + L d n p e r V z M 8 B X S I j t A p 8 t A F q q I b V E N 1 R N E z e k V v 6 N 1 5 c T 6 c T + d r 2 r r k z G Y O 0 J 9 y x j 8 q o K P p < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " l h m p R I 1 1 8 e w a J o n F U 8 t + L d n p e r V z M 8 B X S I j t A p 8 t A F q q I b V E N 1 R N E z e k V v 6 N 1 5 c T 6 c T + d r 2 r r k z G Y O 0 J 9 y x j 8 q o K P p < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " l h m p R I 1 1 8 e w a J o n F U 8 t + L d n p e r V z M 8 B X S I j t A p 8 t A F q q I b V E N 1 R N E z e k V v 6 N 1 5 c T 6 c T + d r 2 r r k z G Y O 0 J 9 y x j 8 q o K P p < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " l h m p R I 1 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " I + 7 6 2 3 T z q 4 M y / l i w g 8 d 1 U e z a C e d I 5 a d h W w 7 4 5 r T U v y n g q Y B 8 c g D q w w R l o g i v Q A m 2 A w S N 4 B i / g 1 X g y 3 o 1 P 4 2 v a u m C U M 3 v g D 4 z v H 3 z r r s U = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I + 7 6 2 3 T z q 4 M y / l i w g 8 d 1 U e z a C e d I 5 a d h W w 7 4 5 r T U v y n g q Y B 8 c g D q w w R l o g i v Q A m 2 A w S N 4 B i / g 1 X g y 3 o 1 P 4 2 v a u m C U M 3 v g D 4 z v H 3 z r r s U = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I + 7 6 2 3 T z q 4 M y / l i w g 8 d 1 U e z a C e d I 5 a d h W w 7 4 5 r T U v y n g q Y B 8 c g D q w w R l o g i v Q A m 2 A w S N 4 B i / g 1 X g y 3 o 1 P 4 2 v a u m C U M 3 v g D 4 z v H 3 z r r s U = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I + 7 6 2 3 T z q 4 M y / l i w g 8 d 1  2019), for dependency parsing gold POS tags are also used to learn POS tag embeddings as universal features.We train the systems on English and directly evaluate them on the target languages.Results are shown in Figure 1.While these systems achieve quite accurate results on closely related languages, we observe large performance drops on both tasks as distance to English increases.These results motivate our proposed approach, which aims to close this gap by directly adapting to the target language through unsupervised learning over unlabeled text.

Proposed Method
In this section, we first introduce the unsupervised monolingual models presented in He et al. (2018), which we refer to as structured flow models, then we propose our approach that extends the structured flow models to cross-lingual settings.

Unsupervised Training of Structured Flow Models
The structured flow generative model, proposed by He et al. (2018), is a state-of-the-art technique for inducing syntactic structure in a monolingual setting without supervision.This model cascades a structured generative prior p syntax (z; θ) with an invertible neural network f φ (z) to generate pre-trained word embeddings x = f φ (z), which correspond to the words in the training sentences.z represents latent syntax variables that are not observed during training.The structured prior defines a probability over syntactic structures, and can be a Markov prior to induce POS tags or DMV prior (Klein and Manning, 2004) to induce dependency structures.Notably, the model side-steps discrete words, and instead uses pre-trained word embeddings as observations, which allows it to be directly employed in cross-lingual transfer setting by using cross-lingual word embeddings as the observations.A graphical illustration of this model is shown in Figure 2. Given a sentence of length l, we denote z = {z} K k=1 as a set of discrete latent variables from the structured prior, e = {e i } l i=1 as the latent embeddings, and x = {x i } l i=1 as the observed word embeddings.Note that the number of latent syntax variables K is no smaller than the sentence length l, and we assume x i is generated (indirectly) conditioned on z i for notation simplicity.The model is trained by maximizing the following marginal data likelihood: p η (•|z i ) is defined to be a conditional Gaussian distribution that emits latent embedding e.The projection function f φ projects the latent embedding e to the observed embedding x.
∂f −1 φ ∂x i is the Jacobian matrix of function f −1 φ at x i , and det represents the absolute value of its determinant.
To understand the intuitions behind Eq. 1, first denote the log likelihood over the latent embedding e as log p gaus (•), then log of Eq. 1 can be equivalently rewritten as: Eq. 2 shows that f −1 φ (x) inversely projects x to a new latent embedding space, on which the unsupervised training objective is simply the Gaussian log likelihood with an additional Jacobian regularization term.The Jacobian regularization term accounts for the volume expansion or contraction behavior of the projection, thus maximizing it can be thought of as preventing information loss. 5This projection scheme can flexibly transform embedding space to fit the task at hand, but still avoids trivial solutions by preserving information.
While f −1 φ (x) can be any invertible function, He et al. (2018) use a version of the NICE architecture (Dinh et al., 2014) to construct f −1 φ , which has the advantage that the determinant term is constantly equal to one.This structured flow model allows for exact marginal data likelihood computation and exact inference by the use of dynamic programs to marginalize out z.More details about this model can be found in He et al. (2018).

Supervised Training of Structured Flow Models
While He et al. (2018) train the structured flow model in an unsupervised fashion, this model can be also trained with supervised data when z is observed.Supervised training is required in the cross-lingual transfer where we train a model on the high-resource source language.The supervised objective can be written as:

Multilingual Training through Parameter Sharing
In this paper, we focus on the zero-shot crosslingual transfer setting where the syntactic structure z is observed for the source language but unavailable for the target languages.Eq. 2 is an unsupervised objective which is optimized on the target languages, and Eq. 3 is optimized on the source language.To establish connections between the source and target languages, we employ two instances of the structured flow model -a source model and a target model -and share parameters between them.The source model uses the supervised objective, Eq. 3, and the target model uses the unsupervised objective, Eq. 2, and both are optimized jointly.Instead of tying their parameters in a hard way, we share their parameters softly through an L2 regularizer that encourages similarity.We use subscript p to represent variables of the source model and q to represent variables of the target model.Together, our joint training objective is: where β = {β 1 , β 2 , β 3 } are regularization parameters.Introduction of hyperparameters is concerning because in the zero-shot transfer setting we do not have annotated data to select the parameters for each target language, but in experiments we found it unnecessary to tune β for different target languages separately, and it is possible to use the same β within the same language category (i.e.distant or nearby).Under the parameter sharing scheme the projected latent embedding space e can be understood as the new interlingual embedding space from which we learn the syntactic structures.The expressivity of the flow model used in learning this latent embeddings space is expected to compensate for the imperfect orthogonality between the two embedding spaces.Further, jointly training both models with Eq. 4 is more expensive than typical cross-lingual transfer setups -it would require re-training both models for each language pair.To improve efficiency and memory utilization, in practice we use a simple pipelined approach: (1) We pre-train parameters for the source model only once, in isolation.
(2) We use these parameters to initialize each target model, and regularize all target parameters towards this initializer via the L2 terms in Eq. 4. In this way, we only need to save the pre-trained parameters for a single source model, and target-side fine-tuning converges much faster than training each pair from scratch.This training approximation has been used before in Zhang et al. (2016).

Experiments
In this section, we first describe the dataset and experimental setup, and then report the cross-lingual transfer results of POS tagging and dependency parsing on distant target languages.Lastly we include analysis of different systems.

Experimental Setup
Across both POS tagging and dependency parsing tasks, we run experiments on Universal Dependency Treebanks (v2.2) (Nivre et al., 2018).Specifically, we train the proposed model on the English corpus with annotated data and fine-tune it on target languages in an unsupervised way.In the rest of the paper we will use Flow-FT to term our proposed method.We use the aligned cross-lingual word embeddings described in Section 2.2 as the observations of our model.To compare with Ahmad et al. (2019), on dependency parsing task we also use universal gold POS tags to index tag embeddings as part of observations.Specifically, the tag embeddings are concatenated with word embeddings to form x, tag embeddings are updated when training on the source language, and fixed at fine-tuning stage.We implement the structured flow model based on the public code from He et al. (2018), 6 which contains models with Markov prior for POS tagging and DMV prior for dependency parsing.Detailed hyperparameters can be found in Appendix B. Both source model and target model are optimized with Adam (Kingma and Ba, 2014).Training on the English source corpus is run 5 times with different random restarts for all models, then the source model with the best English test accuracy is selected to perform transfer.
We compare our method with a direct transfer approach that is based on the state-of-the-art discriminative models as described in Section 2.2.The pre-trained cross-lingual word embeddings for all models are frozen since fine-tuning them will break the multi-lingual alignments.In addition, to demonstrate the efficacy of unsupervised adaptation, we also include direct transfer results of our model without fine-tuning, which we denote as Flow-Fix.On the POS tagging task we reimplement the generative baseline in Zhang et al. (2016) that employs a linear projection (Linear-FT).We present results on 20 target languages in "distant languages" and "nearby languages" categories to analyze the difference of the systems and the scenarios to which they are applicable.

Part-Of-Speech Tagging
Setup.Our method aims to predict coarse universal POS tags, as fine-grained tags are languagedependent.The discriminative baseline with the NCRF++ toolkit (Yang and Zhang, 2018)  2018 Shared Task scoreboard that uses the same dataset. 7The regularization parameters β in all generative models are tuned on the Arabic8 development data and kept the same for all target languages.Our running β is β 1 = 0, β 2 = 500, β 3 = 80.Unsupervised fine-tuning is run for 10 epochs.
Results.We show our results in Table 2, where unsupervised fine-tuning achieves considerable and consistent performance improvements over the Flow-Fix baseline in both language categories.When compared the discriminative LSTM-CRF baseline, our approach outperforms it on 8 out of 10 distant languages, with an average of 5.2% absolute improvement.Unsurprisingly, however, it also underperforms the expressive LSTM-CRF on 8 out of 10 nearby languages.The reasons for this phenomenon are two-fold.First, the flexible LSTM-CRF model is better able to fit the source English corpus than our method (94.02% vs 87.03% accuracy), thus it is also capable of fitting similar input when transferring.Second, unsupervised adaptation helps less when transferring to nearby languages (5.9% improvement over Flow-Fix versus 11.3% on distant languages), we posit that this is because a large portion of linguistic knowledge is shared between similar languages, and the cross-lingual word embeddings have better quality in this case, so unsupervised adaptation becomes less necessary.While the Linear-FT baseline on nearby languages is comparable to our method, its performance on distant languages is much worse, which confirms the importance of invertible projection, especially when language typologies are divergent.

Dependency Parsing
Setup.In preliminary parsing results we found that transferring to "nearby language" group is likely to suffer from catastrophic forgetting (Mc-Closkey and Cohen, 1989) and thus requires stronger regularization towards the source model.This also makes sense intuitively since nearby languages should prefer the source model more than distant languages.Therefore, we use two different sets of regularization parameters for nearby languages and distant languages, respectively.Specifically, β for the "distant languages" group is set as β 1 = β 2 = β 3 = 0.1, tuned on the Arabic development set, and for the "nearby languages" group β is set as β 1 = β 2 = β 3 = 1, tuned on the Spanish development set.Unsupervised adaptation is performed on sentences of length less than 40 due to memory constraints,9 but we test on sentences of all lengths.We run unsupervised fine-tuning for 5 epochs, and evaluate using unlabeled attachment score (UAS) with punctuation excluded.
Results.We show our results in Table 3.While unsupervised fine-tuning improves the performance on the distant languages, it only has minimal effect on nearby languages, which is consistent with our observations in the POS tagging experiment and implies that unsupervised adaption helps more for distant transfer.Similar to POS tagging results, our method is able to outperform state-of-the-art "SelfAtt-Graph" model on 8 out of 10 distant languages, with an average of 8.3% absolute improvement, but the strong discriminative baseline performs better when transferring to nearby languages.Note that the supervised performance of our method on English is poor.This is mainly because the DMV prior is too simple and limits the capacity of the model.While this model still achieves good performance on distant transfer, incorporating more complex DMV variants (Jiang et al., 2016) might lead to further improvement.
Analysis on Dependency Relations.We further perform breakdown analysis on dependency relations to see how unsupervised adaptation helps learn new dependency rules.We select three typical distant languages with different word order of Subject, Object and Verb (Dryer, 2013): Arabic (Modern Standard, VSO), Indonesian (SVO) and Japanese (SOV).We investigate the unlabeled accuracy (recall) on the gold dependency labels.We especially explore four typical dependency relations: case (case marking), nmod (nominal modifier), obj (ob-ject) and nsubj (nominal subject).The first two are "nominal dependents" (modifiers for nouns) and the rest two are the main nominal "core arguments" (arguments for the predicate).Although different languages may vary, these four types are representative relations and occupies 25% to 40% in frequencies among all 37 UD dependency types.
We compare our fine-tuning model with the baseline "SelfAtt-Graph" model and our basic model without fine-tuning.As shown in Figure 3, although our direct transfer model obtain similar results when compared with the baseline, the finetuning method brings large improvements on most of these dependency relations.In these three languages, Japanese benefits from our tuning method the most, probably because its word order is quite different from English and the baseline may overfit to the English order.For example, in Japanese, almost all of the "case" relations are head-first and "obj" relations are modifier-first, and these patterns are exactly opposite to those in English, which serves as our source language.As a result, direct transfer models fail on most of these relations since they only learn the patterns in English.With our fine-tuning on unlabeled data, the model may get more familiar with the unusual patterns of word order and predict more correct attachment decisions (around 0.4 improvements in recalls).In Arabic and Indonesian, although not as obviously as in Japanese, the improvements are still consistent, especially on the relations of the core arguments.

When to Use Generative Models?
In unsupervised cross-lingual transfer setting, it is hard to find a system that is able to achieve state-of-the-art on all languages.As reflected by our experiments, there is a tradeoff between fitting source language and generalizing to target language -the flexibility of discriminative models results in overfitting issue and poor performance when transferred to distant languages.Unfortunately, a limited number of high-resource languages and many more low-resource languages in the world are mostly distant.This means that distant transfer is a practical challenge we face when dealing with low-resource languages.Next we try to give a preliminary guidance about which system should be used in specific transfer scenarios.
As discussed in Section 2.1, there are different types of distance metrics.Here we aim to compute the significance of correlation between the performance difference between our method and the discriminative baseline and different distance features.We have five input distance features: geographic, genetic, syntactic, inventory, and phonological.Specifically, we fit a generalized linear model (GLM) on the difference in accuracy and five features of all 20 target languages, then we perform a hypothesis test to compute the p-value that reflects the significance of specific features. 10Results are shown in Table 4, where we can conclude that the genetic distance feature is significantly correlated with POS tagging performance, while geographic distance feature is significantly correlated with dependency parsing performance.As assumed before, inventory and phonological distance do not have much influence on the transfer.Interestingly, syntactic distance is not the significant term for both tasks, we posit that this is because the transfer performance is affected by both cross-lingual word embedding quality and linguistic features, thus genetic/geographic distance might be a better indicator overall.The results suggest that our method might be more suitable than the discriminative approach at genetically distant transfer for POS tagging and geographically distant transfer for parsing.

Effect of Multilingual-BERT
So far the analysis and experiments of this paper focus on non-contextualized fastText word embeddings.We note that concurrently to this work, Wu and Dredze (2019) found that the recently released multilingual BERT (mBERT; Devlin et al. (2019)) is able to achieve impressive performance on various cross-lingual transfer tasks.To study the effect of contextualized mBERT word embeddings on our proposed method, we report the average POS tagging and dependency parsing results in are included in Appendix C. In the mBERT experiments, all the settings and hyperparameters are the same as in Section 4.2 and Section 4.3, but the aligned fastText embeddings are replaced with the mBERT embeddings. 11We also include the average results from fastText embeddings for comparison.
On the POS tagging task all the models greatly benefit from the mBERT embeddings, especially our method on nearby languages where the mBERT outperforms the fastText by an average of 16 absolute points.Moreover, unsupervised adaptation still considerably improves the Flow-Fix baseline, and surpasses the LSTM-CRF baseline on 9 out of 10 distant languages with an average of 6% absolute performance boost.Different from the fastText setting where our method underperforms the discriminative baseline on the nearby language group, by the use of mBERT embeddings our method also beats the discriminative baseline on 7 out of 10 nearby languages with an average of 3% absolute improvement.A major limitation of our method lies in its strong independence assumptions, which results in the failure to model the long-term context information.We posit that the contextualized word embeddings like mBERT exactly compensate for this drawback in our model through incorporating the context information into the observed word embeddings, so that our method is able to outperform the discriminative baseline on both distant and nearby language groups.
On dependency parsing task, however, our method does not demonstrate significant improvement by the use of mBERT, while mBERT greatly helps the discriminative baseline.Therefore, although our method still outperforms the discriminative baseline on four very distant languages, the baseline demonstrates superior performance on other languages when using mBERT.Interestingly, we find that the performance of flow-based models with mBERT is similar to the performance with fastText word embeddings.Based on this, better generative models for unsupervised dependency parsing that can take advantage of contextualized embeddings seems a promising direction for future work.

Related Work
Cross-lingual transfer learning has been widely studied to help induce syntactic structures in low-resource languages (McDonald et al., 2011;Täckström et al., 2013a;Agić et al., 2014;Tiedemann, 2015;Kim et al., 2017;Schuster et al., 2019;Ahmad et al., 2019).In the case when no available target annotations are available, unsupervised cross-lingual transfer can be performed by directly applying pre-trained source model to the target language.(Guo et al., 2015;Schuster et al., 2019;Ahmad et al., 2019).The challenge of direct transfer method lies in the different linguistic rules between source and distant target languages.Utilizing multiple sources of resources can mitigate this issue and has been actively studied in the past years (Cohen et al., 2011;Naseem et al., 2012;Täckström et al., 2013b;Zhang and Barzilay, 2015;Aufrant et al., 2015;Ammar et al., 2016;Wang andEisner, 2018, 2019).Other approaches that try to overcome the lack of annotations include annotation projection by the use of bitext supervision or bilingual lexicons (Hwa et al., 2005;Smith and Eisner, 2009;Wisniewski et al., 2014) and source data point selection (Søgaard, 2011;Täckström et al., 2013b).
Learning from both labeled source data and unlabeled target data has been explored before.Cohen et al. (2011) learns a generative target language parser as a linear interpolation of multiple source language parameters, Naseem et al. (2012) and Täckström et al. (2013b) rely on additional language typological features to guide selective model parameter sharing in a multi-source transfer setting, Wang andEisner (2018, 2019) extract linguistic features from target languages by training a feature extractor on multiple source languages.

Conclusion
In this work, we focus on transfer to distant languages for POS tagging and dependency parsing, and propose to learn a structured flow model in a cross-lingual setting.Through learning a new latent embedding space as well as languagespecific knowledge with unlabeled target data, our method proves effective at transferring to distant languages.7: POS tagging accuracy (%) and dependency parsing UAS (%) results when using mBERT as the aligned embeddings.Numbers next to languages names are their distances to English.Supervised accuracy on English ( * ) is included for reference.

Figure 1 :
Figure 1: Left: POS tagging transfer accuracy of the Bidirectional LSTM-CRF model, Right: Dependency parsing transfer UAS of the "SelfAtt-Graph"model (Ahmad et al., 2019).These models are trained on the labeled English corpus and directly evaluated on different target languages.The x-axis represents language distance to English (details in Section 2.1).Both models take pre-trained cross-lingual word embeddings as input.The parsing model also uses gold universal POS tags.

Figure 2 :
Figure2: Graphical representation of the structured flow model.We denote discrete syntactic variables as z, latent embedding variable as e, and observed pretrained word embeddings as x.f φ is the invertible projection function.

FDVHFigure 3 :
Figure3: Results (UAS%) on typical dependency relations for Arabic, Indonesian and Japanese, respectively."Baseline" denotes the "SelfAtt-Graph" model, and "Direct-Transfer" denotes our source model without finetuning.The number in the parenthesis after each dependency label indicates the relative frequency of this type.
achieves supervised test accuracy on English of 94.02%, which is competitive (rank 12) on the CoNLL

Table 3 :
Dependency parsing UAS (%) on sentences of all lengths.Numbers next to languages names are their distances to English.Supervised accuracy on English ( * ) is included for reference.

Table 5 ,
while detailed numbers on each language

Table 4 :
p-value of different distance features on POS tagging and dependency parsing task.A lower pvalue indicates stronger association between the feature and the response, which is the difference between our method and the discriminative baselines.