Unsupervised Cross-Lingual Adaptation of Dependency Parsers Using CRF Autoencoders

We consider the task of cross-lingual adaptation of dependency parsers without annotated target corpora and parallel corpora. Previous work either directly applies a discriminative source parser to the target language, ignoring unannotated target corpora, or employs an unsupervised generative parser that can leverage unannotated target data but has weaker representational power than discriminative parsers. In this paper, we propose to utilize unsupervised discriminative parsers based on the CRF autoencoder framework for this task. We train a source parser and use it to initialize and regularize a target parser that is trained on unannotated target data. We conduct experiments that transfer an English parser to 20 target languages. The results show that our method significantly outperforms previous methods.


Introduction
Supervised learning of dependency parsing is difficult for low-resource languages because of the lack of large treebanks. On the other hand, crosslingual adaptation of dependency parsers from richresource languages to low-resource languages has shown a lot of promise (Hwa et al., 2005;Zeman and Resnik, 2008;McDonald et al., 2011;Xiao and Guo, 2014;Tiedemann, 2015;Schlichtkrull and Søgaard, 2017;Ahmad et al., 2019), especially with the help of cross-lingual word representation (Wu and Dredze, 2019) or part-of-speech (POS) tags (Guo et al., 2015).
In this paper, we consider the scenario in which there is only unannotated data for the target language that is not parallel to the source language treebank. A simple strategy is zero-shot transfer or direct transfer, which trains a parser on the source treebank and then directly applies it to the target language (Schuster et al., 2019;Wang et al., 2019). In order to leverage unannotated target data, He et al. (2019) propose to employ an unsupervised generative parser that can be trained on the target data while also regularized via soft parameter tying by a source parser. However, generative parsers are known to underperform discriminative parsers in rich-resource scenarios, mostly because of the unrealistic independence assumptions typically made by generative parsers. In fact, He et al. (2019) show that when they use multilingual BERT (Kenton and Toutanova, 2019) as the cross-lingual word representation, their method underperforms direct transfer of a strong discriminative parser.
In this paper, we propose to instead use an unsupervised discriminative parser based on the CRF autoencoder framework (Ammar et al., 2014;Cai et al., 2017) for cross-lingual parser adaptation. We perform supervised training of the source parser with the source treebank and then use it to initialize the target parser. The target parser is then trained on the unannotated target data in an unsupervised way while being regularized by the source parser. We employ three regularization methods proposed by  that encourage similarity between model parameters and edge scores respectively of the source and target parsers. Our experiments of transferring from English to 20 target languages show that our method significantly outperforms previous methods.

CRF Autoencoder
The CRF autoencoder is a framework of unsupervised structured prediction (Ammar et al., 2014) and has been applied to unsupervised parsing (Cai et al., 2017) and POS induction (Lin et al., 2015). It consists of an encoder that predicts a structure (in our case, a dependency parse tree) from the input sentence and a decoder that reconstructs the sentence from the structure. Let x = (x 1 , x 2 , . . . , x n ) be the input sentence, where x i is the i-th word; let y = (y 1 , y 2 , . . . , y n ) be the dependency parse tree, where y i is a tuple h i , p i in which h i is the index of the dependency head of x i and p i is the POS tag of the head of x i ; and finally letx = (x 1 ,x 2 , . . . ,x n ) be the reconstructed sentence. We would like to have a perfect reconstruction, so we setx = x.

Encoder
The encoder with parameters Θ computes P Θ (y|x). We use the deep biaffine model (Dozat and Manning, 2017), a widely used dependency parser, as our encoder. For each word x i of the input sentence, its word and POS tag embeddings are concatenated and input into a multilayer BiLSTM to produce a contextual representation r i of the word. Then r i is fed into two MLPs to produce h (dep) i and h (head) i , vector representations of the word as a dependent and dependency head respectively.
We use a biaffine function to compute a score matrix s Enc , in which each element s Enc i,j is the score of the potential dependency from x i to x j : where W and b are parameters of the biaffine function.
We follow the head-selection formulation of Dozat and Manning (2017) to compute P Θ (y|x).
where P (h i |x) can be computed by a softmax function on s Enc :

Decoder
The decoder with parameters Λ computes P Λ (x|y). Following Cai et al. (2017), we representx as a sequence of POS tags instead of words and make the decoder independently predict each POS taĝ p i in the reconstructed sentence conditioned only on p i , the true POS tag of its dependency head. Our decoder simply specifies a categorical distribu-tion P (p i |p i ) for each possible head POS tag and computes the reconstruction probability as follows.

Parsing
Given encoder parameters Θ and decoder parameters Λ, we can get the best parse tree by maximizing the probability P Θ,Λ (y,x|x) = P Θ (y|x)P Λ (x|y), We can use Eisner's algorithm (Eisner, 1996) to find the best projective dependency parse tree in O n 3 time or use Chu-Liu/Edmonds' algorithm to find the best non-projective dependency parse tree (Chu, 1965;Edmonds, 1967;Tarjan, 1977) in O n 2 time. Additionally, we can use the head selection method (Zhang et al., 2017) in O n 2 time, which often, but not always, produce a tree structure.

Monolingual Learning
In the unsupervised setting, the parse tree y is unknown. We follow Cai et al. (2017) and minimize the negative conditional Viterbi log likelihood as the training loss function: where N is the number of training sentences. Since both the encoding and the decoding probabilities can be factorized (Eq. 2 and 4), we can rewrite Eq. 6 as follows to make it tractable.
In the supervised setting, the gold parse tree y * is known and the loss function becomes: (8) In both settings, we can optimize encoder parameter Θ and decoder parameter Λ with stochastic gradient descent.

Cross-lingual Adaptation
To enable cross-lingual adaptation, we employ multilingual BERT (m-BERT, (Kenton and Toutanova, 2019)) and universal POS tag as the word and tag representations. We first train a CRF autoencoder (the source model) in a supervised way on the source language treebank. We then use the source model to initialize a second CRF autoencoder (the target model) and train it in an unsupervised way on the unannotated target language corpus. We stop the training after K epochs, where K is a hyperparameter. During training of the target model, we encourage it to remain similar to the source model via regularization. We consider three forms of regularization proposed by .

Regularization of Model Parameters (W)
The parameter regularization encourages the similarity between the source model parameters and target model parameters. Hyper-parameter λ W controls the regularization strength. We add the following regularization term Ω to the training loss (Eq. 6).
Regularization on Edge Scores (E) The regularization on edge scores encourages the source and target models to produce similar scores for each potential dependency in every training sentence x i .
where s(x i ) is the edge score matrix on sentence x i computed by taking the summation of the encoder score s Enc i,j (Eq. 1) and the decoder score log P (p i |p i ) for each possible dependency edge. Hyper-parameter λ E controls the strength of edge regularization.
Regularization on Parse Trees (T) The regularization on parse trees encourages similarity between the parse trees predicted by the source and target models. To achieve this, we change the training loss (Eq. 6) into the following form: where λ T is a hyper-parameter that controls the strength of tree regularization.

Data and Setup
Our experimental setup is the same as that of He et al. (2019). We evaluate all the methods on transferring an English parser to 10 nearby languages and 10 distant languages selected from Universal Dependencies (UD) project version 2.2 (Nivre et al., 2018). We use two sets of hyperparameters: the hyper-parameters for distant languages tuned on the Arabic development set and the hyper-parameters for nearby languages tuned on the Spanish development set.
For supervised learning of the source model, we train on sentences of all lengths. For unsupervised learning of the target model, we train on sentences of length ≤ 40. We test the target model on sentences of all lengths and use Eisner's algorithm for parsing.
We run each experiment for five times with different random seeds on a Tesla P40 GPU and report the average unlabeled attachment score (UAS) with punctuation excluded.

Results
We compare our method with a previous state-ofthe-art approach (He et al., 2019) and several baselines in Table 1. The three generative methods are from He et al. (2019): F-Fix is their Flow-Fix model that directly transfers the generative source model, F-N is their Flow-FT model that trains on the target corpus without source regularization, and F-FT is their best-performing Flow-FT model that trains on the target corpus with source regularization. We rerun their source code 2 in our experiments. For discriminative models, DT is the direct transfer baseline and S-T is the self-training  baseline, both of which use the biaffine parser (Dozat and Manning, 2017). S-T follows Rybak and Wróblewska (2018) who use the source model to predict parse trees on the target data and then perform supervised training of the target model. The last eight methods are our methods. Fix is direct transfer of the CRF autoencoder. N is our method without any regularization. W, E and T are our method with weight, edge and tree regularization respectively. W+E, W+T and E+T are our method with two forms of regularization combined. As shown in Table 1, all the discriminative methods outperform the three generative methods on average, and the performance gap is especially large on nearby languages. This is consistent with the findings of He et al. (2019) when using m-BERT.
Comparing the discriminative methods, we find that our methods clearly outperform the DT, S-T and Fix baselines on both distant languages and nearby languages, showing the advantage of unsupervised training on target data. However, the improvements produced by our methods on nearby languages are much smaller than those on distant languages. This is not surprising considering that nearby languages share similar syntactic behaviors and direct transfer can already produce strong parsers.
Comparing our methods with and without regularization, we see that regularization helps in most cases. The usefulness of regularization is more prominent on nearby languages, probably because of the better performance of the source model on nearby languages.

Analysis
We evaluate our model with varying sizes of the target/source data and fixed source/target data in Figure 1. It can be seen that more target data can boost the accuracy on the distant language (Arabic), but hurt the accuracy on the nearby language (Spanish) unless alleviated by regularization. On the other hand, more source data is always helpful, especially on the distant language.

Conclusion
In this paper, we employ unsupervised discriminative parsers based on the CRF autoencoder framework for unsupervised cross-lingual adaptation of dependency parsers. We initialize the target model using the source model and train it on unannotated target data in an unsupervised way, with three forms of regularization that encourage its similarity   the average unlabeled attachment score (UAS) with punctuation excluded.

Results
We compare our method with a previous state-ofthe-art approach (He et al., 2019) and several baselines in Table 1. The three generative methods are from He et al. (2019): F-Fix is their Flow-Fix model that directly transfers the generative source model, F-N is their Flow-FT model that trains on the target corpus without source regularization, and F-FT is their best-performing Flow-FT model that trains on the target corpus with source regularization. We rerun their source code 1 in our experiments. For discriminative models, DT is the direct transfer baseline and S-T is the self-training baseline, both of which use the biaffine parser (Dozat and Manning, 2017). S-T follows Rybak and Wróblewska (2018) who use the source model to predict parse trees on the target data and then perform supervised training of the target model. The last five methods are our methods. Fix is direct transfer of the CRF autoencoder. N is our method without any regularization. W and E are our method with weight and edge regularization respectively. W+E is our method with two forms of regularization combined. As shown in Table 1, all the discriminative methods outperform the three generative methods on average, and the performance gap is especially large 1 https://github.com/jxhe/ cross-lingual-struct-flow on nearby languages. This is consistent with the findings of He et al. (2019) when using m-BERT.
Comparing the discriminative methods, we find that on distant languages our methods clearly outperform the DT, S-T and Fix baselines, showing the advantage of unsupervised training on target data. However, on nearby languages, only E and W+E outperform the baselines while N and W underperform the baselines. This is not surprising considering that nearby languages share similar syntactic behaviors and direct transfer can already produce strong parsers. The underwhelming performance of N is also consistent with the observation in the unsupervised parsing literature that unsupervised training of a good parser often reduces its parsing accuracy.

Analysis
We evaluate our model with varying sizes of the target/source data and fixed source/target data in Figure 1. It can be seen that more target data can boost the accuracy on the distant language (Arabic), but hurt the accuracy on the nearby language (Spanish) unless alleviated by regularization. On the other hand, more source data is always helpful, especially on the distant language.

Conclusion
In this paper, we employ unsupervised discriminative parsers based on the CRF autoencoder framework for unsupervised cross-lingual adaptation of dependency parsers. We initialize the target model We run each experiment for five times with different random seeds on a Tesla P40 GPU and report the average unlabeled attachment score (UAS) with punctuation excluded.

Results
We compare our method with a previous state-ofthe-art approach (He et al., 2019) and several baselines in Table 1. The three generative methods are from He et al. (2019): F-Fix is their Flow-Fix model that directly transfers the generative source model, F-N is their Flow-FT model that trains on the target corpus without source regularization, and F-FT is their best-performing Flow-FT model that trains on the target corpus with source regulariza-tion. We rerun their source code 1 in our experiments. For discriminative models, DT is the direct transfer baseline and S-T is the self-training baseline, both of which use the biaffine parser (Dozat and Manning, 2017). S-T follows Rybak and Wróblewska (2018) who use the source model to predict parse trees on the target data and then perform supervised training of the target model. The last five methods are our methods. Fix is direct transfer of the CRF autoencoder. N is our method without any regularization. W and E are our method with weight and edge regularization respectively. W+E is our method with two forms of regularization combined. 1 https://github.com/jxhe/ cross-lingual-struct-flow to the source model. Our experiments show the advantage of our methods over previous generative methods and discriminative baselines.