A Simple yet Effective Joint Training Method for Cross-Lingual Universal Dependency Parsing

This paper describes Fudan’s submission to CoNLL 2018’s shared task Universal Dependency Parsing. We jointly train models when two languages are similar according to linguistic typology and then ensemble the models using a simple re-parse algorithm. We outperform the baseline method by 4.4% (2.1%) on average on development (test) set in CoNLL 2018 UD Shared Task.


Introduction
Dependency Parsing has been a fundamental task in Natural Language Processing (NLP). Recently, universal dependency parsing (Zeman et al., 2018a,b; has unified the annotations of different languages and thus made transfer learning among languages possible. Several works using cross-lingual embedding (Duong et al., 2015;Guo et al., 2015) have successfully increased the accuracy of cross-lingual parsing. Beyond embedding-based methods, a natural question is whether we can use a simple way to utilize the universal information. Some previous research either regarded the universal information as extra training signals (e.g., delexicalized embedding (Dehouck and Denis, 2017)), or implicitly trained a network with all features (e.g., adversarial training for parsing in Sato et al. (2017)). In our system, we manually and explicitly share the universal annotations via a shared LSTM component. Similar to Vania et al. (2017), different languages are first grouped based on typology, as shown in table 1. Then, we train a shared model for each pair of languages within the same group, and apply a simple ensemble method over all trained models. Note that our method is orthogonal to other cross-lingual approaches for universal parsing such as cross-lingual embedding.
In the following parts, we first describe the baseline method (Section 2) and our system (Section 3). We show the result on both development set and test set in Section 4 and provide some analysis of the model in Section 5.

Baseline
In this section, we briefly introduce the baseline system, UDPipe 1.2 (Straka and Straková, 2017), which is an improved version of original UDPipe (Straka et al., 2016). The tokenizing, POS tagging and lemma outputs of UDPipe are utilized by Fu-danParser.
UDPipe employs a GRU network during the inference of segmentation and tokenization. The tagger uses characters features to predict the POS and lemma tags. Finally, a transition-based neural dependency parser with one hidden layer predicts the transition actions. The parser also makes use of the information from lemmas, POS taggings and dependency relationships through a group of embeddings precomputed by word2vec.
In the later discussion, we take the baseline performance result from the web page of the shared task 2 for comparison.

System Description
In this submission, we only consider parsing in an end-to-end manner and handle each treebank sep- arately. We first train a monotonic model for all "big" treebanks. Besides, for each language, there are N −1 models fine-tuned from joint-trained (see Figure 2), where N is the number of languages in the same language group.
For small treebanks where training set is less than 50 sentences, we use the delexicalized method the same as Shi et al. (2017)'s approach for the surprise languages. Shi et al. (2017) took delexicalized features (morphology and POS tag) as input and apply 50% dropout rate to the input. In practice, we found that the baseline method performs much better than ours on"fi pud", "br keb" "ja modern" and "th pud", so we use the baseline method instead for these languages.
Our whole system needs about 90 hours to do the inference of all models on TIRA and requires no more than 560M main memory.

Architecture
Features We use words, characters as the lexical information, and use morphological features 3 and POS tags as the delexicalized information. We also tried subword embeddings, but it mostly did not help. More precisely, the character-level features are treated as bag-of-characters. Similarly, we use bag-of-morphology for morphological features (one can see number=single as a character). We first assign the embedding vectors for characters and morphological features, and then for each word, we apply a Convolutional Network (CNN) to encode variable length embeddings into one fixed length feature.
Biaffine BiLSTM. Similar to Shi et al. (2017); Sato et al. (2017); Vania et al. (2017), we use last year's first-place model (Dozat et al., 2017), the graph-based biaffine bizLSTM model as our backbone. Given a sentence of N words, the input is first fed to a bi-directional LSTM and obtain the feature of each word w i . A head MLP and a dependent MLP are used to translate the features, which is then fed into a hidden layer to calculate the biaffine attention. Finally, we are able to compute the score of arcs and labels in following way: where U 1 ∈ R d×d and u 2 ∈ R d are trainable parameters.

Joint Training
For a joint training model of N languages, we have N +1 Biaffne Bi-LSTMs (called LSTMs), see Figure 1. For each language, we have a languagespecific LSTM to process the lexical information such as word-or character-level embedding, and the output is w l i,j . For all languages we have a shared LSTM which takes delexicalized information such as morphology and POS tags as input and the output is w d i,j . Inspired by Sato et al. (2017), we use a gating mechanism to combine these two set of features. Formally, where w l indicates lexical feature, w d indicates delexicalized feature, and is element-wise multiplication.
The difference between Sato et al. (2017) and ours is that we remove the adversarial training loss, which is because we have already use the universal information in the shared network.

Fine-tuning
We fine-tunning each joint-training model for 100 steps (see Figure 2).

Tree Ensemble
We follow the re-parsing method proposed in Sagae and Lavie (2006) to perform model ensemble. Suppose k parsing trees have been obtained, denoted by T 1 , T 2 , ...T k , a new graph is constructed by setting the score of each edge to This graph is feed to a MST algorithm to get the ensemble parsing tree T e . Then the relation label of edge [u → v] in T e is voted by all inputs T i that contains edge [u → v].

Hyper-parameters
We followed the hyper-parameter settings in (Dozat et al., 2017). We train 30, 000 steps for each model and then fine-tune (onot necessary) for 100 steps for the given language. For all the input features, the dimension is 100. For LSTM, we use hidden size equals to 400 and the number of layers is 3. 0.33% dropout rate is applied to the input and LSTM hidden layer. We use Bayesian dropout (Gal and Ghahramani, 2016) in the LSTM layers. We also use word dropout (dropping the whole word with a probability) in the input layer.

Results
The results of the test and development set are shown in Table 5 and Table 6, respectively. The first three columns are the baseline results and the second three columns are the results of our submission. Also, we list the performance improvement of Fudan Parser compared to the baseline system in the last three columns. Figure 2: Take four languages as an example. We aim at testing sentence in language 1. We first jointly train languages 1 and other three languages in three separate network. And then we only keep LSTM 1 and the shared LSTM part to fine tune the models for language 1. Finally we re-parse it as an ensemble to obtain the final parsing tree for a given sentence in language 1. Table 6 and 6, we find that our system achieves higher improvements on the datasets with large size of training data. It is reasonable since our model contains enormous parameters, which is easy to get overfitting if the training set is too small. More analysis are included in Section 5.

Language similarity
The accuracy of the joint training model actually reveals the syntactic similarity between two languages. The accuracy of three language groups, Slavic (Table 2), Romance (Table 3) and Germanic (Table 4). A number in row i, column j means the accuracy of language i testing on the model jointly training on language i and language j. The bold font indicates it is the best model for language i. We can see that for every language, jointly trained models consistently beat single models (the number on the diagonal) which shows the efficacy of the proposed approach.

Morphology
Morphology is extremely helpful when predicting the dependency between words, especially for those morphology rich languages. However, the UD Parsing task is not done in an end-to-end fashion (i.e. the input morphological features are not the ground-true labels) and thus the morphology information is noisy. The performance is hurt greatly because of the noisy predicted morphology features. A significant accuracy gain should be obtained if a better morphology prediction model is used.

Conclusion
Our system provided a simple yet effective method -sharing the universal features to the same part of neural network-to boost the accuracy of syntactic parsing. We also demonstrated that morphological feature plays an important role in syntactic parsing, which is a promising direction to work on.
In the future, we can investigate a better way to do the ensemble or apply a multi-model compression method (e.g. knowledge distillation) to reduce the computational cost. Also, we can explore