Enhanced Universal Dependency Parsing with Second-Order Inference and Mixture of Training Data

This paper presents the system used in our submission to the IWPT 2020 Shared Task. Our system is a graph-based parser with second-order inference. For the low-resource Tamil corpora, we specially mixed the training data of Tamil with other languages and significantly improved the performance of Tamil. Due to our misunderstanding of the submission requirements, we submitted graphs that are not connected, which makes our system only rank 6th over 10 teams. However, after we fixed this problem, our system is 0.6 ELAS higher than the team that ranked 1st in the official results.


Introduction
Based on the Universal Dependencies (UD) (Nivre et al., 2016), the Enhanced Universal Dependencies (EUD) (Bouma et al., 2020) 1 are non-tree graphs with reentrancies, cycles, and empty nodes to deal with the problem that purely rooted trees cannot adequately represent grammatical relations. We found that we can reduce parsing such a graph to parsing bi-lexical structures like semantic dependency parsing (SDP) (Oepen et al., 2015) by reducing reentrancies and empty nodes into new labels. (Wang et al., 2019) is a state-of-the-art approach for the semantic dependency parsing tasks that use second-order inference methods with Mean-Field Variational Inference. We adopt their approach for decoding and encode the sentences with strong pretrained token representations: XLMR (Conneau et al., 2019), Flair (Akbik et al., 2018) and Fast-Text (Bojanowski et al., 2017). Among the datasets, the Tamil language only contains 400 labeled sentences for training, which makes the performance of the model for Tamil low. To further improve the performance for the low resource language, we propose a new approach that we train the Tamil model with a mixture of datasets with Tamil and a rich resource language. Empirical results show that such an approach can improve 2.44 ELAS on the test set of Tamil. Due to our misconceptions on the submission format, we submitted invalid unconnected graphs to the submission site. Thanks to the help of the organizers, they fixed these graphs with simple scripts, and our system is ranked 6th over 10 teams in the official results. However, the submitted graphs can be easily connected if we apply tree algorithms in the decoding. In the postevaluation, we submitted our system outputs again and found that our system is 0.56 ELAS higher than the team ranked 1st in the official results.
2 System Description 2.1 Data Pre-processing There are two features in the EUD graphs that do not appear in SDP graphs. One is the reentrancies of the same head and dependent on different labels. We combined these arcs into one and concatenate the labels of these arcs with a symbol '+' representing the combination of two arcs. In the postprocessing, we split arcs with the '+' symbol in the corresponding labels into multiple arcs. Another one is the empty nodes that are introduced in the shared task (for example, nodes with id 1.1). We used the official script to collapse graphs through reducing such empty nodes into non-empty nodes and introducing new dependency labels 2 . In the post-process, we add empty nodes according to the dependency labels. As the official evaluation only score the collapsed graphs, such a process does not impact the system performance.

Approach
We follow the approach of Wang et al. (2019) 3 to build our system which uses the second-order inference algorithm for the arc predictions. Given a sentence with n words w = [w 1 , w 2 , ..., w n ], we feed a three-layer BiLSTM with their corresponding token representations.

R = BiLSTM(E)
where E = [e 1 , . . . , e n ] is the concatenation of various embeddings of token (We use different combination of XLMR, Flair and FastText for each language as the token representation.) and R = [r 1 , . . . , r n ] represents the output from the BiLSTM. For the arc predictions, we use the feedforward network, Biaffine and Trilinear functions to encode unary potentials ψ u and binary potentials ψ b : where FNN_Biaffine and FNN_Trilinear represent a combination of FNN and Biaffine/Trilinear functions. Then we feed these potentials into a Mean-Field Variational Inference network for the second-order inference.
where P (Y|w) is a probability matrix representing the probabilities of all potential arcs. We first use tree algorithms like the Eisner's (Eisner, 2000) or MST (McDonald et al., 2005) algorithms to ensure the connectivity of the graph. Then we additionally add arcs for the positions that P (Y|w) > 0.5. For the label predictions, we use the FNN_Biaffine to score the labels for each potential arc.
We select the label with the highest score of each potential arc.
To train the system, we follow the approach of Wang et al. (2019) with the cross entropy loss:

|w))
3 https://github.com/wangxinyu0922/ Second_Order_SDP where θ is the parameters of our system, 1(y (arc) ij ) denotes the indicator function and equals 1 when edge (i, j) exists in the gold parse and 0 otherwise, and i, j ranges over all the tokens w in the sentence. The two losses are combined by a weighted average.

Mixture of Datasets for Tamil Parser Training
Tamil dataset has the fewest training and development sentences over all languages, which contains 400 sentences for training and 80 sentences for development. Therefore we believe that Tamil parser can be easily improved if we use more training data.
With the emergence of multilingual contextual embeddings like multilingual BERT (Devlin et al., 2019) and XLMR, training a unified multilingual model with high performances over all languages becomes possible through mixing the training data of multiple languages. However, it does not apply to the shared task as the label set of EUD is distinct in different languages. The arc annotations in the dataset are still helpful for training the Tamil parser. Thus we removed the label annotations in the dataset of other languages so that the label loss of these data cannot be back-propagated. Then we mixed one of the languages with the fully annotated Tamil dataset. To solve the problem of data imbalance in the mixture of the dataset in training, we upsampled the Tamil training set to keep the same data size as that of the other language.

Experimental Settings
In training, we split the official development set into halves as the development set and test set. We used the development set to select the model based on labeled F1 score which is the metric used in the SDP task and it evaluates the accuracy of predicted labeled arcs. We used the test set to choose the best model architecture. We use a batch size of 2000 tokens with the Adam (Kingma and Ba, 2015) optimizer. The hyper-parameters of our system are shown in Table 1, which are mostly adopted from previous work on dependency parsing. We only use the tokenized words as the model input.
For the Tamil Parser, we tried English or Czech datasets to mix with the Tamil dataset. For most of languages, we used freezed XLMR embedding  Table 2 shows the results of official evaluations of all teams, as well as the post-evaluations of our system. In the Official submission, we trained the Tamil Parser with a mixture of English and Tamil datasets ('Ours+en+MST' in the table), and in the post-evaluation, we also tried a mixture of Czech and Tamil datasets ('Ours+cs+MST' in the table) because the Czech dataset contains the largest training data over all languages. In the official results, our system was fixed by the organizers through their simple scripts for the connectivity of graphs, which significantly reduced our system performance. In the post-evaluation, we fixed this issue with MST or Eisner's algorithm and showed that our system performs 0.6 ELAS higher than the best team. For the Tamil parser, mixing the Tamil dataset with the Czech dataset performs 1.7 ELAS better than mixing with the English dataset, which shows that a larger dataset gives better results than the smaller one. Our system with the MST algorithm is 0.2 ELAS stronger than the system with Eisner's algorithm, which shows that the non-projective tree algorithm (MST) is better than the projective tree algorithm (Eisner's) for the EUD task. We built our codes based on PyTorch (Paszke et al., 2019), and ran our experiments on a single Tesla V100 GPU. Table 3 shows a performance comparison between two kinds of embedding choices, XLMR+Flair+FastText and XLMR, and first-order and second-order inference. The results show that second-order inference is stronger than first-order inference in all languages, and embeddings with XLMR embedding only usually perform better than XLMR+Flair+FastText embeddings. However, the Flair+FastText embedding is helpful for Tamil. Therefore we use XLMR+Flair+FastText embeddings for training the Tamil parser while we use XLMR embedding only for other languages.

Performance Comparison between Connected Graphs and Non-Connected Graphs
Before the deadline of the shared task, the submission site showed the scores of each treebank separately even the submission graphs were not connected, which unfortunately made us believe that the non-connected graphs are also acceptable for the task. In fact, these graphs are not acceptable and the organizers fixed the issue with some simple scripts, and this results in a significant reduction in the final scores. In section 3.2, we show that appending a tree-parsing algorithm to our system produces connected graphs with high scores.
Here we also evaluate the non-connected graphs produced by our original system. We think evaluating non-connected graphs is informative for two reasons. The first is that these results help to understand how different the connected graphs and non-connected graphs performs. The second is that in practice, non-connected graphs can be predicted with a relatively faster speed as the MST and Eisner's algorithms are slow while we can get the non-connected graphs through argmax operations. We compare the performance of connected and non-connected graphs for each treebank and each language in Table 4 and 5. The results show that the non-connected graphs perform slightly better than graphs with the tree algorithms. Therefore generating non-connected trees are more practical in practice if there are no such constraints.   Table 2: Official evaluations of all systems and post-evaluations of our team in ELAS. We use the ISO 639-1 language code to represent each language. MST and Eis means the MST and Eisner's algorithm that we used for decoding. 'en' and 'cs' represents which dataset we mixed with the Tamil dataset for training the Tamil parser. Note that 'Ours+en+MST' represent the parsed results of parsers that we used in the Official submission.

Analysis of Mixture of Training Data
For a more in-depth comparison of how the combination of different language datasets affects the performance of the Tamil Parser, Table 6 shows that more training data significantly improve the perfor-mance of the parser. We leave for future work other language combinations as well as similar studies of other parsers.  Table 5: A performance comparison in ELAS between non-connected graphs, connected graphs with the MST algorithm and the best system in the official results over each language. Ours+en represents our official submission and evaluated with official evaluation script.

Conclusion
Our system is a parser with strong contextual embeddings and second-order inference. For the lowresource language, we propose to train the model with a mixture of datasets. Empirical results show that the second-order inference is stronger than the first-order one, and mixing data improves the performance of parser significantly for the lowresource language. After we fix the graph connectivity issue, our system outperforms the system ranked 1st by 0.56 ELAS in the official results. We also show that the non-connected graphs are practically useful for its higher performance and faster speed. Our code is available at https://github. com/Alibaba-NLP/MultilangStructureKD.