AntNLP at CoNLL 2018 Shared Task: A Graph-Based Parser for Universal Dependency Parsing

We describe the graph-based dependency parser in our system (AntNLP) submitted to the CoNLL 2018 UD Shared Task. We use bidirectional lstm to get the word representation, then a bi-affine pointer networks to compute scores of candidate dependency edges and the MST algorithm to get the final dependency tree. From the official testing results, our system gets 70.90 LAS F1 score (rank 9/26), 55.92 MLAS (10/26) and 60.91 BLEX (8/26).


Introduction
The focus of the CoNLL 2018 UD Shared Task is learning syntactic dependency parsers that can work over many typologically different languages, even low-resource languages for which there is little or no training data. The Universal Dependencies (Nivre et al., 2017a,b) treebank collection has 82 treebanks over 57 kinds of languages.
In this paper we describe our system (AntNLP) submitted to the CoNLL 2018 UD Shared Task. Our system is based on the deep biaffine neural dependency parser (Dozat and Manning, 2016). The system contains a BiLSTM feature extractor for getting context-aware word representation and two biaffine classifiers to predict the head token of each word and the label between a head and its dependent.
There are three main metrics for this task, LAS (labeled attachment score), MLAS (morphologyaware labeled attachment score) and BLEX (bilexical dependency score). From the official testing results, our system gets 70.90 LAS F1 score (rank 9/26), 55.92 MLAS (10/26) and 60.91 BLEX (8/26). In a word, Our system is ranked top 10 according to the three metrics described above. Additionally, in the categories of small treebanks, our system obtains the sixth place with a MLAS score of 63.73. Besides that, our system ranked tenth in the EPE 2018 campaign with a 55.71 F1 score.
The rest of this paper is organized as follows. Section 2 gives a brief description of our overall system, including the system framework and parser architecture. In Section 3, 4 we describe our monolingual model and multilingual model. In Section 5, we briefly list our experimental results.

System Overview
The CoNLL 2018 UD Shared Task aims to construct dependency trees based on raw texts, which means that the participants should not only build Figure 2: The architecture of our parser system. the parsing model, but also preprocess systems of the sentence segmentation, tokenization, POStagging and morphological analysis. We use preprocessors from the official UDPipe tool in our submission. The structure of the entire system is shown in Figure 1. Our main focus is on building a graph-based parser.
We implement a graph-based bi-affine parser following Dozat and Manning (2016). The parser architecture is shown in Figure 2, which consists of the following components: • Token representation, which produces the context independent representation of each token in the sentence.
• Deep Bi-LSTM Encoder, which produces the context-aware representation of each token in the sentence based on context.
• Bi-affine Pointer Networks, which assign probabilities to all possible candidate edges.
We describe the three sub-modules in the following sections in detail.

Token representation
Recent studies on dependency parsing show that densely embedded word representation could help to improve empirical parser performance. For example: Chen and Manning (2014) map words and POS tags to a d-dimensional vector space. Dozat and Manning (2016) use the pre-trained GloVe embeddings as an extra representation of the word. Ma et al. (2018) use Convolutional Neural Networks (CNNs) to encode character-level information of a word. The token representation module of our parser also uses dense embedding representations. Details on token embeddings are given in the following.
• Word e w i : The word embedding is randomly initialized from the normal distribution N (0, 1) (e w i ∈ R 100 ).
• Lemma e l i : The lemma embedding is randomly initialized from the normal distribution N (0, 1) (e l i ∈ R 100 ).
• Pre-trained Word e pw i : The FastText pretrained word embedding (e pw i ∈ R 300 ). We will not update e pw i during the training process.
• UPOS e u i : The UPOS-tag embedding is randomly initialized from the normal distribution N (0, 1) (e u i ∈ R 100 ).
• XPOS e x i : The XPOS-tag embedding is randomly initialized from the normal distribution N (0, 1) (e x i ∈ R 100 ).
• Char e c i : The character-level embedding is obtained by the character-level CNNs (e c i ∈ R 64 ).
Our parser uses two kinds of token representations, one is a lexicalized representation of the monolingual model, another one is the delexicalized representation of the multilingual model.
The lexicalized representation x l i of token w i is defined as: and the delexicalized representation x d i of token w i is defined as: In the following sections, we uses x i to represent x l i or x d i when the context is clear.

Deep Bi-LSTM Encoder
Generally, the token embeddings defined above are context independent, which means that the sentence-level information is ignored. In recent years, some work shows that the deep BiLSTM can effectively capture the contextual information of words (Dyer et al., 2015;Kiperwasser and Goldberg, 2016;Dozat and Manning, 2016;Ma et al., 2018). In order to encode context features, we use a 3layer sentence level BiLSTM on top of x 1:n : θ are the model parameters of the forward hidden sequence h. θ are the model parameters of the backward hidden sequence h. The vector v i is our final vector representation of ith token in s, which takes into account both the entire history h i and the entire future h i by concatenating h i and h i .

Biaffine Pointer Networks
How to determine the probability of each dependency edge is an important part of the graphbased parser. The work of Dozat and Manning (2016) shows that the biaffine pointer (attention) networks can calculate the probability of each dependency edge well. Here we used a similar biaffine pointer network structure.
In order to better represent the direction of the dependency edges, we use multi-layer perceptron (MLP) networks to learn each word as the representation of head and dependent words, rather than simply exchanging feature vectors. And we also separate the predictions of dependent edges and their labels. First, for each v i , we use two MLPs to define two pointers h arc i and s arc i , which is the representation of v i with respect to whether it is seen as a head or a modifier of an candidate edge.
Similarly, we use h rel i and s rel to describe x i when determine the relation label of a candidate edge.
We first use the arc-biaffine pointer networks to predict the probability of a dependency edge between any two words. For any two words w i and w j in a sentence, the probability p (arc) i→j that they form a dependency edge w i → w j is as follows: where θ arc = {W (arc) , u (arc) } are the model parameters, a i→j is the computed score of the dependency edge. a (arc) i→ * is a vector, and the k th dimension is the score of the dependency edge a (arc) i→j is the j th dimension of normalization of the vector a (arc) i→ * , meaning the probability of dependency edge w i → w j .
We obtain a dependency tree representation T of a complete graph p (arc) * → * using the maximum spanning tree (MST) algorithm. The probability p of the relation r of each dependency edge where θ rel = {W (rel) , V (rel) , U (rel) } are the model parameters, W (rel) is a 3-dimensional tensor. a

Training Details
We train our model by minimizing the negative log likelihood of the gold standard (w i ) arcs in all training sentences: where τ is the training set, h(w i ) and r(w i ) is w i 's gold standard head and relation within sentence S, and N s is the number of words in S.
251 Figure 3: The architecture of our parser system.

Monolingual Model
There are 82 treebanks in the CoNLL 2018 UD Shared Task, including 61 big treebanks, 5 PUD treebanks (additional parallel test sets), 7 small treebanks and 9 low-resource language treebanks. There are several languages in which there are many treebanks, such as en ewt, en gum and en lines in English. We combine training sets and development sets for multiple treebanks of the same language. And then just train a model for the language and make predictions on its different treebanks.
For each language of UD version 2.2 sets (Nivre et al., 2018;Zeman et al., 2018) with both a training set and a development set, we train a parser using lexicalized token representation and only using its monolingual training set (no cross-lingual features) 1 . The architecture of the monolingual model is shown in Figure 3.

Multilingual Model
For 7 languages without a development set, we divide them into two classes based on the size of their training set, which can be fine-tuned (ga, sme) and can not be fine-tuned (bxr, hsb, hy, kk, kmr).
For each language of UD version 2.2 sets (Nivre 1 In total, we trained 46 monolingual models. Zeman et al., 2018) with both a training set and a development set, we train a parser using delexicalized token representation as a crosslanguage model. The architecture of the multilingual model is shown in Figure 3. The training set of these 5 languages are then used as a development set to validate the performance of each cross-language model (see Table 1). We select the best performance model as a cross-language model for the corresponding language. For both ga and sme, we manually divide the development set from the training set and fine-tune the crosslanguage model. Prediction and fine-tuning results are shown in the Table 2.

Experimental Results
We trained our system based on a Nvidia GeForce GTX Titan X. We used the official TIRA (Potthast et al., 2014) to evaluate the system. We used Dynet neural network library to build our system (Neubig et al., 2017). The hyperparameters of the final system used for all the reported experiments are detailed in Table 5.

Overall Results
The main official evaluation results are given in Table 4. And the Table 6 shows the per-treebank LAS F1 results. Our system achieved 70.90 F1 (LAS) on the overall 82 tree banks, ranked 9 th out  Table 1: Corpus listed above are languages that don't have development set and the training set size is too small to be fine-tuned. "#Train" means the number of sentences. "Cross Language" means the language with the highest LAS score for corresponding origin language in our delexicalized cross-language model.     (Straka et al., 2016), our system gained 5.10 LAS improvement on average. Our system shows better results on 7 small treebanks. Performance improvement are more obvious when considering only small treebanks(for example, our system ranked fourth best on ru taiga and sl sst).
Besides that, our system ranked tenth in the EPE 3 2018 campaign with a 55.71 F1 score.

Discussion on Multilingual Model
As described in section 4, we trained 46 crosslanguage models and selected the corresponding cross-language model for 7 languages that did not have a development set. Generally, cross-language models are trained in the language of the same family. However, apart from grammatical similarity, the language family division also considers the linguistic history, geographical location and other factors. We want to select a language's crosslanguage model to consider only grammatical similarity. So we use a cross-language model to predict the results in this language as a basis for selection.
In table 3, the experimental results show that the Cross-language model with the best performance in hsb, hy, kk, and kmr languages comes from the same language family, while the Cross-language model with the best performance in bxr, ga, and sme is not from the same language family. Therefore, constructing a cross-language model according to the language of the same family is only applicable to some languages, not all of them. We've only chosen the best performing cross-language model at the moment. In the future, we will try to select the top-k cross-language model.

Conclusions
In this paper, we present a graph-based dependency parsing system for the CoNLL 2018 UD Shared Task, which composed of a BiLSTMs feature extractor and a bi-affine pointer networks. The results suggests that a deep BiLSTM extractor and a bi-affine pointer networks is a way to achieve competitive parsing performances. We will continue to improve our system in our future work.