Joint Learning of POS and Dependencies for Multilingual Universal Dependency Parsing

This paper describes the system of team LeisureX in the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Our system predicts the part-of-speech tag and dependency tree jointly. For the basic tasks, including tokenization, lemmatization and morphology prediction, we employ the official baseline model (UDPipe). To train the low-resource languages, we adopt a sampling method based on other richresource languages. Our system achieves a macro-average of 68.31% LAS F1 score, with an improvement of 2.51% compared with the UDPipe.


Introduction
The goal of Universal Dependencies (UD) (Nivre et al., 2016;Zeman et al., 2017) is to develop multilingual treebank, whose annotations of morphology and syntax are cross-linguistically consistent. In this paper, we describe our system for the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (Zeman et al., 2018), and we focus only on the subtasks of part-of-speech (POS) tagging and dependency parsing. For the intermediate steps, including tokenization, lemmatization and morphology prediction, we tackle them by the official baseline model (UDPipe) 1 .
Dependency parsing that aims to predict the existence and type of linguistic dependency relations between words, is a fundamental part in natural language processing (NLP) tasks (Li et al., 2018c;He et al., 2018). Many referential natural language processing studies (Zhang et al., 2018;Bai and Zhao, 2018;Cai et al., 2018;Li et al., 2018b;Wang et al., 2018;Qin et al., 2017) can also contribute to the universal dependency parsing system. Universal dependency parsing focuses on learning syntactic dependency structure over many typologically different languages, even low-resource languages in a real-world setting. Within the dependency parsing literature, there are two dominant techniques, graph-based (McDonald et al., 2005;Ma and Zhao, 2012;Kiperwasser and Goldberg, 2016;Dozat and Manning, 2017) and transition-based parsing (Nivre, 2003;Dyer et al., 2015;. Graph-based dependency parsers enjoy the advantage of the global search which learns the scoring functions for all possible parsing trees to find the globally highest scoring one while transition-based dependency parsers build dependency trees from left to right incrementally, which makes the series of multiple choice decisions locally. In our system, we adopt the transition-based dependency parsing in view of its relatively lower time complexity. Our system implements universal dependency parsing based on the stack-pointer networks (STACKPTR) parser introduced by (Ma et al., 2018). Furthermore, previous work (Straka et al., 2016;Nguyen et al., 2017) showed that POS tags are helpful to dependency parsing. In particular, (Nguyen et al., 2017) pointed out that parsing performance could be improved by the merit of accurate POS tags and the context of syntactic parse tree could help resolve POS ambiguities. Therefore, we seek to jointly learn POS tagging and dependency parsing.
As Long short-term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) have shown significant representational effectiveness to a wide range of NLP tasks, we leverage bidirectional LSTMs (BiLSTM) to learn shared representations for both POS tagging and dependency parsing. In addition, to train the low-resource languages, we adopt a sampling method based on other richresource languages.
In terms of all the above model improvement, compared to the UDPipe baseline, our system achieves a macro-average of 68.31% LAS F1 score, with an improvement of 2.51% in this task.

Our Model
In this section, we describe our joint model 2 for POS tagging and dependency parsing in the CoNLL 2018 Shared Task, which is built on the STACKPTR parser introduced by (Ma et al., 2018). Our model is mainly composed of three components, the representation (Section 2.1), POS tagger (Section 2.2) and dependency parser (Section 2.3). Figure 1 illustrates the overall model.

Representation
Representation is a key component in various NLP models, and good representations should ideally model both complex characteristics and linguistic contexts. In our system, we follow the bidirectional LSTM-CNN architecture (BiLSTM-CNNs) (Chiu and Nichols, 2016; Ma and Hovy, 2016), where CNNs encode word information into character-level representation and BiLSTM models context information of each word.
Character Level Representation Though word embedding is popular in many existing parsers, they are not ideal for languages with high out-ofvocabulary (OOV) ratios. Hence, our system introduces the character-level (Li et al., 2018a) representation to address the challenge. Formally, given a word w = {BOW, c 1 , c 2 , ..., c n , EOW }, where two special BOW (begin-of-word) and EOW (end-of-word) tags indicate the begin and end positions respectively, we use the CNN to extract character-level representation as follows: e c = M axP ool(Conv(w)) 2 Our code will be available here: https://github. com/bcmi220/joint_stackptr.
where the CNN is similar to the one in (Chiu and Nichols, 2016), but we use only characters as the inputs to CNN, without character type features.
Word Level Representation Word embedding is a standard component of most state-of-the-art NLP architectures. Due to their ability to capture syntactic and semantic information of words from large scale unlabeled texts, we pre-train the word embeddings from the given training dataset by word2vec (Mikolov et al., 2013) toolkit. For low-resource languages without available training data, we sample the training dataset from similar languages to generate a mixed dataset.

POS Tagger
To enrich morphological information, we also incorporate UPOS tag embeddings into the representation. Therefore, we jointly predict the UPOS tag in our system. The architecture for the POS tagger in our model is almost identical to that of the parser (Dozat et al., 2017). The tagger uses a BiLSTM over the concatenation of word embeddings and character embeddings: Then we calculate the probability of tag for each type using affine classifiers as follows: The tag classifier is trained jointly using crossentropy losses that are summed together with the dependency parser loss during optimization.

Context-sensitive Representation
In order to integrate contextual information, we concatenate the character embedding e c , pre-trained word embedding e w and UPOS tag embedding e pos , then feed them into the BiLSTM. We take the bidirectional vectors at the final layer as the contextsensitive representation: Notably, we use the UPOS tag from the output of our POS tagging model.

Dependency Parsing
The universal dependency parsing component of our system is built on the current state-of-the-art approach STACKPTR, which combines pointer networks (Vinyals et al., 2015) with an internal stack for tracking the status of depth-first search. It benefits from the global information of the sentence and all previously derived subtree structures, and removes the left-to-right restriction in classical transition-based parsers.
The STACKPTR parser mainly consists of two parts: encoder and decoder. The encoder based on BiLSTM-CNNs architecture takes the sequence of tokens and their POS tags as input, then encodes it into encoder hidden state s i . The internal stack σ is initialized with dummy ROOT. For decoder (a uni-directional RNN), it receives the input from last step and outputs decoder hidden state h t . The pointer neural network takes the top element w h in the stack σ at each timestep t as current head to select a specific child w c with biaffine attention mechanism (Dozat and Manning, 2017) for attention score function in all possible head-dependent pairs. Then the child w c will be pushed onto the stack σ for next step when c = h, otherwise it indicates that all children of the current head h have been selected, therefore the head w h will be popped out of the stack σ. The attention scoring function used is given as follows and the pointer neural network uses a t as pointer to select the child element: More specifically, the decoder maintains a list of available words in test phase. For each head h at each decoding step, the selected child will be removed from the list to make sure that it cannot be selected as a child of other head words.
Given a dependency tree, there may be multiple children for a specific head. This results in more than one valid selection for each time step, which might confuse the decoder. To address this problem, the parser introduces an inside-outside order to utilize second-order sibling information, which has been proven to be an important feature for parsing process (McDonald and Pereira, 2006;Koo and Collins, 2010). To utilize the secondorder information, the parser replaces the input of decoder from s i as follows: where s and h indicate the sibling and head index of node i, • is the element-wise sum operation to ensure no additional model parameters.

Loss Function
The training objective of pur system is to learn the probability of UPOS tags P θ pos (y pos |x) and the dependency trees P θ dep (y dep |x, y pos ). Given a sentence x, the probabilities are factorized as: where θ pos and θ dep represent the model parameters respectively. p <i denotes the preceding dependency paths that have already been generated. c i,j represents the j th word in p i and c i,j denotes all the proceeding words on the path p i . Therefore, the whole loss is the sum of three objectives: Loss = Loss pos + Loss arc + Loss label where the Loss pos , Loss arc and Loss label are the conditional likehood of their corresponding target, using the cross-entropy loss. Specifically, we train a dependency label classifier following Dozat and Manning (2017), which takes the dependency head-child pair as input features.

System Implements
Our system focuses on three targets: the UPOS tag, dependency arc and dependency relation. Therefore, we rely on the UDPipe model (Straka For treebanks with non-empty training dataset (including treebanks whose training set is very small), we utilize the baseline model UDPipe trained on corresponding treebank, which has been provided by the organizer. For the remaining nine treebanks without training data, we construct the train dataset by sampling from the other training datasets according to the language similarity inspired by (Zhao et al., 2009(Zhao et al., , 2010Wang et al., 2015Wang et al., , 2016, as detailed in Table 1. Our system adopts the hyper-parameter configuration in (Ma et al., 2018), with a few exceptions. We initialize word vectors with 50-dimensional pretrained word embeddings, 100-dimensional tag embeddings and 512-dimensional recurrent states (in each direction). Our system drops embeddings and hidden states independently with 33% probability. We optimize with Adam (Kingma and Ba, 2015), setting the learning rate to 1e −3 and β 1 = β 2 = 0.9. Moreover, we train models for up to 100 epochs with batch size 32 on 3 NVIDIA GeForce GTX 1080Ti GPUs with 200 to 500 sentences per second and occupying 2 to 3 GB graphic memory each model. A full run over the test datasets on the TIRA virtual machine (Potthast et al., 2014) takes about 12 hours.  Furthermore, we also compare the performance of our system with the baseline and the best scorer on big treebanks (Table 3), PUD treebanks (Table  4), low-resource languages (Table 5), respectively.

Results
Since our model applies the baseline model for tokenization and segmentation, we show all results of focused metrics on each treebank in Table 6. In addition, we compare our model with the best and the average results of top ten models on each treebank, using LAS F1 for the evaluation metric, as shown in Figure 2.

Conclusion
In this paper, we describe our system in the CoNLL 2018 shared task on UD parsing. Our system uses a transition-based neural network architecture for dependency parsing, which predicts the UPOS tag and dependencies jointly. Combining pointer networks with an internal stack to track the status of the top-down, depth-first search in the parsing decoding procedure, the STACKPTR parser is able to capture information from the whole sentence and all the previously derived subtrees, removing the left-to-right restriction in classical transition-based parsers, while maintaining