A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing

We present a novel neural network model that learns POS tagging and graph-based dependency parsing jointly. Our model uses bidirectional LSTMs to learn feature representations shared for both POS tagging and dependency parsing tasks, thus handling the feature-engineering problem. Our extensive experiments, on 19 languages from the Universal Dependencies project, show that our model outperforms the state-of-the-art neural network-based Stack-propagation model for joint POS tagging and transition-based dependency parsing, resulting in a new state of the art. Our code is open-source and available together with pre-trained models at: https://github.com/datquocnguyen/jPTDP

Part-of-speech (POS) tags are essential features used in most dependency parsers.In real-world parsing, those dependency parsers rely heavily on the use of automatically predicted POS tags, thus encountering error propagation problems.Li et al. (2011), Straka et al. (2016) and Nguyen et al. (2016) show that parsing accuracies drop by 5+% when utilizing automatic POS tags instead of gold ones.Some attempts have been made to avoid using POS tags during dependency parsing (Dyer et al., 2015;Ballesteros et al., 2015), however, these approaches still additionally use the automatic POS tags to achieve the best accuracy.Alternatively, joint learning both POS tagging and dependency parsing has gained more attention because: i) more accurate POS tags could lead to improved parsing performance and ii) the the syntactic context of a parse tree could help resolve POS ambiguities (Li et al., 2011;Hatori et al., 2011;Lee et al., 2011;Bohnet and Nivre, 2012;Qian and Liu, 2012;Wang and Xue, 2014;Zhang et al., 2015;Alberti et al., 2015;Johannsen et al., 2016;Zhang and Weiss, 2016).
In this paper, we propose a novel neural architecture for joint POS tagging and graph-based dependency parsing.Our model learns latent feature representations shared for both POS tagging and dependency parsing tasks by using BiLSTMthe bidirectional LSTM (Schuster and Paliwal, 1997;Hochreiter and Schmidhuber, 1997).Not using any external resources such as pre-trained word embeddings, experimental results on 19 languages from the Universal Dependencies project show that: our joint model performs better than strong baselines and especially outperforms the neural network-based Stack-propagation model for joint POS tagging and transition-based dependency parsing (Zhang and Weiss, 2016), achieving a new state of the art.

Our joint model
In this section, we describe our new model for joint POS tagging and dependency parsing, which we call jPTDP.Figure 1 illustrates the architecture of our new model.We learn shared latent feature vectors representing word tokens in an input sentence by using BiLSTMs.Then these shared feature vectors are further used to make the predic-tion of POS tags as well as fed into a multi-layer perceptron with one hidden layer (MLP) to decode dependency arcs and another MLP to predict relation types for labeling the predicted arcs.

BiLSTM-based latent feature representations:
Given an input sentence s consisting of n word tokens w 1 , w 2 , ..., w n , we represent each word w i in s by an embedding e (•) Plank et al. (2016) and Ballesteros et al. (2015) show that character-based representations of words help improve POS tagging and dependency parsing performances.So, we also use a sequence BiLSTM (BiLSTM seq ) to compute a character-based vector representation for each word w i in s.For a word type w consisting of k characters w = c 1 c 2 ...c k , the input to the sequence BiLSTM consists of k character embeddings c 1:k in which each embedding vector c j represents the j th character c j in w; and the output is the character-based embedding e ( * ) w of the word type w, computed as: For the i th word w i in the input sentence s, we create an input vector e i which is a concatenation (•) of the corresponding word embedding and character-based embedding vectors: w i Then, we feed the sequence of input vectors e 1:n with an additional index i corresponding to a context position into another BiLSTM (BiLSTM ctx ), resulting in shared feature vectors v i representing the i th words w i in the sentence s: POS tagging: Using shared BiLSTM-based latent feature vector representations, then we follow a common approach to compute the cross-entropy objective loss L POS ( t, t), in which t and t are the sequence of predicted POS tags and sequence of gold POS tags of words in the input sentence s, respectively (Goldberg, 2016;Plank et al., 2016).
Arc-factored graph-based parsing: Dependency trees can be formalized as directed graphs.An arc-factored parsing approach learns the scores of the arcs in a graph (Kübler et al., 2009).Then, using an efficient decoding algorithm (in particular, we use the Eisner (1996)'s algorithm), we can find a maximum spanning tree-the highest scoring parse tree-of the graph from those arc scores: where Y(s) is the set of all possible dependency trees for the input sentence s while score arc (h, m) measures the score of the arc between the head h th word and the modifier m th word in s.Following Kiperwasser and Goldberg (2016b), we score an arc by using a MLP with one-node output layer (MLP arc ) on top of the BiLSTM ctx : where v h and v m are the shared BiLSTM-based feature vectors representing the h th and m th words in s, respectively.We then compute a marginbased hinge loss L arc with loss-augmented inference to maximize the margin between the gold unlabeled parse tree and the highest scoring incorrect tree (Kiperwasser and Goldberg, 2016b).
Dependency relation types are predicted in a similar manner.We use another MLP on top of the BiLSTM ctx for predicting relation type of an head-modifier arc.Here, the number of the nodes in the output layer of this MLP (MLP rel ) is the number of relation types.Given an arc (h, m), we compute a corresponding output vector as: Then, based on MLP output vectors v (h,m) , we also compute another margin-based hinge loss L rel for relation type prediction, using only the gold labeled parse tree.

Joint model training:
The final training objective function of our joint model is the sum of the POS tagging loss L POS , the structure loss L arc and the relation labeling loss L rel .The model parameters, including word embeddings, character embeddings, two BiLSTMs and two MLPs, are learned to minimize the sum of the losses.
Discussion: Prior neural network-based joint models for POS tagging and dependency parsing are feed-forward network-and transition-based approaches (Alberti et al., 2015;Zhang and Weiss, 2016), while our model is a BiLSTM-and graphbased method.Our model can be considered as a two-component mixture of a tagging component and a parsing component.Here, the tagging component can be viewed as a simplified version without the additional auxiliary loss for rare words of the BiLSTM-based POS tagging model proposed by Plank et al. (2016).The parsing component can be viewed as an extension of the graph-based dependency model proposed by Kiperwasser and Goldberg (2016b), where we replace the input POS tag embeddings by the character-based representations of words.

Experimental setup
Following Zhang and Weiss (2016) and Plank et al. (2016), we conduct multilingual experiments on 19 languages from the Universal Dependencies (UD) treebanks1 v1.2 (Nivre et al., 2015), using the universal POS tagset (Petrov et al., 2012) instead of the language specific POS tagset.2For dependency parsing, the evaluation metric is the labeled attachment score (LAS).LAS is the percentage of words which are correctly assigned both dependency arc and relation type.UDPipe is the trainable pipeline for processing CoNLL-U files (Straka et al., 2016).TnT denotes the second order HMM-based TnT tagger (Brants, 2000).CRF denotes the Conditional random fields-based tagger, presented in Plank et al. (2014).BiLSTM-aux refers to the state-of-the-art (SOTA) BiLSTMbased POS tagging model with an additional auxiliary loss for rare words (Plank et al., 2016).Note that the (old) language code for Hebrew "iw" is referred to as "he" as in Plank et al. (2016).
[⊕]: Results are reported in Plank et al. (2016).Stack-prop refers to the SOTA Stack-propagation model for joint POS tagging and transition-based dependency parsing (Zhang and Weiss, 2016).-Chars denotes the absolute accuracy decrease of our jPTDP, when the character-based representations of words are not taken into account.B'15 denotes the character-based stack LSTM model for transition-based dependency parsing (Ballesteros et al., 2015).PipelineP tag refers to a greedy version of the approach proposed by Alberti et al. (2015).RBGParser refers to the graph-based dependency parser with tensor decomposition, presented in Lei et al. (2014).

Implementation details
Our jPTDP is implemented using DYNET v2.0 (Neubig et al., 2017). 3We optimize the objective function using Adam (Kingma and Ba, 2014) with default DYNET parameter settings and no mini-batches.We use a fixed random seed, and we do not utilize pre-trained embeddings in any experiment.Following Kiperwasser and Goldberg (2016b) and Plank et al. (2016), we apply a word dropout rate of 0.25 and Gaussian noise with σ = 0.2.For training, we run for 30 epochs, and evaluate the mixed accuracy of correctly assigning POS tag together with dependency arc and relation type on the development set after each training epoch.We perform a minimal grid search of hyper-parameters on English.We find that the highest mixed accuracy on the English develop-ment set is when using 64-dimensional character embeddings, 128-dimensional word embeddings, 128-dimensional BiLSTM states, 2 BiLSTM layers and 100 hidden nodes in MLPs with one hidden layer. 4We then apply those hyper-parameters to all 18 remaining languages.

Main results
Table 1 compares the POS tagging and dependency parsing results of our model jPTDP with results reported in prior work, using the same experimental setup.
Regarding POS tagging, our joint model jPTDP generally obtains similar POS tagging accuracies to the BiLSTM-aux model (Plank et al., 2016).Our model also achieves higher averaged POS tagging accuracy than the joint model Stackpropagation (Zhang and Weiss, 2016).There are slightly higher tagging results obtained by BiLSTM-aux when utilizing pre-trained word embeddings for initialization, as presented in Plank et al. (2016).However, for a fair comparison to both Stack-propagation and our jPTDP, we only compare to the results reported without using the pre-trained word embeddings.
In terms of dependency parsing, in most cases, our model jPTDP outperforms Stack-propagation.It is somewhat unexpected that our model produces about 7% absolute lower LAS score than Stack-propagation on Dutch (nl).A possible reason is that the hyper-parameters we selected on English are not optimal for Dutch.Another reason is due to a large number of non-projective trees in Dutch test set (106/386 ≈ 27.5%), while we use the Eisner's decoding algorithm, producing only projective trees (Eisner, 1996).Without taking "nl" into account, our averaged LAS score over all remaining languages is 1.1% absolute higher than Stack-propagation's.
One reason for our better LAS is probably because jPTDP uses character-based representations of words, while Stack-propagation uses feature representations for suffixes and prefixes which might not be as useful as character-based representations for capturing unknown words.The last row in Table 1 shows an absolute LAS improvement of 4.4% on average when comparing our jPTDP with its simplified version of not using characterbased representations: specifically, morphologically rich languages get an averaged improvement of 9.3 %, vice versa 2.6% for others.5So, our jPDTP is particularly good for morphologically rich languages, with 1.7% higher averaged LAS than Stack-propagation over these languages.

MQuni at the CoNLL 2017 shared task
Our team MQuni participated with jPTDP in the CoNLL 2017 shared task on multilingual parsing from raw text to universal dependencies (Zeman et al., 2017).Training data are 60+ universal dependency treebanks for 40+ languages from UD v2.0 (Nivre et al., 2017a).We do not use any external resource, and we use a fixed random seed and a fixed set of hyper-parameters as presented in Section 3.2 for all treebanks. 6For each treebank, we train a joint model for universal POS tagging and dependency parsing.We evaluate the mixed accuracy on the development set after each training epoch, and select the model with the highest mixed accuracy.Note that for each "surprise" language where there are only few sample sentences with gold-standard annotation or a "small" treebank whose development set is not available, we simply split its sample or training set into two parts with a ratio 4:1, and then use the larger part for training and the smaller part for development.
For parsing from raw text to universal dependencies, we utilize CoNLL-U test files preprocessed by the baseline UDPipe 1.1 (Straka et al., 2016).These pre-processed CoNLL-U test files are available to all participants who do not want to train their own models for any steps preceding the dependency analysis, including: tokenization, word segmentation, sentence segmentation, POS tagging and morphological analysis.Note that we only employ the tokenization, word and sentence segmentation, and we do not care about the POS tagging and morphological analysis pre-processed by UDPipe 1.1.Recall that we perform universal POS tagging and dependency parsing jointly.In addition, when we encounter an additional parallel test set in a language where multiple training treebanks exist, i.e. a parallel test set marked with language code suffix " pud" such as "ar pud", "cs pud" and "de pud", we simply use the model trained for its corresponding language code prefix, e.g., "ar", "cs" and "de".
Table 2 presents our official parsing results from the CoNLL 2017 shared task on UD parsing (Zeman et al., 2017).We obtain 1% absolute higher averaged scores than the baseline UD-Pipe 1.1 (Straka et al., 2016) in both categories: big treebank test sets (denoted as Big in Table 2) and parallel test sets (denoted as PUD in Table 2).Specifically, we obtain a highest rank at 8 th place for the PUD category, showing that our parsing model jPTDP is particularly good when it is applied to a real practical application in outof-domain data.Unlike the baseline UDPipe 1.1 and others, for each surprise language, we simply Here the subscript denotes the official rank out of 33 participating systems.R -S is the system rank where the 4 surprise language test sets are not taken into account.
train a joint model just on the sample data of few sentences with gold-standard annotation provided before the test phase, i.e., we utilize neither external resources nor a cross-lingual technique nor a delexicalized parser.So, it is not surprising that we obtain a very low averaged score over the 4 surprise language test sets.When the 4 surprise language test sets are not taken into account, we obtain a rank in top-10 participating systems.
In fact, it is hard to make a clear comparison between our jPTDP and the parsing models used in other top participating systems.This is because other systems use various external resources and/or better pre-processing modules and/or construct ensemble models for dependency parsing.7For example, UDPipe 1.2 only extends the word and sentence segmenters of the baseline UDPipe 1.1.Consequently, UDPipe 1.2 obtains 0.1% absolute higher in the macro-averaged word segmentation score8 and 0.2% higher in the macro-averaged sentence segmentation score 9 than the baseline UDPipe 1.1, resulting in 1+% better in the macro-averaged LAS F1 score though they use exactly the same parsing model.See Zeman et al. (2017) for an overview of the methods, algorithms, resources and software used for all other participating systems. 10 It is worth noting that for universal POS tagging, we obtain a highest rank at 4 th place for the Big category (i.e., 4 th on average over 55 big treebank test sets). 11In this Big category, we also obtain better rank than both UDPipe 1.2 and 1.1.

Conclusion
In this paper, we describe our novel model for joint POS tagging and graph-based dependency parsing, using bidirectional LSTM-based feature representations.Experiments on 19 languages from the Universal Dependencies (UD) v1.2 show that our model obtains state-of-the-art results in both POS tagging and dependency parsing.
With our joint model, we participated in the CoNLL 2017 shared task on UD parsing (Zeman et al., 2017).Given that we followed a strict closed setting while other top participating systems did not, we still obtained very competitive results.So, we believe our joint model can serve as a new strong baseline for further models in both POS tagging and dependency parsing tasks.
For future comparison, we provide in Table 3 the POS tagging, UAS and LAS accuracies with respect to gold-standard segmentation on the UD v2.0-CoNLL 2017 shared task test sets (Nivre et al., 2017b

Figure 1 :
Figure 1: Illustration of our jPTDP for joint POS tagging and graph-based dependency parsing.

Table 1 :
Methodar bg da de • en es eu • fa fi • fr hi id it iw nl no pl • pt sl • AVG 10.3 12.3 15.6 11.9 9.1 7.3 17.8 8.2 24.4 5.7 4.6 13.8 5.7 10.9 18.8 11.2 23.1 10.0 19.9 12.7 Universal POS tagging accuracies and LAS scores computed on all tokens (including punctuation) on test sets for 19 languages in UD v1.2.The language codes with • refer to morphologically rich languages.Numbers (in the second top row) right below language codes are out-of-vocabulary rates.

Table 3 :
(Nivre et al., 2017b)rce and available at: https://github.com/datquocnguyen/jPTDP.Universal POS tagging accuracies (labeled as UPOS), UAS and LAS scores of our jPTDP model with respect to gold-standard segmentation on the UD v2.0-CoNLL 2017 shared task test sets(Nivre et al., 2017b).UAS refers to the unlabeled attachment score.ltcode denotes the language treebank code.The 4 surprise language tests are bxr, hsb, kmr and sme.The 8 small treebank tests are fr partut, ga, gl treegal, kk, la, sl sst, ug and uk.The 14 parallel test sets are marked with the language code suffix " pud".The 55 remaining test sets are for big treebanks.was also supported by NICTA, funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence Program.The first author was supported by an International Postgraduate Research Scholarshipwhich is an Australian Government Research Training Program Scholarship-and a NICTA NRPA Top-Up Scholarship.