An Improved Neural Network Model for Joint POS Tagging and Dependency Parsing

We propose a novel neural network model for joint part-of-speech (POS) tagging and dependency parsing. Our model extends the well-known BIST graph-based dependency parser (Kiperwasser and Goldberg, 2016) by incorporating a BiLSTM-based tagging component to produce automatically predicted POS tags for the parser. On the benchmark English Penn treebank, our model obtains strong UAS and LAS scores at 94.51% and 92.87%, respectively, producing 1.5+% absolute improvements to the BIST graph-based parser, and also obtaining a state-of-the-art POS tagging accuracy at 97.97%. Furthermore, experimental results on parsing 61 “big” Universal Dependencies treebanks from raw texts show that our model outperforms the baseline UDPipe (Straka and Strakova, 2017) with 0.8% higher average POS tagging score and 3.6% higher average LAS score. In addition, with our model, we also obtain state-of-the-art downstream task scores for biomedical event extraction and opinion analysis applications. Our code is available together with all pre-trained models at: https://github.com/datquocnguyen/jPTDP


Introduction
Dependency parsing -a key research topic in natural language processing (NLP) in the last decade (Buchholz and Marsi, 2006;Nivre et al., 2007a;Kübler et al., 2009) -has also been demonstrated to be extremely useful in many applications such as relation extraction (Culotta and Sorensen, 2004;Bunescu and Mooney, 2005), semantic parsing (Reddy et al., 2016) and machine translation (Galley and Manning, 2009). In general, dependency parsing models can be categorized as graph-based (McDonald et al., 2005) and transition-based (Yamada and Matsumoto, 2003;Nivre, 2003). Most traditional graph-or transition-based models define a set of core and combined features (McDonald and Pereira, 2006;Nivre et al., 2007b;Bohnet, 2010;Zhang and Nivre, 2011), while recent stateof-the-art models propose neural network architectures to handle feature-engineering Kiperwasser and Goldberg, 2016;Ma and Hovy, 2017).
Most traditional and neural network-based parsing models use automatically predicted POS tags as essential features. However, POS taggers are not perfect, resulting in error propagation problems. Some work has attempted to avoid using POS tags for dependency parsing de Lhoneux et al., 2017), however, to achieve the strongest parsing scores these methods still require automatically assigned POS tags. Alternatively, joint POS tagging and dependency parsing has also attracted a lot of attention in NLP community as it could help improve both tagging and parsing results over independent modeling (Li et al., 2011;Hatori et al., 2011;Lee et al., 2011;Bohnet and Nivre, 2012;Zhang et al., 2015;Zhang and Weiss, 2016;Yang et al., 2018).
In this paper, we present a novel neural network-based model for jointly learning POS tagging and dependency paring. Our joint model extends the well-known BIST graph-based dependency parser (Kiperwasser and Goldberg, 2016) with an additional lower-level BiLSTM-based tagging component. In particular, this tagging component generates predicted POS tags for the parser component. Evaluated   duces a 1.5+% absolute improvement over the BIST graph-based parser with a strong UAS score of 94.51% and LAS score of 92.87%; and also obtaining a state-of-the-art POS tagging accuracy of 97.97%. In addition, multilingual parsing experiments from raw texts on 61 "big" Universal Dependencies treebanks (Zeman et al., 2018) show that our model outperforms the baseline UDPipe (Straka and Straková, 2017) with 0.8% higher average POS tagging score, 3.1% higher UAS and 3.6% higher LAS. Furthermore, experimental results on downstream task applications (Fares et al., 2018) show that our joint model helps produce state-of-the-art scores for biomedical event extraction and opinion analysis.

Our joint model
This section presents our model for joint POS tagging and graph-based dependency parsing. Figure 1 illustrates the architecture of our joint model which can be viewed as a two-component mixture of a tagging component and a parsing component. Given word tokens in an input sentence, the tagging component uses a BiLSTM to learn "latent" feature vectors representing these word tokens. Then the tagging component feeds these feature vectors into a multilayer perceptron with one hidden layer (MLP) to predict POS tags. The parsing component then uses another BiLSTM to learn another set of latent feature representations, based on both the input word tokens and the predicted POS tags. These latent feature representations are fed into a MLP to decode dependency arcs and another MLP to label the predicted dependency arcs.

Word vector representation
Given an input sentence s consisting of n word tokens w 1 , w 2 , ..., w n , we represent each i th word w i in s by a vector e i . We obtain e i by concatenating word embedding e (W) w i and character-level word embedding e (C) Here, each word type w in the training data is represented by a real-valued word embedding e (W) w . Given the word type w consisting of k characters w = c 1 c 2 ...c k where each j th character in w is represented by a character embedding c j , we use a sequence BiLSTM (BiLSTM seq ) to learn its character-level vector representation Plank et al., 2016). The input to BiLSTM seq is the sequence of k character embeddings c 1:k , and the output is a concatenation of outputs of a forward LSTM (LSTM f ) reading the input in its regular order and a reverse LSTM (LSTM r ) reading the input in reverse: We use a MLP with softmax output (MLP pos ) on top of the BiLSTM pos to predict POS tag of each word in s. The number of nodes in the output layer of this MLP pos is the number of POS tags. Given v (pos) i , we compute an output vector as: Based on output vectors ϑ i , we then compute the cross-entropy objective loss L POS (t, t), in whicht and t are the sequence of predicted POS tags and sequence of gold POS tags of words in the input sentence s, respectively (Goldberg, 2016). Our tagging component thus can be viewed as a simplified version of the POS tagging model proposed by Plank et al. (2016), without their additional auxiliary loss for rare words.

Parsing component
Assume that p 1 , p 2 , ..., p n are the predicted POS tags produced by the tagging component for the input words. We represent each i th predicted POS tag by a vector embedding e (P) p i . We then create a sequence of vectors x 1:n in which each x i is produced by concatenating the POS tag embedding e (P) p i and the word vector representation e i : We feed the sequence of vectors x 1:n with an additional index i into a BiLSTM (BiLSTM dep ), resulting in latent feature vectors v i as follows: Based on latent feature vectors v i , we follow a common arc-factored parsing approach to decode dependency arcs (McDonald et al., 2005). In particular, a dependency tree can be formalized as a directed graph. An arc-factored parsing approach learns the scores of the arcs in the graph (Kübler et al., 2009). Here, we score an arc by using a MLP with a one-node output layer (MLP arc ) on top of the BiLSTM dep : and |v i − v j | denote the elementwise product and the absolute element-wise difference, respectively; and v i and v j are correspondingly the latent feature vectors associating to the i th and j th words in s, computed by Equation 5.
Given the arc scores, we use the Eisner (1996)'s decoding algorithm to find the highest scoring projective parse tree: where Y(s) is the set of all possible dependency trees for the input sentence s while score arc (h, m) measures the score of the arc between the head h th word and the modifier m th word in s.
Following Kiperwasser and Goldberg (2016), we compute a margin-based hinge loss L ARC with loss-augmented inference to maximize the margin between the gold unlabeled parse tree and the highest scoring incorrect tree.
For predicting dependency relation type of a head-modifier arc, we use another MLP with softmax output (MLP rel ) on top of the BiLSTM dep . Here, the number of the nodes in the output layer of this MLP rel is the number of dependency relation types. Given an arc (h, m), we compute an output vector as: Based on output vectors v (h,m) , we also compute another cross-entropy objective loss L REL for relation type prediction, using only the gold labeled parse tree.
Our parsing component can be viewed as an extension of the BIST graph-based dependency model (Kiperwasser and Goldberg, 2016), where we additionally incorporate the character-level vector representations of words.

Joint model training
The training objective loss of our joint model is the sum of the POS tagging loss L POS , the structure loss L ARC and the relation labeling loss L REL : The model parameters, including word embeddings, character embeddings, POS embeddings, three one-hidden-layer MLPs and three BiLSTMs, are learned to minimize the sum L of the losses. Most neural network-based joint models for POS tagging and dependency parsing are transition-based approaches Zhang and Weiss, 2016;Yang et al., 2018), while our model is a graph-based method. In addition, the joint model JMT (Hashimoto et al., 2017) defines its dependency parsing task as a head selection task which produces a probability distribution over possible heads for each word (Zhang et al., 2017).
Our model is the successor of the joint model jPTDP v1.0 (Nguyen et al., 2017) which is also a graph-based method. However, unlike our model, jPTDP v1.0 uses a BiLSTM to learn "shared" latent feature vectors which are then used for both POS tagging and dependency parsing tasks, rather than using two separate layers. As mentioned in Section 4, our model generally outperforms jPTDP v1.0 with 2.5+% LAS improvements on universal dependencies (UD) treebanks.

Implementation details
Our model is released as jPTDP v2.0, available at https://github.com/datquocnguyen/ jPTDP. Our jPTDP v2.0 is implemented using DYNET v2.0 (Neubig et al., 2017) with a fixed random seed. 1 Word embeddings are initialized either randomly or by pre-trained word vectors, while character and POS tag embeddings are randomly initialized. For learning character-level word embeddings, we use one-layer BiLSTM seq , and set the size of LSTM hidden states to be equal to the vector size of character embeddings.
We apply dropout (Srivastava et al., 2014) with a 67% keep probability to the inputs of BiLSTMs and MLPs. Following Iyyer et al. (2015) and Kiperwasser and Goldberg (2016), we also apply word dropout to learn an embedding for unknown words: we replace each word token w appearing #(w) times in the training set with a special "unk" symbol with probability p unk (w) = 0.25 0.25+#(w) . This procedure only involves the word embedding part in the input word vector representation.
We optimize the objective loss using Adam (Kingma and Ba, 2014) with an initial learning rate at 0.001 and no mini-batches. For training,  The first 11 rows present scores of dependency parsers in which POS tags were predicted by using an external POS tagger such as the Stanford tagger (Toutanova et al., 2003). The last 6 rows present scores for joint models. Clearly, our model produces very competitive parsing results. In particular, our model obtains a UAS score at 94.51% and a LAS score at 92.87% which are about 1.4% and 1.9% absolute higher than UAS and LAS scores of the BIST graph-based model (Kiperwasser and Goldberg, 2016), respectively. Our model also does better than the previous transition-based joint models in , Zhang and Weiss (2016) and Yang et al. (2018), while obtaining similar UAS and LAS scores to the joint model JMT proposed by Hashimoto et al. (2017). We achieve 0.9% lower parsing scores than the state-of-the-art dependency parser of . While also a BiLSTM-and graph-based model, it uses a more sophisticated attention mechanism "biaffine" for better decoding dependency arcs and relation types. In future work, we will extend our model with the biaffine attention mechanism to investigate the benefit for our model. Other differences are that they use a higher dimensional representation than ours, but rely on predicted POS tags.
We also obtain a state-of-the-art POS tagging accuracy at 97.97% on the test Section 23, which is about 0.4+% higher than those by Bohnet and Nivre (2012),  and Yang et al. (2018). Other previous joint models did not mention their specific POS tagging accuracies. 4

UniMelb in the CoNLL 2018 shared task on UD parsing
Our UniMelb team participated with jPTDP v2.0 in the CoNLL 2018 shared task on parsing 82 treebank test sets (in 57 languages) from raw text to universal dependencies (Zeman et al., 2018). The 82 treebanks are taken from UD v2.2 , where 61/82 test sets are for "big" UD treebanks for which both training and development data sets are available and 5/82 test sets are extra "parallel" test sets in languages where another big treebank exists. In addition, 7/82 test sets are for "small" UD treebanks for which development data is not available. The remaining 9/82 sets are in low-resource languages without training data or with a few gold-annotation sample sentences. For the 7 small treebanks without development data available, we split training data into two parts with a ratio 9:1, and then use the larger part for training and the smaller part for development. For each big or small treebank, we train a joint model for universal POS tagging and dependency parsing, using a fixed random seed and a fixed set   (Zeman et al., 2018). "UPOS" denotes the universal POS tagging score. "All", "Big", "PUD", "Small" and "Low" refer to the macro-average scores over all 81, 61 big treebank, 5 parallel, 7 small treebank and 9 lowresource treebank test sets, respectively. "goldseg." denotes the scores of our jPTDP v2.0 model regarding gold segmentation, detailed in Table 4.
of hyper-parameters as mentioned in Section 2.5. 5 We evaluate the mixed accuracy on the development set after each training epoch, and select the model with the highest mixed accuracy. For parsing from raw text to universal dependencies, we employ CoNLL-U test files preprocessed by the baseline UDPipe 1.2 (Straka and Straková, 2017). Here, we utilize the tokenization, word and sentence segmentation predicted by UD-Pipe 1.2. For 68 big and small treebank test files, we use the corresponding trained joint models. We use the joint models trained for cs pdt, en ewt, fi tdt, ja gsd and sv talbanken to process 5 parallel test files cs pud, en pud, fi pud, ja modern and sv pud, respectively. Since we do not focus on low-resource languages, we employ the baseline UDPipe 1.2 to process 9 low-resource treebank test files. The final test runs are carried out on the TIRA platform (Potthast et al., 2014). Table 3 presents our results in the CoNLL 2018 shared task on multilingual parsing from raw texts to universal dependencies (Zeman et al., 2018). Over all 82 test sets, we outperform the baseline UDPipe 1.2 with 0.6% absolute higher average UPOS F1 score and 2.5+% higher average UAS and LAS F1 scores. In particular, for the "big" category consisting of 61 treebank test sets, we obtain 0.8% higher UPOS and 3.1% higher UAS and 3.6% higher LAS than UDPipe 1.2.
Our (UniMelb) official LAS-based rank is at 14 th place while the baseline UDPipe 1.2 is at 18 th place over total 26 participating systems. 6 However, it is difficult to make a clear comparison between our jPTDP v2.0 and the parsing models used in other top systems. Several better participating systems simply reuse the state-of-theart biaffine dependency parser , constructing ensemble models or developing treebank concatenation strategies to obtain larger training data, which is likely to produce better scores than ours (Zeman et al., 2018).
Recall that the shared task focuses on parsing from raw texts. Most higher-ranking systems aim to improve the pre-processing steps of tokenization 7 , word 8 and sentence 9 segmentation, resulting in significant improvements in final parsing scores. For example, in the CoNLL 2017 shared task on UD parsing , UDPipe 1.2 obtained 0.1+% higher average tokenization and word segmentation scores and 0.2% higher average sentence segmentation score than UDPipe 1.1, resulting in 1+% improvement in the final average LAS F1 score while both UDPipe 1.2 and UDPipe 1.1 shared exactly the same remaining components. Utilizing better pre-processors, as used in other participating systems, should likewise improve our final parsing scores.
In Table 3, we also present our average UPOS, UAS and LAS accuracies with respect to (w.r.t.) gold-standard tokenization, word and sentence segmentation. For more details and future comparison, Table 4 presents the UPOS, UAS and LAS scores w.r.t. gold-standard segmentation, obtained by jPTDP v2.0 on each UD v2.2-CoNLL 2018 shared task test set. Compared to the scores presented in Table 3 Table 4: UPOS, UAS and LAS scores computed on all tokens of our jPTDP v2.0 model regarding goldstandard segmentation on 73 CoNLL-2018 shared task test sets "Big", "PUD" and "Small" -UD v2.2 .
[p] and [s] denote the "PUD" extra parallel and small test sets, respectively. For each treebank, a joint model is trained using a fixed set of hyper-parameters as mentioned in Section 2.5.

UniMelb in the EPE 2018 campaign
Our UniMelb team also participated with jPTDP v2.0 in the 2018 Extrinsic Parser Evaluation (EPE) campaign (Fares et al., 2018). 10 The EPE 2018 campaign runs in collaboration with the CoNLL 2018 shared task, which aims to evaluate dependency parsers by comparing their performance on three downstream tasks: biomedical event extraction , negation resolution  and opinion analysis (Johansson, 2017). Here, participants only need to provide parsing outputs of English raw texts used in these downstream tasks; the campaign organizers then compute end-to-end downstream task   (Fares et al., 2018). "SP17" denotes the F1 scores obtained by the EPE 2017 system Stanford-Paris (Schuster et al., 2017) with respect to (w.r.t.) the Stanford basic dependencies. The subscript in the SP17 column denotes the F1 scores obtained by Stanford-Paris w.r.t. the UD-v1-enhanced type of dependency representations, in which the average F1 score at 60.51 is the highest one at EPE 2017. ning our trained model on the pre-processed tokenized and sentence-segmented data provided by the campaign on the TIRA platform. Table 5 presents the results we obtained for three downstream tasks at EPE 2018 (Fares et al., 2018). Since we employed external training data, our obtained scores are not officially ranked. In total 17 participating teams, we obtained the highest average F1 score over the three downstream tasks (i.e., we ranked first, unofficially). In particular, we achieved the highest F1 scores for both biomedical event extraction and opinion analysis. Our results may be high because the training data we used is larger than the English UD treebanks used by other teams. Table 5 also presents scores from the Stanford-Paris team (Schuster et al., 2017)-the first-ranked team at EPE 2017 . Both EPE 2017 and 2018 campaigns use the same downstream task setups, therefore the downstream task scores are directly comparable. Note that Stanford-Paris employed the state-of-the-art biaffine dependency parser  with larger training data. In particular, Stanford-Paris not only used the WSJ sections 02-21 and the training split of the GENIA treebank (as we did), but also included the Brown corpus. The downstream application of negation resolution requires parsing of fiction, which is one the genres included in the Brown corpus. Hence it is reasonable that the Stanford-Paris team produced better negation resolution scores than we did.
However, in terms of the Stanford basic dependencies, while we employ a less accurate parsing model with smaller training data, we obtain higher downstream task scores for event extraction and opinion analysis than the Stanford-Paris team. Consequently, better intrinsic parsing performance does not always imply better extrinsic downstream application performance. Similar observations on the biomedical event extraction and opinion analysis tasks can also be found in Nguyen and Verspoor (2018) and Gómez-Rodríguez et al. (2017), respectively. Further investigations of this pattern requires much deeper understanding of the architecture of the downstream task systems, which is left for future work.

Conclusion
In this paper, we have presented a novel neural network model for joint POS tagging and graphbased dependency parsing. On the benchmark English WSJ Penn treebank, our model obtains strong parsing scores UAS at 94.51% and LAS at 92.87%, and a state-of-the-art POS tagging accuracy at 97.97%.
We also participated with our joint model in the CoNLL 2018 shared task on multilingual parsing from raw texts to universal dependencies, and obtained very competitive results. Specifically, using the same CoNLL-U files pre-processed by UDPipe (Straka and Straková, 2017), our model produced 0.8% higher POS tagging, 3.1% higher UAS and 3.6% higher LAS scores on average than UDPipe on 61 big UD treebank test sets. Furthermore, our model also helps obtain state-of-the-art downstream task scores for the biomedical event extraction and opinion analysis applications.
We believe our joint model can serve as a new strong baseline for both intrinsic POS tagging and dependency parsing tasks as well as for extrinsic downstream applications. Our code and pre-trained models are available at: https:// github.com/datquocnguyen/jPTDP.