Making Dependency Labeling Simple, Fast and Accurate

This work addresses the task of dependency labeling—assigning labels to an (unlabeled) dependency tree. We employ and extend a feature representation learning approach, optimizing it for both high speed and accuracy. We apply our labeling model on top of state-of-the-art parsers and evaluate its performance on standard benchmarks including the CoNLL-2009 and the English PTB datasets. Our model processes over 1,700 English sentences per second, which is 30 times faster than the sparse-feature method. It improves labeling accuracy over the outputs of top parsers, achieving the best LAS on 5 out of 7 datasets 1 .


Introduction
Traditionally in dependency parsing, the tasks of finding the tree structure and labeling the dependency arcs are coupled in a joint achitecture. While it has potential to eliminate errors propogated through a separated procedure, joint decoding introduces other sources of issues that can also lead to non-optimal labeling assignments. One of the issues arises from inexact algorithms adopted in order to solve the hard joint search problem. For instance, many parsers (Nivre et al., 2007;Titov and Henderson, 2007;Zhang et al., 2013;Dyer et al., 2015; adopt greedy decoding such as beam search, which may prune away the correct labeling hypothesis in an early decoding stage. Another issue is caused by the absence of rich label features. Adding dependency labels to the combinatorial space significantly slows down the search procedure. As a trade-off, many parsers such as MST-Parser, TurboParser and RBGParser (McDonald et al., 2005;Martins et al., 2010; incorporate only single-arc label features to reduce the processing time. This restriction greatly limits the labeling accuracy. In this work, we explore an alternative approach where the dependency labeling is applied as a separate procedure, alleviating the issues described above. The potential of this approach has been explored in early work. For instance, McDonald et al. (2006) applied a separate labeling step on top of the first-order MSTParser. The benefit of such approach is two-fold. First, finding the optimal labeling assignment (once the tree structure is produced) can be solved via an exact dynamic programming algorithm. Second, it becomes relatively cheap to add rich label features given a fixed tree, and the exact algorithm still applies when high-order label features are included. However, due to performance issues, such approach has not been adopted by the top performing parsers. In this work, we show that the labeling procedure, when optimized with recent advanced techniques in parsing, can achieve very high speed and accuracy.
Specifically, our approach employs the recent distributional representation learning technique for parsing. We apply and extend the low-rank tensor factorization method  to the second-order case to learn a joint scoring function over grand-head, head, modifier and their labels. Unlike the prior work which additionally requires traditional sparse features to achieve state-of-the-art performance, our extention alone delivers the same level of accuracy, while being substantially faster. As a consequence, the labeling model can be applied either as a refinement (re-labeling) step on top of existing parsers with negligible cost of computation, or as a part of a decoupled procedure to simplify and speed up the dependency parsing decoding.
We evaluate on all datasets in the CoNLL-2009 shared task as well as the English Penn Treebank dataset, applying our labeling model on top of stateof-the-art dependency parsers. Our labeling model processes over 1,700 English sentences per second, which is 30 times faster than the sparse-feature method. As a refinement (re-labeling) model, it achieves the best LAS on 5 out of 7 datasets.

Task Formulation
Given an unlabeled dependency parsing tree y of sentence x, where y can be obtained using existing (non-labeling) parsers, we classify each headmodifier dependency arc h → m ∈ y with a particular label l h→m . Let l = h→m∈y {l h→m }, our goal is to find the assignment with the highest score: For simplicity, we omit x, y in the following discussion, which remain the same during the labeling process. We assume that the score S(l) decomposes into a sum of local scores of single arcs or pairs of arcs (in the form of grand-head-head-modifier), i.e.
Parameterizing the scoring function is a key challenge. We follow  to learn dense representations of features, which have been shown to better generalize the scoring function.

Scoring
The representation-based approach requires little feature engineering. Concretely, let φ g , φ h , φ m ∈ R n be the atomic feature vector of the grandhead, head and modifier word respectively, and  Table 1: Word atomic features used by our model. POS, form, lemma and morph stand for the POS tag, word form, word lemma and morphology features respectively. The suffix -p refers to the previous token, and -n refers to the next. φ g→h,p , φ h→m,q ∈ R d be the atomic feature vector of the two dependency arcs respectively. It is easy to define and compute these vectors. For instance, φ g (as well as φ h and φ m ) can incorperate binary features which indicate the word and POS tag of the current token (and its local context), while φ g→h,p (and φ h→m,q ) can indicate the label, direction and length of the arc between the two words.
The scores of the arcs are computed by (1) projecting the atomic vectors into low-dimensional spaces; and (2) summing up the element-wise products of the resulting dense vectors: where r 1 is a hyper-parameter denoting the dimension after projection, and U 1 , V 1 ∈ R r 1 ×n , W 1 ∈ R r 1 ×d are projection matrices to be learned.
The above formulation can be shown equivalent to factorizing a huge score table T 1 (·, ·, ·) into the product of three matrices U 1 , V 1 and W 1 , where T 1 is a 3-way array (tensor) storing feature weights of all possible features involving three componentsthe head, modifier and the arc between the two. Accordingly, the formula to calculate s 1 (·) is equivalent to summing up all feature weights (from T 1 ) over the structure h q → m. 2 We depart from the prior work in the following aspects. First, we naturally extend the factorization approach to score second-order structures of grand-head, head and modifier, Here r 2 is a hyper-parameter denoting the dimension, and U 2 , V 2 , W 2 ∈ R r 2 ×n , X 2 , Y 2 ∈ R r 2 ×d are additional parameter matrices to be learned. Second, in order to achieve state-of-the-art parsing accuracy, prior work combines the single-arc score s 1 (h q → m) with an extensive set of sparse features which go beyond single-arc structures. However, we find this combination is a huge impediment to decoding speed. Since our extention already captures high-order structures, it readily delivers state-of-theart accuracy without the combination. This change results in a speed-up of an order of magnitude (see section 2.4 for a further discussion).

Viterbi Labeling
We use a dynamic programming algorithm to find the labeling assignment with the highest score. Suppose h is any node apart from the root, and g is h's parent. Let f(h, p) denote the highest score of subtree h with l g→h fixed to be p. Then we can compute f(·, ·) using a bottom-up method, from leaves to the root, by transition function And the highest score of the whole tree is Once we get f(·, ·), we can determine the labels backward, in a top-down manner. The time complexity of our algorithm is O(N L 2 · T ), where N is the number of words in a sentence, L is the number of total labels, and T is the time of computing features and scores.

Speed-up
In this section, we discuss two simple but effective strategies to speed up the labeling procedure.
Pruning We prune unlikely labels by simply exploiting the part-of-speech (POS) tags of the head and the modifier. Specifically, let 1(pos h , pos m , l) denote whether there is an arc h l → m in the training data such that h's POS tag is pos h and m's POS tag is pos m . In the labeling process, we only consider the possible labels that occur with the corresponding POS tags. Let K be the average number of possible labels per arc, then the time complexity is dropped to O(N K 2 · T ) approximately. In practice, K ≈ L/4. Hence this pruning step makes our labeler 16 times faster.
Using Representation-based Scoring Only The time to compute scores, i.e. T , consists of building the features and fetching the corresponding feature weights. For traditional methods, this requires enumerating feature templates, constructing feature ID and searching the feature weight in a look-up table. For representation-based scoring, the dense word representations (e.g. U 1 φ h ) can be pre-computed, and the scores are obtained by simple inner products of small vectors. We choose to use representationbased scoring only, therefore reducing the time to O(N K 2 · (r 1 + r 2 ) + N T ). In practice, we find the labeling process becomes about 30 times faster.

Learning
be the collection of M training samples. Our goal is to learn the values of the set of parameters Θ = {U 1 , V 1 , W 1 , U 2 , V 2 , W 2 , X 2 , Y 2 } based on D. Following standard practice, we optimize the parameter values in an online maximum soft-margin framework, minimizing the structural hinge loss: where l i −l 1 is the number of different labels between l i andl. We adjust parameters Θ by ∆Θ via passive-aggressive update: where δΘ = dloss(Θ) dΘ denotes the derivatives and C is a regularization hyper-parameter controlling the maximum step size of each update.
To counteract over-fitting, we follow the common practice of averaging parameters over all iterations.

Results
Experimental Setup We test our model on the CoNLL-2009 shared task benchmark with 7 different languages as well as the English Penn Treebank dataset. Whenever available, we use the predicted POS tags, word lemmas and morphological information provided in the datasets as atomic features. Following standard practice, we use unlabeled attachment scores (UAS) and labeled attachment scores (LAS) as evaluation measure 3 . In order to compare with previous reported numbers, we exclude punctuations for PTB in the evaluation, and include punctuations for CoNLL-2009 for consistency. We use RBGParser 4 , a state-of-the-art graphbased parser for predicting dependency trees, and then apply our labeling model to obtain the dependency label assignments. To demonstrate the effectiveness of our model on other systems, we also apply it on two additional parsers -Stanford Neural Shift-reduce Parser (Chen and Manning, 2014) 5 and TurboParser (Martins et al., 2010) 6 . In all reported experiments, we use the default suggested settings to run these parsers. The hyper-parameters of our labeling model are set as follows: r 1 = 50, r 2 = 30, C = 0.01.
Labeling Performance To test the performance of our labeling method, we first train our model using the gold unlabeled dependency trees and evaluate the labeling accuray on CoNLL-2009. Table 3 presents the results. For comparison, we implement a combined system which adds a rich set of traditional, sparse features into the scoring function and jointly train the feature weights. As shown in the table, using our representation-based method alone is 3 We use the official evaluation script from CoNLL-X:   super fast, being 30 times faster than the implementation with traditional feature computation and able to process over 1,700 English sentences per second. It does not affect the LAS accuracy except for Chinese. Table 4 shows the performance on the English PTB dataset. We use RBGParser to predict both labeled and unlabeled trees, and there is no significant difference between their UAS. This finding lays the foundation for a separate procedure, as the tree structure does not vary much comparing to the joint procedure, and we can exploit rich label features and sophisticated algorithms to improve the LAS. Our re-labeling model improves over the predictions generated by the three different parsers, ranging from 0.2% to 0.4% LAS gain. Moreover, the labeling procedure runs in only 1.5 seconds on the test set. If we use the existing parsers to only predict unlabeled trees, we also obtain speed improvement, even for the highly speed-optimzed Stanford Neural Parser.

CoNLL-2009 Results
In Table 2, we compare our model with the best systems 7 of the CoNLL-2009 shared task, Bohnet (2010), Zhang and McDonald (2014) as well as the most recent neural network parser . Despite the simplicity of the decoupled parsing procedure, our labeling model achieves LAS performance on par with the state-of-the-art neural network parser. Specifically, our model obtains the best LAS on 5 out of 7 languages, while the neural parser outperforms ours on Catalan and Chinese.

Conclusion
The most common method for dependency parsing couples the structure search and label search. We demonstrate that decoupling these two steps yields both computational gains and improvement in labeling accuracy. Specifically, we demonstrate that our labeling model can be used as a post-processing step to improve the accuracy of state-of-the-art parsers. Moreover, by employing dense feature representations and a simple pruning strategy, we can significantly speed up the labeling procedure and reduce the total decoding time of dependency parsing.