Efficient Second-Order TreeCRF for Neural Dependency Parsing

In the deep learning (DL) era, parsing models are extremely simplified with little hurt on performance, thanks to the remarkable capability of multi-layer BiLSTMs in context representation. As the most popular graph-based dependency parser due to its high efficiency and performance, the biaffine parser directly scores single dependencies under the arc-factorization assumption, and adopts a very simple local token-wise cross-entropy training loss. This paper for the first time presents a second-order TreeCRF extension to the biaffine parser. For a long time, the complexity and inefficiency of the inside-outside algorithm hinder the popularity of TreeCRF. To address this issue, we propose an effective way to batchify the inside and Viterbi algorithms for direct large matrix operation on GPUs, and to avoid the complex outside algorithm via efficient back-propagation. Experiments and analysis on 27 datasets from 13 languages clearly show that techniques developed before the DL era, such as structural learning (global TreeCRF loss) and high-order modeling are still useful, and can further boost parsing performance over the state-of-the-art biaffine parser, especially for partially annotated training data. We release our code at https://github.com/yzhangcs/crfpar.


Introduction
As a fundamental task in NLP, dependency parsing has attracted a lot of research interest due to its simplicity and multilingual applicability in capturing both syntactic and semantic information (Nivre et al., 2016). Given an input sentence x = w 0 w 1 . . . w n , a dependency tree, as depicted in Figure 1, is defined as y = {(i, j, l), 0 ≤ i ≤ n, 1 ≤ j ≤ n, l ∈ L}, where (i, j, l) is a dependency from the head word w i to the modifier word * Corresponding author $ 0 I 1 saw 2 Sarah 3 with 4 a 5 telescope 6 nsubj dobj pobj det root prep Figure 1: An example full dependency tree. In the case of partial annotation, only some (not all) dependencies are annotated, for example, the two thick (blue) arcs.
w j with the relation label l ∈ L. Between two mainstream approaches, this work focuses on the graph-based paradigm (vs. transition-based). Before the deep learning (DL) era, graph-based parsing relies on many hand-crafted features and differs from its neural counterpart in two major aspects. First, structural learning, i.e., explicit awareness of tree structure constraints during training, is indispensable. Most non-neural graph-based parsers adopt the max-margin training algorithm, which first predicts a highest-scoring tree with the current model, and then updates feature weights so that the correct tree has a higher score than the predicted tree.
Second, high-order modeling brings significant accuracy gains. The basic first-order model factors the score of a tree into independent scores of single dependencies (McDonald et al., 2005a). Second-order models were soon propose to incorporate scores of dependency pairs, such as adjacent-siblings (McDonald and Pereira, 2006) and grand-parent-child (Carreras, 2007;Koo and Collins, 2010), showing significant accuracy improvement yet with the cost of lower efficiency and more complex decoding algorithms. 1 In contrast, neural graph-based dependency parsing exhibits an opposite development trend. Pei et al. (2015) propose to use feed-forward neural networks for automatically learning combinations of dozens of atomic features similar to Chen and Manning (2014), and for computing subtree scores. They show that incorporating second-order scores of adjacent-sibling subtrees significantly improved performance. Then, both Wang and Chang (2016) and Kiperwasser and Goldberg (2016) propose to utilize BiLSTM as an encoder and use minimal feature sets for scoring single dependencies in a first-order parser. These three representative works all employ global max-margin training.  propose a strong and efficient biaffine parser and obtain state-of-the-art accuracy on a variety of datasets and languages. The biaffine parser is also first-order and employs simpler and more efficient non-structural training via local head selection for each token (Zhang et al., 2017).
Observing such contrasting development, we try to make a connection between pre-DL and DL techniques for graph-based parsing. Specifically, the first question to be addressed in this work is: can previously useful techniques such as structural learning and high-order modeling further improve the state-of-the-art 2 biaffine parser, and if so, in which aspects are they helpful?
For structural learning, we focus on the more complex and less popular TreeCRF instead of maxmargin training. The reason is two-fold. First, estimating probability distribution is the core issue in modern data-driven NLP methods (Le and Zuidema, 2014). The probability of a tree, i.e., p(y | x), is potentially more useful than an unbounded score s(x, y) for high-level NLP tasks when utilizing parsing outputs. Second, as a theoretically sound way to measure model confidence of subtrees, marginal probabilities can support Minimum Bayes Risk (MBR) decoding (Smith and Smith, 2007), and are also proven to be crucial for the important research line of token-level active learning based on partial trees (Li et al., 2016).
One probable reason for the less popularity of TreeCRF, despite its usefulness, is due to the complexity and inefficiency of the inside-outside algorithm, especially the outside algorithm. As far as we know, all existing works compute the inside and outside algorithms on CPUs. The inefficiency issue becomes more severe in the DL era, due to 2 Though many recent works report higher performance with extra resources, for example contextualized word representations learned from large-scale unlabeled texts under language model loss, they either adopt the same architecture or achieve similar performance under fair comparison. the unmatched speed of CPU and GPU computation. This leads to the second question: can we batchify the inside-outside algorithm and perform computation directly on GPUs? In that case, we can employ efficient TreeCRF as a built-in component in DL toolkits such as PyTorch for wider applications (Cai et al., 2017;Le and Zuidema, 2014).
Overall, targeted at the above two questions, this work makes the following contributions.
• We for the first time propose second-order TreeCRF for neural dependency parsing. We also propose an efficient and effective triaffine operation for scoring second-order subtrees.
• We propose to batchify the inside algorithm via direct large tensor computation on GPUs, leading to very efficient TreeCRF loss computation. We show that the complex outside algorithm is no longer needed for the computation of gradients and marginal probabilities, and can be replaced by the equally efficient back-propagation process.
• We conduct experiments on 27 datasets from 13 languages. The results and analysis show that both structural learning and high-order modeling are still beneficial to the state-ofthe-art biaffine parser in many ways in the DL era.

The Basic Biaffine Parser
We re-implement the state-of-the-art biaffine parser  with two modifications, i.e., using CharLSTM word representation vectors instead of POS tag embeddings, and the first-order Eisner algorithm (Eisner, 2000) for projective decoding instead of the non-projective MST algorithm.
Scoring architecture. Figure 2 shows the scoring architecture, consisting of four components.
Input vectors. The ith input vector is composed of two parts: the word embedding and the CharLSTM word representation vector of w i .
where CharLSTM(w i ) is obtained by feeding w i into a BiLSTM and then concatenating the two last hidden vectors (Lample et al., 2016). We find that replacing POS tag embeddings with . . . e i . . . e k . . . e j . . . CharLSTM(w i ) leads to consistent improvement, and also simplifies the multilingual experiments by avoiding POS tag generation (especially n-fold jackknifing on training data).
BiLSTM encoder. To encode the sentential contexts, the parser applies three BiLSTM layers over e 0 . . . e n . The output vector of the top-layer BiLSTM for the ith word is denoted as h i .
MLP feature extraction. Two shared MLPs are applied to h i , obtaining two lower-dimensional vectors that detain only syntax-related features: where r h i and r m i are the representation vector of w i as a head word and a modifier word respectively.
Biaffine scorer.  for the first time propose to compute the score of a dependency i → j via biaffine attention: where W biaffine ∈ R d×d . The computation is extremely efficient on GPUs.
Local token-wise training loss. The biaffine parser adopts a simple non-structural training loss, trying to independently maximize the local probability of the correct head word for each word. For a gold-standard head-modifier pair (w i , w j ) in a training instance, the cross-entropy loss is In other words, the model is trained based on simple head selection, without considering the tree structure at all, and losses of all words in a minibatch are accumulated.
Decoding. Having scores of all dependencies, we adopt the first-order Eisner algorithm with time complexity of O(n 3 ) to find the optimal tree.
Handling dependency labels. The biaffine parser treats skeletal tree searching and labeling as two independent (training phase) and cascaded (parsing phase) tasks. This work follows the same strategy for simplicity. Please refer to  for details.

Second-order TreeCRF
This work substantially extends the biaffine parser in two closely related aspects: using probabilistic TreeCRF for structural training and explicitly incorporating high-order subtree scores. Specifically, we further incorporate adjacent-sibling subtree scores into the basic first-order model: 3 where k and j are two adjacent modifiers of i and satisfy either i < k < j or j < k < i. As a probabilistic model, TreeCRF computes the conditional probability of a tree as where Y(x) is the set of all legal (projective) trees for x, and Z(x) is commonly referred to as the normalization (or partition) term. During training, TreeCRF employs the following structural training loss to maximize the conditional probability of the gold-standard tree y given x.

Scoring Second-order Subtrees
To avoid major modification to the original scoring architecture, we take a straightforward extension to obtain scores of adjacent-sibling subtrees. First, we employ three extra MLPs to perform similar feature extraction.
where r h i ; r s i ; r m i are the representation vectors of w i as head, sibling, and modifier respectively. 4 Then, we propose a natural extension to the biaffine equation, and employ triaffine for score computation over three vectors. 5 The triaffine computation can be quite efficiently performed with the einsum function on PyTorch.

Computing TreeCRF Loss Efficiently
The key to TreeCRF loss is how to efficiently compute log Z(x), as shown in Equation 8. This problem has been well solved long before the DL era for non-neural dependency parsing. Straightforwardly, we can directly extend the viterbi decoding algorithm by replacing max product with sum 4 Another way is to use one extra MLP for sibling representation, and re-use head and modifier representation from the basic first-order components, which however leads to inferior performance in our preliminary experiments. 5 We have also tried the approximate method of , which uses three biaffine operations to simulate the interactions of three input vectors, but observed inferior performance. We omit the results due to the space limitation.

7:
Ci,j = log i<r≤j e I i,r +C r,j 8: end for £ refer to Figure 3 9: return C0,n ≡ log Z product, and naturally obtain log Z(x) in the same polynomial time complexity. However, it is not enough to solely perform the inside algorithm for non-neural parsing, due to the inapplicability of the automatic differentiation mechanism. In order to obtain marginal probabilities and then feature weight gradients, we have to realize the more sophisticated outside algorithm, which is usually at least twice slower than the inside algorithm. This may be the major reason for the less popularity of TreeCRF (vs. max-margin training) before the DL era.
As far as we know, all previous works on neural TreeCRF parsing explicitly implement the insideoutside algorithm for gradient computation Jiang et al., 2018). To improve efficiency, computation is transferred from GPUs to CPUs with Cython programming.
This work shows that the inside algorithm can be effectively batchified to fully utilize the power of GPUs. Figure 3 and Algorithm 1 together illustrate the batchified version of the second-order inside algorithm, which is a direct extension of the secondorder Eisner algorithm in McDonald and Pereira (2006) by replacing max product with sum product. We omit the generations of incomplete, complete, and sibling spans in the opposite direction from j to i for brevity.
Basically, we first pack the scores of same-width spans at different positions (i, j) for all B sentences in the data batch into large tensors. Then we can do computation and aggregation simultaneously on GPUs via efficient large tensor operation.
Similarly, we also batchify the decoding algorithm. Due to space limitation, we omit the details.
It is noteworthy that the techniques described here are also applicable to other grammar formulations such as CKY-style constituency parsing (Finkel et al., 2008;Drozdov et al., 2019).

Outside via Back-propagation
Eisner (2016) proposes a theoretical proof on the equivalence between the back-propagation mechanism and the outside algorithm in the case of constituency (phrase-structure) parsing. This work empirically verifies this equivalence for dependency parsing.
Moreover, we also find that marginal probabilities p(i → j | x) directly correspond to gradients after back-propagation with log Z(x) as the loss: which can be easily proved. For TreeCRF parsers, we perform MBR decoding (Smith and Smith, 2007) by replacing scores with marginal probabilities in the decoding algorithm, leading to a slight but consistent accuracy increase.

Handling Partial Annotation
As an attractive research direction, studies show that it is more effective to construct or even collect partially labeled data (Nivre et al., 2014;Hwa, 1999;Pereira and Schabes, 1992), where a sentence may correspond to a partial tree |y p | < n in the case of dependency parsing. Partial annotation can be very powerful when combined with active learning, because annotation cost can be greatly reduced if annotators only need to annotate sub-structures that are difficult for models. Li et al. (2016) present a detailed survey on this topic. Moreover, Peng et al. (2019) recently released a partially labeled multi-domain Chinese dependency treebank based on this idea. Then, the question is how to train models on partially labeled data. Li et al. (2016) propose to extend TreeCRF for this purpose and obtain promising results in the case of non-neural dependency parsing. This work applies their approach to the neural biaffine parser. We are particularly concerned at the influence of structural learning and high-order modeling on the utilization of partially labeled training data.
For the basic biaffine parser based on first-order local training, it seems the only choice is omitting losses of unannotated words. In contrast, tree constraints allow annotated dependencies to influence the probability distributions of unannotated words, and high-order modeling further helps by promoting inter-token interaction. Therefore, both structural learning and high-order modeling are intuitively very beneficial.
Under partial annotation, we follow Li et al. (2016) and define the training loss as: where Z(x, y p ) only considers all legal trees that are compatible with the given partial tree and can also be efficiently computed like Z(x).

Experiments
Data.
We conduct experiments and analysis on 27 datasets from 13 languages, including two widely used datasets: the English Penn Treebank (PTB) data with Stanford dependencies (Chen and Manning, 2014), and the Chinese data at the CoNLL09 shared task (Hajič et al., 2009). We also adopt the Chinese dataset released at the NLPCC19 cross-domain dependency parsing shared task (Peng et al., 2019), containing one source domain and three target domains. For simplicity, we directly merge the train/dev/test data of the four domains into larger ones respectively. One characteristic of the data is that most sentences are partially annotated based on active learning.
Finally, we conduct experiments on Universal Dependencies (UD) v2.2 and v2.3 following Ji et al. (2019) and  respectively. We adopt the 300d multilingual pretrained word embeddings used in Zeman et al. (2018) and take the CharLSTM representations as input. For UD2.2, to compare with Ji et al. (2019), we follow the raw text setting of the CoNLL18 shared task (Zeman et al., 2018), and directly use their sentence segmentation and tokenization results. For UD2.3, we also report the results of using gold-standard POS tags to compare with .
Evaluation metrics. We use unlabeled and labeled attachment score (UAS/LAS) as the main metrics. Punctuations are omitted for PTB. For the partially labeled NLPCC19 data, we adopt the official evaluation script, which simply omits the words without gold-standard heads to accommodate partial annotation. We adopt Dan Bikel's randomized parsing evaluation comparator for significance test. Parameter settings. We directly adopt most parameter settings of , including dropout and initialization strategies. For CharLSTM, the dimension of input char embeddings is 50, and the dimension of output vector is 100, following Lample et al. (2016). For the secondorder model, we set the dimensions of r h /s/m i to 100, and find little accuracy improvement when increasing to 300. We trained each model for at most 1,000 iterations, and stop training if the peak performance on the dev data does not increase in 100 consecutive epochs.
Models. LOC uses local cross-entropy training loss and employs the Eisner algorithm for finding the optimal projective tree. CRF and CRF2O denote the first-order and second-order TreeCRF model respectively. LOC MST denotes the basic local model that directly produces non-projective tree based on the MST decoding algorithm of Dozat and Manning (2017).

Efficiency Comparison
Figure 4 compares the parsing speed of different models on PTB-test. For a fair comparison, we run all models on the same machine with Intel Xeon CPU (E5-2650v4, 2.20GHz) and GeForce GTX 1080 Ti GPU. "CRF (CPU)" refers to the model that explicitly performs the inside-outside algorithm using Cython on CPUs. Multi-threading is employed since sentences are mutually independent. However, we find that using more than 4 threads does not further improve the speed.
We can see that the efficiency of TreeCRF is greatly improved by batchifying the inside algorithm and implicitly realizing the outside algorithm by back-propagation on GPUs. For the first-order CRF model, our implementation can parse about 500 sentences per second, over 10 times faster than the multi-thread "CRF (CPU)". For the secondorder CRF2O, our parser achieves the speed of 400 sentences per second, which is able to meet the requirements of a real-time system. More discussions on efficiency are presented in Appendix A.  the parsing accuracy further, probably because the performance is already very high. However, as shown by further analysis in Section 4.3, the positive effect is actually introduced by structural learning and high-order modeling.

Main Results
On CoNLL09, CRF significantly outperforms LOC, and CRF2O can further improve the performance.
On the partially annotated NLPCC19 data, CRF outperforms LOC by a very large margin, indicating the usefulness of structural learning in the scenario of partial annotation. CRF2O further improves the parsing performance by explicitly modeling second-order subtree features. These results confirm our intuitions discussed in Section 3.4. Please note that the parsing accuracy looks very low because the partially annotated tokens are usually difficult for models.

Analysis
Impact of MBR decoding. For CRF and CRF2O, we by default to perform MBR decoding, which employs the Eisner algorithm over marginal probabilities (Smith and Smith, 2007) to find the best tree.  Table 1 reports the results of directly finding 1-best trees according to dependency scores. Except for PTB, probably due to the high accuracy already, MBR decoding brings small yet consistent improvements for both CRF and CRF2O.
Convergence behavior. Figure 5 compares the convergence curves. For clarity, we plot one data point corresponding to the peak LAS every 20 epochs. We can clearly see that both structural learning and high-order modeling consistently improve the model. CRF2O achieves steadily higher accuracy and converges much faster than the basic LOC.
Performance at sub-and full-tree levels. Beyond the dependency-wise accuracy (UAS/LAS), we would like to evaluate the models regarding performance at sub-tree and full-tree levels. Table 2 shows the results. We skip the partially labeled NLPCC19 data. UCM means unlabeled complete matching rate, i.e., the percent of sentences obtaining whole correct skeletal trees, while LCM further requires that all labels are also correct.
For SIB, we evaluate the model regarding unlabeled adjacent-sibling subtrees (system outputs vs. gold-standard references). According to Equation 6, (i, k, j) is an adjacent-sibling subtree, if and only if w k and w j are both children of w i at the same side, and there are no other children of w i between them. Given two trees, we can col-  lect all adjacent-sibling subtrees and compose two sets of triples. Then we evaluate the P/R/F values. Please note that it is impossible to evaluate SIB for partially annotated references.
We can clearly see that by modeling adjacentsibling subtree scores, the SIB performance obtains larger improvement than both CRF and LOC, and this further contributes to the large improvement on full-tree matching rates (UCM/LCM).
Capability to learn from partial trees. To better understand why CRF2O performs very well on partially annotated NLPCC19, we design more comparative experiments by retaining either a proportion of random training sentences (full trees) or a proportion of random dependencies for each sentence (partial trees). Figure 6 shows the results.
We can see that the performance gap is quite steady when we gradually reduce the number of training sentences. In contrast, the gap clearly becomes larger when each training sentence has less annotated dependencies. This shows that CRF2O is superior to the basic LOC in utilizing partial annotated data for model training. Table 3 compares different models on UD datasets, which contain a lot of non-projective trees. We adopt the pseudo-projective approach (Nivre and Nilsson, 2005) for handling the ubiquitous nonprojective trees of most languages. Basically, the idea is to transform non-projective trees into projective ones using more complex labels for postprocessing recovery.

Results on Universal Dependencies
We can see that for the basic local parsers, the direct non-projective LOC MST and the pseudoprojective LOC achieve very similar performance.
More importantly, both CRF and CRF2O produce consistent improvements over the baseline in many languages. On both UD2.2 and UD2.3, Our proposed CRF2O model achieves the highest accuracy for 10 languages among 12, and obtains significant improvement in more than 7 languages. Overall, the averaged improvement is 0.45 and 0.29 on UD2.2 and UD2.3 respectively, which is also significant at p < 0.005.

Related Works
Batchification has been widely used in linear-chain CRF, but is rather complicated for tree structures. Eisner (2016) presents a theoretical proof on the equivalence of outside and back-propagation for constituent tree parsing, and also briefly discusses other formalisms such as dependency grammar. Unfortunately, we were unaware of Eisner's great work until we were surveying the literature for paper writing. As an empirical study, we believe this work is valuable and makes it practical to deploy TreeCRF models in real-life systems.
Falenska and Kuhn (2019) present a nice analytical work on dependency parsing, similar to Gaddy et al. (2018) on constituency parsing. By extending the first-order graph-based parser of Kiperwasser and Goldberg (2016) into second-order, they try to find out how much structural context is implicitly captured by the BiLSTM encoder. They concatenate three BiLSTM output vectors (i, k, j) for scoring adjacent-sibling subtrees, and adopt maxmargin loss and the second-order Eisner decoding algorithm (McDonald and Pereira, 2006). Based on their negative results and analysis, they draw the conclusion that high-order modeling is redundant because BiLSTM can implicitly and effectively encode enough structural context. They also present a nice survey on the relationship between RNNs and syntax. In this work, we use a much stronger basic parser and observe more significant UAS/LAS improvement than theirs. Particularly, we present an in-depth analysis showing that explicitly highorder modeling certainly helps the parsing model and thus is complementary to the BiLSTM encoder. Ji et al. (2019) employ graph neural networks to incorporate high-order structural information into the biaffine parser implicitly. They add a threelayer graph attention network (GAT) component (Veličković et al., 2018) between the MLP and Biaffine layers. The first GAT layer takes r h i and r m i from MLPs as inputs and produces new representation r h1 i and r m1 i by aggregating neighboring nodes. Similarly, the second GAT layer operates on r h1 i and r m1 i , and produces r h2 i and r m2 i . In this way, a node gradually collects multi-hop highorder information as global evidence for scoring single dependencies. They follow the original local head-selection training loss. In contrast, this work adopts global TreeCRF loss and explicitly incorporates high-order scores into the biaffine parser.  investigate the usefulness of structural training for the first-order biaffine parser. They compare the performance of local head-selection loss, global max-margin loss, and TreeCRF loss on multilingual datasets. They show that TreeCRF loss is overall slightly superior to max-margin loss, and LAS improvement from structural learning is modest but significant for some languages. They also show that structural learning (especially TreeCRF) substantially improves sentence-level complete matching rate, which is consistent with our findings. Moreover, they explicitly compute the inside and outside algorithms on CPUs via Cython programming. In contrast, this work proposes an efficient secondorder TreeCRF extension to the biaffine parser, and presents much more in-depth analysis to show the effect of both structural learning and high-order modeling.

Conclusions
This paper for the first time presents second-order TreeCRF for neural dependency parsing using triaffine for explicitly scoring second-order subtrees. We propose to batchify the inside algorithm to accommodate GPUs. We also empirically verify that the complex outside algorithm can be implicitly performed via efficient back-propagation, which naturally produces gradients and marginal probabilities. We conduct experiments and detailed analysis on 27 datasets from 13 languages, and find that structural learning and high-order modeling can further enhance the state-of-the-art biaffine parser in various aspects: 1) better convergence behavior; 2) higher performance on sub-and full-tree levels; 3) better utilization of partially annotated data.