Revisiting Higher-Order Dependency Parsers

Neural encoders have allowed dependency parsers to shift from higher-order structured models to simpler first-order ones, making decoding faster and still achieving better accuracy than non-neural parsers. This has led to a belief that neural encoders can implicitly encode structural constraints, such as siblings and grandparents in a tree. We tested this hypothesis and found that neural parsers may benefit from higher-order features, even when employing a powerful pre-trained encoder, such as BERT. While the gains of higher-order features are small in the presence of a powerful encoder, they are consistent for long-range dependencies and long sentences. In particular, higher-order models are more accurate on full sentence parses and on the exact match of modifier lists, indicating that they deal better with larger, more complex structures.


Introduction
Before the advent of neural networks in NLP, dependency parsers relied on higher-order features to better model sentence structure (McDonald and Pereira, 2006;Carreras, 2007;Koo and Collins, 2010;Martins et al., 2013, inter alia). Common choices for such features were siblings (a head word and two modifiers) and grandparents (a head word, its own head and a modifier). Kiperwasser and Goldberg (2016) showed that even without higher order features, a parser with an RNN encoder could achieve state-of-the-art results. This led folk wisdom to suggest that modeling higher-order features in a neural parser would not bring additional advantages, and nearly all recent research on dependency parsing was restricted to first-order models (Dozat and Manning, 2016;Smith et al., 2018a). Kulmizev et al. (2019) further reinforced this belief comparing transition and graph-based decoders (but none of which higher order); Falenska and Kuhn (2019) suggested that higher-order features become redundant because the parsing models encode them implicitly.
However, there is some evidence that neural parsers still benefit from structure modeling. Zhang et al. (2019) showed that a parser trained with a global structure loss function has higher accuracy than when trained with a local objective (i.e., learning the head of each word independently). Falenska and Kuhn (2019) examined the impact of consecutive sibling features in a neural dependency parser. While they found mostly negative results in a transition-based setting, a graph-based parser still showed significant gains on two out of 10 treebanks.
In this paper, we test rigorously the hypothesis of the utility of second-order features. In particular, we experiment with consecutive sibling and grandparent features in a non-projective, graphbased dependency parser. We found that without a pretrained encoder, these features are only useful for large treebanks; however, when using BERT, they can improve performance on most treebanks we tested on -especially true for longer sentences and long-distance dependencies, and full sentence parses 1 . This challenges the hypothesis that encoders can single-handedly improve parsers, or more generally, structured models in general.

Notation
We use x to refer to a sentence with tokens (x 1 , x 2 , . . . , x n ), plus the ROOT pseudo-token, and y to refer to a valid tree composed of n arcs (h, m).
We overload the notation s θ (·) to indicate the model score for a part or complete sentence, de-pending on its arguments.

Encoding
We encode a x with a bidirectional LSTM, producing hidden states (h 0 , h 1 , . . . , h n ), with h 0 corresponding to ROOT. Each token is represented by the concatenation of its pretrained word embeddings, a character-level left-to-right LSTM and, optionally, BERT embeddings.
Similar to Straka et al. (2019), when using BERT, we take the mean of its last four layers. When the BERT tokenizer splits a token into more than one, we take the first one and ignore the rest, and we use the special token [CLS] to represent ROOT. The word embeddings we use are the ones provided in the CoNLL 2018 shared task.

First-Order Model
We start with a first-order model, which is used as a pruner before running the second-order parser as in Martins et al. (2013). It uses biaffine attention to compute arc and label scores (Dozat and Manning, 2016), and similarly to Qi et al. (2018), we also add distance and linearization terms. 2 We want our pruner to be capable of estimating arc probabilities, and thus we train it with a marginal inference loss, maximizing the log probability of the correct parse tree y: We can compute the partition function over all possible trees y i efficiently using the Matrix-Tree Theorem (Koo et al., 2007), which also gives us arc marginal probabilities. The sentence score s θ (x, y) is computed as the sum of the score of its parts.
Additionally, we try first-order models trained with a hinge loss, as Zhang et al. (2019) (also used with our second-order models; see §2.4), maximizing the margin between the correct parse tree y and any other treeŷ: where ∆(y,ŷ) is the Hamming cost between y and y, i.e., the number of arcs in which they differ.

Second-Order Model
We train second-order models with a hinge loss. It is computed in the same way as in the firstorder case, except now the sentence scores include second-order parts. Notice that the Hamming cost still only considers differing arcs.
Consecutive siblings A consecutive sibling part is a tuple (h, m, s) such that h is the parent of both m and s, which are both to the left or to the right of h, and no other child of h exists between them. Additionally, we consider tuples (h, m, ∅) to indicate that m is the first child (if to the left of h) or the last child (if to the right).
Grandparents A grandparent part is a tuple (h, m, g) such that g is the parent of h and h is the parent of m. There are no grandparent parts such that h is ROOT.
Scoring The score for a higher order part (h, m, r) of type ρ (in our case, either grandparent or consecutive sibling) is computed as: where λ ρ 1 , λ ρ 2 and λ ρ 3 are learnable scalars, w ρ is a learnable vector, f ρ h (·), f ρ m (·) and f ρ r (·) are learnable affine transforms. There is a set of these parameters for consecutive siblings and another for grandparents. The factors that compose the score represent different combinations of a second-order part with h, m, or both. There is no factor combining h and m only, since they are already present in the first-order scoring. We also introduce a parameter vector h ∅ to account for ∅.
Decoding The drawback of higher-order feature templates is that exact decoding is intractable for the non-projective case. Classically, researchers have resorted to approximate decoding as well as using a first-order parser to eliminate unlikely arcs and their respective higher-order parts. We employ both of these techniques; specifically, we use the dual decomposition algorithm AD 3 (Martins et al., 2011(Martins et al., , 2013 for decoding, which often arrives at the exact solution. We use head automata factors to handle sibling and grandparent structures (Koo et al., 2010), and the traditional Chu-Liu-Edmonds algorithm to handle the tree constraint factor (Mc-Donald et al., 2005).

Additional Training Details
Multitask Learning Our models also predict UPOS, XPOS and morphology tags (UFeats), as training for these additional objectives increases parsing performance. They are implemented via softmax layers on top of the BiLSTM output, and have a cross-entropy loss. Parser and tagger share two BiLSTM layers, with an additional layer for each one (similar to Straka, 2018). We only consider UFeats singletons in the training data, i.e., we do not decompose them into individual features.
Perturb and MAP During training with a hinge loss, we add noise sampled from a standard Gumbel distribution to the arc scores, as in Papandreou and Yuille (2011). This effectively makes decoding behave as sampling from the tree space. In all cases, we use gold tokenization. They represent varied language families, writing systems and typology, inspired by Smith et al. (2018b).
Hyperparameters All LSTM cells have 400 units in each direction, as well as arc and label biaffine projections. Second-order layers have 200 units, and character embeddings have 250. We apply dropout with p = 0.5 to all linear layers, and we use word dropout (replacing an encoded word vector with a trainable vector) with p = 0.33 in models without BERT and 0.2 in the ones with it. We use Adam with β 1 = 0.9, β 2 = 0.99, and constant learning rate of 10 −3 for the first-order models without BERT and 5 · 10 −4 for all others. We used bert-chinese for Chinese and Japanese, and bert-base-multilingual-cased for other languages; and did not fine-tune its weights. We run the AD 3 decoder for up to 500 iterations with a step size of 0.05. We use batches of 1,000 tokens for first-order models and 800 for secondorder, and train for up to 100k batches. We evaluate on the dev set each 200 batches and stop early after 50 evaluations without improvement.
Pruning Before training or evaluating a secondorder parser, we run a first-order model trained with marginal inference to prune unlikely arcs and any second-order parts including them. When using BERT in the main parser, we also use a pruner trained with BERT. We keep up to 10 candidate heads for each token, and further prune arcs with posterior probability lower than a threshold t times the probability of the most likely head. Without BERT, t = 10 −6 , and with it t = 10 −8 , as we found BERT makes the pruner overconfident. The lowest pruner recall on the dev set was 98.91% (on Turkish); all other treebanks are above 99%. During training, we never prune out gold arcs. Table 1 shows the test set UAS and LAS for our models. Parsers with BERT and hinge loss achieve the best performance in most datasets; secondorder models are generally better at UAS. An interesting case is Ancient Greek, which is not in BERT's pretraining data. First-order models with BERT perform worse than the ones without it in UAS and LAS, but the second-order model achieves the highest UAS.

Results
Without BERT, second-order features are only beneficial in some medium-to-large treebanks. In the smallest ones, as Turkish and Hungarian, they actually lead to a performance drop; when using BERT, however, they increase accuracy in these datasets. On the other hand, large treebanks such as Russian and Czech have improvements from second-order features even without BERT. This suggests that in order for them to be beneficial, either large amounts of annotated training data are needed (which not all UD treebanks have) or a powerful encoder such as BERT.
Considering first-order models, Zhang et al. (2019) found no particular advantage of a hinge loss objective over a cross-entropy one or viceversa. In our experiments, this is mostly the case for models trained with small-to-medium treebanks and without BERT. When more training data or a pretrained encoder is available, the hinge loss objective tends to reach higher accuracy than the cross-entropy one.  Figures 1, 2 and 3 show LAS by sentence length, dependency length and depth in the tree (distance to root). While BERT reduces the gap between first and second-order models, the latter are consistently more accurate in sentences longer than 10 tokens, and in dependencies longer than four tokens. Varying distance to root shows a somewhat irregular pattern (similar to what Kulmizev et al., 2019 found); the three BERT models are close to each other, but among the other three, the second-order parser is clearly best for depths 2-9. Table 2 shows complete sentence matches and head words with exact match of their modifier set, over all treebanks. Second-order models are better on both metrics. Table 3 shows results for models that do not employ multitask learning (in our case, jointly learning UPOS, XPOS and morphological features) on the development set for a subset of the treebanks, and the results for the models that employ it on the same data. All models are first order with a probabilistic loss function. MTL parsers performed better except for Arabic UAS, and even then only by a small difference, which motivated us to use MTL in all our experiments.
Runtime Our first-order parsers without BERT process 2,000 tokens per second on average, and the second-order ones around 600 (averaged across all treebanks). For models with BERT, the figures  are 1,600 and 460, respectively. 3 This slowdown of 3.5x for second-order models is even smaller than the ones reported by Martins et al. (2013).

Conclusion
We compared second-order dependency parsers to their more common, first-order counterparts.   While their overall performance gain was small, they are distinctively better for longer sentences and long-range dependencies. Considering the exact match of complete parse trees or all modifiers of a word, second-order models exhibit an advantage over first-order ones. Our results indicate that even a powerful encoder as BERT can still benefit from explicit output structure modelling; this would be interesting to explore in other NLP tasks as well. Another interesting line of research would be to evaluate the contribution of higher-order features in a cross-lingual setting, leveraging structure learned from larger treebanks to underresourced languages.