Multilingual Back-and-Forth Conversion between Content and Function Head for Easy Dependency Parsing

Universal Dependencies (UD) is becoming a standard annotation scheme cross-linguistically, but it is argued that this scheme centering on content words is harder to parse than the conventional one centering on function words. To improve the parsability of UD, we propose a back-and-forth conversion algorithm, in which we preprocess the training treebank to increase parsability, and reconvert the parser outputs to follow the UD scheme as a postprocess. We show that this technique consistently improves LAS across languages even with a state-of-the-art parser, in particular on core dependency arcs such as nominal modifier. We also provide an in-depth analysis to understand why our method increases parsability.


Introduction
As shown in Figure 1 there are several variations in annotations of dependencies. A famous example is a head choice in a prepositional phrase (e.g, to a bar), which diverges in the two trees. Though various annotation schemes have been proposed so far (Hajic et al., 2001;Johansson and Nugues, 2007;de Marneffe and Manning, 2008;McDonald et al., 2013), recently the Universal Dependencies (UD) (de Marneffe et al., 2014) gains much popularity and is becoming the annotation standard across languages. The upper tree in Figure  1 is annotated in UD.
Practically, however, UD may not be the optimal choice. In UD a content word consistently dominates a function word, but past work points out that this makes some parser decisions more  difficult than the conventional style centering on function words, e.g., the tree in the lower part of Figure 1 (Schwartz et al., 2012;Ivanova et al., 2013).
To overcome this issue, in this paper, we show the effectiveness of a back-and-forth conversion approach where we train a model and parse sentences in an anontation format with higher parsability, and then reconvert the parser output into the UD scheme. Figure 1 shows an example of our conversion. We use the function head trees (below) as an intermediate representation.
This is not the first attempt to improve dependency parsing accuracy with tree conversions. The positive result is reported in Nilsson et al. (2006) using the Prague Dependency Treebank. For the conversion of content and function head in UD, however, the effect is still inconclusive. Using English UD data, Silveira and Manning (2015) report the negative result, which they argue is due to error propagation at backward conversions, in particular in copula constructions that often incur drastic changes of the structure. Rosa (2015) report the advantage of funcion head in the adposition construction, but the data is HamleDT (Zeman et al., 2012) rather than UD and the conversion target is conversely too restrictive.
Our main contribution is to show that the backand-forth conversion can bring consistent accuracy improvements across languages in UD, by  limiting the conversion targets to simpler ones around function words while covering many linguistic phenomena. Another limitation in previous work is the parsers: MSTParser or MaltParser is often used, but they are not state-of-the-art today. We complement this by showing the effectiveness of our approach even with a modern parser with rich features. We also provide an in-dpeth analysis to explore when and why our conversion brings higher parsability than the orignal UD.

Conversion method
Let us define notations first. For the i-th word w i in a sentence, p i denotes its POS tag, h i the head index, l i the dependency label, and left i (right i ) the list of indexes of left (right) children for w i . For instance in the upper tree in Figure 1, w 5 = went, p 5 = VERB, h 5 = 2, l 5 = ccomp, and left 5 = [3, 4].

Forward Conversion
The forward algorithm receives the original UD tree and converts it to a function head tree by modifying h i . Figure 1 is an example, and Algorithm 1 is the pseudo-code; root(y) returns the root word index of tree y. The algorithm traverses the tree in a top-down fashion and modifies the deepest node first. The modifications such as changing the mark arc from went to that in Figure 1 occur when it detects a word w i (that, in this case), for which the pair (p i , l i ) exists in the set of conversion targets, which is listed in Table 1 and is denoted by T in Algorithm 1. Let w j be the head of the detected word w i . Then, we reattach the arcs so that w i 's head becomes w j 's head and w j 's new head becomes w i . Note that we modify heads (h i ) only and keep labels (l i ). We skip the children of the root word (line 13); otherwise, an arc with root label will appear at an intermediate node. We operate only on the outermost child when multiple candidates are found (line 11).
Backward Conversion In contrast, the backward algorithm receives a function head tree and Algorithm 1 Forward conversion Input: a dependency tree y and the set of targets T . Output: modified y after applying CONV(root(y)). 1: procedure CONV(j) 2: for i in leftj do 3: CONV(i) 4: CHANGEDEP(SEARCH(leftj), j) 5: for i in rightj do 6: CONV(i) 7: CHANGEDEP(SEARCH(reverse(rightj)), j) 8: procedure SEARCH(children) 9: for i in children do 10: if (pi, li) ∈ T then T is the set of targets. 11: return i The first found candidate is outermost. We only change this. 12: procedure CHANGEDEP(i, j) 13: if lj = root then We skip the root. 14: hi ← hj; hj ← i reconverts it to a UD-style tree. Algorithm 2 is the pseudo-code. There are two main differences between the forward and backward algorithms. The first is the relative position of a target node (one of Table 1) among the operated nodes; in the forward algorithm they are the target node, its parent (head), and its grandparent, while in the backward algorithm they are the target node, its head, and its children. The second is how we reattach the nodes at the CHANGEDEP operation, in particular when the target node has multiple children. While the forward algorithm modifies only two arcs at once, the backward algorithm may modify more than two arcs considering possible parse errors at prediction. Specifically, when we find a target node having multiple children, we change the head of all these children to the head of the target (excluding those with the mwe label) 2 . We choose the intermost child as the new head of the target word (line 17). Table 1 is developed for covering main constructions in English and Japanese while keeping the backward conversion accuracy high. We do not argue this list is perfect, and seeking better one is an important future work. Note also that we use this list across all languages.
erage. We argue this is not a serious restriction since UD already contains moderate amount of non-projective arcs and the parser should be able to handle them. In practice, this complication does not lead to performance degradation; when we employ non-projective parsers, the scores increase regardless of the increased non-projectivity.

Experimental Setting
For each treebank and parser, we train two different models: one with the original trees (UD) and another with the converted trees (CONV). Reconverting CONV's output into the UD scheme by the backward algorithm, we can evaluate the outputs of both models against the same UD test set. For parsers, we use two non-projective parsers: second-order MSTParser (MST) (McDonald et al., 2005) 3 and RBGParser (RBG) (Lei et al., 2014) 4 with the default settings, which utilizes the thirdorder features and is much stronger .
We choose 19 langueges from UD ver.1.3 considering the sizes and typological variations. 5 The ratio of converted tokens is 6.3% on average (2.3%-15.6%). The failed backward conversions rarely occur at most 0.01% (0.002% on average) in the training data. We use gold POS tags, and exclude punctuations from evaluation.

Result
Attachment scores Table 2 shows the main result and we can see that the improvements are remarkable in the labeled attachment score (LAS): For MST, the scores increase more than 1.0 point in many languages (11 out of 19), and for RBG, though the changes are smaller, more than 0.5 points improvements are still observed in 10 languages. The differences in the unlabelled attachment score (UAS) are modest, implying that our conversion contributes in particular to find correct arc labels rather than head words themselves. On the other hand, LAS of Hindi decreases with RBG. One possible explanation for this is that the score of original UD is sufficiently high (91.74) and our conversion may impede parsability in such cases.
These overall improvements are not observed in past work (Silveira and Manning, 2015). One reason of our success seems that we restrict our conversion to simpler constructions and operations. We do not modify copula and auxiliary constructions, which involve more complex changes, amplifying error propagation in backward conversion. Our conversion also suffers from such propagation (see below) but in a lesser extent, suggsting that it may achieve a good balance between parsability and simplicity.
As the whole trends of the two parsers are similar, we mainly foucs on RBG in the analysis below.
What kinds of errors are reduced by our conversion? To inspect this, we compare F1-scores of each arc label. Table 3 summarizes the results for the frequent labels, and interestingly we can see that the improvements are observed for more semantically crucial, core relations such as dobj (+0.81), nmod (+2.34), and nsubj (+2.01). 6 This is not surprising as these relations are involved in most of our conversion. See Figure 1, on which in the original tree, nmod arc connects two content words (went and bar) while in the converted tree, they are connected via a function word to. The result suggests that this latter structure is more parsable than the original one, possibly because directly connecting content words is harder due to the sparsity. We further investigate this hypothesis quantitatively later.
The F1-scores degrade in some functional lables, such as mark (-2.74) and case (-0.85 Table 3: F1-scores (UD and CONV) and the average ratio in the test set (Ratio) of the frequent labels.
specting the outputs, we find that this essentially arises in our backward conversion, which induces errors on these arcs even when they are correctly attached in the (CONV) parser output, if another core label arc following them, such as nmod, attaches wrong. Figure 2 describes the situation.
In the initial parser output (above), the case arc to in is correct although it misattaches groups as a child of in (the correct head is provides). By the backward conversion, then, it induces a wrong case arc from groups to in, which hurts both precision and recall. In summary, we can say that just predicting correct functional arcs (e.g., case) is equally easy for both representations, but our method needs correct analysis on both functional and core arcs, to recover the true functional arcs. Although this additional complexity seems deficiency, the oveall scores (FAS) increase, which suggests that the majority case is successful predictions of both arcs thanks to our conversion. In other words, though our method slightly drops scores of functional arcs, it saves much more arcs of core relations, which are generally harder.
CNC To further verify the intuition above, now we introduce another metric called the CNC score, which is recently proposed in Nivre (2016) for UD evaluation purpose and calculates LAS excluding functional arcs 7 . The last column in Table 2 shows the results, where the improvements are clearer than LAS, +0.9 points on average. The results confirm the above observation that our method facilitates to find core grammatical arcs at a slight sacrifice of functional arcs.
Head word vocabulary entropy Finally, we provide an analysis to answer the question why our method improves the scores of core dependency arcs. As we mentioned above, this may be relevant to the ease of sparseness by placing function words between two content words. We verify this intuition quantitatively in terms of the entropy reduction of head word vocabulary. Schwartz et al. (2012) hypothesize about such correlation between entropy and parsability, although no qunatitative verification has been carried out yet.
For each dependency h l w from h to w with label l in the training data, we extract a pair ((p, l, w), h) where p is the POS tag of w. We then discard the pairs such that a tuple (p, l, w) appears less than five times, and calculate the entropy of head word, H l (h) from the conditional probablity P (h|p, l, w). We perform this both for the original UD and converted data, and calculate the difference for each label H orig l (h) − H conv l (h).
See Figure 3 above, where many nmods appear on the upper left side, meaning that the reduction of entropy contributes to the larger improvements cross-linguistically. Other points on this area include dobjs of Japanese and Persian, both of which employ case constructions for expressing objects.
We also explore the correlation between LAS and the averaged reduction of entropy per a token in each language. Figure 3 below shows a negative correlation, which means the reduction of entropy as a whole by the conversion relates with the overall improvement. In particular in MST, we find a strong negative correlation (r = −.75; p < .01). RBG, on the other hand, has a weaker, non-significant negative correlation (r = −.35; p = .14) when excluding Hindi, which seems an outlier. These correlations imply that the variation of entropy can be a metric of assessing an annotation framework, or a conversion method. 7 Arcs with the following relations: aux, auxpass, case, cc, cop, det, mark, and neg.

Conclusion and Future Work
We have shown that our back-and-forth conversion around function words reduces head word vocabulary, leading to improvements of parsability and labelled attachment scores. This is the first empirical result on UD showing the parser preference to the function head scheme across languages. The method is modular, and can be combined with any parsing systems as pre-and post-processing steps.
Recently there has been a big success in the transition-based neural dependency parsers, which we have not tested mainly because the most such systems currently available, such as SyntaxNet (Andor et al., 2016) and LSTMParser (Dyer et al., 2015), do not support non-projective parsing. The neural parsers are advantageous in that the bilexical sparsity problem, the main challenge in UD parsing for the ordinary feature-based systems, might be alleviated thanks to word embeddings. It is thus an interesting and important future work to develop a neural dependency parser designed for non-projective parsing and see whether our conversion is still effective for such stronger system.