Should Have, Would Have, Could Have. Investigating Verb Group Representations for Parsing with Universal Dependencies.

Treebanks have recently been released for a number of languages with the harmonized annotation created by the Universal Dependencies project. The representation of certain constructions in UD are known to be suboptimal for parsing and may be worth transforming for the purpose of parsing. In this paper, we focus on the representation of verb groups. Several studies have shown that parsing works better when auxiliaries are the head of auxiliary dependency relations which is not the case in UD. We therefore transformed verb groups in UD treebanks, parsed the test set and transformed it back, and contrary to expectations, observed signiﬁcant decreases in accuracy. We provide suggestive evidence that improvements in previous studies were obtained because the transformation helps disambiguating POS tags of main verbs and aux-iliaries. The question of why parsing accuracy decreases with this approach in the case of UD is left open.


Introduction
Universal Dependencies 1 (henceforth UD) (Nivre, 2015) is a recent project that is attempting to harmonize syntactic annotation in dependency treebanks across languages. This is done through the development of annotation guidelines. Some guidelines have been hypothesized to be suboptimal for parsing. In the literature, certain representations of certain constructions have been shown to be better than their alternatives for parsing, for example in Schwartz et al. (2012). The UD guidelines however have been written with the intent to maximize crosslinguistic parallelism and this constraint has forced the guidelines developers to sometimes choose representations that are known to be worse for parsing (de Marneffe et al., 2014). For that reason, de Marneffe et al. (2014) suggest that those representations could be modified for the purpose of parsing, thus creating a parsing representation. Transforming tree representations for the purpose of parsing is not a new idea. It has been done for constituency parsing for example by Collins (1999) but also for dependency parsing for example by Nilsson et al. (2007). Nilsson et al. (2007) modified the representation of several constructions in several languages and obtained a consistent improvement in parsing accuracy. In this paper, we will investigate the case of the verb group construction and attempt to reproduce the study by Nilsson et al. (2007) on UD treebanks to find out whether or not the alternative representation is useful for parsing with UD.
2 Background 2.1 Tree Transformations for Parsing  have shown that modifying coordination constructions and verb groups from their representation in the Prague Dependency Treebank (henceforth PDT) to a representation described in Melčuk (1988) (Mel'čuk style, henceforth MS) improves dependency parsing for Czech. The procedure they follow is as follows: 2. Train a model on that transformed data.
3. Parse the test data.
4. Transform the parsed data back to the original representation (for comparison with the original gold standard). Nilsson et al. (2007) have shown that these same modifications as well as the modification of nonprojective structures helps parsing in four languages. Schwartz et al. (2012) conducted a study over the alternative representations of 6 constructions across 5 parsing models for English and found that some of them are easier to parse than others. Their results were consistent across parsing models. The motivations behind those two types of studies are different.  have originally a representation that is more semantically oriented and potentially useful for NLP applications which they therefore wish their output to have, the PDT style, and change it to a representation that is more syntactically oriented, the MS style, because it is easier to parse. By contrast, Schwartz et al. (2012) have no a priori preference for any of the different alternatives of the constructions they study and instead study the effect of the different representations on parsing for the purpose of choosing one representation over the other. Their methodology is therefore different, they evaluate the different representations on their respective gold standard. They argue that accuracy within a representation is a good indicator of the learnability of that representation and they argue that learnability is a good criterion for selecting a syntactic representation among alternatives. In any case, these studies seem to show that such transformations can affect parsing for various languages and for various parsing models. Silveira and Manning (2015) were the first to obtain negative results from such transformations. They attempted to modify certain constructions in a UD treebank to improve parsing for English but failed to show any improvement. Some transformations even decreased parsing accuracy. They observe that when they transform their parsed data back to the original representation, they can amplify parser errors. As a matter of fact, a transformation can be prompted by the presence of only one dependency relation but involve transformations of many surrounding dependency relations. The verb group transformation is such an example and will be described in section 3. If, then, a wrong dependency relation prompts a transformation in the parsed data, its surrounding items which might have been correct become wrong. A wrong parse can then become worse. They take this as partial explanation for the results that are inconsistent with the literature. However, the same problem can have arisen in  and may have downplayed the effects that those studies have observed. It therefore seems that this explanation is not enough to account for those results.
This raises the question of whether this phenomenon actually happened in the study by Nilsson et al. (2007). It would be interesting to know if the effects they observed were affected by this kind of error amplification. It seems that there is still a lot to do to study the impact of different representations on parsing with UD as well as on dependency parsing more generally. We propose to take one step in that direction in this paper.

Error Analysis for Dependency Parsing
McDonald and  conducted an extensive error analysis on two parsers in order to compare them. They compare the effect of sentence length on the two models, the effect of the structure of the graph (i.e. how close to the root individual arcs are) on the two models as well as the accuracy of the models on different POS tags and on different dependency relations. These comparisons allow them to provide insights into the strengths and weaknesses of each model. Conducting such an error analysis that compares baseline models with their transformed version could provide some further insights into the effects obtained with tree transformations. Attempting such a detailed error analysis is beyond the scope of this project but some steps will be taken in that direction and are described in Section 4.

Verb Groups
In the PDT, main verbs are the head of auxiliary dependencies, as in Figure 1. Nilsson et al. (2007) show that making the auxiliary the head of the dependency as in Figure 2 is useful for parsing Czech and Slovenian. Schwartz et al. (2012) also report that, in English, 11 verb groups are easier to parse when the auxiliary is the head (as in PDT) than when the verb is the head (as in MS). Since UD adopts the PDT style representation of verb groups, it would be interesting to find out whether or not transforming them to MS could also improve parsing. This is what will be attempted in this study.  describe algorithms for such a transformation as well as its back transformation. However, their back transformation algorithm assumes that the auxiliary appears to the left of the verb which is not always the case. In addition, it is unclear what they do with the cases in which there are two auxiliaries in a verb group. For these reasons, we will use a slightly modified version of this algorithm that we describe in Section 3.

General Approach
We will follow the methodology from Nilsson et al. (2007), that is, to transform, parse and then detransform the data so as to compare the original and the transformed model on the original gold standard. The method from Schwartz et al. (2012) which consists in comparing the baseline and the transformed data on their respective gold standard is less relevant here because UD is believed to be a useful representation and that the aim will be to improve parsing within that representation. However, as was argued in that study, their method can give an indication of the learnability of a construction and can potentially be used to understand the results obtained by the parse-transform-detransform method. For this reason, this method will also be attempted. In addition, the original parsed data will also be transformed into the MS gold standard for comparison with the MS parsed data on the MS gold standard. Comparing the two can potentially help find out if the error amplifications described in the background section are strongly influencing the results. As a matter of fact, if the transformed model is penalized by error amplifications on the original gold standard, it is expected that the original model will be penalized in the same way on the transformed gold standard.

Transformation Algorithm
The transformation algorithm is illustrated by Figures 3, 4 and 5 which represent the transformation of a sentence with a verb group given in Example (1). Figure 3 is the original UD representation of this example, Figure 4, an intermediate representation and Figure 5 is the final MS representation.
(1) I could easily have done this The transformation first looks for verb groups in a dependency graph. Those verb groups are collected in the set V . A verb group V i has a main verb V imv (done in the example) and a set of auxiliaries V iaux with at least one element (could and have in the example). Verb groups are collected by traversing the sentence from left to right, looking at auxiliary dependency relations. An auxiliary dependency relation w aux aux ← − −w mv is a relation where the main verb is the head and the auxiliary is the dependent. Only auxiliary dependency relations between two verbal forms are considered. When such a dependency relation is found, if there is a V i in V that has the head of the dependency relation (w mv ) as main verb V imv , w aux is added to that V i 's set of auxiliaries V iaux . Otherwise, a new V i is created and added to V . After that, for each V i in V , if there is only one auxiliary in V iaux , the direction of the dependency relation between that auxiliary and the main verb V imv is inverted and the head of V imv becomes the head of the auxiliary. When there are several auxiliaries (like in example (1)), the algorithm attaches the closest one to V imv and the head of V imv becomes the head of the outermost one. Any auxiliary inbetween is attached in a chain from the outermost to the one that is closest to the verb. In the example, the main verb done gets attached to the closest auxiliary have and the head of the main verb done which was the root becomes the head of the outermost auxiliary, could.
Next, dependents of the main verb are dealt with to make sure projectivity is maintained. As a matter of fact, as can be seen from Figure 4, the previous changes can introduce non-projectivity in an otherwise projective tree, which is undesirable. Dependents to the left of the leftmost verb of the whole verb group (i.e. including the auxiliaries and the main verb) get attached to the leftmost verb. In the example, I gets attached to could. Dependents to the right of the rightmost verb of the verb group get attached to the leftmost verb. In the example, this remains attached to the main verb done. Any remaining dependent gets attached to the auxiliary that is closest to the verb. In the example, easily gets attached to have.

Back Transformation Algorithm
The back transformation algorithm works similarly to the transformation algorithm. A set of verb groups V is first collected by traversing the sentence from left to right, looking at auxiliary dependency relations. An auxiliary dependency relation w d aux ← − −w h between a dependent w d and a head w h in MS can be between an auxiliary and the main verb or between two auxiliaries. When one such relation is found, if its head w h is not already in a V iaux in V , a new verb group V i is created and w h is added to V iaux . What the algorithm does next depends on the direction of that dependency relation. If it is right-headed, the dependent w d of that dependency relation is the main verb and the algorithm recurses the chain of auxiliary dependency relations through heads: it looks at the head w h of dependency relations and adds them to V iaux until it finds a head that is not itself the dependent of an auxiliary dependency relation. If it is left-headed, the algorithm recurses the chain of auxiliary dependency relations through the dependents. It looks at dependents of dependency relations until it finds the main verb V imv , i.e. a w i d that is not the head of an auxiliary dependency relation, each time adding the head of the relation w i h to V iaux . After that, for each V i in V , the head of the auxiliary that is furthest from the main verb becomes the head of the main verb. The main verb becomes the head of all auxiliaries and their dependents.
In the previous example, Figure 5 can be transformed back to Figure 3 in this way: done is identified as the main verb of the verb group and could as its furthest auxiliary. The head of could therefore becomes the head of done and the two auxiliaries of the sentence as well as their dependents get attached to the main verb.

Data
We ran all experiments on UD 1.2 (Nivre et al., 2015). Treebanks that had 0.1% or less of auxiliary dependency relations were discarded. Japanese was also discarded because the Japanese treebank is not open source. Dutch was discarded because the back transformation accuracy was low (90%). This is due to inconsistencies in the annotation: verb groups are annotated as a chain of dependency relations. This leaves us with a total of 25 out of the 37 treebanks. For comparability with the study in Nilsson et al. (2007), and because we used a slightly modified version of their algorithm, we also tested the approach on the versions of the Czech and Slovenian treebanks that they worked on, respectively version 1.0 of the PDT (Hajic et al., 2001) and the 2006 version of SDT (Deroski et al., 2006). overview of the data used for the experiments.

Software
For comparability with previous studies, we used MaltParser  with default settings, training on the training set and parsing on the development set for all the languages that we investigated. For enhanced comparability of the results, we used the UD POS tags instead of the language specific POS tags. MaltEval (Nilsson and Nivre, 2008) was used for evaluation. The transformation code has been released as part of the python package oDETTE version 1.0 2 (DEpendency Treebank Transformation and Evaluation). The package can be used to run the whole pipeline, from transformation to evaluation. It can work on several treebanks in parallel which enables quick experiments. (We trained and parsed the data for the 25 treebanks in 9 minutes on an 8-core machine).

Effect of VG Transformation on Parsing
As mentioned before, we converted training data in all treebanks involved, trained a parser with that transformed training set, parsed the test data and transformed the parsed data back to the original representation. Parsing accuracy of that transformed parsed data can then be compared with the parsed data obtained from the baseline, the unmodified model. Results are given in As mentioned in Section 2.1, results on the original representation are the ones that we care about because it is the UD representation that we are interested in and because those results are directly comparable with each other. However, as was also said, results on the transformed gold standard can give an indication on the learnability of a construction. For this reason, they are reported in Table 3. Table 3 also reports results of the parsing model trained on UD representations where the parsed data have been transformed to the MS representation. As was said in Section 3.1, this is to find out if error amplifications have a strong influence on the results: if error amplifications were the main source of added errors from the baseline on UD to the back transformed UD, it would be expected that the original parsed test set transformed into MS would perform worse on the MS gold standard than the test set parsed by the model trained on MS. As can be seen from Table 3 however, this is not the case: the original model generally beats the transformed model even on the transformed gold standard. As can also be seen from the  table, the scores are overall higher for the UD parsing model on the UD gold standard than the transformed parsing model on the transformed gold standard. This potentially indicates that the verb group transformation makes the UD representation harder to learn and might help give a partial explanation of why it decreases parsing accuracy on the original gold standard. This is not entirely surprising as, as can be seen from the Figures illustrating the transformation above, the original representation is flatter than the transformed representation. Further work is needed to explore that more in-depth. In any case, the original model beats the transformed model on several metrics and it seems safe to conclude that the verb group transformation hurts UD parsing at least with MaltParser.

Comparing Dependency Relations
Turning to the error analysis, one thing that is striking when looking at the performance of different dependency relations is that punctuation performs consistently worse in the transformed version of the parsed data compared to the baseline as can be seen  in Figure 6. 4 Because punctuation is most often attached to the main verb, it can be hypothesized that identifying the main verb of the sentence is crucial for avoiding this kind of errors and that the transformation hurts the identification of the main verb in the case of UD. A close examination of about a third of errors containing an auxiliary dependency relation in English further reinforced that hypothesis.

Comparison with SDT and PDT
What is noticeable in the results we have seen so far is that the accuracy decreased for languages for which accuracy has been shown to increase in the past: Czech, Slovenian and English. This indicates that the UD style is making a difference. For that reason, we are now attempting a comparison be-  tween the effect of the approach on SDT and on UD Slovenian as well as between its effect on PDT and UD Czech. As shown in Table 4, similar improvements to the original study were obtained on SDT and PDT. As was just mentioned, it can be hypothesized that identifying the main verb is crucial for avoiding the kind of errors that were observed in the UD transformed version. It can then be hypothesized that the transformation helps to identify the main verb in PDT and SDT whereas it makes it harder in UD. When observing some examples in SDT, the transformation seems to help disambiguating POS tags. As a matter of fact, more than 90% of auxiliaries in SDT have the tag Verb-copula but also more than 20% of the main verbs involved in auxiliary dependency relations have that same POS tag. POS tags therefore do not give enough information to distinguish between the main verb and an auxiliary.
The experiment we are now turning to suggested that this is a reasonable hypothesis. We tested the approach on three different versions of PDT and  Table 5, the results in SDT support the hypothesis: when verbs are made fully ambiguous, the transformation improves the results more than when they are partially ambiguous. When they are disambiguated, the approach does not work, the accuracy even decreases. The picture is slightly less clear with PDT where disambiguating the POS tags makes the approach ineffective but making them ambiguous does not make the approach more useful. Ambiguating the tags seems to affect PDT less than it affects SDT however which might indicate that PDT suffers from ambiguity even more than SDT in the original treebank. This might be due to the fact that the POS tags used in the PDT experiments are automatically predicted whereas the tags used for SDT are gold tags. This idea is further explored in Section 4.4.
We tested the same approach on the UD treebanks for Czech and Slovenian to see if they can also be affected by ambiguity in some way. In the case of UD, τ d is the same as τ o since the tags are already disambiguated. As can be seen from the top part of Table 6, the opposite effect is found: the transformation hurts accuracy more when the tags are ambiguous than when they are not. However, because of the similarity between copulas and auxiliaries in UD, representing them differently might make it confus-16 ing for the parser. It would be interesting to try the approach and change the representation of copulas as well as auxiliaries. We tested something simpler: we tested the same experiment on the treebanks without copulas, i.e. we removed all sentences that have a copula dependency relations both in the training and the test sets. As can be seen from the bottom of Table 6, doing so gives the expected results: the transformation affects accuracy less when the tags are ambiguous than when they are not. The transformation still does not help parsing accuracy however.

Predicted vs gold POS tags
An issue that has been ignored so far is that in the PDT, the parser used predicted POS tags for parsing the test sets whereas in UD (and in SDT), we have been using gold POS tags. It was said in the previous section that the experiment about ambiguity on the PDT seems to indicate that tags are of poorer quality in the original experiment. It is possible that this is due to the fact that they are predicted rather than gold tags. It would be interesting to find out if the transformation approach works on UD parsing using predicted tags. This is slightly difficult to test as there does not exist taggers for all UD treebanks yet. There does exist one for Swedish however, which is why we tested this hypothesis on UD Swedish. As can be seen from Table 7, using predicted POS tags does have an impact on the effect of the transformation as the transformation hurts parsing accuracy less than it does on data with gold POS tags. The transformation still does not help parsing accuracy however.
Overall then, the results suggest that there is something about the UD representation that makes this transformation infelicitous. It seems then that in POS tag Orig Transf ∆ gold 76.8 75.7** -1.1 predicted 76.4 75.6** -0.8 Table 7: LAS on the original and transformed UD Swedish treebank with predicted and gold POS tags. ∆ = Transf -Orig the case of UD, it is better to keep the main verbs as heads of auxiliary dependency relations. There are other factors that may play a role in the results. For example, as appears from Table 1, the original SDT has a much higher percentage of auxiliary dependencies. This could be caused by the domain of the treebank.

Conclusion and Future Work
In this paper, we have attempted to reproduce a study by Nilsson et al. (2007) that has shown that making auxiliaries heads in verb groups improves parsing but failed to show that those results port to parsing with Universal Dependencies. Contrary to expectations, the study has given evidence that main verbs should stay heads of auxiliary dependency relations for parsing with UD. The benefits of error analyses for such a study have been highlighted because they allow us to shed more light on the different ways in which the transformations affect the parsing output. Experiments suggest that gains obtained from verb group transformations in previous studies have been obtained mainly because those transformations help disambiguating between main verbs and auxiliaries. It is however still an open question why the VG transformation hurts parsing accuracy in the case of UD. It seems that the transformation makes the construction harder to learn which might be because it makes it less flat. Future work could carry out an error analysis that is more detailed than was the case in this study. Repeating those experiments with other tree transformations that have been shown to be successful in the past, such as making prepositions the head of prepositional phrases, as well as looking at other parsing models would provide more insight into the relationship between tree transformations and parsing.