Experiments with Generative Models for Dependency Tree Linearization

We present experiments with generative models for linearization of unordered labeled syntactic dependency trees (Belz et al., 2011; Rajkumar and White, 2014). Our linearization models are derived from generative models for dependency structure (Eisner, 1996). We present a series of generative dependency models designed to capture successively more information about ordering constraints among sister dependents. We give a dynamic programming algorithm for computing the conditional probability of word orders given tree structures under these models. The models are tested on corpora of 11 languages us-ing test-set likelihood, and human ratings for generated forms are collected for English. Our models beneﬁt from representing local order constraints among sisters and from backing off to less sparse distributions, including distributions not conditioned on the head.


Introduction
We explore generative models for producing linearizations of unordered labeled syntactic dependency trees. This specific task has attracted attention in recent years (Filippova and Strube, 2009;He et al., 2009;Belz et al., 2011;Bohnet et al., 2012;Zhang, 2013) because it forms a useful part of a natural language generation pipeline, especially in machine translation (Chang and Toutanova, 2007) and summarization (Barzilay and McKeown, 2005). Closely related tasks are generation of sentences given CCG parses (White and Rajkumar, 2012), bags of words (Liu et al., 2015), and semantic graphs (Braune et al., 2014).
Here we focus narrowly on testing probabilistic generative models for dependency tree lineariza-tion. In contrast, the approach in most previous work is to apply a variety of scoring functions to trees and linearizations and search for an optimally-scoring tree among some set. The probabilistic linearization models we investigate are derived from generative models for dependency trees (Eisner, 1996), as most commonly used in unsupervised grammar induction (Klein and Manning, 2004;Gelling et al., 2012). Generative dependency models have typically been evaluated in a parsing task (Eisner, 1997). Here, we are interested in the inverse task: inferring a distribution over linear orders given unordered dependency trees. This is the first work to consider generative dependency models from the perspective of word ordering. The results can potentially shed light on how ordering constraints are best represented in such models. In addition, the use of probabilistic models means that we can easily define well-motivated normalized probability distributions over orders of dependency trees. These distributions are useful for answering scientific questions about crosslinguistic word order in quantitative linguistics, where obtaining robust estimates has proven challenging due to data sparsity (Futrell et al., 2015).
The remainder of the work is organized as follows. In Section 2 we present a set of generative linearization models. In Section 3 we compare the performance of the different models as measured by test-set probability and human acceptability ratings. We also compare our performance with other systems from the literature. Section 4 concludes.

Generative Models for Projective Dependency Tree Linearization
We investigate head-outward projective generative dependency models. In these models, an ordered dependency tree is generated by the following kind  (2) From the AP comes this story. Order 2 is the original order in the corpus, but order 1 is much more likely under our models.
of procedure. Given a head node, we use some generative process G to generate a depth-1 subtree rooted in that head node. Then we apply the procedure recursively to each of the dependent nodes. By applying the procedure starting at a ROOT node, we generate a dependency tree. For example, to generate the dependency tree in Fig , and so on. In this work, we experiment with different specific generative processes G which generate a local subtree conditioned on a head.

Model Types
Here we describe some possible generative processes G which generate subtrees conditioned on a head. These models contain progressively more information about ordering relations among sister dependents.
A common starting point for G is Eisner Model C (Eisner, 1996). In this model, dependents on one side of the head are generated by repeatedly sampling from a categorical distribution until a special stop-symbol is generated. The model only captures the propensity of dependents to appear on the left or right of the head, and does not capture any order constraints between sister dependents on one side of the head.
We consider a generalization of Eisner Model C which we call Dependent N-gram models. In a Dependent N-gram model, we generate dependents on each side the head by sampling a sequence of dependents from an N-gram model. Each dependent is generated conditional on the N − 1 previously generated dependents from the head outwards. We have two separate Ngram sequence distributions for left and right dependents. Eisner Model C can be seen as a Dependent N-gram model with N = 1.
We also consider a model which can capture many more ordering relations among sister dependents: given a head h, sample a subtree whose head is h from a Categorical distribution over subtrees. We call this the Observed Orders model because in practice we are simply sampling one of the observed orders from the training data. This generative process has the capacity to capture the most ordering relations between sister dependents.

Distributions over Permutations of Dependents
We have discussed generative models for ordered dependency trees. Here we discuss how to use them to make generative models for word orders conditional on unordered dependency trees.
Suppose we have a generative process G for dependency trees which takes a head h and generates a sequence of dependents w l to the left of h and a sequence of dependents w r to the right of h. Let w denote the pair (w l , w r ), which we call the configuration of dependents. To get the probability of some w given an unordered subtree u, we want to calculate the probability of w given that G has generated the particular multiset W of dependents corresponding to u. To do this, we calculate: and W is the set of all possible configurations (w l , w r ) compatible with multiset W. That is, W is the set of pairs of permutations of multisets W l and W r for all possible partitions of W into W l and W r . The generative dependency model gives us the probability p(w). It remains to calculate the normalizing constant Z, the sum of probabilities of possible configurations. For the Observed Orders model, Z is the sum of probabilities of subtrees with the same dependents as subtree u. For the Dependent N-gram models of order N , we calculate Z using a dynamic programming algorithm, presented in Al-gorithm 1 as memoized recursive functions. When N = 1 (Eisner Model C), Z is more simply: where PARTS(W) is the set of all partitions of multiset W into two multisets W l and W r , p L is the probability mass function for a dependent to the left of the head, p R is the function for a dependent to the right, and stop is a special symbol in the support of p L and p R which indicates that generation of dependents should halt. The probability mass functions may be conditional on the head h. These methods for calculating Z make it possible to transform a generative dependency model into a model of dependency tree ordering conditional on local subtree structure.
Algorithm 1 Compute the sum of probabilities of all configurations of dependents W under a Dependent N-gram model with two component Ngram models of order N : p R for sequences to the right of the head and p L for sequences to the left.

Labelling
The previous section discussed the question of the structure of the generative process for dependency trees. Here we discuss an orthogonal modeling question, which we call labelling: what information about the labels on dependency tree nodes and edges should be included in our mod-els. Dependency tree nodes are labeled with wordforms, lemmas, and parts-of-speech (POS) tags; and dependency tree edges are labeled with relation types. A model might generate orders of dependents conditioned on all of these labels, or a subset of them. For example, a generative dependnecy model might generate (relation type, dependent POS tag) tuples conditioned on the POS tag of the head of the phrase. When we use such a model for dependency linearization, we would say the model's labelling is relation type, dependent POS, and head POS. In this study, we avoid including wordforms or lemmas in the labelling, to avoid data sparsity issues.

Model Estimation and Smoothing
In order to alleviate data sparsity in fitting our models, we adopt two smoothing methods from the language modelling literature.
All categorical distributions are estimated using add-k smoothing where k = 0.01. For the Dependent N-gram models, this means adding k pseudocounts for each possible dependent in each context. For the Observed Orders model, this means adding k pseudocounts for each possible permutation of the head and its dependents.
We also experiment with combining our models into mixture distributions. This can be viewed as a kind of back-off smoothing (Katz, 1987), where the Observed Orders model is the model with the most context, and Dependent N-grams and Eisner Model C are backoff distributions with successively less context. Similarly, models with less information in the labelling can serve as backoff distributions for models with more information in the labelling. For example, a model which is conditioned on the POS of the head can be backed off to a model which does not condition on the head at all. We find optimal mixture weights using the Baum-Welch algorithm tuned on a held-out development set.

Evaluation
Here we empirically evaluate some options for model type and model labelling as described above. We are interested in how many of the possible orders of a sentence our model can generate (recall), and in how many of our generated orders really are acceptable (precision). As a recall-like measure, we quantify the probability of the word orders of held-out test sentences. Low probabil-    ities assigned to held-out sentences indicate that there are possible orders which our model is missing. As a precision-like measure, we get human acceptability ratings for sentence reorderings generated by our model. We carry out our evaluations using the dependency corpora of the Universal Dependencies project (v1.1) (Agić et al., 2015), with the train/dev/test splits provided in that dataset. We remove nodes and edges dealing with punctuation. Due to space constraints, we only present results from 11 languages here.

Test-Set Probability
Here we calculate average probabilities of word orders per sentence in the test set. This number can be interpreted as the (negative) average amount of information contained in the word order of a sentence beyond information about dependency relations.
The results for selected languages are shown in Table 1. The biggest gains come from using Dependent N-gram models with N > 1, and from backing off the model labelling. The Observed Orders model does poorly on its own, likely due to data sparsity; its performance is much improved when backing off from conditioning on the head. Eisner Model C (n1) also performs poorly, likely because it cannot represent any ordering constraints among sister dependents. The fact it helps to back off to distributions not conditioned on the head suggests that there are commonalities among distributions of dependents of different heads, which could be exploited in further generative dependency models.

Human Evaluation
We collected human ratings for sentence reorderings sampled from the English models from 54 native American English speakers on Amazon Mechanical Turk. We randomly selected a set of 90 sentences from the test set of the English Universal Dependencies corpus. We generated a reordering of each sentence according to each of 12 model configurations in Table 1. Each participant saw an original sentence and a reordering of it, and was asked to rate how natural each version of the sentence sounded, on a scale of 1 to 5. The order of presentation of the original and reordered forms was randomized, so that participants were not aware of which form was the original and which was a reordering. Each participant rated 56 sentence pairs. Participants were also asked whether the two sentences in a pair meant the same thing, with "can't tell" as a possible answer. Table 2 shows average human acceptability ratings for reorderings, and the proportion of sentence pairs judged to mean the same thing. The original sentences have an average acceptability rating of 4.48/5. The very best performing models are those which do not back off to a distribution not conditioned on the head. However, in the case of the Observed Orders and other sparse models, we see consistent improvement from this backoff. Figure 2 shows the acceptability ratings (out of 5) plotted against test set probability. We see that   Figure 2: Comparison of test set probability (Table 1) and acceptability ratings (Table 2) for English across models. A least-squares linear regression line is shown. Labels as in Table 1. the models which yield poor test set probability also have poor acceptability ratings.

Comparison with other systems
Previous work has focused on the ability to correctly reconstruct the word order of an observed dependency tree. Our goal is to explicitly model a distribution over possible orders, rather than to recover a single correct order, because many orders are often possible, and the particulator order that a dependency tree originally appeared in might not be the most natural. For example, our models typically reorder the sentence "From the AP comes this story" (in Figure 1) as "This story comes from the AP"; the second order is arguably more natural, though the first is idiomatic for this particular phrase. So we do not believe that BLEU scores and other metrics of similarity to a "correct" ordering are particularly relevant for our task. Previous work uses BLEU scores (Papineni et al., 2002) and human ratings to evaluate generation of word orders. To provide some comparability with previous work, we report BLEU scores on the 2011 Shared Task data here. The systems reported in Belz et al. (2011) achieve BLEU scores ranging from 23 to 89 for English; subsequent work achieves BLEU scores of 91.6 on the same data (Bohnet et al., 2012). Drawing the highestprobability orderings from our models, we achieve a top BLEU score of 57.7 using the model configuration hdr/oo. Curiously, hdr/oo is typically the worst model configuration in the test set probability evaluation (Section 3.1). The BLEU performance is in the middle range of the Shared Task systems. The human evaluation of our models is more optimistic: the best score for Meaning Similarity in the Shared Task was 84/100 (Bohnet et al., 2011), while sentences ordered according to our models were judged to have the same meaning as the original in 85% of cases (Table 2), though these figures are based on different data. These comparisons suggest that these generative models do not provide state-of-the-art performance, but do capture some of the same information as previous models.

Discussion
Overall, the most effective models are the Dependent N-gram models. The naive approach to modeling order relations among sister dependents, as embodied in the Observed Orders model, does not generalize well. The result suggests that models like the Dependent N-gram model might be effective as generative dependency models.

Conclusion
We have discussed generative models for dependency tree linearization, exploring a path less traveled by in the dependency linearization literature. We believe this approach has value for answering scientific questions in quantitative linguistics and for better understanding the linguistic adequacy of generative dependency models.