Synthetic Data Made to Order: The Case of Parsing

To approximately parse an unfamiliar language, it helps to have a treebank of a similar language. But what if the closest available treebank still has the wrong word order? We show how to (stochastically) permute the constituents of an existing dependency treebank so that its surface part-of-speech statistics approximately match those of the target language. The parameters of the permutation model can be evaluated for quality by dynamic programming and tuned by gradient descent (up to a local optimum). This optimization procedure yields trees for a new artificial language that resembles the target language. We show that delexicalized parsers for the target language can be successfully trained using such “made to order” artificial languages.


Introduction
Dependency parsing is a core task in natural language processing (NLP). Given a sentence, a dependency parser produces a dependency tree, which specifies the typed head-modifier relations between pairs of words. While supervised dependency parsing has been successful (McDonald and Pereira, 2006;Nivre, 2008;Kiperwasser and Goldberg, 2016), unsupervised parsing can hardly produce useful parses (Mareček, 2016). So it is extremely helpful to have some treebank of supervised parses for training purposes.

Past work: Cross-lingual transfer
Unfortunately, manually constructing a treebank for a new target language is expensive (Böhmová et al., 2003). As an alternative, cross-lingual transfer parsing (McDonald et al., 2011) is sometimes possible, thanks to the recent development of multi-lingual treebanks Nivre et al., 2015;Nivre et al., 2017). The idea is to parse the sentences of the target language with a supervised parser trained on the treebanks of one or more source languages. Although the parser cannot be expected to know the words of the target language, it can make do with parts of speech (POS) (McDonald et al., 2011;Zhang and Barzilay, 2015) or crosslingual word embeddings (Duong et al., 2015;Guo et al., 2016;Ammar et al., 2016). A more serious challenge is that the parser may not know how to handle the word order of the target language, unless the source treebank comes from a closely related language (e.g., using German to parse Luxembourgish). Training the parser on trees from multiple source languages may mitigate this issue (McDonald et al., 2011) because the parser is more likely to have seen target part-of-speech sequences somewhere in the training data. Some authors (Rosa andŽabokrtský, 2015a,b;Wang and Eisner, 2016) have shown additional improvements by preferring source languages that are "close" to the target language, where the closeness is measured by distance between POS language models trained on the source and target corpora.

This paper: Tailored synthetic data
We will focus on delexicalized dependency parsing, which maps an input POS tag sequence to a dependency tree. We evaluate single-source transfer-train a parser on a single source language, and evaluate it on the target language. This is the setup of Zeman and Resnik (2008) and Søgaard (2011a).
Our novel ingredient is that rather than seek a close source language that already exists, we create one. How? Given a dependency treebank of a possibly distant source language, we stochastically permute the children of each node, according to some distribution that makes the permuted language close to the target language.
And how do we find this distribution? We adopt the tree-permutation model of Wang and Eisner (2016). We design a dynamic programming algorithm which, for any given distribution p in Wang and Eisner's family, can compute the expected counts of all POS bigrams in the permuted source treebank. This allows us to evaluate p by computing the divergence between the bigram POS language model formed by these expected counts, and the one formed by the observed counts of POS bigrams in the unparsed target language. In order to find a p that locally minimizes this divergence, we adjust the model parameters by stochastic gradient descent (SGD).

Key limitations in this paper
Better measures of surface closeness between two languages might be devised. However, even counting the expected POS N -grams is moderately expensive, taking time exponential in N if done exactly. So we compute only these local statistics, and only for N = 2. We certainly need N > 1 because the 1-gram distribution is not affected by permutation at all. N = 2 captures useful bigram statistics: for example, to mimic a verb-final language with prenominal modifiers, we would seek constituent permutations that result in matching its relatively high rate of VERB-PUNCT and ADJ-NOUN bigrams. While N > 2 might have improved the results, it was too slow for our large-scale experimental design. §7 discusses how richer measures could be used in the future.
We caution that throughout this paper, we assume that our corpora are annotated with gold POS tags, even in the target language (which lacks any gold training trees). This is an idealized setting that has often been adopted in work on unsupervised and cross-lingual transfer. §7 discusses a possible avenue for doing without gold tags.

Modeling Surface Realization
We begin by motivating the idea of tree permutation. Let us suppose that the dependency tree for a sentence starts as a labeled graph-a tree in which siblings are not yet ordered with respect to their parent or one another. Each language has some systematic way to realize its unordered trees as surface strings: 1 it imposes a particular order on the tree's word tokens. More precisely, a language specifies a distribution p(string | unordered tree) over a tree's possible realizations.
As an engineering matter, we now make the strong assumption that the unordered dependency trees are similar across languages. That is, we suppose that different languages use similar underlying syntactic/semantic graphs, but differ in how they realize this graph structure on the surface. 1 Modeling this process was the topic of the recent Surface Realization Shared Task (Mille et al., 2018). Most relevant is work on tree linearization (Filippova and Strube, 2009;Futrell and Gibson, 2015;Puzikov and Gurevych, 2018). Thus, given a gold POS corpus u of the unknown target language, we may hope to explain its distribution of surface POS bigrams as the result of applying some target-language surface realization model to the distribution of cross-linguistically "typical" unordered trees. To obtain samples of the latter distribution, we use the treebanks of one or more other languages. The present paper evaluates our method when only a single source treebank is used. In the future, we could try tuning a mixture of all available source treebanks.

Realization is systematic
We presume that the target language applies the same stochastic realization model to all trees. All that we can optimize is the parameter vector of this model. Thus, we deny ourselves the freedom to realize each individual tree in an ad hoc way. To see why this is important, suppose the target language is French, whose corpus u contains many NOUN-ADJ bigrams. We could achieve such a bigram from the unordered source tree However, that realization is not in fact appropriate for French, so that ordered tree would not be a useful training tree for French. Our approach should disprefer this tempting but incorrect realization, because any model with a high probability of this realization would, if applied systematically over the whole corpus, also yield sentences like He sleepy made Sue, with unwanted PRON-ADJ bigrams that would not match the surface statistics of French. We hope our approach will instead choose the realization model that is correct for French, in which the NOUN-ADJ bigrams arise instead from source trees where the ADJ is a dependent of the NOUN, yielding (e.g.) . This has the same POS sequence as the example above (as it happens), but now assigns the correct tree to it.

A parametric realization model
As our family of realization distributions, we adopt the log-linear model used for this purpose by Wang and Eisner (2016). The model assumes that the root node a of the unordered dependency tree selects an ordering π(a) of the n a nodes consisting of a and its n a − 1 dependent children. The procedure is repeated recursively at the child nodes. This method can produce only projective trees.
Each node a draws its ordering π(a) independently according to which is a distribution over the n a ! possible orderings. Z(a) is a normalizing constant. f is a feature vector extracted from the ordered pair of nodes π i , π j , and θ is the model's parameter vector of feature weights. See Appendix A for the feature templates, which are a subset of those used by Wang and Eisner (2016). These features are able to examine the tree's node labels (POS tags) and edge labels (dependency relations). Thus, when a is a verb, the model can assign a positive weight to "subject precedes verb" or "subject precedes object," thus preferring orderings with these features. Following Wang and Eisner (2016, §3.1), we choose new orderings for the noun and verb nodes only, 2 preserving the source treebank's order at all other nodes a.

Generating training data
Given a source treebank B and some parameters θ, we can use equation (1) to randomly sample realizations of the trees in B. The effect is to reorder dependent phrases within those trees. The resulting permuted treebank B can be used to train a parser for the target language.

Choosing parameters θ
So how do we choose θ that works for the target language? Suppose u is a corpus of targetlanguage POS sequences, using the same set of POS tags as B. We evaluate parameters θ according to whether POS tag sequences in B will be distributed like POS tag sequences in u.
To do this, first we estimate a bigram language modelq from the actual distribution q of POS sequences observed in u. Second, let p θ denote the distribution of POS sequences that we expect to see in B , that is, POS sequences obtained by 2 Specifically, the 93% of nodes tagged with NOUN, PROPN, PRON or VERB in Universal Dependencies format. In retrospect, this restriction was unnecessary in our setting, but it skipped only 4.4% of nodes on average (from 2% to 11% depending on language). The remaining nodes were nouns, verbs, or childless. stochastically realizing observed trees in B according to θ. We estimate another bigram model p θ from this distribution p θ .
We then try to set θ, using SGD, to minimize a divergence D(p θ ,q) that we will define below.

Estimation of bigram models
is the count of POS bigram st in the average 3 sentence of u and c q (s) = t c q (st ). We estimatep θ in the same way, where c p (st) denotes the expected count of st in a random POS sequence y ∼ p θ . This is equivalent to choosingq,p θ to minimize the KL-divergences KL(q ||q), KL(p θ ||p θ ). It ensures that each model's expected bigram counts match those in the POS sequences.
However, these maximum-likelihood estimates might overfit on our finite data, u and B. We therefore smooth both models by first adding λ = 0.1 to all bigram counts c q (st) and c p (st). 4

Divergence of bigram models
We need a metric to evaluate θ. If p and q are bigram language models over POS sequences y (sentences), their Kullback-Leibler divergence is where y ranges over POS sequences and st ranges over POS bigrams. These include bigrams where s = BOS ("beginning of sequence") or t = EOS ("end of sequence"), which are boundary tags that we take to surround y. All quantities in equation (3) can be determined directly from the (expected) bigram counts given by c p and c q . No other model estimation is needed.
A concern about equation (3) is that a single bigram st that is badly underrepresented in q may contribute an arbitrarily large term log p(t|s) q(t|s) . To limit this contribution to at most log 1 α , for some small α ∈ (0, 1), we define KL α (p || q) by a variant of equation ( 3 A more familiar definition of cq would use the total count in u. Our definition, which yields the same bigram probabilities, is analogous to our definition of cp. This cp is needed for KL(p || q) in (3), and cq symmetrically for KL(q || p).
4 Ideally one should tune λ to minimize the language model perplexity on held-out data (e.g., by cross-validation). 5 This is inspired by the α-skew divergence of Lee (1999, Our final divergence metric D(p θ ,q) defines D as a linear combination of exclusive and inclusive KL α divergences, which respectively emphasize p θ 's precision and recall at matching q's bigrams: (4) where β, α 1 , α 2 are tuned by cross-validation to maximize the downstream parsing performance. The division by average sentence length converts KL from nats per sentence to nats per word, 6 so that the KL values have comparable scale even if B has much longer or shorter sentences than u.

Efficiently computing expected counts
We now present a polynomial-time algorithm for computing the expected bigram counts c p under p θ (or equivalentlyp θ ), for use above. This averages expected counts from each unordered tree x ∈ B. Algorithm 1 in the supplement gives pseudocode.
The insight is that rather than sampling a single realization of x (as B does), we can use dynamic programming to sum efficiently over all of its exponentially many realizations. This gives an exact answer. It algorithmically resembles tree-to-string machine translation, which likewise considers the possible reorderings of a source tree and incorporates a language model by similarly tracking their surface N -grams (Chiang, 2007, §5.3.2).
For each node a of the tree x, let the POS string y a be the realization of the subtree rooted at a. Let c a (st) be the expected count of bigram st in y a , whose distribution is governed by equation (1). We allow s = BOS or t = EOS as defined in §2.4.2.
The c a function can be represented as a sparse map from POS bigrams to reals. We compute c a at each node a of x in a bottom-up order. The final step computes c root , giving the expected bigram counts in x's realization y (that is, c p in §2.4).
We find c a as follows. Let n = n a and recall from §2.2 that π(a) is an ordering of a 1 , . . . , a n , where a 1 , . . . , a n−1 are the child nodes of a, and a n is a dummy node representing a's head token. 2001). Indeed, we may regard KLα(p || q) as the α-skew divergence between the unigram distributions p(· | s) and q(· | s), averaged over all s in proportion to cp(s). In principle, we could have used the α-skew divergence between the distributions p(·) and q(·) over POS sequences y, but computing that would have required a sampling-based approximation ( §7). 6 Recall that the units of negated log-probability are called bits for log base 2, but nats for log base e. Also, let a 0 and a n+1 be dummy nodes that always appear at the start and end of any ordering.
For all 0 ≤ i ≤ n and 1 ≤ j ≤ n + 1, let p a (i, j) denote the expected count of the a i a j node bigram-the probability that π(a) places node a i immediately before node a j . These node bigram probabilities can be obtained by enumerating all possible orderings π, a matter we return to below.
It is now easy to compute c a : That is, c a inherits all non-boundary bigrams st that fall within its child constituents (via c within a ). It also counts bigrams st that cross the boundary between consecutive nodes (via c across a ), where nodes a i and a j are consecutive with probability p a (i, j).
When computing c a via (5), we will have already computed c a 1 , . . . , c a n−1 bottom-up. As for the dummy nodes, a n is realized by the length-1 string h where h is the head token of node a, while a 0 and a n+1 are each realized by the empty string. Thus, c an simply assigns count 1 to the bigrams BOS h and h EOS, and c a 0 and c a n+1 each assign expected count 1 to BOS EOS. (Notice that thus, c across a (st) counts y a 's boundary bigrams-the bigrams st where s = BOS or t = EOS-when i = 0 or j = n + 1 respectively.)

Efficient enumeration over permutations
The main challenge above is computing the node bigram probabilities p a (i, j). These are marginals of p(π | a) as defined by (1), which unfortunately is intractable to marginalize: there is no better way than enumerating all n! permutations.
That said, there is a particularly efficient way to enumerate the permutations. The Steinhaus-Johnson-Trotter (SJT) algorithm (Sedgewick, 1977) does so in O(1) time per permutation, obtaining each permutation by applying a single swap to the previous one. Only the features that are affected by this swap need to be recomputed. For our features (Appendix A), this cuts the runtime per permutation from O(n 2 ) to O(n).
Furthermore, the single swap of adjacent nodes only changes 3 bigrams (possibly including boundary bigrams). As a result, it is possible to obtain the marginal probabilities with O(1) additional work per permutation. When a node bigram is destroyed, we increment its marginal probability by the total probability of permutations encountered since the node bigram was last created. This can be found as a difference of partial sums. The final partial sum is the normalizing constant Z(a), which can be applied at the end. Pseudocode is given in supplementary material as Algorithm 2.
When we train the parameters θ ( §2.4), we must back-propagate through the whole computation of equation (4), which depends on tag bigram counts c a (st), which depend via (5) on expected node bigram counts p a (i, j), which depend via Algorithm 2 on the permutation probabilities p(π | a), which depend via (1) on the feature weights θ.

Pruning high-degree trees
As a further speedup, we only train on trees with number of words < 40 and max a n a ≤ 5, so n a ! ≤ 120. 7 We then produce the synthetic treebank B ( §2.3) by drawing a single realization of each tree in B for which max a n a ≤ 7. This requires sampling from up to 7! = 5040 candidates per node, again using SJT. 8 That is, in this paper we run exact algorithms ( §3), but only on a subset of B. The subset is not necessarily representative. An improvement would use importance sampling, with a proposal distribution that samples the slower trees less often during SGD but upweights them to compensate.
§7 suggests a future strategy that would run on all trees in B via approximate, sampling-based algorithms. The exact methods would remain useful for calibrating the approximation quality.

Minibatch estimation of c p
To minimize (4), we use the Adam variant of SGD (Kingma and Ba, 2014), with learning rate 0.01 chosen by cross-validation ( §5.1).
SGD requires a stochastic estimate of the gradient of the training objective. Ordinarily this is done by replacing an expectation over the entire training set with an expectation over a minibatch. 7 We found that this threshold worked much better than ≤ 4 and about as well as the much slower ≤ 6. 8 This pruning heuristic retains 36.1% of the trees (averaging over the 20 development treebanks ( §5.1)) for training, and 66.6% for actual realization. The latter restriction follows Wang and Eisner (2016, §4.2): they too discarded trees with nodes having na ≥ 8.
Equation (2) with p =p θ is indeed an expectation over sentences of B. It can be stochastically estimated as (3) where c p gives the expected bigram counts averaged over only the sentences in a minibatch of B. These are found using §3's algorithms with the current θ. Unfortunately, the term log p(t | s) depends on bigram counts that should be derived from the entire corpus B in the same way. Our solution is to simply reuse the minibatch estimate of c p for the latter counts. We use a large minibatch of 500 sentences from B so that this drop-in estimate does not introduce too much bias into the stochastic gradient: after all, we only need to estimate bigram statistics on 17 POS types. 9 By contrast, the c q values that are used for the expectation in the second term of (4) and in log q(t | s) do not change during optimization, so we simply compute them once from all of u.

Informed initialization
Unfortunately the objective (4) is not convex, so the optimizer is sensitive to initialization (see §5.3 below for empirical discussion). Initializing θ = 0 (so that p(π | a) is uniform) gave poor results in pilot experiments. Instead, we initially choose θ to be the realization parameters of the source language, as estimated from the source treebank B. This is at least a linguistically realistic θ, although it may not be close to the target language. 10 For this initial estimation, we follow Wang and Eisner (2016) and perform supervised training on B of the log-linear realization model (1), by maximizing the conditional log-likelihood of B, namely (x,t)∈B log p θ (t | x), where (x, t) are an unordered tree and its observed ordering in B. This initial objective is convex. 11

Experiments
We performed a large-scale experiment requiring hundreds of thousands of CPU-hours. To our knowledge, this is the largest study of parsing transfer yet attempted. 9 We also used the minibatch to estimate the average sentence length Ey∼p[ |y| ] in (4), although here we could have simply used all of B since this value does not change. 10 As an improvement, one could also try initial realization parameters for B that are estimated from treebanks of other languages. Concretely, the optimizer could start by selecting a "galactic" treebank from Wang and Eisner (2016) that is already close to the target language, according to (4), and try to make it even closer. We leave this to future work. 11 Unfortunately, we did not regularize it, which probably resulted in initializing some parameters too close to ±∞ for the optimizer to change them meaningfully.

Data and setup
As our main dataset, we use Universal Dependencies version 1.2 (Nivre et al., 2015)-a set of 37 dependency treebanks for 33 languages, with a unified POS-tag set and relation label set.
Our evaluation metric was unnormalized attachment score (UAS) when parsing a target treebank with a parser trained on a (possibly permuted) source treebank. For both evaluation and training, we used only the training portion of each treebank.
Our parser was Yara (Rasooli and Tetreault, 2015), a fast and accurate transition-based dependency parser that can be rapidly retrained. We modified Yara to ignore the input words and use only the input gold POS tags (see §1.3). To train the Yara parser on a (possibly permuted) source treebank, we first train on 80% of the trees and use the remaining 20% to tune Yara's hyperparameters. We then retrain Yara on 100% of the source trees and evaluate it on the target treebank.
Similar to Wang and Eisner (2017), we use 20 treebanks (18 distinct languages) as development data, and hold out the remaining 17 treebanks for the final evaluation. We chose the hyperparameters (α 1 , α 2 , β) of (4) to maximize the target-language UAS, averaged over all 376 transfer experiments where the source and target treebanks were development treebanks of different languages. 12 (See Appendix C for details.) The next few sections perform some exploratory analysis on these 376 experiments. Then, for the final test in §5.4, we will evaluate UAS on all 337 transfer experiments where the source is a development treebank and the target is a test treebank of a different language. 13

Exploratory analysis
We have assumed that a smaller divergence between source and target treebanks results in better transfer parsing accuracy. Figure 1 shows that these quantities are indeed correlated, both for the original source treebanks and for their "made to order" permuted versions. 12 We have 19*20=380 pairs in total, minus the four excluded pairs (grc, grc proiel), (grc proiel, grc), (la proiel, la itt) and (la itt, la proiel). Unlike Wang and Eisner (2017), we exclude duplicated languages in development and testing. 13 Specifically, there are 3 duplicated sets: {grc, grc proiel}, {la, la proiel, la itt}, and {fi, fi ftb}. Whenever one treebank is used as the target language, we exclude the other treebanks in the same set. 15 According to the family (and sub-family) information at http://universaldependencies.org.   Figure 1: UAS is higher when divergence is lower. Each point represents a pair of source and target languages, whose shape and color identify the treebank of the target language (see legend). The marker is solid if the source and target languages belong to the same language family. 15 The left graph uses the original source treebank (Kendall's τ = −0.41), while the right graph uses its permuted version (τ = −0.39).
Thus, we hope that the optimizer will find a systematic permutation that reduces the divergence. Does it? Yes: Figures 5 and 6 in the supplementary material show that the optimizer almost always manages to reduce the objective on training data, as expected.
One concern is that our divergence metric might misguide us into producing dysfunctional languages whose trees cannot be easily recovered from their surface strings, i.e., they have no good parser. In such a language, the word order might be extremely free (e.g., θ = 0), or common constructions might be syntactically ambiguous. Fortunately, Appendix D shows that our synthetic languages appear natural with respect to their their parsability.
The above findings are promising. So does permuting the source language in fact result in better transfer parsing of the target language? We experiment on the 376 development pairs. The solid lines in Figure 2 show our improvements on the dev data, with a simpler scatterplot given by in Figure 7 in the supplementary material. The upshot is that the synthetic source treebanks yield a transfer UAS of 52.92 on average. This is not yet a result on held-out test data: recall that 52.92 was the best transfer UAS achieved by any hyperparameter setting. That said, it is 1.00 points better than transferring from the original source treebanks, a significant difference (paired permutation test by language pair, p < 0.01). Figure 2 shows that this average improvement is mainly due to the many cases where the source and target languages come from different families.
Permutation tends to improve source languages that were doing badly to start with. However, it tends to hurt a source language that is already in the target language family.
A hypothetical experiment shows that permuting the source does have good potential to help (or at least not hurt) in both cases. The dashed lines in Figure 2-and the scatterplot in Figure 8show the potential of the method, by showing the improvement we would get from permuting each source treebank using an "oracle" realization policy-the supervised realization parameters θ that are estimated from the actual target treebank. The usefulness of this oracle-permuted source varies depending on the source language, but it is usually much better than the automaticallypermuted version of the same source.
This shows that large improvements would be possible if we could only find the best permutation policy allowed by our model family. The question for future work is whether such gains can be achieved by a more sensitive permutation model than (1), a better divergence objective than (4), or a better search algorithm than §4.2. Identifying the best available source treebank, or the best mixture of all source treebanks, would also help greatly. Even with oracle permutation in Figure 8, the correlation remains strong (τ = 0.59), suggesting that the choice of source treebank is important even beyond its effect on search initialization.

Sensitivity to initializer
We suspected that when "made to order" source treebanks (more than the oracle versions) have performance close to their original versions, this is in part because the optimizer can get stuck near the initializer ( §4.3). To examine this, we experimented with random restarts, as follows. In addition to informed initialization ( §4.3), we optimized from 5 other starting points θ ∼ N (0, I). From these 6 runs, we selected the final parameters that achieved the best divergence (4). As shown by Figure 9 in the supplement, greater gains appear to be possible with more aggressive search methods of this sort, which we leave to future work. We could also try non-random restarts based on the realization parameters of other languages, as suggested in footnote 10.

Final evaluation on the test languages
For our final evaluation ( §5.1), we use the same hyperparameters (Appendix C) and report on single-source transfer to the 17 held-out treebanks.
The development results hold up in Figure 3. Using the synthetic languages yields 50.36 UAS on average-1.75 points over the baseline, which is significant (paired permutation test, p < 0.01).
In the supplementary material (Appendix E), we include some auxiliary experiments on multisource transfer.
The alternative of cross-lingual transfer has recently flourished thanks to the development of consistent cross-lingual datasets of POS-tagged (Petrov et al., 2012) and dependency-parsed (Mc-Donald et al., 2013) sentences. McDonald et al. (2011) showed a significant improvement over grammar induction by simply using the delexicalized parser trained on other language(s). Subsequent improvements have come from re-weighting source languages (Søgaard, 2011b;Rosa anď Zabokrtský, 2015a,b;Wang and Eisner, 2016), adapting the model to the target language using WALS (Dryer and Haspelmath, 2013) features (Naseem et al., 2012;All (376) in-family (46)   The three points on the polyline (from left to right) represent the target UAS for parsers trained on three sources: the original source treebank, the "made to order" permutation that attempts to match surface statistics of the target treebank, and an oracle permutation that uses a realization model trained on the target language. We use solid markers and purple lines if the transfer is within-family (source and target treebank from the same language family), and hollow and olive for cross-family transfer. The black polyline in each column is the mean of the others. The  Zhang and Barzilay, 2015; Ammar et al., 2016), and improving the lexical representations via multilingual word embeddings (Duong et al., 2015;Guo et al., 2016;Ammar et al., 2016) and synthetic data generation ( §6.2).

Synthetic data generation
Our novel proposal ties into the recent interest in data augmentation in supervised machine learning. In unsupervised parsing, the most widely adopted synthetic data method has been annotation projection, which generates synthetic analyses of target-language sentences by "projecting" the analysis from a source-language translation. Of course, this requires bilingual corpora as an additional resource. Annotation projection was proposed by Yarowsky et al. (2001), gained promising results on sequence labelling tasks, and was later developed for unsupervised parsing (Hwa et al., 2005;Ganchev et al., 2009;Smith and Eisner, 2009;Tiedemann, 2014;Ma and Xia, 2014;. Recent work in this vein has mainly focused on improving the synthetic data, including reweighting the training trees (Agić et al., 2016) or pruning those that cannot be aligned well Collins, 2015, 2017;Lacroix et al., 2016). On the other hand, Wang and Eisner (2016) proposed to permute source language treebanks using word order realization models trained on other source languages. They generated on the order of 50,000 synthetic languages by "mixing and matching" a few dozen source languages. Their idea was that with a large set of synthetic languages, they could use them as supervised examples to train an unsupervised structure discovery system that could analyze any new language. Systems built with this dataset were competitive in single-source parser transfer (Wang and Eisner, 2016), typology prediction (Wang and Eisner, 2017), and parsing unknown languages (Wang and Eisner, 2018).
Our work in this paper differs in that our synthetic treebanks are "made to order." Rather than combine aspects of different treebanks and hope to get at least one combination that is close to the target language, we "combine" the source treebank with a POS corpus of the target language, which guides our customized permutation of the source.
Beyond unsupervised parsing, synthetic data has been used for several other tasks. In NLP, it has been used for complex tasks such as questionanswering (QA) (Serban et al., 2016) and machine reading comprehension (Weston et al., 2016;Hermann et al., 2015;Rajpurkar et al., 2016), where highly expressive neural models are used and not enough real data is available to train them. In the playground of supervised parsing, Gulordava and Merlo (2016) conduct a controlled study on the parsibility of languages by generating treebanks with short dependency length and low variability of word order.

Conclusion & Future Work
We have shown how cross-lingual transfer parsing can be improved by permuting the source treebank to better resemble the target language on the surface (in its distribution of gold POS bigrams). The code is available at https://github. com/wddabc/ordersynthetic. Our work is grounded in the notion that by trying to explain the POS bigram counts in a target corpus, we can discover a stochastic realization policy for the target language, which correctly "translates" the source trees into appropriate target trees.
We formulated an objective for evaluating such a policy, based on KL-divergence between bigram models. We showed that the objective could be computed efficiently by dynamic programming, thanks to the limitation to bigram statistics.
Experimenting on the Universal Dependencies treebanks v1.2, we showed that the synthetic treebanks were-on average-modestly but significantly better than the corresponding real treebanks for single-source transfer (and in Appendix E, on multi-source transfer).
On the downside, Figure 7 shows that with our current method, permuting the source language to be more like the target language is helpful (on average) only when the source language is from a different language family. This contrast would be even more striking if we had a better optimizer: Figure 9 shows that SGD's initialization bias limits permutation's benefit for cross-family training, as well as its harm for within-family training.
Several opportunities for future work have already been mentioned throughout the paper. We are also interested in experimenting with richer families of permutation distributions, as well as "conservative" distributions that tend to prefer the original source order. We could use entropy regularization (Grandvalet and Bengio, 2005) to encourage more "deterministic" patterns of realization in the synthetic languages.
We would also like to consider more sensitive divergence measures that go beyond bigrams, for example using recurrent neural network language models (RNNLMs) forq andp θ . This means abandoning our exact dynamic programming methods; we would also like to abandon exact exhaustive enumeration in order to drop §4.1's bounds on n. Fortunately, there exist powerful MCMC methods (Eisner and Tromble, 2006) that can sample from interesting distributions over the space of n! permutations, even for large n. Thus, we could approximately sample from p θ by drawing permuted versions of each tree in B.
Given this change, a very interesting direction would be to graduate from POS language models to word language models, using cross-lingual unsupervised word embeddings (Ruder et al., 2017). This would eliminate the need for the gold POS tags that we unrealistically assumed in this paper (which are typically unavailable for a low-resource target language). Furthermore, it would enable us to harness richer lexical information beyond the 17 UD POS tags. After all, even a (gold) POS corpus might not be sufficient to determine the word order of the target language: "NOUN VERB NOUN" could be either subject-verb-object or object-verbsubject. However, "water drink boy" is presumably object-verb-subject. Thus, using crosslingual embeddings, we would try to realize the unordered source trees so that their word strings, with few edits, can achieve high probability under a neural language model of the target.