Target Language-Aware Constrained Inference for Cross-lingual Dependency Parsing

Prior work on cross-lingual dependency parsing often focuses on capturing the commonalities between source and target languages and overlook the potential to leverage the linguistic properties of the target languages to facilitate the transfer. In this paper, we show that weak supervisions of linguistic knowledge for the target languages can improve a cross-lingual graph-based dependency parser substantially. Specifically, we explore several types of corpus linguistic statistics and compile them into corpus-statistics constraints to facilitate the inference procedure. We propose new algorithms that adapt two techniques, Lagrangian relaxation and posterior regularization, to conduct inference with corpus-statistics constraints. Experiments show that the Lagrangian relaxation and posterior regularization techniques improve the performances on 15 and 17 out of 19 target languages, respectively. The improvements are especially large for the target languages that have different word order features from the source language.


Introduction
Natural language processing (NLP) techniques have achieved remarkable performance in a variety of tasks when sufficient training data is available. However, obtaining high-quality annotations for low-resource language tasks is challenging, and this poses great challenges to process low-resource languages. To bridge the gap, crosslingual transfer has been proposed to transfer models trained on high-resource languages (e.g., English) to low-resource languages (e.g., Tamil) to combat the resource scarcity problem. Recent studies have demonstrated successes of transferring models across languages without retraining for NLP tasks, such as named entity recognition (Xie et al., 2018), dependency parsing (Tiedemann, 2015;Agić et al., 2014), and question answering (Joty et al., 2017), using a shared multi-lingual word embedding space (Smith et al., 2017) or delexicalization approaches (Zeman and Resnik, 2008;. One key challenge for cross-lingual transfer is the differences among languages; for example, languages may have different word orders. When transferring a model learned from a source language to target languages, the performance may drop significantly due to the differences. To tackle this problem, various approaches have been proposed to better capture the commonalities between the source and the target languages (McDonald et al., 2011;Guo et al., 2016;Agić, 2017;Ahmad et al., 2019); however, they overlook the potential to leverage linguistic knowledge about the target language to account for the differences between the source and the target languages to facilitate the transfer.
In this paper, we propose a complementary approach that studies how to leverage the linguistic knowledge about the target languages to help the transfer. Specifically, we use corpus linguistic statistics of the target languages as weak supervision signals to guide the test-time inference process when parsing with a graph-based parser. This approach is effective as the model only need to be trained once on the source language and applied to many target languages using different constraints without retraining the model.
We argue that certain corpus linguistic statistics such as the word order (e.g., how often an adjective appears before or after a noun) can be easily obtained from available resources such as World Atlas of Language Structures (WALS) (Dryer and Haspelmath, 2013). To incorporate the corpus linguistic statistics to a cross-lingual parser, we compile them into corpus-wise constraints and adopt two families of methods: 1) Lagrangian relaxation (LR) and 2) posterior regularization (PR) to solve the constrained inference problem. The algorithms take the original graph-based parsing inference as a sub-routine, and LR iteratively adjusts the pair-wise potentials until the constraints are (loosely) satisfied, while PR finds a feasible distribution and do inference based on that. The constrained inference framework is general and supports any knowledge that can be formulated as a first-order logic (Roth and Yih, 2004).
We evaluate the proposed approach under the single-source transfer setting using English as the source language and test on 19 target languages covering a broad range of language families with low-resource languages such as Tamil and Welsh. We demonstrate that by adding three simple corpus-wise constraints derived from WALS features, the performances improve in 15 and 17 out of 19 languages when using Lagrangian relaxation and posterior regularization techniques, respectively. The improvements are especially substantial when the target language features are distant from the source language. For example, our framework improves the UAS score of Urdu by 15.7%, and Tamil by 7.3%. 1

Constrained Cross-Lingual Parsing
Our work focuses on the graph-based dependency parser (McDonald et al., 2005) in the zero-shot single-source transfer setting as in Ahmad et al. (2019). However, the proposed algorithms can be extended other transfer settings. Given a trained model, we derive corpus-statistics constraints and apply them to correct errors caused by word order differences between the source and the target language during the inference time. Figure 1 shows an example of how constraints can influence the inference results.
In this section, we first give a quick review of the graph-based parser and introduce the notations. We then discuss how to formulate corpuswise constraints based on corpus linguistic statistics for guiding the graph-based parser.

Background: Graph-Based Parser
A graph-based parser learns a scoring function for every pair of words in a sentence and conducts inference to derive a directed spanning tree with the highest accumulated score. Formally, given the kth sentence w k = (w k1 , . . . , w kL(k) ) where L(k) 1 The code and data are available at https://github.com/MtSomeThree/ CrossLingualDependencyParsing. denotes the length of the k-th sentence, a graphbased parser learns a score matrix S (k) , where S (k) ij denotes the score to form an arc from word w ki to word w kj . Let y k be an indicator function that y k (i, j) 2 {0, 1} denotes the arc from w ki to w kj . The maximum directed spanning tree inference can be formulated as an integer linear programming (ILP) problem: where Y k is the set of legal dependency trees of sentence k. In recent years, neural network approaches (Kiperwasser and Goldberg, 2016;Wang and Chang, 2016;Kuncoro et al., 2016;Dozat and Manning, 2017) have been applied to modeling the scoring matrix S (k) and have achieved great performance in dependency parsing. From the probabilistic point of view, if we assume for different i, j, the edge probabilities P (y k (i, j) = 1|w k ) are mutually conditional independent, the probability of a whole parse tree can be written as j is a constant term, then Eq. (1) can be regarded as the following maximum a posteriori (MAP) inference problem: (3)

Corpus-Wise Constraints
Given the inference problems as in equations (1) and (3), additional constraints can be imposed to incorporate expert knowledge about the languages to help yield a better parser. Instance-level constraints have been explored in the literature of dependency parsing, both in the monolingual (Dryer, 2007) and cross-lingual transfer  settings. However, most word order features for a language are non-deterministic and cannot be compiled into instance-wise constraints.
In this work, we introduce corpus-wise constraints to leverage the non-deterministic features for cross-lingual parser. We compile the following two types of corpus-wise constraints based on corpus linguistics statistics: • Unary constraints consider statistics regarding a particular POS tag (P OS). • Binary constraints consider statistics regarding a pair of POS tags (P OS 1 , P OS 2 ). Specifically, a unary constraint specifies the ratio r of the heads of a particular P OS appears on the left of that P OS. 2 Similarly, a binary constraint specifies the ratio r of P OS 1 being on the left of P OS 2 when there is an arc between P OS 1 and P OS 2 .
The ratios r for the constraints are called corpus statistics, which can be estimated in one of the following ways: a) leveraging existing linguistics resources or consulting linguists; b) leveraging a higher-resource language that is similar to the target language (e.g., Finnish and Estonian) to collect the statistics. In this paper, we explore the first option and leverage the WALS features, which provide a reference for word order typology, to estimate the ratios.

Compile Constraints From WALS Features.
For a particular language, once we collect the corpus-statistics of a pair of POS tags, we can formulate a binary constraint. There are different ways to estimate the corpus-statistics. For example,Östling (2015) utilizes a small amount of parallel data to estimate the dominant word orders. In this paper, we simply utilize a small subset of WALS features that show the dominant order of some POS pairs (e.g. adjective and noun) in a language. They can be directly compiled into binary constraints.
Similarly, we can estimate the ratio for unary constraints based on WALS features. For a particular POS tag, we choose all WALS features related to it to formulate a feature vector f . The mapping from the vector f to the unary constraint ratio r is learnable: for each language with annotated data, we can get a WALS feature vector f lang and a ratio r lang from the annotation. We only need a small amount of data to estimate r lang well. Given a set of languages with feature vectors and estimated ratios, we can learn the mapping by a simple linear regression, and apply it to estimate the ratio of any target language to compile a unary constraint.

Formulate Constraints
In the following, we mathematically formulate the corpus-wise constraints. Note that these con-2 The ratio for the head being on the right of that P OS is thereby 1 r.  straints are based on the statistics over the entire corpus. For a unary constraint C (P OS) u , let P denotes a set of word with part-of-speech tag P OS. We define C + u := {(k, i, j)|w kj 2 P^i < j} as the set of arcs where the head of word in P is on its left and C u : For a binary constraint C , we denotes P 1 , P 2 as a set of word with part-of-speech tag P OS 1 , P OS 2 , respectively. We then define C + b as the set of arcs with two ends w ki 2 P 1 and w kj 2 P 2 , and w ki is on the left of w kj . We define C b Similarly. Formally, For notational simplicity, we use C to denote all constraints including both the unary and binary ones. The ratio function R(C, Y ) for a constraint C given the parse trees Y can be defined as: .
We want to enforce the ratio R(C, Y ) estimated from Y to be consistent with a value r (see Sec. 2.2), which formulates a constraint where ✓ is a tolerance margin. Note that the instance-level hard constraint is a special case of . " * (*, ,)

Graph-Based Parser
Constrained Inference Original Inference Figure 2: The pipelines of the baseline method (left), Lagrangian relaxation (right) and posterior regularization (middle). Lagrangian relaxation converts constrained inference to an unconstrained optimization problem using Lagrange's method. Posterior regularization method is working on the distribution space. For a given distribution, PR finds the closest feasible distribution and conduct MAP inference.
the corpus-statistics constraint when r = 0 or r = 1. Given a set of corpus-statistics constraints C = {C 1 , C 2 , . . . , C n } with corresponding corpus statistics r = {r 1 , r 2 , . . . , r n }, the objective of the constrained inference is: where Y denotes the set of all possible dependency trees. As all the constraints can be written as a linear inequality with respect to y k (i, j). Eq. (4) is an ILP.

Inference with Corpus-Statistics Constraints
The ILP problem in Eq. (4) is in general an NPhard problem; especially, it involves variables associated with the entire corpus. Without constraints, Eq. (4) can be decoupled into K subproblems, and the inference with respect to each sentence can be solved independently as in Eq.
(1). In this way, an efficient inference algorithm such as maximum directed spanning tree algorithm (Chu and Liu, 1965) can be used. However, with the corpus-wise constraints, directly solving Eq. (4) is infeasible. Therefore, we explore two algorithms for inference with corpusstatistics constraints: Lagrangian relaxation and posterior regularization (Ganchev et al., 2010). The Lagrangian relaxation algorithm introduces Lagrangian multipliers to relax the constraint optimization problem to an unconstrained optimization problem, and estimates the Lagrangian multipliers with gradient-based methods. The posterior regularization algorithm uses the constraints of the target language to define a feasible set of parse tree distributions, and find a feasible distribution that is closest to the parse tree distribution trained on the source language by minimizing the KL-divergence. The constrained inference problem can then be converted into an MAP inference problem on the best feasible distribution. Figure  2 illustrates the procedure of the original inference, Lagrangian relaxation, and posterior regularization.

Lagrangian Relaxation
Lagrangian relaxation has been applied in various NLP applications Collins, 2012, 2011). In Eq. (4), each constraint C i involves two inequality constraints: Instead of treating these two constraints separately, we consider a heuristic to optimize with equality constraints and terminate earlier when constraints in Eq. (4) are satisfied. Despite this approach does not guarantee the solution is optimal if all the constraints are satisfied as the original Lagrangian relaxation algorithm does, in practice, the inference converges faster (as the number of Lagrangian multipliers is half) and the parsing performance maintains.
In the following, we derive the constrained inference algorithm for corpus-statistics constraints. First, we rewrite the equality constraint R(C, Y ) = r by substituting R(C, Y ) with Eq. (2.3): We use F(C) to denote the left-hand-side of Eq. (5), which is linear w.r.t. y k . Then, the Lagrangian relaxation of the constrained inference problem can be written as: Algorithm 1 Lagrangian Relaxation for Constraint Inference Input: , learning rate decay ⌘, initial learning rate ↵ 0 Output: parse treesŶ ↵ ⌘↵ 10: until MAX ITER times 11: returnŶ where i is called Lagrangian multiplier. It is well-known that we can solve the dual form of the constrained inference problem: To solve the dual form, we initialize i to be 0. At iteration t, we firstly conduct an constraintaugmented inference with a fixed (t) : As F (C) is a linear function w.r.t y k , we combine it with S (k) ij y k (i, j). In this way, the inference problem Eq. (6) can be treated as a special case of Eq. (1) with a different scoring matrix S (k) . In this way, we can treat the inference on every sentence independently and leverage existing inference techniques.
After solving the constraint-augmented inference, we compute the ratio of every constraint r i (t) = R(C i ,Ŷ (t) ), and use gradient ascent algorithm to update the Lagrangian multipliers Here ↵ (t) denote the step size at iteration t. The algorithm is shown in Algorithm 1.

Posterior Regularization
From a probabilistic point of view, the parser model learns parameters ✓ to realize Eq. (2). During the inference, the model predict a probability distribution p ✓ (Y|W) over possible parse trees given a sentence. The posterior regularization algorithm first defines a feasible set of the probability distributions w.r.t the given constraints, and looks for the closest feasible distribution q ⇤ (Y) to the model distribution p ✓ (Y|W). The best parse tree is given by arg max Y q ⇤ (Y). Specifically, we define the feasible set as: To measure the distance between two distribution, we use the KL-divergence, and find the best feasible distribution q ⇤ (Y): If the feasible set has the expectation form: Eq. (7) has a simple close form solution (Ganchev et al., 2010): where ⇤ is the solution of In the rest of this section, we first show that the feasible set Q we considered above can be reformulated in the form of Eq. (8), and then we discuss how to solve Eq. (10). To show that the inequality R(C, q(Y))  r, in Q can be formulated in the form of Eq. (8), we set Similarly, we can derive R(C, q(Y)) r, into the same form and rewrite Q as Output: parse treesŶ update based on g 7: until MAX ITER times 8: for each (k, i, j) do 9: defined by Eq. (12) 10:Ŷ MAP inference based on q 11: returnŶ where = ( C 1 , C 2 , ..., C N ) is a collection of the constraints. The detailed derivations can be found in Appendix A.
We solve Eq. (10) by sub-gradient descent 3 . Noting that there can be exponential number of terms in Z( ), we firstly need to factorize Z( ) from corpus level to instance level and arc level, and compute the gradient. The technical details are in Appendix A. With the optimal ⇤ , we can compute the feasible distribution q ⇤ (Y) given p ✓ and W. Noting that the solution Eq. (9) can also be factorized to arc-level: here q ⇤ k (i, j) denote the arc-level distribution q ⇤ (y k (i, j) = 1) satisfying Eq. (3). We then do MAP inference based on q, which is actually a minimal spanning tree problem same as before. Algorithm 2 summarizes the process.

Experiments
In this section, we evaluate the proposed algorithms by transferring an English dependency parser to 19 target languages covering 13 language families of real low-resource languages. We first introduce the experimental setup including data selection and constraint details and then discuss the results as well as in-depth analysis.

Setup
Model and Data We train the best performing Att-Graph parser proposed in Ahmad et al. (2019) on English and transfer it to 19 target languages in UD Tree Bank v2.2 (Nivre et al., 2018). 4 The model takes words and predicted POS tags 5 as input, and achieve transfer by leveraging pre-trained multi-lingual FastText (Bojanowski et al., 2017) embeddings that project the word embeddings from different languages into the same space using an offline transformation method (Smith et al., 2017;Conneau et al., 2018). The SelfAtt-Graph model uses a Transformer (Vaswani et al., 2017) with relative position embedding as the encoder and a deep biaffine scorer (Dozat and Manning, 2017) as the decoder. We follow the setting in Ahmad et al. (2019) to train and tune only on the source language (English) and directly transfer to all the target languages. We modify their decoder to incorporate constraints with the proposed constrained inference algorithms during the transfer phase without retraining the model. All the hyperparameters are specified in Appendix Table 4 together with hyper-parameters for the inference algorithms in Appendix Table 5.
Constraints We consider two types of constraints: 1) instance-level projective constraints for avoiding creating crossing arcs in the dependency trees, 2) corpus-statistics constraints constructed by the process described in Section 2.2. We consider the following three corpusstatistics constraints: C1 = (NOUN), C2 = (N OU N, ADP ), C3 = (N OU N, ADJ); intuitively, C1 concerns about the ratio of nouns being on the right of their heads; C2 concerns about the ratio of nouns being on the left of adpositions among all noun-adposition arcs; C3 concerns about the ratio of nouns being on the left of adjectives among all noun-adjective arcs.
For binary constraints, C2 and C3 can be directly compiled from WALS feature 85A and 87A respectively. We encode "dominant order" specified in WALS as the ratio being always greater than 0.75 (i.e., r = 0.875 and ✓ = 0.125). If there is no dominant order or the feature is missing, we set r = 0.5 and ✓ = 0.25. Some WALS features like 82A, 83A are also about word order, but we need to specify the arc types to utilize them. For simplicity, we only consider forming constraints from the POS tags in this paper. To estimate the ratio for unary constraint C1, we use the WALS features 82A, 83A, 85A, 86A, 87A, 88A, 89A that are related to NOUN to form feature vectors, and do regression on languages in the test set except the target language to predict the constraint ratio. The process guarantees the target language remain unseen during the ratio estimation process. The ratios on the regression training languages are estimated by sampling 100 sentences in the training set per language. We also consider an oracle setting where we collect a "ground-truth" ratio of each constraint for the target language to estimate an upper bound of our inference algorithms. In the oracle setting, we estimate the ratio on the whole training corpus of the target language and set the margin to ✓ = 0.01.

Parsing Performances
We first compare the performances of the crosslingual dependency parser with or without con-straints. Table 1 illustrates the results for the 19 target languages we selected, 6 along with the performance on the source language (English). The performance on English is not as high as the dependency parsers specialized for English, because to achieve transfer, we have to freeze the pretrained multi-lingual word embeddings. Yet this parser achieved the best single-source transfer performances according to Ahmad et al. (2019).
As is shown in Table 1, the improvements by our constrained inference algorithms are dramatic in a few languages that have very distinct word order features from the source language. For example, the parsing performance of Hindi (hi) improves about 15% in UAS with WALS features via both Lagrangian relaxation and posterior regularization inference. The improvements are less obvious for languages that are in the same family as English such as Danish(da) and Dutch(nl). This is expected as the corpus linguistic statistics of these languages are similar to English thus the constraints are mostly satisfied with the baseline parser. Comparing Lagrangian relaxation and posterior regularization, we find posterior regularization being more robust and less sensitive to the errors in the corpus-statistics estimation, while Lagrangian relaxation gives a higher improvement on average. Overall, the two proposed constrained inference algorithms improved the transfer performance by 3.5% and 3.1% per UAS on average on 19 target languages.
For languages like Finnish (fi) and Estonian (et), the WALS setting works even better than the oracle. We suspect the reason being the large margin we set in the WALS setting. When the estimated corpus-statistics is different from the real ratio in the test set, the large margin relaxes the constraints, thus could result in better performances.
Discussion. Despite the major experiments and analysis are conducted using English as the only source language, our approach is general and does not have restriction on the choice of the source language(s). To verify this claim, we run experiments with Hebrew as the source language. Under the oracle setting, Lagrangian relaxation and posterior regularization improve the baseline by 4.4% and 4.1%, respectively.
We observed that if we compile WALS features into hard constraints (i.e., set r = 0 or 1), the constraint inference framework only improves performance on half of the languages. For example, in Estonian (et), the performance drops about 3%. This is because WALS only provides the dominant order. Therefore, treating WALS as hard constraints introduces error to the inference.
Finally, we assume if we can access to native speakers, the corpus-statistics can be estimated by a few partial annotations of parse trees. In our simulation, using less than 300 arcs, we can achieve the same performance as using the oracle.

Contributions of Individual Constraints
We analyze the contribution of each constraint demonstrated in Table 2. Here we use the oracle setting to reduce the noise introduced by corpus-statistics estimation errors. The results are based on Lagrangian relaxation inference. As shown in Table 2 Table 3: Contribution of individual constraints and their statistics in Hindi. The second column lists the ratios estimated from oracle in English/ baseline in Hindi/ oracle in Hindi, respectively. The improvement is measured in UAS. The improvement of constraints is computed same as Table 2 non-projective dependencies, we observed performance improvements on almost all the languages when the projective constraint is enforced. All the constraints we formulated have positive contributions to the performance improvements. C1 = (NOUN) brings the largest gain probably because its widest coverage. Table 1 shows that the performance of Hindi improves from 34% to over 51% per UAS for both inference algorithms. To better understand where the improvements come from, we conduct an analysis to breakdown the contribution of each individual constraint for Hindi. Table 3 shows the results. We can see that since the corpus linguistic statistics between Hindi and English are distinct, the baseline model only achieves low performance. With the constrained inference, especially the postposition constraint (C2), the proposed inference algorithm bring significant improvement.
To verify the effectiveness of the constraints, we analyze the relation between the performance improvements and corpus statistics ratio gaps between the source and the target languages. To quantify the ratio gap, we weight constraints by their coverage rate and compute the weighted average of the ratio difference between source and target languages. Results show that the performance improvement is highly related to the ratio gap. The Pearson Correlation Coefficient is 0.938. The figure showing the correlation between performance gap (as per UAS) and the corpus statistics ratio gap is in the Appendix Figure 3.

Related Work
Cross-Lingual Transfer for Parsing Many approaches have been developed to transfer a dependency parser. However, they mainly focus on better capture information from the source language(s). Constrained Inference for Parsing Several previous studies show that adding constraints in inference time improves the performance of models. Grave and Elhadad (2015) consider incorporating constraints to promote popular types of arcs in an unsupervised setting. Naseem et al. (2010);  train a parser with constraints compiled from the frequency of particular arcs. Compared with the previous work, we focus on crosslingual transfer with word order constraints.
Finally, prior studies have noticed that the word order information is significant for parsing and use it as features (Ammar et al., 2016;Naseem et al., 2012;Rasooli and Collins, 2017;Zhang and Barzilay, 2015;Dryer, 2007).  further propose to decompose these features from models for adapting target languages. Wang and Eisner (2018a) use the statistics of sur-face part-of-speech (POS) tags of target languages to learn the word order. Wang and Eisner (2018b) use POS tags of target languages together with a similar language, and design a stochastic permutation process to synthetic the word order. However, none of them consider using the word order features as constraints.

Incorporating Constraints In NLP Tasks
Constraints are widely incorporated in variety of NLP tasks. To name a few, Roth and Yih (2004) propose to formulate constrained inferences in NLP as integer linear programming problems. To solve the intractable structure, Rush and Collins (2012) decompose the structure and incorporate constraints on some composite tasks. To improve the performance of a model, Chang and Collins (2011);Peng et al. (2015) incorporate constraints on exact decoding tasks and inference tasks on graphical models, and Chang et al. (2013);Dalvi (2015); Martins (2015) incorporate corpus-level constraints on semi-supervised multilabel classification and coreference resolution. Zhao et al. (2017) incorporate corpus-level constraints to avoid amplifying gender bias on visual semantic role labeling and multilabel classification. In contrast to previous work, we incorporate corpuslevel constraints to facilitate dependency parser in the cross-lingual transfer setting.

Conclusion
We propose to leverage corpus-linguistic statistics to guide the inference of cross-lingual dependency parsing. We compile these statistics into corpus-statistic constraints and design two inference algorithms on top of a graph-based parser based on Lagrangian relaxation and posterior regularization. Experiments on 19 languages show that our approach improves the performance of the cross-lingual parser substantially. In the future, we plan to study the design and incorporation of fine-grained constraints considering multipule languages for cross-lingual transfer. We also plan to adapt this constrained inference framework to other cross-lingual structured prediction problems, such as semantic role labeling.