A Regularization-based Framework for Bilingual Grammar Induction

Grammar induction aims to discover syntactic structures from unannotated sentences. In this paper, we propose a framework in which the learning process of the grammar model of one language is influenced by knowledge from the model of another language. Unlike previous work on multilingual grammar induction, our approach does not rely on any external resource, such as parallel corpora, word alignments or linguistic phylogenetic trees. We propose three regularization methods that encourage similarity between model parameters, dependency edge scores, and parse trees respectively. We deploy our methods on a state-of-the-art unsupervised discriminative parser and evaluate it on both transfer grammar induction and bilingual grammar induction. Empirical results on multiple languages show that our methods outperform strong baselines.


Introduction
Syntactic parsing is an important task in natural language processing. Supervised parsing requires manual labeling of gold parse trees, which is a very labor-intensive task. On the other hand, unsupervised parsing (a.k.a. grammar induction) does not require labeled data and can make use of large amounts of unlabeled data that are freely available. However, grammar induction is very challenging and its accuracy is still far below that of supervised parsing. To compensate the lack of supervision in grammar induction, some previous work considers multilingual grammar induction, i.e., simultaneously learning grammars of multiple languages (Snyder et al., 2009;Berg-Kirkpatrick and Klein, 2010;Liu et al., 2013). Existing multilingual approaches require external resources such as parallel corpora, word alignments, and linguistic phylogenetic trees. * Yong Jiang contributed to this work when at Shang-haiTech University. Kewei Tu is the corresponding author. In this paper, we aim at bilingual grammar induction without external resource. We are motivated by our observation that learning the unsupervised Convex-MST model (Grave and Elhadad, 2015) on the English corpus and then directly applying it to parse other languages produces surprisingly good results (Table 1). From the table, we can see that even with this simplistic method (which we call direct transfer), the dependency accuracy on each language is often very close to the accuracy of the model specifically trained on the corpus of that language. For the Swedish language, the accuracy of direct transfer is even better than that of the specifically trained model. This surprising result suggests that grammars of different languages, even those from different language families (e.g., English and Japanese), may have non-trivial similarity that can be helpful in bilingual grammar induction.
Inspired by this observation, we propose a regularization-based framework to bilingual grammar induction that encourage knowledge sharing between models learned on a language pair. We build our framework on top of Convex-MST, a state-of-the-art unsupervised dependency parser, and propose three regularization terms that encourage similarity between model parameters, edge scores, and parse trees respectively. We test our methods on ten languages on the tasks of transfer grammar induction and bilingual grammar induction and show that our methods can achieve a significant boost over strong baselines.

Unsupervised Dependency Parsing
Dependency parsing is the task of mapping an input sentence x = x 1 , x 2 , ..., x n of length n to an output dependency structure y. A dummy root x 0 is typically added at the beginning of the sentence to denote the head of the dependency tree. There are several approaches to represent the parse tree y. In transition based dependency parsers, the dependency tree can be regarded as a sequence of actions. In graph based dependency parsers, the dependency tree can be represented as a spanning tree in the graph. In chart based parsers (a.k.a., grammar based parsers), the dependency tree is denoted as a set of grammar rules. In unsupervised graph based dependency parsers, since gold trees are not available, carefully designed models and objective functions are required for learning a dependency parser. Regardless of model architectures, current unsupervised dependency models usually use the following form of objective function, where Y is the set of all possible dependency tree, w is the model parameter, X is the unlabeled training corpus, D is the measurement between the parse y and model prediction on sentence x, R(w) is the regularization term of parameter w, O ∈ {min, } is an operator. Table 2 shows the choices of O, D and R for several widely used models.

Graph based Dependency Parsing
In this paper, we focus on graph based dependency parsers, though we believe that our approaches can be generalized to other types of parsers. Previous work on unsupervised graph based dependency parsing utilizes the autoencoder structure (Cai et al., 2017) or the discriminative clustering techniques (Grave and Elhadad, 2015).
Following (McDonald et al., 2005), we can use a discriminative model for dependency parsing with first order factorization such that the score of a dependency tree y is the sum of the scores of its dependency edges. The score of an edge from word h to word m , s w (x, h, m), can be computed as the inner product of a feature vector f (x, h, m) and a parameter vector w. The optimal dependency tree for sentence x be discovered in polynomial time (Eisner, 1996;McDonald et al., 2005).

Bilingual Knowledge Sharing
Given non-parallel corpora of two languages X s and X t , our goal is to learn two models with parameters w s and w t for the two languages. The simplest learning objective function is, which contains no interaction between the two models.
As suggested by our empirical observation in Table 1, the model of one language may provide a useful inductive bias in learning the model of another language. Note that given a sentence, a graph-based dependency parser has three levels of representations: the model parameters, the scores of dependency edges computed from the parameters, and the parse tree computed from the edge scores. Therefore, we propose three different regularization terms to effectively encourage similarity of the two models. An example is shown in Figure 1.

Regularization of Weight Parameters (W-Reg)
Motivated by the approach of Berg-Kirkpatrick and Klein (2010), we encourage the similarity between the two weight parameters w s and w t measured by l2 norm distance: Regularization on Edge Scores (E-Reg) Directly encouraging weight similarity might result in an inductive bias that is too strong, because the difference between the two languages (e.g., different word orders) may lead to different meanings of each feature dimension. Therefore, we propose to encourage similarity between the scores computed by the two models for each dependency edge of Parsers O D R DMV (Klein and Manning, 2004) negative log likelihood -Convex-MST (Grave and Elhadad, 2015) min 2 distance 2 norm LC-DMV (Noji et al., 2016) negative log likelihood 2 norm NDMV (Jiang et al., 2016) , min negative log likelihood -CRFAE (Cai et al., 2017) min negative conditional log likelihood 1 norm D-NDMV (Han et al., 2019) , min negative (conditional) log likelihood - each sentence, which can be seen as a soft version of weight regularization.
is the weighted dependency graph of sentence x.

Regularization on Parse Trees (T-Reg)
Another alternative is to encourage similarity between the parse trees predicted by the two models. Motivated by the idea of knowledge distillation (Kim and Rush, 2016), in the learning objective of each model, we add a fourth term to encourage the parse tree to be close to the prediction of the other model. Below we show the objective function for w s .
We apply these regularization method to the Convex-MST model (Grave and Elhadad, 2015). Our three objective functions can be optimized with coordinate descent in a similar way to Convex-MST. In each iteration, we first fix parse y for each training sentence and update parameters w s and w t by stochastic gradient descent; then we fix w s and w t and update y of each sentence by the Frank-Wolfe algorithm. While our three methods are applicable to any pair of languages, intuitively one may use weight regularization only for similar languages, and use edge regularization and tree regularization for an arbitrary language pair.

Experiments
To enable direct comparison with the Convex-MST model, we use the dataset used in their paper (Grave and Elhadad, 2015), the universal treebanks version 2.0 1 , introduced by McDonald et al. (2013). The dataset contains ten different languages, which belong to five diverse families. In additional, we test our methods on twelve languages from the more recent UD Treebank 1.4 2 , which is also used in previous grammar induction work Li et al., 2019). Following previous work, we train all the models on the gold POS tags of sentences no longer than ten. We tune hyper-parameters on the development dataset and report the DDA on sentences no longer than ten and all the sentences in the test dataset. As our goal is to investigate the benefits of our regularization methods, the two hyper-parameters µ and β  of Convex-MST are tuned on the English development dataset and then fixed (µ = 0.1, β = 0.001) while λ is selected from {10, 5, 1, 5e − 1, 1e − 1, 5e − 2, 1e − 2, 5e − 3, 1e − 3, 1e − 4} for each language pair.

Experiments on Transfer Grammar Induction
In transfer grammar induction, we train the first model on the first language independent of the second language; then, with the first model fixed, we optimize our knowledge sharing objective with respect to the second model; finally, we evaluate the second model on the test set of the second language. In this way, we want to test whether our methods can transfer useful linguistic knowledge from the first language to the second language. We report the results of transfer grammar induction from English to the other nine languages in Table  3. Our edge regularization and tree regularization methods outperform the Convex-MST baseline in almost all the cases. The weight regularization method achieves worse results than Convex-MST except for the Swedish language, which demonstrates that directly regularizing weight parameters may not work well in the transfer grammar induction task. For the Swedish language, although direct transfer already achieves better performance than Convex-MST, our regularization methods can further boost the performance by a large margin. The Indonesian language is the only language for which transfer grammar induction provides no benefit, possibly because of its significant syntactic difference from the English language. Our additional experimental results on UD treebank 1.4  Table 4: Results of bilingual grammar induction on test sentences no longer than 10 (except the Avg-All row which shows the average accuracies on all the sentences). BASE refers to the individually trained baseline. COMB refers to learning a single model from the combined training set of the two languages. *: for the UD 1.4 dataset we show the average results.
show a similar trend.
We perform transfer grammar induction from English to Swedish with different values of λ and show the results in Figure 2. We can see that the impact of different hyper-parameter values on the accuracy generally follows the same tendency for our three methods.

Experiments on Bilingual Grammar Induction
In bilingual grammar induction, we jointly train two models on two languages. In our experiments, we pair English with each of the other nine languages. The results are reported in Table 4. It can be seen that in most cases joint training leads to better accuracies than the individually trained models as well as the single model learned from the combined training set. By comparing table 4  with table 3, we can also see that bilingual joint training leads to better accuracies than transfer grammar induction, which shows the benefit of training two models simultaneously rather than sequentially. Again, our additional experimental results on UD treebank 1.4 show a similar trend.

Related Work
Our work is related to many previous work.
Unsupervised Transfer Learning There has been previous work aiming at solving an unsupervised learning task of a target domain with the help of knowledge learned from a source domain (Dai et al., 2008;Wang et al., 2008;Pan and Yang, 2010). There is no labeled data in both the source and the target domains during training. Our transfer grammar induction setting can be seen as an instance of unsupervised transfer learning.

Cross-lingual Supervised Dependency Parsing
This task focuses on learning a parser with unlabeled training data and additional labeled training data of a second language (McDonald et al., 2011;Naseem et al., 2012;Guo et al., 2015). The main difference between our approach and theirs is that our approach is fully unsupervised. and do not utilize external information like word alignments or cross-lingual word embeddings.
Other Approaches to Multilingual Grammar Induction To the best of our knowledge, this task is first proposed by Kuhn (2004). They assume that the syntax trees induced from parallel sentences share structured regularities and utilize the word alignments to guide parsing. From then on, many approaches are proposed on both constituency grammar induction and dependency grammar induction (Snyder et al., 2009;Berg-Kirkpatrick and Klein, 2010). We differ from these approaches in that we do not make use of any external rules or knowledge.

Conclusion
In this paper, we propose three regularizationbased knowledge sharing methods to bilingual grammar induction problems. We test our methods on transfer grammar induction and bilingual grammar induction and show that our methods achieve better performance than the baselines. In future work, we plan to investigate the effectiveness of our approach in other types of induction tasks.