Combining Generative and Discriminative Approaches to Unsupervised Dependency Parsing via Dual Decomposition

Unsupervised dependency parsing aims to learn a dependency parser from unannotated sentences. Existing work focuses on either learning generative models using the expectation-maximization algorithm and its variants, or learning discriminative models using the discriminative clustering algorithm. In this paper, we propose a new learning strategy that learns a generative model and a discriminative model jointly based on the dual decomposition method. Our method is simple and general, yet effective to capture the advantages of both models and improve their learning results. We tested our method on the UD treebank and achieved a state-of-the-art performance on thirty languages.


Introduction
Dependency parsing is an important task in natural language processing.It identifies dependencies between words in a sentence, which have been shown to benefit other tasks such as semantic role labeling (Lei et al., 2015) and sentence classification (Ma et al., 2015).Supervised learning of a dependency parser requires annotation of a training corpus by linguistic experts, which can be time and resource consuming.Unsupervised dependency parsing eliminates the need for dependency annotation by directly learning from unparsed text.
Previous work on unsupervised dependency parsing mainly focuses on learning generative models, such as the dependency model with valence (DMV) (Klein and Manning, 2004) and combinatory categorial grammars (CCG) (Bisk and Hockenmaier, 2012).Generative models have many advantages.For example, the learning objective function can be defined as the marginal likelihood of the training data, which is typically easy to compute in a generative model.In addition, many types of inductive bias, such as those favoring short dependency arcs (Smith and Eisner, 2006), encouraging correlations between POS tags (Cohen et al., 2008;Cohen and Smith, 2009;Berg-Kirkpatrick et al., 2010;Jiang et al., 2016), and limiting center embedding (Noji et al., 2016), can be incorporated into generative models to achieve better parsing accuracy.However, due to the strong independence assumption in most generative models, it is difficult for these models to utilize context information that has been shown to benefit supervised parsing.
Recently, a feature-rich discriminative model for unsupervised parsing is proposed that captures the global context information of sentences (Grave and Elhadad, 2015).Inspired by discriminative clustering, learning of the model is formulated as convex optimization of both the model parameters and the parses of training sentences.By utilizing language-independent rules between pairs of POS tags to guide learning, the model achieves state-ofthe-art performance on the UD treebank dataset.
In this paper we propose to jointly train two state-of-the-art models of unsupervised dependency parsing: a generative model called LC-DMV (Noji et al., 2016) and a discriminative model called Convex-MST (Grave and Elhadad, 2015).We employ a learning algorithm based on the dual decomposition (Dantzig and Wolfe, 1960) inference algorithm, which encourages the two models to influence each other during training.
We evaluated our method on thirty languages and found that the jointly trained models surpass their separately trained counterparts in parsing accuracy.Further analysis shows that the two models positively influence each other during joint training by implicitly sharing the inductive bias.

DMV
The dependency model with valence (DMV) (Klein and Manning, 2004) is the first generative model that outperforms the left-branching baseline in unsupervised dependency parsing.In DMV, a sentence is generated by recursively applying three types of grammar rules to construct a parse tree from the top down.The probability of the generated sentence and parse tree is the probability product of all the rules used in the generation process.To learn the parameters (rule probabilities) of DMV, the expectation maximization algorithm is often used.Noji et al. (2016) exploited two universal syntactic biases in learning DMV: restricting the center-embedding depth and encouraging short dependencies.They achieved a comparable performance with state-of-the-art approaches.

Convex-MST
Convex-MST (Grave and Elhadad, 2015) is a discriminative model for unsupervised dependency parsing based on the first-order maximum spanning tree dependency parser (McDonald et al., 2005).Given a sentence, whether each possible dependency exists or not is predicted based on a set of handcrafted features and a valid parse tree closest to the prediction is identified by the minimum spanning tree algorithm.
For each sentence x, a first-order dependency graph is built over the words of the sentence.The weight of each edge is calculated by w T f (x, i, j), where w is the parameters and f (x, i, j) is the handcrafted feature vector of the dependency from the i-th word to the j-th word in sentence x.For sentence x of length n, we can represent it as matrix X where each raw is a feature vector.The parse tree y is a spanning tree of the graph and can be represented as a binary vector with length n×n where each element is 1 if the corresponding arc is in the tree and 0 otherwise.
Learning is based on discriminative clustering with the following objective function: where X α is a matrix where each row is a feature representation f (x α , i, j) of an edge in the depen-dency graph of sentence x α , v represents whether each dependency arc in y α satisfies a set of prespecified linguistic rules, and λ and µ are hyperparameters.The Frank-Wolfe algorithm is employed to optimize the objective function.

Dual Decomposition
Dual decomposition (Dantzig and Wolfe, 1960), a special case of Lagrangian relaxation, is an optimization method that decomposes a hard problem into several small sub-problems.It has been widely used in machine learning (Komodakis et al., 2007) and natural language processing (Koo et al., 2010;Rush and Collins, 2012).Komodakis et al. (2007) proposed using dual decomposition to do MAP inference for Markov random fields.Koo et al. (2010) proposed a new dependency parser based on dual decomposition by combining a graph based dependency model and a non-projective head automata.In the work of Rush et al. (2010), they showed that dual decomposition can effectively integrate two lexicalized parsing models or two correlated tasks.

Agreement based Learning
Liang et al. (2008) proposed agreement based learning that trains several tractable generative models jointly and encourages them to agree on certain latent variables.To effectively train the system, a product EM algorithm was used.They showed that the joint model can perform better than each independent model on the accuracy or convergence speed.They also showed that the objective function of the work of Klein and Manning (2004) is a special case of the product EM algorithm for grammar induction.Our approach has a similar motivation to agreement based learning but has two important advantages.First, while their approach only combines generative models, our approach can make use of both generative and discriminative models.Second, while their approach requires the sub-models to share the same dynamic programming structure when performing decoding, our approach does not have such restriction.

Joint Training
We minimize the following objective function that combines two different models of unsupervised dependency parsing: where N is the size of training data, M F and M G are the parameters of the first and second model respectively, F and G are their respective learning objectives, and Y α is the set of valid dependency parses of sentence x α .While in principle this objective can be used to combine many different types of models, here we consider two state-ofthe-art models of unsupervised dependency parsing, a generative model LC-DMV (Noji et al., 2016) and a discriminative model Convex-MST (Grave and Elhadad, 2015).We denote the parameters of LC-DMV by Θ and the parameters of Convex-MST by w.Their respective objective functions are, where P Θ (x α , y α ) is the joint probability of sentence x α and parse y α , f is a constraint factor, and the notations in the second objective function are explained in section 2.2.

Learning
We use coordinate descent to optimize the parameters of the two models.In each iteration, we first fix the parameters and find the best dependency parses of the training sentences (see section 3.2); we then fix the parses and optimize the parameters.The detailed algorithm is shown in Algorithm 1.
Pretraining of the two models is done by running their original learning algorithms separately.When the parses of the training sentences are fixed, it is easy to show that the parameters of the two models can be optimized separately.Updating the parameters Θ of LC-DMV can be done by simply counting the number of times each rule is used in the parse trees and then normalizing the counts to get the maximum-likelihood probabilities.The parameters w of Convex-MST can be updated by stochastic gradient descent.After updating Θ and w at each iteration, we additionally train each model separately for three iterations, which we find further improves learning.We employ the dual decomposition algorithm to solve this problem (shown in Algorithm 2), where τ represents the step size.
The most important part of the algorithm is solving the two separate decoding problems: The first decoding problem can be solved by a modified CYK parsing algorithm that takes into account the information in vector u.The second decoding problem can be solved using the same algorithm of Grave and Elhadad (2015) (we use the projective version in our approach).

Setup
We use UD Treebank 1.4 as our datasets.We sorted the datasets in the treebank by the number of training sentences of length ≤ 15 and selected the top thirty datasets, which is similar to the setup of Noji et al. (2016).For each dataset, we trained our method on the training data with length ≤ 15 and tested our method on the testing data with length ≤ 40.We tuned the hyper-parameters of our method on the dataset of the English language and reported the results on the thirty datasets without any further parameter tuning.We compared our method with four baselines.The first two baselines are Convex-MST and LC-DMV that are independently trained.To construct the third baseline, we used the independently trained Convex-MST baseline to parse all the training sentences and then used the parses to initialize the training of LC-DMV.This can be seen as a simple method to combine two different approaches.On the other hand, we did not use the LC-DMV baseline to initialize Convex-MST training because the objective function of Convex-MST is convex and therefore the initialization does not matter.

Results
In Table 1, we compare our jointly trained models with the four baselines.We can see that with joint training and independent decoding, LC-DMV and Convex-MST can achieve superior overall performance than when they are separately trained with or without mutual initialization.Joint decoding with our jointly trained models performs worse than independent decoding.We made the same observation when applying joint decoding to the separately trained models (not shown in the table ).We believe this is because unsupervised parsers have relatively low accuracy and forcing them to reconcile would not lead to better parses.On the other hand, joint decoding during training helps propagate useful inductive biases between models and thus leads to better trained models.

Analysis of Parsing Results
We analyze the parsing results from the two models to see how they benefit each other with joint training.Note that LC-DMV limits the depth of center embedding and encourages shorter dependency length, while Convex-MST encourages dependencies satisfying pre-specified linguistic rules.Therefore, we would like to see whether the jointly-trained LC-DMV produces more dependencies satisfying the linguistic priors than its separately-trained counterpart, and whether the jointly-trained Convex-MST produces parse trees with less center embedding and shorter dependencies than its separately-trained counterpart.
Figure 1 shows the percentages of dependencies satisfying linguistic rules when using the separately and jointly trained LC-DMV to parse the test sentences in the English dataset.As we can see, with joint training, LC-DMV is indeed influenced by Convex-MST and produces more dependencies satisfying linguistic rules.
Table 2 shows the average dependency length when using the separately and jointly trained Convex-MST to parse the English test dataset.The dependency length can be seen to decrease with joint training, showing the influence from LC-DMV.As to center embedding depth, we find that separately trained Convext-MST already produces very few center embeddings of depth 2 or  more, so the influence from the center embedding constraint of LC-DMV during joint training is not obvious.We note that the influence on Convex-MST from LC-DMV during joint training is relatively small, which may contribute to the much smaller accuracy improvement (1.5%) of Convex-MST with joint training in comparison with the 9.3% improvement of LC-DMV.We conducted an additional experiment that scaled down the Convex-MST objective in joint training in order to increase the influence of LC-DMV.
The results show that LC-DMV indeed influences Convex-MST to a greater degree, but the parsing accuracies of the two models decrease.

Conclusion
In this paper, we proposed a new learning strategy for unsupervised dependency parsing that learns a generative model and a discriminative model jointly based on dual decomposition.We show that with joint training, two state-of-the-art models can positively influence each other and achieve better performance than their separately trained counterparts.

Algorithm 1
Parameter Learning Input: Training sentence x 1 , x 2 , ..., x N Pre-train Θ and w repeat Fix Θ and w and solve the decoding problem to get y α , α = 1, 2, . . ., N Fix the parses and update Θ and w until Convergence Algorithm 2 Decoding via Dual Decomposition Input: Sentence x, fixed parameters w and Θ Initialize vector u of size n × n to 0 repeat ŷ = arg min y∈Y F (x, y; Θ) + u T y ẑ = arg min z∈Y G(x, z; w) − u T z if ŷ = ẑ then return ŷ else u = u + τ (ŷ − ẑ) end if until Convergence3.2Joint DecodingGiven a training sample x and parameters w, Θ, the goal of decoding is to find the best parse tree: T y−log P Θ (x, y)

Figure 1 :
Figure 1: Percentages of dependencies satisfying linguistic rules in the LC-DMV parses of the English test dataset.Noun and Verb denote dependencies headed by nouns and verbs.

Table 2 :
Average dependency length in the Convex-MST parses of the English test dataset.