CRF Autoencoder for Unsupervised Dependency Parsing

Unsupervised dependency parsing, which tries to discover linguistic dependency structures from unannotated data, is a very challenging task. Almost all previous work on this task focuses on learning generative models. In this paper, we develop an unsupervised dependency parsing model based on the CRF autoencoder. The encoder part of our model is discriminative and globally normalized which allows us to use rich features as well as universal linguistic priors. We propose an exact algorithm for parsing as well as a tractable learning algorithm. We evaluated the performance of our model on eight multilingual treebanks and found that our model achieved comparable performance with state-of-the-art approaches.


Introduction
Unsupervised dependency parsing, which aims to discover syntactic structures in sentences from unlabeled data, is a very challenging task in natural language processing. Most of the previous work on unsupervised dependency parsing is based on generative models such as the dependency model with valence (DMV) introduced by Klein and Manning (2004). Many approaches have been proposed to enhance these generative models, for example, by designing advanced Bayesian priors (Cohen et al., 2008), representing dependencies with features (Berg-Kirkpatrick et al., 2010), and representing discrete tokens with continuous vectors (Jiang et al., 2016).
Besides generative approaches, Grave and Elhadad (2015) proposed an unsupervised discrim- * This work was supported by the National Natural Science Foundation of China (61503248). inative parser. They designed a convex quadratic objective function under the discriminative clustering framework. By utilizing global features and linguistic priors, their approach achieves stateof-the-art performance. However, their approach uses an approximate parsing algorithm, which has no theoretical guarantee. In addition, the performance of the approach depends on a set of manually specified linguistic priors.
Conditional random field autoencoder (Ammar et al., 2014) is a new framework for unsupervised structured prediction. There are two components of this model: an encoder and a decoder. The encoder is a globally normalized feature-rich CRF model predicting the conditional distribution of the latent structure given the observed structured input. The decoder of the model is a generative model generating a transformation of the structured input from the latent structure. Ammar et al. (2014) applied the model to two sequential structured prediction tasks, part-of-speech induction and word alignment and showed that by utilizing context information the model can achieve better performance than previous generative models and locally normalized models. However, to the best of our knowledge, there is no previous work applying the CRF autoencoder to tasks with more complicated outputs such as tree structures.
In this paper, we propose an unsupervised discriminative dependency parser based on the CRF autoencoder framework and provide tractable algorithms for learning and parsing. We performed experiments in eight languages and show that our approach achieves comparable results with previous state-of-the-art models. x 3 x 2 x 1 Encoder Decoder Figure 1: The CRF Autoencoder for the input sentence "These stocks eventually reopened" and its corresponding parse tree (shown at the top).
x andx are the original and reconstructed sentence. y is the dependency parse tree represented by a sequence where y i contains the token and index of the parent of token x i in the parse tree, e.g., y 1 = stocks, 2 and y 2 = reopened, 4 . The encoder is represented by a factor graph (with a global factor specifying valid parse trees) and the decoder is represented by a Bayesian net.

Model
Figure 1 shows our model with an example input sentence. Given an input sentence x = (x 1 , x 2 , . . . , x n ), we regard its parse tree as the latent structure represented by a sequence y = (y 1 , y 2 , . . . , y n ) where y i is a pair t i , h i , t i is the head token of the dependency connecting to token x i in the parse tree, and h i is the index of this head token in the sentence. The model also contains a reconstruction output, which is a token sequencê x = (x 1 ,x 2 , . . . ,x n ). Throughout this paper, we setx = x.
The encoder in our model is a log-linear model represented by a first-order dependency parser. The score of a dependency tree can be factorized as the sum of scores of its dependencies. For each dependency arc (x, i, j), where i and j are the indices of the head and child of the dependency, a feature vector f (x, i, j) is specified. The score of a dependency is defined as the inner product of the feature vector and a weight vector w, The score of a dependency tree y of sentence x is We define the probability of parse tree y given sentence x as where Y(x) is the set of all valid parse trees of x.
The partition function can be efficiently computed in O(n 3 ) time using a variant of the inside-outside algorithm (Paskin, 2001) for projective tree structures, or using the Matrix-Tree Theorem for nonprojective tree structures (Koo et al., 2007).
The decoder of our model consists of a set of categorical conditional distributions θ x|t , which represents the probability of generating token x conditioned on token t. So the probability of the reconstruction outputx given the parse tree y is The conditional distribution ofx, y given x is P (y,x|x) = P (y|x)P (x|y) Following McDonald et al. (2005) and Grave et al. (2015), we define the feature vector of a dependency based on the part-of-speech tags (POS) of the head, child and context words, the direction, and the distance between the head and child of the dependency. The feature template used in our parser is shown in Table 1.

Parsing
Given parameters w and θ, we can parse a sentence x by searching for a dependency tree y which has the highest probability P (x, y|x). Table 1: Feature template of a dependency, where i is the index of the head, j is the index of the child, dis = |i − j|, and dir is the direction of the dependency.
For projective dependency parsing, we can use Eisners algorithm (1996) to find the best parse in O(n 3 ) time. For non-projective dependency parsing, we can use the Chu-Liu/Edmond algorithm (Chu and Liu, 1965;Edmonds, 1967;Tarjan, 1977) to find the best parse in O(n 2 ) time.

Objective Function
Spitkovsky et al. (2010) shows that Viterbi EM can improve the performance of unsupervised dependency parsing in comparison with EM. Therefore, instead of using negative conditional log likelihood as our objective function, we choose to use negative conditional Viterbi log likelihood, where Ω(w) is a L1 regularization term of the encoder parameter w and λ is a hyper-parameter controlling the strength of regularization.
To encourage learning of dependency relations that satisfy universal linguistic knowledge, we add a soft constraint on the parse tree based on the universal syntactic rules following Naseem et al. (2010) and Grave et al. (2015). Hence our objective function becomes where Q(x, y) is a soft constraint factor over the parse tree, and α is a hyper-parameter controlling the strength of the constraint factor. The factor Q is also decomposable by edges in the same way as the encoder and the decoder, and therefore our parsing algorithm can still be used with this factor is an indicator function of whether dependency t i → x i satisfies one of the universal linguistic rules in R. The universal linguistic rules that we use are shown in Table 2 (Naseem et al., 2010).

Algorithm
We apply coordinate descent to minimize the objective function, which alternately updates w and θ. In each optimization step of w, we run two epochs of stochastic gradient descent, and in each optimization step of θ, we run two iterations of the Viterbi EM algorithm.
To update w using stochastic gradient descent, for each sentence x, we first run the parsing algorithm to find the best parse tree y * = arg max y∈Y(x) (P (x, y|x)Q α (x, y)); then we can calculate the gradient of the objective function based on the following derivation, where D(x) is the set of all possible dependency arcs of sentence x, 1[·] is the indicator function, and µ(x, i, j) is the expected count defined as follows,   (Jiang et al., 2016), and Convex-MST (Grave and Elhadad, 2015) Methods WSJ10 WSJ Basic Setup Feature DMV (Berg-Kirkpatrick et al., 2010) 63.0 -UR-A E-DMV (Tu and Honavar, 2012) 71.4 57.0 Neural E-DMV (Jiang et al., 2016) 69.7 52.5 Neural E-DMV (Good Init) (Jiang et al., 2016) 72.5 57.6 Basic Setup + Universal Linguistic Prior Convex-MST (Grave and Elhadad, 2015) 60.8 48.6 HDP-DEP (Naseem et al., 2010) 71.9 -CRFAE 71.7 55.7 Systems Using Extra Info LexTSG-DMV (Blunsom and Cohn, 2010) 67.7 55.7 CS (Spitkovsky et al., 2013) 72.0 64.4 MaxEnc (Le and Zuidema, 2015) 73.2 65.8 Table 3: Comparison of recent unsupervised dependency parsing systems on English. Basic setup is the same as our setup except that linguistic prior is not used. Extra info includes lexicalization, longer training sentences, etc.
The expected count can be efficiently computed using the Matrix-Tree Theorem (Koo et al., 2007) for non-projective tree structures or using a variant of the inside-outside algorithm for projective tree structures (Paskin, 2001).
To update θ using Viterbi EM, in the E-step we again use the parsing algorithm to find the best parse tree y * for each sentence x; then in the Mstep we update θ by maximum likelihood estimation.

Setup
We experimented with projective parsing and used the informed initialization method proposed by Klein and Manning (2004) to initialize our model before learning. We tested our model both with and without using the universal linguistic rules. We used AdaGrad to optimize w. We used POS tags of the input sentence as the tokens in our model. We learned our model on training sentences of length ≤ 10 and reported the directed dependency accuracy on test sentences of length ≤ 10 and on all test sentences.

Results on English
We evaluated our model on the Wall Street Journal corpus. We trained our model on section 2-21, tuned the hyperparameters on section 22, and tested our model on section 23. Table 3 shows the directed dependency accuracy of our model (CR-FAE) compared with recently published results. It can be seen that our method achieves a comparable performance with state-of-the-art systems.
We also compared the performances of CRF autoencoder using an objective function with negative log likelihood vs. using our Viterbi version of the objective function (Eq.1). We find that the Viterbi version leads to much better performance (55.7 vs. 41.8 in parsing accuracy of WSJ), which echoes Spitkovsky et al. 's findings on Viterbi EM (2010).

Multilingual Results
We evaluated our model on seven languages from the PASCAL Challenge on Grammar Induction (Gelling et al., 2012). We did not use the Arabic corpus because the number of training sentences with length ≤ 10 is less than 1000. The result is shown in Table 4. The accuracies of DMV and Neural DMV are from Jiang et.al (2016). Both our model (CRFAE) and Convex-MST were tuned on the validation set of each corpus. It can be seen that our method achieves the best results on average. Besides, our method outperforms Convex-MST both with and without linguistic prior. From the results we can also see that utilizing universal linguistic prior greatly improves the performance of Convex-MST and our model.

Conclusion
In this paper, we propose a new discriminative model for unsupervised dependency parsing based on CRF autoencoder. Both learning and inference of our model are tractable. We tested our method on eight languages and show that our model is competitive to the state-of-the-art systems.