Second-Order Unsupervised Neural Dependency Parsing

Most of the unsupervised dependency parsers are based on first-order probabilistic generative models that only consider local parent-child information. Inspired by second-order supervised dependency parsing, we proposed a second-order extension of unsupervised neural dependency models that incorporate grandparent-child or sibling information. We also propose a novel design of the neural parameterization and optimization methods of the dependency models. In second-order models, the number of grammar rules grows cubically with the increase of vocabulary size, making it difficult to train lexicalized models that may contain thousands of words. To circumvent this problem while still benefiting from both second-order parsing and lexicalization, we use the agreement-based learning framework to jointly train a second-order unlexicalized model and a first-order lexicalized model. Experiments on multiple datasets show the effectiveness of our second-order models compared with recent state-of-the-art methods. Our joint model achieves a 10% improvement over the previous state-of-the-art parser on the full WSJ test set.


Introduction
Dependency parsing is a classical task in natural language processing. The head-dependent relations produced by dependency parsing can provide an approximation to the semantic relationship between words, which is useful in many downstream NLP tasks such as machine translation, information extraction and question answering. Nowadays, supervised dependency parsers can reach a very high accuracy (Dozat and Manning, 2017;Zhang et al., 2020). Unfortunately, supervised parsing requires treebanks (annotated parse trees) for training, which are very expensive and time-consuming to build. On the other hand, unsupervised dependency parsing requires only unannotated corpora for training, though the accuracy of unsupervised parsing still lags far behind that of supervised parsing. We focus on unsupervised dependency parsing in this paper.
Most methods in the literature of unsupervised dependency parsing are based on the Dependency Model with Valence (DMV) (Klein and Manning, 2004), which is a probabilistic generative model. A main disadvantage of DMV and many of its extensions is that they lack expressiveness. The generation of a dependent token is only conditioned on its parent, the relative direction of the token to its parent, and whether its parent has already generated any child in this direction, hence ignoring other contextual information. To improve model expressiveness, researchers often turn to discriminative methods, which can incorporate more contextual information into the scoring or prediction of dependency arcs. For example, Grave and Elhadad (2015) uses the idea of disrciminative clustering, Cai et al. (2017) uses a discriminative parser in the CRF-autoencoder framework, and Li et al. (2018) uses an encoder-decoder framework that contains a discriminative transitioned-based parser. For DMV, Han et al. (2019) proposes the discriminative neural DMV which uses a global sentence embedding to introduce contextual information into the calculation of grammar rule probabilities. In the literature of supervised graph-based dependency parsing, however, there exists another technique for incorporating contextual information and increasing expressiveness, namely high-order parsing (Koo and Collins, 2010;Ma and Hai, 2012). A first-order parser, such as the DMV, only considers local parent-children information. In comparison, a high-order parser takes into account the interaction between multiple dependency arcs.
In this work, we propose the second-order neural DMV model, which incorporates second-order information (e.g., sibling or grandparent) into the original (neural) DMV model. To achieve better learning accuracy, we design a new neural architecture for rule probability computation and promote direct marginal likelihood optimization (Salakhutdinov et al., 2003;Tran et al., 2016) over the widely used expectationmaximization algorithm for training. One particular challenge faced by second-order neural DMVs is that the number of grammar rules grows cubically to the vocabulary size, making it difficult to store and train a lexicalized model containing thousands of words. Therefore, instead of learning a second-order lexicalized model, we propose to jointly learn a second-order unlexicalized model (whose vocabulary consists of POS tags instead of words) and a first-order lexicalized model based on the agreement-based learning framework (Liang et al., 2007). The jointly learned models have a manageable number of grammar rules while still benefiting from both second-order parsing and lexicalization.
We conduct experiments on the Wall Street Journal (WSJ) dataset and seven languages on the Universal Dependencies (UD) dataset. The experimental results demonstrate that our models achieve state-ofthe-art accuracies on unsupervised dependency parsing.

Dependency Model With Valence
The Dependency Model with Valence (DMV) (Klein and Manning, 2004) is a probabilistic generative model of a sentence and its parse tree. It generates a dependency parse tree from the imaginary root node in a recursive top-down manner. There are three types of probabilistic grammar rules in a DMV, namely ROOT, CHILD and DECISION rules, each associated with a set of multinomial distributions P ROOT (c), P CHILD (c|p, dir, val) and P DECISION (dec|p, dir, val), where p is the parent token, c is the child token, dec is the continue/stop decision, dir indicates the direction of generation, and val indicates whether parent p has generated any child in direction dir. To generate a sequence of tokens along with its dependency parse tree, the DMV model generates a token c from the ROOT distribution P ROOT (c) firstly. Then for each token p that has already been generated, it generates a decision from the DECISION distribution P DECISION (dec|p, dir, val) to determine whether to generate a new child in direction dir. If dec is CONTINUE, then a new child p is generated from the CHILD distribution P CHILD (c|p, dir, val). If dec is STOP, then p stops generating children in direction dir. The joint probability of the sequence and its corresponding dependency parse tree can be calculated by taking product of the probabilities of all the generation steps.

Neuralized DMV Models
Neural DMV One limitation of the DMV model is that it does not consider the correlation between tokens. Jiang et al. (2016) proposed the Neural DMV (NDMV) model, which uses continuous POS embedding to represent discrete POS tags and calculate rule probabilities through neural networks based on the POS embedding. In this way, the model can learn the correlation between POS tags and smooth grammar rule probabilities accordingly.
Lexicalized NDMV Neural DMV is still an unlexicalized model which is based on POS tags and does not use word information. Han et al. (2017) proposed the Lexicalized NDMV (L-NDMV) in which each token is a POS/word pair. The neural network that computes rule probabilities takes both the POS embedding and the word embedding as input. To reduce the vocabulary size, they replace low-frequency words with their POS tags.

Second-Order Parsing
In our proposed second-order NDMV, we calculate each rule probability based additionally on the information of the sibling or grandparent. We take sibling-NDMV for example to demonstrate the generative story.
• We start with the imaginary root token, generating its only child c with probability P ROOT (c) • For each token p, we decide whether to generate a new child or not with probability P DECISION (dec|p, s, dir, val), where s is the previous child token generated by p in direction dir. If p has not generated any child in direction dir yet, we use a special symbol NULL to represent s. • If decision dec is CONTINUE, p generates a new child c with probability P CHILD (c|p, s, dir, val).
If decision a is STOP, p stops generating children in direction dir.
For parsing, we design dynamic programming algorithms adapted from Koo and Collins (2010). Since the grandparent token is deterministic for each token, the parsing algorithm of our grand-NDMV model is similar to theirs. There are two options for determining the sibling token since the generation process of child tokens can be either from the inside out or from the outside in. Koo and Collins (2010) make the inside-out assumption, but in this paper, we make the outside-in assumption because it makes implementation easier and can achieve better performance empirically. We provide the pseudo code of the second-order inside algorithm and the second-order parsing algorithm in the appendix.

Parameterization
In a neural DMV, we compute the probability of a grammar rule using a neural network. Below we formulate the computation of CHILD rule probabilities. The full architecture of the neural network is shown in Figure 1. ROOT and DECISION rule probabilities are computed in a similar way.
In our second-order neural DMV, each CHILD rule P CHILD (c|p, s, dir, val) involves three tokens: parent p, child c, and sibling (or grandparent) s. Denote the embedding of the parent, child and sibling (or grandparent) by x p , x c , x s ∈ R d , which are retrieved from a shared token embedding layer. We use three different linear transformations to produce the representations of a token as a parent, child, and sibling (or grandparent). e c = W c x c e p = W p x p e s = W s x s We feed e c , e p , e s to the same neural network that consists of three consecutive MLPs. The first and second MLPs are used respectively to insert valence and direction information into the representations, and the last MLP is used to produce final hidden representations h c , h p , h s (see the appendix for the complete formulation). We use different parameters of the first and second MLPs for different values of valence val and direction dir. We add skip-connections to the first and second MLPs because skipconnections have been found very useful in unsupervised neural parsing (Kim et al., 2019). We then follow Wang et al. (2019) and use a decomposed trilinear function to compute the unnormalized rule probability from the three vectors h c , h p , h s .
are the parameters of the decomposed trilinear function and × is scalar multiplication. Then we apply a softmax function to produce the final rule probability.
where C is the vocabulary.

Learning
The learning objective function L(θ) is the log-likelihood of training sentences where θ is the parameters of the neural networks. The probability of each sentence x is defined as: where T (x) is the set of all possible dependency parse trees for sentence x. We use c(r, x, z) to represent the number of times rule r is used in dependency parse tree z of sentence x. Then we have where R is the collection of all DECISION, CHILD and ROOT rules.
Learning via EM algorithm We can rewrite the log-likelihood of sentence x as follows.
where q(z) is an arbitrary distribution and H is the entropy function. In the E-step, we fix θ and set q(z) = p θ (z|x). In the M-step, we fix q(z) and update θ with the following objective: where e(r, x) is the expected count of grammar rule r in sentence x based on q(z), which can be obtained using the inside-outside algorithm. We can use gradient descent to update θ.
Learning via direct marginal likelihood optimization We can also use gradient descent to maximize log p θ (x) directly. Based on the derivation of Salakhutdinov et al. (2003) where e(r, x) is the expected count of grammar rule r in sentence x based on p θ (z|x). Traditionally, we use the inside-outside algorithm to obtain the expected count e(r, x). Eisner (2016) points out that we can use back-propagation to calculate the expected count e(r, x).
So we only need to use the inside algorithm to calculate log p θ (x) and then use back-propagation to update the parameters directly, without the need for the outside algorithm.
Mini-batch gradient descent as online EM In Equation 7, we note that the gradient contains the term e(r, x). If we use mini-batch gradient descent to optimize log p θ (x), it is analogous to the online-EM algorithm (Liang and Klein, 2009). To compute the gradient for each mini-batch, we first need to compute the expected counts from the training sentences in the mini-batch, which is exactly what the online E-step does; we then use the expected counts to compute the gradient and update the model parameters, which is similar to the M-step, except that here we only perform one update step, while in the EM algorithm multiple update steps may be taken based on the same expected counts. According to Liang and Klein (2009), online-EM has a faster convergence speed and can even find a better solution. Empirically, we do find that direct marginal likelihood optimization outperforms the EM algorithm.

Agreement-Based Learning
In our second-order DMV model, the number of grammar rules is 4 |V | 3 +4 |V | 2 +|V |, which is cubic in the vocabulary size |V |. When our model is lexicalized, the vocabulary may contain thousands of words or more, making the model s ize less manageable. Instead of learning a second-order lexicalized model, we propose to jointly learn a second-order unlexicalized model (whose vocabulary consists of POS tags instead of words) and a first-order lexicalized model based on the agreement-based learning framework (Liang et al., 2007). The jointly learned models have a manageable number of grammar rules while still benefiting from both second-order parsing and lexicalization. Empirically, we do find that the jointly trained models outperform lexicalized second-order models. Following Liang et al. (2007), we define the objective function for our jointly trained first-order L-NDMV and second-order NDMV as where θ 0 is parameters of L-NDMV and θ 1 is parameters of second-order NDMV. Intuitively, the objective requires the two models to reach agreement on the probability distribution of dependency parse tree z. We use joint decoding (parsing) to predict dependency parse tree z predict for sentence x.
The inside and parsing algorithms for jointly trained models can be found in the appendix.
Learning via product EM algorithm Liang et al. (2007) propose to optimize the objective using the product EM algorithm based on the following lower bound of the objective.
The product EM algorithm performs coordinate-wise ascent on L(θ, q). In the product E-step, we optimize L(θ, q) with respect to q.
where const does not depend on θ and q. In the product E-step, the maximum can be obtained by setting where const does not depend on θ. It consists of one term for each model. We update the parameters of each model separately based on the expected counts obtained from the product E-step, which can be calculated through the inside-outside algorithm.
Learning via direct marginal likelihood optimization O agree can be calculated through the inside algorithm. Similar to Section 3.3, we can benefit from both agreement-based learning and the online-EM algorithm if we use gradient descent to optimize O agree instead of using the product EM algorithm. Setting On the WSJ dataset, for fair comparison, we follow Han et al. (2017) and Han et al. (2019) and use HDP-DEP (Naseem et al., 2010) to initialize our models. Specifically, we train the unsupervised HDP-DEP model on WSJ, use it to parse the training corpus, and then use the predicted parse trees to perform supervised learning of our model for several epochs. On the UD dataset, we use the K&M initialization (Klein and Manning, 2004). We use direct marginal likelihood optimization (DMO) as the training method and use Adam (Kingma and Ba, 2015) as the optimizer with learning rate 0.001. The batch size is set to 64 for WSJ and 100 for UD. The hyperparameters of the neural networks, the setting of L-NDMV and more details can be found in the appendix. We apply early stopping based on the log-likelihood of the development data and report the mean accuracy over 5 random restarts.

Result
Result on WSJ In Table 1, we compare our methods with previous unsupervised dependency parsers.
Our sibling-NDMV model can outperform the previous state-of-the-art parser by 1.9 points on WSJ10 and 3.1 points on WSJ in the unlexicalized setting. Our lexicalized sibling-NDMV achieves further improvement over the unlexicalized sibling-NDMV. On the other hand, our grand-NDMV performs significantly worse than the sibling-NDMV and lexicalization hurts its performance. Why grandparent information is less useful than sibling information in unsupervised parsing is an intriguing question that we leave for feature research. Joint training with a first-order L-NDMV can increase the performance of unlexicalized sibling-NDMV from 77.5 to 79.9 and that of unlexicalized grand-NDMV from 71.4 to 76.0 on WSJ10. The jointly trained models also outperform the lexicalized second-order models.  (Berg-Kirkpatrick et al., 2010) 63.0 -PR-S (Gillenwater et al., 2011) 64.3 53.3 E-DMV (Headden et al., 2009) 65.0 -TSG-DMV (Blunsom and Cohn, 2010) 65.9 53.1 UR-A E-DMV (Tu and Honavar, 2012) 71.4 57.0 CRFAE (Cai et al., 2017) 71.7 55.7 Neural DMV (Jiang et al., 2016) 72.5 57.6 HDP-DEP (Naseem et al., 2010) 73.8 -NVTP (Li et al., 2018) 54  Result on UD In Table 2, we first compare our models with models which do not use the universal linguistic prior (UP) 3 . The variational variant of D-NDMV (Han et al., 2019) is the recent state-ofthe-art model without UP. Our method outperforms theirs on six of the eight languages and also on average. We then compare our second-order models with recent state-of-the-art discriminative models, which rely heavily on the universal linguistic prior to achieve good performance (for example, Li et al. (2018) reported bad results if they do not use the universal linguistic prior). We find that sibling-NDMV can outperform these discriminative models while grand-NDMV can achieve comparable results, even though we do not utilize the universal linguistic prior.

Effect of Skip-Connections
From Table 3 and 4, we find that using skip-connections can achieve higher log-likelihood and better parsing accuracy in most cases. On UD, the performance is much better when using skip-connections except on Basque.

Comparison of Training Methods
In Table 3, we find that the EM algorithm significantly underperforms DMO. On the other hand, Table 4 shows that the EM algorithm performs comparably to DMO on WSJ. We also compare the learning curves of these two methods. For fair comparison, we use the same batch-size for both methods. First we conduct an experiment using the joint L-NDMV and sibling-NDMV model on WSJ. In Figure 2, we find that DMO converges to a higher log-likelihood compared with EM and the convergence speed is roughly the same. In Figure 3, we find DMO can find a slightly better model compared with EM. Second, we conduct an experiment using sibling-NDMV model on the UD French dataset. In Figure 4, we find DMO converges faster than EM and converges to a higher log-likelihood. In Figure 4, we find that the model accuracy of DMO is much higher than that of EM at   (Noji and Miyao, 2015). DV,VV: The deterministic and variational variants of D-NDMV (Han et al., 2019). +sibling: Our second-order sibling-NDMV. +grand: Our second-order grand-NDMV. NVTP: Neural variational transition-based parser (Li et al., 2018). CM: Convex-MST (Grave and Elhadad, 2015).  the beginning, but it drops significantly after epoch 23, suggesting that early-stop is necessary. We also find similar phenomena for other languages on UD.
It should be noted that we use HDP-DEP (Naseem et al., 2010) for initialization on WSJ and use K&M initialization (Klein and Manning, 2004) on UD. We see that HDP-DEP initialization leads to a very high initial UAS of 75% (Figure 3), while K&M initialization leads to a low initial UAS of 38.5% ( Figure 5). It can be seen that EM is more sensitive to the initialization while DMO can achieve good results even if the initialization is bad.

Effect of Joint Training and Parsing
In Table 5, we compare the performance with different training and parsing settings. We find that joint parsing is better than separate parsing in both training settings. With joint training, each individual model can achieve better performance compared with separate training, which shows the effectiveness of agreement-based joint learning.

Limitations
Our second-order NDMV model is more sensitive to the initialization compared with the first-order NDMV model. We fail to produce a good result under the K&M initialization on WSJ: only 58.5% UAS   for sibling-NDMV on WSJ10, while the first-order NDMV model can achieve 69.7% UAS. We rely on the parsing result of HDP-DEP to initialize our model in order to reach the state-of-the-art result on WSJ. This is similar to the case of L-NDMV, which performs badly when using the K&M initialization according to Han et al. (2017). Because of the bad performance of L-NDMV with the K&M initialization as well as the time constraint that prevents us from running HDP-DEP on UD, we did not conduct experiments of agreement-based learning with L-NDMV on the UD datasets. We leave this for future work.
Our second-order model is also quite sensitive to the design of the neural architecture, which is similar to case of unsupervised constituency parsing reported by Kim et al. (2019). We also try the third-order NDMV model (grand-sibling or tri-sibling) but are not able to get better results compared with sibling-NDMV.
Our second-order parsing algorithm has a theoretical time complexity of O(n 4 ), which is higher than the time complexity of O(n) of transition-based unsupervised parsers (Li et al., 2018) Table 5: The effect of joint training and joint parsing complexity of O(n 3 ) of first-order NDMV models, where n is the sentence length. However, transitionbased parsers are hard to batchify, while our model can be parallelized efficiently following the methods introduced by Torch-Struct (Rush, 2020). In practice, our second-order parser runs very fast on GPU, requiring only several minutes to train.

Conclusion
We propose second-order NDMV models, which incorporate sibling or grandparent information. We find that sibling information is very useful in unsupervised dependency parsing. We use agreement-based learning to combine the benefits of second-order parsing and lexicalization, achieving state-of-the-art results on the WSJ dataset. We also show the effectiveness of our neural parameterization architecture with skip-connections and the direct marginal likelihood optimization method.

A.1 Inside Algorithm and Parsing Algorithm
We use the dynamic programming substructure proposed for second-order supervised dependency parsing. For grandparent-child model, Koo and Collins (2010) augment both complete and incomplete spans with grandparent indices. They called the augmented span g-spans. Formally, they denote a complete g-span as C g h,e , where C h,e is a normal complete span in the Eisner algorithm, g is the grandparent's index, with the implication that (g, h) is a dependency. Incomplete g-span is defined similarly.
For second-order NDMV, we further augment incomplete and complete g-spans with valence information. We distinguish the direction of span explicitly, denoting our augmented complete v-span as C g,v h,e,d , where d is the direction, v is the valence, h is the start index and e is the end index of span compared with g-span. Incomplete v-span is defined similarly.
For grand-NDMV, given sentence x, we suppose that x 0 is the imaginary root token and x 1 , ..x n are tokens. We denote D [i, g, d, v, a] = log(p DECISION (decision = a|parent = x i , grand = x g , direction = d, valence = v)), S[i, c, g, d, v] = log(p CHILD (child = x c |parent = x i , grand = x g , direction = d, valence = v)), and R[i] = log(p ROOT (child = x i )). Given these definitions, the inside algorithm of grand-NDMV is shown in Algorithm 1.
For sibling-NDMV, g in C g h,e stands for the index of sibling instead of the index of grandparent. Given sentence x, we suppose that x 0 is a special NULL token which stands for no sibling and x 1 , ..x n are tokens. We denote D [i, g, d, v, a] = log(p DECISION (decision = a|parent = x i , sibling = x g , direction = d, valence = v)), S[i, c, g, d, v] = log(p CHILD (child = x c |parent = x i , sibling = x g , direction = d, valence = v)), and R[i] = log(p ROOT (child = x i )). Given these definitions, the inside algorithm of sibling-NDMV is shown in Algorithm 2.
For jointly trained L-NDMV and second-order NDMV model, we take jointly trained L-NDMV sibing-NDMV for example. We denote x is the sequence of word/POS pairs which starts indexing at 1. The inside algorithm of jointly trained L-NDMV and sibling-NDMV model is shown in Algorithm 3.
Following Eisner (2016), we use back-propagation to obtain expected counts of grammar rules. For the parsing algorithm, we can replace logsumexp with max in Algorithm 1, 2 and 3 to get the Viterbi log-likelihood of the sentence, then use back-propagation to get grammar rules which are used in the Viterbi parse tree, and finally reconstruct the parse tree based on these rules.

A.2 Full Parameterization
Denote the embedding of the parent, child and sibling (or grandparent) by x p , x c , x s ∈ R d . We use three different linear transformations to produce the representations of each token as a parent, child, and sibling (or grandparent).
We feed e s , e c , e p to the same neural network which consists of three MLP with skip-connection layer. The first MLP aims at encoding valence information: where val ∈ [HASCHILD, NOCHILD].

A.3 Hyperparameters
We set the dimension of POS embedding to 100. The dimension of all linear layers to calculate hidden representation is set to 100. We set the size of decomposed trilinear function parameters to 30 for child and root rules and 10 for decision rules in the unlexicalized setting.
For the lexicalized model, we set the dimension of word embedding to 100. We concatenate the POS embedding and word embedding as input. The dimension of all linear layers to calculate hidden representation is set to 200. We set the size of decomposed trilinear function parameters to 150 for child and root rules and 50 for decision rules. We use an additional dropout layer after the embedding layer to avoid over-fitting since the vocabulary size of the lexicalized model is much larger compared to the unlexicalized model. The dropout rate is set to 0.5.

A.4 Setting of L-NDMV
The vocabulary consists of word/POS pairs that appear for at least two times in the WSJ10 dataset. We use random embedding to initialize the POS embedding and FastText embedding to initialize the word embedding, which is different from the setting in the original paper (Han et al., 2017). We train FastText on the whole WSJ dataset for 100 epochs with window size 3 and embedding dimension 100.