Enhancing Unsupervised Generative Dependency Parser with Contextual Information

Most of the unsupervised dependency parsers are based on probabilistic generative models that learn the joint distribution of the given sentence and its parse. Probabilistic generative models usually explicit decompose the desired dependency tree into factorized grammar rules, which lack the global features of the entire sentence. In this paper, we propose a novel probabilistic model called discriminative neural dependency model with valence (D-NDMV) that generates a sentence and its parse from a continuous latent representation, which encodes global contextual information of the generated sentence. We propose two approaches to model the latent representation: the first deterministically summarizes the representation from the sentence and the second probabilistically models the representation conditioned on the sentence. Our approach can be regarded as a new type of autoencoder model to unsupervised dependency parsing that combines the benefits of both generative and discriminative techniques. In particular, our approach breaks the context-free independence assumption in previous generative approaches and therefore becomes more expressive. Our extensive experimental results on seventeen datasets from various sources show that our approach achieves competitive accuracy compared with both generative and discriminative state-of-the-art unsupervised dependency parsers.


Introduction
Dependency parsing is a very important task in natural language processing. The dependency relations identified by dependency parsing convey syntactic information useful in subsequent applications such as semantic parsing, information extraction, and question answering. In this paper, we * Corresponding author focus on unsupervised dependency parsing, which aims to induce a dependency parser from training sentences without gold parse annotation.
Most previous approaches to unsupervised dependency parsing are based on probabilistic generative models, for example, the Dependency Model with Valence (DMV) (Klein and Manning, 2004) and its extensions (Cohen and Smith, 2009;Headden III et al., 2009;Cohen and Smith, 2010;Berg-Kirkpatrick et al., 2010;Gillenwater et al., 2010;Jiang et al., 2016). A disadvantage of such approaches comes from the context-freeness of dependency grammars, a strong independence assumption that limits the information available in determining how likely a dependency is between two words in a sentence. In DMV, the probability of a dependency is computed from only the head and child tokens, the dependency direction, and the number of dependencies already connected from the head token. Additional information used for computing dependency probabilities in later work is also limited to local morpho-syntactic features such as word forms, lemmas and categories (Berg-Kirkpatrick et al., 2010), which does not break the context-free assumption.
More recently, researchers have started to utilize discriminative methods in unsupervised dependency parsing based on the idea of discriminative clustering (Grave and Elhadad, 2015), the CRFAE framework (Cai et al., 2017) or the neural variational transition-based parser (Li et al., 2019). By conditioning dependency prediction on the whole input sentence, discriminative methods are capable of utilizing not only local information, but also global and contextual information of a dependency in determining its strength. Specifically, both Grave and Elhadad (2015) and Cai et al. (2017) include in the feature set of a dependency the information of the tokens around the head or child token of the dependency. In this way, they break the context-free independence assumption because the same dependency would have different strength in different contexts. Besides, Li et al. (2019) propose a variational autoencoder approach based on Recurrent Neural Network Grammars.
In this paper, we propose a novel approach to unsupervised dependency parsing in the middle between generative and discriminative approaches. Our approach is based on neural DMV (Jiang et al., 2016), an extension of DMV that employs a neural network to predict dependency probabilities. Unlike neural DMV, however, when computing the probability of a dependency, we rely on not only local information as in DMV, but also global and contextual information from a compressed representation of the input sentence produced by neural networks. In other words, instead of modeling the joint probability of the input sentence and its dependency parse as in a generative model, we model the conditional probability of the sentence and parse given global information of the sentence. Therefore, our approach breaks the context-free assumption in a similar way to discriminative approaches, while it is still able to utilize many previous techniques (e.g., initialization and regularization techniques) of generative approaches.
Our approach can be seen as an autoencoder. The decoder is a conditional generative neural DMV that generates the sentence as well as its parse from a continuous representation that captures the global features of the sentence. To model such global information, we propose two types of encoders, one deterministically summarizes the sentence with a continuous vector while the other probabilistically models the continuous vector conditioned on the sentence. Since the neural DMV can act as a fully-fledged unsupervised dependency parser, the encoder can be seen as a supplementary module that injects contextual information into the neural DMV for contextspecific prediction of dependency probabilities. This is very different from the previous unsupervised parsing approach based on the autoencoder framework (Cai et al., 2017;Li et al., 2019), in which the encoder is a discriminative parser and the decoder is a generative model, both of which are required for performing unsupervised parsing.
Our experiments verify that our approach achieves a comparable result with recent state-of-the-art approaches on extensive datasets from various sources.

Dependency Model with Valence
The Dependency Model with Valence (DMV) (Klein and Manning, 2004) is an extension of an earlier dependency model (Carroll and Charniak, 1992) for grammar induction. Different from the earlier model, there are three types of probabilistic grammar rules in DMV, namely ROOT, CHILD and CHILD rules. To generate a token sequence and its corresponding dependency parse tree, the DMV model first generates a token c from the ROOT distribution p(c|root). Then the generation continues in a recursive procedure. At each generation step, it makes a decision as to whether a new token needs to be generated from the current head token h in the dir direction by sampling a STOP or CONTINUE symbol dec from the CHILD distribution p(dec|h, dir, val) where val is an indicator representing whether token h has already generated a token before. If dec is CONTINUE, a new token is generated from the CHILD distribution p(c|h, dir, val). If dec is STOP, then the generation process switches to a new direction or a new head token. DMV can be trained from an unannotated corpus using the expectation-maximization algorithm.

Neural DMV
The DMV model is very effective in inducing syntactic dependency relations between tokens in a sentence. One limitation of DMV is that correlation between similar tokens (such as different verb POS tags) is not taken into account during learning and hence rules involving similar tokens have to be learned independently. Berg-Kirkpatrick et al. (2010) proposed a feature-based DMV model in which the grammar rule probabilities are computed by a log-linear model with manually designed features that reflect token similarity. Jiang et al. (2016) proposed the neural DMV model which learns token embeddings to better capture correlations between tokens and utilizes a neural network to calculate grammar rule probabilities from the embeddings. Both approaches significantly outperform the original DMV. However, because of the strong independence assumption in such generative models, they can only utilize local information of a grammar rule (e.g., the head and child tokens, direction, and valence) when computing its probability.

Discriminative Neural DMV
We extend the neural DMV such that when predicting the probability of a grammar rule in parsing a sentence, the model incorporates not only local information of the rule but also global information of the sentence. Specifically, we model each grammar rule probability conditioned on a continuous vector. We therefore call our model the discriminative neural DMV (D-NDMV). In this way, the probability of a dependency rule becomes sensitive to the input sentence, which breaks the context-free assumption in the neural DMV. Here, we provide two approaches to model this global continuous vector.

Model
Suppose we have a sentence (i.e., a word sequence) w, the corresponding POS tag sequence x, and the dependency parse z which is hidden in unsupervised parsing. DMV and its variants model the joint probability of the POS tag sequence and the parse P (x, z) and, because of the context-free assumption, factorize the probability based on the grammar rules used in the parse. In contrast, to the global features of the sentence, we model the conditional probability of the POS tag sequence and the parse given the sequence w: P (x, z|w). We assume conditional contextfreeness and factorize the conditional probability based on the grammar rules.
where r ranges over all the grammar rules used in the parse z of tag sequence x, Θ is the set of parameters to compute parameters of the distribution. Since one can reliably predict the POS tags x from the words w without considering the parse z (as most POS taggers do), to avoid degeneration of the model, we compute p(r|w) based on global information of w produced by a long short-term memory network (LSTM). Figure 1 shows the neural network structure for parametering p(chd|head, dir, val, w), the probabilities of CHILD rules given the input sentence w. The structure is similar to the one used in neural DMV except for using LSTM sentence encoder Inputs:

Softmax Layer:
Valence Head Tag … Outputs Wdir Wchd Hidden Layer:  to get the representation s from the sentence w. The embeddings of the head POS tag and valence are represented by v h and v val . The concatenation [v val ; v h ; s] is fed into a fully-connected layer with a direction-specific weight matrix W dir and the ReLU activation function to produce the hidden layer g. All possible child POS tags are represented by the matrix W chd . The i-th row of W chd represents the output embedding of the i-th POS tag. We take the product of the hidden layer g and the child matrix W chd and apply a softmax function to obtain the CHILD rule probabilities. ROOT and CHILD rule probabilities are computed in a similar way.
Since the mapping from w to s is deterministic, we call it the deterministic variant of D-NDMV. To make the notations consistent with subsequent sections, we add an auxiliary random variable s to represent the global information of sentence w. The probabilistic distribution of s is defined as, where Φ is the set of parameters of the LSTM neural network. Figure 2 (left) shows the directed graphical representation of this model. If we diminish the capacity of s (e.g., by shrinking its dimension), then our model gradually reduces to neural DMV.

Parsing
Given a deterministic variant with fixed parameters Φ, Θ. we can parse a sentence represented by POS tag sequence x and word sequence w Figure 2: Left: the illustration of the deterministic variant of D-NDMV as a directed graph. The deterministic variant models an autoencoder with P Φ (s|w)) as the encoder and P Θ (x, z|s) as the decoder. Right: the illustration of the variational variant of D-NDMV as a directed graph. We use dashed lines to denote the variational approximation q Φ (s|x) to the intractable posterior P Φ (s|x), and the solid lines to denote the generative model P (s)P Θ (x, z|s).
by searching for a dependency tree z * which has the highest probability p(x, z|w) among the set of valid parse trees Z(x).
Note that once we compute all the grammar rule probabilities based on w, our model becomes a standard DMV and therefore dynamic programming can be used to parse each sentence efficiently (Klein and Manning, 2004).

Unsupervised Learning
Objective Function: In a typical unsupervised dependency parsing setting, we are given a set of training sentences with POS tagging but without parse annotations. The objective function of learning deterministic variant is as follows.
The log conditional likelihood is defined as: We may replace summation with maximization so that it becomes the conditional Viterbi likelihood.
Learning Algorithm: We optimize our objective function using the expectation-maximization (EM) algorithm. Specifically, the EM algorithm alternates between E-steps and M-steps to maximize a lower-bound of the objective function. For each training sentence, the lower bound is defined as: where q(z) is an auxiliary distribution over the latent parse z.
In the E-step, we fix Θ, Φ and maximize Q(q, Θ, Φ) with respect to q. The maximum is reached when the Kullback-Leibler divergence is zero, i.e., Based on the optimal q, we compute the expected counts E q(z) c(r, x, z) using the insideoutside algorithm, where c(r, x, z) is the number of times rule r is used in producing parse z of tag sequence x.
In the M-step, we fix q and maximize Q(q, Θ, Φ) with respect to Θ, Φ. The lower bound now takes the following form: where r ranges over all the grammar rules and Constant is a constant value. The probabilities p(r|w, Θ, Φ) are computed by the neural networks and we can back-propagate the objective Q(Θ, Φ) into the parameters of the neural networks. We initialize the model either heuristically (Klein and Manning, 2004) or using a pre-trained unsupervised parser (Jiang et al., 2016); then we alternate between E-steps and M-steps until convergence.
Note that if we require q(z) to be a delta function, then the algorithm becomes hard-EM, which computes the best parse of each training sentence in the E-step and set the expected count to 1 if the rule is used in the parse and 0 otherwise. It has been found that hard-EM outperforms EM in unsupervised dependency parsing (Spitkovsky et al., 2010;Tu and Honavar, 2012), so we use hard-EM in our experiments.

Variational Variant for D-NDMV
Motivated by (Bowman et al., 2016), we propose to model the global representation s as drawing from a prior distribution, generally a standard Gaussian distribution. We also propose a variational posterior distribution q Φ (s|x) to approximate this prior distribution. In this way, we formalize it into a variational inference framework. We call this model variational variant and illustrate its graphical model in Figure 2 (right). It can be seen from Figure 2 (right) that the variational variant shares the same formulation of the encoder part with the variational autoencoder (VAE). Different from the vanilla VAE model with a simple multilayered feedforward neural network as the decoder, our decoder is a generative latent variable model with the structured hidden variable z.
For the learning of the variational variant, we use the log likelihood as the objective function and optimize its lower bound. We show the derivation as followings: By performing the Monte Carlo method to estimate the expectation w.r.t. q Φ (s|x) and set the number of samples L to 1, we rewrite the second term as: where s (l) is estimated by the reparameterization trick (Kingma and Welling, 2014), which enables low gradient variances and stabilizes training.
Because this formula is similar to Eq. 5, we can follow the subsequent derivation of deterministic variant and learn the variational variant using EM. It is worth noting that different from deterministic variant, in M-step an additional KL divergence term in Eq. 9 should be optimized by back-propagation.

Experiments
We tested our methods on seventeen treebanks from various sources. For each dataset, we compared with current state-of-the-art approaches on the specific dataset.

Dataset and Setup
English Penn Treebank We conducted experiments on the Wall Street Journal corpus (WSJ) with section 2-21 for training, section 22 for validation and section 23 for testing. We trained our model with training sentences of length ≤ 10, tuned the hyer-parameters on validation sentences of length ≤ 10 the and evaluated on testing sentences of length ≤ 10 (WSJ10) and all sentences (WSJ). We reported the directed dependency accuracy (DDA) of the learned grammars on the test sentences.  (Nivre et al., 2016). We trained our model on training sentences of length ≤ 15 and report the DDA on testing sentences of length ≤ 15 and ≤ 40.

Datasets from PASCAL Challenge on Grammar Induction
We conducted experiments on corpora of eight languages from the PASCAL Challenge on Grammar Induction (Gelling et al., 2012). We trained our model with training sentences of length ≤ 10 and evaluated on testing sentences of length ≤ 10 and all sentences.
Note that on the UD Treebanks and PASCAL datasets, we used the same hyper-parameters as in the WSJ experiments without further tuning.
Setup Following previous work, we conducted experiments under the unlexicalized setting where a sentence is represented as a sequence of gold part-of-speech tags with punctuations removed. The embedding length was set to 10 for the head and child tokens and the valence. The sentence embedding length was also set to 10. We trained the neural networks using stochastic gradient descent with batch size 10 and learning rate 0.01. We used the change of the loss on the validation set as the stop criteria. For our methods in the WSJ experiments, we followed  and initialized our model using the pre-trained model of Naseem et al. (2010), which significantly increased the accuracy and decreased the variance. For the other experiments, we used a pre-trained NDMV model to initialize our method. We ran our model for 5 times and report the average DDA.

Results on English Penn Treebank
In Table 1, we compared our method with a large number of previous approaches to unsupervised dependency parsing. Both variational variant and deterministic variant outperform recent approaches in the basic setup, which demonstrates the benefit of utilizing contextual information in dependency strength prediction. Deterministic variant has a slightly better parsing accuracy than variational variant but variational variant is more stable. The standard derivations of deterministic variant and variational variant are 0.530 and 0.402 respectively for 5 runs.

Results on Universal Dependency Treebank
We compare our model with several state-of-theart models on the UD Treebanks and report the results in Table 2.
We first compare our model with two generative models: NDMV and left corner DMV (LC-DMV) (Noji et al., 2016). The LC-DMV is the recent state-of-the-art generative approach on Universal Dependency Treebank. Our variational variant D-NDMV outperforms the LC-DMV and the NDMV on average.
Furthermore, we compare our model with current state-of-the-art discriminative models, the neural variational transition-based parser (NVTP) (Li et al., 2019) and Convex-MST (Grave and El-   hadad, 2015). Note that current discriminative approaches usually rely on strong universal linguistic prior 1 to get better performance. So the comparisons may not be fair for our model. Despite this, we find that our model can achieve competitive accuracies compared with these approaches.

Results on Datasets from PASCAL Challenge
We also perform experiments on the datasets from PASCAL Challenge (Gelling et al., 2012), which contains eight languages: Arabic, Basque, Czech, Danish, Dutch, Portuguese, Slovene and Swedish. We compare our approaches with NDMV (Jiang et al., 2016), Convex-MST (Grave and Elhadad, 2015) and CRFAE (Cai et al., 2017). NDMV and CRFAE are two state-of-the-art approaches on the PASCAL Challenge datasets. We show the directed dependency accuracy on the testing sentences no longer than 10 (Table 3) and on all the testing sentences (Table 4). It can be seen that on average our models outperform other state-of-the-   art approaches including those utilizing the universal linguistic prior.

Analysis
In this section, we studies what information is captured in the sentence embeddings and the some configurations that are sensitive to our model. Here we use deterministic variant of D-NDMV to conduct the following analysis. deterministic variant of D-NDMV performs similar to deterministic variant of D-NDMV.

Rule Probabilities in Different Sentences
The motivation behind D-NDMV is to break the independence assumption and utilize global information in predicting grammar rule probabilities.
Here we conduct a few case studies of what information is captured in the sentence embedding and how it influences grammar rule probabilities. We train a D-NDMV on WSJ and extract all the embeddings of the training sentences. We then focus on the following two sentences: "What 's next" and "He has n't been able to replace the M'Bow cabal". We now examine the dependency rule probability of VBZ generating JJ to the right with valence 0 in these two sentences (illustrated in Figure 3). In the first sentence, this rule is used in the gold parse ("'s" is the head of "next"); but in the second sentence, this rule is not used in the gold parse (the head of "able" is "been" instead of "has"). We observe that the rule probability redicted by D-NDMV given the first sentence is indeed significantly larger than that given the second sentence, which demonstrates the positive impact of conditioning rule probability prediction on the sentence embedding.
To obtain a more holistic view of how rule probabilities change in different sentences, we collect the probabilities of a particular rule ("IN" generating "CD" to the right with valence 1) predicted by our model for all the sentences of WSJ. Figure  4 shows two distributions over the rule probability when the rule is used in the gold parse vs. when the rule is applicable to parsing the sentence but is not used in the gold parse. It can be seen that when  the rule appears in the gold parse, its probability is clearly boosted in our model. Finally, for every sentence of WSJ, we collect the probabilities predicted by our model for all the rules that are applicable to parsing the sentence. We then calculate the average probability 1) when the rule is used in the gold parse, 2) when the rule is not used in the gold parse, and 3) regardless of whether the rule is used in the gold parse or not. We use the E-DMV model as the baseline in which rule probabilities do not change with sentences. The results are shown in Table 5. We observe that compared with the E-DMV baseline, the rule probabilities predicted by our model are increased by 14.0% on average, probably because our model assigns higher probabilities to rules applicable to parsing the input sentence than to rules not applicable (e.g., if the head or child of the rule does not appear in the sentence). The increase of the average probability when the rule is used in the gold parse (15.7%) is higher than when the rule is not used in the gold parse (13.7%), which again demonstrates the advantage of our model.

Choice of Sentence Encoder
Besides LSTM, there are a few other methods of producing the sentence representation. Table 6 compares the experimental results of these methods. The bag-of-tags method simply computes the average of all the POS tag embeddings and has the lowest accuracy, showing that the word order is in-  formative for sentence encoding in D-NDMV. The anchored words method replaces the POS tag embddings used in the neural network of the neural DMV with the corresponding hidden vectors produced by a LSTM on top of the input sentence, which leads to better accuracy than bag-of-tags but is still worse than LSTM. Replacing LSTM with Bi-LSTM or attention-based LSTM also does not lead to better performance, probably because these models are more powerful and hence more likely to result in degeneration and overfitting.

Impact of Genres
All the sentences in WSJ come from newswire, which conform to very similar syntactic styles.
Here we study whether our method can capture different syntactic styles by learning our method from Chinese Treebank 9.0 (2005) which contains sentences of two different genres: the informal genre and the formal genre. The experimental setup is the same as that in section 4. We pick the rule of "CD" (number) generating "AD" (adverb) to the left with valence 0 and collect the rule probability in sentences from the two genres.In informal sentences our model assigns smaller probabilities to the rule than in formal sentences. This may reflect the fact that formal texts are more precise than informal text when presenting numbers, which is captured by our model 2 .

Impact of Sentence Embedding Dimension
The dimension of sentence embeddings in our model is an important hyper-parameter. If the dimension is too large, the sentence embedding may capture too much information of the sentence and hence the model is very likely to degenerate or overfit as discussed in section 3.1. If the dimension is too small, the model loses the benefit of sentence information and becomes similar to neural DMV. As Figure 5 illustrates, dimension 10 leads to the best parsing accuracy, while dimen- sion 20 produces lower parsing accuracy probably because of a combination of degeneration and overfitting. The conditional log Viterbi likelihood curves on the training set and the validation set in Figure 5 confirm that overfitting indeed occur with dimension 20.

Conclusion
We propose D-NDMV, a novel unsupervised parser with characteristics from both generative and discriminative approaches to unsupervised parsing. D-NDMV extends neural DMV by parsing a sentence using grammar rule probabilities that are computed based on global information of the sentence. In this way, D-NDMV breaks the context-free independence assumption in generative dependency grammars and is therefore more expressive. Our extensive experimental results show that our approach achieves competitive accuracy compared with state-of-the-art parsers.

A Impact of Genres
All the sentences in WSJ come from newswire, which conform to very similar syntactic styles.
Here we study whether our method can capture different syntactic styles by learning our method from Chinese Treebank 9.0 (2005) which contains sentences of two different genres: the informal genre (chat messages and transcribed conversational telephone speech) and the formal genre (newswire, broadcast and so on). The experimental setup is the same as that in section 4. We extract the embeddings of the training sentences from the learned model and map them onto a 3D space via the t-SNE algorithm (Van der Maaten and Hinton, 2008) ( Figure 6). It can be seen that although the two types of sentences are mixed together overall, many regions are clearly dominated by one type or the other. This verifies that sentence embeddings learned by our approach can capture some genre information.
We pick the rule of "CD" (number) generating "AD" (adverb) to the left with valence 0 and illustrate the distributions of the rule probability in sentences from the two genres in Figure 7. It can be seen that in informal sentences our model assigns  Table 7: Sentences closest to the two example sentences in terms of the L2 distance between their learned embeddings. Both the word sequence and the POS tag sequence are shown for each sentence. smaller probabilities to the rule than in formal sentences. This may reflect the fact that formal texts are more precise than informal text when presenting numbers, which is captured by our model.

B Nearby Sentences in Embedding Space
We train a our method on WSJ and extract all the embeddings of the training sentences. We then focus on the following two sentences: "What 's next" and "He has n't been able to replace the M'Bow cabal". Table 7 shows the two sentences as well as a few other sentences closest to them measured by the L2 distance between their embeddings. It can be seen that most sentences close to the first sentence contain a copula followed by a predicative adjective, while most sentences close to the second sentence end with a noun phrase where the noun has a preceding modifier. These two examples show that the sentence embeddings learned by our approach encode syntactic information that can be useful in parsing.