Semi-supervised Parsing with a Variational Autoencoding Parser

We propose an end-to-end variational autoencoding parsing (VAP) model for semi-supervised graph-based projective dependency parsing. It encodes the input using continuous latent variables in a sequential manner by deep neural networks (DNN) that can utilize the contextual information, and reconstruct the input using a generative model. The VAP model admits a unified structure with different loss functions for labeled and unlabeled data with shared parameters. We conducted experiments on the WSJ data sets, showing the proposed model can use the unlabeled data to increase the performance on a limited amount of labeled data, on a par with a recently proposed semi-supervised parser with faster inference.

While supervised approaches have been very successful, they require large amounts of labeled data, particularly when neural architectures are used, which usually are over-parameterized. Syntactic annotation is notoriously difficult and requires specialized linguistic expertise, posing a serious challenge for low-resource languages. Semi-supervised parsing aims to alleviate this problem by combining a small amount of labeled data and a large amount of unlabeled data, to improve parsing performance on using labeled data alone. Traditional semi-supervised parsers use unlabeled data to generate additional features in order to assist the learning process (Koo et al., 2008), together with different variants of self-training (Søgaard, 2010). However, these approaches are usually pipe-lined and error-propagation may occur.
In this paper, we propose Variational Autoencoding Parser, or VAP, extends the idea of VAE, illustrated in Figure 3. The VAP model uses unlabeled examples to learn continuous latent variables of the sentence, which can be used to support tree inference by providing an enriched representation.
We summarize our contributions as follows: 1. We proposed a Variational Autoencoding Parser (VAP) for semi-supervised dependency parsing; 2. We designed a unified loss function for the proposed parser to deal with both labeled and unlabeled data. 3. We show improved performance of the proposed model with unlabeled data on the WSJ data sets, and the performance is on a par with a recently proposed semi-supervised parser (Corro and Titov, 2019), with faster inference.

Related Work
Most dependency parsing studies fall into two major groups: graph-based and transition-based (Kubler et al., 2009). Graph-based parsers (Mc-Donald, 2006) regard parsing as a structured prediction problem to find the most probable tree, while transition-based parsers (Nivre, 2004(Nivre, , 2008 treat parsing as a sequence of actions at different stages leading to a dependency tree. While earlier works relied on manual feature engineering, in recent years the hand-crafted features were replaced by embeddings and deep neural network architectures were used to learn representation for scoring structural decisions, leading to improved performance in both graph-based and transition-based parsing (Nivre, 2014;Pei et al., 2015;Chen and Manning, 2014;Dyer et al., 2015;Weiss et al., 2015;Andor et al., 2016;Kiperwasser and Goldberg, 2016;Wiseman and Rush, 2016).
The annotation difficulty for this task, has also motivated work on unsupervised (grammar induction) and semi-supervised approaches to parsing (Tu and Honavar, 2012;Jiang et al., 2016;Koo et al., 2008;Li et al., 2014;Kiperwasser and Goldberg, 2015;Cai et al., 2017;Corro and Titov, 2019). It also leads to advances in using unlabeled data for constituent grammar (Shen et al., 2018b,a) Similar to other structured prediction tasks, directly optimizing the objective is difficult when the underlying probabilistic model requires marginalizing over the dependency trees. Variational approaches are a natural way to alleviate this difficulty, as they try to improve the lower bound of the original objective, and have been applied in several recent NLP works (Stratos, 2019;Chen et al., 2018;Kim et al., 2019b,a). Variational Autoencoder (VAE) (Kingma and Welling, 2014) is particularly useful for latent representation learning, and has been studied in semi-supervised context as the Conditional VAE (CVAE) (Sohn et al., 2015). Note our work differs from VAE as VAE is designed for tabular data but not for structured prediction, as the input towards VAP is the sequence of sentential tokens and the output is the dependency tree.

Graph-based Dependency Parsing
A dependency graph of a sentence can be regarded as a directed tree spanning all the words of the sentence, including a special "word"-the ROOTto originate out. Assuming a sentence of length l, a dependency tree can be denoted as T = (< h 1 , m 1 >, . . . , < h l−1 , m l−1 >), where h t is the index in the sequence of the head word of the dependency connecting the tth word m t as a modifier.
Our graph-based VAP parser is constructed based on the following standard structured prediction paradigm (McDonald et al., 2005;Taskar et al., 2005). In inference, based on the scoring function S Λ with parameter Λ, the parsing problem is formulated as finding the most probable directed spanning tree for a given sentence x: where T * is the highest scoring parse tree and T is the set of all valid trees for the sentence x.
It is common to factorize the score of the entire graph into the summation of its substructures-the individual arc scores (McDonald et al., 2005): whereT represents the candidate parse tree, and s Λ is a function scoring individual arcs. s Λ (h, m) describes the likelihood of an arc from the head h to its modifier m in the tree. Throughout this paper, the scoring is based on individual arcs, as we focus on first-order parsing.

Scoring Function Using Neural Architecture
We used the same neural architecture as that in Kiperwasser and Goldberg (2016)'s study. We first use a bi-LSTM model to take as input u t = [p t ; e t ] at position t to incorporate contextual information, by feeding the word embedding e t concatenated with the POS tag embeddings p t of each word. The bi-LSTM then projects u t as o t . Subsequently a nonlinear transformation is applied on these projections.
Suppose the hidden states generated by the bi-LSTM are , for a sentence of length l, we compute the arc scores by introducing parameters W h , W m , w and b, and transform them as follows: In this formulation, we first use two parameters to extract two different representations that carry two different types of information: a head seeking for its modifier (h-arc) and a modifier seeking for its head (m-arc). Then a nonlinear function maps them to an arc score. For a single sentence, we can form a scoring matrix as shown in Figure 2, by filling each entry in the matrix using the score we obtained. Therefore, the scoring matrix is used to represent the head-modifier arc score for all the possible arcs connecting two tokens in a sentence (Zheng, 2017). Using this scoring arc matrix, we build our graphbased parser. Figure 3b) is a semi-supervised parser able to make use of unlabeled data in addition to labeled data, extending the idea of variational autoencoder (VAE, illustrated in Figure 3a) to dependency parsing.

VAP (illustrated in
VAP learns, using both labeled and unlabeled data, a continuous latent variables representation, designed to support the parsing task by creating contextualized token-representations that capture properties of the full sentence. Typically, each token in the sentence is represented by its latent variable z t , which is a high-dimensional Gaussian variable, to be aggregated as a group of latent variables z. This configuration ensures the continuous latent variable retains the contextual information from lower-level neural models to assist finding its head or its modifier; as well as forcing the representation of similar tokens to be closer. The latent variable group z is modeled via P (z|x). In addition, we model the process of reconstructing the input sentence from the latent variable through a generative story P (x|z).
We adjust the original VAE setup in our semisupervised task by considering examples with labels, similar to recent conditional variational formulations (Sohn et al., 2015;Miao and Blunsom, 2016;Zhou and Neubig, 2017). We propose a full probabilistic model for a given sentence x, with the unified objective to maximize for both supervised and unsupervised parsing as follows: This objective can be interpreted as follows: if the training example has a golden tree T with it, then the objective is the log joint probability P θ,ω (T , x); if the golden tree is missing, then the objective is the log marginal probability P θ (x). The probability of a certain tree is modeled by a tree-CRF with parameters ω as P ω (T |x). Given the assumed generative process P θ (x|z), directly optimizing this objective is intractable, thus instead we optimize its Evidence Lower BOund (ELBO): We show J lap is the ELBO of J in the appendix A.1.
In practice, similar as VAE-style models, E where z j is the j-th sample of N samples sampled from Q φ (z|x). At prediction stage, we simply use µ z rather than sampling z.

Incorporating POS and External Embeddings
In previous studies (Chen and Manning, 2014;Dozat and Manning, 2017;Dozat et al., 2017; Kiperwasser and Goldberg, 2016) exploring parsing using neural architectures, POS tags and external embeddings have been shown to contain important information characterizing the dependency relationship between a head and a child. Therefore, in addition to the variational autoencoding framework taking as input the randomly initialized word embeddings, optionally we can build the same structure for POS to reconstruct tags and for external embeddings to reconstruct words as well, whose variational objectives are U p and U e respectively. Hence, the final variational objective can be a combination of three: U = U w (The original U in Lemma A.1)+U p +U e (or just U = U w + U p if external embeddings are not used).

Experimental Settings
Data sets We compared our models' performance with strong baselines on the WSJ data set, which is the Stanford Dependency conversion (De Marneffe and Manning, 2008) of the Penn Treebank (Marcus et al., 1993). We used the standard section split: 2-21 for training, 22 for development and 23 for testing. To simulate the low-resource language environment, we used 10% of the whole training set as the annotated, and the rest 90% as the unlabeled.

Input Representation and Architecture
We describe the details of the architecture as follows: The internal word embeddings have dimension 100 and the POS embeddings have dimension 25. The hidden layer of the bi-LSTM layer is of dimension 125. The nonlinear layers used to form the head and the modifier representation both have 100 dimension. We also used separate bi-LSTMs for words and POSs.
Training In the training phase, we usedd Adam (Kingma and Ba, 2014) to learn all the parameters in the VAP model. We did not take efforts to tune models' hyper-parameters and they remained the same across all the experiments. To preventing over-fitting, we applied the "early stop" strategy by using the development set.

Semi-Supervised Dependency Parsing on WSJ Data Set
We evaluated our VAP model on the WSJ data set and compared the model performance with other semi-supervised parsing models, including CRFAE (Cai et al., 2017), which is originally designed for dependency grammar induction but can be adopted for semi-supervised parsing, and "differentiable Perturb-and-Parse" parser (DPPP) (Corro and Titov, 2019). To contextualize the results, we also experiment with the supervised neural marginbased parser (NMP) (Kiperwasser and Goldberg, 2016), neural tree-CRF parser (NTP) and the supervised version of VAP, with only the labeled data.
To ensure a fair comparison, our experimental set up on the WSJ is identical as that in DPPP, using the same 100 dimension skip-gram word embeddings employed in an earlier transition-based system (Dyer et al., 2015). We show our experimental results in Table 1 As shown in this table, our VAP model is able to utilize the unlabeled data to improve the overall performance on that with only using labeled data alone. Our VAP model performs slightly worse than the NMP model, which we attribute to the increased model complexity by incorporating extra encoder and decoders to deal with the latent variable. However, our VAP model achieved comparable results on semi-supervised parsing as the DPPP model, while our VAP model is simple and straightforward without inferencing the parse tree if it is unknown. Instead, the DPPP model has to apply Monte Carlo sampling from the posterior of the structure by using a "GUMBEL-MAX trick" to approximate the categorical distribution at each step, which is intensively computationally expensive, in order to form a dependency tree of high probability. Self-training using NMP with both labeled and unlabeled data is also included as a base-line, where the performance is deteriorated without appropriately using the unlabeled data.

Conclusion
In this study, we presented Variational Autoencoding Parser (VAP), an end-to-end parser, capable of utilizing the unlabeled data together with labeled data to improve the parsing performance, without any external resources. The proposed VAP model performs on a par with a recently published (Corro and Titov, 2019) semi-supervised parsing system on the WSJ data set, with faster inference, showing its potential for low-resource languages.