Dependency Link Embeddings: Continuous Representations of Syntactic Substructures

We present a simple method to learn continuous representations of dependency substructures (links), with the motivation of directly working with higher-order, structured embed-dings and their hidden relationships, and also to avoid the millions of sparse, template-based word-cluster features in dependency parsing. These link embeddings allow a signiﬁcantly smaller and simpler set of unary features for dependency parsing, while maintaining improvements similar to state-of-the-art, n-ary word-cluster features, and also stacking over them. Moreover, these link vectors (made publicly available) are directly portable as off-the-shelf, dense, syntactic features in various NLP tasks. As one example, we incorporate them into constituent parse reranking, where their small feature set again matches the performance of standard non-local, manually-deﬁned features, and also stacks over them.


Introduction
Word representations and more recently, word embeddings, learned from large amounts of text have been quite successful as features in various NLP tasks (Koo et al., 2008;Turian et al., 2010;Collobert et al., 2011;Dhillon et al., 2012;Al-Rfou' et al., 2013;Bansal et al., 2014;Guo et al., 2014;Pennington et al., 2014;Yu and Dredze, 2014;Faruqui et al., 2014;Wang et al., 2015). While these word representations do capture useful, dense relationships among known and unknown words, one still has to work with sparse conjunctions of features on the multiple words involved in the substructure that a task factors on, e.g., head-argument links in dependency parsing. Therefore, most statistical dependency parsers still suffer from millions of such conjoined, template-based, n-ary features on word clusters or embeddings (Koo et al., 2008;Bansal et al., 2014). Some recent work has addressed this issue, via low-rank tensor mappings (Lei et al., 2014), feature embeddings , or neural network parsers (Chen and Manning, 2014).
Secondly, it would also be useful to learn dense representations directly for the higher-order substructures (that structured NLP tasks factor on) so as to explicitly capture the useful, hidden relationships among these substructures, instead of relying on the sparse word-conjoined relationships.
In this work, we propose to address both these issues by learning simple dependency link embeddings on 'head-argument' pairs (as a single concatenated unit), which allows us to work directly with linguistically-intuitive, higher-order substructures, and also fire significantly fewer and simpler features in dependency parsing, as opposed to word cluster and embedding features in previous work (Koo et al., 2008;Bansal et al., 2014), while still maintaining their strong accuracies.
Trained using appropriate dependency-based context in word2vec, the fast neural language model of Mikolov et al. (2013a), these link vectors allow a substantially smaller set of unary link features (as opposed to n-ary, conjoined features) which provide savings in parsing time and memory. Moreover, unlike conjoined features, link embeddings allow a tractable set of accurate per-dimension features, making the feature set even smaller and the featuregeneration process orders of magnitude faster (than hierarchical clustering features).
At the same time, these link embedding features maintain dependency parsing improvements similar to the complex, template-based features on word clusters and embeddings by previous work (Koo et al., 2008;Bansal et al., 2014) (up to 9% relative error reduction), and also stack statistically significantly over them (up to an additional 5% relative error reduction).
Another advantage of this approach (versus previous work on feature embeddings or special neural networks for parsing) is that these link embeddings can be imported as off-the-shelf, dense, syntactic features into various other NLP tasks, similar to word embedding features, but now with richer, structured information, and in tasks where plain word embeddings have not proven useful . As an example, we incorporate them into a constituent parse reranker and see improvements that again match state-of-the-art, manually-defined, non-local reranking features and stack over them statistically significantly. We make our link embeddings publicly available 1 and hope that they will prove useful in various other NLP tasks in future work, e.g., as dense, syntactic features in sentence classification or as linguistically-intuitive, initial units in vectorspace composition.

Dependency Link Embeddings
To train the link embeddings, we use the speedy, skip-gram neural language model of Mikolov et al. (2013a;2013b) via their toolkit word2vec. 2 We use the original skip-gram model and simply change the context tuple data on which the model is trained, similar to Bansal et al. (2014) and Levy and Goldberg (2014). The goal is to learn similar embeddings for links with similar syntactic contextual properties like label, signed distance, ancestors, etc.
Clusters: Table 1 shows example clusters obtained by clustering link embeddings via MAT-LAB's linkage + cluster commands, with 1000 clusters. 5 We can see that these link embeddings are able to capture useful groups and subtle distinctions directly at the link level (without having to work with all pairs of word types), e.g., based on syntactic properties like capitalization, verb form, position in sentence; and based on topics like location, time, finance, etc.

Dependency Parsing Experiments
In this section, we will first discuss how we use the link embeddings as features in dependency parsing. Next, we will present empirical results on feature space reduction and on parsing performance on both in-domain and out-of-domain datasets.

Features
The BROWN cluster features are based on Bansal et al. (2014), who follow Koo et al. (2008) Koo et al. (2008)). We have another feature that additionally includes the signed, bucketed distance of the particular link in the given sentence. Also note the difference of our unary bucket features from the binary bucket features of Bansal et al. (2014), who had to work with pairwise, conjoined features of the head and the argument. Hence, they used features on conjunctions of the two bucket values from the head and argument word vectors, firing one pairwise feature per dimension, because firing features on all dimension pairs (corresponding to an outer product) led to an infeasible number of features. The result discussion of these feature differences in presented in §3.2.
Bit-string features: We first hierarchically cluster the link vectors via MATLAB's linkage function with {method=ward, metric=euclidean} to get 0-1 bit-strings (similar to BROWN). Next, we again fire a small set of unary indicator features that simply con- sist of the link's bit-string prefix, the prefix-length, and another feature that adds the signed, bucketed distance of that link in the sentence. 6

Setup and Results
For all experiments (unless otherwise noted), we follow the 2nd-order MSTParser setup of Bansal et al. (2014), in terms of data splits, parameters, preprocessing, and feature thresholding. Statistical significance is reported based on the bootstrap test (Efron and Tibshirani, 1994) with 1 million samples.
First, we compare the number of features in Table 2. Our dense, unary, link-embedding based Bucket and Bit-string features are substantially fewer than the sparse, n-ary, template-based features used in the MSTParser baseline, in BROWN, and in the word embedding SKIP DEP result of Bansal et al. (2014). This in turn also improves our parsing speed and memory. Moreover, regarding the preprocessing time taken to generate these various feature types, our Bucket features, which just need the fast word2vec training, take 2-3 orders of magnitude lesser time than the BROWN features (15 mins. versus 2.5 days) 7 ; this is also advantageous when 6 We again used prefixes of length 4, 6, 8, 12, same as the BROWN feature setting. For unknown links' features, we replace the bucket or bit-string prefix with a special 'UNK' string. 7 Based on a modern 3.50 GHz desktop and 1 thread. The Bit-string features additionally need hierarchical clustering, but are still at least twice as fast as BROWN features.  7). Therefore, the main contribution of these link embeddings is that their significantly simpler, smaller, and faster set of unary features can match the performance of complex, template-based BROWN features (and of the dependency-based word embedding features of Bansal et al. (2014)), and also stack over them. We also get similar trends of improvements on the labeled attachment score (LAS) metric. 8 Moreover, unlike Bansal et al. (2014), our Bucket features achieve statistically significant improvements, most likely because they fired D pairwise, conjoined features, one per dimension d, consisting of the two bucket values from the head and argument word vectors. This would disallow the classifier to learn useful linear combinations of the various dimensions. Firing D 2 features on all dimension pairs (corresponding to an outer product) would lead to an infeasible number of features. On the other hand, we have a single vector for head+argument, allowing us to fire just D features (one per dimension) and still learn useful dimension combinations in linear space. We also report out-of-domain performance, in Table 4, on the Web treebank (Petrov and McDonald, 2012) test sets, directly using the WSJ-trained models. Again, both our Bucket and Bit-string linkembedding features achieve decent improvements over Baseline and they stack over BROWN, while using much fewer features. Moreover, one can hopefully achieve bigger gains by training link embeddings on Web or Wikipedia data (since BLLIP is news-domain).

Off-the-shelf: Constituent Parsing
Finally, these link embeddings are also portable as off-the-shelf, dense, syntactic features into other NLP tasks, either to incorporate missing syntactic information, or to replace sparse (n-ary lexicalized or template-based) parsing features, or where word embedding features are not appropriate and one needs higher-order embeddings, e.g., in constituent parsing (see Andreas and Klein (2014)).
Therefore, as a first example, we import our link embedding features into a constituent parse reranker. We follow Bansal and Klein (2011), reranking 50best lists of the Berkeley parser (Petrov et al., 2006). We first extract dependency links in each candidate constituent tree based on the head-modifier rules of Collins (2000). Next, we simply fire our Bit-string features on each link, where the feature again consists of just the prefix bit-string, the prefix length, and the signed, bucketed link distance. 9 Table 5 shows these reranking results, where 1best and log p(t|w) are the two Berkeley parser baselines, and where Config is the state-of-the-art, nonlocal, configurational feature set of Huang (2008), which in turn is a simplified merge of Charniak and Johnson (2005) and Collins (2000) (here configurational). Again, all our test improvements are statistically significant at p < 0.01: Bit-string (90.9) over both the baselines (90.2, 89.9); and Config + Bit-string (91.4) over Config (91.1). Moreover, the Bit-string result (90.9) is the same (i.e., no statistically significant difference) as the Config result (91.1). Therefore, we can again match the improvements of complex, manually-defined, nonlocal reranking features with a much smaller set of simple, dense, off-the-shelf, link-embedding features, and also complement them statistically significantly.
In related work, Bansal et al. (2014) also use dependency context to tailor word embeddings to dependency parsing. However, their embedding features are still based on the sparse set of n-ary, word-based templates from previous work (McDonald et al., 2005a;Koo et al., 2008). Our structured link embeddings achieve similar improvements as theirs (and better in the case of direct, per-dimension bucket features) with a substantially smaller and simpler (unary) set of features that are aimed to directly capture hidden relationships between the substructures that dependency parsing factors on. Moreover, we hope that similar to word embeddings, these link embeddings will also prove useful when imported into various other NLP tasks as dense, continuous features, but now with additional syntactic information.
There has also been some recent, useful work on reducing the sparsity of features in dependency parsing, e.g., via low-rank tensors (Lei et al., 2014) and via neural network parsers that learn tag and label embeddings (Chen and Manning, 2014). In related work,  learn dense feature embeddings for dependency parsing; however, they still work with the large number of manuallydefined feature templates from previous work and train embeddings for all those templates, with an aim to discover hidden, shared information among the large set of sparse features. We get similar improvements with a much smaller and simpler set of unary link features; also, our link embeddings are more portable to other NLP tasks than template-based embeddings specific to dependency parsing.
Other work includes learning distributed structured output via dense label vectors (Srikumar and Manning, 2014), learning bilexical operator embeddings (Madhyastha et al., 2014), and learning joint word embeddings and composition functions based on predicate-argument compositionality (Hashimoto et al., 2014).
Our main goal is to directly learn embeddings on linguistically-intuitive units like dependency links, so that they can be used as non-sparse, unary features in dependency parsing, and also as off-theshelf, dense, syntactic features in other NLP tasks (versus more intrinsic approaches based on feature embeddings or neural network parsers, which are harder to export).

Conclusion and Future Work
We presented dependency link embeddings, which provide a small, simple set of unary features for dependency parsing, while maintaining statistically significant improvements, similar and complementary to sparse, n-ary, word-cluster features. These link vectors are also portable as off-the-shelf syntactic features in other NLP tasks; we import them into constituent parse reranking, where they again match and stack over state-of-the-art, non-local reranking features. We release our link embeddings (available at ttic.edu/bansal) and hope that these will prove useful in various other NLP tasks, e.g., as dense, syntactic features in sentence classification or as linguistically-intuitive, initial units in vectorspace composition.
In future work, it will be useful to try obtaining stronger parsing accuracies via newer, better representation learning tools, e.g., GloVe (Pennington et al., 2014), and by training on larger quantities of automatically-parsed data. It will also be useful to perform intrinsic evaluation of these link embeddings on appropriate syntactic datasets and metrics, and extrinsic evaluation via various other NLP tasks such as sentence classification. Finally, it will be interesting to try parsers or frameworks where we can directly employ the embeddings as features, instead of bucketing or clustering them.