Semi-supervised Dependency Parsing using Bilexical Contextual Features from Auto-Parsed Data

We present a semi-supervised approach to improve dependency parsing accuracy by using bilexical statistics derived from auto-parsed data. The method is based on estimating the attachment potential of head-modiﬁer words, by taking into account not only the head and modiﬁer words themselves, but also the words surrounding the head and the modiﬁer. When integrating the learned statistics as features in a graph-based parsing model, we observe nice improvements in accuracy when parsing various English datasets.


Introduction
We are concerned with semi-supervised dependency parsing, namely how to leverage large amounts of unannotated data, in addition to annotated Treebank data, to improve dependency parsing accuracy. Our method (Section 2) is based on parsing large amounts of unannotated text using a baseline parser, extracting word-interaction statistics from the automatically parsed corpus, and using these statistics as the basis of additional parser features. The automatically-parsed data is used to acquire statistics about lexical interactions, which are too sparse to estimate well from any realistically-sized Treebank. Specifically, we attempt to infer a function assoc(head, modif er) measuring the "goodness" of head-modifier relations ("how good is an arc in which black is a modifier of jump"). A similar approach was taken by Chen et al. (2009) and Van Noord et al. (2007). We depart from their work by extending the scoring to include a wider lexical context. That is, given the sentence fragment in Figure 1, we score the (incorrect) dependency arc (black, jump) based on the triplets (the black fox, will jump over). Learning a function between word triplets raises an extreme data sparsity issue, which we deal with by decomposing the interaction between triplets to a sum of interactions between word pairs. The decomposition we use is inspired by recent work in word-embeddings and dense vector representations (Mikolov et al., 2013a;Mnih and Kavukcuoglu, 2013). Indeed, we initially hoped to leverage the generalization abilities associated with vector-based representations. However, we find that in our setup, reverting to direct countbased statistics achieve roughly the same results (Section 3).
Our derived features improve the accuracy of a first-order dependency parser by 0.75 UAS points (absolute) when evaluated on the in-domain WSJ test-set, obtaining a final accuracy of 92.32 UAS for a first-order parser. When comparing to the strong baseline of using Brown-clusters based features (Koo et al., 2008), we find that our tripletsbased method outperform them by over 0.27 UAS points. This is in contrast to previous works (e.g. (Bansal et al., 2014)) in which improvements over using Brown-clusters features were achieved only by adding to the cluster-based features, not by replacing them. As expected, combining both our features and the brown-cluster features result in some additional gains.

Our Approach
Our departure point is a graph-based parsing model (McDonald et al., 2005): Given a sentence x we look for the highest-scoring parse tree y in the space Y(x) of valid dependency trees over x. The score of a tree is determined by a linear model parameterized by a weights vector w, and a feature function Φ(x, y). To make the search Features in φ ij lex (x, y) bin(S ij (x, y)) bin(S ij (x, y)) • dist(x,y) bin(S ij (x, y)) • pos(x) • pos(y) bin(S ij (x, y)) • pos(x) • pos(y) • dist(x,y) Table 1: All features are binary indicators. x and y are token indices. S ij is estimated from auto-parsed corpora as described in the text. The values of S(·, ·) are in the range (0, 1), which is split by bin into 10 equally-spaced intervals. dist is the signed and binned sentence-distance between x and y. pos(x) is the part of speech of token x. • indicates a concatenation of features. tractable, the feature function is decomposed into local feature functions over tree-parts φ(x, part). The features in φ are standard graph-based dependency parsing features, capturing mostly structural information from the parse tree.
We extend the scoring function by adding an additional term capturing the strength of lexical association between head word h and modifier word m in each dependency arc: score The association function assoc is also modeled as a linear model assoc(h, m) = w lex ·φ lex (h, m). While the weights w lex are trained jointly with w based on supervised training data, the features in φ lex do not look at h and m directly, but instead rely on a quantity S(h, m) that reflects the "goodness" of the arc (h, m). The quantity S(h, m) ranges between 0 and 1, and is estimated based on large quantities of auto-parsed data. Given a value for S(h, m), φ lex is composed of indicator functions indicating the binned ranges of S(h, m), possibly conjoined with information such as the binned surface distance between the tokens h and m and their parts of speech. The complete specification of φ lex we use is shown in Table 1 (the meaning of the ij indices will be discussed in Section 2.2).

Estimating S(h, m)
One way of estimating S(h, m), which was also used in (Chen et al., 2009), is using rank statistics. Let D be the list of observed (h, m) pairs sorted by their frequencies, and let rank(h, m) be the index of the pair (h, m) in D. We now set: While effective, this approach has two related shortcomings: first, it requires storing counts for all the pairs (h, m) appearing in the auto-parsed data, resulting in memory requirement that scales quadratically with the vocabulary size. Second, even with very large auto-parsed corpora many plausible head-modifier pairs are likely to be unobserved.
An alternative way of estimating S(h, m) that does not require storing all the observed pairs and that has a potential of generalizing beyond the seen examples is using a log-bilinear embedding model similar to the skip-gram model presented by Mikolov et al. (2013b) to embed word pairs such that compatible pairs receive high scores. The model assigns two disjoint sets of d-dimensional continuous latent vectors, u and v, where u h ∈ R d is an embedding of a head word h, and v m ∈ R d is an embedding of a modifier word m. The embedding is done by trying to optimize the following corpus-wide objective that is maximizing the dot product of the vectors of observed (h, m) pairs and minimizing the dot product of vectors of random h and m pairs. Formally: where σ(x) = 1/(1 + e −x ), and k is the number of negative samples, drawn from the corpus-based Unigram distribution P m . For further details, see (Mikolov et al., 2013b;Goldberg and Levy, 2014). We then take: 1 In contrast to the counts based method, this model is able to estimate the strength of a pair of words even if the pair did not appear in the corpus due to sparsity.
Finally, Levy and Goldberg (2014b) show that the skip-grams with negative-sampling model described above achieves its optimal solution when u h · v m = P M I(h, m) − log k. This gives rise to another natural way of estimating S: . . . will jump over Figure 1: Illustration of the bilexical information including context. When scoring the (incorrect) arc between h 0 and m 0 , we take into account also the surrounding words h −1 , h +1 , m −1 and m +1 .
where p(h, m), p(h) and p(m) are unsmoothed maximum-likelihood estimates based on the autoparsed corpus.
Like S RANK and unlike S EMBED , S PMI requires storing statistics for all observed word pairs, and is not able of generalizing beyond (h, m) pairs seen in the auto-parsed data. However, as we see in Section 3, this method performing similarly in practice, suggesting that the generalization capabilities of the embedding-based method do not benefit the parsing task.

Adding additional context
Estimating the association between a pair of words is effective. However, we would like to go a step further and take into account also the context in which these words occur. Specifically, our complete model attempts to estimate the association between word trigrams centered around the head and the modifier words.
A naive solution that defines each trigram as its own vocabulary item will increase the vocabulary size by two orders of magnitude and result in severe data sparsity. An alternative solution would be to associate each word in the triplet (h −1 , h 0 , h +1 ) with its own unique vocabulary item, and likewise for modifier words. In the embeddings-based model, this results in 6 vec- dog , for example, represents the word "dog" when it appears to the left of the modifier word, and u +1 walk the word "walk" when it appears to the right of the head word. 2 This amounts to only a 3fold increase in the required vocabulary size. We then model the strength of association between h −1 h 0 h +1 and m −1 m 0 m +1 as a weighted sum of pairwise interactions: 3 As before, the pairwise association measure assoc ij (h i , m j ) is modeled as a linear model: Where φ ij lex is again defined in terms of a goodness function S ij (x, y). For example, S −1,+1 (the, over) corresponds to the goodness of a head-modifier arc where the word to the left of the head word is "the" and the word to the right of the modifier word is "over". S ij (h i , m j ), the goodness of the arc induced by the pair (h i , m j ), can be estimated by either S RANK , S EMBED or S PMI as before. For example, in the embeddings model we set S ij (x, y) = σ(u i x · v j y ). We update the bilexical features to include context as explained above. Instead of learning α and w lex coefficients separately, we absorb the α ij terms into w lex , learning both at the same time: Finally, the parser selects a dependency tree which maximizes the following: In the word embeddings literature, it is common to represent a word triplet as the sum of the individual component vectors, resulting in ux,y,z · v a,b,c = (u −1 Expanding the terms will result in a very similar formulation to our proposal, but we allow the extra flexibility of associating a different strength αij with each pairwise term.

Experiments and Results
Data Our experiments are based on the Penn Treebank (PTB) (Marcus et al., 1993) as well as the Google Web Treebanks (LDC2012T13), covering both in-domain and out-of-domain scenarios. We use the Stanford-dependencies representation (de Marneffe and Manning, 2008). All the constituent-trees are converted to Stanforddependencies based on the settings of Version 1.0 of the Universal Treebank (McDonald et al., 2013). 4 These are based on the Stanford Dependencies converter but use some non-default flags, and change some of the dependency labels. All of the models are trained on section 2-21 of the WSJ portion of the PTB. For in-domain data, we evaluate on sections 22 (Dev) and 23 (Test). All of the parameter tuning were performed on the Dev set, and we report test-set numbers only for the "most interesting" configurations. For out-ofdomain data, we use the Brown portion of the PTB (Brown), as well as the test-sets of different domains available in the Google Web Treebank: Answers, Blogs, Emails, Reviews and Newsgroups. All trees have automatically assigned part-ofspeech tags, assigned by the TurboTagger POStagger. 5 The train-set POS-tags were derived in a 10-fold jacknifing, and the different test datasets receive tags from a tagger trained on sections 2-21.
For auto-parsed data, we parse the text of the BLLIP corpus (Charniak, 2000) using our baseline parser. This is the same corpus used for deriving Brown clusters for use as features in (Koo et al., 2008). We use the clusters provided by Terry Koo 6 . Parsing accuracy is measured by unlabeled attachment score (UAS) excluding punctuations. Implementation Details We focus on first-order parsers, as they are the most practical graphbased parsers in terms of running time in realistic parsing scenarios. Our base model is a reimplementation of a first-order projective Graphbased parser (McDonald et al., 2005), which we extend to support the semi-supervised φ lex features. The parser is trained for 10 iterations of online-training with passive-aggressive updates (Crammer et al., 2006). For the Brown-cluster features, we use the feature templates described by (Koo et al., 2008;Bansal et al., 2014).
The embedding vectors are trained using the freely available word2vecf software 7 , by conjoining each word with its relative position (-1, 0 or 1) and treating the head words as "words" and the modifier words to be "contexts". The words are embedded into 300-dimensional vectors. All code and vectors will be available at the first author's website.

Results
The results are shown in Table 2. The second block (HM) compares the baseline parser to a parser including the assoc(h, m) lexical component, using various ways of computing s(h, m). We observe a clear improvement above the baseline from using the lexical component across all domains. The different estimation methods perform very similar to each other.
In the third block (TRIP) we switch to the triplet-based lexical association. With S RANK , there is very little advantage over looking at just pairs. However, with S EMBED or S PMI the improvement of using the triplet-based method over using just the head-modifier pairs is clear. The countingbased PMI method performs on par with the Embedding based approximation of it.
The second line of the first block (Base+Brown) represents the current state-of-the-art in semisupervised training of graph-based parsing: using Brown-cluster derived features (Koo et al., 2008;Bansal et al., 2014). The Brown-derived features provide similar (sometimes larger) gains to using our HM method, and substantially smaller gains than our TRIP method. To the best of our knowledge, we are the first to show a semi-supervised method that significantly outperforms the use of Brown-clusters without using Brown-clusters as a component.
As expected, combining our features and the Brown-based features provide an additional improvement, as can be seen in the last block of

Related Work
Semi-supervised approaches to dependency parsing can be roughly categorized into two groups: those that use unannotated data and those that use automatically-parsed data. Our proposed method falls in the second group.
Among the words that use unannotated data, the dominant approach is to derive either word clusters (Koo et al., 2008) or word vectors (Chen and Manning, 2014) based on unparsed data, and use these as additional features for a supervised parsing model. While the word representations used in such methods are not specifically designed for the parsing task, they do provide useful features for parsing, and in particular the method of (Koo et al., 2008), relying on features derived using the Brown-clustering algorithm, provides very competitive state-of-the-art results. To the best of our knowledge, we are the first to show a substantial improvement over using Brown-clustering derived features without using Brown-cluster features as a component.
Among the words that use auto-parsed data, a dominant approach is self-training (McClosky et al., 2006), in which a parser A (possibly an ensemble) is used to parse large amounts of data, and a parser B is then trained over the union of the gold data and the auto-parsed data produced by parser A. In the context of dependency-parsing, successful uses of self-training require parser A to be stronger than parser B (Petrov et al., 2010) or use a selection criteria for training only on highquality parses produced by parser A (Sagae and Tsujii, 2007;Weiss et al., 2015). In contrast, our work uses the same parser (modulo the feature-set) for producing the auto-parsed data and for training the final model, and does not employ a highquality parse selection criteria when creating the auto-parsed corpus. It is possible that high-quality parse selection can improve our proposed method even further.
Works that derive features from auto-parsed data include (Sagae and Gordon, 2009;Bansal et al., 2014). Such works assign a representation (either cluster or vector) for individual word in the vocabulary based on their syntactic behavior. In contrast, our learned features are designed to capture interactions between words. As discussed in sections (1) and (2), most similar to ours is the work of (Chen et al., 2009;Van Noord, 2007). We extend their approach to take into account not only direct word-word interactions but also the lexical surroundings in which these interactions occur.
Another recent approach that takes into account various syntactic interactions was recently introduced by , who propose to learn to embed complex features that are being used in a graph-based parser based on other features they co-occur with in auto-parsed data. Similar to our approach, the embedded features are then used as additional features in a conventional graph-based model. The approaches are to a large extent complementary, and could be combined.
Finally, our work adds additional features to a graph-based parser which is based on a linearmodel. Recently, progress in dependency parsing has been made by introducing non-linear, neuralnetwork based models (Pei et al., 2015;Chen and Manning, 2014;Weiss et al., 2015;Dyer et al., 2015;Zhou et al., 2015). Adapting our approach to work with such models is an interesting research direction.

Conclusions
We presented a semi-supervised method for dependency parsing and demonstrated its effectiveness on a first-order graph-based parser. Taking into account not only the (head,modifier) wordpair but also their immediate surrounding words add a clear benefit to parsing accuracy.