Dependency Based Embeddings for Sentence Classification Tasks

We compare different word embeddings from a standard window based skipgram model, a skipgram model trained using dependency context features and a novel skipgram variant that utilizes additional information from dependency graphs. We explore the effectiveness of the different types of word embeddings for word similarity and sentence classiﬁcation tasks. We consider three common sentence classiﬁcation tasks: question type classiﬁcation on the TREC dataset, binary sentiment classiﬁcation on Stanford’s Sentiment Treebank and semantic relation classiﬁcation on the SemEval 2010 dataset. For each task we use three different classiﬁcation methods: a Support Vector Machine, a Convolutional Neural Network and a Long Short Term Memory Network. Our experiments show that dependency based embeddings outperform standard window based embeddings in most of the settings, while using dependency context embeddings as additional features improves performance in all tasks regardless of the classiﬁcation method. Ourembeddings and code are available at https:


Introduction
Representing words as low dimensional vectors (also known as word embeddings) has been a widely adopted technique in NLP. Word representations can be used as features for classification tasks such as named entity recognition or chunking (Turian et al., 2010), and as a pretraining method for initializing deep neural network representations (Collobert et al., 2011;Kim, 2014). Word embeddings provide better generalization to unseen examples since they can capture general semantic and syntactic properties of words. One of the most popular methods of learning word embeddings is the skipgram model of Mikolov et al. (2013a;2013b) where embeddings are trained by making predictions of context words appearing in a window around a target word. The standard skipgram model ignores syntax and only partially takes into consideration the sequential structure of text, but still captures certain syntactic properties of words. A significant amount of previous research has explored methods for directly taking syntax into account for word embedding learning (Pham et al., 2015;Cheng and Kartsaklis, 2015;Hashimoto et al., 2014). One simple method is based on traditional count-based distributional semantic spaces and utilizes words with syntactic types from a dependency parse graph as context features (Padó and Lapata, 2007;Baroni and Lenci, 2010). This method has also been applied to skipgram models, where word embeddings are optimized to predict dependency context features instead of other words (Levy and Goldberg, 2014).
Syntax-based embeddings have been shown to have different properties in word similarity evaluations than their window based counterparts, better capturing the functional properties of words. However, it is not clear if they provide any advantage for NLP tasks. We show that using dependency context features can be a general method of providing syntactic information for several sentence classification tasks. Furthermore, the dependency context embeddings improve performance with all classifiers we tested.
We consider the usage of word and dependency context features for three common sentence classification tasks: TREC question type classification, binary sentiment prediction on Stanford Sentiment Treebank, and SemEval 2010 relation identification. We evaluate different methods of using the dependency context embeddings as extra features besides word embeddings to inject information into sentence classifiers about the syntactic structure of a sentence. The advantage of such a method is that it can be applied to any classifier that utilizes standard word embeddings. We evaluate the usefulness of syntaxbased word embeddings and dependency context embeddings with three different sentence classification methods: a Support Vector Machine (SVM), a Convolutional Neural Network (CNN) and a Long Short Term Memory network (LSTM).
In order to better utilize the structure of dependency graphs, we propose an extended version of the simple dependency based skipgram of Levy et al. (2014). This extended version considers cooccurrences in a dependency graph between pairs of words, words and dependency context features, and between different dependency context features. This scheme results in word embeddings that share properties between window based models and dependency graph based ones. More importantly, it provides additional structural information for the dependency context feature embeddings making them more effective when used in sentence classification tasks.
Our evaluation provides several insights on the role of syntax for embeddings and how they can be used for sentence classification. First, we confirm past claims about the different properties between dependency and window based skipgram embeddings in word similarity tasks. Second, we show that dependency based embeddings perform better in question classification and relation identification than window based ones. These results are robust across multiple classification methods. We show that combining dependency context feature embeddings together with word embeddings provide a simple and effective way to improve sentence classification performance. Finally, the performance gain is higher for the extended dependency based skipgram developed in this paper.

Related Work
Estimating word representations from text has been the focus of a lot of research in NLP. Traditional count-based models learn representations by applying SVD in a word-word co-occurrence matrix (Turney et al., 2010). More recently, neural models have been used to learn word embeddings by optimizing for a word prediction task (Collobert et al., 2011;Mnih and Teh, 2012;Mikolov et al., 2013a). However, the most commonly used word representation techniques like word2vec's skipgram and CBoW take little consideration of syntactic structure.
Several modifications have been proposed so that word embedding learning algorithms can better utilize syntax or the sequence structure of sentences. One such model is the dependency based skipgram of Levy and Goldberg (2014) which we further extend in this paper. Evaluation of this model is limited to word similarity or lexical substitution in context (Melamud et al., 2015), and little is known about performance within other NLP tasks. Hashimoto et al. (2014) proposed a log-bilinear language model based on predicate-argument structures and report improvements on phrase similarity tasks compared to standard skipgram. In Ling et al. (2015), skipgram and CBoW models are adapted to include position specific weights for the words inside the cooccurrence window and the resulting embeddings provide slight improvements for parsing and POS tagging tasks. The C-PHRASE model (Pham et al., 2015) is another modification of the CBoW model that uses an external parser to replace windows with syntactic constituents. In Cheng and Kartsaklis (2015), a recursive neural network structured according to a sentence's parse learns word embeddings by composing into valid sentences rather than distorted ones.
Structured skipgram models (Levy and Goldberg, 2014;Ling et al., 2015) have a notable difference with other approaches of incorporating structural information into embeddings (e.g. C-PHRASE), since they also produce embeddings of the structural context features at the prediction layer. We show that in the case of dependency contexts, these structural features can provide valuable information to sentence classifiers. In our extended dependency based skipgram, we do not make a distinction between words and structural features in the training process, which results into better performing dependency context embeddings when used in sentence classification. Another difference of our skipgram model with other structured skipgram variants is that we keep the long distance word contexts used in standard window based skipgram training with the purpose of capturing both functional and topic related semantic properties of words.
Our work is also related to methods of providing explicit syntactic information to sentence classifiers. Most of the previously proposed approaches rely on tree-structured neural architectures to drive composition of word embeddings to a sentence representation (Socher et al., 2012;Tai et al., 2015;. We use a different approach where syntactic information is provided only through embeddings. Our approach is not orthogonal to using tree-structured models and the two of them could be applied together. An advantage of providing syntactic information through embeddings is that large amounts of automatically parsed textual data can be utilized in order to learn representations of dependency types.

Embedding Models
The skipgram model of Mikolov et al. (2013a;2013b) optimizes vector representations of words (word embeddings) such that they can predict other context words occurring in a small window. The architecture consists of a single hidden layer feedforward network without any non-linearity applied on the hidden layer. The input to the network is the index of a target word (a one-hot vector) and the output is a vector of probabilities of appearance for context words. The network learns word embeddings by maximizing the log probability of a context word c given a target word t observed in a large corpus of textual data D. To avoid the large computational cost of applying a softmax for the whole vocabulary, a commonly used strategy is to train with negative sampling. For each target-context pair (t, c) coming from the observed data D, a small number of context words is sampled from unobserved data D according to a simple distribution and then used as the negative classes.
The probability of the target context pair (t, c) being observed in the data is given by: where v t and v c are target and context word embeddings, and σ is the sigmoid function. For a negative sampled pair (t, c), the probability of the pair not being observed in the data is given by: The objective becomes: (3) The network learns two sets of weights for each word: one for embedding words to a low dimensional representation in the hidden layer that we will refer to as the embedding layer weights, and one for assigning a probability to context words that we will refer to as the prediction layer weights. Both sets of weights assign representations to words such that words that have similar co-occurrence patterns with other words are closer in the embedding space. Typically, the embedding layer weights are used as feature representations of words for other other tasks. Due to its scalability to large corpora and the good performance of its derived word embeddings in several NLP tasks the skipgram model has become a standard solution for unsupervised learning of word representations.
While typical training of skipgarm is performed by optimizing for the prediction of other words in a window around the target word, it is possible to use other contextual features, such as contexts from dependency graphs of sentences.
We consider three variations of skipgram based on different target-context pairs:

Window-5 based skipgram (Win5)
This is a standard skipgram model that considers target-context word pairs inside a window of 5 words to the right and to the left of the target word. The window size for every target instance in the corpus is uniformly sampled from the [1,5] range, effectively providing a weighting scheme for context Win5 "cup" contexts: She, asked, for, a, of, coffee LG "cup" contexts: case_for, det_a, of:nmod_coffee, for:nmod -1 _asked EXT "cup" contexts: She, asked, for, a, of, coffee, case_for, det_a, of: Figure 1: A sentence and its dependency parse graph. The contexts of the word "cup" are shown for each model. In addition, for the EXT model the contexts of the "of:nmod coffee" dependency context feature are shown.
words according to their distance from the target word.

Skipgram with dependency contexts (LG)
Levy and Golberg's (2014) modification to the skipgram model replaces context words in a window by dependency contexts. A dependency context is a discrete symbol denoting a word and its syntactic role in a dependency parse graph (e.g. nsubj she, of : nmod cof f ee, of : nmod −1 cup). The directionality of dependency edges is encoded by introducing features with inverse relations. Training of this skipgram variant is similar to window based approaches, but each word is considered as a node in a dependency graph obtained by a parser, and embeddings are optimized to predict their corresponding word's immediate syntactic contexts (Figure 1). The network's weight matrices have different shapes, where representations coming from the embedding layer weights correspond to word embeddings, while representations coming from the prediction layer weights to dependency context embeddings.

Extended Dependency Skipgram (EXT)
We propose another variation of skipgram based on dependency graphs that utilizes additional cooccurrences compared to the LG variant. Each target word is taken as a node in the dependency graph and then optimize word embeddings such that they maximize the probability of other words within distance one and two in the graph. As with the Win5 model, we apply a weighting according to distance, with words having distance one from the target counted twice. This word-word prediction behaves similarly to the Win5 model, but considers the dependency parse to filter coincidental co-occurrences. The second type of predictions that embeddings are optimized for is similar to the LG model, where each word predicts its dependency contexts. We also optimize for a third type of context prediction where for each node, dependency contexts become the targets and predict the rest of dependency contexts of the same node. An example of the different targetcontext pairs that each skipgarm variant utilizes can be seen in Figure 1. The three types of target-context pairs for the extended dependency skipgram are interleaved during training. The weight matrices of this network are symmetric resulting in two embeddings per word and dependency context feature.

Implementation Details
We trained 300 dimensional versions of the above skipgram variants on English Wikipedia August 2015 dump of 2 billion words. Vocabularies consist of words and dependency contexts that appear more than 100 times (approximately 220k words and 1.3m dependency contexts). Training was done by applying negative sampling with 15 negative samples per target-context pair for 10 iterations over the entire corpus using stochastic gradient descent. The following commonly used methods (Mikolov et al., 2013b;Levy et al., 2015) were applied during training: drawing negative samples according to their unigram distribution raised to the power of 0.75, linear decay of learning rate with initial α = 0.25, and subsampling of target words with probability given where f is the word's frequency. Dependency parsing for LG and EXT training was done with the Stanford Neural Network dependency parser (Chen and Manning, 2014) using Universal Dependency tags (De Marneffe et al., 2014).

Word Similarity Evaluation
We evaluate the effect of the different contextual features for skipgram word embeddings in two word similarity datasets: WordSim-353 (Finkelstein et al., 2001) and SimLex-999 (Hill et al., 2015). For both datasets, we compare the cosine similarity of word embeddings for a pair of words to human judgements and report Spearman's correlation in Table 1. The two datasets use a different notion of word similarity for scoring. Wordsim-353 mostly captures topical similarity (or relatedness), giving high similarity to pair of words like clothes-closet. SimLex-999 uses a more strict version of similarity, often called substitutional similarity, where the pair clothes-closet has a low similarity score and pairs like shore-coast have high similarity. Win5 skipgram version achieves a higher correlation for WordSim-353 compared to LG, but the results are reversed for SimLex-999. This agrees with previous research that shows that syntactic contexts correlate better with substitutional similarity judgements than using words in a window as contexts (Levy and Goldberg, 2014). As expected, the extended model represents a middle ground solution between the two. While similarity based evaluation makes obvious that different contextual features capture different properties of words, it is not clear which kind similarity notion is more useful when word representations are used as features for NLP tasks. We answer this question for sentence level classification tasks in the next section.

Sentence Classification
We consider three common sentence classification tasks: TREC question type classification (QC), binary sentiment classification on Stanford's Sentiment Treebank (SST), and relation identification between pairs of nominals (RI) using the SemEval 2010 dataset. The experiments aim to answer two questions. First, to assess the effect of different context features for word embeddings when used in sentence classification tasks, given their different behaviour on word similarity evaluation. Second, to experiment with methods of using the dependency context embeddings themselves as a way to provide classifiers with dependency syntactic information. We carry out experiments with three different classification methods: SVMs with averaged embeddings, the Convolutional Neural Network of Kim (2014), and a Long Short Term Memory recurrent neural network (Hochreiter and Schmidhuber, 1997). These classifiers have some distinct characteristics. The SVM does not take into account the structure of the sentence, nor does it build any internal representations. On the other hand, both the CNN and LSTM networks operate on sequences of words and build internal representations before predicting the class label distribution. However, they do not have access to explicit syntactic information. We first give a description of the classification methods and the way embeddings are used as features, followed by the description of the tasks and results.

Classification Methods
SVM with averaged embeddings We create a sentence representation by averaging embeddings of sentence features (words and dependency contexts). This can be considered the equivalent of a Bagof-Words sentence representation in the embedding space, hence called Bag-of-Embeddings (BoE). We then train a classifier by applying a Support Vector Machine with a Gaussian kernel: For hyperparameter tuning, we set parameter γ of the kernel to 1/k, where k is the number of features (dimensionality of embeddings), and then perform cross validation for the c parameter using the standard Win5 word embeddings in the question classification task.
Convolutional Neural Network (CNN) We use the simple Convolutional Neural Network of Kim (2014) that has been shown to perform well in multiple sentence classification tasks. The network's input is a sentence matrix X formed by concatenating k-dimensional word embeddings. Then a convolutional filter W ∈ R h×k is applied to every possible sequence of length h to get a feature map: followed by a max-over-time pooling operation to get the feature with the highest value: The pooled features of different filters are then concatenated and passed to a fully connected softmax layer to perform the classification. The network uses multiple filters with different sequence sizes covering different size of windows in the sentence. All hyperparameters of the network are the same as used in the original paper (Kim, 2014): stochastic dropout (Srivastava et al., 2014) with p = 0.5 on the penultimate layer, 100 filters for each filter region with filter regions of width 2,3 and 4. Optimization is performed with Adadelta (Zeiler, 2012) on minibatches of size 50.
Long Short Term Memory (LSTM) LSTM networks (Hochreiter and Schmidhuber, 1997) are recurrent neural networks where recurrent units consist of a memory cell c and three gates i, o and f . Given a sequence of input embeddings x, LSTM outputs a sequence of states h given by the following where W ∈ R 4k×2k ,c t is a candidate state for the memory cell and is element-wise vector multiplication. The distribution of labels for the whole sentence is computed by a fully connected softmax layer on top of the final hidden state after applying stochastic dropout with p = 0.25. We use 150 dimensions for the size of h, Adagrad (Duchi et al., 2011) for optimization and mini-batch size of 100.

Sentence Feature Representations
We provide syntactic information to each classifier in the following manner. First we parse each sentence to get a dependency graph. Each node in the graph is associated with a word w having an embedding v w and a set of dependency context features d 1 , d 2 , ..., d C with embeddings v d 1 , v d 2 , ..., v d C exactly like during the dependency based skipgram training process. We then create a representation x of that node using different combinations of its associated word and dependency context embeddings: • Words: Using only word embeddings • Dep: A node's representation becomes the average of its associated dependency context embeddings: • Wavg: Combination of the word and dependency context embeddings by a weighted average scheme that assigns equal contribution to the word and dependency context part: • Conc: Similar to the Wavg, but dependency context embeddings are first averaged and then concatenated to the word embedding to form a single vector: where ⊕ is the concatenation operator. This method keeps the word and syntactic part separate at the expense of doubling the dimensionality.
The above methods are used with the LG and EXT variants to create context specific node representations. For the EXT model, both word and dependency context embeddings used come from the embedding layer weights. The Words method is the only one that can be applied to the Win5 model. It is the most commonly used method to utilize word representations as features and our baseline. To make the comparison more fair for the Win5 model we include two additional variations that utilize both the embedding and prediction layer weights as an ensemble method for creating a word's representation: • Win5 AvgE: Ensemble made by averaging word embeddings from the embedding and prediction layer weights of Win5 skipgram: • Win5 ConcE: Another ensemble made by concatenating word embeddings from the embedding and prediction layer weights of Win5 skipgram: Ensemble techniques have been reported to outperform simple word representations in some word similarity tasks (Levy et al., 2015). Since the EXT skipgram version uses symmetric weight matrices for the embedding and prediction layer, ensemble methods like the above could also be applied, but are not considered for these experiments. Note that contrary to the dependency based models, these ensemble methods do not create context specific representations.
The dependency graph's node representations are used as a sequence of embeddings respecting the order of the sentence to become the input for the CNN and LSTM. For the SVM BoE, word and dependency contexts of the whole sentence are averaged separately for the Words and Dep method, and then averaged again for the Wavg method or concatenated for the Conc method. As we are evaluating performance of embeddings, we do not perform updates during training of CNNs and LSTMs.

TREC Question Classification
The TREC Question Classification dataset (Li and Roth, 2002) consists of 5452 training questions and 500 test questions. The task is to classify each question with one of six labels (e.g. location, definition, ...) depending on the answer they seek. For CNNs and LSTMs 10% of the training data were used as the dev set to pick the best model among different iterations. Classification accuracy results for each input representations and classification method can be seen in Table  2. We also report the state of the art result by the dependency convolutional neural network of Mu et al. (2015). Their model consists of a convolutional neural network that takes a dependency tree at the input layer instead of a sequence, and uses heuristics to choose the subset of nodes where pooling is applied.

SST-2
The Stanford Sentiment Treebank dataset (Socher et al., 2013) has fine grained sentiment polarity scores for movie reviews on the phrasal and sentence level. The binary version of the task considers only positive and negative sentiment labels, resulting in a 6920/872/1821 split for training/dev/testing sets. All the models were trained using only the sentence level annotations. Classification accuracies for all models are reported in Table 3. The state of the art for this dataset comes from Kim (2014) using the same convolutional neural network as we do, but also utilizing the phrasal level annotations which provide about an order of magnitude larger training set. In addition, this specific configuration of the network (multichannel) uses two channels at the input layer, one updating the word embeddings during training and one that keeps them static as we do in our experiments.  We only used the shortest dependency path between the two nominals as the input to classifiers. In table 4, we report results using the official SemEval metric of macro-averaged F1-Score for (9+1)-way classifi-cation, taking directionality into account. The best reported result for this dataset is 85.6 F1-score by  also using a convolutional network on a sequence of word embeddings from the shortest dependency path between the pair of nominals. They also introduce negative samples during training by reversing the subject and object of the relation and WordNet features. Without using WordNet features their model achieves 84.0 F1-score.   .

Discussion
Our evaluation shows that dependency context embeddings can provide valuable syntactic information for sentence classification tasks using the three classification methods described. Out of the three tasks, Question Classification and Relation Identification showed great improvements when using dependency context embeddings compared to the baseline, while sentiment classification only showed moderate improvements. This is in agreement with previous research , where explicit syntactic information was provided to classifiers by using tree structured networks and showed that syntax provides small improvements for binary sentiment classification in Stanford's Sentiment Treebank. It is notable that for QC and RI, using only word embeddings that are trained with syntactic information (LG and EXT Words models) still outperform the baseline window based skipgram. Using the de-pendency context embeddings as a means to represent the dependency parse of sentences consistently outperforms the baseline method across the three tasks and for every classification method. This indicates that this additional syntactic information cannot be recovered by the CNN and LSTM even though they have access to the sequential structure of sentences, at least when trained on datasets of this size. As expected, the SVM BoE benefits the most by the addition of dependency context embeddings since these are its only source of structural information.
The dependency context embeddings from the EXT model outperform the LG model, both when used alone and when in combination with the word embeddings. This can be attributed to the additional information they are exposed to during training.
The effectiveness of the Wavg compared to the Conc method for combining word and dependency context embeddings seems to depend on the classification method. In genearal, we observe that the CNN performs better with Wavg, while SVM and LSTM with Conc. On the other hand, the ensemble methods of the Win5 model (AvgE and ConcE) do not provide any consistent advantage over the baseline. In most cases, AvgE slightly hurts performance while ConcE slighty improves it.
Our evaluation also suggests that best performing models in word similarity tasks do not necessarily achieve the best performance in other NLP tasks. When considering only word embeddings as features for sentence classification (Words method), we observe that the EXT model on average performs better than the Win5 and LG models, while the opposite is true for word similarity evaluation. This indicates that providing additional contextual information for training embeddings results in less specialized embeddings for particular types of semantic similarity evaluations, but can be useful for a wide range of sentence level classification tasks.
While the purpose of our experiments is a comparison of embeddings and little hyperparameter tuning was done for the classifiers, results of the CNN using EXT Wavg representations for QC (95.0) and RI (84.31) are close to the best reported results with specifically engineered systems for these tasks: 96.0 for QC (Mou et al., 2015) and 85.6 for RI . As our method does not depend on a specific classification setting it would be interesting to see if those approaches can further improve using dependency based representations.

Conclusions
We compare a window based, a dependency based and an extended dependency based skipgram model in word similarity and sentence classification tasks of question classification, binary sentiment prediction and semantic relation identification. For the sentence classification, we use three classifiers (SVM, CNN, LSTM) and experiment with several methods of utilizing dependency context feature embeddings to create representations that capture the syntactic role of words in dependency graphs. We reaffirm that dependency based models produce word embeddings that better capture functional properties of words and that window based models better capture topical similarity. The dependency based word embeddings largely improved the performance of the three classifiers for question classification and semantic relation identification, but only marginally for sentiment prediction. Finally, using dependency context features along with the word embeddings we observed better performance for the three classifiers in each task.