Convolutional Sentence Kernel from Word Embeddings for Short Text Categorization

This paper introduces a convolutional sentence kernel based on word embeddings. Our kernel overcomes the sparsity issue that arises when classifying short documents or in case of little training data. Experiments on six sentence datasets showed statistically signiﬁcant higher accuracy over the standard linear kernel with n-gram features and other proposed models.


Introduction
With the proliferation of text data available online, text categorization emerged as a prominent research topic. Traditionally, words (unigrams) and phrases (n-grams) have been considered as document features and subsequently fed to a classifier such as an SVM (Joachims, 1998). In the SVM dual formulation that relies on kernels, i. e. similarity measures between documents, a linear kernel can be interpreted as the number of exact matching n-grams between two documents. Consequently, for short documents or when little training data is available, sparsity issues due to word synonymy arise, e. g., the sentences 'John likes hot beverages' and 'John loves warm drinks' have little overlap and therefore low linear kernel value (only 1) in the n-gram feature space, even with dependency tree representations and downward paths for n-grams as illustrated in Figure 1.
We propose to relax the exact matching between words by capitalizing on distances in word embeddings. We smooth the implicit delta word kernel, i. e. a Dirac similarity function between unigrams, behind the traditional linear document kernel to capture the similarity between words that are different, yet semantically close. We then aggregate these word and phrase kernels into sentence and documents kernels through convolution resulting in higher kernel values between semantically related sentences (e. g., close to 7 compared to 1 with bigram downward paths in Figure 1). Experiments on six standard datasets for sentiment analysis, subjectivity detection and topic spotting showed statistically significant higher accuracy for our proposed kernel over the bigram approaches. Our main goal is to demonstrate empirically that word distances from a given word vector space can easily be incorporated in the standard kernel between documents for higher effectiveness and little additional cost in efficiency.
The rest of this paper is structured as follows. Section 2 reviews the related work. Section 3 gives the detailed formulation of our kernel. Section 4 describes the experimental settings and the results we obtained on several datasets. Finally, Section 5 concludes our paper and mentions future work. Siolas and d'Alché Buc (2000) pioneered the idea of semantic kernels for text categorization, capitalizing on WordNet (Miller, 1995) to propose continuous word kernels based on the inverse of the path lengths in the tree rather than the common delta word kernel used so far, i. e. exact matching between unigrams. Bloehdorn et al. (2006) extended it later to other tree-based similarity measures from WordNet while Mavroeidis et al. (2005) exploited its hierarchical structure to define a Generalized Vector Space Model kernel.

Related work
In parallel, Collins and Duffy (2001) developed the first tree kernels to compare trees based on their topology (e. g., shared subtrees) rather than the similarity between their nodes. Culotta and Sorensen (2004) used them as Dependency Tree Kernels (DTK) to capture syntactic similarities while Bloehdorn and Moschitti (2007) and Croce et al. (2011) used them on parse trees with respectively Semantic Syntactic Tree Kernels (SSTK) and Smoothing Partial Tree Kernels (SPTK), adding node similarity based on Word-Net to capture semantic similarities but limiting to comparisons between words of the same POS tag.
Similarly, Gärtner et al. (2003) developed graph kernels based on random walks and Srivastava et al. (2013) used them on dependency trees with Vector Tree Kernels (VTK), adding node similarity based on word embeddings from SENNA (Collobert et al., 2011) and reporting improvements over SSTK. The change from WordNet to SENNA was supported by the recent progress in low-dimension Euclidean vector space representations of words that are better suited for computing distances between words. Actually, in our experiments, word2vec by Mikolov et al. (2013a) led to better results than with SENNA for both VTK and our kernels. Moreover, it possesses an additional additive compositionality property obtained from the Skip-gram training setting (Mikolov et al., 2013b), e. g., the closest word to 'Germany' + 'capital' in the vector space is found to be 'Berlin'.
More recently, for short text similarity, Song and Roth (2015) and Kenter and de Rijke (2015) proposed additional semantic meta-features based on word embeddings to enhance classification.

Formulation
We denote the embedding of a word w by w.

Word Kernel (WK)
We define a kernel between two words as a polynomial kernel over a cosine similarity in the word embedding space: where α is a scaling factor. We also tried Gaussian, Laplacian and sigmoid kernels but they led to poorer results in our experiments. Note that a delta word kernel, i. e. the Dirac function 1 w 1 =w 2 , leads to a document kernel corresponding to the standard linear kernel over n-grams.

Phrase Kernel (PhK)
Next we define a kernel between phrases consisting of several words. In our work, we considered two types of phrases: (1) co-occurrence phrases defined as contiguous sequences of words in the text; and (2) syntactic phrases defined as downward paths in the dependency tree representation, e. g., respectively 'hot beverages' and 'beverages hot' in Figure 1. With this dependency tree involved, we expect to have phrases that are syntactically more meaningful. Note that VTK considers random walks in dependency trees instead of downward paths, i. e. potentially taking into account same nodes multiple times for phrase length greater than two, phenomenon known as tottering.
Once we have phrases to compare, we may construct a kernel between them as the product of word kernels if they are of the same length l. That is, we define the Product Kernel (PK) as: where w j i is the i-th word in phrase p j of length l. Alternatively, in particular for phrases of different lengths, we may embed phrases into the embedding space by taking a composition operation on the constituent word embeddings. We considered two common forms of composition (Blacoe and Lapata, 2012): vector addition (+) and elementwise multiplication ( ). Then we define the Composition Kernel (CK) between phrases as: where p j , the embedding of the phrase p j , can be obtained either by addition (p j = l i=1 w j i ) or by element-wise multiplication (p j = l i=1 w j i ) of its word embeddings. For CK, we do not require the two phrases to be of the same length so the kernel has a desirable property of being able to compare 'Berlin' with 'capital of Germany' for instance.

Sentence Kernel (SK)
We can then formulate a sentence kernel in a similar way to Zelenko et al. (2003). It is defined through convolution as the sum of all local phrasal similarities, i. e. kernel values between phrases contained in the sentences: where φ(s k ) is the set of either statistical or syntactic phrases (or set of random walks for VTK) in sentence s k , λ 1 is a decaying factor penalizing longer phrases, = max{|p 1 |, |p 2 |} is the maximum length of the two phrases, λ 2 is a distortion parameter controlling the length difference η between the two phrases (η = ||p 1 | − |p 2 ||) and PhK is a phrase kernel, either PK, CK + or CK .
Since the composition methods we consider are associative, we employed a dynamic programming approach in a similar fashion to Zelenko et al. (2003) to avoid duplicate computations.

Document Kernel
Finally, we sum sentence kernel values for all pairs of sentences between two documents to get the document kernel. Once we have obtained all document kernel values K ij between documents i and j, we may normalize them by K ii K jj as the length of input documents might not be uniform.

Experiments
We evaluated our kernel with co-occurrence and syntactic phrases on several standard text categorization tasks.

Experimental settings
In all our experiments, we used the FANSE parser (Tratz and Hovy, 2011) to generate dependency trees and the pre-trained version of word2vec 1 , a 300 dimensional representation of 3 million English words trained over a Google News dataset 1 https://code.google.com/p/word2vec of 100 billion words using the Skip-gram model and a context size of 5. While fine-tuning the embeddings to a specific task or on a given dataset may improve the result for that particular task or dataset (Levy et al., 2015), it makes the expected results less generalizable and the method harder to use as an off-the-shelf solution -re-training the neural network to obtain task-specific embeddings requires a certain amount of training data, admittedly unlabeled, but still not optimal under our scenario with short documents and little task-specific training data available. Moreover, tuning the hyperparameters to maximize the classification accuracy needs to be carried out on a validation set and therefore requires additional labeled data. Here, we are more interested in showing that distances in a given word vector space can enhance classification in general. As for the dependency-based word embeddings proposed by Levy and Goldberg (2014), we do not think they are better suited for the problem we are tackling. As we will see in the results, we do benefit from the dependency tree structure in the phrase kernel but we still want the word kernel to be based on topical similarity rather than functional similarity.
To train and test the SVM classifier, we used the LibSVM library (Chang and Lin, 2011) and employed the one-vs-one strategy for multi-class tasks. To prevent overfitting, we tuned the parameters using cross-validation on 80% of PL05 dataset (α = 5, λ 1 = 1 for PK since there is no need for distortion as the phrases are of the same length by definition, and λ 1 = λ 2 = 0.5 for CK) and used the same set of parameters on the remaining datasets. We performed normalization for our kernel and baselines only when it led to performance improvements on the training set (PL05, News, PL04 and MPQA).
We report accuracy on the remaining 20% for PL05, on the standard test split for Twitter (25%) and News (50%) and from 5-fold cross-validation for the other datasets (Amazon, PL04 and MPQA). We only report accuracy as the macro-average F1scores led to similar conclusions (and except for Twitter and News, the class label distributions are balanced). Results for phrase lengths longer than two were omitted since they were marginally different at best. Statistical significance of improvement over the bigram baseline with the same phrase definition was assessed using the micro sign test (p < 0.01) (Yang and Liu, 1999). Table 1: Accuracy results on the test set for PL05 (20%), standard test split for Twitter (25%) and News (50%) and from 5-fold CV for the other datasets (Amazon, PL04 and MPQA). Bold font marks the best performance in the column. * indicates statistical significance at p < 0.01 using micro sign test against the bigram baseline (delta word kernel) of the same column and with the same phrase definition.  Table 1 presents results from our convolutional sentence kernel and the baseline approaches. Note again that a delta word kernel leads to the typical unigram and bigram baseline approaches (first three rows). The 3 rd row corresponds to DTK (Culotta and Sorensen, 2004) and the 4 th one to VTK (Srivastava et al., 2013) -the difference with our model on the 9 th row lies in the function φ(·) that enumerates all random walks in the dependency tree representation following Gärtner et al. (2003) whereas we only consider the downward paths. Overall, we obtained better results than the ngram baselines, DTK and VTK, especially with syntactic phrases. VTK shows good performance across all datasets but its computation was more than 700% slower than with our kernel. Regarding the phrase kernels, PK generally produced better results than CK, implying that the semantic linearity and ontological relation encoded in the embedding is not sufficient enough and treating them separately is more beneficial. However, we believe CK has more room for improvement with the use of more accurate phrase embeddings such as the ones from Le and Mikolov (2014), Yin and Schütze (2014) and Yu and Dredze (2015).

Results
There was little contribution to the accuracy from non-unigram features, indicating that large part of the performance improvement is credited to the word embedding resolving the sparsity issue. shows the accuracy on the same test set (20% of the dataset) when the learning was done on 1% to 100% of the training set (80% of the dataset) for the bigram baseline and our bigram PK phrase kernel, both with dependency tree representation, on PL04. We see that our kernel starts to plateau earlier in the learning curve than the baseline and also reaches the maximum baseline accuracy with only about 1,500 training examples.

Computational complexity
Solving the SVM in the primal for the baselines requires O(N nL) time where N is the number of training documents, n is the number of words in the document and L is the maximum phrase length considered. The computation of VTK reduces down to power series computation of the adjacency matrix of the product graph, and since we require kernel values between all documents, it requires O(N 2 (n 2 d + n 4 L)) time where d is the dimension of the word embedding space. Our kernel is the sum of phrase kernels (PhK) starting from every pair of nodes between two sentences, for all phrase lengths (l) and distortions (λ 2 ) under consideration. By storing intermediate values of composite vectors, a phrase kernel can be computed in O(d) time regardless of the phrase length, therefore the whole computation process has O(N 2 n 2 L 2 d) complexity. Although our kernel has the squared terms of the baseline's complexity, we are tackling the sparsity issue that arises with short text (small n) or when little training data is available (small N ). Moreover, we were able to get better results with only bigrams (small L). Hence, the loss in efficiency is acceptable considering significant gains in effectiveness.

Conclusion
In this paper, we proposed a novel convolutional sentence kernel based on word embeddings that overcomes the sparsity issue, which arises when classifying short documents or when little training data is available. We described a general framework that can encompass the standard n-gram baseline approach as well as more relaxed versions with smoother word and phrase kernels. It achieved significant improvements over the baselines across all datasets when taking into account the additional information from the latent word similarity (word embeddings) and the syntactic structure (dependency tree).
Future work might involve designing new kernels for syntactic parse trees with appropriate similarity measures between non-terminal nodes as well as exploring recently proposed phrase embeddings for more accurate phrase kernels.