Distributional Representations of Words for Short Text Classification

Traditional supervised learning approaches to common NLP tasks depend heavily on manual annotation, which is labor intensive and time consuming, and often suffer from data s-parseness. In this paper we show how to mitigate the problems in short text classiﬁcation (STC) through word embeddings – distributional representations of words learned from large unlabeled data. The word embeddings are trained from the entire English Wikipedia text. We assume that a short text documen-t is a speciﬁc sample of one distribution in a Bayesian framework. A Gaussian process approach is used to model the distribution of words. The task of classiﬁcation becomes a simple problem of selecting the most probable Gaussian distribution. This approach is compared with those based on the classical maximum entropy (MaxEnt) model and the Laten-t Dirichlet Allocation (LDA) approach. Our approach achieved better performance and also showed advantages in dealing with unseen words.


Introduction
With the boom of e-commerce and social media, short texts, such as instant messages, microblogs and product reviews, become more available in diverse forms than before. These short forms of documents have become convenient presentations of information. It is becoming more and more important to understand those short text documents and to efficiently detect what users are interested in. Unlike long documents such as news articles and blogs, it is hard to measure similarities among these short texts since they do not share much in common (Phan et al., 2008). This poses a great challenge to short text classification (STC).
The task of short text classification can be described as follows: given a short text S, the aim is to identify its target theme T. Several supervised learning approaches have been proposed for short text classification. They have been shown to be effective and yielded good performance. These approaches are effective because they leverage a large body of linguistic knowledge and related corpora. However, the supervised learning approaches depend heavily on manual annotation, which is labor intensive and time consuming, and often suffer from data sparseness.
To tackle the above problems, we exploit word embeddings. A word embedding W:words→R n is a distributed representation for a word which is usually learned from a large corpus. Many researches have found that the learned word vectors capture linguistic regularities and collapse similar words into groups (Mikolov et al., 2013b).
In this paper, we apply an information theoretic approach which assumes that the short text is generated from a predefined parametric model, and estimate its optimal parameters from training data. We use Gaussian models to describe the distribution of words embeddings since it can describe any continuous distribution in common practice. Then, we classify new short texts using the Bayesian rule to get the posterior probability (Baker and McCallum, 1998).
The paper is organized as follows. Some related work is presented in Section 2. The word embedding based approach to short text classification is presented in Section 3. The dataset and evaluation metrics are described in Section 4. Experimental results on short text classification are given in Section 5. Some conclusions are drawn in Section 6.

Related Work
Learning to identify the theme of a short text document has been extensively studied during the past decade. Because the text length is short, data sparseness is an outstanding issue. Several approaches have been explored to overcome the data sparseness in order to get better performance.
Some try to calculate the similarity between short texts. E.g., (Zelikovitz and Hirsh, 2000) utilizes a corpus of unlabeled longer documents to compute the similarity between the test sample and the training one. To avoid collecting the specific longer documents, Web search engines (e.g. Google) are used to measure the similarity score (Bollegala et al., 2007;Yih and Meek, 2007) . But the efficiency of those approaches is a severe problem because they repeatedly queried search engines.
Some try to select more useful contextual information to expand and enrich the original text, e.g. using large unlabeled corpora, such as Wikipedia (Banerjee et al., 2007) and WordNet (Hu et al., 2009). A disadvantage of these approaches is that their adaptability would be an issue for certain languages because some of those external resources may be unavailable. Another approach is to integrate the context data with a set of hidden topics discovered from related corpora. E.g., (Phan et al., 2008;Chen et al., 2011) manually built a large and rich universal dataset, and derived a set of hidden topics through topic models such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) from these corpora. This approach has achieved satisfactory results, but it requires manual collection of the corpora. These researches have shown good improvement, but they rely too much on external resources which are difficult to get in some cases.
With the recent revival of interest in deep neural networks, many researchers have concentrated on learning a real-valued vector representation in a continuous space, where similar words are likely to have similar vectors. This is called word embedding (Turian et al., 2010). In fact, the learned word vectors capture linguistic regularities in a very simple way. In the embedding space, the vector offsets can measure specific relationship, such as the offset between vec ("King") and vec ("Man") is very close to that between vec ("Woman") and vec ("Queen") (Mikolov et al., 2013b).

Methodology
This section describes the proposed Gaussian classification approaches that use the learned word embeddings to model a classifier for the task of short text classification.

Word Representation
To get word representation, each input word token is transformed into a vector by looking up word embeddings learned from language model (Zeng et al., 2014). Distributed representations of words in word embedding space are shown to explicitly encode many syntactic and semantic regularities. Word embeddings have been used to help to achieve better performance in several NLP tasks (Collobert et al., 2011). There are some free tools for training word embeddings (Turian et al., 2010). We directly utilize Word2Vec tool provided by Mikolov et al. (Mikolov et al., 2013a) to train word embeddings on the Wikipedia corpus.

Our Approach
As mentioned in Section 3.1, all of the words are represented as word vectors. Word embeddings can be taken as an observation from an unsupervised generative model. We assume that a short text d j is generated by theme t k (parameterized by λ k ) according to the domain prior p(t k |λ k ). Similar to language modeling, we assume that a word embedding w i j for the i-th word in short text d j depends only on the preceding words. Under this assumption, the probability of a document given theme t k is, Next we assume that each word in a document is independent of its context, which is the same as that for uni-gram language model. Then we rewrite equation 1 as Gaussian model is used to describe the distribution. We use the training data to estimate the parameters λ k = {µ k , Σ k }, where µ k and Σ k denote the mean vector and covariance matrix. We also assume that the covariance matrix of Gaussian is diagonal. λ k can be estimated through Maximum Likelihood (ML) estimation asλ k : where |w k | is the total number of words in theme t k on the training set, w i k is the i-th word. Given estimates of the model parameters, new test data can be classified using the Bayesian theorem. A new short test text can be assigned the most likely theme as follows, A uniform prior is used to choose the most probable theme which minimizes cross entropy on the test document. In equation 5, we drop the denominator (which is the same constant across all domains), and take the log of the entire expression. This results in

Dataset and Evaluation Metrics
To evaluate the performance of the above approach, we use the Web snippet dataset used in (Phan et al., 2008;Chen et al., 2011;Sun, 2012 (Phan et al., 2008). The dataset has an average of 18 words in each snippet. Column 2 of Table 2 shows that the test data include about 4,378 words (about 43.62%) which do not appear in the training data. Column 3 shows the sizes of unseen words after Porter stemming (Sparck Jones, 1997). This table shows that there are more than 40% unseen words in the test data.
We downloaded the English Wikipedia dump of October 8, 2014, 1 which was used for training word embeddings. After removing all the nonroman characters and MediaWiki markups, we had 14,941,377 articles. The hyper-parameters used in Word2Vec are the same as that in (Mikolov et al., 2013a). To compare our results with the previous studies, we adopt accuracy as the performance metric, which is the proportion of the true results in the test output.

Experiments
We conducted three sets of experiments. In the first set of experiments, we compare the performance of our approach with the previous studies. The second is to test the capability of our approach in dealing with the unseen words using different size of training data. The third is to investigate the effect of the word representation dimension on STC.

Comparison with Previous Work
For comparison, we select two approaches from (Phan et al., 2008) and the results are given in Table 3. The first method took the short text document as a bag of words (Salton, 1989) and used classical TF/IDF to represent the contribution of each term to its theme. In the second method, topic models are estimated from related corpus using LDA, then topics of the short text are inferred from those models. Thus, the features in method 2 contain topic distributions and bag-of-word vectors. The two approaches employ MaxEnt classifiers. Table 3 illustrates the results for the three approaches. The best result is obtained from our proposed method with an absolute gain of 3.3 percent. It is clear that using word embeddings which were trained from universal dataset mitigated the problem of unseen words. Unlike the simple representations based on word frequencies (with some simplifications) (Clinchant and Perronnin, 2013) used in the previous studies, an important advantage is that our approach makes better use of the semantics from all the words in the short text document.

Dealing with Unseen Words
To validate the importance and influence of the size of training data in our approach, we increase the size of training data from 1,000 to 10,000 and measure the performance on the same test set. Since less training data will lead to more unseen words in the test phase, this experiment shows the capability in coping with unseen words, as shown the lines of Original and After Stemming in Figure 1. We directly cited the results of (Phan et al., 2008) because we could not crawl the related corpora which contained 3.5GB Wikipedia documents to re-implement their work.
The results of this experiment are shown in Figure 1. It can be seen that our approach based on the Gaussian process with word embeddings achieved good performance using relatively small data and reduced the cost of collecting and annotating training data.

The Effect of Word Representation
Dimensions on STC In our method, there is a free parameter in building word embeddings, i.e., the dimension of word representations. We empirically show the effect on the test data. Figure 2 presents the short text classification performance obtained with different dimensions of word embeddings. In this section, we used all the training data as our experimental data. The best performance is about 85.83% when the size of word embedding space is 550 dimensions. The system achieves 7.23% absolute improvement when the di-

Conclusion
In this paper, we proposed to use Gaussian process with continuous word embeddings for short text classification. The experimental results show that our approach is effective and that the word embeddings capture syntactic and semantic relationships between words can make good contributions to handle unseen data. For future work, we would like to investigate how continuous word embeddings will work on other genres of short texts like microblogs or on conventional (long) texts, in topic and sentiment classification.