Initializing Convolutional Filters with Semantic Features for Text Classification

Convolutional Neural Networks (CNNs) are widely used in NLP tasks. This paper presents a novel weight initialization method to improve the CNNs for text classification. Instead of randomly initializing the convolutional filters, we encode semantic features into them, which helps the model focus on learning useful features at the beginning of the training. Experiments demonstrate the effectiveness of the initialization technique on seven text classification tasks, including sentiment analysis and topic classification.


Introduction
Recently, neural networks (NNs) dominate the state-of-the-art results on a wide range of natural language processing (NLP) tasks. The commonly used neural networks in NLP include Recurrent NNs, Convolutional NNs, Recursive NNs and their combinations. NNs are known for their strong abilities to learn features automatically. However, the lack of data or inappropriate parameter settings might greatly limit the generalization abilities of the models (Bengio et al., 2009;Le-Cun et al., 2015;Krizhevsky et al., 2012;Srivastava et al., 2014). To enhance the performance, a lot of improved methods have been proposed, e.g. developing advanced structures (Zhao et al., 2015;Zhang et al., 2016a), introducing prior knowledge (Hu et al., 2016) and utilizing external resources (Xie et al., 2016;Qian et al., 2016).
It is also noteworthy that the neural networks' performance is sensitive to weight initialization † Corresponding author. because their objectives are non-convex (Glorot and Bengio, 2010;Saxe et al., 2013;Mishkin and Matas, 2015). In fact, initialization techniques even play a role of catalyst for the revival of neural networks (Hinton et al., 2006;LeCun et al., 2015). Most improvements on initializing weights are based on mathematical methods, e.g. xavier initialization (Glorot and Bengio, 2010) and orthogonal initialization (Saxe et al., 2013). For NLP tasks, an influential technique is to use pretrained word vectors to initialize embedding layers (Kim, 2014;Chen and Manning, 2014). Consider the embedding layers could be initialized by pretrained word vectors, how about weights in other layers that are still randomly initialized?
Inspired by this question, we propose a simple yet effective method to improve CNNs by initializing convolutional layers (filters). Unlike the previous weight initialization based on mathematical methods, we encode semantic features into the filters instead of initializing them randomly. As CNNs exploit 1-D convolutional filters to extract n-gram features, our method aims at helping the filters focus on learning useful n-grams, e.g. "not bad" which is more useful than "watch a movie" for determining reviews' polarities. Specifically, we select n-grams from training data via a novel Naive Bayes (NB) weighting technique, and then cluster the n-gram embeddings with K-means algorithm. After that, we use the centroid vectors of the clusters to initialize the filters.
With this initialization method, CNN filters tend to extract important n-gram features at the beginning of the training process. By integrating our method into a classic CNN model for text classification (Kim, 2014), we observe significant im-provements in sentiment analysis and topic classification tasks. The advantages of our approach are as follows: • Features are directly extracted from training data without involving any external resources; • The computation brought by our method is relatively small, resulting in small additional training costs; • The filter initialization is task independent. It could be easily applied to other NLP tasks. Also, we further analyze the filters, shedding some light on the mechanism how our method influences the training process.

Related Work
Most recently, CNNs are becoming increasingly popular in a variety of NLP tasks. An influential one is the work of (Kim, 2014), where a simple CNN with a single layer of convolution is used for feature extraction. Despite its simple structure, the model achieves strong baselines on many sentence classification datasets. Following this work, several improved models are proposed. Zhang and Wallace (2015) improve the model by optimizing hyper-parameters and provide a detailed analysis of the CNN (Kim, 2014). Yin and Schütze (2016) and Zhang et al. (2016b) exploit different pre-trained word embeddings (e.g. word2vec and GloVe) to enhance the model.
In addition to initializing embedding layers with pre-trained word vectors, other pre-designed features also prove to be very effective in assisting the training of neural models. For example, in (Hu et al., 2016), neural models are harnessed by logic rules. Li et al. (2016) propose to use pre-calculated words' weights to guide Paragraph Vector model. Dai and Le (2015) combine the hidden layers of RNNs with linearly increasing weights. Xie et al. (2016) use entity descriptions from knowledge bases (e.g. Freebase) to learn knowledge representations for entity classification and knowledge graph completion. Qian et al. (2016) propose linguistically regularized LSTMs for sentiment analysis with sentiment lexicons, negation words, and intensity words. In this work, we encode semantic features into convolutional layers by initializing them with important n-grams. Being aware of which n-grams are important, CNN is able to ex- tract more discriminative features for text classification.

Our Method
The intuition behind our method is simple: Since CNNs essentially capture semantic features of ngrams, we can use important n-grams to initialize the filters. As a result, the filters are able to focus on extracting those important n-gram features at the beginning of the training. As shown in Figure 1, we use embeddings of "not" and "bad" to initialize the filter. A larger score will be obtained when the "not bad" filter matches the bigram "not bad" in the text, otherwise a relatively smaller score will be returned.

N-gram Selection
Firstly, we extract important n-grams from the training data. Intuitively, n-gram "not bad" is much more important than "watch a movie" for determining reviews' polarities. Naive Bayes (NB) weighting is an effective technique for determining the words' importance (Martineau and Finin, 2009;Wang and Manning, 2012). NB weight r of a n-gram w in class c is calculated as follows: c is the number of texts that contain n-gram w in class c, p w c is the number of texts that contain n-gram w in other classes, ||p c || 1 is the number of texts in class c, ||pc|| 1 is the number of texts in other classes, α is a smoothing parameter. For positive class in movie review dataset, the ratios of n-grams like "amazing" and "not bad" should be large since they appear much more frequently in positive texts than in negative texts. For neutral n-grams like "of the" and "movie", their ratios should be around 1. For each class, we select the n-grams whose ratios are much higher than 1 for filter initialization. We give examples of ngrams selected by our method in Appendix.

Filter Initialization
We concatenate word embeddings to construct ngram embeddings. For example, a tri-gram embedding has 3*100 dimensions when word embedding has 100 dimensions. This concatenation follows the mechanism of convolutional filters,where a filter with n*d dimensions is able to capture ngram features (d is the dimension of word embedding). Because the number of filters in CNNs is much smaller than the number of n-grams, a filter tends to extract the features of a class of ngrams rather than an individual n-gram. Based on this observation, we don't use n-gram embeddings to initialize the filters directly. Instead, we firstly use K-means to cluster features of the selected n-grams, and then use the clusters' centroid vectors to initialize the filters. In this work, we consider clustering uni-gram (word), bi-gram and tri-gram features. Figure 2 shows two uni-gram cluster examples extracted from the location questions in TREC dataset (Li and Roth, 2002).
After obtaining the n-gram clusters, we feed their centroid vectors into the center of the filters. The remaining positions are still initialized randomly. Taking filters with size 3, 4, 5 as examples, Figure 3 shows how we fill uni, bi, and tri-gram features into the filters. By doing this, we encode semantic features into the filters. For example, in the TREC question classification task, the initialization will result in six types of filters which are sensitive to abbreviation, entity, description, human, location and number questions respectively.

Datasets and Hyper-parameter Settings
CNN-non-static 1 (short for CNN) proposed by Kim (2014) is used as our baseline, which consists of one embedding layer, one convolutional layer, one max pooling layer, and one fully connected layer. The model proposed by Kim (2014) is a strong baseline in sentence classification. For details of the model, one can see (Kim, 2014;Zhang and Wallace, 2015). Pre-trained word embeddings on Google News via word2vec toolkit 2 are used for initializing the convolutional filters, besides initializing the embedding layer of CNN as in (Kim, 2014). For a fair comparison, we use the same seven datasets 3 and hyper-parameter setting with Kim (2014)'s work for training and testing. Uni, bi, and tri-gram features are used to initialize the filters. For a K-way classification problem, we select top 10% n-grams in each class according to NB weighting. Since 300 filters are used in Kim (2014)'s work, we follow this setting and aggregate n-grams into 300/K clusters for each class. Centroid vectors are used for filling the filters. Taking binary classification dataset MR as an example, 150 "positive" filters and 150 "negative" filters are obtained after initialization.

Effectiveness of Filter Initialization
In this section, we demonstrate the effectiveness of our initialization technique. We respectively use uni, bi and tri-gram centroid vectors to fill the filters.  further improves the accuracies significantly on all datasets except MPQA. The results are consistent with (Wang and Manning, 2012), where NB weighting produces little improvement over MPQA. We can also observe that the performance of uni, bi and tri-grams are comparable. None of them outperforms the others on all datasets. Table 2 lists the results of our model and other state-of-the-arts. Models in the first group are improved CNNs based on (Kim, 2014). Among them, MV-CNN and MGNC-CNN utilize multiple pre-trained embeddings as inputs, and CNN-Rule integrates logic rules. Our model achieves the best performance on three tasks without requiring any extra training costs and resources. With this simple initialization method, our model also gives competitive results against more sophisticated deep learning models in the second group, e.g. Adasent (Zhao et al., 2015) and DSCNN (Zhang et al., 2016a) that have complex structures or use the combinations of NNs.

Comparisons with State-of-the-arts
Experiments show that our n-gram features make great contribution to both two-class and multi-class classification. Essentially, our method enables CNNs to obtain better generalization abilities. Furthermore, as the initialization does not rely on any external prior knowledge or resources, it could be easily applied to other NLP tasks or other languages.  Table 3: "+" and "-" are used to denote the number of positive and negative weights respectively. The data in the table are obtained from MR by the average of 10 times training. Every time we select 100 filters. 50 of them are initialized with positive n-grams and the rest are with negative n-grams.

Further Analysis of Filters
We further analyze the filter initialization with an example of binary sentiment classification. Through the initialization we have determined which filters are positive or negative in advance. The corresponding neurons of positive filters upon max-pooling layer are supposed to be activated by positive samples. Since positive (negative) samples have labels of 1 (0), the corresponding weights (in logistic regression) of those "positive" neurons tend to be positive. For the same reason, the negative filters tend to have negative weights. The results shown in Table 3 confirm our hypothesis: Positive/negative filters respectively tend to have positive(+)/negative(-) weights. The difference between positive and negative filters are more obvious in bi-gram and tri-gram cases. It is because bi and tri-gram centroid vectors could initialize more parameters of filters than uni-gram.
In Table 1, experiments show that different choices of uni, bi, and tri-grams have little influence on the results. The following is our assumption: Compared to uni-grams (words), bi and trigrams can cover more spaces of filters and introduce more NB information to filters. Filters initialized by them thus pay more attention to NB information than filters initialized by uni-grams according to Table 3. However, bi and tri-grams are also sparser in data than uni-grams. Their NB weights are not as accurate as those of uni-grams, even applied smoothing. As NB weight of a ngram denotes its contribution to the classification, model initialized with tri-grams does not always perform the best.

Conclusion
This paper proposes a novel weight initialization technique for CNNs. We discover that convolutional filters that encode semantic features at the beginning of the training tend to produce better results than being randomly initialized. This has a similar effect with embedding layer initialization via pre-trained word vectors. Experimental results demonstrate the effectiveness of the initialization technique on multiple text classification tasks. In addition, our method requires few external resources and relatively small calculation, making it attractive for scenarios where training costs may be an issue.
In textual data, the features extracted by CNNs are n-grams. However, in fields like computer vision, features extracted by filters are more difficult to interpret. It still requires further exploration to apply our method to fields beyond NLP.