Multichannel Variable-Size Convolution for Sentence Classification

We propose MVCNN, a convolution neural network (CNN) architecture for sentence classification. It (i) combines diverse versions of pretrained word embeddings and (ii) extracts features of multigranular phrases with variable-size convolution filters. We also show that pretraining MVCNN is critical for good performance. MVCNN achieves state-of-the-art performance on four tasks: on small-scale binary, small-scale multi-class and largescale Twitter sentiment prediction and on subjectivity classification.


Introduction
Different sentence classification tasks are crucial for many Natural Language Processing (NLP) applications. Natural language sentences have complicated structures, both sequential and hierarchical, that are essential for understanding them. In addition, how to decode and compose the features of component units, including single words and variable-size phrases, is central to the sentence classification problem.
In recent years, deep learning models have achieved remarkable results in computer vision , speech recognition (Graves et al., 2013) and NLP (Collobert and Weston, 2008). A problem largely specific to NLP is how to detect features of linguistic units, how to conduct composition over variable-size sequences and how to use them for NLP tasks (Collobert et al., 2011;Kim, 2014). Socher et al. (2011a) proposed recursive neural networks to form phrases based on parsing trees. This approach depends on the availability of a well performing parser; for many languages and domains, especially noisy domains, reliable parsing is difficult. Hence, convolution neural networks (CNN) are getting increasing attention, for they are able to model long-range dependencies in sentences via hierarchical structures (Dos Santos and Gatti, 2014;Kim, 2014;Denil et al., 2014). Current CNN systems usually implement a convolution layer with fixed-size filters (i.e., feature detectors), in which the concrete filter size is a hyperparameter. They essentially split a sentence into multiple sub-sentences by a sliding window, then determine the sentence label by using the dominant label across all sub-sentences. The underlying assumption is that the sub-sentence with that granularity is potentially good enough to represent the whole sentence. However, it is hard to find the granularity of a "good sub-sentence" that works well across sentences. This motivates us to implement variable-size filters in a convolution layer in order to extract features of multigranular phrases.
Breakthroughs of deep learning in NLP are also based on learning distributed word representations -also called "word embeddings" -by neural language models (Bengio et al., 2003;Mnih and Hinton, 2009;Mikolov et al., 2010;Mikolov, 2012;Mikolov et al., 2013a). Word embeddings are derived by projecting words from a sparse, 1-of-V encoding (V : vocabulary size) onto a lower dimensional and dense vector space via hidden layers and can be interpreted as feature extractors that encode semantic and syntactic features of words.
Many papers study the comparative performance of different versions of word embeddings, usually learned by different neural network (NN) architectures. For example, Chen et al. (2013) compared HLBL (Mnih and Hinton, 2009), SENNA (Collobert and Weston, 2008), Turian (Turian et al., 2010) and Huang (Huang et al., 2012), showing great variance in quality and characteristics of the semantics captured by the tested embedding versions. Hill et al. (2014) showed that embeddings learned by neural machine translation models outperform three repre-sentative monolingual embedding versions: skipgram (Mikolov et al., 2013b), GloVe (Pennington et al., 2014) and C&W (Collobert et al., 2011) in some cases. These prior studies motivate us to explore combining multiple versions of word embeddings, treating each of them as a distinct description of words. Our expectation is that the combination of these embedding versions, trained by different NNs on different corpora, should contain more information than each version individually. We want to leverage this diversity of different embedding versions to extract higher quality sentence features and thereby improve sentence classification performance.
The letters "M" and "V" in the name "MVCNN" of our architecture denote the multichannel and variable-size convolution filters, respectively. "Multichannel" employs language from computer vision where a color image has red, green and blue channels. Here, a channel is a description by an embedding version.
For many sentence classification tasks, only relatively small training sets are available. MVCNN has a large number of parameters, so that overfitting is a danger when they are trained on small training sets. We address this problem by pretraining MVCNN on unlabeled data. These pretrained weights can then be fine-tuned for the specific classification task.
In sum, we attribute the success of MVCNN to: (i) designing variable-size convolution filters to extract variable-range features of sentences and (ii) exploring the combination of multiple public embedding versions to initialize words in sentences. We also employ two "tricks" to further enhance system performance: mutual learning and pretraining.
In remaining parts, Section 2 presents related work. Section 3 gives details of our classification model. Section 4 introduces two tricks that enhance system performance: mutual-learning and pretraining. Section 5 reports experimental results. Section 6 concludes this work.

Related Work
Much prior work has exploited deep neural networks to model sentences. Blacoe and Lapata (2012) represented a sentence by element-wise addition, multiplication, or recursive autoencoder over embeddings of component single words. Yin and Schütze (2014) ex-tended this approach by composing on words and phrases instead of only single words. Collobert and Weston (2008) and Yu et al. (2014) used one layer of convolution over phrases detected by a sliding window on a target sentence, then used max-or average-pooling to form a sentence representation.  stacked multiple layers of one-dimensional convolution by dynamic kmax pooling to model sentences. We also adopt dynamic k-max pooling while our convolution layer has variable-size filters. Kim (2014) also studied multichannel representation and variable-size filters. Differently, their multichannel relies on a single version of pretrained embeddings (i.e., pretrained Word2Vec embeddings) with two copies: one is kept stable and the other one is fine-tuned by backpropagation. We develop this insight by incorporating diverse embedding versions. Additionally, their idea of variable-size filters is further developed.
Le and Mikolov (2014) initialized the representation of a sentence as a parameter vector, treating it as a global feature and combining this vector with the representations of context words to do word prediction. Finally, this fine-tuned vector is used as representation of this sentence. Apparently, this method can only produce generic sentence representations which encode no taskspecific features.
Our work is also inspired by studies that compared the performance of different word embedding versions or investigated the combination of them. For example, Turian et al. (2010) compared Brown clusters, C&W embeddings and HLBL embeddings in NER and chunking tasks. They found that Brown clusters and word embeddings both can improve the accuracy of supervised NLP systems; and demonstrated empirically that combining different word representations is beneficial. Luo et al. (2014) adapted CBOW (Mikolov et al., 2013a) to train word embeddings on different datasets: free text documents from Wikipedia, search click-through data and user query data, showing that combining them gets stronger results than using individual word embeddings in web search ranking and word similarity task. However, these two papers either learned word representations on the same corpus (Turian et al., 2010) or enhanced the embedding quality by extending training corpora, not learning algorithms (Luo et al., 2014). In our work, there is no limit to the type of embedding versions we can use and they leverage not only the diversity of corpora, but also the different principles of learning algorithms.

Model Description
We now describe the architecture of our model MVCNN, illustrated in Figure 1.
Multichannel Input. The input of MVCNN includes multichannel feature maps of a considered sentence, each is a matrix initialized by a different embedding version. Let s be sentence length, d dimension of word embeddings and c the total number of different embedding versions (i.e., channels). Hence, the whole initialized input is a three-dimensional array of size c × d × s. Figure 1 depicts a sentence with s = 12 words. Each word is initialized by c = 5 embeddings, each coming from a different channel. In implementation, sentences in a mini-batch will be padded to the same length, and unknown words for corresponding channel are randomly initialized or can acquire good initialization from the mutual-learning phase described in next section.
Multichannel initialization brings two advantages: 1) a frequent word can have c representations in the beginning (instead of only one), which means it has more available information to leverage; 2) a rare word missed in some embedding versions can be "made up" by others (we call it "partially known word"). Therefore, this kind of initialization is able to make use of information about partially known words, without having to employ full random initialization or removal of unknown words. The vocabulary of the binary sentiment prediction task described in experimental part contains 5232 words unknown in HLBL embeddings, 4273 in Huang embeddings, 3299 in GloVe embeddings, 4136 in SENNA embeddings and 2257 in Word2Vec embeddings. But only 1824 words find no embedding from any channel! Hence, multichannel initialization can considerably reduce the number of unknown words.
Convolution Layer (Conv). For convenience, we first introduce how this work uses a convolution layer on one input feature map to generate one higher-level feature map. Given a sentence of length s: w 1 , w 2 , . . . , w s ; w i ∈ R d denotes the embedding of word w i ; a convolution layer uses sliding filters to extract local features of that sentence. The filter width l is a param- eter. We first concatenate the initialized embeddings of l consecutive words (w i−l+1 , . . . , w i ) as c i ∈ R ld (1 ≤ i < s + l), then generate the feature value of this phrase as p i (the whole vector p ∈ R s+l−1 contains all the local features) using a tanh activation function and a linear projection vector v ∈ R ld as: More generally, convolution operation can deal with multiple input feature maps and can be stacked to yield feature maps of increasing layers. In each layer, there are usually multiple filters of the same size, but with different weights . We refer to a filter with a specific set of weights as a kernel. The goal is often to train a model in which different kernels detect different kinds of features of a local region. However, this traditional way can not detect the features of regions of different granularity. Hence we keep the property of multi-kernel while extending it to variable-size in the same layer.
As in CNN for object recognition, to increase the number of kernels of a certain layer, multiple feature maps may be computed in parallel at the same layer. Further, to increase the size diversity of kernels in the same layer, more feature maps containing various-range dependency features can be learned. We denote a feature map of the i th layer by F i , and assume totally n feature maps exist in layer i − 1: Considering a specific filter size l in layer i, each feature map F j i,l is computed by convolving a distinct set of filters of size l, arranged in a matrix V j,k i,l , with each feature map F k i−1 and summing the results: where * indicates the convolution operation and j is the index of a feature map in layer i. The weights in V form a rank 4 tensor. Note that we use wide convolution in this work: it means word representations w g for g ≤ 0 or g ≥ s+1 are actually zero embeddings. Wide convolution enables that each word can be detected by all filter weights in V.
In Figure 1, the first convolution layer deals with an input with n = 5 feature maps. 1 Its filters have sizes 3 and 5 respectively (i.e., l = 3, 5), and each filter has j = 3 kernels. This means this convolution layer can detect three kinds of features of phrases with length 3 and 5, respectively. DCNN in  used onedimensional convolution: each higher-order feature is produced from values of a single dimension in the lower-layer feature map. Even though that work proposed folding operation to model the dependencies between adjacent dimensions, this type of dependency modeling is still limited. Differently, convolution in present work is able to model dependency across dimensions as well as adjacent words, which obviates the need for a folding step. This change also means our model has substantially fewer parameters than the DCNN since the output of each convolution layer is smaller by a factor of d.
1 A reviewer expresses surprise at such a small number of maps. However, we will use four variable sizes (see below), so that the overall number of maps is 20. We use a small number of maps partly because training times for a network are on the order of days, so limiting the number of parameters is important.
Dynamic k-max Pooling.  pool the k most active features compared with simple max (1-max) pooling (Collobert and Weston, 2008). This property enables it to connect multiple convolution layers to form a deep architecture to extract high-level abstract features. In this work, we directly use it to extract features for variable-size feature maps. For a given feature map in layer i, dynamic k-max pooling extracts k i top values from each dimension and k top top values in the top layer. We set where i ∈ {1, 2, . . . L} is the order of convolution layer from bottom to top in Figure 1; L is the total numbers of convolution layers; k top is a constant determined empirically, we set it to 4 as . As a result, the second convolution layer in Figure 1 has an input with two same-size feature maps, one results from filter size 3, one from filter size 5. The values in the two feature maps are for phrases with different granularity. The motivation of this convolution layer lies in that a feature reflected by a short phrase may be not trustworthy while the longer phrase containing the short one is trustworthy, or the long phrase has no trustworthy feature while its component short phrase is more reliable. This and even higher-order convolution layers therefore can make a trade-off between the features of different granularity.
Hidden Layer. On the top of the final kmax pooling, we stack a fully connected layer to learn sentence representation with given dimension (e.g., d).
Logistic Regression Layer. Finally, sentence representation is forwarded into logistic regression layer for classification.
In brief, our MVCNN model learns from  to use dynamic kmax pooling to stack multiple convolution layers, and gets insight from (Kim, 2014) to investigate variable-size filters in a convolution layer. Compared to , MVCNN has rich feature maps as input and as output of each convolution layer. Its convolution operation is not only more flexible to extract features of variable-range phrases, but also able to model dependency among all dimensions of representations. MVCNN extends the network in (Kim, 2014) by hierarchical convolution architecture and further exploration of multichannel and variablesize feature detectors.

Model Enhancements
This part introduces two training tricks that enhance the performance of MVCNN in practice.
Mutual-Learning of Embedding Versions. One observation in using multiple embedding versions is that they have different vocabulary coverage. An unknown word in an embedding version may be a known word in another version. Thus, there exists a proportion of words that can only be partially initialized by certain versions of word embeddings, which means these words lack the description from other versions.
To alleviate this problem, we design a mutuallearning regime to predict representations of unknown words for each embedding version by learning projections between versions. As a result, all embedding versions have the same vocabulary. This processing ensures that more words in each embedding version receive a good representation, and is expected to give most words occurring in a classification dataset more comprehensive initialization (as opposed to just being randomly initialized).
Let c be the number of embedding versions in consideration, V 1 , V 2 , . . . , V i , . . . , V c their vocabularies, V * = ∪ c i=1 V i their union, and V − i = V * \V i (i = 1, . . . , c) the vocabulary of unknown words for embedding version i. Our goal is to learn embeddings for the words in V − i by knowledge from the other c − 1 embedding versions.
We use the overlapping vocabulary between V i and V j , denoted as V ij , as training set, formalizing a projection f ij from space V i to space V j (i = j; i, j ∈ {1, 2, . . . , c}) as follows: where M ij ∈ R d×d , w i ∈ R d denotes the representation of word w in space V i andŵ j is the projected (or learned) representation of word w in space V j . Squared error between w j andŵ j is the training loss to minimize. We useŵ j = f ij (w i ) to reformat Equation 4. Totally c(c − 1)/2 projections f ij are trained, each on the vocabulary intersection V ij . Let w be a word that is unknown in V i , but is known in V 1 , V 2 , . . . , V k . To compute an embedding for w in V i , we first compute the k projections f 1i (w 1 ), f 2i (w 2 ), . . ., f ki (w k ) from the source spaces V 1 , V 2 , . . . , V k to the target space V i . Then, the element-wise average of f 1i (w 1 ), f 2i (w 2 ), . . ., f ki (w k ) is treated as the representation of w in V i . Our motivation is that -assuming there is a true representation of w in V i (e.g., the one we would have obtained by training embeddings on a much larger corpus) and assuming the projections were learned well -we would expect all the projected vectors to be close to the true representation. Also, each source space contributes potentially complementary information. Hence averaging them is a balance of knowledge from all source spaces.
As discussed in Section 3, we found that for the binary sentiment classification dataset, many words were unknown in at least one embedding version. But of these words, a total of 5022 words did have coverage in another embedding version and so will benefit from mutual-learning. In the experiments, we will show that this is a very effective method to learn representations for unknown words that increases system performance if learned representations are used for initialization.
Pretraining. Sentence classification systems are usually implemented as supervised training regimes where training loss is between true label distribution and predicted label distribution. In this work, we use pretraining on the unlabeled data of each task and show that it can increase the performance of classification systems. Figure 1 shows our pretraining setup. The "sentence representation" -the output of "Fully connected" hidden layer -is used to predict the component words ("on" in the figure) in the sentence (instead of predicting the sentence label Y/N as in supervised learning). Concretely, the sentence representation is averaged with representations of some surrounding words ("the", "cat", "sat", "the", "mat", "," in the figure) to predict the middle word ("on").
Given sentence representation s ∈ R d and initialized representations of 2t context words (t left words and t right words): w i−t , . . ., w i−1 , w i+1 , . . ., w i+t ; w i ∈ R d , we average the total 2t + 1 vectors element-wise, depicted as "Average" operation in Figure 1. Then, this resulting vector is treated as a predicted representation of the middle word and is used to find the true middle word by means of noise-contrastive estimation (NCE) (Mnih and Teh, 2012). For each true example, 10 noise words are sampled.
Note that in pretraining, there are three places where each word needs initialization. (i) Each word in the sentence is initialized in the "Multichannel input" layer to the whole network. (ii) Each context word is initialized as input to the average layer ("Average" in the figure). (iii) Each target word is initialized as the output of the "NCE" layer ("on" in the figure). In this work, we use multichannel initialization for case (i) and random initialization for cases (ii) and (iii). Only finetuned multichannel representations (case (i)) are kept for subsequent supervised training. The rationale for this pretraining is similar to auto-encoder: for an object composed of smaller-granular elements, the representations of the whole object and its components can learn each other. The CNN architecture learns sentence features layer by layer, then those features are justified by all constituent words.
During pretraining, all the model parameters, including mutichannel input, convolution parameters and fully connected layer, will be updated until they are mature to extract the sentence features. Subsequently, the same sets of parameters will be fine-tuned for supervised classification tasks.
In sum, this pretraining is designed to produce good initial values for both model parameters and word embeddings. It is especially helpful for pretraining the embeddings of unknown words.

Experiments
We test the network on four classification tasks. We begin by specifying aspects of the implementation and the training of the network. We then report the results of the experiments.

Hyperparameters and Training
In each of the experiments, the top of the network is a logistic regression that predicts the probability distribution over classes given the input sentence. The network is trained to minimize cross-entropy of predicted and true distributions; the objective includes an L 2 regularization term over the parameters. The set of parameters comprises the word embeddings, all filter weights and the weights in fully connected layers. A dropout operation  is put before the logistic regression layer. The network is trained by back-propagation in mini-batches and the gradient-based optimization is performed using the AdaGrad update rule (Duchi et al., 2011) In all data sets, the initial learning rate is 0.01, dropout probability is 0.8, L 2 weight is 5 · 10 −3 , batch size is 50. In each convolution layer, filter sizes are {3, 5, 7, 9} and each filter has five kernels (independent of filter size).

Datasets and Experimental Setup
Standard Sentiment Treebank (Socher et al., 2013). This small-scale dataset includes two tasks predicting the sentiment of movie reviews. The output variable is binary in one experiment and can have five possible outcomes in the other: {negative, somewhat negative, neutral, somewhat positive, positive}. In the binary case, we use the given split of 6920 training, 872 development and 1821 test sentences. Likewise, in the finegrained case, we use the standard 8544/1101/2210 split. Socher et al. (2013) used the Stanford Parser (Klein and Manning, 2003) to parse each sentence into subphrases. The subphrases were then labeled by human annotators in the same way as the sentences were labeled. Labeled phrases that occur as subparts of the training sentences are treated as independent training instances as in (Le and Mikolov, 2014;. Sentiment140 2 (Go et al., 2009). This is a large-scale dataset of tweets about sentiment classification, where a tweet is automatically labeled as positive or negative depending on the emoticon that occurs in it. The training set consists of 1.6 million tweets with emoticon-based labels and the test set of about 400 hand-annotated tweets. We preprocess the tweets minimally as follows. 1) The equivalence class symbol "url" (resp. "username") replaces all URLs (resp. all words that start with the @ symbol, e.g., @thomasss). 2) A sequence of k > 2 repetitions of a letter c (e.g., "cooooooool") is replaced by two occurrences of c (e.g., "cool"). 3) All tokens are lowercased.
Subj. Subjectivity classification dataset 3 released by (Pang and Lee, 2004) has 5000 subjective sentences and 5000 objective sentences. We report the result of 10-fold cross validation as baseline systems did.

Pretrained Word Vectors
In this work, we use five embedding versions, as shown in Table 1 (Parker et al., 2009) with setup: window size 5, negative sampling, sampling rate 10 −3 , threads 12. It is worth emphasizing that above embeddings sets are derived on different corpora with different algorithms. This is the very property that we want to make use of to promote the system performance. Table 2 shows the number of unknown words in each task when using corresponding embedding version to initialize (rows "HLBL", "Huang", "Glove", "SENNA", "W2V") and the number of words fully initialized by five embedding versions ("Full hit" row), the number of words partially initialized ("Partial hit" row) and the number of words that cannot be initialized by any of the embedding versions ("No hit" row).
About 30% of words in each task have partially initialized embeddings and our mutual-learning is able to initialize the missing embeddings through projections. Pretraining is expected to learn good representations for all words, but pretraining is especially important for words without initialization ("no hit"); a particularly clear example for this is the Senti140 task: 236,484 of 387,877 words or 61% are in the "no hit" category. Table 3 compares results on test of MVCNN and its variants with other baselines in the four sentence classification tasks. Row 34, "MVCNN (overall)", shows performance of the best configuration of MVCNN, optimized on dev. This version uses five versions of word embeddings, four filter sizes (3,5,7,9), both mutual-learning and pretraining, three convolution layers for Senti140 task and two convolution layers for the other tasks. Overall, our system gets the best results, beating all baselines.
HLBL is removed from row 34, row 28 shows what happens when mutual learning is removed from row 34 etc.
The block "baselines" (1-18) lists some systems representative of previous work on the corresponding datasets, including the state-of-the-art systems (marked as italic). The block "versions" (19-23) shows the results of our system when one of the embedding versions was not used during training. We want to explore to what extend different embedding versions contribute to performance. The block "filters" (24-27) gives the results when individual filter width is discarded. It also tells us how much a filter with specific size influences. The block "tricks" (28-29) shows the system performance when no mutual-learning or no pretraining is used. The block "layers" (30-33) demonstrates how the system performs when it has different numbers of convolution layers.
From the "layers" block, we can see that our system performs best with two layers of convolution in Standard Sentiment Treebank and Subjectivity Classification tasks (row 31), but with three layers of convolution in Sentiment140 (row 32). This is probably due to Sentiment140 being a much larger dataset; in such a case deeper neural networks are beneficial.
The block "tricks" demonstrates the effect of mutual-learning and pretraining. Apparently, pretraining has a bigger impact on performance than mutual-learning. We speculate that it is because pretraining can influence more words and all learned word embeddings are tuned on the dataset after pretraining.
The block "filters" indicates the contribution of each filter size. The system benefits from filters of each size. Sizes 5 and 7 are most important for high performance, especially 7 (rows 25 and 26).
In the block "versions", we see that each embedding version is crucial for good performance: performance drops in every single case. Though it is not easy to compare fairly different embedding versions in NLP tasks, especially when those embeddings were trained on different corpora of different sizes using different algorithms, our results are potentially instructive for researchers making decision on which embeddings to use for their own tasks.

Conclusion
This work presented MVCNN, a novel CNN architecture for sentence classification. It combines multichannel initialization -diverse versions of pretrained word embeddings are usedand variable-size filters -features of multigranular phrases are extracted with variable-size convolution filters. We demonstrated that multichannel initialization and variable-size filters enhance system performance on sentiment classification and subjectivity classification tasks.

Future Work
As pointed out by the reviewers the success of the multichannel approach is likely due to a combination of several quite different effects.
First, there is the effect of the embedding learning algorithm. These algorithms differ in many aspects, including in sensitivity to word order (e.g., SENNA: yes, word2vec: no), in objective function and in their treatment of ambiguity (explicitly modeled only by Huang et al. (2012).
Second, there is the effect of the corpus. We would expect the size and genre of the corpus to have a big effect even though we did not analyze this effect in this paper.
Third, complementarity of word embeddings is likely to be more useful for some tasks than for others. Sentiment is a good application for complementary word embeddings because solving this task requires drawing on heterogeneous sources of information, including syntax, semantics and genre as well as the core polarity of a word. Other tasks like part of speech (POS) tagging may benefit less from heterogeneity since the benefit of embeddings in POS often comes down to making a correct choice between two alternatives -a single embedding version may be sufficient for this.
We plan to pursue these questions in future work.