Molding CNNs for text: non-linear, non-consecutive convolutions

The success of deep learning often derives from well-chosen operational building blocks. In this work, we revise the temporal convolution operation in CNNs to better adapt it to text processing. Instead of concatenating word representations, we appeal to tensor algebra and use low-rank n-gram tensors to directly exploit interactions between words already at the convolution stage. Moreover, we extend the n-gram convolution to non-consecutive words to recognize patterns with intervening words. Through a combination of low-rank tensors, and pattern weighting, we can efficiently evaluate the resulting convolution operation via dynamic programming. We test the resulting architecture on standard sentiment classification and news categorization tasks. Our model achieves state-of-the-art performance both in terms of accuracy and training speed. For instance, we obtain 51.2% accuracy on the fine-grained sentiment classification task.


Introduction
Deep learning methods and convolutional neural networks (CNNs) among them have become de facto top performing techniques across a range of NLP tasks such as sentiment classification, question-answering, and semantic parsing.As methods, they require only limited domain knowledge to reach respectable performance with increasing data and computation, yet permit easy architectural and operational variations so as to fine tune them to specific applications to reach top performance.Indeed, their success is often contingent on specific architectural and operational choices. 1 Our code and data are available at https://github.com/taolei87/text_convnet CNNs for text applications make use of temporal convolution operators or filters.Similar to image processing, they are applied at multiple resolutions, interspersed with non-linearities and pooling.The convolution operation itself is a linear mapping over "n-gram vectors" obtained by concatenating consecutive word (or character) representations.We argue that this basic building block can be improved in two important respects.First, the power of n-grams derives precisely from multi-way interactions and these are clearly missed (initially) with linear operations on stacked n-gram vectors.Non-linear interactions within a local context have been shown to improve empirical performance in various tasks (Mitchell and Lapata, 2008;Kartsaklis et al., 2012;Socher et al., 2013).Second, many useful patterns are expressed as non-consecutive phrases, such as semantically close multi-word expressions (e.g.,"not that good", "not nearly as good").In typical CNNs, such expressions would have to come together and emerge as useful patterns after several layers of processing.
We propose to use a feature mapping operation based on tensor products instead of linear operations on stacked vectors.This enables us to directly tap into non-linear interactions between adjacent word feature vectors (Socher et al., 2013;Lei et al., 2014).To offset the accompanying parametric explosion we maintain a low-rank representation of the tensor parameters.Moreover, we show that this feature mapping can be applied to all possible non-consecutive n-grams in the sequence with an exponentially decaying weight depending on the length of the span.Owing to the low rank representation of the tensor, this operation can be performed efficiently in linear time with respect to the sequence length via dynamic programming.Similar to traditional convolution operations, our non-linear feature mapping can be applied successively at multiple levels.
We evaluate the proposed architecture in the context of sentence sentiment classification and news categorization.On the Stanford Sentiment Treebank dataset, our model obtains state-of-theart performance among a variety of neural networks in terms of both accuracy and training cost.Our model achieves 51.2% accuracy on finegrained classification and 88.6% on binary classification, outperforming the best published numbers obtained by a deep recursive model (Tai et al., 2015) and a convolutional model (Kim, 2014).On the Chinese news categorization task, our model achieves 80.0% accuracy, while the closest baseline achieves 79.2%.
Our model most closely relates to the latter.Since these models have originally been developed for computer vision (LeCun et al., 1998), their application to NLP tasks introduced a number of modifications.For instance, Collobert et al. (2011) use the max-over-time pooling operation to aggregate the features over the input sequence.This variant has been successfully applied to semantic parsing (Yih et al., 2014) and information retrieval (Shen et al., 2014;Gao et al., 2014).Kalchbrenner et al. (2014) instead propose (dynamic) k-max pooling operation for modeling sentences.In addition, Kim (2014) combines CNNs of different filter widths and either static or fine-tuned word vectors.In contrast to the traditional CNN models, our method considers non-consecutive n-grams thereby expanding the representation capacity of the model.Moreover, our model captures non-linear interactions within n-gram snippets through the use of tensors, moving beyond direct linear projection operator used in standard CNNs.As our experiments demonstrate these advancements result in improved performance.

Background
Let x ∈ R L×d be the input sequence such as a document or sentence.Here L is the length of the sequence and each x i ∈ R d is a vector representing the i th word.The (consecutive) n-gram vector ending at position j is obtained by simply concatenating the corresponding word vectors Out-of-index words are simply set to all zeros.
The traditional convolution operator is parameterized by filter matrix m ∈ R nd×h which can be thought of as n smaller filter matrices applied to each x i in vector v j .The operator maps each ngram vector v j in the input sequence to m v j ∈ R h so that the input sequence x is transformed into a sequence of feature representations, The resulting feature values are often passed through non-linearities such as the hyper-tangent (element-wise) as well as aggregated or reduced by "sum-over" or "max-pooling" operations for later (similar stages) of processing.
The overall architecture can be easily modified by replacing the basic n-gram vectors and the convolution operation with other feature mappings.Indeed, we appeal to tensor algebra to introduce a non-linear feature mapping that operates on nonconsecutive n-grams.

Model
N-gram tensor Typical n−gram feature mappings where concatenated word vectors are mapped linearly to feature coordinates may be insufficient to directly capture relevant information in the n−gram.As a remedy, we replace concatenation with a tensor product.Consider a 3-gram (x 1 , x 2 , x 3 ) and the corresponding tensor product x 1 ⊗ x 2 ⊗ x 3 .The tensor product is a 3-way array of coordinate interactions such that each ijk entry of the tensor is given by the product of the corresponding coordinates of the word vectors Here ⊗ denotes the tensor product operator.The tensor product of a 2-gram analogously gives a two-way array or matrix x 1 ⊗ x 2 ∈ R d×d .The ngram tensor can be seen as a direct generalization of the typical concatenated vector 2 .
Tensor-based feature mapping Since each ngram in the sequence is now expanded into a high-dimensional tensor using tensor products, the set of filters are analogously maintained as highorder tensors.In other words, our filters are linear mappings over the higher dimensional interaction terms rather than the original word coordinates.
Consider again mapping a 3-gram (x 1 , x 2 , x 3 ) into a feature representation.Each filter is a 3-way tensor with dimensions d × d × d.The set of h filters, denoted as T , is a 4-way tensor of dimension d × d × d × h, where each d 3 slice of T represents a single filter and h is the number of such filters, i.e., the feature dimension.The resulting h−dimensional feature representation z ∈ R h for the 3-gram (x 1 , x 2 , x 3 ) is obtained by multiplying the filter T and the 3-gram tensor as follows.The l th coordinate of z is given by The formula is equivalent to summing over all the third-order polynomial interaction terms where tensor T stores the coefficients.Directly maintaining the filters as full tensors leads to parametric explosion.Indeed, the size of the tensor T (i.e.h × d n ) would be too large even for typical low-dimensional word vectors where, e.g., d = 300.To this end, we assume a low-rank factorization of the tensor T, represented in the Kruskal form.Specifically, T is decomposed into a sum of h rank-1 tensors To see this, consider word vectors with a "bias" term xi = [xi; 1].The tensor product of n such vectors includes the concatenated vector as a subset of tensor entries but, in addition, contains all up to n th -order interaction terms.
where P, Q, R ∈ R h×d and O ∈ R h×h are four smaller parameter matrices.P i (similarly Q i , R i and O i ) denotes the i th row of the matrix.Note that, for simplicity, we have assumed that the number of rank-1 components in the decomposition is equal to the feature dimension h.Plugging the low-rank factorization into Eq.(1), the featuremapping can be rewritten in a vector form as where is the element-wise product such that, e.g., (a b) k = a k × b k for a, b ∈ R m .Note that while Px 1 (similarly Qx 2 and Rx 3 ) is a linear mapping from each word x 1 (similarly x 2 and x 3 ) into a h-dimensional feature space, higher order terms arise from the element-wise products.
Non-consecutive n-gram features Traditional convolution uses consecutive n-grams in the feature map.Non-consecutive n-grams may nevertheless be helpful since phrases such as "not good", "not so good" and "not nearly as good" express similar sentiments but involve variable spacings between the key words.Variable spacings are not effectively captured by fixed n-grams.
We apply the feature-mapping in a weighted manner to all n-grams thereby gaining access to patterns such as "not ... good".Let z[i, j, k] ∈ R h denote the feature representation corresponding to a 3-gram (x i , x j , x k ) of words in positions i, j, and k along the sequence.This vector is calculated analogously to Eq.( 2), We will aggregate these vectors into an h−dimensional feature representation at each position in the sequence.The idea is similar to neural bag-of-words models where the feature representation for a document or sentence is obtained by averaging (or summing) of all the word vectors.In our case, we define the aggregate representation z 3 [k] in position k as the weighted sum of all 3-gram feature representations ending at position k, i.e., where λ ∈ [0, 1) is a decay factor that downweights 3-grams with longer spans (i.e., 3-grams that skip more in-between words).As λ → 0 all non-consecutive 3-grams are omitted, and the model acts like a traditional model with only consecutive n-grams.When λ > 0, however, z 3 [k] is a weighted average of many 3-grams with variable spans.
Aggregating features via dynamic programming Directly calculating z 3 [•] according to Eq.( 3) by enumerating all 3-grams would require O(L 3 ) feature-mapping operations.We can, however, evaluate the features more efficiently by relying on the associative and distributive properties of the feature operation in Eq.( 2).Let f 3 [k] be a dynamic programming table representing the sum of 3-gram feature representations before multiplying with matrix O.That is, We can analogously define f 1 [i] and f 2 [j] for 1grams and 2-grams, These dynamic programming tables can be calculated recursively according to the following formulas: where s 1 [•] and s 2 [•] are two auxiliary tables.The resulting z[•] is the sum of 1, 2, and 3-gram features.We found that aggregating the 1,2 and 3gram features in this manner works better than using 3-gram features alone.Overall, the n-gram feature aggregation can be performed in O(Ln) matrix multiplication/addition operations, and remains linear in the sequence length.
The overall architecture The dynamic programming algorithm described above maps the original input sequence to a sequence of feature representations z = z[1 : L] ∈ R L×h .As in standard convolutional architectures, the resulting sequence can be used in multiple ways.One can directly aggregate it to a classifier or expose it to non-linear element-wise transformations and use it as an input to another sequence-to-sequence feature mapping.
The simplest strategy (adopted in neural bagof-words models) would be to average the feature representations and pass the resulting averaged vector directly to a softmax output unit Our architecture, as illustrated in Figure 1, includes two additional refinements.First, we add a non-linear activation function after each feature representation, i.e. z = ReLU (z + b), where b is a bias vector and ReLU is the rectified linear unit function.Second, we stack multiple tensorbased feature mapping layers.That is, the input sequence x is first processed into a feature sequence and passed through the non-linear transformation to obtain z (1) .The resulting feature sequence z (1) is then analogously processed by another layer, parameterized by a different set of feature-mapping matrices P, • • • , O, to obtain a higher-level feature sequence z (2) , and so on.The output feature representations of all these layers are averaged within each layer and concatenated as shown in Figure 1.The final prediction is therefore obtained on the basis of features across the levels.descent using AdaGrad algorithm (Duchi et al., 2011).

Learning the Model
Initialization We initialize matrices P, Q, R from uniform distribution − 3/d, 3/d and similarly O ∼ U − 3/h, 3/h .In this way, each row of the matrices is an unit vector in expectation, and each rank-1 filter slice has unit variance as well, In addition, the parameter matrix W in the softmax output layer is initialized as zeros, and the bias vectors b for ReLU activation units are initialized to a small positive constant 0.01.

Regularization
We apply two common techniques to avoid overfitting during training.First, we add L2 regularization to all parameter values with the same regularization weight.In addition, we randomly dropout (Hinton et al., 2012) units on the output feature representations z (i) at each level.

Experimental Setup
Datasets We evaluate our model on sentence sentiment classification task and news categorization task.For sentiment classification, we use the Stanford Sentiment Treebank benchmark (Socher et al., 2013).The dataset consists of 11855 parsed English sentences annotated at both the root (i.e.sentence) level and the phrase level using 5-class fine-grained labels.We use the stan-dard 8544/1101/2210 split for training, development and testing respectively.Following previous work, we also evaluate our model on the binary classification variant of this benchmark, ignoring all neutral sentences.The binary version has 6920/872/1821 sentences for training, development and testing.
For the news categorization task, we evaluate on Sogou Chinese news corpora. 3The dataset contains 10 different news categories in total, including Finance, Sports, Technology and Automobile etc.We use 79520 documents for training, 9940 for development and 9940 for testing.To obtain Chinese word boundaries, we use LTP-Cloud4 , an open-source Chinese NLP platform.
Baselines We implement the standard SVM method and the neural bag-of-words model NBoW as baseline methods in both tasks.To assess the proposed tensor-based feature map, we also implement a convolutional neural network model CNN by replacing our filter with traditional linear filter.The rest of the framework (such as feature averaging and concatenation) remains the same.
In addition, we compare our model with a wide range of top-performing models on the sentence sentiment classification task.Most of these models fall into either the category of recursive neural networks (RNNs) or the category of convolutional neural networks (CNNs).The top block lists recursive neural network models, the second block are convolutional network models and the third block contains other baseline methods, including the paragraph-vector model (Le and Mikolov, 2014), the deep averaging network model (Iyyer et al., 2015) and our implementation of neural bag-of-words.The training time of baseline methods is taken from (Iyyer et al., 2015) or directly from the authors.For our implementations, timings were performed on a single core of a 2.6GHz Intel i7 processor.
network baselines include standard RNN (Socher et al., 2011b), RNTN with a small core tensor in the composition function (Socher et al., 2013), the deep recursive model DRNN (Irsoy and Cardie, 2014) and the most recent recursive model using long-short-term-memory units RLSTM (Tai et al., 2015).These recursive models assume the input sentences are represented as parse trees.As a benefit, they can readily utilize annotations at the phrase level.In contrast, convolutional neural networks are trained on sequence-level, taking the original sequence and its label as training input.Such convolutional baselines include the dynamic CNN with k-max pooling DCNN (Kalchbrenner et al., 2014) and the convolutional model with multi-channel CNN-MC by Kim (2014).To leverage the phrase-level annotations in the Stanford Sentiment Treebank, all phrases and the corresponding labels are added as separate instances when training the sequence models.We follow this strategy and report results with and without phrase annotations.

Word vectors
The word vectors are pre-trained on much larger unannotated corpora to achieve better generalization given limited amount of training data (Turian et al., 2010).In particular, for the English sentiment classification task, we use the publicly available 300-dimensional GloVe word vectors trained on the Common Crawl with 840B tokens (Pennington et al., 2014).This choice of word vectors follows most recent work, such as DAN (Iyyer et al., 2015) and RLSTM (Tai et al., 2015).For Chinese news categorization, there is no widely-used publicly available word vectors.Therefore, we run word2vec (Mikolov et al., 2013) to train 200-dimensional word vectors on the 1.6 million Chinese news articles.Both word vectors are normalized to unit norm (i.e.w 2 2 = 1) and are fixed in the experiments without fine-tuning.

Implementation details
The source code is implemented in Python using the Theano library (Bergstra et al., 2010), a flexible linear algebra compiler that can optimize userspecified computations (models) with efficient automatic low-level implementations, including (back-propagated) gradient calculation.

Overall Performance
Table 1 presents the performance of our model and other baseline methods on Stanford Sentiment Treebank benchmark.Our full model obtains the highest accuracy on both the development and test sets.Specifically, it achieves 51.2% and 88.6% test accuracies on fine-grained and binary tasks respectively5 .As shown in Table 2, our model performance is relatively stable -it remains high accuracies with around 0.5% standard deviation under different initializations and dropout rates.
Our full model is also several times faster than other top-performing models.For example, the convolutional model with multi-channel (CNN-MC) runs over 2400 seconds per training epoch.In contrast, our full model (with 3 feature layers) runs on average 28 seconds with only root labels and on average 445 seconds with all labels.
Our results also show that the CNN model, where our feature map is replaced with traditional linear map, performs worse than our full model.This observation confirms the importance of the proposed non-linear, tensor-based feature mapping.The CNN model also lags behind the DCNN and CNN-MC baselines, since the latter two propose several advancements over standard CNN.
Table 3 reports the results of SVM, NBoW and our model on the news categorization task.Since the dataset is much larger compared to the sentiment dataset (80K documents vs. 8.5K sentences), the SVM method is a competitive baseline.It achieves 78.5% accuracy compared to 74.4% and We hypothesize that the difference is due to the nature of the two tasks: the document classification task requires to handle less compositions or context interactions than sentiment analysis.

Hyperparameter Analysis
We next investigate the impact of hyperparameters in our model performance.We use the models trained on fine-grained sentiment classification task with only root labels.

Number of layers
We plot the fine-grained sentiment classification accuracies obtained during hyperparameter grid search.Figure 2 illustrates how the number of feature layers impacts the model performance.As shown in the figure, adding higher-level features clearly improves the classification accuracy across various hyperparameter settings and initializations.

Non-consecutive n-gram features
We also analyze the effect of modeling non-consecutive n- (1) positive prediction (2) negative prediction (3) negative prediction (4) positive prediction  grams.Figure 3 splits the model accuracies according to the choice of span decaying factor λ.
Note when λ = 0, the model applies feature extractions to consecutive n-grams only.As shown in Figure 3, this setting leads to consistent performance drop.This result confirms the importance of handling non-consecutive n-gram patterns.
Non-linear activation Finally, we verify the effectiveness of rectified linear unit activation func- 3: Comparison of our model variations in sentiment classification task when considering consecutive n-grams only (decaying factor λ = 0) and when considering non-consecutive n-grams (λ > 0).Modeling non-consecutive n-gram features leads to better performance.

Example Predictions
Figure 5 gives examples of input sentences and the corresponding predictions of our model in fine-grained sentiment classification.To see how our model captures the sentiment at different local context, we apply the learned softmax activation to the extracted features at each position without taking the average.That is, for each index i, we obtain the local sentiment p = softmax W z (1) [i] ⊕ z (2) [i] ⊕ z (3) [i] .We plot the expected sentiment scores 2 s=−2 s • p(s), where a score of 2 means "very positive", 0 means "neutral" and -2 means "very negative".As shown in the figure, our model successfully learns negation and double negation.The model also identifies positive and negative segments appearing in the sentence.

Conclusion
We proposed a feature mapping operator for convolutional neural networks by modeling n-gram interactions based on tensor product and evaluating all non-consecutive n-gram vectors.The associated parameters are maintained as a low-rank tensor, which leads to efficient feature extraction via dynamic programming.The model achieves top performance on standard sentiment classification and document categorization tasks.

FollowingFigure 1 :
Figure1: Illustration of the model architecture.The input is represented as a matrix where each row is a d-dimensional word vector.Several feature map layers (as described in Section 4) are stacked, mapping the input into different levels of feature representations.The features are averaged within each layer and then concatenated.Finally a softmax layer is applied to obtain the prediction output.

Figure 2 :
Figure 2: Dev accuracy (x-axis) and test accuracy (y-axis) of independent runs of our model on finegrained sentiment classification task.Deeper architectures achieve better accuracies.
tion (ReLU) by comparing it with no activation (or identity activation f (x) = x).As shown in Figure 4, our model with ReLU activation generally outperforms its variant without ReLU.The observation is consistent with previous work on convolutional neural networks and other neural network models.

Table 1 :
The recursive neural Comparison between our model and other baseline methods on Stanford Sentiment Treebank.
Example sentences and their sentiments predicted by our model trained with root labels.The predicted sentiment scores at each word position are plotted.Examples (1)-(5) are synthetic inputs, (6) and (7) are two real inputs from the test set.Our model successfully identifies negation, double negation and phrases with different sentiment in one sentence.
Applying ReLU activation on top of tensor-based feature mapping leads to better performance in sentiment classification task.Runs with no activation are plotted as blue circles.