Dependency Sensitive Convolutional Neural Networks for Modeling Sentences and Documents

The goal of sentence and document modeling is to accurately represent the meaning of sentences and documents for various Natural Language Processing tasks. In this work, we present Dependency Sensitive Convolutional Neural Networks (DSCNN) as a general-purpose classification system for both sentences and documents. DSCNN hierarchically builds textual representations by processing pretrained word embeddings via Long Short-Term Memory networks and subsequently extracting features with convolution operators. Compared with existing recursive neural models with tree structures, DSCNN does not rely on parsers and expensive phrase labeling, and thus is not restricted to sentence-level tasks. Moreover, unlike other CNN-based models that analyze sentences locally by sliding windows, our system captures both the dependency information within each sentence and relationships across sentences in the same document. Experiment results demonstrate that our approach is achieving state-of-the-art performance on several tasks, including sentiment analysis, question type classification, and subjectivity classification.


Introduction
Sentence and document modeling systems are important for many Natural Language Processing (NLP) applications. The challenge for textual modeling is to capture features for different text units and to perform compositions over variable-length sequences (e.g., phrases, sentences, documents). As a traditional method, the bag-of-words model treats sentences and documents as unordered collections of words. In this way, however, the bag-of-words model fails to encode word orders and syntactic structures.
By contrast, order-sensitive models based on neural networks are becoming increasingly popular thanks to their ability to capture word order information. Many prevalent order-sensitive neural models can be categorized into two classes: Recursive models and Convolutional Neural Networks (CNN) models. Recursive models can be considered as generalizations of traditional sequence-modeling neural networks to tree structures. For example, (Socher et al., 2013) uses Recursive Neural Networks to build representations of phrases and sentences by combining neighboring constituents based on the parse tree. In their model, the composition is performed in a bottom-up way from leaf nodes of tokens until the root node of the parsing tree is reached. CNN based models, as the second category, utilize convolutional filters to extract local features (Kalchbrenner et al., 2014;Kim, 2014) over embedding matrices consisting of pretrained word vectors. Therefore, the model actually splits the sentence locally into n-grams by sliding windows.
However, despite their ability to account for word orders, order-sensitive models based on neural networks still suffer from several disadvantages. First, recursive models depend on well-performing parsers, which can be difficult for many languages or noisy domains (Iyyer et al., 2015;Ma et al., 2015). Besides, since tree-structured neural networks are vulnerable to the vanishing gradient problem (Iyyer et al., 2015), recursive models require heavy label-ing on phrases to add supervisions on internal nodes. Furthermore, parsing is restricted to sentences and it is unclear how to model paragraphs and documents using recursive neural networks. In CNN models, convolutional operators process word vectors sequentially using small windows. Thus sentences are essentially treated as a bag of n-grams, and the long dependency information spanning sliding windows is lost.
These observations motivate us to construct a textual modeling architecture that captures long-term dependencies without relying on parsing for both sentence and document inputs. Specifically, we propose Dependency Sensitive Convolutional Neural Networks (DSCNN), an end-to-end classification system that hierarchically builds textual representations with only root-level labels.
DSCNN consists of a convolutional layer built on top of Long Short-Term Memory (LSTM) networks. DSCNN takes slightly different forms depending on its input. For a single sentence (Figure 1), the LSTM network processes the sequence of word embeddings to capture long-distance dependencies within the sentence. The hidden states of the LSTM are extracted to form the low-level representation, and a convolutional layer with variable-size filters and max-pooling operators follows to extract task-specific features for classification purposes. As for document modeling (Figure 2), DSCNN first applies independent LSTM networks to each subsentence. Then a second LSTM layer is added between the first LSTM layer and the convolutional layer to encode the dependency across different sentences.
We evaluate DSCNN on several sentence-level and document-level tasks including sentiment analysis, question type classification, and subjectivity classification. Experimental results demonstrate the effectiveness of our approach comparable with the state-of-the-art. In particular, our method achieves highest accuracies on MR sentiment analysis (Pang and Lee, 2005), TREC question classification (Li and Roth, 2002), and subjectivity classification task SUBJ (Pang and Lee, 2004) compared with several competitive baselines.
The remaining part of this paper is the following. Section 2 discusses related work. Section 3 presents the background including LSTM networks and convolution operators. We then describe our architec-tures for sentence modeling and document modeling in Section 4, and report experimental results in Section 5.

Related Work
The success of deep learning architectures for NLP is first based on the progress in learning distributed word representations in semantic vector space (Bengio et al., 2003;Mikolov et al., 2013;Pennington et al., 2014), where each word is modeled with a realvalued vector called a word embedding. In this formulation, instead of using one-hot vectors by indexing words into a vocabulary, word embeddings are learned by projecting words onto a low dimensional and dense vector space that encodes both semantic and syntactic features of words.
Given word embeddings, different models have been proposed to learn the composition of words to build up phrase and sentence representations. Most methods fall into three types: unordered models, sequence models, and Convolutional Neural Networks models.
In unordered models, textual representations are independent of the word order. Specifically, ignoring the token order in the phrase and sentence, the bag-of-words model produces the representation by averaging the constituting word embeddings (Landauer and Dumais, 1997). Besides, a neural-bagof-words model described in (Kalchbrenner et al., 2014) adds an additional hidden layer on top of the averaged word embeddings before the softmax layer for classification purposes.
In contrast, sequence models, such as standard Recurrent Neural Networks (RNN) and Long Short-Term Memory networks, construct phrase and sentence representations in an order-sensitive way. For example, thanks to its ability to capture longdistance dependencies, LSTM has re-emerged as a popular choice for many sequence-modeling tasks, including machine translation (Bahdanau et al., 2014), image caption generation (Vinyals et al., 2014), and natural language generation (Wen et al., 2015). Besides, RNN and LSTM can be both converted to tree-structured networks by using parsing information. For example, (Socher et al., 2013) applied Recursive Neural Networks as a variant of the standard RNN structured by syntactic trees to the sentiment analysis task. (Tai et al., 2015) also generalizes LSTM to Tree-LSTM where each LSTM unit combines information from its children units.
Recently, CNN-based models have demonstrated remarkable performances on sentence modeling and classification tasks. Leveraging convolution operators, these models can extract features from variablelength phrases corresponding to different filters. For example, DCNN in (Kalchbrenner et al., 2014) constructs hierarchical features of sentences by onedimensional convolution and dynamic k-max pooling. (Yin and Schütze, 2015) further utilizes multichannel embeddings and unsupervised pretraining to improve classification results.

Preliminaries
In this section, we describe two building blocks for our system. We first discuss Long Short-Term Memory as a powerful network for modeling sequence data, and then formulate convolution and max-overtime pooling operators for the feature extraction over sequence inputs.

Long Short-Term Memory
Recurrent Neural Network (RNN) is a class of models to process arbitrary-length input sequences by recursively constructing hidden state vectors h t . At each time step t, the hidden state h t is an affine function of the input vector x t at time t and its previous hidden state h t−1 , followed by a non-linearity such as the hyperbolic tangent function: where W, U and b are parameters of the model. However, traditional RNN suffers from the exploding or vanishing gradient problems, where the gradient vectors can grow or decay exponentially as they propagate to earlier time steps. This problem makes it difficult to train RNN to capture longdistance dependencies in a sequence (Bengio et al., 1994;Hochreiter, 1998).
To address this problem of capturing long-term relations, Long Short-Term Memory (LSTM) networks, proposed by (Hochreiter and Schmidhuber, 1997) introduce a vector of memory cells and a set of gates to control how the information flows through the network. We thus have the input gate i t , the forget gate f t , the output gate o t , the memory cell c t , the input at the current step t as x t , and the hidden state h t , which are all in R d . Denote the sigmoid function as σ, and the element-wise multiplication as . At each time step t, the LSTM unit manipulates a collection of vectors described by the following equations: Note that the gates i t , f t , o t ∈ [0, 1] d and they control at time step t how the input is updated, how much the previous memory cell is forgotten, and the exposure of the memory to form the hidden state vector respectively.

Convolution and Max-over-time Pooling
Convolution operators have been extensively used in object recognition (LeCun et al., 1998), phoneme recognition (Waibel et al., 1989), sentence modeling and classification (Kalchbrenner et al., 2014;Kim, 2014), and other traditional NLP tasks (Collobert and Weston, 2008). Given an input sentence of length s: [w 1 , w 2 , ..., w s ], convolution operators apply a number of filters to extract local features of the sentence.
In this work, we employ one-dimensional wide convolution described in (Kalchbrenner et al., 2014). Let h t ∈ R d denote the representation of w t , and F ∈ R d×l be a filter where l is the window size. One-dimensional wide convolution computes the feature map c of length (s + l − 1) for the input sentence. Specifically, in wide convolution, we stack h t column by column, and add (l − 1) zero vectors to both ends of the sentence respectively. This formulates an input feature map X ∈ R d×(s+2l−2) . Thereafter, one-dimensional convolution applies the filter F to each set of consecutive l columns in X to produce (s − l − 1) activations. The k-th activation is produced by where X k:k+l−1 ∈ R d×l is the k-th sliding window in X, and b is the bias term.
performs elementwise multiplications and f is an nonlinear function such as Rectified Linear Units (ReLU) or the hyperbolic tangent.
Then, the max-over-time pooling selects the maximum value in the feature map as the feature corresponding to the filter F.
In practice, we apply many filters with different window sizes l to capture features encoded in llength windows of the input.

Model Architectures
Convolutional Neural Networks have demonstrated state-of-the-art performances in sentence modeling and classification. Despite the fact that CNN is an order-sensitive model, traditional convolution operators extract local features from each possible window of words through filters with predefined sizes. Therefore, sentences are effectively processed like a bag of n-grams, and long-distance dependencies can be only captured if we have long enough filters.
To capture long-distance dependencies, much recent effort has been dedicated to building treestructured models from the syntactic parsing information. However, we observe that these methods suffer from three problems. First, they require an external parser and are vulnerable to parsing errors (Iyyer et al., 2015). Besides, tree-structured models need heavy supervisions to overcome vanishing gradient problems. For example, in (Socher et al., 2013), input sentences are labeled for each subphrase, and softmax layers are applied at each internal node. Finally, tree-structured models are restricted to sentence level, and cannot be generalized to model documents.
In this work, we propose a novel architecture to address these three problems. Our model hierarchically builds text representations from input words without parsing information. Only labels at the root level are required at the top softmax layer, so there is no need for labeling subphrases in the text. The system is not restricted to sentence-level inputs: the architecture can be restructured based on the sentence tokenization for modeling documents. Let the input of our model be a sentence of length s: [w 1 , w 2 , ..., w s ], and c be the total number of word embedding versions. Different versions come from pre-trained word vectors such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014).

Sentence Modeling
The first layer of our model consists of LSTM networks processing multiple versions of word embedding. For each version of word embedding, we construct an LSTM network where the input x t ∈ R d is the d-dimensional word embedding vector for w t . As described in the previous section, the LSTM layer will produce a hidden state representation h t ∈ R d at each time step. We collect hidden state representations as the output of LSTM layers: for i = 1, 2, ..., c.
A convolution neural network follows as the second layer. To deal with multiple word embeddings, we use filter F ∈ R c×d×l , where l is the window size. Each hidden state sequence h (i) produced by the i-th version of word embeddings forms one channel of the feature map. These feature maps are stacked as c-channel feature maps X ∈ R c×d×(s+2(l−1)) .
Similar to the single channel case, activations are computed as a slight modification of equation 4: A max-over-time pooling layer is then added on top of the convolution neural network. Finally, the pooled features are used in a softmax layer for classification. A sentence modeling example is illustrated in Figure 1.

Document Modeling
Figure 2: A schematic for document modeling hierarchy, which can be viewed as a variant of the one for sentence modeling. Independent LSTM networks process subsentences separated by punctuation. Hidden states of LSTM networks are averaged as the sentence representations, from which the high-level LSTM layer creates the joint meaning of sentences.
Our model is not restricted to sentences; it can be restructured to model documents. The intuition comes from the fact that as the composition of words builds up the semantic meaning for sentences, the composition of sentences establishes the semantic meaning for documents (Li et al., 2015). Now suppose that the input of our model is a document consisting of n subsentences: [s 1 , s 2 , ..., s n ]. Subsentences can be obtained by splitting the document using punctuation (comma, period, question mark, and exclamation point) as delimiters.
We employ independent LSTM networks for each subsentence in the same way as the first layer of the sentence modeling architecture. For each subsentence we feed-forward the hidden states of the corresponding LSTM network to the average pooling layer. Take the first sentence of the document as an example, where h (i) s1,j is the hidden state of the first sentence at time step j, and len(s1) denotes the length of the first sentence. In this way, after the averaging pooling layers, we have a representation sequence consisting of averaged hidden states for subsentences, for i = 1, 2, ..., c. Thereafter, a high-level LSTM network comes into play to capture the joint meaning created by the sentences.
Similar as sentence modeling, a convolutional layer is placed on top of the high-level LSTM for feature extraction. Finally, a max-over-time pooling layer and a softmax layer follow to pool features and perform the classification task. Figure 2 gives the schematic for the hierarchy.
We also benchmark our system on the subjectivity classification dataset (SUBJ) released by (Pang and Lee, 2004). The dataset contains 5,000 subjective sentences and 5,000 objective sentences. We report 10-fold cross validation results as the baseline does.
For document-level dataset, we use Large Movie Review (IMDB) created by (Maas et al., 2011). There are 25,000 training and 25,000 testing examples with binary sentiment polarity labels, and 50,000 unlabeled examples. Different from Stanford Sentiment Treebank and Movie Review dataset, every example in this dataset has several sentences.

Training Details and Implementation
We use two sets of 300-dimensional pre-trained embeddings, word2vec 1 and GloVe 2 , forming two channels for our network. For all datasets, we use 100 convolution filters each for window sizes of 3, 4, 5. Rectified Linear Units (ReLU) is chosen as the nonlinear function in the convolutional layer.
For regularization, before the softmax layers, we employ Dropout operation (Hinton et al., 2012) with dropout rate 0.5, and we do not perform any l 2 constraints over the parameters. We use the gradientbased optimizer Adadelta (Zeiler, 2012) to minimize cross-entropy loss between the predicted and true distributions, and the training is early stopped when the accuracy on validation set starts to drop.
As for training cost, our system processes around 4000 tokens per second on a single GTX 670 GPU.

Pretraining of LSTM
We experiment with two variants of parameter initialization of sentence level LSTMs. The first variant (DSCNN in Table 1) initializes the weight matrices in LSTMs as random orthogonal matrices. In the second variant (DSCNN-Pretrain in Table 1), we first train sequence autoencoders (Dai and Le, 2015) which read input sentences at the encoder and reconstruct the input at the decoder. We pretrain separately on each task based on the same train/valid/test splits. The pretrained encoders are used to be the start points of LSTM layers for later supervised classification tasks.  Table 1 reports the results of DSCNN on different datasets, demonstrating its effectiveness in comparison with other state-of-the-art methods.

Sentence Modeling
For sentence modeling tasks, DSCNN beats all baselines on MR and TREC, and achieves the same best result on SUBJ as MVCNN. In SST-2, DSCNN only reports a slightly lower accuracy than MVCNN. In MVCNN, however, the author uses more resources including five versions of word embeddings. For SST-5, DSCNN is second only to (a) description → entity  Tree-LSTM, which nonetheless relies on parsers to build tree-structured neural models.
The benefit of DSCNN is illustrated by its consistently better results over the sequential CNN models including DCNN and CNN-MC. The superiority of DSCNN is mainly attributed to its ability to maintain long-term dependencies. Figure 3 depicts the correlation between the dependency length and the classification accuracy. While CNN-MC and DSCNN are similar when the sum of dependency arc lengths is below 15, DSCNN gains obvious ad-vantages when dependency lengths grow for long and complex sentences. Dep-CNN is also more robust than CNN-MC, but it relies on the dependency parser and predefined patterns to model longer linguistic structures. Figure 4 gives some examples where DSCNN makes correct predictions while CNN-MC fails. In the first example, CNN-MC classifies the question as entity due to its focus on the noun phrase "worn or outdated flags", while DSCNN captures the long dependency between "done with" and "flags", and assigns the correct label description. Similarly in the second case, due to "Nile", CNN-MC labels the question as location, while the dependency between "depth of" and "river" is ignored. As for the third example, the question involves a complicated and long attributive clause for the subject "artery". CNN-MC gets easily confused and predicts the type as location due to words "from" and "to", while DSCNN keeps correct. Finally, "Lindbergh" in the last example make CNN-MC bias to human.
We also sample some misclassified examples of DSCNN in Figure 5. Example (a) fails because the numeric meaning of "point" is not captured by the word embedding. Similarly, in the second example, the error is due to the out-of-vocabulary word "TMJ" and it is thus apparently difficult for DSCNN to figure out that it is an abbreviation. Example (c) is likely to be an ambiguous or mistaken annotation. The finding here agrees with the discussion in Dep-CNN work (Ma et al., 2015).

Document Modeling
For document modeling, the result of DSCNN on IMDB against other baselines is listed on the last column of Table 1. Documents in IMDB consist of several sentences and thus very long: the average length is 241 tokens per document and the maximum length is 2526 words (Dai and Le, 2015). As a result, there is no result reported using CNN-based models due to prohibited computation time, and most previous works are unordered models including variations of bag-of-words.
DSCNN outperforms bag-of-words model (Maas et al., 2011), Deep Averaging Network (Iyyer et al., 2015), and word representation Restricted Boltzmann Machine model combined with bag-of-words features (Dahl et al., 2012). The key weakness of bag-of-words prevents those models from capturing long-term dependencies.
Besides, Paragraph Vector (Le and Mikolov, 2014) and SA-LSTM (Dai and Le, 2015) achieve better results than DSCNN. It is worth mentioning that both methods, as unsupervised learning algorithms, can gain much positive effects from unlabeled data (they are using 50,000 unlabeled examples in IMDB). For example in (Dai and Le, 2015), with additional data from Amazon reviews, the error rate of SA-LSTM on MR dataset drops by 3.6%.

Conclusion
In this work, we present DSCNN, Dependency Sensitive Convolutional Neural Networks for purpose of text modeling at both sentence and document levels. DSCNN captures long-term inter-sentence and intra-sentence dependencies by processing word vectors through layers of LSTM networks, and extracts features by convolutional operators for classification. Experiments show that DSCNN consistently outperforms traditional CNNs, and achieves state-of-the-art results on several sentiment analysis, question type classification and subjectivity classification datasets.