Learning Generic Sentence Representations Using Convolutional Neural Networks

We propose a new encoder-decoder approach to learn distributed sentence representations that are applicable to multiple purposes. The model is learned by using a convolutional neural network as an encoder to map an input sentence into a continuous vector, and using a long short-term memory recurrent neural network as a decoder. Several tasks are considered, including sentence reconstruction and future sentence prediction. Further, a hierarchical encoder-decoder model is proposed to encode a sentence to predict multiple future sentences. By training our models on a large collection of novels, we obtain a highly generic convolutional sentence encoder that performs well in practice. Experimental results on several benchmark datasets, and across a broad range of applications, demonstrate the superiority of the proposed model over competing methods.


Introduction
Learning sentence representations is central to many natural language modeling applications. The aim of a model for this task is to learn fixedlength feature vectors that encode the semantic and syntactic properties of sentences. Deep learning techniques have shown promising performance on sentence modeling, via feedforward neural networks (Huang et al., 2013), recurrent neural networks (RNNs) (Hochreiter and Schmidhuber, 1997), convolutional neural networks (CNNs) (Kalchbrenner et al., 2014;Kim, 2014;Shen et al., 2014), and recursive neural networks (Socher et al., 2013). Most of these models are task-dependent: they are trained specifically for a certain task. However, these methods may be-come inefficient when we need to repeatedly learn sentence representations for a large number of different tasks, because they may require retraining a new model for each individual task. In this paper, in contrast, we are primarily interested in learning generic sentence representations that can be used across domains.
Several approaches have been proposed for learning generic sentence embeddings. The paragraphvector model of Le and Mikolov (2014) incorporates a global context vector into the log-linear neural language model (Mikolov et al., 2013) to learn the sentence representation; however, at prediction time, one needs to perform gradient descent to compute a new vector. The sequence autoencoder of Dai and Le (2015) describes an encoder-decoder model to reconstruct the input sentence, while the skip-thought model of  extends the encoder-decoder model to reconstruct the surrounding sentences of an input sentence. Both the encoder and decoder of the methods above are modeled as RNNs.
CNNs have recently achieved excellent results in various task-dependent natural language applications as the sentence encoder (Kalchbrenner et al., 2014;Kim, 2014;Hu et al., 2014). This motivates us to propose a CNN encoder for learning generic sentence representations within the framework of encoder-decoder models proposed by ; Cho et al. (2014). Specifically, a CNN encoder performs convolution and pooling operations on an input sentence, then uses a fullyconnected layer to produce a fixed-length encoding of the sentence. This encoding vector is then fed into a long short-term memory (LSTM) recurrent network to produce a target sentence. Depending on the task, we propose three models: (i) CNN-LSTM autoencoder: this model seeks to reconstruct the original input sentence, by capturing the intra-sentence information; (ii) CNN-LSTM future you$$$$$$will$$$$$$love$$$$$$it$$$$$$$$$$! you$$will$$$love$$$it$$$$$$$$! i promise$$.
sentence&encoder sentence&decoder paragraph& generator Figure 1: Illustration of the CNN-LSTM encoder-decoder models. The sentence encoder is a CNN, the sentence decoder is an LSTM, and the paragraph generator is another LSTM. (Left) (a)+(c) represents the autoencoder; (b)+(c) represents the future predictor; (a)+(b)+(c) represents the composite model. (Right) hierarchical model. In this example, the input contiguous sentences are: this is great. you will love it! i promise.
predictor: this model aims to predict a future sentence, by leveraging inter-sentence information; (iii) CNN-LSTM composite model: in this case, there are two LSTMs, decoding the representation to the input sentence itself and a future sentence. This composite model aims to learn a sentence encoder that captures both intra-and inter-sentence information.
The proposed CNN-LSTM future predictor model only considers the immediately subsequent sentence as context. In order to capture longerterm dependencies between sentences, we further introduce a hierarchical encoder-decoder model. This model abstracts the RNN language model of Mikolov et al. (2010) to the sentence level. That is, instead of using the current word in a sentence to predict future words (sentence continuation), we encode a sentence to predict multiple future sentences (paragraph continuation). This model is termed hierarchical CNN-LSTM model. , we first train our proposed models on a large collection of novels. We then evaluate the CNN sentence encoder as a generic feature extractor for 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking and 5 standard classification benchmarks.

As in
In these experiments, we train a linear classifier on top of the extracted sentence features, without additional fine-tuning of the CNN. We show that our trained sentence encoder yields generic repre-sentations that perform as well as, or better, than those of ; Hill et al. (2016), in all the tasks considered.
Summarizing, the main contribution of this paper is a new class of CNN-LSTM encoder-decoder models that is able to leverage the vast quantity of unlabeled text for learning generic sentence representations. Inspired by the skip-thought model , we have further explored different variants: (i) CNN is used as the sentence encoder rather than RNN; (ii) larger context windows are considered: we propose the hierarchical CNN-LSTM model to encode a sentence for predicting multiple future sentences.

Model description 2.1 CNN-LSTM model
Consider the sentence pair (s x , s y ). The encoder, a CNN, encodes the first sentence s x into a feature vector z, which is then fed into an LSTM decoder that predicts the second sentence s y . Let w t x ∈ {1, . . . , V } represent the t-th word in sentences s x , where w t x indexes one element in a Vdimensional set (vocabulary); w t y is defined similarly w.r.t. s y . Each word w t x is embedded into a k-dimensional vector x t = W e [w t x ], where W e ∈ R k×V is a word embedding matrix (learned), and notation W e [v] denotes the v-th column of matrix W e . Similarly, we let y t = W e [w t y ].
CNN encoder The CNN architecture in Kim (2014); Collobert et al. (2011) is used for sentence encoding, which consists of a convolution layer and a max-pooling operation over the entire sentence for each feature map. A sentence of length T (padded where necessary) is represented as a matrix X ∈ R k×T , by concatenating its word embeddings as columns, i.e., the t-th column of X is x t .
A convolution operation involves a filter W c ∈ R k×h , applied to a window of h words to produce a new feature. According to Collobert et al. (2011), is a nonlinear activation function such as the hyperbolic tangent used in our experiments, b ∈ R T −h+1 is a bias vector, and * denotes the convolutional operator. Convolving the same filter with the h-gram at every position in the sentence allows the features to be extracted independently of their position in the sentence. We then apply a max-over-time pooling operation (Collobert et al., 2011) to the feature map and take its maximum value, i.e.,ĉ = max{c}, as the feature corresponding to this particular filter. This pooling scheme tries to capture the most important feature, i.e., the one with the highest value, for each feature map, effectively filtering out less informative compositions of words. Further, pooling also guarantees that the extracted features are independent of the length of the input sentence.
The above process describes how one feature is extracted from one filter. In practice, the model uses multiple filters with varying window sizes (Kim, 2014). Each filter can be considered as a linguistic feature detector that learns to recognize a specific class of n-grams (or h-grams, in the above notation). However, since the h-grams are computed in the embedding space, the model naturally handles similar h-grams composed of synonyms. Assume we have m window sizes, and for each window size, we use d filters; then we obtain a md-dimensional vector to represent a sentence.
Compared with the LSTM encoders used in ; Dai and Le (2015); Hill et al. (2016), a CNN encoder may have the following advantages. First, the sparse connectivity of a CNN, which indicates fewer parameters are required, typically improves its statistical efficiency as well as reduces memory requirements (Goodfellow et al., 2016). For example, excluding the number of parameters used in the word embeddings, our trained CNN sentence encoder has 3 million parameters, while the skip-thought vector of  contains 40 million parameters. Second, a CNN is easy to implement in parallel over the whole sentence, while an LSTM needs sequential computation.
LSTM decoder The CNN encoder maps sentence s x into a vector z. The probability of a length-T sentence s y given the encoded feature vector z is defined as where w 0 y is defined as a special start-of-thesentence token. All the words in the sentence are sequentially generated using the RNN, until the end-of-the-sentence symbol is generated. Specifically, each conditional p(w t y |w <t where h t , the hidden units, are recursively updated through h t = H(y t−1 , h t−1 , z), and h 0 is defined as a zero vector (h 0 is not updated during training). V is a weight matrix used for computing a distribution over words. Bias terms are omitted for simplicity throughout the paper. The transition function H(·) is implemented with an LSTM (Hochreiter and Schmidhuber, 1997).
Given the sentence pair (s x , s y ), the objective function is the sum of the log-probabilities of the target sentence conditioned on the encoder representation in (1): T t=1 log p(w t y |w <t y , z). The total objective is the above objective summed over all the sentence pairs.
Applications Inspired by Srivastava et al. (2015), we propose three models: (i) an autoencoder, (ii) a future predictor, and (iii) the composite model. These models share the same CNN-LSTM model architecture, but are different in terms of the choices of the target sentence. An illustration of the proposed encoder-decoder models is shown in Figure 1(left).
The autoencoder (i) aims to reconstruct the same sentence as the input. The intuition behind this is that an autoencoder learns to represent the data using features that explain its own important factors of variation, and hence model the internal structure of sentences, effectively capturing the intrasentence information. Another natural task is encoding an input sentence to predict the subsequent sentence. The future predictor (ii) achieves this, effectively capturing the inter-sentence information, which has been shown to be useful to learn the semantics of a sentence . These two tasks can be combined to create a composite model (iii), where the CNN encoder is asked to learn a feature vector that is useful to simultaneously reconstruct the input sentence and predict a future sentence. This composite model encourages the sentence encoder to incorporate contextual information both within and beyond the sentence.

Hierarchical CNN-LSTM model
The future predictor described in Section 2.1 only considers the immediately subsequent sentence as context. By utilizing a larger surrounding context, it is likely that we can learn even higher-quality sentence representations. Inspired by the standard RNN-based language model (Mikolov et al., 2010) that uses the current word to predict future words, we propose a hierarchical encoder-decoder model that encodes the current sentence to predict multiple future sentences. An illustration of the hierarchical model is shown in Figure 1(right), with details provided in Figure 2.
Our proposed hierarchical model characterizes the hierarchy word-sentence-paragraph. A paragraph is modeled as a sequence of sentences, and each sentence is modeled as a sequence of words. Specifically, assume we are given a paragraph D = (s 1 , . . . , s L ), that consists of L sentences. The probability for paragraph D is then defined as where s 0 is defined as a special start-of-theparagraph token. As shown in Figure 2(left), each p(s |s < ) in (2) is calculated as where h (p) denotes the -th hidden state of the LSTM paragraph generator, and h (p) 0 is fixed as a zero vector. The CNN in (5) is as described in Section 2.1, encoding the sentence s −1 into a vector representation z −1 .
Equation (4) serves as the paragraph-level language model (Mikolov et al., 2010), which encodes all the previous sentence representations z < into a vector representation h (p) . This hidden state h is used to guide the generation of the -th sentence through the decoder (3), which is defined as where w ,0 is defined as a special start-of-thesentence token. T is the length of sentence , and w ,t denotes the t-th word in sentence . As shown in Figure 2 ,t denotes the t-th hidden state of the LSTM decoder for sentence , x ,t−1 denotes the word embedding for w ,t−1 , and h (s) ,0 is fixed as a zero vector for all = 1, . . . , L. V is a weight matrix used for computing distribution over words.

Related work
Various methods have been proposed for sentence modeling, which generally fall into two categories. The first consists of models trained specifically for a certain task, typically combined with downstream applications. Several models have been proposed along this line, ranging from simple additional composition of the word vectors (Mitchell and Lapata, 2010;Yu and Dredze, 2015;Iyyer et al., 2015) to those based on complex nonlinear functions like recursive neural networks (Socher et al., 2011(Socher et al., , 2013, convolutional neural networks (Kalchbrenner et al., 2014;Hu et al., 2014;Johnson and Zhang, 2015;Gan et al., 2017), and recurrent neural networks (Tai et al., 2015;Lin et al., 2017).
The other category consists of methods aiming to learn generic sentence representations that can be used across domains. This includes the paragraph vector (Le and Mikolov, 2014), skip-thought vector , and the sequential denoising autoencoders (Hill et al., 2016). Hill et al. (2016) also proposed a sentence-level log-linear bag-of-words (BoW) model, where a BoW representation of an input sentence is used to predict adjacent sentences that are also represented as BoW. Most recently, Wieting et al. (2016); Arora et al. (2017); Pagliardini et al. (2017) proposed methods in which sentences are represented as a weighted average of fixed (pre-trained) word vectors. Our model falls into this category, and is most related to .
However, there are two key aspects that make our model different from . First, we use CNN as the sentence encoder. The combination of CNN and LSTM has been considered in image captioning (Karpathy and Fei-Fei, 2015), and in some recent work on machine translation (Kalchbrenner and Blunsom, 2013;Meng et al., 2015;Gehring et al., 2016). Our utilization of a CNN is different, and more importantly, the ultimate goal of our model is different. Our work aims to use a CNN to learn generic sentence embeddings.
Second, we use the hierarchical CNN-LSTM model to predict multiple future sentences, rather than the surrounding two sentences as in . Utilizing a larger context window aids our model to learn better sentence representations, capturing longer-term dependencies between sentences. Similar work to this hierarchical language modeling can be found in ; Sordoni et al. (2015); ; Wang and Cho (2016). Specifically, ; Sordoni et al. (2015) uses an LSTM for the sentence encoder, while  uses a bag-of-words to represent sentences.

Experiments
We first provide qualitative analysis of our CNN encoder, and then present experimental results on 8 tasks: 5 classification benchmarks, paraphrase detection, semantic relatedness and image-sentence ranking. As in , we evaluate the capabilities of our encoder as a generic feature extractor. To further demonstrate the advantage of our learned generic sentence representations, we also fine-tune our trained sentence encoder on the 5 clas-sification benchmarks. All the CNN-LSTM models are trained using the BookCorpus dataset , which consists of 70 million sentences from over 7000 books.
We train four models in total: (i) an autoencoder, (ii) a future predictor, (iii) the composite model, and (iv) the hierarchical model. For the CNN encoder, we employ filter windows (h) of sizes {3,4,5} with 800 feature maps each, hence each sentence is represented as a 2400-dimensional vector. For both, the LSTM sentence decoder and paragraph generator, we use one hidden layer of 600 units.
The CNN-LSTM models are trained with a vocabulary size of 22,154 words. In order to learn a generic sentence encoder that can encode a large number of possible words, we use two methods of considering words not in the training set. Suppose we have a large pretrained word embedding matrix, such as the publicly available word2vec vectors (Mikolov et al., 2013), in which all test words are assumed to reside.
The first method learns a linear mapping between the word2vec embedding space V w2v and the learned word embedding space V cnn by solving a linear regression problem . Thus, any word from V w2v can be mapped into V cnn for encoding sentences. The second method fixes the word vectors in V cnn as the corresponding word vectors in V w2v , and we do not update the word embedding parameters during training. Thus, any word vector from V w2v can be naturally used to encode sentences. By doing this, our trained sentence encoder can successfully encode 931,331 words.
For training, all weights in the CNN and nonrecurrent weights in the LSTM are initialized from a uniform distribution in [-0.01,0.01]. Orthogonal initialization is employed on the recurrent matrices in the LSTM. All bias terms are initialized to zero. The initial forget gate bias for LSTM is set to 3. Gradients are clipped if the norm of the parameter vector exceeds 5 . The Adam algorithm (Kingma and Ba, 2015) with learning rate 2 × 10 −4 is utilized for optimization. For all the CNN-LSTM models, we use mini-batches of size 64. For the hierarchical CNN-LSTM model, we use mini-batches of size 8, and each paragraph is composed of 8 sentences. We do not perform any regularization other than dropout (Srivastava et al., 2014). All experiments are implemented A you needed me? this is great. its lovely to see you.
he had thought he was going crazy. B you got me? this is awesome. its great to meet you. i felt like i was going crazy. C i got you.
you are awesome. its great to meet him. i felt like to say the right thing.
D i needed you. you are great. its lovely to see him. he had thought to say the right thing. Table 1: Vector "compositionality" using element-wise addition and subtraction. Let z(s) denote the vector representation z of a given sentence s. We first calculate z =z(A)-z(B)+z(C). The resulting vector is then sent to the LSTM to generate sentence D.
Query and nearest sentence johnny nodded his curly head , and then his breath eased into an even rhythm . aiden looked at my face for a second , and then his eyes trailed to my extended hand .
i yelled in frustration , throwing my hands in the air . i stand up , holding my hands in the air .
i loved sydney , but i was feeling all sorts of homesickness . i loved timmy , but i thought i was a self-sufficient person .
" i brought sad news to mistress betty , " he said quickly , taking back his hand . " i really appreciate you taking care of lilly for me , " he said sincerely , handing me the money .
" i am going to tell you a secret , " she said quietly , and he leaned closer . " you are very beautiful , " he said , and he leaned in .
she kept glancing out the window at every sound , hoping it was jackson coming back . i kept checking the time every few minutes , hoping it would be five oclock .
leaning forward , he rested his elbows on his knees and let his hands dangle between his legs . stepping forward , i slid my arms around his neck and then pressed my body flush against his .
i take tris 's hand and lead her to the other side of the car , so we can watch the city disappear behind us . i take emma 's hand and lead her to the first taxi , everyone else taking the two remaining cars . in Theano (Bastien et al., 2012), using a NVIDIA GeForce GTX TITAN X GPU with 12GB memory.

Qualitative analysis
We first demonstrate that the sentence representation learned by our model exhibits a structure that makes it possible to perform analogical reasoning using simple vector arithmetics, as illustrated in Table 1. It demonstrates that the arithmetic operations on the sentence representations correspond to wordlevel addition and subtractions. For instance, in the 3rd example, our encoder captures that the difference between sentence B and C is "you" and "him", so that the former word in sentence A is replaced by the latter (i.e., "you"-"you"+"him"="him"), resulting in sentence D. Table 2 shows nearest neighbors of sentences from a CNN-LSTM autoencoder trained on the BookCorpus dataset. Nearest neighbors are scored by cosine similarity from a random sample of 1 million sentences from the BookCorpus dataset. As can be seen, our encoder learns to accurately capture semantic and syntax of the sentences.

Quantitative evaluations
Classification benchmarks We first study the task of sentence classification on 5 datasets: MR (Pang and Lee, 2005), CR (Hu and Liu, 2004), SUBJ (Pang and Lee, 2004), MPQA (Wiebe et al., 2005) and TREC (Li and Roth, 2002). On all the datasets, we separately train a logistic regression model on top of the extracted sentence features. We restrict our comparison to methods that also aims to learn generic sentence embeddings for fair comparison. We also provide the state-of-the-art results using task-dependent learning methods for reference. Results are summarized in Table 3. Our CNN encoder provides better results than the combine-skip model of  on all the 5 datasets.
We highlight some observations. First, the autoencoder performs better than the future predictor, indicating that the intra-sentence information may be more important for classification than the inter-sentence information. Second, the hierarchi-  Table 3: Classification accuracies on several standard benchmarks. The last column shows results on the task of paraphrase detection, where the evaluation metrics are classification accuracy and F1 score. † The first and second block in our results were obtained using the first and second method of considering words not in the training set, respectively. ‡ "combine" means concatenating the feature vectors learned from both the hierarchical model and the composite model.
cal model performs better than the future predictor, demonstrating the importance of capturing longterm dependencies across multiple sentences. Our combined model, which concatenates the feature vectors learned from both the hierarchical model and the composite model, performs the best. This may be due to that: (i) both intra-and long-term inter-sentence information are leveraged; (ii) it is easier to linearly separate the feature vectors in higher dimensional spaces. Further, using (fixed) pre-trained word embeddings consistently provides better performance than using the learned word embeddings. This may be due to that word2vec provides more generic word representations, since it is trained on the large Google News dataset (containing 100 billion words) (Mikolov et al., 2013).
To further demonstrate the advantage of the learned generic representations, we train a CNN classifier (i.e., a CNN encoder with a logistic regression model on top) with two different initialization strategies: random initialization and initialization with trained parameters from the CNN-LSTM composite model. Results are shown in Figure 3 (3.52% on average) over random initialization of CNN parameters. Figure 3( Table 4: Results for image-sentence ranking experiments on the COCO dataset. R@K denotes Recall@K (higher is better) and Med r is the median rank (lower is better). ( †) taken from . ( * ) taken from Karpathy and Fei-Fei (2015). ( ‡) taken from Mao et al. (2015).
learning, with the autoencoder on all the data (labeled and unlabled), and the classifier only on the labeled data.
Paraphrase detection Now we consider paraphrase detection on the MSRP dataset (Dolan et al., 2004). On this task, one needs to predict whether or not two sentences are paraphrases. The training set consists of 4076 sentence pairs, and the test set has 1725 pairs. As in Tai et al. (2015), given two sentence representations z x and z y , we first compute their element-wise product z x z y and their absolute difference |z x − z y |, and then concatenate them together. A logistic regression model is trained on top of the concatenated features to predict whether two sentences are paraphrases. We present our results on the last column of Table 3. Our best result is better than the other results that use task-independent methods.
Image-sentence ranking We consider the task of image-sentence ranking, which aims to retrieve items in one modality given a query from the other. We use the COCO dataset (Lin et al., 2014), which contains 123,287 images each with 5 captions. For development and testing we use the same splits as Karpathy and Fei-Fei (2015). The development and test sets each contain 5000 images. We further split them into 5 random sets of 1000 images, and report the average performance over the 5 splits.
Performance is evaluated using Recall@K, which measures the average times a correct item is found within the top-K retrieved results. We also report the median rank of the closest ground truth result   . ( ‡) taken from Tai et al. (2015).
in the ranked list. We represent images using 4096-dimensional feature vectors from VggNet (Simonyan and Zisserman, 2015). Each caption is encoded using our trained CNN encoder. The training objective is the same pairwise ranking loss as used in , which takes the form of max(0, α − f (x n , y n ) + f (x n , y m )), where f (·, ·) is the image-sentence score. (x n , y n ) denotes the related image-sentence pair, and (x n , y m ) is the randomly sampled unrelated image-sentence pair with n = m. For image retrieval from sentences, x denotes the caption, y denotes the image, and vice versa. The objective is to force the matching score of the related pair (x n , y n ) to be greater than the unrelated pair (x n , y m ) by a margin α, which is set to 0.1 in our experiments. Table 4 shows our results. Consistent with previous experiments, we empirically found that the encoder trained using the fixed word embedding performed better on this task, hence only results using this method are reported. As can be seen, we obtain the same median rank as in , indicating that our encoder is as competitive as the skip-thought vectors . The performance gain between our encoder and the combine-skip model of  on the R@1 score is significant, which shows that the CNN encoder has more discriminative power on re-trieving the most correct item than the skip-thought vector.
Semantic relatedness For our final experiment, we consider the task of semantic relatedness on the SICK dataset (Marelli et al., 2014), consisting of 9927 sentence pairs. Given two sentences, our goal is to produce a real-valued score between [1, 5] to indicate how semantically related two sentences are, based on human generated scores. We compute a feature vector representing the pair of sentences in the same way as on the MSRP dataset. We follow the method in Tai et al. (2015), and use the crossentropy loss for training. Results are summarized in Table 5. Our result is better than the combineskip model of . This suggests that CNN also provides competitive performance at matching human relatedness judgements.

Conclusion
We presented a new class of CNN-LSTM encoderdecoder models to learn sentence representations from unlabeled text. Our trained convolutional encoder is highly generic, and can be an alternative to the skip-thought vectors of . Compelling experimental results on several tasks demonstrated the advantages of our approach. In future work, we aim to use more advanced CNN architectures (Conneau et al., 2016) for learning generic sentence embeddings.