Connecting Supervised and Unsupervised Sentence Embeddings

Representing sentences as numerical vectors while capturing their semantic context is an important and useful intermediate step in natural language processing. Representations that are both general and discriminative can serve as a tool for tackling various NLP tasks. While common sentence representation methods are unsupervised in nature, recently, an approach for learning universal sentence representation in a supervised setting was presented in (Conneau et al.,2017). We argue that although promising results were obtained, an improvement can be reached by adding various unsupervised constraints that are motivated by auto-encoders and by language models. We show that by adding such constraints, superior sentence embeddings can be achieved. We compare our method with the original implementation and show improvements in several tasks.


Introduction
Word embeddings are considered one of the key building blocks in natural language processing and are widely used for various applications (Mikolov et al., 2013;Pennington et al., 2014). While word representations has been successfully used, representing the more complicated and nuanced nature of the next element in the hierarchy -a full sentence -is still considered a challenge. Once trained, universal sentence representations can be used as an out-of-the-box tool for solving various NLP and computer vision problems. Even though their importance is unquestionable, it seems that current results are still far from satisfactory.
More concretely, given a set of sentences {s i } n i=1 , sentence embedding methods are designed to map them to some feature space F along with a distance metric M such that given two sentences s i and s j that have similar semantic meaning, their distance M(s i , s j ) would be small. The challenge is learning a mapping T : {s i } n i=1 → F that manages to capture the semantics of each s i . While sentence embedding are not always used in similarity probing, we find this formulation useful as the similarity assumption is implicitly made when training classifiers on top of the embeddings in downstream tasks.
Sentences embedding methods were mostly trained in an unsupervised setting. In (Le and Mikolov, 2014) the ParagraphVector model was proposed which is trained to predict words in the document. SkipThought (Kiros et al., 2015) vectors rely on the continuity of text to train an encoder-decoder model that tries to reconstruct the surrounding sentences of a given passage. In Sequential Denoising Autoencoders (SDAE) (Hill et al., 2016) high-dimensional input data is corrupted according to some noise function, and the model is trained to recover the original data from the corrupted version. FastSent (Hill et al., 2016) learns to predicts a Bag-Of-Word (BOW) representation of adjacent sentences given a BOW representation of some sentence. In (Klein et al., 2015) a Hybrid Gaussian Laplacian density function is fitted to the sentence to derive Fisher Vectors.
While previous methods train sentence embeddings in an unsupervised manner, a recent work (Conneau et al., 2017) argued that better representations can be achieved via supervised training on a general sentence inference dataset (Bowman et al., 2015). To this end, the authors use the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015) to train different  (Conneau et al., 2017) which is the baseline for our work. AE Reg and LM Reg refers to the Auto-Encoder and Language-Model regularization terms described in 2.1 and Combined refers to optimizing with both terms. Bi-AE Reg and Bi-LM Reg refers to the bi-directional Auto-Encoder and bi-directional Language-Model regularization terms described in 2.2. As evident from the results, adding simple unsupervised regularization terms improves the results of the model on almost all the evaluated tasks.
sentence embedding methods and compare them on various benchmarks. The SNLI dataset is composed of 570K pairs of sentences with a label depicting the relationship between them, which can be either 'neutral', 'contradiction' or 'entailment'. The authors show that by leveraging the dataset, state-of-the-art representations can be obtained which are universal and general enough for solving various NLP tasks. A different, unsupervised, task in NLP is estimating the probability of word sequences. A family of algorithms for this task titled word language models seek to model the problem as estimating the probability of a word, given the previous words in the text. In (Bengio et al., 2003) neural networks were employed and (Mikolov et al., 2010) was among the first methods to use recurrent neural networks (RNN) for modeling the problem, where the probability of the a word is estimated based on the previous words fed to the RNN. A variant of RNN -Long Short Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) -were used in (Sundermeyer et al., 2012). Following that, (Zaremba et al., 2014) proposed a dropout augmented LSTM.
We note that there exists a connection between those two problems and try to model it more explicitly. Recently, the incorporation of the hidden states of neural language models in downstream supervised-learning models have been shown to improve the results of the latter (e.g. ElMo -Peters et al. (2018) (2017) ) -in this work we jointly train the unsupervised and supervised tasks. To this end, we incorporate unsupervised regularization terms motivated by language modeling and auto-encoders in the training framework proposed by (Conneau et al., 2017). We test our proposed model on a set of NLP tasks and show improved results over the baseline framework of (Conneau et al., 2017).

Method
Our approach builds upon the previous work of ( Conneau et al., 2017). Specifically, we use their BiLSTM model with max pooling.
More concretely, given a sequence of T words, {w t } t=1,...,T with given word embedding (Mikolov et al., 2013;Pennington et al., 2014) {v t } t=1,...,T ,a bidirectional LSTM computes a set of T vectors {h t } t=1,...,T where each h t is the concatenation of a forward LSTM and a backward LSTM that read the sentences in two opposite directions. We denote { − → h t } and { ← − h t } as the hidden states of the left and right LSTM's respectively, where t = 1, . . . , T . The final sentence representation is obtained by taking the maximal value of each dimension of the {h t } hidden units (i.e.: max pooling). The original model of (Conneau et al., 2017) was trained on the SNLI dataset in a supervised fashion -given pairs of sentences s 1 and s 2 , denote their representation bys 1 and s 2 . During training, the concatenation ofs 1 ,s 2 , |s 1 −s 2 | ands 1 * s 2 is fed to a three layer fully connected network followed by a softmax classifier.

Regularization terms
We note that by training on SNLI, the model might overfit and would not be general enough to provide universal sentence embedding. We devise several regularization criteria that incentivize the hidden states to maintain more information about the input sequence.
Specifically, denote the dimension of the word embedding by d and the dimension of the hidden state by l. We add a linear transformation layer L l×d : H → W on top of the BiLSTM to transform the hidden states back to the dimension of word embeddings and denote its output by {w t } t=1,...,T . Recall that in the training process, we minimize the log-likelihood loss of the fully connected network predictions which we denote by y i where y gt is the prediction score given to the correct ground truth class. Now, the total loss criteria with our regularization term can be written as where the first term in both (1) and (2) is the original classification loss. We call the second regularization term in (1) an auto-encoder regularization term and in (2) a language model regularization term. Intuitively, since each w t is obtained by a linear transformation of h t , it enforces the hidden state h t to maintain enough information on each w t such it can be reconstructed back from h t or such that the following word w t+1 can be predicted from h t . This aids in obtaining a more general sentence representation and mitigates the risk of overfitting to the SNLI training set. The constant λ in (1) and (2) is a hyper-parameter that controls the amount of regularization and was set to 1 in our experiments.
We have also experimented with combining the two terms, giving equal weight to each of them in optimizing the model.

Bi-directional Regularization terms
Similarly to regularization terms described in 2.1, we devise variants of (1) and (2) (1) and (2) are re-written as: We call the second regularization term in (3) a bi-directional auto-encoder regularization and in (4) a bi-directional language model regularization term. Again, λ 1 and λ 2 are hyper-parameters controlling the amount of regularization and were set to 0.5 in our experiments.
Our results are summarized in table 1. We compared out method against the baseline BiL-STM implementation of (Conneau et al., 2017) and included FastSent (Hill et al., 2016) and SkipThought vectors (Kiros et al., 2015) as a reference.
As evident from table 1 in almost all the tasks evaluated, adding the proposed regularization terms improves performance. This serve to show that in a supervised learning setting, additional information on the input sequence can be leveraged and injected to the model by adding simple unsupervised loss criteria.

Conclusions
In our work, we have sought to connect unsupervised and supervised learning in the context of sentence embeddings. Leveraging supervision given by some general task aided in obtaining state-of-the-art sentence representations (Conneau et al., 2017). However, every supervised learning tasks is prone to overfit. In this context, overfitting to the learning task will result in a model which generalizes less well to new tasks.
We alleviate this problem by incorporating unsupervised regularization criteria in the model's loss function which are motivated by autoencoders and language models. We note that the added regularization terms do come at the price of increasing the model size by ld parameters (where d and l are the dimensions of the word embedding and the LSTM hidden state, respectively) due to the added linear transformation (see 2.1). However, as evident from our results, this does not hinder the model performance, even though we did not increase the amount of training data. Moreover, since those term are unsupervised in nature, it is possible to pre-train the model on unlabeled data and then finetune it on the SNLI dataset.
In conclusion, our experiments show that adding the proposed regularization terms results in a more general model and superior sentence embeddings. This validates our assumption that while the a supervised signal is general enough for learning sentence embeddings, it can be further improved by incorporated a second unsupervised signal.