InferLite: Simple Universal Sentence Representations from Natural Language Inference Data

Natural language inference has been shown to be an effective supervised task for learning generic sentence embeddings. In order to better understand the components that lead to effective representations, we propose a lightweight version of InferSent, called InferLite, that does not use any recurrent layers and operates on a collection of pre-trained word embeddings. We show that a simple instance of our model that makes no use of context, word ordering or position can still obtain competitive performance on the majority of downstream prediction tasks, with most performance gaps being filled by adding local contextual information through temporal convolutions. Our models can be trained in under 1 hour on a single GPU and allows for fast inference of new representations. Finally we describe a semantic hashing layer that allows our model to learn generic binary codes for sentences.


Introduction
Distributed representations of words have become immensely successful as the building blocks for deep neural networks applied to a wide range of natural language processing tasks (Pennington et al., 2014). Learning representations of sentences, however, has largely been done in a taskdependent way. In recent years, a growing body of research has emerged for learning general purpose sentence embeddings. These methods aim to learn a universal encoding function that can map arbitrary sentences into vectors which can then be applied to downstream prediction tasks without finetuning. Much of the motivation behind this work is to mimic the successful use of feature transfer in computer vision.
Recently, Conneau et al. (2017) showed that a bidirectional LSTM with max pooling trained to perform Natural Language Inference (NLI), called InferSent, outperforms several other encoding functions on a suite of downstream prediction tasks. This method could match or outperform existing models that learns generic embeddings in an unsupervised setting, often requiring several days or weeks to train . However, a better understanding of what properties induce a useful generic embedding remains illusive.
In this work we propose a lightweight version of InferSent, called InferLite. InferLite deviates from InferSent in that it does not use any recurrent connections and can generalize to multiple pre-trained word embeddings. Our method uses a controller to dynamically weight embeddings for each word followed by max pooling over components to obtain the final sentence representation. Despite its simplicity, our method obtains performances on par with InferSent (Conneau et al., 2017) when using Glove representations (Pennington et al., 2014) as the source of pre-trained word vectors. To our surprise, the majority of evaluations can be done competitively without any notion of context, word ordering or position. For tasks where this is useful, much of the performance gap can be made up through a stack of convolutional layers to incorporate local context. Finally, we describe a semantic hashing layer that allows our model to be extended to learning generic binary vectors. The final result is a method that is both fast at training and inference and offers a strong baseline for future research on general purpose embeddings.

Why learn lightweight encoders?
Our proposed model naturally raises a question: why consider lightweight sentence encoders? If a generic encoder only needs to be trained once, why would training times be relevant? We argue our direction is important for two reasons. One is inference speed. With a lightweight encoder, we can encode millions of sentences efficiently without requiring extensive computational resources. The appendix includes inference speeds of our models. The second, perhaps more importantly, is to gain a better understanding of what prop-erties lead to high quality generic embeddings. When models take several days or weeks to train, an ablation analysis becomes prohibitively costly. Since our models can be trained quickly, it allows for a more extensive analysis of architectural and data necessities. Moreover, we include an ablation study in the appendix that shows even innocent or seemingly irrelevant model decisions can have a drastic effect on performance. Such observations could not be observed when models take orders of magnitude longer to train.

Related Work
A large body of work on distributional semantics have considered encoding phrase and sentence meaning into vectors e.g. (Mitchell and Lapata, 2008;Grefenstette et al., 2013;Paperno et al., 2014). The first attempt at using neural networks for learning generic sentence embeddings was , who proposed a sequenceto-sequence extension of the skip-gram model but applied at the sentence level. This method was taught to encode a sentence and predict its neighbours, harnessing a large collection of books for training . A similar approach, FastSent, was proposed by (Hill et al., 2016) which replaced the RNN encoder of skip-thoughts with word embedding summation. Methods using RNN encoders tend to perform poorly on STS evaluations, as shown by Wieting et al. (2015). Arora et al. (2017) showed a simple weighted bag of words with the first principal component subtracted, can be competitive on many sentencing encoding tasks.
Attempts to learn generic encoders with discriminative objectives were considered by Nie et al. (2017) and Logeswaran and Lee (2018) who replaced the decoder of skip-thoughts with classification tasks based on discourse relations and prediction of target sentences from an encoded candidate. All of the above methods relied on a large corpus of unlabelled data. Conneau et al. (2017) showed that similar or improved performance can be obtained using NLI datasets as a source of supervisory information. The state of the art sentence encoders utilize multi-task learning (Subramanian et al., 2018) by training an encoder to simultaneously do well on a collection of tasks such as NLI, next sentence prediction and translation.
The use of gating for selecting word representations has been considered in previous work. Yang et al. (2017) introduced a method for choosing between word and character embeddings while    method for word embedding selection. Gating has also been widely applied to multimodal fusion (Arevalo et al., 2017;Wang et al., 2018b;Kiros et al., 2018). Our work is also related to recent methods that induce contextualized word representations (Mc-Cann et al., 2017;Peters et al., 2018) as well as pre-training language models for task-dependent fine-tuning (Dai and Le, 2015;Howard and Ruder, 2018;Radford et al., 2018). We differ from these approaches in that we aim to infer a transferable sentence vector without any additional fine-tuning.

Method
Our method operates on a collection of pre-trained word representations and is then trained on the concatenation of SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018) datasets as in Conneau et al. (2017). Table 1 summarizes the properties of the embeddings we consider. At a high level, our method takes as input a collection of embeddings for each word and learns a gated controller to decide how to weight each representation. After encoding each word in a sentence, the sentence embedding is obtained by max pooling the transformed word representations. Unlike Subramanian et al. (2018), which learn a shared encoder in a multi-task setting, we instead fix the prediction task to NLI but use embeddings obtained from alternative tasks. Figure 1 illustrates our model. We begin by defining notation. Suppose we are given a sentence of words S = w 1 , . . . , w T which we would like to encode into a vector. Let K be the number of embedding types (e.g. Glove, News, Query) and let E k denote the word embedding matrix for type k. Define E c = [E 1 ; . . . ; E K ] to be the concatenation of word embedding matrices of all K types. We break our model description into four modules: Encoder, Controller, Fusion and Reduction. In the appendix we include an ablation study that analyzes the effect of our design choices. Encoder. The encoder computes M + 1 layers The first layer is computed as: where W k 0,g E k is a time distributed matrix multiply 1 and 0,h is the activation function. Each subsequent layer is given by: where ⇤ denotes the 1-D convolution operator that preserves dimensions. Note that if the convolutional filter length is 1, the model reduces to a bagof-words encoder. We use ReLU activation functions for i,h where i = 0, . . . , M 1 and a tanh activation for the last layer M,h . Controller. The controller first computes a shared layer G c 0 along with M heads G k 1 , . . . , G k M for k = 1, . . . , K embedding types. The first layer is computed as: where W c 0,g E c is a time distributed matrix multiply and 0,g is the activation function. Define Each subsequent layer is given by: We use ReLU activation functions for i,g where i = 0, . . . , M 1 and a sigmoid activation for the last layer M,g .
Fusion. The fusion layer combines the encoder and controller layers as follows: where denotes a component-wise product, W f F 0 is a time distributed matrix multiply, f is a ReLU activation function and G c 0 is added as a skip connection. In the appendix we demonstrate that the added skip connection is crucial to the success of the model. Reduction. The final reduction operation simply applies max pooling across tokens: resulting in a sentence vector s. This resulting vector corresponds to the embedding for which we evaluate all downstream tasks with. For training on NLI, we follow existing work and compute the concatenation of the embeddings of premise and hypothesis sentences along with their componentwise and absolute difference (Conneau et al., 2017). This joint vector is fed into a 2 hidden layer feedforward network with ReLU activations, followed by a softmax layer to predict whether the sentence pairs are neutral, entailed or contradictory. After training on NLI, the weights of the model are frozen and used for encoding new sentences. There are three main differences: 1) we generalize to multiple embedding types 2) we only apply gating at the end of the last layer as a way of weighting all embedding types (instead of each layer) and 3) we use a skip connection from the controller's transformed input to the fusion layer. We note that our encoder module can be reduced to the gated convolutional encoder in van den Oord et al. (2016) if we use one embedding type, remove the time distributed layers and only use a single convolutional layer.

Semantic hashing
We can augment a semantic hashing (Salakhutdinov and Hinton, 2009) layer to InferLite as a way of learning binary codes for sentences. Binary codes allow for efficient storage and retrieval  over massive corpora. To do this, we append the following layer: where LN is Layer Normalization (Ba et al., 2016), is the sigmoid activation and ⌧ is a temperature hyperparameter. We initialize ⌧ = 1 at the beginning of training and exponentially decay ⌧ towards 0 over the course of training. At inference time, we threshold at 0.5 to obtain codes. We found Layer Normalization was important for obtaining good codes as otherwise many dead units would form. In the appendix we include downstream performance results for 256, 1024 and 4096-bit codes. The combination of fast inference and efficient storage allows InferLite to be an effective generic encoder for large-scale retrieval and similarity search.

Experiments
We use the SentEval toolkit (Conneau and Kiela, 2018) for evaluating our sentence embeddings. All of our models are trained to optimize performance on the concatenation of SNLI and MultiNLI, using the concatenated development sets for early stopping. We use 4096-dimensional embeddings as in Conneau et al. (2017). We consider encoders that use convolutional filters of length 1 (no context) or length 3 (local context), with a stack of M = 3 convolutional layers. All word embeddings are pre-trained, normalized to unit length and held fixed during training. Full hyperparameter details are included in the appendix, including an ablation study comparing the effect of the choice of M .
We first analyze performance of our model on NLI prior to evaluating our models on downstream tasks. Figure 2 shows development set accuracy on NLI for models with and without context, using various feature combinations. Here we observe that a) using local context improves NLI performance and b) adding additional embedding types leads to improved performance. Tables 2 and 3 show results on downstream evaluation tasks. Here several observations can be made. First note the effectiveness of the basic (glove,1) model, which is essentially a deep bagof-unigram encoder. We also observe our models outperform all previous bag of words baselines. Next we observe that adding local context helps significantly on MR, CR, SST2 and TREC tasks. Furthermore, fusing embeddings from query and news models matches or improves performance over a glove-only model on 12 out of 15 tasks. Our (glove+news+query,3) model is best on 5 tasks and is a generally strong performer across all evaluations. Finally observe that our models significantly improves over previous work on STS tasks.
Next we compare training times of our models to previous work. All of our models can be trained in one GPU hour or less. QuickThoughts and InferSent can be trained on the order of a day while Multitask requires 1 week of training. This demonstrates the trade-off of these approaches.
In the appendix we include results from several other experiments including COCO imagesentence retrieval, downstream performance of InferLite with semantic hashing and results on 10 probing tasks introduced in . We also do an extensive ablation study   of model components and illustrate gate activation values qualitatively for sentences from the (glove+news+query,3) model.

Limitations
We also experimented with additional embedding types, including Picturebook (Kiros et al., 2018), knowledge graph and neural machine translation based embeddings. While adding these embeddings improved performance on NLI, they did not lead to any performance gains on downstream tasks. This is in contrast to Subramanian et al. (2018) who showed adding additional tasks in a multi-task objective led to better downstream performance. This demonstrates the limitations of solely using NLI as an objective, even if we transfer embeddings from additional tasks.

Future Work
In future work, we would like to explore using contextualized word embeddings, such as CoVe (McCann et al., 2017) and ELMo (Peters et al., 2018), as input to our models as opposed to noncontextualized representations. We also intend to evaluate on additional benchmark tasks such as GLUE (Wang et al., 2018a), explore using the learned word representations as contextualized embeddings and perform downstream fine-tuning.