Universal Sentence Encoder for English

We present easy-to-use TensorFlow Hub sentence embedding models having good task transfer performance. Model variants allow for trade-offs between accuracy and compute resources. We report the relationship between model complexity, resources, and transfer performance. Comparisons are made with baselines without transfer learning and to baselines that incorporate word-level transfer. Transfer learning using sentence-level embeddings is shown to outperform models without transfer learning and often those that use only word-level transfer. We show good transfer task performance with minimal training data and obtain encouraging results on word embedding association tests (WEAT) of model bias.


Introduction
We present easy-to-use sentence-level embedding models with good transfer task performance even when using remarkably little training data. 1 Model engineering characteristics allow for tradeoffs between accuracy versus memory and compute resource consumption.

Model Toolkit
Models are implemented in TensorFlow (Abadi et al., 2016) and are made publicly available on TensorFlow Hub. 2 Listing 1 provides an example Listing 1: Python sentence embedding code.
code snippet to compute a sentence-level embedding from a raw untokenized input string. 3 The resulting embedding can be used directly or incorporated into a downstream model for a specific task. 4

Encoders
Two sentence encoding models are provided: (i) transformer (Vaswani et al., 2017), which achieves high accuracy at the cost of greater resource consumption; (ii) deep averaging network (DAN) (Iyyer et al., 2015), which performs efficient inference but with reduced accuracy.

Transformer
The transformer sentence encoding model constructs sentence embeddings using the encoding sub-graph of the transformer architecture (Vaswani et al., 2017). The encoder uses attention to compute context aware representations of words in a sentence that take into account both the ordering and identity of other words. The context aware word representations are averaged together to obtain a sentence-level embedding.
We train for broad coverage using multi-task learning, with the same encoding model supporting multiple downstream tasks. The task types include: a Skip-Thought like task (Kiros et al., 2015); 5 conversational response prediction (Henderson et al., 2017); and a select supervised classification task that improves sentence embeddings. 6 The transformer encoder achieves the best transfer performance. However, this comes at the cost of compute time and memory usage scaling dramatically with sentence length.

Deep Averaging Network (DAN)
The DAN sentence encoding model begins by averaging together word and bi-gram level embeddings. Sentence embeddings are then obtain by passing the averaged representation through a feedforward deep neural network (DNN). The DAN encoder is trained similar to the transformer encoder. Multitask learning trains a single DAN encoder to support multiple downstream tasks. An advantage of the DAN encoder is that compute time is linear in the length of the input sequence. Similar to Iyyer et al. (2015), our results demonstrate that DANs achieve strong baseline performance on text classification tasks.

Encoder Training Data
Unsupervised training data are drawn from a variety of web sources. The sources are Wikipedia, web news, web question-answer pages and discussion forums. We augment unsupervised learning with training on supervised data from the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) in order to further improve our representations (Conneau et al., 2017). Since the only supervised training data is SNLI, the models can be used for a wide range of downstream supervised tasks that do not overlap with this dataset. 7

Transfer Tasks
This section presents the data used for the transfer learning experiments and word embedding association tests (WEAT): (MR) Movie review sentiment on a five star scale (Pang and Lee, 2005); (CR) Sentiment of customer reviews (Hu and Liu, 2004); (SUBJ) Subjectivity of movie reviews and plot summaries (Pang and Lee, 2004); 5 The Skip-Thought like task replaces the LSTM (Hochreiter and Schmidhuber, 1997)  (MPQA) Phrase opinion polarity from news data (Wiebe et al., 2005); (TREC) Fine grained question classification sourced from TREC (Li and Roth, 2002); (SST) Binary phrase sentiment classification (Socher et al., 2013); (STS Benchmark) Semantic textual similarity (STS) between sentence pairs scored by Pearson r with human judgments (Cer et al., 2017); (WEAT) Word pairs from the psychology literature on implicit association tests (IAT) that are used to characterize model bias (Caliskan et al., 2017). 8 Table 1 gives the number of samples for each transfer task.

Transfer Learning Models
For sentence classification transfer tasks, the output of the sentence encoders are provided to a task specific DNN. For the pairwise semantic similarity task, the similarity of sentence embeddings u and v is assessed using − arccos uv ||u|| ||v|| . 9

Baselines
For each transfer task, we include baselines that only make use of word-level transfer and baselines that make use of no transfer learning at all. For word-level transfer, we incorporate word embeddings from a word2vec skip-gram model trained on a corpus of news data (Mikolov et al., 2013). The pretrained word embeddings are included as input to two model types: a convolutional neural network model (CNN) (Kim, 2014); a DAN. The baselines that use pretrained word embeddings allow us to contrast word-vs. sentence-level transfer. Additional baseline CNN and DAN models are trained without using any pretrained word or sentence embeddings. For reference, we compare with InferSent (Conneau et al., 2017) and Skip-Thought with layer normalization (Ba et al., 2016) on sentence-classification tasks. On the STS Benchmark, we compare with InferSent and the state-of-the-art neural STS systems CNN (HCTI) (Shao, 2017) and gConv (Yang et al., 2018).

Combined Transfer Models
We explore combining the sentence and wordlevel transfer models by concatenating their representations prior to the classification layers. For completeness, we report results providing the classification layers with the concatenating of the sentence-level embeddings and the representations produced by baseline models that do not make use of word-level transfer learning.

Experiments
Experiments use our most recent transformer and DAN encoding models. 10 Transfer task model hyperparamaters are tuned using a combination of Vizier (Golovin et al., 2017) and light manual tuning. When available, model hyperparameters are tuned using task dev sets. Otherwise, hyperparameters are tuned by cross-validation on task training data or the evaluation test data when neither training nor dev data are provided. Training repeats ten times for each task with randomly initialized weights and we report results by averaging across runs. Transfer learning is important when training data is limited. We explore using varying amounts of training data for SST. Contrasting the transformer and DAN encoders demonstrates trade-offs in model complexity and the training data required to reach a desired level of task accuracy. Finally, to assess bias in our encoders, we evaluate the strength of biased model associations on WEAT. We compare to Caliskan et al. (2017) who discovered that word embeddings reproduce human-like biases on implicit association tasks.    Table 3 compares our models to strong baselines on the STS Benchmark. Our transformer embeddings outperform the sentence representations produced by InferSent. Moreover, computing similarity scores by directly comparing the representations produced by our encoders approaches the performance of state-of-the-art neural models whose representations are fit to the STS task. Table 4 illustrates transfer task performance for varying amounts of training data. With small quantities of training data, sentence-level transfer achieves surprisingly good performance. Using only 1k labeled examples and the transformer embeddings for sentence-level transfer surpasses the performance of transfer learning using In-ferSent on the full training set of 67.3k examples. Training with 1k labeled examples and the transformer sentence embeddings surpasses wordlevel transfer using the full training set, CNN w2v , and approaches the performance of the best model without transfer learning trained on the complete dataset, CNN rnd @67.3k. Transfer learning is not always helpful when there is enough task training data. However, we observe that our best performing model still makes use of transformer sentencelevel transfer but combined with a CNN with no word-level transfer, U T +CNN rnd . Table 5 contrasts Caliskan et al. (2017)'s findings on bias within GloVe embeddings with results from the transformer and DAN encoders. Similar to GloVe, our models reproduce human associations between flowers vs. insects and pleasantness vs. unpleasantness. However, our models demonstrate weaker associations than GloVe for probes targeted at revealing ageism, racism and sexism. 11 Differences in word association patterns can be attributed to training data composition and the mixture of tasks used to train the representations.

Resource Usage
This section describes memory and compute resource usage for the transformer and DAN sentence encoding models over different batch sizes and sentence lengths. Figure 1 plots model resource consumption against sentence length. 12 Compute Usage The transformer model time complexity is O(n 2 ) in sentence length, while the 11 The development of our models did not target reducing bias. Researchers and developers are strongly encouraged to independently verify whether biases in their overall model or model components impacts their use case. For resources on ML fairness visit https://developers.google.com/machinelearning/fairness-overview/.
12 All benchmark values are averaged over 25 runs that follow 5 priming runs. CPU and mem. benchmarks are performed on a machine with an Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz CPU. GPU benchmarks use an Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz CPU and NVIDIA Tesla P100 GPU.    very short sequences transformer requires almost half as much memory as the DAN model.

Conclusion
Our encoding models provide sentence-level embeddings that demonstrate strong transfer performance on a number of NLP tasks. The encoding models make different trade-offs regarding accuracy and model complexity that should be considered when choosing the best one for a particular application. Overall, our sentence-level embeddings tend to surpass the performance of transfer using word-level embeddings alone. Models that make use of sentence-and word-level transfer often achieve the best performance. Sentencelevel transfer using our models can be exceptionally helpful when limited training data is available. The pre-trained encoding models are publicly available for research and use in industry applications that can benefit from a better understanding of natural language.