Multilingual Universal Sentence Encoder for Semantic Retrieval

We present easy-to-use retrieval focused multilingual sentence embedding models, made available on TensorFlow Hub. The models embed text from 16 languages into a shared semantic space using a multi-task trained dual-encoder that learns tied cross-lingual representations via translation bridge tasks (Chidambaram et al., 2018). The models achieve a new state-of-the-art in performance on monolingual and cross-lingual semantic retrieval (SR). Competitive performance is obtained on the related tasks of translation pair bitext retrieval (BR) and retrieval question answering (ReQA). On transfer learning tasks, our multilingual embeddings approach, and in some cases exceed, the performance of English only sentence embeddings.


Introduction
We introduce three new members in the universal sentence encoder (USE) (Cer et al., 2018) family of sentence embedding models.Two multilingual models, one based on CNN (Kim, 2014) and the other based on the Transformer architecture (Vaswani et al., 2017), target performance on tasks requiring models to capture multilingual semantic similarity.The third member introduced is an alternative interface to our multilingual Transformer model for use in retrieval question answering (ReQA).The 16 languages supported by our multilingual models are given in Table 1. 1

Multi-task Dual Encoder Training
Similar to Cer et al. (2018) and Chidambaram et al. (2018), we target broad coverage using a multi-task dual-encoder training framework, with a single shared encoder supporting multiple downstream tasks.The training tasks include: a multifeature question-answer prediction task,4 a translation ranking task, and a natural language inference (NLI) task.Additional task specific hidden layers for the question-answering and NLI tasks are added after the shared encoder to provide representational specialization for each type of task.

SentencePiece
SentencePiece tokenization (Kudo and Richardson, 2018) is used for all of the 16 languages supported by our models.A single 128k Sentence-Piece vocabulary is trained from 8 million sentences sampled from our training corpus and balanced across the 16 languages.For validation, the trained vocab is used to process a separate development set, also sampled from the sentence encoding model training corpus.We find the character coverage is higher than 99% for all languages, which means less than 1% output tokens are out of vocabulary.Each token in the vocab is mapped to a fixed length embedding vector.5

Shared Encoder
Two distinct architectures for the sentence encoding models are provided: (i) transformer (Vaswani et al., 2017), targeted at higher accuracy at the cost of resource consumption; (ii) convolutional neural network (CNN) (Kim, 2014), designed for efficient inference but obtaining reduced accuracy.
Transformer The transformer encoding model embeds sentences using the encoder component of the transformer architecture (Vaswani et al., 2017).Bi-directional self-attention is used to compute context-aware representations of tokens in a sentence, taking into account both the ordering and the identity of the tokens.The context-aware token representations are then averaged together to obtain a sentence-level embedding.
CNN The CNN sentence encoding model feeds the input token sequence embeddings into a convolutional neural network (Kim, 2014).Similar to the transformer encoder, average pooling is used to turn the token-level embeddings into a fixedlength representation.Sentence embeddings are then obtain by passing the averaged representation through additional feedforward layers.

Training Corpus
Training data consists of mined question-answer pairs,6 mined translation pairs,7 and the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015).8SNLI only contains English data.The number of mined questions-answer pairs also varies across languages with a bias toward a handful of top tier languages.To balance training across languages, we use Google's translation system to translate SNLI to the other 15 languages.We also translate a portion of question-answer pairs to ensure each language has a minimum of 60M training pairs.For each of our datasets, we use 90% of the data for training, and the remaining 10% for development/validation.

Model Configuration
Input sentences are truncated to 256 tokens for the CNN model and 100 tokens for the transformer.The CNN encoder uses 2 CNN layers with filter width of [1, 2, 3, 5] and a filter size of 256. 5 Experiments on Retrieval Tasks In this section we evaluate our multilingual encoding models on semantic retrieval, bitext and retrieval question answer tasks.

Semantic Retrieval (SR)
Following Gillick et al. (2018), we construct semantic retrieval (SR) tasks from the Quora question-pairs (Hoogeveen et al., 2015) and AskUbuntu (Lei et al., 2016) datasets.The SR task is to identify all sentences in the retrieval corpus that are semantically similar to a query sentence. 10or each dataset, we first build a graph connecting each of the positive pairs, and then compute its transitive closure.Each sentence then serves as a test query that should retrieve all of the other sentences it is connected to within the transitive closure.Mean average precision (MAP) is employed to evaluate the models.More details on the constructed datasets can be found in Gillick et al. (2018).Both datasets are English only.
Table 2 shows the MAP@100 on the Quora/AskUbuntu retrieval tasks.We use Gillick et al. (2018)  bers listed here are from the models without indomain training data11 .

Bitext Retrieval (BR)
Bitext retrieval performance is evaluated on the United Nation (UN) Parallel Corpus (Ziemski et al., 2016), containing 86,000 bilingual document pairs matching English (en) documents with with their translations in five other languages: French (fr), Spanish (es), Russian (ru), Arabic (ar) and Chinese (zh).Document pairs are aligned at the sentence-level, which results in 11.3 million aligned sentence pairs for each language pair.Table 3 shows precision@1 (P@1) for the proposed models as well as the current state-of-the-art results from Yang et al. (2019), which uses a dualencoder architecture trained on mined bilingual data.USE Trans is generally better than USE CNN , performing lower than the SOTA but not by too much with the exception of en-zh.125.3 Retrieval Question Answering (ReQA) Similar to the data set construction used for the SR tasks, the SQuAD v1.0 dataset (Rajpurkar et al., 2016) is transformed into a retrieval question answering (ReQA) task. 13We first break all docu- Table 5: Cross-lingual performance on Quora/AskUbuntu cl-SR (MAP) and SQuAD cl-ReQA (P@1).Queries/questions are machine translated to the other languages, while retrieval candidates remain in English.
ments in the dataset into sentences using an offthe-shelf sentence splitter.Each question of the (question, answer spans) tuples in the dataset is treated as a query.The task is to retrieve the sentence designated by the tuple answer span.Search is performed on a retrieval corpus consisting of all of the sentences within the corpus.We contrast sentence and paragraph-level retrieval using our models, with the later allowing for comparison against a BM25 baseline (Jones et al., 2000). 14 We evaluated ReQA using the SQuAD dev and train sets and without training on the SQuAD data. 15The sentence and paragraph retrieval P@1 are shown in table 4. For sentence retrieval, we compare encodings produced using context from the text surrounding the retrieval candidate, USE QA Trans+Cxt , to sentence encodings produced without contextual cues, USE Trans .Paragraph retrieval contrasts USE QA Trans+Cxt with BM25.

Cross-lingual Retrieval
Our earlier experiments are extended to explore cross-lingual semantic retrieval (cl-SR) and crosslingual retrieval question answering (cl-ReQA).
use of sampling makes it difficult to directly compare with their results and we provide our own baseline base on BM25.
14 BM25 is a strong baseline for text retrieval tasks.Paragraph-level experiments use the BM25 implementation: https://github.com/nhirakawa/BM25,with default parameters.We exclude sentence-level BM25, as BM25 generally performs poorly at this granularity. 15For sentences, the resulting retrieval task for development set consists of 11,425 questions and 10,248 candidates, and the retrieval task for train set is consists of 87,599 questions and 91,703 candidates.For paragraph retrieval, there are 2,067 retrieval candidates in the development set and 18,896 in the training set.To retrieve paragraphs with our model, we first run sentence retrieval and use the retrieved nearest sentence to select the enclosing paragraph.
SR queries and ReQA questions are machine translated into other languages, while keeping the retrieval candidates in English. 16Table 5 provides our cross-lingual retrieval results.On all the languages, USE Trans outperforms USE CNN .While cross-lingual performance lags the English only tasks, the performance is surprisingly close given the added difficulty of the cross-lingual setting.

Experiments on Transfer Tasks
For comparison with prior USE models, English task transfer performance is evaluated on SentEval (Conneau and Kiela, 2018).For sentence classification transfer tasks, the output of the sentence encoders are provided to a task specific DNN.For the pairwise semantic similarity task, the similarity of sentence embeddings  and  is assessed using − arccos following Yang et al. (2018).As shown in table 6, our multilingual models show competitive transfer performance comparing with state-of-the-art sentence embedding models.USE Trans performs better than USE CNN in all tasks.Our new multilingual USE Trans even outperforms our best previously released English only model, USE Trans for English (Cer et al., 2018), on some tasks.GPU are 2 to 3 times faster than CPU.Our CNN models have the smallest memory footprint and are the fastest on both CPU and GPU.The memory requirements increase with sentence length, with the Transformer model increasing more than twice as fast as the CNN model. 18While this makes CNNs an attractive choice for efficiently encoding longer texts, this comes with a corresponding drop in accuracy on many retrieval and transfer tasks.

Conclusion
We present two multilingual models for embedding sentence-length text.Our models embed text from 16 languages into a shared semantic embedding space and achieve performance on transfer tasks that approaches monolingual sentence embedding models.The models achieve good performance on semantic retrieval (SR), bitext retrieval (BR) and retrieval question answering (ReQA).They achieve performance on cross-lingual semantic retrieval (cl- 18 Transformer models are ultimately governed by a time and space complexity of ( 2 ).The benchmarks show for shorter sequence lengths the time and space requirements are dominated by computations that scale linearly with length and have a larger constant factor than the quadratic terms.SR) and cross-lingual retrieval question answering (cl-ReQA) that approaches monolingual SR and ReQA performance for many language pars.Our models are made freely available with additional documentation and tutorial colaboratory notebooks at: https://tfhub.dev/s?q=universalsentence-encoder-multilingual.

Figure ( 1
Figure (1) provides compute and memory usage benchmarks for our models. 17Inference times on

Figure 1 :
Figure 1: Resource usage for the multilingual Transformer and CNN encoding models.

Table 2 :
MAP@100 on SR (English).Models are compared with the best models from Gillick et al.USE Trans and USE CNN .We also export a larger graph for QA tasks from our Transformer based model that includes QA specific layers and support providing context information from the larger document as USE QA Trans+Cxt . 9

Table 3 :
as the baseline model, which is trained using a similar dual encoder architecture.The num-P@1 on UN Bitext retrieval task.

Table 4 :
P@1 for SQuAD ReQA.Models are not trained on SQuAD.Dev and Train only refer to the respective sections of the SQuAD dataset.