Cross-Thought for Sentence Encoder Pre-training

In this paper, we propose Cross-Thought, a novel approach to pre-training sequence encoder, which is instrumental in building reusable sequence embeddings for large-scale NLP tasks such as question answering. Instead of using the original signals of full sentences, we train a Transformer-based sequence encoder over a large set of short sequences, which allows the model to automatically select the most useful information for predicting masked words. Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders trained with continuous sentence signals as well as traditional masked language modeling baselines. Our proposed approach also achieves new state of the art on HotpotQA (full-wiki setting) by improving intermediate information retrieval performance.


Introduction
Encoding sentences into embeddings (Kiros et al., 2015;Subramanian et al., 2018;Reimers and Gurevych, 2019) is a critical step in many Natural Language Processing (NLP) tasks. The benefit of using sentence embeddings is that the representations of all the encoded sentences can be reused on a chunk level (compared to word-level embeddings), which can significantly accelerate inference speed. For example, when used in question answering (QA), it can significantly shorten inference time with all the embeddings of candidate paragraphs pre-cached into memory and only matched with the question embedding during inference.
There have been several models specifically designed to pre-train sentence encoders with largescale unlabeled corpus. For example, Skipthought (Kiros et al., 2015) uses encoded sentence embeddings to generate the next sentence (Figure 2(a)). Inverse Cloze Task  defines some pseudo labels to pre-train a sentence encoder (Figure 2(b)). However, pseudo labels may bear low accuracy, and rich linguistic information that can be well learned in generic language modeling is often lost in these unsupervised methods. In this paper, we propose a novel unsupervised approach that fully exploits the strength of language modeling for sentence encoder pre-training.
Popular pre-training tasks such as language modeling (Peters et al., 2018;Radford et al., 2018), masked language modeling (Devlin et al., 2019; and sequence generation (Dong et al., 2019; are not directly applicable to sentence encoder training, because only the hidden state of the first token (a special token) (Reimers and Gurevych, 2019;Devlin et al., 2019) can be used as the sentence embedding, but no loss or gradient is specifically designed for the first special token, which renders sentence embeddings learned in such settings contain limited useful information.
Another limitation in existing masked language modeling methods (Devlin et al., 2019; is that they focus on long sequences (512 words), where masked tokens can be recovered by considering the context within the same sequence. This is useful for word dependency learning within a sequence, but less effective for sentence embedding learning.
In this paper, we propose Cross-Thought, which segments input text into shorter sequences, where masked words in one sequence are less likely to be recovered based on the current sequence itself, but more relying on the embeddings of other surrounding sequences. For example, in Figure 1, the masked words "George Washington" and "United States" in the third sequence can only be correctly predicted by considering the context from the first sequence. Thus, instead of performing selfattention over all the words in all sentences, our proposed pre-training method enforces the model to learn from mutually-relevant sequences and automatically select the most relevant neighbors for masked words recovery.
The proposed Cross-Thought architecture is illustrated in Figure 2(c). Specifically, we preappend each sequence with multiple special tokens, the final hidden states of which are used as the final sentence embedding. Then, we train multiple cross-sequence Transformers over the hidden states of different special tokens independently, to retrieve relevant sequence embeddings for masked words prediction. After pre-training, the attention weights in the cross-sequence Transformers can be directly applied to downstream tasks (e.g., in QA tasks, similarity scores between question and candidate answers can be ranked by their respective sentence embeddings).
Our contributions are summarized as follows. (i) We propose the Cross-Thought model to pre-train a sentence encoder with a novel pre-training task: recovering a masked short sequence by taking into consideration the embeddings of surrounding sequences. (ii) Our model can be easily finetuned on diverse downstream tasks. The attention weights of the pre-trained cross-sequence Transformers can also be directly used for ranking tasks. (iii) Our model achieves the best performance on multiple sequence-pair classification and answer-selection tasks, compared to state-of-the-art baselines. In addition, it further boosts the recall of information retrieval (IR) models in open-domain QA task, and achieves new state of the art on the HotpotQA benchmark (full-wiki setting).

Related Work
Sequence Encoder Many studies have explored different ways to improve sequence embeddings. Huang et al. (2013) proposes deep structured semantic encoders for web search. Tan et al. (2015) uses LSTM as the encoder for non-factoid answer selection, and Tai et al. (2015) proposes tree-LSTM to compute semantic relatedness between sentences. Mou et al. (2016) also uses tree-based CNN as the encoder for textual entailment tasks. Cheng et al. (2016) proposes Long Short-Term Memory-Networks (LSTMN) for inferring the relation between sentences, and Lin et al. (2017) combines LSTM and self-attention mechanism to improve sentence embeddings. Multi-task learning (Subramanian et al., 2018;Cer et al., 2018) has also been applied for training better sentence embeddings. Recently, in additional to supervised learning, models pre-trained with unsupervised methods begin to dominate the field.
Pre-training Several methods have been proposed to directly pre-train sentence embedding, such as Skip-thought (Kiros et al., 2015), Fast-Sent (Hill et al., 2016), and Inverse Cloze Task . Although these methods can obtain better sentence embeddings in an unsupervised way, they cannot achieve state-of-the-art performance in downstream tasks even with further finetuning. More recently, Peters et al. (2018) proposes to pre-train LSTM with language modeling (LM) task, and Radford et al. (2018) pre-trains Transformer also with LM. Instead of sequentially generating words in a single direction, Devlin et al. (2019) proposes the masked language modeling task to pre-train bidirectional Transformer. Most recently, Guu et al. (2020); Lewis et al. (2020) propose to jointly train sentence-embedding-based information retriever and Transformer to re-construct documents. However, their methods are usually difficult to train with reinforcement learning methods involved, and need to periodically re-index the whole corpus such as Wikipedia. In this paper, to pre-train sentence encoder, we propose a new model Cross-Thought to recover the masked information across sequences. We make use of the heuristics that nearby sequences in the document contain the most important information to recover the masked words. Therefore, the challenging retrieval part can be replaced by soft-attention mechanism, making our model much easier to train.

Cross-Thought
In this section, we introduce our proposed pretraining model Cross-Thought, and describe how to finetune the pre-trained model on downstream tasks. Specifically, most parameters in downstream tasks can be initialized by the pre-trained Cross-Thought, and for certain tasks (e.g., ranking) the attention weights across sequences can be directly used without additional parameters ( Figure 3).

Pre-training Data Construction
Our pre-training task is inspired by Masked Language Modeling (Devlin et al., 2019;, and the key difference is the way to construct sequences for pre-training. As our goal is sentence embedding learning, the pre-training task is designed to encourage the model to recover masked words based on sentence-level global context from other sequences, instead of word-level local context within the same sequence ( Figure 1). Therefore, unlike previous work that segments raw text into long sequences and shuffles the sequences for pre-training, we propose to create shorter text sequences instead, without shuffling. In this way, a shorter sequence may not contain all the necessary information for recovering the masked words, hence requiring the probing into surrounding sequences to capture the missing information.

Cross-Thought Pre-training
The pre-training model is illustrated in Figure  3(a). As aforementioned, the input of pretraining data consists of M continuous sequences [X 0 , X 1 , ..., X M −1 ]. Similar to BERT (Devlin et al., 2019), we use the hidden state of the special token as the final sentence embedding. To encode the embeddings with richer semantic information, we propose to pre-append N special tokens S instead of a single one to each sequence X.
We first use Transformer to encode the segmented short sequences as follows: where S ∈ R N ×d are the embeddings of N special tokens, and X m ∈ R lm×d are the contextualized word embeddings of the m-th sequence. l m is the sequence length and d is the dimension of the embedding. [·; ·] is the concatenation of matrices. H m ∈ R (N +lm)×d are all the hidden states of the Transformer, and E m ∈ R N ×d are the hidden states on the special tokens, used as the final sequence embedding. Next, we build cross-sequence Transformer on top of these sequence embeddings, so that each sequence can distill information from other sequences. As the embeddings of different special tokens encode different information, we run Transformer on the embedding of each special token separately: where E m [n] ∈ R d is the n-th row of E m . F n ∈ R M ×d is the concatenation of all the embeddings of the n-th special token in each sequence. C n ∈ Figure 3: Illustration of Cross-Thought for pre-training and finetuning procedures. Circles in red are sentence embeddings. Lines in blue are the cross-sequence Transformers, the attention weights of which are α in Eqn. (7). Words in red are special tokens, the hidden states of which are the sentence embeddings. Multiple special tokens are used to enrich the sentence embeddings. In (a), the sentence embedding of the third sequence provides context that can help generate the masked word in the first sequence. In (b), the model can be initialized with pre-trained Cross-Thought for Answer Selection or Textual Entailment tasks. The attention weights α and hidden states of cross-sequence Transformers can be used directly for Ranking and Classification tasks.
R M ×d is the output of cross-sequence Transformer, where all the information across sequences are fused together. As the weights of multi-head attention in cross-sequence Transformer will be used for downstream tasks, we decompose the attention weights of one head in the cross-sequence Transformer on the n-th special tokens as follows: where W Q ∈ R d×d , W K ∈ R d×d are the parameters to learn. α ∈ R M ×M are the attention weights that can be directly used in downstream tasks (e.g., for measuring the similarity between question and candidate answers in QA tasks). Finally, to encourage the embedding from other sequences retrieved by cross-sequence Transformer to help generate the masked words in the current sequence, we use another Transformer layer on top of the merged sequence embeddings as follows: where C n [m] ∈ R d is the hidden state of crosssequence Transformer on the n-th special token (1) is the hidden state for non-special words X m . O m ∈ R (N +lm)×d will be used to generate the masked word: where P (a m,i |O m ) is the probability of generating the i-th masked word in the m-th sequence.

Cross-Thought Finetuning
To demonstrate how pre-trained Cross-Thought can initialize models for downstream tasks, we take two tasks as examples: answer selection and sequencepair classification (under the setting of using sentence embeddings only, without word-level crosssequence attention (Devlin et al., 2019)). The procedure is illustrated in Figure 3 (b).

Answer Selection
The goal is to select one answer from a candidate pool, {X 1 , X 2 , ..., X M −1 } based on question X 0 . We consider the representations of candidate answers that are cached and can be matched to question embeddings when a new question comes in. Based on the pre-trained model, the attention weights in Eqn.(7) from different heads of the cross-sequence Transformers can be directly applied to rank the answer candi-dates. For further finetuning, the loss relying on the attention weights is defined as follows: where α[0] are the attention weights between question X 0 and all the answer candidates, and m is the index of the correct answer in {X 1 , X 2 , ..., X M −1 }. Note that we have multiple cross-sequence Transformers on different special tokens, and each Transformer has multiple heads. Thus, we use the mean value of all the attention matrices as the final weights.

Sentence Pair Classification
The goal is to identify the relation between two sequences, X 0 and X 1 . Reusable sequence embeddings are very useful in some tasks, such as finding the most similar pair of sentences from a large candidate pool, which requires large-scale repetitive encoding and matching without pre-computed sentence embeddings. As the pre-training of cross-sequence Transformer is designed to fuse the embeddings of different sequences, the merged representations in Eqn.(4) can be used for downstream classification as follows: whereC ∈ R 2N ×d is the concatenation of the hidden states of all the cross-sequence Transformers on N different special tokens. Note that there are only two sequences here for classification. c ∈ R 2N d is the reshaped matrix for final classification, and cross-entropy loss is used for optimization.

Experiments
In this section, we conduct experiments based on our pre-trained models, and provide additional detailed analysis.

Datasets
We conduct experiments on five datasets, the statistics of which is shown in QQP (Wang et al., 2018): Quora Question Pairs is to identify whether two questions are duplicated or not.
Quasar-T (Dhingra et al., 2017) 4 : This is a dataset for question answering by searching the related passages and then reading it to extract the answer. In this dataset, we evaluate the models by whether it can correctly select the sentence containing the gold answer from the candidate pool.
HotpotQA  5 : A dataset of diverse and explainable multi-hop question answering. We focus on the full-wiki setting, where the model needs to extract the answer from all the abstracts in Wikipedia and related sentences.
Note that for the datasets of MNLI, QQP and HotpotQA, the test sets are hidden and the number of submissions is limited. For a fair comparison between our models and baselines, we split 5% of the training data as validation set and use the original validation set as test set.

Implementation Details
All the models are pre-trained on Wikipedia and finetuned on downstream tasks. We also evaluate the pre-trained models on whether they can perform unsupervised paragraph selection, and whether the improvement over paragraph ranking can lead to better answer prediction on HotpotQA task. As our experiments are to evaluate the ability of sentence   encoder, we only build a light layer on sentence embeddings for classification task, and use only dot product (Cross-Thought, ICT) or cosine similarity (Skip-thought, LM, MLM) between sentence embeddings for ranking task. Note that for fair comparison, all the encoders in our experiments have the same structure as RoBERTa-base (12 layers, 12 self-attention heads, hidden size 768). For all experiments, we use Adam (Kingma and Ba, 2015) as the optimizer and use the tokenizer of GPT-2 (Radford et al., 2018). For model pre-training, all models including the baselines we re-implement are trained with Wikipedia pages. 6 We use 16 NVIDIA V100 GPUs for model training. Our code is mainly based on the RoBERTa codebase, 7 and we use similar hyperparameters as RoBERTa-base training. Each train-ing sample contains 500 short sequences with 64 tokens, and we randomly mask 15% of the tokens in the sequences. During training, we fix the position embeddings for the pre-appended special tokens, and randomly select 64 continuous positions from 0 to 564 for the other words. Thus, the model can be used to encode longer sequences in downstream tasks. The batch size is set to 128 (4 million tokens). We use warm-up steps 10,000, maximal update steps 125,000, learning rate 0.0005, dropout 0.1 for pre-training. Each model is pre-trained for around 4 days.
For model finetuning, in experiments for MNLI, SNLI and QQP, we use batch size 32, warmup steps 7,432, maximal update steps 123,873, and learning rate 0.00001. For Quasar-T and HotpotQA, we set batch size 80, warmup steps 2,220, maximal update steps 20,360, and learning rate 0.00005. Dropout is the only hyper-parameter we tuned, and 0.1 is the best from [0.1, 0.2, 0.3]. As HotpotQA in full-wiki setting does not provide answer candidates, we randomly sample 100 negative paragraphs from the top 1000 paragraphs ranked by BM25 scores during training. During inference, we use sentence embeddings to further rank the top 1000 paragraphs. For the unsupervised experiment on HotpotQA, we only rank top 200 paragraphs.

Baselines
Existing baseline methods are mostly trained with different encoders or different datasets. For fair comparison, we re-implement all these baselines by using a 12-layer Transformer as the sentence encoder and Wikipedia as the source for pre-training. There are three groups of baselines considered for evaluation:

Pre-trained Sentence Embedding
• ICT : Inverse cloze task treats a sentence and its context as a positive pair, otherwise negative. Sentences are masked from the context 10% of the time. This model is trained by ranking loss based on the dot product between sequence embeddings.
• Skip-Thought (Kiros et al., 2015): The task is to encode sentences into embeddings that are used to re-construct the next and the previous sentences. This model is based on encoder-decoder structure without considering attention across sequences (Cho et al., 2014). We use 6-layer Transformer as the decoder for re-construction.
Language Modeling In addition, we also reimplement benchmark baselines on the classic Language Modeling (LM) and Mask Language Modeling tasks, as most existing models are pre-trained with different unlabeled datasets: • Language Model (LM) (Radford et al., 2018): The task is to predict the probability of the next word based on given context. As the words are sequentially encoded, to evaluate the performance of this model on HotpotQA in the unsupervised setting, we use the last hidden state as the sentence embedding (instead of the first one by ICT, MLM, Skip-Thought).
• Masked Language Model (MLM) (Devlin et al., 2019): The task is to generate randomly masked words from sequences. We explore different settings of training data. "Masked LM-1-512" trains a Transformer on sequences with 512 tokens and pre-appends 1 special token to each sequence. "Masked LM-1-64" is trained on sequences with 64 tokens. Both models are trained with Wikipedia text only. "Masked LM (160G)" is the RoBERTa model pre-trained on a much larger corpus.

Multi-hop Question Answering
To further evaluate our model on multi-hop question answering task in open-domain setting, we compare our framework with several strong baselines on HotpotQA: • Cognitive Graph (Ding et al., 2019): It uses an iterative process of answer extraction and further reasoning over graphs built upon extracted answers.
• Semantic Retrieval (Nie et al., 2019): It uses a semantic retriever on both paragraph-and sentence-level to retrieve question-related information.
• Recurrent Retriever (Asai et al., 2020): It uses a recurrent retriever to collect useful information from Wikipedia graphs for question answering.

Experimental Results
Results on the classification and ranking tasks are summarized in Table 2. Results of our pipeline on HotpotQA (full-wiki) are shown in Table 3.
Effect of Pre-training Tasks Among all the pretraining tasks, our proposed method Cross-Thought achieves the best performance. With finetuning, LM pre-training tasks work better than the Skip-Thought and ICT methods which are specifically designed for learning sentence embedding. Moreover, we provide a fair comparison between "Crossthought-1-64" and "Masked LM-1-64", both of which segment Wikipedia text into short sequences in 64 tokens for pre-training, and only use the hidden state of the first special token as sentence embedding. Results show that our Cross-Thought model achieves much better performance than Masked LM-1-64, as well as the Transformer pre-trained on 160G data (10 times larger than Wikipedia).

Effect of Training on Short Sequences
Results on "Cross-Thought-1-512" and "Cross-Thought-1-64" (using sequences of 512 tokens and 64 tokens, respectively) clearly show that shorter sequences lead to more effective pre-training. Moreover, we also observe that "Cross-Thought-1-512" and "Masked LM-1-512" achieve almost the same performance. It means that our Cross-Thought has  to be trained on short sequences (64 tokens); otherwise, it would learn more on the word dependencies within sequence other than the sequence embeddings. Actually, the effect of short sequences is also proved by Skip-Thought which focuses on generating sequences in sentence level, but our Cross-Thought can achieve better performance.
Effect of Sentence Embedding Size As we keep the number of parameters fixed for the encoders trained with different tasks, increasing the dimension of hidden state will lead to more parameters to train. Instead, for each sequence, we pre-append more special tokens, the hidden states of which are concatenated together as the final sentence embedding. Experiments on "Cross-Thought-1-64", "Cross-Thought-3-64" and "Cross-Thought-5-64" compare pre-appending 1, 3 and 5 different special tokens to sequences for pre-training. We can see that a larger sentence embedding size can significantly improve performance on the ranking tasks while not on the classification tasks. We hypothesize that the main reason is ranking tasks are more challenging, with many different pairs to compare, for which the contextual sentence embeddings can provide additional information.
Effect of Paragraph Ranking without Finetuning We also conduct an analysis on whether pretrained sentence embeddings can be directly used for downstream tasks without finetuning. Although model performance without finetuning is generally worse than supervised training, experiments in column "HotpotQA(u)" further validate the previously discussed three conclusions. Besides, we observe that although the model pre-trained by masked language modeling leads to better performance after finetuning, it is not designed to train sentence embeddings, thus cannot be used for passage ranking. While all the other methods achieve much better performance than masked language modeling, our model "Cross-Thought-5-64" with the largest embedding size achieves the best performance.
Effect of Cross-Thought as Information Retriever (IR) on QA Task Our pipeline of solving HotpotQA (full-wiki) consists of three steps: (i) Fast candidate paragraph retrieval; (ii) Multi-hop paragraphs re-ranking by a more complex model; and (iii) Answer and supporting facts extraction. We evaluate our proposed method on how well the finetuned sentence embeddings can be utilized in the first step for IR, with the re-ranker and answer extractor fixed. "Masked LM-1-64" and "Cross-Thought-1-64" in Table 3 show that our pre-trained model achieves better performance than the baseline model pre-trained on single sequences. Moreover, the pipeline integrating our sentence embedding achieves new state of the art on HotpotQA (full-wiki).

Case Study
Table 4 provides a case study on the unsupervised passage ranking and masked language modeling tasks. For the case from HotpotQA, we can see that attention weights from the cross-sequence Transformer in Cross-Thought can rank the paragraph with gold answer to the first place among the 200 candidate paragraphs.
For the case from Mased Language Modeling, we also observe that the sentence that can be used to recover the masked words receives much higher attention weight compared to others, validating our motivation on retrieving the useful sentence embeddings from other sequences to enhance masked word recovery in the current sequence.

Conclusion
We propose a novel approach, Cross-Thought, to pre-train sentence encoder. Experiments demonstrate that using Cross-Thought trained with short sequences can effectively improve sentence embedding. Our pre-trained sentence encoder with further finetuning can beat several strong baselines on many NLP tasks.