Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning

Even though BERT has achieved successful performance improvements in various supervised learning tasks, BERT is still limited by repetitive inferences on unsupervised tasks for the computation of contextual language representations. To resolve this limitation, we propose a novel deep bidirectional language model called a Transformer-based Text Autoencoder (T-TA). The T-TA computes contextual language representations without repetition and displays the benefits of a deep bidirectional architecture, such as that of BERT. In computation time experiments in a CPU environment, the proposed T-TA performs over six times faster than the BERT-like model on a reranking task and twelve times faster on a semantic similarity task. Furthermore, the T-TA shows competitive or even better accuracies than those of BERT on the above tasks. Code is available at https://github.com/joongbo/tta.


Introduction
A language model is an essential component in many NLP applications ranging from automatic speech recognition (ASR) (Chan et al., 2016;Panayotov et al., 2015) to neural machine translation (NMT) (Sutskever et al., 2014;Sennrich et al., 2016;Vaswani et al., 2017).Recently, BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019) and its variations have brought significant improvements in learning natural language representation, and they have achieved state-of-the-art performances on various downstream tasks such as GLUE benchmark (Wang et al., 2019) and question answering (Rajpurkar et al., 2016).This success of BERT continues in various unsupervised tasks such as the N -best list reranking for ASR and NMT (Shin et al., 2019;Salazar et al., 2019), showing that deep bidirec-tional language models are useful in unsupervised applications as well.
However, when applying the BERT to unsupervised learning tasks, there exists significant inefficiency in computing language representations at the inference stage (Salazar et al., 2019).During training, the BERT uses the masked language modeling (MLM) objective, which is to predict the original ids of explicitly masked words from the input.Due to the MLM objective, each contextual word representation should be computed by a two-step process of masking a word in the input and feeding it into the BERT.During inference, therefore, this process repeats n times to obtain obtain the representations of the whole words of a text sequence (Wang and Cho, 2019;Shin et al., 2019;Salazar et al., 2019), resulting in computational complexity of O(n 3 )2 in terms of the number n of words.Hence, it is necessary to reduce the computational complexity when we apply the model to the case where the inference time is considered critical, e.g.mobile environments and real-time systems (Sanh et al., 2019;Lan et al., 2019).Faced with this limitation of the BERT, we raise a new research question: "Can we make a deep bidirectional language model that has minimal inference time while maintaining the accuracy of BERT?" In this paper, we answer "YES" to the above question by proposing a novel bidirectional language model named T-TA: Transformer-based Text Autoencoder that has the reduced computational complexity of O(n 2 ) when applying the model to the unsupervised applications.The proposed model is trained with a new learning objective named language autoencoding (LAE).The LAE let the target labels to be the same as the text input, and its objective is to predict every token in the input sequence at once without merely copying the input to the output.To learn the proposed objective, we devise a diagonal masking operation and an input isolation mechanism inside the T-TA based on the Transformer encoder (Vaswani et al., 2017).These components enable the proposed T-TA to compute contextualized language representations at once while maintaining the benefits of the deep bidirectional architecture of BERT.
We conduct a series of experiments on two unsupervised tasks: the N -best list reranking and the unsupervised semantic textual similarity.First, in the runtime experiments on CPU environments, we show that the proposed T-TA is 6.35 times faster than the BERT-based model in the reranking task, and 12.7 times faster in the semantic similarity task.Second, even with this faster inference, the T-TA achieves competitive performances to BERT on reranking tasks.Furthermore, the T-TA outperforms BERT up to 8 points in Pearson's r on unsupervised semantic textual similarity tasks.

Related Works
When referring to the autoencoder for language modeling, sequence-to-sequence learning approaches have been commonly used.These approaches encode a given sentence into a compressed vector representation, followed by a decoder which reconstructs the original sentence from the sentence-level representation (Sutskever et al., 2014;Cho et al., 2014;Dai and Le, 2015).To the best of our knowledge, however, none of them considered an autoencoder that encodes word-level representations like BERT without the autoregressive decoding process.
There have been many studies on neural networkbased language models for word-level representations.Distributed word representations were proposed and gained huge interests as they were considered to be fundamental building blocks for the natural language processing tasks (Rumelhart et al., 1986;Bengio et al., 2003;Mikolov et al., 2013b).Recently, researchers explored contextualized representations of text where each word will have different representations depending on the context (Peters et al., 2018;Radford et al., 2018).More recently, the Transformer-based deep bidirectional model was proposed and applied to the various supervised-learning tasks with a huge success (Devlin et al., 2019).
For unsupervised tasks, researchers adopted the recent language-representation models and inves-tigated their effectiveness.One typical example is the N -best list reranking for ASR and NMT tasks.In particular, there have been researches integrating the left-to-right and the right-to-left language models (Arisoy et al., 2015;Chen et al., 2017;Peris and Casacuberta, 2015) so as to outperform conventional unidirectional language models (Mikolov et al., 2010;Sundermeyer et al., 2012) in these tasks.Furthermore, BERT-based approaches have been explored and have achieved significant performance improvements on these tasks based on the fact that; bidirectional language models yield the pseudo-log-likelihood of a given sentence; this score is useful in ranking the n-best hypotheses (Wang and Cho, 2019;Shin et al., 2019;Salazar et al., 2019).
Another line of research includes reducing the computation time and memory consumption of BERT.Lan et al. (2019) proposed parameterreduction techniques, factorized embedding parameterization and cross-layer parameter sharing, and achieved 18 times fewer parameters and 1.7 times faster training time.With a similar research direction, Sanh et al. (2019) presented a method to pretrain a smaller model that can be finetuned for the downstream task, and achieved a 1.4 times lower parameter count with 1.6 times faster inference.However, none of these studies presented methods that directly revise BERT architecture for decreasing computational complexity during inference.

Language Model Baselines
The conventional language modeling is a task of predicting the i-th token x i using its preceding context x <i = [x 1 , . . ., x i−1 ], and we call this objective as causal language modeling (CLM) throughout this paper following (Conneau and Lample, 2019).As shown in Figure 1a, we can obtain (left-to-right) contextualized language representations H C n ] at a single feeding the input sequence to the CLM-trained language model, where ) is the hidden representation of i-th token.This paper takes this unidirectional language model (uniLM) as our speed baseline.However, contextualized language representations obtained from the uniLM are insufficient to accurately encode a given text because future contexts cannot be used to understand the current tokens during inference.
Recently, BERT (Devlin et al., 2019) enables the full contextualization of the language repre- sentations by using the masked language modeling (MLM) objective.In the MLM, some tokens from the input sequence are randomly masked, and the objective is to predict the original tokens at the masked positions using only their context.As in Figure 1b, we can obtain a contextualized representation of i-th token ) by masking the token in the input sequence and feeding it into the MLM-trained model, where x i+1 , . . ., x n ] is an external masking operation.This paper takes this bidirectional language model (biLM) as our performance baseline.However, this mask-and-predict approach should be repeated n times to obtain the whole language representations because the learning occurs only at the masked position during the MLM training.Although the language representations are robust and accurate, this repetition causes significant inefficiency in the use of unsupervised applications such as the N -best list reranking tasks (Wang and Cho, 2019;Shin et al., 2019;Salazar et al., 2019).

Language Autoencoding
In this paper, we propose a new learning objective named language autoencoding (LAE) for obtaining fully contextualized language representations without repetition.The LAE lets the output to become the same as the input, and the objective is to predict every token in a text sequence at once without merely copying the input to output.For the proposed task, a language model should reproduce the whole input at once while avoiding the over-fitting.Otherwise, the model only outputs the representation copied from the input representation without learning any statistics of the language.To this end, information flow from the i-th input to the i-th output should be blocked inside the model shown in Figure 1c.From this LAE objective, we can obtain fully contextualized language representations The way of blocking the information flow is described in the next section.

Transformer-based Text Autoencoder
In this section, we introduce a novel architecture of a deep bidirectional language model named T-TA, which stands for Transformer-based Text Autoencoder, and the overall architecture of the T-TA is shown in Figure 2. As in its name, the model architecture is based on the Transformer encoder (Vaswani et al., 2017).To learn the proposed LAE, we develop a diagonal masking operation and an input isolation mechanism inside the T-TA.Both developments are designed to let the language model predict every token at once while maintaining the deep bidirectional property (see the descriptions in the following subsections).Due to the space limit, we refer to the original Transformer paper (Vaswani et al., 2017) for other details of the standard functions such as the multi-head attention, the scaled dot-product attention, layer normalization, and the position-wise fully connected feed-forward network.

Diagonal Masking
As shown in Figure 3, a diagonal masking operation is inside the scaled dot-product attention in order to be "self-unknown" during the inference.This operation prevents the information from flowing to the same position in the next layer by masking out the diagonal values in the input of the softmax.Specifically, the output vector at each position is the weighted sum of the value V at other positions, where the attention weights come from the query Q and the key K.
The diagonal mask becomes meaningless when we use it together with the residual connection or utilize it in the multi-layer architecture.To keep the self-unknown functional, we can remove the residual connection and adopt single-layer architecture.However, it is essential to utilize deep architecture to understand the intricate patterns of natural language.To this end, we further develop an architecture described in the next section.

Input Isolation
We now propose an input isolation mechanism in order to make the residual connection and the multi-layer architecture compatible with the diagonal masking operation.In the input isolation, the key-value inputs (K-V) of all encoding layers are isolated from the network flow, and they are fixed to the sum of the token embeddings and the position embeddings.Only query inputs (Q) are updated across the layers during inference by referring to the fixed output of the embedding layer.Additionally, we input the position embeddings to the Q of the very first encoding layer in order to make the self-attention mechanism effective.Otherwise, the attention weights will be the same at all positions, resulting in that the first self-attention works as a simple calculator of averaging the input representations except the "self" position.Finally, we utilize the residual connection only to the query to maintain the unawareness completely.The dashed arrows in Figure 2 show this input isolation mechanism inside the T-TA.
By using the diagonal masking and input isolation together, the T-TA can have multiple encoder layers.They enable the T-TA to obtain high-quality contextual language representations with an only single feeding of a sequence.

Discussion and Analysis
Until now, we have introduced the new learning objective, language autoencoding (LAE), and the novel deep bidirectional language model, Transformer-based Text Autoencoder (T-TA).We will verify the model architecture of the proposed T-TA in Section 4.3.1, and compare our model with the recent bidirectional language model BERT in Section 4.3.2.

Verification of the Architecture
We here discuss how the diagonal masking with input isolation preserve "self-unknown" property in detail.
As in Figure 2, we have two embeddings, token embeddings X = [X 1 , . . ., X n ] T ∈ R n×d and position embeddings P = [P 1 , . . ., P n ] T ∈ R n×d , where the d is an embedding dimension.From the input isolation, the key and value K = V = X + P have the information of input tokens and they are fixed in all layers, but the query Q l is updated across the layers during inference started from the position embeddings Q 1 = P at the first layer.
Let us consider the l-th encoding layer's query input Q l and its output H l = Q l+1 .Then, where SMSAN(•) represents the Self-Masked Self-Attention Network, the encoding layer of the T-TA, g(x) = Norm(Add(x, FeedForward(x))), two upper-side sub-boxes of the encoding layer, and f (•) is the (multi-head) diagonal-masked selfattention (DMSA) mechanism shown in Figure 2.
As in Figure 3, the DMSA module computes Z l as follows: In the DMSA module, the i-th element of T is always computed by a weighted average of the fixed V discarding the information of i-th token X i in V i .To be more specific, Z l i is the weighted average of the V with the attention weight vector s l i , i.e., . We here note that only the DMSA is related to the "self-unknown" since no token representation is referred to each other in the subsequent transformations from Z l to H l .Therefore, it is guaranteed that the i-th element of the query representation in any layer, Q l i , never sees the corresponding token representation started from the Q 1 i = P i .Consequently, the T-TA preserves the "self-unknown" property during inference while maintaining the residual connection and multi-layer architecture.

Comparison with BERT
There are several differences between the strong baseline BERT (Devlin et al., 2019) and the proposed model T-TA, while both models learn deep bidirectional language representations.
• While BERT uses external masking operation in the input, T-TA has internal masking operation in the model as we intend.Also, while BERT is based on denoising autoencoder, T-TA is based on autoencoder.Due to this novel approach, the T-TA does not need mask-and-predict repetition during computing contextual language representations.Consequently, we reduce the computational complexity from O(n 3 ) of BERT to O(n 2 ) of T-TA when applying the language models to the unsupervised learning tasks.
• As in the T-TA, feeding an intact input (without masks) into BERT is also possible.However, we argue that it will significantly hurt the model performance on unsupervised applications since the MLM objective does not consider the intact token much.We include experiments that show model performance with the intact input (described in Table 1, 3, and 4).We also suggest reading previous research that reported the same opinion (Salazar et al., 2019).

Experiments
To evaluate the proposed method, we conduct a series of experiments.We first evaluate the contextual language representations obtained from the Transformer-based Text Autoencoder (T-TA) on the N -best list reranking tasks.We then apply our method to unsupervised semantic textual similarity (STS) tasks.The following sections will demonstrate that the proposed model is much faster than the BERT during inference (in Section 5.2) while showing competitive or even better accuracies than those of the BERT on reranking tasks (in Section 5.3) and STS tasks (in Section 5.4).

Language Model Setups
This paper mainly compares the proposed T-TA with the bidirectional language model (biLM), which is trained with the masked language modeling (MLM) objective, like BERT.For a fair comparison, each model has the same number of parameters based on the Transformer as followed: |L| = 3 self-attention layers with d = 512 input and output dimensions, h = 8 attention heads, and d f = 2048 hidden units for the position-wise feed-forward layers.We use a gelu activation (Hendrycks and Gimpel, 2016) rather than the standard relu, following OpenAI GPT (Radford et al., 2018) and BERT (Devlin et al., 2019).We set a position embeddings to be trainable following BERT (Devlin et al., 2019) rather than a fixed sinusoid (Vaswani et al., 2017) with supported sequence lengths up to 128 tokens in our experiments.We use WordPiece embeddings (Wu et al., 2016) with a vocabulary of about |V | 30, 000 tokens.The weights of the embedding layer and the last softmax layer of the Transformer are shared.For the speed baseline, we also implement a unidirectional language model (uniLM), which has the same number of parameters as T-TA and biLM.
For training, we make a training instance consist- ing of a single sentence with [BOS] and [EOS] tokens at the begin and the end of each sentence.We use 64 sentences as a training batch, and train language models 1M steps for ASR and 2M steps for NMT.We train the language models with Adam (Kingma and Ba, 2014) with an initial learning rate of 1e − 4, β 1 = 0.9, β 2 = 0.999, learning rate warm up over the first 50k steps and linear decay of the learning rate.We use a dropout probability of 0.1 on all layers.Our implementation is based on Google's official code for BERT 3 .
To train language models that we implement, we use about 13GB English Wikipedia dump that has about 120M sentences.The trained models are used for reranking in neural machine translation (NMT) and unsupervised semantic textual similarity tasks.For reranking in automatic speech recognition (ASR), we use additional in-domain training data of the 4.0GB normalized text data of the official LibriSpeech corpus that has about 40M sentences.
One of the strong baseline language models, the pre-trained BERT-base-uncased (Devlin et al., 2019), is used for reranking and STS.We also include the reranking results from the traditional count-based 5-gram language models that are trained on each dataset using the KenLM library (Heafield, 2011).

Running Time Analysis
We first measure the running time of each language model for computing the contextual language representation H L ∈ R n×d of a given text sequence.In the unsupervised STS tasks, we directly use the H L for the analysis.In the case of the reranking 3 https://github.com/google-research/berttask, we need further computation; we compute Softmax(H L E T ) to obtain the likelihood of each token, where E ∈ R |V |×d is the weight parameters of the softmax layer.Therefore, the computational complexity of reranking is bigger than that of STS.
In the running time measurement, we use Intel(R) Core(TM) i7-6850K CPU (3.60GHz) on the TensorFlow 1.12.0 library with Python 3.6.8over the Ubuntu 16.04.06LTS.In each experiment, we measured the run-time 50 times and averaged the results.Figure 4 shows that the run-time of the T-TA is faster than that of the biLM, and it becomes significant as the sentence is longer.For numerical comparison, we set the standard number of words to 20 since the average number of words in English sentences today is about 20 (DuBay, 2006).In this setup, the T-TA takes about 9.85 ms while the biLM takes about 125 ms in the STS task, showing that the T-TA is 12.7 times faster than the biLM.In the reranking task, the ratio between the T-TA and the biLM is reduced to 6.35 times (still significant), and this is because the repetition of biLM is related only to computing H L not to Softmax(H L E T ).
For the visibility of Figure 4, we omit the runtime results of uniLM, which is also as fast as the T-TA (Appendix A.3).With this fast inference, we show that the T-TA is as accurate as BERT in the next section.

Reranking the N-best List
To evaluate language models, we conduct experiments on the unsupervised task of reranking the N -best list.In the experiments, we apply each language model to rerank the 50-best candidate sentences, which are obtained in advance using each sequence-to-sequence model on ASR and NMT.The ASR and NMT models we implement are detailed in Appendix A.1 and A.2.
We rescore the sentences by linearly interpolating two scores from a sequence-to-sequence model and each language model as follows: where the score s2s is the score from sequence-tosequence models, score lm is the score from language models calculated by the sum (or mean) of the log-likelihood of each token, and the interpolation weight λ is set to a value that shows the best performance in the development set.
We note that the T-TA and biLM (also BERT) assign the pseudo-log-likelihood to the score of a  (Shin et al., 2019).
given sentence while the uniLM assigns the loglikelihood.Because the reranking task is based on relative scores of the n-best hypotheses, the fact that bidirectional language models yield the pseudo-log-likelihood of a given sentence does not matter in this task (Wang and Cho, 2019;Shin et al., 2019;Salazar et al., 2019).

Results on Speech Recognition
For reranking in ASR, we use prepared N -best lists obtained from dev and test sets using Seq2Seq ASR that we train on the Librispeech ASR corpus.Additionally, we use the N -best lists obtained from (Shin et al., 2019) in order to see the robustness of the language models on testing environments.Table 1 shows the word error rates (WERs) for each method after reranking.The interpolation weights λ were 0.3 or 0.4 in all N -best lists for ASR.We observe that the bidirectional language models trained with the LAE (T-TA) and MLM (biLM) outperform the unidirectional language model (uniLM) trained with the CLM.Performance gains from the reranking are much lower in the better base system Seq2Seq ASR , and we can see that it is challenging to rerank the N -best list using a language model if the speech recognition model performs well enough.Interestingly, the T-TA is competitive and even better than the biLM, and it may be from the gap between training and testing of the biLM: the biLM predicts multiple masks at a time when training, but predicts only one mask at a time when testing.Moreover, the 3-layer T-TA is better than the 12-layer BERT-base, showing that in-domain data is critical to the language model applications.Finally, we note that feeding an intact input to the BERT, denoted as "w/ BERT \M " in the Table 1, underperforms the others, and this shows that the mask-and-predict is necessary for the effective reranking.

Results on Machine Translation
To see the reranking performance in other domain, NMT, we prepare the N -best lists using Seq2Seq NMT4 from the WMT-13's German-to-English and French-to-English test sets.Table 2 shows the BLEU scores for each method after reranking.Each interpolation weight becomes a value that shows the best performance on each test set with each method in NMT.The interpolation weights λ were 0.4 or 0.5 in the N -best lists for NMT.
We observe again that the bidirectional language models trained with the LAE and MLM perform better than the unidirectional language model trained with the CLM.Also, the Fr→En translation has less effect on reranking than the De→En translation because the base NMT system for Fr→En is better than that for De→En.
Seeing that the 12-layer BERT is much better than the others in reranking on NMT, it seems that the N -best hypotheses of the NMT model are more subtle to distinguish than those of the ASR model from the language model perspective.All reranking results in ASR and NMT demonstrate that the proposed T-TA performs efficiently like uniLM and effectively like biLM.

Unsupervised Semantic Textual Similarity
In addition to the reranking task, we apply language models to the semantic textual similarity (STS), which is the task of measuring the meaning similarity of sentence pairs.We use STS Benchmark (Cer et al., 2017) and SICK (Marelli et al., 2014), where both datasets have a set of sentence pairs with corresponding similarity scores.The evaluation metric of STS is the Pearson's r between the predicted similarity scores and the reference scores of the given sentence pairs.In this section, we address the task of unsupervised STS to examine the inherent ability to obtain contextual language representations of each language model, and we mainly compare language models that are trained on the English Wikipedia dump.To compute a similarity score of a given sentence pair, we use the cosine similarity of two sentence representations, where each representation is obtained by averaging each language model's contextual language representations.Specifically, contextual representations of a given sentence are the outputs of the final encoding layer of each model, denoted as context in Table 3 and 4. For comparison, we use non-contextual representations, which are obtained from the outputs of the embedding layer, denoted as embed in Table 3 and 4. As a strong baseline for unsupervised STS tasks, we also include the 12-layer BERT model (Devlin et al., 2019), and we use the BERT in the mask-and-predict approach for computing contextual representations of each sentence.Note that we use the most straightforward approach for the unsupervised STS in order to focus on comparing token-level language representations.

Results on STS Benchmark
The STS Benchmark (STSb) has 5749/1500/1379 sentence pairs for train/dev/test splits with corresponding scores ranging from 0-5.We test language models on the STSb-dev and STSb-test using the most simple approach on the unsupervised STS.As our additional baselines, we include the results of GloVe (Pennington et al., 2014)  -denotes the infeasible value.Bolds are for the top-2 performances on each sub-task.(Mikolov et al., 2013a) from the official sites of STS Benchmark5 .Table 3 shows our T-TA trained with the LAE best captures the semantic of a sentence over the Transformer-based language models.It is remarkable that our 3-layer T-TA trained on the relatively small data outperforms the 12-layer BERT trained on large data (Wikipedia + BookCorpus).Another interesting point is that embedding representations are trained better by the CLM than the other language modeling objectives, and we guess that the uniLM highly depends on the embedding layer due to its constraint of the unidirectional context.
Since the uniLM encodes all contexts in the last token [EOS], we also use the last representation as to the sentence representation, but it does not outperform the averaged sentence representation.Similarly, BERT has a special token [CLS], which is trained for the "next sentence prediction" objective, so we also use it to see how [CLS] learns sentence representation, but it significantly underperforms the others.

Results on SICK
We further evaluate language models on the SICK data, which consists of 4934/4906 sentence pairs for train/test splits with the scores ranging from 1-5.The results are in Table 4, and we have the same observations as STSb.
All results on unsupervised STS tasks demonstrate that the T-TA learns textual semantics best using the token-level language modeling, LAE.

Conclusion
In this work, we propose a novel deep bidirectional language model named Transformer-based Text Autoencoder (T-TA) in order to eliminate the computational overload of applying BERT for unsupervised applications.Experimental results on the Nbest list reranking and the unsupervised semantic textual similarity tasks demonstrate that the proposed T-TA is significantly faster than the BERTbased approach, while its encoding ability is competitive or even better than that of BERT.

A Appendices
A.1 Setups for ASR systems This section introduces our implementation of the speech recognition system.For the input features, we use 80-band Melscale spectrogram derived from the speech signal.The target sequence is processed in 5K caseinsensitive sub-word units created via unigram byte-pair encoding (Shibata et al., 1999).We use an attention-based encoder-decoder model as our acoustic model.The encoder is a 5-layer bidirectional LSTM, and there are bottleneck layers that conduct linear transformation between every LSTM layers.Also, there is a VGG module before the encoder, and it reduces encoding time steps by a quarter through two max-pooling layers.The decoder is 2-layer bidirectional LSTM with location-aware attention mechanism (Chorowski et al., 2015).All the layers have 1024 hidden units.The model is trained with additional CTC objective function because the left-to-right constraint of CTC helps learn alignments between speech-text pairs (Hori et al., 2017).
Our model is trained for 20 epochs on 960h of LibriSpeech training data using Adadelta optimizer (Zeiler, 2012).Using this acoustic model, we obtain 50-best decoded sentences for each input audio through hybrid CTC-attention based scoring (Hori et al., 2017) method.For Seq2Seq ASR , we additionally use a pre-trained RNNLM to combine the log-probability p lm of RNNLM during decoding as follows: log p(y n |y 1:n−1 ) = log p am (y n |y 1:n−1 ) + β log p lm (y n |y 1:n−1 ), where β is set to 0.7.We use ESPNet toolkit (Watanabe et al., 2018)   Table 5 shows the oracle word error rates (WERs) of the 50-best lists, which are measured assuming that the best sentence is always picked from the candidates.We also include the oracle WERs from the 50-best lists of (Shin et al., 2019).

A.2 Setups for NMT systems
We implement the standard Transformer model (Vaswani et al., 2017) using Tensor2Tensor library (Vaswani et al., 2018) for machine translation.Both the encoder and decoder of the Transformer consist of 6 layers with 512 hidden units, and the number of the self-attention heads is 8.The maximum number of input tokens is set to 256.We use the shared vocabulary of size 32k.For effective training, we let the token embedding layer and the last softmax layer share their weights.The other hyperparameters of our translation system follow the standard transformer_base_single_gpu setting in Google's official Tensor2Tensor repository6 .
We train the baseline model on the standard WMT18 French-English and German-English datasets for 250k steps using Adam optimizer (Kingma and Ba, 2014).We use linear-warmupsquare-root-decay learning rate scheduling with the default learning rate 2.5e-4 and warmup steps 16k.Using this baseline translation model, we obtain 50-best decoded sentences for each source through the beam-search.The oracle BLEU scores for the NMT system are shown in Table 6

A.3 Running time of uniLM and T-TA
As mentioned in Section 5.2, we also measure execution times of the uniLM we implement.Figure 5 shows that the averaged run-times of the uniLM and the T-TA for the number of words in a sentence.Since we use subword tokens, the number n w of words and the number n of tokens can be different n w ≤ n.A.4 Running time on GPU Additionally, we also measure execution times on a GPU-augmented environment (using GeForce GTX 1080 Ti). Figure 6 shows that the averaged run-times of the biLM and the T-TA for the number of words in a sentence.In our 20-words standard, the T-TA takes about 2.51 ms and biLM takes about 4.72 ms in the STS task, showing that the T-TA is 1.88 times faster than the biLM.Compared to the CPU-only environment, the speed difference was reduced due to the GPU supports.Seeing Figure 4, however, the CPU-only environment and the GPUaugmented environment have a similar tendency: the longer the sentence, the more significant the difference between the T-TA and the biLM.

A.5 Perplexity and Reranking
In general, perplexity (PPL) is a measure of how well language models trained.To see the alignment of PPL and reranking, we compute PPL of reference sentences from the Librispeech dev-clean and test-clean set using each language model.We can get pseudo-perplexity (pPPL) from biLM and T-TA since they do not follow the product rule, unlike uniLM.We note that we compute subword-level (p)PPL (not word-level); these values are valid only in our vocabulary.We can find that WERs are better aligned with the median of pPPL m than the averaged pPPL a .Interestingly, the pPPL a of T-TA is similar to the PPL a of uniLM, but the pPPL m of T-TA is similar to that of biLM.We additionally find that if the length of a sentence is short, T-TA shows a very high perplexity, even higher than uniLM.

Figure 2 :
Figure 2: Architecture of our T-TA.Highlighted box and dashed arrows are newly invented in this paper.

Figure 3 :
Figure 3: Diagonal masking of the scaled dot-product attention mechanism.Highlighted box and dashed arrow are newly invented in this paper.

Figure 4 :
Figure 4: Average running times of each model according to the number of words on STS and reranking tasks, sub-scripted as sts and rrk respectively.

Figure 5 :
Figure 5: Running times according to the number of words for uniLM and T-TA.

Figure 6 :
Figure 6: Running times according to the number of words for biLM and T-TA on GPU-augmented environment.

Table 2 :
BLEU scores after reranking with each language model on WMT13.Bolds are for the best performance on each sub-task.Underlines are for the best in our implementations.

Table 4 :
Pearson's r × 100 results on SICK data.-denotes the infeasible value.Bolds are for the best performance on each sub-task.
for this implementation.

Table 5 :
Oracle WERs of the 50-best lists on Lib-riSpeech from each ASR system.

Table 6 :
. Oracle BLEUs of the 50-best lists on WMT

Table 7 :
(pseudo)Perplexities and corresponding WERs of language models on LibriSpeech.