Neural Machine Translation without Embeddings

Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and heuristic subword induction algorithms. A simple universal alternative is to represent every computerized text as a sequence of bytes via UTF-8, obviating the need for an embedding layer since there are fewer token types (256) than dimensions. Surprisingly, replacing the ubiquitous embedding layer with one-hot representations of each byte does not hurt performance; experiments on byte-to-byte machine translation from English to 10 different languages show a consistent improvement in BLEU, rivaling character-level and even standard subword-level models. A deeper investigation reveals that the combination of embeddingless models with decoder-input dropout amounts to token dropout, which benefits byte-to-byte models in particular.


Introduction
Neural NLP models often operate on the subword level, which requires language-specific tokenizers (Koehn et al., 2007;Adler and Elhadad, 2006) and subword induction algorithms, such as BPE (Sennrich et al., 2016;Kudo, 2018). Instead, working at the byte level by representing each character as a variable number of Unicode (UTF-8) bytes, does not require any form of preprocessing, allowing the model to read and predict every computerized text using a single vocabulary of 256 types. While previous work found that byte-level models tend to underperform models based on subword tokens (Wang et al., 2019), byte-based models exhibit an interesting property: their vocabulary is smaller than the number of latent dimensions (256 < d).
In this work, we demonstrate that this property allows us to remove the input and output embedding layers from byte-to-byte translation models, and in doing so, improve the models' performance consistently.
We replace the dense trainable embedding matrix with a fixed one-hot encoding of the vocabulary as the first and last layers of a standard transformer model. Machine translation experiments on 10 language pairs show that byte-to-byte models without an embedding layer achieve higher BLEU scores than byte-based models with parameterized embeddings (+0.5 on average), thus closing the performance gap with subword and character models. We observe this result consistently throughout a wide variety of target languages and writing systems.
The fact that removing parameters improves performance is counter-intuitive, especially given recent trends in machine learning that advocate for increasingly larger networks. We further investigate why embeddingless models yield better results and find implicit token dropout (commonly referred to as "word dropout") as the main source of that boost. While prior work shows that randomly masking tokens from the decoder input can improve the performance of language generation models (Bowman et al., 2016), we find that this effect is amplified when operating at the byte level. Overall, our results suggest that, even without additional parameters, byte-based models can compete and potentially outperform subword models, but that they may require alternative optimization techniques to achieve that goal.

Byte Tokenization
Modern software typically represents text using Unicode strings (UTF-8), which allows one to encode virtually any writing system using a variable number of bytes per token; English characters are typically represented by a single byte, with other writing systems taking two (e.g. Arabic), three (e.g. Chinese), or four (e.g. emojis) bytes per character. By treating each byte as a separate token, we can encode any natural language text using a single uni-  Figure 1: Subword (BPE), character, and byte tokens of the string "Будь здоров." UTF-8 uses two bytes to represent each character in the Cyrillic script, making the byte sequence longer than the number of characters.
versal vocabulary of only 256 token types. Moreover, byte tokenization obviates the need for any heuristic preprocessing, such as splitting spaces, punctuation, and contractions. Figure 1 illustrates subword, character, and byte tokenization.

Embeddingless Model
Our model is based on the original transformer encoder-decoder (Vaswani et al., 2017) with one main difference: we eliminate the input and output token embedding layers. These layers typically use a common parameter matrix E ∈ R |V |×d that contains a d-dimensional embedding vector for each source and target vocabulary item in V . 2 Instead, we use a fixed one-hot representation of our byte vocabulary. For instance, the character "R" could be represented as a vector with 1 at dimension 82 and 0 elsewhere. Since it is standard practice to use representations of more than 256 dimensions, every possible byte can be represented by such one-hot vectors. To predict the next token for a decoder input of n tokens, we take the output of the last transformer decoder layer, Y ∈ R n×d , and apply a softmax across each vector's dimensions. Formal expressions of the input and output of our model are detailed in Figure 2.
Omitting the embedding layer reduces the number of parameters by a factor of O(|V | · d). 3 We do add a total of 3 parameters to scale the encoder and decoder's (one-hot) inputs and the decoder's output (before the softmax). We initialize all three with √ d, akin to the constant scaling factor typically applied to the input embedding layer in transformers. Despite the reduction in model size, memory Original Embeddingless Figure 2: The main differences between the original encoder-decoder model and the new embeddingless model. X ∈ R n×|V | is the one-hot representation of n input tokens (bytes); P n are the positional embeddings up to length n.
consumption increases when working on longer sequences, since the space complexity of transformers is O(n 2 + n · d). In our case, d (512) is typically larger than n (see Table 1), entailing an increase in memory consumption that is roughly linear in the sequence length n, and a similar decrease in processing speed when compared to character and subword models. In addition to replacing the embedding layers, we also remove the dropout layers on the encoder input and decoder output, since zeroing out entries of one-hot vectors is equivalent to randomly masking out input tokens or deleting significant parts of the model's predicted distribution. The dropout on the decoder input (prefix of the target fed with teacher forcing) remains intact at this point and is applied throughout our main experiments. Further analysis shows that decoder input dropout is in fact a significant source of performance gains, which we further investigate in Section 6.

Experiments
We train byte-tokenized embeddingless models for machine translation and compare them to standard byte, character, and subword-based models on a diverse set of languages. We adopt a standard experimental setup that was designed and tuned for the subword baseline and limits our hyperparameter tuning to dropout probabilities. et al., 2014), selecting 10 additional languages with varying characteristics 5 (see Table 1). For each one, we train translation models from English to the target language (the original direction of translation ), and also in the opposite direction for completeness. We clean the training data for every language pair by first removing sentences longer than 800 bytes, and then the sentences with the largest bytelength ratio between source and target such that we remove a total of 5% of the training examples.

Datasets
Baselines In addition to the byte-based embeddingless transformer, we train standard transformer encoder-decoder models as baselines, each one using a different tokenization scheme: subword, character, and byte. For subword tokenization, we apply the Moses tokenizer (Koehn et al., 2007) followed by BPE (Sennrich et al., 2016). Both character and byte tokenizations apply no additional preprocessing at all and include whitespaces as valid tokens.

Hyperparameters
The code for our model and baselines is based on Fairseq (Ott et al., 2019) implementation of the transformer encoder-decoder model. During preprocessing we use 10,000 merging steps when building the BPE vocabulary for every language pair. The vocabularies and embeddings are always shared among source and target languages. In every transformer we use 6 encoder and decoder layers, 4 attention heads, a hidden dimension of 512, and a feed-forward dimension of 1024. We optimize with Adam (Kingma and Ba, 2014), using the inverse square root learning rate scheduler with 4000 warmup steps and a peak learn-  ing rate of 5 × 10 −4 , label smoothing of 0.1, and weight decay of 1 × 10 −4 . We train each model for 50k steps and average the top 5 checkpoints according to the validation loss. We tune dropout (0.2 or 0.3) on the validation set. We set the batch size according to a maximum of 64,000 bytes per batch, which controls for the number of batches per epoch across different tokenization methods.
Evaluation We evaluate our models using Sacre-BLEU, case-sensitive, with the 13a tokenizer for all languages except Chinese (ZH tokenizer) and Japanese (MeCab tokenizer). We use the raw text as the reference for all of our experiments, instead of using the default tokenized-detokenized version, which normalizes the text and gives an artificial advantage to text processed with Moses. Table 2 shows our experiments' results. Every row describes the test BLEU scores of our model and the three baselines trained on a different language pair. We discuss the implications of these below.

Results
Are embeddings essential? The results show that it is indeed possible to train embeddingless machine translation models that perform competitively. The performance gaps between models with different tokenization schemes are relatively small. Except for Vietnamese, the difference between the embeddingless model and the best embeddingbased model is always under 1 BLEU.
In the most controlled setting, where we compare byte-based models with and without learnable embeddings, models without embeddings consistently achieve higher BLEU scores in 19 of 20 cases (and an equal score for ru-en), with a boost of about 0.5 BLEU on average. When compared to models based on character embeddings, the embeddingless byte-to-byte approach yields higher BLEU scores in 17 out of 20 cases, though the average difference is quite small in practice (0.3 BLEU).
Is subword tokenization superior to bytes or characters? Previous work in machine translation shows that subword models consistently outperform character or byte-based models (Gupta et al., 2019;Wang et al., 2019;Gao et al., 2020). However, our results indicate that this is not necessarily the case. When translating from English to a foreign language, the original direction of the IWSLT dataset, embeddingless byte-to-byte models achieve performance that is equal or better than subword embedding models' in 8 out of 10 cases. We observe a different trend when translating into English, where subword models surpass other models for every source language; the fact that Moses is a particularly good tokenizer for English -and less so for other languages -is perhaps related to this phenomenon. Whereas prior work proposed closing the performance gap by adding layers to the basic architecture, under the assumption that character-based models lack capacity or expressiveness, our results show that actually removing a component from the model can improve performance under certain conditions. It is possible that character and byte-based transformer models encounter an optimization issue rather than one of capacity or expressivity.

Analysis
Why does removing the embedding matrix improve the performance of byte-based models? As mentioned in Section 3, the embeddingless models do not use dropout on the encoder input and decoder output, but do apply dropout on the decoder input while training. Since the embeddingless decoder's inputs are fixed one-hot vectors, using dropout implicitly drops out complete tokens. In prior work, token dropout ("word dropout") has been shown to have a consistently positive effect (Bowman et al., 2016). We, therefore, rerun our experiments while  controlling for token dropout (p = 0.2) to determine its effect on our results. Table 3 shows that decoder-side token dropout improves the performance of all models, with a larger impact on byte-based models and embeddingless models in particular. This effect is largely consistent, with only 7 out of 160 cases in which token dropout decreased performance on the validation set. We suspect that dropping out target tokens softens the effects of exposure bias by injecting noise into the ground-truth prefix.
Given the benefits of token dropout on the baseline models, we re-evaluate the results from Section 5, while allowing for token dropout as a potential hyperparameter. Table 4 shows that, when translating from the original English text to a foreign language, the different models perform roughly on par, with no single tokenization method dominating the others. Furthermore, byte-level models with and without embeddings achieve almost identical results. In contrast, when translating in the opposite direction, subword models consistently outperform the other methods with an average gap of 0.76 BLEU from the next best model. Also, removing the embeddings from byte-based models decreases performance by an average of 0.45 BLEU when generating English. This discrepancy might stem from artifacts of reverse translation, or perhaps from the English-centric nature of subword tokenization, which is based on Moses preprocessing and BPE. Overall, these results suggest that despite the greater number of parameters in subword models, character and byte models can perform competitively, but may require slightly different optimization techniques to do so.

Related Work
There is prior work on replacing language-specific tokenizers with more universal tokenization approaches. Schütze (2017) shows how character n-gram embeddings can be effectively trained by segmenting text using a stochastic process. Sen-  tencePiece (Kudo and Richardson, 2018) tokenizes raw Unicode strings into subwords using BPE (Sennrich et al., 2016) or unigram LM (Kudo, 2018). Byte BPE (Wang et al., 2019) extends Senten-cePiece to operate at the byte level. While this approach is indeed more language-agnostic than heuristic tokenizers, it does suffer from performance degradation when no pre-tokenization (e.g. splitting by whitespace) is applied. 6 Moreover, the assumption that subword units must be contiguous segments does not hold for languages with non-concatenative morphology such as Arabic and Hebrew.
Character and byte-based language models (Lee et al., 2017; treat the raw text as a sequence of tokens (characters or bytes) and do not require any form of preprocessing or word tokenization, and Choe et al. (2019) even demonstrate that byte-based language models can perform comparably to word-based language models on the billion-word benchmark (Chelba et al., 2013). Although earlier results on LSTM-based machine translation models show that character tokenization can outperform subword tokenization (Cherry et al., 2018), recent literature shows that the same does not hold for transformers (Gupta et al., 2019;Wang et al., 2019;Gao et al., 2020). To narrow the gap, recent work suggests using deeper models (Gupta et al., 2019) or specialized architectures (Gao et al., 2020). Our work deviates from this trend by removing layers to improve the model. This observation contests the leading hypothesis in existing literature -that the performance gap results from reduced model capacity -and suggests that the problem may be one of optimization.

Conclusions
This work challenges two key assumptions in neural machine translation models: the necessity of embedding layers, and the superiority of subword tokenization. Experiments on 10 different languages show that, despite their ubiquitous usage, competitive models can be trained without any embeddings by treating text as a sequence of bytes. Our investigation suggests that different tokenization methods may require revisiting the standard optimization techniques used with transformers, which are primarily geared towards sequences of English subwords.