Towards Reasonably-Sized Character-Level Transformer NMT by Finetuning Subword Systems

Applying the Transformer architecture on the character level usually requires very deep architectures that are difﬁcult and slow to train. These problems can be partially overcome by incorporating a segmentation into tokens in the model. We show that by initially training a subword model and then ﬁnetuning it on characters, we can obtain a neural machine translation model that works at the character level without requiring token segmentation. We use only the vanilla 6-layer Transformer Base architecture. Our character-level models better capture morphological phenomena and show more robustness to noise at the expense of somewhat worse overall translation quality. Our study is a signiﬁcant step towards high-performance and easy to train character-based models that are not extremely large.


Introduction
State-of-the-art neural machine translation (NMT) models operate almost end-to-end except for input and output text segmentation. The segmentation is done by first employing rule-based tokenization and then splitting into subword units using statistical heuristics such as byte-pair encoding (BPE; Sennrich et al., 2016) or SentencePiece (Kudo and Richardson, 2018).
Recurrent sequence-to-sequence (S2S) models can learn translation end-to-end (at the character level) without changes in the architecture (Cherry et al., 2018), given sufficient model depth. Training character-level Transformer S2S models (Vaswani et al., 2017) is more complicated because the selfattention size is quadratic in the sequence length.
In this paper, we empirically evaluate Transformer S2S models. We observe that training a character-level model directly from random initialization suffers from instabilities, often preventing it from converging. Instead, we propose finetuning subword-based models to get a model without explicit segmentation. Our character-level models show slightly worse translation quality, but have better robustness towards input noise and better capture morphological phenomena. Our approach is important because previous approaches have relied on very large transformers, which are out of reach for much of the research community.

Related Work
Character-level decoding seemed to be relatively easy with recurrent S2S models (Chung et al., 2016). But early attempts at achieving segmentation-free NMT with recurrent networks used input hidden states covering a constant character span (Lee et al., 2017). Cherry et al. (2018) showed that with a sufficiently deep recurrent model, no changes in the model are necessary, and they can still reach translation quality that is on par with subword models. Luong and Manning (2016) and Ataman et al. (2019) can leverage characterlevel information but they require tokenized text as an input and only have access to the character-level embeddings of predefined tokens.
Training character-level transformers is more challenging. Choe et al. (2019) successfully trained a character-level left-to-right Transformer language model that performs on par with a subword-level model. However, they needed a large model with 40 layers trained on a billion-word corpus, with prohibitive computational costs.
In the most related work to ours, Gupta et al. (2019) managed to train a character-level NMT with the Transformer model using Transparent Attention (Bapna et al., 2018). Transparent attention attends to all encoder layers simultaneously, making the model more densely connected but also more computationally expensive. During training, this improves the gradient flow from the decoder to the encoder. Gupta et al. (2019) claim that Trans-

tokenization
The cat sleeps on a mat. The cat sleeps on a mat .

32k
The cat sle eps on a mat . 8k The c at s le eps on a m at . 500 The c at s le ep s on a m at . 0 T h e c a t s l e e p s o n a m a t . parent Attention is crucial for training characterlevel models, and show results on very deep networks, with similar results in terms of translation quality and model robustness to ours. In contrast, our model, which is not very deep, trains quickly. It also supports fast inference and uses less RAM, both of which are important for deployment. Gao et al. (2020) recently proposed adding a convolutional sub-layer in the Transformer layers. At the cost of a 30% increase of model parameter count, they managed to narrow the gap between subword-and character-based models by half. Similar results were also reported by Banar et al. (2020), who reused the convolutional preprocessing layer with constant step segments Lee et al. (2017) in a Transformer model.

Our Method
We train our character-level models by finetuning subword models, which does not increase the number of model parameters. Similar to the transfer learning experiments of Kocmi and Bojar (2018), we start with a fully trained subword model and continue training with the same data segmented using only a subset of the original vocabulary.
To stop the initial subword models from relying on sophisticated tokenization rules, we opt for the loss-less tokenization algorithm from Senten-cePiece (Kudo and Richardson, 2018). First, we replace all spaces with the sign and do splits before all non-alphanumerical characters (first line of Table 1). In further segmentation, the special space sign is treated identically to other characters.
We use BPE (Sennrich et al., 2016) for subword segmentation because it generates the merge operations in a deterministic order. Therefore, a vocabulary based on a smaller number of merges is a subset of a vocabulary based on more merges estimated from the same training data. Examples  of the segmentation are provided in Table 1. Quantitative effects of different segmentation on the data are presented in Table 2, showing that character sequences are on average more than 4 times longer than subword sequences with 32k vocabulary. We experiment both with deterministic segmentation and stochastic segmentation using BPE Dropout (Provilkov et al., 2020). At training time, BPE Dropout randomly discards BPE merges with probability p, a hyperparameter of the method. As a result of this, the text gets stochastically segmented into smaller units. BPE Dropout increases translation robustness on the source side but typically has a negative effect when used on the target side. In our experiments, we use BPE Dropout both on the source and target side. In this way, the charactersegmented inputs will appear already at training time, making the transfer learning easier.
We test two methods for finetuning subword models to reach character-level models: first, direct finetuning of subword models, and second, iteratively removing BPE merges in several steps in a curriculum learning setup (Bengio et al., 2009). In both cases we always finetune models until they are fully converged, using early stopping.

Experiments
To cover target languages of various morphological complexity, we conduct our main experiments on two resource-rich language pairs, English-German and English-Czech; and on a low-resource pair, English-Turkish. Rich inflection in Czech, compounding in German, and agglutination in Turkish are examples of interesting phenomena for character models. We train and evaluate the English-German translation using the 4.5M parallel sen-  We follow the original hyperparameters for the Transformer Base model (Vaswani et al., 2017), including the learning rate schedule. For finetuning, we use Adam (Kingma and Ba, 2015) with a constant learning rate 10 −5 . All models are trained using Marian (Junczys-Dowmunt et al., 2018). We also present results for character-level English-German models having about the same number of parameters as the best-performing subword models. In experiments with BPE Dropout, we set dropout probability p = 0.1.
We evaluate the translation quality using BLEU (Papineni et al., 2002), chrF (Popović, 2015), and METEOR 1.5 (Denkowski and Lavie, 2014). Following Gupta et al. (2019), we also conduct a noisesensitivity evaluation to natural noise as introduced by Belinkov and Bisk (2018). With probability p words are replaced with their variants from a misspelling corpus. Following Gupta et al. (2019), we assume the BLEU scores measured with input can be explained by a linear approximation with intercept α and slope β using the noise probability p: BLEU ≈ βp + α. However, unlike them, we report the relative translation quality degradation β/α instead of only β. Parameter β corresponds to absolute BLEU score degradation and is thus higher given lower-quality systems, making them seemingly more robust.
To look at morphological generalization, we evaluate translation into Czech and German using MorphEval (Burlot and Yvon, 2017). MorphEval consists of 13k sentence pairs that differ in exactly one morphological category. The score is the percentage of pairs where the correct sentence is preferred.

Results
The results of the experiments are presented in Table 3. The translation quality only slightly decreases when drastically decreasing the vocabulary. However, there is a gap between the character-   level and subword-level model of 1-2 BLEU points.
With the exception of Turkish, models trained by finetuning reach by a large margin better translation quality than character-level models trained from scratch.
In accordance with Provilkov et al. (2020), we found that BPE Dropout applied both on the source and target side leads to slightly worse translation quality, presumably because the stochastic segmentation leads to multimodal target distributions. The detailed results are presented in Appendix A. However, for most language pairs, we found a small positive effect of BPE dropout on the finetuned systems (see Table 4).
For English-to-Czech translation, we observe a large drop in BLEU score with the decreasing vocabulary size, but almost no drop in terms of METEOR score, whereas for other language pairs, all metrics are in agreement. The differences between the subword and character-level models are less pronounced in the low-resourced English-to-Turkish translation.
Whereas the number of parameters in transformer layers in all models is constant at 35 million, the number of parameters in the embeddings decreases 30× from over 15M to only slightly over 0.5M, with overall a 30% parameter count reduction. However, matching the number of parameters by increasing the model capacity narrows close the performance gap, as shown in Table 5.
In our first set of experiments, we finetuned the  Figure 1: Degradation of the translation quality of the subword (gray, the darker the color, the smaller the vocabulary) and character-based systems (red) for English-German translation with increasing noise. model using the character-level input directly. Experiments with parent models of various vocabulary sizes (column "Direct finetuning" in Table 3) suggest the larger the parent vocabulary, the worse the character-level translation quality. This result led us to hypothesize that gradually decreasing the vocabulary size in several steps might lead to better translation quality. In the follow-up experiment, we gradually reduced the vocabulary size by 500 and always finetuned until convergence. But we observed a small drop in translation quality in every step, and the overall translation quality was slightly worse than with direct finetuning (column "In steps" in Table 3).
With our character-level models, we achieved higher robustness towards source-side noise (Figure 1). Models trained with a smaller vocabulary tend to be more robust towards source-side noise.
Character-level models tend to perform slightly better in the MorphEval benchmark. Detailed results are shown in Table 6. In German, this is due to better capturing of agreement in coordination and future tense. This result is unexpected because these phenomena involve long-distance dependencies. On the other hand, the characterlevel models perform worse on compounds, which are a local phenomenon. Ataman et al. (2019) observed similar results on compounds in their hybrid character-word-level method. We suspect this might be caused by poor memorization of some compounds in the character models.
In Czech, models with a smaller vocabulary better cover agreement in gender and number in pronouns, probably due to direct access to inflective endings. Unlike German, character-level models capture worse agreement in coordinations, presum-   ably due to there being a longer distance in characters. Training and inference times are shown in Table 7. Significantly longer sequences also manifest in slower training and inference. Table 7 shows that our character-level models are 5-6× slower than subword models with 32k units. Doubling the number of layers, which had a similar effect on translation quality as the proposed finetuning (Gupta et al., 2019), increases the inference time approximately 2-3× in our setup.

Conclusions
We presented a simple approach for training character-level models by finetuning subword models. Our approach does not require computationally expensive architecture changes and does not require dramatically increased model depth. Subword-based models can be finetuned to work on the character level without explicit segmentation with somewhat of a drop in translation quality. The models are robust to input noise and better capture some morphological phenomena. This is important for research groups that need to train and deploy character Transformer models without access to very large computational resources.

A Effect of BPE Dropout
We discussed the effect of BPE dropout in Section 3. Table 8 shows the comparison of the main quantitative results with and without BPE dropout.

B Notes on Reproducibility
The training times were measured on machines with GeForce GTX 1080 Ti GPUs and with Intel Xeon E5-2630v4 CPUs (2.20GHz). The parent models were trained on 4 GPUs simultaneously, the finetuning experiments were done on a single GPU. We used model hyperparameters used by previous work and did not experiment with the hyperparameters of the architecture and training of the initial models. The only hyperparameter that we tuned was the learning rate of the finetuning. We set the value to 10 −5 after several experiments with English-to-German translation with values between 10 −7 and 10 −3 based on the BLEU score on validation data.
Validation BLEU scores are tabulated in Table 9.