Character-Level Translation with Self-attention

We explore the suitability of self-attention models for character-level neural machine translation. We test the standard transformer model, as well as a novel variant in which the encoder block combines information from nearby characters using convolutions. We perform extensive experiments on WMT and UN datasets, testing both bilingual and multilingual translation to English using up to three input languages (French, Spanish, and Chinese). Our transformer variant consistently outperforms the standard transformer at the character-level and converges faster while learning more robust character-level alignments.


Introduction
Most existing Neural Machine Translation (NMT) models operate on the word or subword-level, which tends to make these models memory inefficient because of large vocabulary sizes. Characterlevel models (Lee et al., 2017;Cherry et al., 2018) instead work directly on raw characters, resulting in a more compact language representation, while mitigating out-of-vocabulary (OOV) problems (Luong and Manning, 2016). Characterlevel models are also very suitable for multilingual translation since multiple languages can be modeled using the same character vocabulary. Multilingual training can lead to improvements in the overall performance without an increase in model complexity (Lee et al., 2017), while also circumventing the need to train separate models for each language pair.
Models based on self-attention have achieved excellent performance on a number of tasks, including machine translation (Vaswani et al., 2017) and representation learning (Devlin et al., 2019;1 Figure 1: A comparison of the encoder blocks in the standard transformer (a) and our novel modification, the convtransformer (b), which uses 1D convolutions to facilitate character interactions. Yang et al., 2019). Despite the success of these models, their suitability for character-level translation remains largely unexplored, with most efforts having focused on recurrent models (e.g., Lee et al. (2017); Cherry et al. (2018)).
In this work, we perform an in-depth investigation of the suitability of self-attention models for character-level translation. We consider two models: the standard transformer from Vaswani et al. (2017) and a novel variant that we call the convtransformer (Figure 1, Section 3). The convtransformer uses convolutions to facilitate interactions among nearby character representations.
We evaluate these models on both bilingual and multilingual translation to English, using up to three input languages: French (FR), Spanish (ES), and Chinese (ZH). We compare the performance when translating from close (e.g., FR and ES) and on distant (e.g., FR and ZH) input languages (Section 5.1) and we analyze the learned character alignments (Section 5.2). We find that self-attention models work surprisingly well for character-level translation, achieving competitive performance to equivalent subword-level models while requiring up to 60% fewer parameters (under the same model configuration). At the character-level, the convtransformer outperforms the standard transformer, converges faster, and produces more robust alignments.

Character-level NMT
Fully character-level translation was first tackled in Lee et al. (2017), who proposed a recurrent encoder-decoder model. Their encoder combines convolutional layers with max-pooling and highway layers to construct intermediate representations of segments of nearby characters. Their decoder network autoregressively generates the output translation one character at a time, utilizing attention on the encoded representations.
Lee et al. (2017)'s approach showed promising results on multilingual translation in particular. Without any architectural modifications or changes to the character vocabularies, training on multiple source languages yielded performance improvements while also acting as a regularizer. Multilingual training of character-level models is possible not only for languages that have almost identical character vocabularies, such as French and Spanish, but even for distant languages that can be mapped to a common character-level vocabulary, for example, through latinizing Russian (Lee et al., 2017) or Chinese (Nikolov et al., 2018).
More recently, (Cherry et al., 2018) performed an in-depth comparison between different character-and subword-level models. They showed that, given sufficient computational time and model capacity, character-level models can outperform subword-level models, due to their greater flexibility in processing and segmenting the input and output sequences.

The Transformer
The transformer (Vaswani et al., 2017) is an attention-driven encoder-decoder model that has achieved state-of-the-art performance on a number of sequence modeling tasks in NLP. Instead of using recurrence, the transformer uses only feedforward layers based on self-attention. The standard transformer architecture consists of six stacked encoder layers that process the input using selfattention and six decoder layers that autoregressively generate the output sequence.
The original transformer (Vaswani et al., 2017) computes a scaled dot-product attention by taking as input query Q, key K, and value V matrices: where √ d k is a scaling factor. For the encoder, Q, K and V are equivalent, thus, given an input sequence with length N , Attention performs N 2 comparisons, relating each word position with the rest of the words in the input sequence. In practice, Q, K, and V are projected into different representation subspaces (called heads), to perform Multi-Head Attention, with each head learning different word relations, some of which might be interpretable (Vaswani et al., 2017;Voita et al., 2019).
Intuitively, attention as an operation might not be as meaningful for encoding individual characters as it is for words, because individual character representations might provide limited semantic information for learning meaningful relations on the sentence level. However, recent work on language modeling (Al-Rfou et al., 2019) has surprisingly shown that attention can be very effective for modeling characters, raising the question of how well the transformer would work on characterlevel bilingual and multilingual translation, and what architectures would be suitable for this task. These are the questions this paper sets out to investigate.

Convolutional Transformer
To facilitate character-level interactions in the transformer, we propose a modification of the standard architecture, which we call the convtransformer. In this architecture, we use the same decoder as the standard transformer, but we adapt each encoder block to include an additional subblock. The sub-block (Figure 1, b), inspired from Lee et al. (2017), is applied to the input representations M , before applying self-attention. The sub-block consists of three 1D convolutional layers, C w , with different context window sizes w. In order to maintain the temporal resolution of convolutions, the padding is set to w−1 2 . We apply three separate convolutional layers, C 3 , C 5 and C 7 , in parallel, using context window sizes of 3, 5 and 7, respectively. The different context window sizes aim to resemble characterlevel interactions of different levels of granularity, such as on the subword-or word-level. To compute the final output of the convolutional subblock, the outputs of the three layers are concatenated and passed through an additional 1D convolutional layer with context window size 3, C 3 , which fuses the representations: For all convolutional layers, we set the number of filters to be equal to the embedding dimension size d model , which results in an output of equal dimension as the input M . Therefore, in contrast to Lee et al. (2017), who use max-pooling to compress the input character sequence into segments of characters, here we leave the resolution unchanged, for both transformer and convtransformer models. Finally, for additional flexibility, we add a residual connection (He et al., 2016) from the input to the output of the convolutional block.

Experimental Set-up
Datasets. We conduct experiments on two datasets. First, we use the WMT15 DE→EN dataset, on which we test different model configurations and compare our results to previous work on character-level translation. We follow the preprocessing in Lee et al. (2017)  (ii) all sentences in the corpus are from the same domain. We construct our training corpora by randomly sampling one million sentence pairs from the FR, ES, and ZH parts of the UN dataset, targeting translation to English. To construct multilingual datasets, we combine the respective bilingual datasets (e.g., FR→EN, and ES→EN) and shuffle them. To ensure all languages share the same character vocabulary, we latinize the Chinese dataset using the Wubi encoding method, following (Nikolov et al., 2018). For testing, we use the original UN test sets provided for each pair.
Tasks. Our experiments are designed as follows: (i) bilingual scenario, in which we train a model with a single input language; (ii) multilingual scenario, in which we input two or three languages at the same time without providing any language identifiers to the models and without increasing the number of parameters. We test combining input languages that can be considered as more similar in terms of syntax and vocabulary (e.g. FR and ES) as well as more distant (e.g., ES and ZH).

Automatic evaluation
Model comparison. In Table 1 We find character-level training to be 3 to 5 times slower than subword-level training due to much longer sequence lengths. However, the standard transformer trained at the character level already achieves very good performance, outperforming the recurrent model from Lee et al. (2017). On this dataset, our convtransformer variant performs on par with the character-level transformer. Character-level transformers also perform competitively with equivalent BPE models while requiring up to 60% fewer parameters. Furthermore, our 12-layer convtransformer model matches the performance of the 6-layer BPE transformer, which has a comparable number of parameters.
Multilingual experiments. In Table 2, we report our BLEU results on the UN dataset using the 6-layer transformer/convtransformer character-level models (Appendix A contains example model outputs). All of our models were trained for 30 epochs. Multilingual models are  Table 2: BLEU scores on the UN dataset, for different input training languages (first column), and evaluated on three different test sets (t-FR, t-ES and t-ZH). The target language is always English. #P is the number of training pairs. The best overall results for each language are in bold. evaluated on translation from all possible input languages to English.
Although multilingual translation can be realized using subword-level models through extracting a joint segmentation for all input languages (e.g., as in Firat et al. (2016); Johnson et al. (2017)), here we do not include any subword-level multilingual baselines, for two reasons. First, extracting a good multilingual segmentation is much more challenging for our choice of input languages, which includes distant languages such as Chinese and Spanish. Second, as discussed previously, subword-level models have a much larger number of parameters, making a balanced comparison with character-level models difficult.
The convtransformer consistently outperforms the character-level transformer on this dataset, with a gap of up to 2.3 BLEU on bilingual translation (ZH→EN) and up to 2.6 BLEU on multilingual translation (FR+ZH→EN). Training multilingual models on similar input languages (FR + ES→EN) leads to improved performance for both languages, which is consistent with (Lee et al., 2017). Training on distant languages is surprisingly still effective in some cases. For example, the models trained on FR+ZH→EN outperform the models trained just on FR→EN; however they perform worse than the bilingual models trained on ZH→EN. Thus, distant-language training seems to be helpful mainly when the input language is closer to the target translation language (which is English here).
The convtransformer is about 30% slower to train than the transformer (see Figure 2). Nevertheless, the convtransformer reaches comparable performance in less than half of the number of epochs, leading to an overall training speedup 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28  compared to the transformer.

Analysis of Learned Alignments
To gain a better understanding of the multilingual models, we analyze their learned character alignments as inferred from the model attention probabilities. For each input language (e.g., FR), we compare the alignments learned by each of our multilingual models (e.g., FR + ES → EN model) to the alignments learned by the corresponding bilingual model (e.g., FR → EN). Our intuition is that the bilingual models have the greatest flexibility to learn high-quality alignments because they are not distracted by other input languages. Multilingual models, by contrast, might learn lower quality alignments because either (i) the architecture is not robust enough for multilingual training; or (ii) the languages are too dissimilar to allow for effective joint training, prompting the model to learn alternative alignment strategies to accommodate for all languages. We quantify the alignments using canonical correlation analysis (CCA) (Morcos et al., 2018). First, we sample 500 random sentences from each of our UN testing datasets (FR, ES, or ZH) and then produce alignment matrices by extracting the encoder-decoder attention from the last layer of each model. We use CCA to project each alignment matrix to a common vector space and infer the correlation. We analyze our transformer and convtransformer models separately. Our results are in Figure 3, while Appendix B contains example alignment visualizations.
For similar source and target languages (e.g., the FR+ES→EN model), we observe a strong pos-  itive correlation to the bilingual models, indicating that alignments can be simultaneously learned. When introducing a distant source language (ZH) in the training, we observe a drop in correlation, for FR and ES, and an even larger drop for ZH. This result is in line with our BLEU results from Section 5.1, suggesting that multilingual training on distant input languages is more challenging than multilingual training on similar input languages. The convtransformer is more robust to the introduction of a distant language than the transformer (p < 0.005 for FR and ES inputs, according to a one-way ANOVA test). Our results also suggest that more sophisticated attention architectures might need to be developed when training multilingual models on several distant input languages.

Conclusion
We performed a detailed investigation of the utility of self-attention models for character-level translation. We test the standard transformer architecture, as well as introduce a novel variant which augments the transformer encoder with convolutions, to facilitate information propagation across nearby characters. Our experiments show that self-attention performs very well on characterlevel translation, with character-level architectures performing competitively when compared to equivalent subword-level architectures while requiring fewer parameters. Training on multiple input languages is also effective and leads to improvements across all languages when the source and target languages are similar. When the languages are different, we observe a drop in performance, in particular for the distant language.
In future work, we will extend our analysis to include additional source and target languages from different language families, such as more Asian languages. We will also work towards improving the training efficiency of character-level models, which is one of their main bottlenecks, as well as towards improving their effectiveness in multilingual training.

A Example model outputs
Tables 3, 4, and 5 contain example translations produced by our different bilingual and multilingual models trained on the UN datasets.

B Visualization of Attention
In Figures 4,5, 6 and 7 we plot example alignments produced by our different bilingual and multilingual models trained on the UN datasets, always testing on translation from FR to EN. The alignments are produced by extracting the encoderdecoder attention of the last decoder layer of our transformer/convtransformer models. We observe the following patterns: (i) for bilingual translation (Figure 4), the convtransformer has a sharper weight distribution on the matching characters and words than the transformer; (ii) for multilingual translation of close languages (FR+ES→EN, Figure 5), both transformer and convtransformer are able to preserve the word alignments, but the alignments produced by the convtransformer appear to be slightly less noisy; (iii) for multilingual translation of distant languages (FR+ZH→EN, Figure 6), the character alignments of the transformer become visually much noisier and concentrate on a few individual characters, with many word alignments dissolving. The convtransformer character alignments remain more spread out, and word align-ment appears to be better preserved. This is another indication that the convtransformer is more robust for multilingual translation of distant languages. (iv) for multilingual translation with three inputs, where two of the three languages are close (FR+ES+ZH→EN, Figure 7), we observe a similar pattern, with word alignments being better preserved by the convtransformer. source Pour que ce cadre institutionnel soit efficace, il devra remédier aux lacunes en matière de réglementation et de mise en oeuvre qui caractérisentà ce jour la gouvernance dans le domaine du développement durable. reference For this institutional framework to be effective, it will need to fill the regulatory and implementation deficit that has thus far characterized governance in the area of sustainable development. FR→EN transformer To ensure that this institutional framework is effective, it will need to address regulatory and implementation gaps that characterize governance in sustainable development. convtransformer In order to ensure that this institutional framework is effective, it will have to address regulatory and implementation gaps that characterize governance in the area of sustainable development.

FR+ES→EN transformer
To ensure that this institutional framework is effective, it will need to address gaps in regulatory and implementation that characterize governance in the area of sustainable development. convtransformer In order to ensure that this institutional framework is effective, it will be necessary to address regulatory and implementation gaps that characterize governance in sustainable development so far.

FR+WB→EN transformer
To ensure that this institutional framework is effective, gaps in regulatory and implementation that have characterized governance in sustainable development to date. convtransformer For this institutional framework to be effective, it will need to address gaps in regulatory and implementation that characterize governance in the area of sustainable development.

FR+ES+WB→EN transformer
To ensure that this institutional framework is effective, it will need to address regulatory and implementation gaps that are characterized by governance in the area of sustainable development. convtransformer If this institutional framework is to be effective, it will need to address gaps in regulatory and implementation that are characterized by governance in the area of sustainable development. source Estamos convencidos de que el futuro de la humanidad en condiciones de seguridad, la coexistencia pacífica, la tolerancia y la reconciliación entre las naciones se verán reforzados por el reconocimiento de los hechos del pasado. reference We strongly believe that the secure future of humanity, peaceful coexistence, tolerance and reconciliation between nations will be reinforced by the acknowledgement of the past. ES→EN transformer We are convinced that the future of humanity in conditions of security, peaceful coexistence, tolerance and reconciliation among nations will be strengthened by recognition of the facts of the past. convtransformer We are convinced that the future of humanity under conditions of safe, peaceful coexistence, tolerance and reconciliation among nations will be reinforced by the recognition of the facts of the past.

FR+ES→EN transformer
We are convinced that the future of mankind under security, peaceful coexistence, tolerance and reconciliation among nations will be strengthened by the recognition of the facts of the past. convtransformer We are convinced that the future of humanity in safety, peaceful coexistence, tolerance and reconciliation among nations will be reinforced by the recognition of the facts of the past.

ES+WB→EN transformer
We are convinced that the future of humanity in safety, peaceful coexistence, tolerance and reconciliation among nations will be strengthened by the recognition of the facts of the past. convtransformer We are convinced that the future of humanity in safety, peaceful coexistence, tolerance and reconciliation among nations will be strengthened by the recognition of the facts of the past.

FR+ES+WB→EN transformer
We are convinced that the future of mankind in safety, peaceful coexistence, tolerance and reconciliation among nations will be strengthened by the recognition of the facts of the past. convtransformer We are convinced that the future of mankind in security, peaceful coexistence, tolerance and reconciliation among nations will be strengthened by the recognition of the facts of the past. source ZH 利用专家管理农场对于最大限度提高生产率和灌溉水使用效率也是重要的。 source ZH tjh|et fny|pe tp|gj pei|fnrt cf|gf jb|dd bv|ya rj|ym tg|u|yx t iak|ivc|ii wgkq0|et uqt|yx bn j tgj|s r . reference EN The use of expert farm management is also important to maximize land productivity and efficiency in the use of irrigation water. ZH→EN transformer The use of expert management farms is also important for maximizing productivity and irrigation use. convtransformer The use of experts to manage farms is also important for maximizing efficiency in productivity and irrigation water use.

FR+ZH→EN transformer
The use of expert management farms is also important for maximizing productivity and efficiency in irrigation water use. convtransformer The use of expert management farms is also important for maximizing productivity and irrigation water efficiency.

ES+ZH→EN transformer
The use of expert farm management is also important for maximizing productivity and irrigation water use efficiency. convtransformer The use of expert management farms to maximize efficiency in productivity and irrigation water use is also important.

FR+ES+ZH→EN transformer
The use of expert management farms is also important for maximizing productivity and irrigation water use. convtransformer It is also important that expert management farms be used to maximize efficiency in productivity and irrigation use.