Character-level Chinese-English Translation through ASCII Encoding

Character-level Neural Machine Translation (NMT) models have recently achieved impressive results on many language pairs. They mainly do well for Indo-European language pairs, where the languages share the same writing system. However, for translating between Chinese and English, the gap between the two different writing systems poses a major challenge because of a lack of systematic correspondence between the individual linguistic units. In this paper, we enable character-level NMT for Chinese, by breaking down Chinese characters into linguistic units similar to that of Indo-European languages. We use the Wubi encoding scheme, which preserves the original shape and semantic information of the characters, while also being reversible. We show promising results from training Wubi-based models on the character- and subword-level with recurrent as well as convolutional models.


Introduction
Character-level sequence-to-sequence (Seq2Seq) models for machine translation can perform comparably to subword-to-subword or subwordto-character models, when dealing with Indo-European language pairs, such as German-English or Czech-English (Lee et al., 2017). Such language pairs benefit from having a common Latin character representation, which facilitates suitable character-to-character mappings to be learned. This method, however, is more difficult for non-Latin language pairs, such as Chinese-English. Chinese characters differ from English characters, in the sense that they carry more meaning and resemble subword units in English. For example, the Chinese character '人' corresponds to the 1 Code and data available at https://github.com/ duguyue100/wmt-en2wubi. Chinese-to-English translation. A raw Chinese word ('承诺') is encoded into ASCII characters ('bd|yad'), using the Wubi encoding method, before passing it to a Seq2Seq network. The network generates the English translation 'commitment', processing one ASCII character at a time.
word 'human' in English. This lack of correspondence makes the problem more demanding for a Chinese-English character-to-character model, as it would be forced to map higher-level linguistic units in Chinese to individual Latin characters in English. Good performance on this task may, therefore, require specific architectural decisions.
In this paper, we propose a simple solution to this challenge: encode Chinese into a meaningful string of ASCII characters, using the Wubi method (Lunde, 2009) (Section 3). This encoding enables efficient and accurate character-level prediction applications in Chinese, with no changes required to the model architecture (see Figure 1). Our approach significantly reduces the character vocabulary size of a Chinese text, while preserving the shape and semantic information encoded in the Chinese characters.
We demonstrate the utility of the Wubi encoding on subword-and character-level Chinese NMT, comparing the performance of systems trained on Wubi vs. raw Chinese characters (Section 4). We test three types of Seq2Seq models: recurrent (Cho et al., 2014) convolutional (Gehring et al., 2017) as well as hybrid (Lee et al., 2017). Our results demonstrate the utility of Wubi as a preprocessing step for Chinese translation tasks, showing promising performance.
2 Background 2.1 Sequence-to-sequence models for NMT Neural networks with Encoder-Decoder architectures have recently achieved impressive performance on many language pairs in Machine Translation, such as English-German and English-French (Wu et al., 2016). Recurrent Neural Networks (RNNs) (Cho et al., 2014) process and encode the input sequentially, mapping each word onto a vector representation of fixed dimensionality. The representations are used to condition a decoder RNN which generates the output sequence.
Recent studies have shown that Convolutional Neural Networks (CNNs) (LeCun et al., 1998) can perform better on Seq2Seq tasks than RNNs (Gehring et al., 2017;Chen and Wu, 2017;Kalchbrenner et al., 2016). CNNs enable simultaneous computations which are more efficient especially using parallel GPU hardware. Successive layers in CNN models have an increasing receptive field for modeling long-term dependencies in candidate languages.

Chinese-English translation
Recent large-scale benchmarks of RNN encoderdecoder models (Wu et al., 2016; have shown that translation pairs involving Chinese are among the most challenging for NMT systems. For instance, in Wu et al. (2016) an NMT system trained on English-to-Chinese had the least relative improvement across five other language pairs, measured over the performance of a phrase-based machine translation baseline.
While it is known that the quality of a Chinese translation system can be significantly impacted by the choice of word segmentation (Wang et al., 2015), there has been little work on improving the representation medium for Chinese translation. Wang et al. (2017) perform an empirical comparison on various translation granularities for the Chinese-English task. They find that adding additional information about the segmentation of the Chinese characters, such as marking the start and the end of each word, leads to improved performance over raw character or word translation.
The work that is most related to ours is (Du and Way, 2017), in which they use Pinyin 2 to romanize raw Chinese characters based on their pronunciation. This method, however, adds ambiguity to the data, because many Chinese characters share the same pronunciation.

Encoding Chinese characters with Wubi
Wubi (Lunde, 2009) is a shape-based encoding method for inputting Chinese characters on a computer QWERTY keyboard. The encoding is based on the structure of the characters rather than on their pronunciation. Using the method, each raw Chinese character (e.g., "设") can be efficiently mapped to a unique sequence of 1 to 5 ASCII characters (e.g., "ymc"). This feature greatly reduces the ambiguity brought by other phonetic input methods, such as Pinyin.
As an input method, Wubi uses 25 key caps from the QWERTY keyboard, where each key cap is assigned to five categories based on the character's first stroke (when written by hand). Each of the key caps is associated with different character roots. A Chinese character is broken down into its character roots, and a corresponding QW-ERTY association of the character roots is used to encode a word. For example, the Wubi encoding of '哈' is 'kwgk', and the character roots of this word are 口(k), 人(w), 王(g) and 口(k).
To create a one-to-one mapping of every Chinese character to a Wubi encoding during translation, we append numbers to the encodings, whenever one code maps to multiple Chinese characters. Step aside 让开 yh|ga Applying Wubi significantly reduces the character-level vocabulary size of a Chinese text (from > 5, 000 commonly used Chinese characters, to 128 ASCII characters 3 ), while preserving its shape and semantic information. Table 1 contains examples of Wubi, along with the corresponding words in Chinese and English.

Dataset
In this work, we use a subset of the English and Chinese parts of the United Nations Parallel Corpus (Ziemski et al., 2016). We choose the UN corpus because of its high-quality, man-made translations. The dataset is sufficient for our purpose: our aim here is not to reach state-of-the-art performance on Chinese-English translation, but to demonstrate the potential of the Wubi encoding on the character level. We preprocess the UN dataset with the MOSES tokenizer 4 , and use Jieba 5 to segment the Chinese sentence into words, following which we encode the texts into Wubi. We use the '|' character as a subword separator for Wubi, in order to ensure that the mapping from Chinese to Wubi is unique. We also convert all Chinese punctuation marks (e.g. '。、《》') from UTF-8 to ASCII (e.g. '.,<>') because they share similar linguistic roles to English punctuations. This conversion additionally decreases the size of the Wubi character vocabulary.
Our final dataset contains 2.1M sentence pairs for training, and 55k pairs for validation and testing respectively (Table 2 contains additional statistics). Note that our procedures are entirely reversible. To investigate the utility of the Wubi encoding, we compare the performance of NMT models on four training pairs: raw Chinese-to-English (cn2en) versus Wubi-to-English (wubi2en); English-to-raw Chinese (en2cn) versus Englishto-Wubi (en2wubi). For each pair, we investigate three levels of sequence granularity: wordlevel, subword-level, and character-level. The word-level operates on individual English words (e.g. walk) and either raw-Chinese words (e.g. 编 设) or Wubi words (e.g. sh|wy). We limit all wordlevel vocabularies to the 50k most frequent words for each language. The subword-level is produced using the byte pair encoding (BPE) scheme (Sennrich et al., 2016), capping the vocabulary size at 10k for each language. The character-level operates on individual raw-Chinese characters (e.g. '重'), or individual ASCII characters.

Model descriptions and training details
Our models are summarized in Table 3, including the number of parameters and vocabulary sizes used for each pair. For the subword-and wordlevel experiments, we use two systems 6 . The first, LSTM, is an LSTM Seq2Seq model (Cho et al., 2014) with an attention mechanism (Bahdanau et al., 2015). We use a single layer of 512 hidden units for the encoder and decoder, and set 512 as the embedding dimensionality. The second system, FConv, is a smaller version of the convolutional Seq2Seq model with an attention mechanism from (Gehring et al., 2017). We use word embeddings with dimension 256 for this model. The encoder and the decoder of FConv have the same convolutional architecture which consists of 4 convolution layers for the encoder and 3 for the decoder, each layer having filters with dimension 256 and size 3.
For all character-level experiments, we use the fully-character level model, char2char from (Lee et al., 2017) 7 . The encoder of this model consists of 8 convolutional layers with max pooling, which produce intermediate representations of segments of the input characters. Following this, a 4-layer highway network (Srivastava et al., 2015) is applied, as well as a single-layer recurrent network with gated recurrent units (GRUs) (Cho et al., 2014). The decoder consists of an attention mechanism and a two-layer GRU, which predicts the output one character at a time. The character embedding dimensionality is 128 for the encoder and  We train all models for 25 epochs using the Adam optimizer (Kingma and Ba, 2014). We used four NVIDIA Titan X GPUs for conducting the experiments, and use beam search with beam size of 20 to generate all final outputs.

Quantitative evaluation
In Table 4, we present the BLEU scores for all the previously described experiments. Before computing BLEU, we convert all Chinese outputs to Wubi to ensure a consistent comparison. This conversion has a one-to-one mapping between Chinese and Wubi, whereas, in the reverse direction, ill-formed Wubi output on the character-level might not be reversible to Chinese.
On the word-level, the Wubi-based models achieve comparable results to their counterparts in Chinese, in both translation directions. LSTM significantly outperforms FConv across all experiments here, most likely due to its much larger size (see Table 3).
On the subword-level, we observe a slight increase of about 0.5 BLEU when translating from English to Wubi instead of raw Chinese. This increase is most likely due to the difference in the BPE vocabularies: while the English and Wubi BPE rules that were learned cover 100% of the dataset, for Chinese this is 98.7% -the remaining 1.3% had to be replaced by the unk symbol under our vocabulary constraints. While the models were capable of compensating for this gap when translating to English, in the reverse direction it resulted in a loss of performance. This highlights one benefit of Wubi on the subword-level: the Latin encoding seems to give a greater flexibility for extracting suitable BPE rules. It would be interesting to repeat this comparison using much larger datasets and larger BPE vocabularies.
Character-level translation is more difficult than word-level, since the models are expected to not only predict sentence-level semantics, but also to generate the correct spelling of each word. Our char2char Wubi models outperformed the raw Chinese models with 0.95 BLEU points when translating to English, and 0.65 BLEU when translating from English. The differences are statistically significant (p = 0.001 and p = 0.034 respectively) according to bootstrap resampling (Koehn, 2004) with 1500 samples. The results demonstrate the advantage of Wubi on the characterlevel, which outperforms raw Chinese even though it has fewer parameters dedicated for character embeddings (Table 3) and that it has to deal with substantially longer input or output sequences (see Table 2).
In Figure 2, we plot the sentence-level BLEU scores obtained by the char2char models on our test set, with respect to the length of the input sentences.  glish ( Figure 2a) the Wubi-based model consistently outperforms the raw Chinese model, for all input lengths. Interestingly, the gap between the two systems increases for longer Chinese inputs of over 20 words, indicating that Wubi is more robust for such examples. This result could be explained by the fact that the encoder of the char2char model is more suitable for modeling languages with a higher level of granularity such as English and German. When translating from English to Chinese (Figure 2b) Wubi still has a small edge, however in this case we see the reverse trend: it performs much better on shorter sentences up to 12 English words. Perhaps, the increased granularity of the output sequence led to an advantage during decoding using beam search.
Interestingly, all the char2char models use only a tiny fraction of their parameters as embeddings, due to the much smaller size of their vocabularies. The best-performing LSTM word-level model has the majority of its parameters, 61% or over 50M, dedicated to word embeddings. For the Wubibased character-level models, the number is only 0.3% or 0.21M. There is even a significant difference between Wubi and Chinese on the characterlevel, for example, en2wb has 12 times fewer embedding parameters than en2cn. Thus, although char2char performed worse than LSTM in our experiments, these results highlight the potential of character-level prediction for developing compact yet performant translation systems, for Latin as well as non-Latin languages.

Qualitative evaluation
In Table 5, we present four examples from our test dataset that cover short as well as long sentences.
We also include the translations produced by the character-level char2char systems, which is the main focus of this paper. Full examples from the additional systems are available in the supplementary material.
In the first example, which is a short sentence resembling the headline of a document, both the wubi2en and cn2en models produced correct translations. When translating from English to Chinese, however, the en2wubi produced the word '与' (highlighted in red) which more correctly matches the ground truth text. In contrast, the en2cn model produced the synonym '和'. In the second example, the en2wubi output completely matches the ground truth and is superior to the en2cn output. The latter failed to correctly translate 'the' to '这次' (marked in green).
The wubi2en translation in the third example accurately translated the word 'believe' (marked in blue) and the full form of the abbreviation 'ldcs' -'the least developed countries' (highlighted in green), whereas the cn2en chooses 'are convinced' and ignores 'ldcs' in its output sentence. Interestingly, although the ground truth text maps the word 'essential' (marked in red) to three Chinese words '至 为 重 要', both en2wubi and en2cn use only a single word to interpret it. Arguably, en2wubi's translation '至关重要' is closer to the ground truth than en2cn's translation '必不可少'.
The fourth example is more challenging. There, the English ground truth 'requested' (highlighted in blue) maps to two different parts of the Chinese ground truth '提出' (in blue) and '要求' (in green). This one-to-many mapping confuses both translation models. The wubi2en tries to match the Chinese text by translating '提出' into 'pro- we are convinced that increased trade growth and development is essential .
Example 4 English ground truth in some cases , additional posts were requested without explanation .
ground truth d afs|hxf nge|ukq k , rj|bm fu|lk km|ptkm0 s|fiy jf , ua|fii wt|bm yu|je . wubi2en in some cases , no indication was made when additional staffing requirements were proposed . cn2en in some cases , there was no indication of the request for additional posts . en2wubi posed' and '要求' into 'requirements': this model may have been misled by the word '时' (can be translated to 'when'); the output contains an adverbial clause. While the wubi2en output is closer to the ground truth, the two have little overlap. For the English-to-Chinese task, the en2cn translation is better than the one produced by en2wubi: while en2cn successfully translated 'without explanation' (in red), the en2wubi model ignored this part of the sentence. The Wubi-based models tend to produce slightly shorter translations for both directions (see Table 6). In overall, the Wubi-based outputs appear to be visibly better than the raw Chinesebased outputs, in both directions.

Conclusion
We demonstrated that an intermediate encoding step to ASCII characters is suitable for the character-level Chinese-English translation task, and can even lead to performance improvements. All of our models trained using the Wubi encoding achieve comparable or better performance to the baselines trained directly on raw Chinese. On the character-level, using Wubi yields BLEU improvements when translating both to and from English, despite the increased length of the input or output sequences, and the smaller number of embedding parameters used. Furthermore, there are also improvements on the subword-level, when translating from English. Future work will focus on making use of the semantic structure of the Wubi encoding scheme, to develop architectures tailored to utilize it. Another exciting future direction is multilingual many-toone character-level translation from Chinese and several Latin languages simultaneously, which becomes possible using encodings such as Wubi. This has previously been successfully realized for Latin and Cyrillic languages (Lee et al., 2017).