A Comparison on Fine-grained Pre-trained Embeddings for the WMT19Chinese-English News Translation Task

This paper describes our submission to the WMT 2019 Chinese-English (zh-en) news translation shared task. Our systems are based on RNN architectures with pre-trained embeddings which utilize character and sub-character information. We compare models with these different granularity levels using different evaluating metics. We find that a finer granularity embeddings can help the model according to character level evaluation and that the pre-trained embeddings can also be beneficial for model performance marginally when the training data is limited.


Introduction
Neural Machine Translation (NMT) systems are mostly based on an encoder-decoder architecture with attention. Given a sentence x in source language, the model predicts a corresponding output sentence y in target language, which maximizes the conditional probability p(y|x). The attentionbased Recurrent Neural Network (RNN) version of this architecture has been a very popular approach to NMT (Bahdanau et al., 2015;Luong et al., 2015). Despite the success of these models, they still suffer from problems such as outof-vocabulary (OOV) words, i.e. words that have not been seen at training. To alleviate the OOV problem, we follow the methods used in word representation and segment words into smaller units. In some morphorlogically rich languages such as Chinese, a word can be divided into characters and then the characters can be further divided into smaller components called glyphs. Both character and glyph might contain semantic information and therefore utilizing such information might help alleviate the OOV problem.
Based on the RNN attention-based model (Bahdanau et al., 2015), we experiment with different granularity levels on the WMT19 Chinese-English (zh-en) news translation shared task. This paper describes our submitted systems with embeddings pre-trained on monolingual corpora. The two submitted systems use pre-trained embeddings enhanced by character and sub-character information respectively. The preprocessing methods include Chinese word segmentation, tokenization, data filtering based on rules and Byte Pair Encoding (BPE). Our baseline model is based on RNNSearch (Bahdanau et al., 2015) operating on word level and we use Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) as encoder and decoder. For character level word embeddings, we use the Character-Enhanced Word Embedding (CWE) proposed by Chen et al. (2015). For the sub-character level embeddings, we use the Joint Learning Word Embedding (JWE) proposed by Yu et al. (2017). We use various metrics, namely BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2011), TER (Snover et al., 2006) and CharacTER (Wang et al., 2016) for evaluation.
When compared with our baseline model, the models with pre-trained sub-character level embeddings on monolingual corpus show better performance, achieving an increase of +0.53 BLEU score with the sub-character level embeddings. We ran additional experiments on the character and subcharacter level pre-trained embeddings and found that the use of these embeddings can benefit the model when the training corpus size is limited.
This paper is structured as follows: Section 2 introduces the related work including the model architecture and pre-trained embeddings used in our experiment. In Section 3, data selection and preprocessing methods are described. Section 4 introduces the model architectures and hyperparameter settings. Section 5 shows the evaluation results on models with different granularity levels. Section 6 shows additional experiments to better understand our models.

Related Work
NMT has been an important task in Natural Language Processing.
A translation system aims to find the corresponding target sentence y = {y 1 , y 2 , ..., y m } given a sentence x = {x 1 , x 2 , ..., x n } in source language, in a probabilistic manner, represented as max y P (y|x). Most NMT models are based on the sequence-tosequence approach, and the RNN-based architecture  with attention (Bahdanau et al., 2015) is a popular version of such an approach. The attention mechanism functions as a dynamic calculation of the context vector. At each decoding step, a probability distribution is calculated based on the current decoder hidden state and all encoder hidden states. This distribution is defined as the attention score, representing the importance of each input token at current decoding time step. The context vector is calculated as a weighted average of all encoder hidden state vector, where the attention score is the weight. With the introduction of attention, the model does not need to rely on a single context vector to represent the whole sentence and thus can better handle long sentences.
In recent years model architectures based on convolutional neural networks (Gehring et al., 2017) and transformers (Vaswani et al., 2017) have shown competitive or better performance than RNN-based architectures. In addition, strategies such as back translation (Sennrich et al., 2016a), reranking (Neubig et al., 2015) and model ensembling have led to improvements in translation quality. In our experiments, we only experiment with RNN architectures and focus on the effect of using character and sub-character level embeddings and only use ensembling for comparison purposes.
We use the CWE model proposed by Chen et al. (2015) and the JWE model proposed by Yu et al. (2017) for pre-trained embeddings training. Both models are based on the word2vec proposed by Mikolov et al. (2013). Based on Continuous-Bagof-Word (CBOW), the CWE model construct a new word representation by summing the word embeddings with character embeddings (see Eq 1). Chen et al. also proposed a multi-prototype character embeddings where characters are tagged with additional factors, such as position and con-text cluster, for character disambiguation.
where w j is the word embeddings and c k is the embeddings of the k-th character in x j . ⊕ is the composition operator (either addition or concatenation). The JWE model proposed by Yu et al. (2017) is also based on CBOW and it utilizes character and sub-character level information. They construct a dictionary that maps each Chinese character to its sub-character components. As Figure  1 shows, words together with the characters and sub-character components within the context window are all used to predict the target word. The additional semantic information provided by character and subcharacters are shown to improve over word representation, especially in addressing outof-vocabulary words.  (Yu et al., 2017). w i−1 and w i+1 are context words. c i−1 and c i+1 represent characters in context words. s i−1 and s i+1 represent sub-characters of context characters and s i is the sub-character of target word w i .

Data and Preprocessing
We use all the parallel data provided by WMT for the zh-en translation task, including the News Commentary v14, UN Parallel Corpus V1.0 and the CWMT corpora. In addition, the Common Crawl Corpus from WMT is used as monolingual data to pre-train the embeddings. We use the newsdev2018 and newsdev2017 as validation set and the newstest2019 as our test data. We tokenize English sentences with the Moses tokenizer (Koehn et al., 2007). On the Chinese side we use Jieba for Chinese word segmentation. 1 The data preprocessing consists of filtering sentences to be added to the parallel training corpus by rules and by alignment score. Following the preprocessing criteria from submissions in previous years (Xu and Carpuat, 2018;Stahlberg et al., 2018;Haddow et al., 2018), we filter the training data based on the following criteria: • The length of sentences in both languages must be between 4 and 50.
• The maximum length ratio of sentence pairs is 1.3.
• Chinese sentences with no Chinese character are filtered out.
• English sentences with no English character are filtered out.
• Same source and target sentences are removed.
• Sentences should not contains HTML tags.
• Sentence pairs with alignment score above -65 are removed. 2 The fast_align toolkit 3 is used to calculate the alignment score for the parallel data. After the filtering, 10.38M sentence pairs are used as training data. We apply Byte-pair Encoding (BPE) (Sennrich et al., 2016b) with 30,000 merge operations on the English sentences. For Chinese sentences, we segment them into different granularity levels, including words, subwords via BPE and characters. In the character level setting, only Chinese words are separated and each character is treated as a single token. The training texts for models with pre-trained embeddings is the same as baseline, which use words as basic units.

Baseline
The baseline model is based on the bidirectional RNN architecture with attention (Bahdanau et al., 1 https://github.com/fxsjy/jieba 2 We tried different filter strategies and found this criterion gives a better performance than others. 3 https://github.com/clab/fast_align 2015). Our models are built with OpenNMT-py (Klein et al., 2017). We follow the hyperparameter setting of Deep RNN from Xu and Carpuat (2018) and use a four-layer LSTM for both the encoder and decoder. The embeddings and hidden layer size are limited to 512. We use the Adam optimizer (Kingma and Ba, 2015) with initial learning rate of 0.0005. We apply label smoothing (Szegedy et al., 2016) and dropout (Srivastava et al., 2014) of 0.1 to avoid overfitting. We use the multi-layer perception (mlp) attention as in (Bahdanau et al., 2015). The batch size is 4096 tokens per batch and the models are selected based on best performance on the validation set. All our models are trained on a GTX 1080Ti GPU.

Pre-trained Embeddings
We apply pre-trained embeddings to the two submitted systems. The character level and subcharacter level pre-trained embeddings are trained with CWE (Chen et al., 2015) and JWE (Yu et al., 2017) respectively. We trained the embeddings on the Common Crawl Corpus provided by WMT19 and fine-tuned them on the task data when training the RNN. The preprocessing for monolingual data includes Chinese word segmentation and removal of non-Chinese characters. Apart from the pretrained embeddings, the hyperparameters of the two submissions are the same as in the baseline system.

Result and Analysis
We use the CharacTER.py 4 script for Charac-TER score calculation and multeval 5 (Clark et al., 2011) to calculate BLEU, METEOR and TER scores. The evaluation results for models on word, subword and character level are presented in Table 1.
The model with BPE applied on both source and target languages (bpe2bpe) achieves higher score than other single models, with an increase of +1.18 BLEU score over the baseline system. The two models (baseline+cwe, baseline+jwe) utilizing character and sub-character information are based on pre-trained embeddings with CWE and JWE as described in Section 2. We use the source training text for the pre-trained embeddings to prevent the introduction of noise. As we can see from the BLEU scores, the model with JWE pre-4 https://github.com/rwth-i6/CharacTER 5 https://github.com/jhclark/multeval trained embeddings shows similar performance to the baseline system while the model with CWE embeddings on character level shows a marginal decrease. The METEOR and TER score presents similar trends to BLEU, whereas from the evaluation of CharacTER scores the introduction of pre-trained embeddings on both character and subcharacter levels shows better performance than the baseline.
It can also be seen from the comparison on BPE-based models that the model with CWE embeddings performs slightly worse than the bpe2bpe model, which operates on BPE on both source and target languages. The results according to CharacTER show that finer granularity embeddings can benefit the model in character level evaluations. The char2bpe model shows the worst performance according to BLEU scores, whereas the CharacTER score of this model is higher than that of other word level models. Finally, when we ensemble the baseline and four models with JWE embeddings pre-trained on different iterations, the BLEU score shows an increase of +1.26 BLEU over the baseline.
The two models with stars (apprentice-c and apprentice-g) are our official shared task submissions, with the first one operating on character level and the second, on glyph (sub-character) level. The apprentice-c model uses the CWE pretrained embeddings while the apprentice-g uses JWE embeddings. For the first, we train the pre-trained embeddings on the monolingual data (Common Crawl) and then fine-tune it on filtered parallel data during the training of RNN models. Note that we did not use back-translation to aug-ment the training data and due to time limit we apply a relatively larger learning rate than previous work to boost training speed, therefore our systems achieve relatively lower score than the previous work (Xu and Carpuat, 2018). The CWEbased model shows a better BLEU score than the baseline model. The lower performance for the apprentice-g model might have resulted from insufficient training epochs for the JWE embeddings. Due to time restrictions, we did not submit the system with the best word embeddings. In the additional experiments after the task deadline, we fine-tuned the models on the best word embeddings version and achieve a higher BLEU score of 17.43 for the apprentice-g(best) model. The Char-acTER score for the fine-tuned model is lower than other models except the two with BPE. Generally, the sub-character level models perform better than the word level and character level models.
6 Additional Experiments

Evaluating Embeddings
We have tried additional experiments to evaluate the effect of character and subcharacter level pretrained embeddings. Table 2 presents the model performance with respect to the embeddings performance in traditional word similarity and analogy tasks. We use the wordsim-240 and wordsim-297 dataset and the analogy dataset from Chen et al. (2015) for word similarity and analogy evaluation respectively. We use the evaluation script in JWE 6 for both evaluations.
From Table 2, we can see that among all models with JWE pre-trained embeddings, the one with  Table 2: Comparison of model performance and word embeddings performance. The evaluation on wordsim-240 and wordsim-297 test set shows Spearman correlation between the pre-trained embedding and human judgements. The performance on analogy indicates accuracy on analogy reasoning in "a:b::c:?" format. The number after the embeddings type represents number of training iterations.
10 iterations performs the best. When the embeddings are trained over 20 iterations, the BLEU score starts to decrease. The same pattern can be found on the CWE-based models. However, the model with 5-iteration embeddings achieves the highest BLEU score among all CWE-based models. From the embeddings performance on the analogy task, excluding the cwe5 model, we find that the embeddings performance correlates with BLEU scores. When comparing the CWEbased models with the JWE-based models, we see that on both translation quality and word embeddings evaluations, the model on finer granularity performs best.

Effect of Corpus Size
Another experiment was done to compare the effect of pre-trained embeddings on different corpora sizes. We train the word embeddings with best iteration setting and train the RNN model on different corpora sizes. Smaller corpora are created by taking 25% and 50% of the original corpus.  It can be seen from Table 3 that with smaller parallel training corpora the introduction of the pre-trained word embeddings has a more marked positive influence. When the dataset is reduced to half, all the three models show a decrease in BLEU score. However, the gap between the baseline and the cwe-based model is smaller. When the dataset is further limited to 25%, both models with pretrained embeddings perform better than the baseline, whose score does not change. Although it seems that the pre-trained embeddings, even with sub-character level semantic information involved, could only benefit marginally on the whole training data, the introduction of extra semantic information might play a more important role when the parallel training resources are limited. Here we measure the performance of models with varying sentence lengths, as shown in Figure 2. The test set is seperated into 8 subsets based on the sentence lengths and models are evaluated on each subset, the x-axis in Figure 2 represents sentence length intervals. We see that the two models with embeddings trained on a larger monolingual corpus perform better than the other models in medium-length sentences (between 30 and 50). The apprentice-c model, which uses CWE embeddings operating on character level, greatly outperforms the other models on short sentences with length less than 10. Since the sentence length is short, the tokens in the sentence are mostly composed of one or two characters, thus the model with character-based embeddings has an advantage. Regarding the two models with embeddings trained without extra monolingual data, both models show good performance on medium length sentences but perform poorly on long sentences. The introduction of pre-trained embeddings can increase the models' preference to generate shorter sentences, resulting in the model achieving lower BLEU score on long sentences.

Analysis of Model Perplexity
In order to understand the effect of pre-trained embedding on target language model, we calculate the model perplexity on the test data with models on different corpus size. The result is represented in Table 4. The model with JWE pretrained embeddings performs better on all corpus sizes, having a lower perplexity, though the difference is marginal. Similar result as the BLEU evaluation shows that the pre-trained embeddings benefit model performance on smaller corpus sizes.

Model
Perplexity

Transformer Models
Besides the RNN model, we also experimented with pre-trained embeddings and the transformer architecture. We follow the hyperparameter setting from Vaswani et al. (2017), limiting the embeddings to 512 dimensions. We compare the transformer models with and without pre-trained embeddings. The results are presented in Table 5.
From the evaluation results on BLEU and Charac-TER, the transformer models without pre-trained embeddings show better performance. We find it interesting that the embedding pre-trained with CWE decrease the performance severely, leading to a reduction of -3.85 BLEU score from the model without it. The introduction of finer granularity embeddings might not benefit the transformer performance. We hypothesize that the pretrained embedding enhanced by character and subcharacter infomation might conflict with the fixed positional encoding used in transformer.

Conclusion
This paper describes our NMT models with pretrained embeddings operating on character and sub-character levels. We participated in the WMT19 zh-en news translation shared task and submitted two systems with embeddings trained on monolingual corpus. We experimented with the effect of using fine-grained pre-trained embeddings and showed the potential benefit of using them. In additional experiments, we find that using pre-trained embeddings can better benefit the translation models when the parallel training data is limited.