Robust Neural Machine Translation with Joint Textual and Phonetic Embedding

Neural machine translation (NMT) is notoriously sensitive to noises, but noises are almost inevitable in practice. One special kind of noise is the homophone noise, where words are replaced by other words with similar pronunciations. We propose to improve the robustness of NMT to homophone noises by 1) jointly embedding both textual and phonetic information of source sentences, and 2) augmenting the training dataset with homophone noises. Interestingly, to achieve better translation quality and more robustness, we found that most (though not all) weights should be put on the phonetic rather than textual information. Experiments show that our method not only significantly improves the robustness of NMT to homophone noises, but also surprisingly improves the translation quality on some clean test sets.


Introduction
In recent years we witnessed rapid progresses in the field of neural machine translation (NMT). Sutskever et al. (2014) proposed a general end-toend approach for sequence learning and demonstrated promising results on the task of machine translation.  proposed the famous Encoder-Decoder architecture: a neural network has two components, an encoder network which encodes an input sequence into a fixedlength vector representation, and a decoder network which generates the output sequence based on the fixed-length vector representation. To overcome the limitation of the fixed-length vector representation, attention mechanism Luong et al., 2015) has been pro-posed and extensively studied, resulting in significant improvements in NMT. Further improvements, such as replacing recurrent units by convolutional units (Gehring et al., 2017) or selfattention (Vaswani et al., 2017), push the boundary of NMT, especially the transformer network proposed in Vaswani et al. (2017), which achieves the state-of-the-art (SOTA) results on many tasks, including machine translations.
However, these NMT models are very sensitive to the noises in input sentences. The causes of such vulnerability are complicated, and some of them are: 1) neural networks are inherently sensitive to noises, such as adversarial examples (Goodfellow et al., 2014;Szegedy et al., 2013), 2) the global effect of attention, where every input word can affect every output word generated by the decoder, and 3) the embedding of input words is very sensitive to noises. To improve the robustness of NMT, Cheng et al. (2018) recently proposed to use adversarial stability training to improve the robustness of both encoder and decoder with perturbed inputs.
In this paper, we focus on one special kind of noises, which we call the homophone noise, where a word is replaced by another word with the same (or very similar) pronunciation. Homophone noises are common in many real-world systems. One such example is speech translation, where the ASR system may output correct phoneme sequences, but transcribe some words into their homophones. Another example is phonetic input systems for non-phonetic writing systems such as Pinyin for Chinese or Katakana/Hiragana for Japanese. It is very common for a user to choose a homophone instead of the correct word. Existing NMT systems are very sensitive to homophone noises, and Table 1 Transformer the hpv has been found dead so far and 57 have been saved  Output of Our Method so far, 109 people have been found dead and 57 others have been rescued   Table 1: The translation results on Mandarin sentences without and with homophone noises. The word '有' (yǒu, "have") in clean input is replaced by one of its homophone, '又' (yòu, "again"), to form a noisy input. This seemingly minor change completely fools the Transformer to generate something irrelvant ("hpv"). Our method, by contrast, is very robust to homophone noises thanks to phonetic information.
ter, '有', has been replaced by one of its homophones, '又', it generates a strange and irrelvant translation. The method proposed in this paper can generate correct results under such kind of noises, since it mainly uses phonetic information.
Since words are discrete signals, to feed them into a neural network, a common practice is to encode them into float vectors through embedding. However, if a word a is replaced by a word b due to noises, the embedding usually changes dramatically, which will inevitably result in the changes in outputs. For homophone noises, since correct phonetic information exists, we can make use of it to make translation much more robust.
In this paper, we propose to improve the robustness of NMT models to homophone noises by jointly embedding both textual and phonetic information. In our approach, the input includes both sentences and the pronunciations of each word in source language, and the output is still sentences in target language. Both words and their corresponding pronunciations are embedded and then combined to feed into a neural network. This approach has the following advantages: • First, it is a simple but general approach, and easy to implement. Our approach can be applied to many NMT models, and only needs to modify the embedding layer, where phonetic information is embedded and combined with word embeddings.
• Second, it can dramatically improve the robustness of NMT models to homophone noises. This is because the final embedding in our approach is a combination of both word embedding and phonetic embedding. When mainly focusing on phonetic embeddings, even some word embeddings are incorrect, the final embeddings are still robust.
• Third, it also improves translation performance on clean test sets. This seems to be surprising since no additional information is provided. Although the real reason is unknown, we suspect it is because some kind of regularization effects from phonetic embeddings.
To further improve the robustness of NMT models to homophone noises, we use data augmentation to expand the training datasets, by randomly selecting training instances and adding homophone noises to them. The experimental results clearly show that data augmentation improves the robustness of NMT models to homophone noises.

Background
In theory, the goal of NMT is to optimize the translation probability of a target sentence y = y 1 , . . . , y N conditioned on the corresponding source sentence x = x 1 , . . . , x M : where θ represents model parameters and y <n represents partial translation with first n − 1 target words. NMT uses neural networks to model the objective function (1), usually with two core components, namely, an encoder and a decoder. The encoder encodes the source sentence x into a sequence of hidden representations, and the decoder generates the target sequence y based on this sequence of hidden representations. Both encoder and decoder can be modeled by various neural networks, such as RNN, CNN and self-attention.
Both source sequence x and target sequence y are sequence of words, which are discrete signals; however, neural networks need continuous signals as inputs. A common practice is to first embed words into a real coordinate space, denoted by R d , where d is the dimension of the space. After embedding, each word is represented by a ddimensional real vector, which can be feed into a neural network.
Embedding is very sensitive to noises, which is kind of self-evident: when a word a is replace by a word b due to noises, the obtained embedding vector usually changes dramatically (from the embedding vector of a to the embedding vector of b). It is hard to make the embedding robust for arbitrary noises; however, for homophone noises, due to the existence of correct phonetic information, the embedding can be made extremely robust by putting most weights on phonetic information, and only using textual information as auxiliary information. This is reasonable since our human can communicate with each other with no knowledge of texts.

Joint Embedding
For a word a in source language, suppose its pronunciation can be expressed by a sequence of pronunciation units, such as phonemes or syllables, denoted by Ψ(a) = {s 1 , . . . , s l }. Note that we use the term "word" loosely here, and in fact a may be a word or a subword, or even a character in some languages where each character has pronunciations, such as in Mandarin.
We embed both pronunciation units and words. For a pronunciation unit s, its embedding is denoted by π(s), and for a word a, its embedding is denoted by π(a). For the pair of a word a and its pronunciation sequence ψ(a) = {s 1 , . . . , s l }, we have l + 1 embedding vectors, that is, π(a), π(s 1 ), ..., π(s l ). Since l is different for different words, or even for different pronunciations of the same word, we need to combine them into a vector whose length not depending on l. There are obvious multiple ways, and in this paper, we do in the following way: • First, both words and pronunciation units are embedded into the same space R d .

Data Augmentation
Data augmentation is widely known to be extremely useful in training models (Shotton et al., 2011;Krizhevsky et al., 2012), especially in training data-hungry neural networks. The homophone noises tackled in this paper is easy to simulate, and also can be collected from the output of ASR and pronunciation-based input methods.
In this paper, we augment the training datasets in a very simple way: • First, for each word a, build a set Φ(a). Φ(a) contains words which are homophones of a and realistically may be mistaken into by real systems from a.
• Second, randomly pick training pairs from training datasets, and revise the source sentences by randomly replacing some words by their homophones.
After augmentation, the robustness of models are significantly improved. Together with the embedding of pronunciation information, the resulting NMT models are very robust to homophone noises.

Models
In our experiments, we use Transformer as baseline. This is because Transformer has the SOTA results on many machine translation datasets, as well as well-maintained open source codebases, such as Tensor2Tensor and OpenNMT. Specifically, we use the PyTorch version (PyTorch 0.4.0) of OpenNMT. All models are trained with 8 GPUs. Except specifically mentioned exceptions, the parameters are: 6 layers, 8 heads attention, 2048 neurons in feed-forward layer, and 512 neurons in other layers, dropout rate is 0.1, label smoothing rate is 0.1, Adam optimizer, learning rate is 2 with NOAM decay.

Translation Tasks
We evaluated our method on translation tasks of Mandarin-English, and reported the 4-gram BLEU score (Papineni et al., 2002) as calculated by the multi-bleu.perl perl script.
We used an extended NIST corpus which consists of 2M sentence pairs with about 51M Mandarin words and 62M English words, respectively. The NIST 2006 set is used as the dev set to select the best model, and the NIST 2002NIST , 2003NIST , 2004NIST , 2005 and 2008 dataset are used as test sets.
We apply byte-pair encodings (BPE) (Sennrich et al., 2016) on both Chinese and English sides to reduce the vocabulary size down to 18K and 10K, respectively. We exclude the sentences which are longer than 256 subwords. For Mandarin, we use pinyin as pronunciation units, and there are 404 pinyins in total. For symbols or entries in Mandarin dictionary without pronunciation or with unknown pronunciations, a special pronunciation unit, unk , is used. In Figure 1, we compare the performances, measured by BLEU scores to multiple references, of the baseline model and our models with β = 0.2, 0.4, 0.6, 0.8, 0.95, 1.0, respectively. We report the results every 10000 iterations from iteration 10000 to iteration 90000. Note that our model is almost exactly the same as baseline model, with only different source embeddings. In theory, when β = 0, our model is identical to baseline model. However, in practice, there is a slight difference: when β = 0, the embedding parameters are still there, which will affect the optimization procedure even no gradients flow back to these parameters. When β = 1, only phonetic information is used. There are some interesting observations from Figure 1. First, combing textual and phonetic information improves the performance of translation. Compared with baseline, when β = 0.2, the BLEU scores improves 1 − 2 points, and when β = 0.4, 0.6, 0.8, 0.95, the BLEU scores improves 2 − 3 points. Second, the phonetic information plays a very important role in translation. Even when β = 0.95, that is, the weight of phonetic embedding is 0.95 and the weight of word embedding is only 0.05, the performance is still very good. In fact, our best BLEU score (48.91), is achieved when β = 0.95. However, word embedding is still important. In fact, when we use only phonetic information (when β = 1.0), the performance become worse, almost the same as baseline (only using textual information). As mentioned in Section 2 , it seems that our human only needs phonetic information to communicate with each other, this is probably because we have better ability to understand context than machines, thus do not need the help of textual information.

Translation Results
We use NIST 06 as dev set to select best models for both baseline and our model with different βs, and test them on NIST 02, 03, 04, and 08 test sets. The results are reported in Table 2. These results corroborate our findings from Figure 1, and here we emphasize the most important finding: although we can get good translations from either textual or phonetic information of source sentence, combing them yield much better results.
To understand why phonetic information helps the translation, it is helpful to visualize the embedding of pronunciation units. We projects the whole pinyin embedding space into a 2-dimensional space using t-SNE technique (Maaten and Hinton, 2008), and illustrate a small region of it in Figure 2. An intriguing property of the embedding is   Homophone is very common in Mandarin. In Table 3, five groups of homophones are listed. To test the robustness of NMT models to homophone noises, we created two noisy test sets, namely, NoisySet1, and NoisySet2, based on NIST06 Mandarin-English test set. The creation procedure is as follows: for each source sentence in NIST06, we scan it from left to right, and if a word has homophones, it will be replaced by one of its homophones by a certain probability (10% for NoisySet1 and 20% for NoisySet2).
In Figure 3, we compare the performances, measured by BLEU scores, of the baseline model and our models with β = 0.2, 0.4, 0.6, 0.8, 0.95, 1.0, respectively, on NIST06 test set and the two created noisy sets. The models are chosen based on their performance (BLEU scores) on NIST06 test set. As Figure 3 shows, as β grows, which means that more and more weights are put on phonetic information, the performances on both noisy test sets almost steadily improve. When β = 1.0, as expected, homophone noises will not affect the results since the model is trained solely based on phonetic information. However, this is not our best choice since the performance on the clean test set gets much worse. In fact, it seems that the best choice of β is a value smaller but close to 1, which mainly focus on phonetic information but still utilizes some textual information. Table 4 demonstrate the effects of homophone noises on two sentences. The baseline model can translate both sentences correctly; however, when only one word (preposition) is replaced by one of its homophones, the baseline model generates incorrect, redundant and strange translations. This shows the vulnerability of the baseline model. Note that since the replaced words are prepositions, the meaning of the noisy source sentences are still very clear, and it does not affect our human's understanding at all. For our method, we use the model with β = 0.95, and it generates reasonable translations.
To further improve the robustness of NMT models, we augment the training dataset by the method described in Section 4. There are about 2M sentence pairs in the original training dataset, and we Figure 3: BLEU scores on dataset without and with homophone noises. On both noisy test sets, as more weight are put on phonetic embedding, that is, as β grows, the translation quality improves.
Clean Input 古巴是第一个与新中国建交的拉美国家 Output of Transformer cuba was the first latin american country to establish diplomatic relations with new china Noisy Input 古巴是第一个于新中国建交的拉美国家 Output of Transformer cuba was the first latin american country to discovering the establishment of diplomatic relations between china and new Zealand Output of Our Method cuba is the first latin american country to establish diplomatic relations with new china Clean Input 他认为, 格方对俄方的指责是荒谬的 Output of Transformer he believes that georgia's accusation against russia is absurd Noisy Input 他认为, 格方憝俄方的指责是荒谬的 Output of Transformer he believes that the accusations by the russian side villains are absurd Output of Our Method he maintained that georgia's accusation against russia is absurd Table 4: Two examples of homophone noises on source sentences. Textual-only embedding is very sensitive to homophone noises, thus generates weird outputs. However, when jointly embedding both textual and phonetic information in source sentences, the model is very robust. add 40% noisy sentence pairs, resulting a training dataset with about 2.8M sentence pairs.
In Table 5, we report the performance of baseline model and our model with β = 0.95, with and without data augmentation. Not surprisingly, data augmentation significantly improves the robustness of NMT models to homophone noises. However, the noises in training data seem to hurt the performance of the baseline model (from 45.97 to 43.94), and its effort on our model seems to be much smaller, probably because our model mainly uses the phonetic information.
In Figure 4, we illustrate the embeddings of some pinyins before and after data augmentation. An interesting observation here is that if two pinyins are the pronunciations of some common words, such as BAO and PU, TUO and DUO, YE and XIE, their embeddings become much closer after data augmentation, this explains why the robustness of NMT models to homophone noises improve.

Conclusion
In this paper, we propose to use both textual and phonetic information in neural machine translation by combining them in the embedding layer of neural networks. Such combination not only makes NMT models much more robust to homophone noises, but also improves their performance on dataset without homophone noises. Our experimental results clearly show that both textual and phonetical information are important in neural machine translation, although the best balance is to rely mostly on phonetic information. Since homo-  Table 5: Comparison of models trained with and without data augmentation. Clearly, data augmentation significantly improves the robustness of models to homophone noises. phone noises is easy to stimulate, we also augment the training dataset by adding homophone noises, which is proven by our experiments to be very useful in improving the robustness of NMT models to homophone noises.