Paraphrases as Foreign Languages in Multilingual Neural Machine Translation

Paraphrases, rewordings of the same semantic meaning, are useful for improving generalization and translation. Unlike previous works that only explore paraphrases at the word or phrase level, we use different translations of the whole training data that are consistent in structure as paraphrases at the corpus level. We treat paraphrases as foreign languages, tag source sentences with paraphrase labels, and train on parallel paraphrases in the style of multilingual Neural Machine Translation (NMT). Our multi-paraphrase NMT that trains only on two languages outperforms the multilingual baselines. Adding paraphrases improves the rare word translation and increases entropy and diversity in lexical choice. Adding the source paraphrases boosts performance better than adding the target ones, while adding both lifts performance further. We achieve a BLEU score of 57.2 for French-to-English translation using 24 corpus-level paraphrases of the Bible, which outperforms the multilingual baselines and is +34.7 above the single-source single-target NMT baseline.


Introduction
Paraphrases, rewordings of texts with preserved semantics, are often used to improve generalization and the sparsity issue in translation (Callison-Burch et al., 2006;Fader et al., 2013;Ganitkevitch et al., 2013;Narayan et al., 2017;Sekizawa et al., 2017). Unlike previous works that use paraphrases at the word/phrase level, we research on different translations of the whole corpus that are consistent in structure as paraphrases at the corpus level; we refer to paraphrases as the different translation versions of the same corpus. We train paraphrases in the style of multilingual NMT (Johnson et al., 2017;Ha et al., 2016) . Implicit parameter sharing enables multilingual NMT to learn across languages and achieve better generalization (Johnson et al., 2017). Training on closely related languages are shown to improve translation (Zhou et al., 2018). We view paraphrases as an extreme case of closely related languages and view multilingual data as paraphrases in different languages. Paraphrases can differ randomly or systematically as each carries the translator's unique style.
We treat paraphrases as foreign languages, and train a unified NMT model on paraphrase-labeled data with a shared attention in the style of multilingual NMT. Similar to multilingual NMT's objective of translating from any of the N input languages to any of the M output languages (Firat et al., 2016), multi-paraphrase NMT aims to translate from any of the N input paraphrases to any of the M output paraphrases in Figure 1. In Figure 1, we see different expressions of a host showing courtesy to a guest to ask whether sake (a type of alcohol drink that is normally served warm in Asia) needs to be warmed. In Table 6, we show a few examples of parallel paraphrasing data in the Bible corpus. Different translators' styles give rise to rich parallel paraphrasing data, covering wide range of domains. In Table 7, we also show some paraphrasing examples from the modern poetry dataset, which we are considering for future research.
Indeed, we go beyond the traditional NMT learning of one-to-one mapping between the source and the target text; instead, we exploit the many-to-many mappings between the source and target text through training on paraphrases that are consistent to each other at the corpus level. Our method achieves high translation performance and gives interesting findings. The differences between our work and the prior works are mainly the following.
Unlike previous works that use paraphrases at the word or phrase level, we use paraphrases at the entire corpus level to improve translation performance. We use different translations of the whole training data consistent in structure as paraphrases of the full training data. Unlike most of the multilingual NMT works that uses data from multiple languages, we use paraphrases as foreign languages in a single-source single-target NMT system training only on data from the source and the target languages.
Our main findings in harnessing paraphrases in NMT are the following.
1. Our multi-paraphrase NMT results show significant improvements in BLEU scores over all baselines.
2. Our paraphrase-exploiting NMT uses only two languages, the source and the target languages, and achieves higher BLEUs than the multi-source and multi-target NMT that incorporates more languages.
3. We find that adding the source paraphrases helps better than adding the target paraphrases.
4. We find that adding paraphrases at both the source and the target sides is better than adding at either side. 5. We also find that adding paraphrases with additional multilingual data yields mixed performance; its performance is better than training on language families alone, but is worse than training on both the source and target paraphrases without language families.
6. Adding paraphrases improves the sparsity issue of rare word translation and diversity in lexical choice.
In this paper, we begin with introduction and related work in Section 1 and 2. We introduce our models in Section 3. Finally, we present our results in Section 4 and conclude in Section 5.
Our work is different in that we exploit paraphrases at the corpus level, rather than at the word  or phrase level.

Multilingual Attentional NMT
Machine polyglotism which trains machines to translate any of the N input languages to any of the M output languages from many languages to many languages, many languages is a new paradigm in multilingual NMT (Firat et al., 2016;Zoph and Knight, 2016;Dong et al., 2015;Gillick et al., 2016;Al-Rfou et al., 2013;Tsvetkov et al., 2016). The objective is to translate from any of the N input languages to any of the M output languages (Firat et al., 2016). Many multilingual NMT systems involve multiple encoders and decoders (Ha et al., 2016), and it is hard to combine attention for quadratic language pairs bypassing quadratic attention mechanisms (Firat et al., 2016). An interesting work is training a universal model with a shared attention mechanism with the source and target language labels and Byte-Pair Encoding (BPE) (Johnson et al., 2017;Ha et al., 2016). This method is elegant in its simplicity and its advancement in low-resource language translation and zero-shot translation using pivot-based translation mechanism (Johnson et al., 2017;Firat et al., 2016).
Unlike previous works, our parallelism is across paraphrases, not across languages. In other words, we achieve higher translation performance in the single-source single-target paraphrase-exploiting NMT than that of the multilingual NMT.

Models
We have four baseline models. Two are singlesource single-target attentional NMT models, the other two are multilingual NMT models with a shared attention (Johnson et al., 2017;Ha et al., 2016). In Figure 1, we show an example of multilingual attentional NMT. Translating from all 4 languages to each other, we have 12 translation paths. For each translation path, we label the source sentence with the source and target language tags. Translating from "你 的 清 酒 凉 了 吗?" to "Has your sake turned cold?", we label the source sentence with opt src zh opt tgt en. More details are in Section 4.
In multi-paraphrase model, all source sentences are labeled with the paraphrase tags. For example, in French-to-English translation, a source sentence may be tagged with opt src f1 opt tgt e0, denoting that it is translating from version "f1" of French data to version "e0" of English data. In Figure 1, we show 2 Japanese and 2 English paraphrases. Translating from all 4 paraphrases to each other (N = M = 4), we have 12 translation paths as N × (N − 1) = 12. For each translation path, we label the source sentence with the source and target paraphrase tags. For the translation path from "お酒冷めましたよね?" to "Has your sake turned cold?", we label the source sentence with opt src j1 opt tgt e0 in Figure 1. Paraphrases of the same translation path carry the same labels. Our paraphrasing data is at the corpus level, and we train a unified NMT model with a shared attention. Unlike the paraphrasing sentences in Figure 1, We show this example with only one sentence, it is similar when the training data contains many sentences. All sentences in the same paraphrase path share the same labels.

Data
Our main data is the French-to-English Bible corpus (Mayer and Cysouw, 2014), containing 12 versions of the English Bible and 12 versions of the French Bible 1 . We translate from French to English. Since these 24 translation versions are consistent in structure, we refer to them as paraphrases at corpus level. In our paper, each paraphrase refers to each translation version of whole Bible corpus. To understand our setup, if we use all 12 French paraphrases and all 12 English paraphrases so there are 24 paraphrases in total, i.e., N = M = 24, we have 552 translation paths be-

Source Sentence
Machine Translation Correct Target Translation Comme de l'eau fraîche pour une personne fatigué, Ainsi est une bonne nouvelle venant d'une terre lointaine.
As cold waters to a thirsty soul, so is good news from a distant land.
Like cold waters to a weary soul, so is a good report from a far country.
Lorsque tu seras invité par quelqu'unà des noces, ne te mets pasà la première place, de peur qu'il n'y ait parmi les invités une personne plus considérable que toi, When you are invited to one to the wedding, do not be to the first place, lest any one be called greater than you.
When you are invited by anyone to wedding feasts, do not recline at the chief seat lest one more honorable than you be invited by him, Car chaque arbre se connaîtà son fruit. On ne cueille pas des figues sur deś epines, et l'on ne vendange pas des raisins sur des ronces.
For each tree is known by its own fruit. For from thorns they do not gather figs, nor do they gather grapes from a bramble bush.
For each tree is known from its own fruit. For they do not gather figs from thorns, nor do they gather grapes from a bramble bush. Vous tous qui avez soif, venez aux eaux, Même celui qui n'a pas d'argent! Venez, achetez et mangez, Venez, achetez du vin et du lait, sans argent, sans rien payer! Come, all you thirsty ones, come to the waters; come, buy and eat. Come, buy for wine, and for nothing, for without money.
Ho, everyone who thirsts, come to the water; and he who has no silver, come buy grain and eat. Yes, come buy grain, wine and milk without silver and with no price. Oui , vous sortirez avec joie , Et vous serez conduits en paix ; Les montagnes et les collineséclateront d'allégresse devant vous , Et tous les arbres de la campagne battront des mains .
When you go out with joy , you shall go in peace ; the mountains shall rejoice before you , and the trees of the field shall strike all the trees of the field .
For you shall go out with joy and be led out with peace . The mountains and the hills shall break out into song before you , and all the trees of the field shall clap the palm . For all experiments, we choose a specific English corpus as e0 and a specific French corpus as f0 which we evaluate across all experiments to ensure consistency in comparison, and we evaluate all translation performance from f0 to e0.

Training Parameters
In all our experiments, we use a minibatch size of 64, dropout rate of 0.3, 4 RNN layers of size 1000, a word vector size of 600, number of epochs of 13, a learning rate of 0.8 that decays at the rate of 0.7 if the validation score is not improving or it is past epoch 9 across all LSTM-based experiments. Byte-Pair Encoding (BPE) is used at preprocessing stage (Ha et al., 2016). Our code is built on OpenNMT (Klein et al., 2017) and we evaluate our models using BLEU scores (Papineni et al., 2002), entropy (Shannon, 1951), F-measure and qualitative evaluation.

Baselines
We introduce a few acronyms for our four baselines to describe the experiments in Table 1,  Table 2 and Figure 3. Firstly, we have two single-source single-target attentional NMT mod-els, Single and WMT. Single trains on f0 and e0 and gives a BLEU of 22.5, the starting point for all curves in Figure 3. WMT adds the out-domain WMT'14 French-to-English data on top of f0 and e0; it serves as a weak baseline that helps us to evaluate all experiments' performance discounting the effect of increasing data.
Moreover, we have two multilingual baselines 2 built on multilingual attentional NMT, Family and Span (Zhou et al., 2018). Family refers to the multilingual baseline by adding one language family at a time, where on top of the French corpus f0 and the English corpus e0, we add up to 20 other European languages. Span refers to the multilingual baseline by adding one span at a time, where a span is a set of languages that contains at least one language from all the families in the data; in other words, span is a sparse representation of all the families. Both Family and Span trains on the Bible in 22 Europeans languages trained using multilingual NMT. Since Span is always suboptimal to Family in our results, we only show numerical results for Family in Table 1 and 2, and we plot both Family and Span in Figure 3. The two multilingual baselines are strong baselines while the fWMT baseline is a weak baseline that helps us to evaluate all experiments' performance discounting the effect of increasing data. All baseline results are taken from   a research work which uses the grid of (1, 6,11,16,22) for the number of languages or equivalent number of unique sentences and we follow the same in Figure 3 (Zhou et al., 2018). All experiments for each grid point carry the same number of unique sentences. Furthermore, Vsrc refers to adding more source (English) paraphrases, and Vtgt refers to adding more target (French) paraphrases. Vmix refers to adding both the source and the target paraphrases. Vmf refers to combining Vmix with additional multilingual data; note that only Vmf, Family and Span use languages other than French and English, all other experiments use only English and French. For the x-axis, data refers to the number of paraphrase corpora for Vsrc, Vtgt, Vmix; data refers to the number of languages for Family; data refers to and the equivalent number of unique training sentences compared to other training curves for WMT and Vmf.

Results
Training on paraphrases gives better performance than all baselines: The translation performance of training on 22 paraphrases, i.e., 11 English paraphrases and 11 French paraphrases, achieves a BLEU score of 55.4, which is +32.9 above the Single baseline, +8.8 above the Family baseline, and +26.1 above the WMT baseline. Note that the Family baseline uses the grid of (1, 6, 11, 16, 22) for number of languages, we continue to use this grid for our results on number of paraphrases, which explains why we pick 22 as an example here. The highest BLEU 57.2 is achieved when we train on 24 paraphrases, i.e., 12 English paraphrases and 12 French paraphrases.
Adding the source paraphrases boosts translation performance more than adding the tar- get paraphrases: The translation performance of adding the source paraphrases is higher than that of adding the target paraphrases. Adding the source paraphrases diversifies the data, exposes the model to more rare words, and enables better generalization. Take the experiments training on 13 paraphrases for example, training on the source (i.e., 12 French paraphrases and the English paraphrase e0) gives a BLEU score of 48.8, which has a gain of +1.4 over 47.4, the BLEU score of training on the target (i.e., 12 English paraphrases and the French paraphrase f0). This suggests that adding the source paraphrases is more effective than adding the target paraphrases.
Adding paraphrases from both sides is better than adding paraphrases from either side: The curve of adding paraphrases from both the source and the target sides is higher than both the curve of adding the target paraphrases and the curve of adding the source paraphrases. Training on 11 paraphrases from both sides, i.e., a total of 22 paraphrases achieves a BLEU score of 50.8, which is +3.8 higher than that of training on the target side only and +1.9 higher than that of training on the source side only. The advantage of combining both sides is that we can combine paraphrases from both the source and the target to reach 24 paraphrases in total to achieve a BLEU score of 57.2.
Adding both paraphrases and language families yields mixed performance: We conduct one more experiment combining the source and target paraphrases together with additional multilingual data. This is the only experiment on paraphrases where we use multilingual data other than only French and English data. The BLEU score is 49.3, higher than training on families alone, in fact, it is higher than training on eight European fami-lies altogether. However, it is lower than training on English and French paraphrases alone. Indeed, adding paraphrases as foreign languages is effective, however, when there is a lack of data, mixing the paraphrases with multilingual data is helpful.
Adding paraphrases increases entropy and diversity in lexical choice, and improves the sparsity issue of rare words: We use bootstrap resampling and construct 95% confidence intervals for entropies (Shannon, 1951) of all models of Vmix, i.e., models adding paraphrases at both the source and the target sides. We find that the more paraphrases, the higher the entropy, the more diversity in lexical choice as shown in Table 4. From the word F-measure shown in Table 5, we find that the more paraphrases, the better the model handles the sparsity of rare words issue. Adding paraphrases not only achieves much higher BLEU score than the WMT baseline, but also handles the sparsity issue much better than the WMT baseline.
Adding paraphrases helps rhetoric translation and increases expressiveness: Qualitative evaluation shows many cases where rhetoric translation is improved by training on diverse sets of paraphrases. In Table 3, Paraphrases help NMT to use a more contemporary synonym of "silver", "money", which is more direct and easier to understand. Paraphrases simplifies the rhetorical or subtle expressions, for example, our model uses "rejoice" to replace "break out into song", a personification device of mountains to describe joy, which captures the essence of the meaning being conveyed. However, we also observe that NMT wrongly translates "clap the palm" to "strike". We find the quality of rhetorical translation ties closely with the diversity of parallel paraphrases data. Indeed, the use of paraphrases to improve rhetoric translation is a good future research question. Please refer to the Table 3 for more qualitative examples.

Conclusion
We train on paraphrases as foreign languages in the style of multilingual NMT. Adding paraphrases improves translation quality, the rare word issue, and diversity in lexical choice. Adding the source paraphrases helps more than adding the target ones, while combining both boosts performance further. Adding multilingual data to paraphrases yields mixed performance. We would like to explore the common structure and terminology consistency across different paraphrases. Since structure and terminology are shared across paraphrases, we are interested in a building an explicit representation of the paraphrases and extend our work for better translation, or translation with more explicit and more explainable hidden states, which is very important in all neural systems.
We are interested in broadening our dataset in our future experiments. We hope to use other parallel paraphrasing corpora like the poetry dataset as shown in Table 7. There are very few poems that are translated multiple times into the same language, we therefore need to train on extremely small dataset. Rhetoric in paraphrasing is important in poetry dataset, which again depends on the training paraphrases. The limited data issue is also relevant to the low-resource setting.
We would like to effectively train on extremely small low-resource paraphrasing data. As discussed above about the potential research poetry dataset, dataset with multiple paraphrases is typically small and yet valuable. If we can train using extremely small amount of data, especially in the low-resource scenario, we would exploit the power of multi-paraphrase NMT further.
Cultural-aware paraphrasing and subtle expressions are vital (Levin et al., 1998;Larson, 1984). Rhetoric in paraphrasing is a very important too. In Figure 1, "is your sake warm enough?" in Asian culture is an implicit way of saying "would you like me to warm the sake for you?". We would like to model the culture-specific subtlety through multi-paraphrase training.