Handling Syntactic Divergence in Low-resource Machine Translation

Despite impressive empirical successes of neural machine translation (NMT) on standard benchmarks, limited parallel data impedes the application of NMT models to many language pairs. Data augmentation methods such as back-translation make it possible to use monolingual data to help alleviate these issues, but back-translation itself fails in extreme low-resource scenarios, especially for syntactically divergent languages. In this paper, we propose a simple yet effective solution, whereby target-language sentences are re-ordered to match the order of the source and used as an additional source of training-time supervision. Experiments with simulated low-resource Japanese-to-English, and real low-resource Uyghur-to-English scenarios find significant improvements over other semi-supervised alternatives.


Introduction
While neural machine translation (NMT; Bahdanau et al. (2015); Vaswani et al. (2017)) now represents the state of the art in the majority of large-scale MT benchmarks (Bojar et al., 2017), it is highly dependent on the availability of copious parallel resources; NMT under-performs previous phrase-based methods when the training data is small (Koehn and Knowles, 2017). Unfortunately, million-sentence parallel corpora are often unavailable for many language pairs. Conversely, monolingual sentences, particularly in English, are often much easier to find, making semisupervised approaches that can use monolingual data a desirable solution to this problem. 2 私 は 新しい ⾞車車 を 買った 。 I bought a new car .
I var_1 a new car var_2 bought .
Reference Japanese: Japanese-ordered English: English: Figure 1: An English sentence re-ordered into Japanese order using the rule-based method of Isozaki et al. (2010b), and its reference Japanese translation.
Semi-supervised approaches for NMT are often based on automatically creating pseudoparallel sentences through methods such as backtranslation (Irvine and Callison-Burch, 2013;Sennrich et al., 2016) or adding an auxiliary autoencoding task on monolingual data (Cheng et al., 2016;Currey et al., 2017). However, both methods have problems with lowresource and syntactically divergent language pairs. Back translation assumes enough data to create a functional NMT system, an unrealistic requirement in low-resource scenarios, while autoencoding target sentences by definition will not be able to learn source-target word reordering. This paper proposes a method to create pseudoparallel sentences for NMT for language pairs with divergent syntactic structures. Prior to NMT, word reordering was a major challenge for statistical machine translation (SMT), and many techniques have emerged over the years to address this challenge (Xia and McCord, 2004;. Importantly, even simple heuristic reordering methods with a few handcreated rules have been shown to be highly effective in closing syntactic gaps (Collins et al. (2005); Isozaki et al. (2010b) ; Fig. 1). Because these rules usually function solely in high-resourced languages such as English with high-quality synguages, but limited success on real low-resource settings and syntactically divergent language pairs (Neubig and Hu, 2018;Guzmán et al., 2019). Hence we focus on semi-supervised methods in this paper. tactic analysis tools, a linguist with rudimentary knowledge of the structure of the target language can create them in short order using these tools.
However, similar pre-ordering methods have not proven useful in NMT (Du and Way, 2017), largely because high-resource scenarios NMT is much more effective at learning reordering than previous SMT methods were (Bentivogli et al., 2016). However, in low-resource scenarios it is less realistic to expect that NMT could learn this reordering from scratch on its own.
Here we ask "how can we efficiently leverage the monolingual target data to improve the performance of the NMT system in low-resource, syntactically divergent language pairs?" We tackle this problem via a simple two-step data augmentation method: (1) we first reorder monolingual target sentences to create source-ordered target sentences as shown in Fig. 1, (2) we then replace the words in the reordered sentences with source words using a bilingual dictionary, and add them as the source side of a pseudo-parallel corpus. Experiments demonstrate the effectiveness of our approach on translation from Japanese and Uyghur to English, with a simple, linguistically motivated method of head finalization (HF; Isozaki et al. (2010b)) as our reordering method.

The Proposed Method
Training Framework We assume that there are two types of available resources: a small parallel corpus P = {(s, t)} and a large monolingual target corpus Q. The goal of our method is to create a pseudo-parallel corpusQ = {(ŝ, t)}, whereŝ is a pseudo-parallel sentence automatically created in two steps of (1) word reordering, and (2) word-byword translation.
Word Reordering The first step reorders monolingual target sentences t ∈ Q into the source order t s . Instead of devising an entirely new wordordering method, we can simply rely on methods that have already been widely studied and proven useful in SMT . Reordering can be done either using rules based on linguistic knowledge (Isozaki et al., 2010b;Collins et al., 2005) or learning from aligned parallel data (Xia and McCord, 2004;Habash, 2007), and in principle our pseudo-corpus creation paradigm is compatible with any of these methods.
Specifically, in this work we utilize rule-based methods, as our goal is to improve translation of low-resource languages, where large quantities of high-quality parallel data do not exist and we posit that current data-driven reordering methods are unlikely to function well. Examples of rule-based methods include those to reorder English into German (Navratil et al., 2012), Arabic (Badr et al., 2009), or Japanese (Isozaki et al., 2010b). In experiments we use Isozaki et al. (2010b)'s method of reordering SVO languages (e.g. English) into the order of SOV languages (e.g. Japanese) by simply (1) applying a syntactic parser to English (Tsuruoka et al., 2004), (2) identifying the head constituent of each phrase and moving it to the end of the phrase, and (3) inserting special tokens after subjects and objects of predicates to mimic Japanese case markers.
Word-by-word Translation To generate data for training MT models, we next perform wordby-word translation of t s into pseudo-source sentenceŝ using a bilingual dictionary (Xie et al., 2018). 3 There are many ways we can obtain this dictionary: even for many low-resource languages with a paucity of bilingual text, we can obtain manually-curated lexicons with reasonable coverage, or run unsupervised word alignment on whatever parallel data we have available. In addition, we can induce word translations for more words in target language using methods for bilingual lexicon induction over pre-trained word embeddings (e.g. Grave et al. (2018)).

Experiments
We evaluate our method on two language pairs: Japanese-to-English (ja-en) and Uyghur-to-English (ug-en). Japanese and Uyghur are phylogenetically distant languages, but they share similar SOV syntactic structure, which is greatly divergent from English SVO structure.

Experimental Setup
For both language pairs, we use an attentionbased encoder-decoder NMT model with a onelayer bidirectional LSTM as the encoder and onelayer uni-directional LSTM as the decoder. 4 Embeddings and LSTM states were set to 300 and 256 dimensions respectively. Target word embeddings are shared with the softmax weight matrix in the decoder. As noted above, we use HF (Isozaki et al., 2010b) as our re-ordering rule. HF was designed for transforming English into Japanese order, but we use it as-is for the Uyghur-English pair as well to demonstrate that simple, linguistically motivated rules can generalize across pairs with similar syntax with little or no modification. Further details regarding the experimental settings are in the supplementary material.

Simulated Japanese to English Experiments
We first evaluate on a simulated low-resource ja-en translation task using the ASPEC dataset (Nakazawa et al., 2016). We randomly select 400k ja-en parallel sentence pairs to use as our full training data. We then randomly sub-sample low-resource datasets of 3k, 6k, 10k, and 20k parallel sentences, and use the remainder of the 400k English sentences as monolingual data. We duplicate the number of parallel sentences by 5 times in the training data augmented with the reordered pairs. For settings with supervised parallel sentences of 3k, 6k, 10k and 20k, we set the maximum vocabulary size of both Japanese and English to be 10k, 10k, 15k and 20k respectively.
To automatically learn a high-precision dictionary on the small amount of parallel data we have available for training, we use GIZA++ (Och and Ney, 2003) to learn alignments in both directions then take the intersection of alignments. We then learn the bilingual word embeddings with DeMa-BWE (Zhou et al., 2019), an unsupervised method that has shown strong results on syntactically divergent language pairs. We give the more reliable alignments extracted from GIZA++ high priority by querying the alignment dictionary first, then follow by querying the embedding-induced dictionary. When an English word is not within any vocabulary, we output the English word as-is into the pseudo-source sentence.

Real Uyghur to English Experiments
We also consider the harder case of Uyghur, a truly lowresource language. We create test and validation sets using the test data from the DARPA LORELEI corpus (Christianson et al., 2018) which contains 2,275 sentence pairs (after filtering out noisy ones) related to incidents that happened in the Uyghur area. We hold out 300 pairs as the validation data and use the rest as the test  set. The LORELEI language pack also contains the bilingual lexicons between Uyghur and English, and thousands of in-domain English sentences. We also use a large monolingual English corpus containing sentences related to various incidents occurring all over the world collected from ReliefWeb. 5 To sub-select a relevant subset of this corpus, we use the cross-entropy filtering (Moore and Lewis, 2010) to select 400k that are most like the in-domain English data.
For parallel data, like many low-resource languages, we only have access to data from the Bible 6 and Wikipedia language links (the total number of parallel Uyghur-English Wikipedia titles is 3,088), but no other in-domain parallel data. We run GIZA++ on this parallel data to obtain an alignment dictionary. We learn the bilingual word embeddings via the supervised Geometric approach (Jawanpuria et al., 2019) on FastText (Grave et al., 2018) pre-trained Uyghur and English monolingual embeddings.
the too high rotation speed produces the reverse deformation supervised however , the deformation of <unk> and the deformation of <unk> is caused by the dc rate ours however , the deformation of <unk> is generated when the rotation rate is large source 8000 3 3 29 3 12 2 reference a 3.3 magnitude earthquake with the depth of 8000 meters hit feb 12 at 3:29 urumqi time supervised 2 , on february 12 -12 of darkness , urumqi time hit a 3.3 earthquake , the earthquake hit . ours 2 -on february 12 -on , urumqi time 3:29minus 3.3 magnitude earthquake hit , the earthquake under depth of 8000 meters . Table 2: Translation examples on ja-en (reorder with 6000 supervised pairs) and ug-en (reorder) from our model and the supervised counterpart.

Results and Comparison
Baselines In Tab. 1, we compare our models with baselines including regular supervised training (sup) and back-translation (Sennrich et al., 2016) (back). 7 To demonstrate the effectiveness of the reordering, we also compare our method against a copy-based data-augmentation method (No-reorder) where the original English sentences t ∈ Q rather than the reordered ones t s are translated via the bilingual lexicon. 8 For each of the above settings, we also experimented with a phrased-based statistical machine translation (SMT) system (Dyer et al., 2010). In Tab. 1, we only show the results with supervised data and back-translation for SMT, since we observed that the data augmentation method performs poorly with SMT (complete results are presented in the Appendix).
Main Results In Tab. 1, we observe consistent improvements on both ja-en and ug-en translation tasks against other baseline methods. First, comparing our results with the NMT models trained using the same amount of parallel data, our word reordering-based semi-supervised models consistently outperform standard NMT models by a large margin. In the case that we have no access to in-domain parallel data at all, our method can still achieve some success in ug-en translation. Second, comparing our Reorder method with the No-Reorder one, reordering English sentences into the source language order consistently brings large performance gains, which demonstrates the importance of reordering. These results are notable given previous reports that explicit reordering is not beneficial for NMT (Du and Way, 2017). Third, for ja-en translation, when gradually decreasing the amount of parallel data, the improvements of our model over the supervised NMT models become more significant, demonstrating the effectiveness of our approach in low-resource settings. Fourth, back-translation is not very beneficial or even harmful, likely because the backtranslation system trained on limited supervised data can not provide high-quality translations to train the model. Finally, we also notice that although SMT performs better than NMT with less supervised training data (3k, 6k supervised data and Uyghur), the performance gain is not as remarkable as NMT when the amount of supervised data increases. Moreover, in the case of less supervised data, our data augmentation method with reordering still outperforms SMT. We give two examples of the translation outputs of our model and a supervised NMT model for ug-en and ja-en (trained with 6k supervised pairs) in Tab. 2. In the first example from ja-en, our model is able to output terminology such as "rotation rate" thanks to the enlarged vocabulary while the supervised model can not. In the example from ug-en, our model can produce a more fluent sentence with better information coverage.
Analysis To investigate the effects of reordering, we compare our method with "No-Reorder" described in 3.2. First, we bucket the test data by sentence length and compute the BLEU score accordingly. We present the comparison results in Fig. 2, from which we observe that "Reorder" outperforms "No-Reorder" consistently under different sentence length buckets, and the improvement is larger when the sentence length is longer.
Second, we also evaluate the model outputs on the test data with RIBES (Isozaki et al., 2010a) which is an automatic evaluation metric of translation quality designed for distant languages that is especially sensitive to word order. From Fig. 3, we can see that "Reorder" consistently outperforms "No-Reorder" on ja-en translation especially when the amount of supervised data decreases. This suggests that with reordered pairs as the augmented training data, the model is able to output more syntactically correct sentences.

Conclusion
This paper proposed a simple yet effective semisupervised learning framework for low-resource machine translation that artificially creates sourceordered target sentences for data-augmentation. Experimental results on ja-en and ug-en translations show that our approach achieves significant improvements over baseline systems, demonstrating the effectiveness of the proposed approach on divergent language pairs.