IITP-MT System for Gujarati-English News Translation Task at WMT 2019

We describe our submission to WMT 2019 News translation shared task for Gujarati-English language pair. We submit constrained systems, i.e, we rely on the data provided for this language pair and do not use any external data. We train Transformer based subword-level neural machine translation (NMT) system using original parallel corpus along with synthetic parallel corpus obtained through back-translation of monolingual data. Our primary systems achieve BLEU scores of 10.4 and 8.1 for Gujarati→English and English→Gujarati, respectively. We observe that incorporating monolingual data through back-translation improves the BLEU score significantly over baseline NMT and SMT systems for this language pair.


Introduction
In this paper, we describe the system that we submit to the WMT 2019 1 news translation shared task (Bojar et al., 2019).
We participate in Gujarati-English language pair and submit two systems: English→Gujarati and Gujarati→English. Gujarati language belongs to Indo-Aryan language family and is spoken predominantly in the Indian state of Gujarat. It is a low-resource language as only a few thousands parallel sentences are available, which are not enough to train a neural machine translation (NMT) system as well statistical machine translation (SMT) system. Gujarati-English is a distant language pair and they have different linguistic properties including syntax, morphology, word order etc. English follows subject-verb-object order while Gujarati follows subject-object-verb order. 1 http://www.statmt.org/wmt19/ translation-task.html NMT (Kalchbrenner and Blunsom, 2013;Cho et al., 2014;Sutskever et al., 2014;Bahdanau et al., 2015) has recently become dominant paradigm for machine translation (MT) achieving state-ofthe-art on standard benchmark data sets for many language pairs. As opposed to SMT, NMT systems are trained in an end-to-end manner. Training an effective NMT requires a huge amount of high-quality parallel corpus and in absence of that, an NMT system tends to perform poorly (Koehn and Knowles, 2017). However, back-translation (Sennrich et al., 2016) has been shown to improve NMT systems in such a situation. In this work, we train a SMT system and an NMT system for both English→Gujarati and Gujarati→English using the original training data. SMT systems are also used to generate synthetic parallel corpora through back-translation of monolingual data from English news crawl and Gujarati Wikipedia dumps. These corpora along with the original training corpora are used to improve the baseline NMT systems. All the SMT and NMT systems are trained at subword level.
Our SMT systems are standard phrase-based SMT systems (Koehn et al., 2003), and NMT systems are based on Transformer (Vaswani et al., 2017) architecture. Experiments show that NMT systems achieve BLEU (Papineni et al., 2002) scores of 10.4 and 8.1 for Gujarati→English and English→Gujarati, respectively, outperforming the baseline SMT systems even in the absence of enough-sized parallel data.
Rest of the paper is arranged in following manner: Section 2 gives brief introduction of the Transformer architecture that we used for NMT training, Section 3 describes the task, Section 4 describes the submitted systems, Section 5 gives various evaluation scores for English-Gujarati translation pair, and finally, Section 6 concludes the work.

408
2 Transformer Architecture Recurrent neural network based encoder-decoder NMT architecture (Cho et al., 2014;Sutskever et al., 2014;Bahdanau et al., 2015) deals with input/output sentences word-by-word sequentially, which prevents the model from parallel computation. Vaswani et al. (2017) came up with a highly parallelizable architecture called Transformer which uses the self-attention to better encode a sequences. Self-attention is used in the architecture to calculate attention between a word and the other words in the sentence itself. Encoder and decoder both are stack of 6 identical layers. Each layer in encoder has two sub-layers: i. multi-head self attention mechanism and ii. position wise feed forward network. Each sub-layer is associated with residual connections, followed by layer normalization. Multi-head attention computes the attention multiple times for each word. Since their is no sequence to sequence encoding, positional encoding is used to encode the sequence information.

Task Description
This task focuses on translating news domain corpus and this year, Gujarati language is introduced for the first time in a WMT shared task. Gujarati is a low-resource language and not many results have been reported in machine translation involving this language. Also, there was no standard test set for this language pair. So introduction of this language pair will help in further research for this language pair.
As Gujarati does not have enough parallel data, the data that are provided for this shared task are mainly from WikiTitles which consists of only 11,671 parallel titles. Apart from that, few publicly available domain specific parallel data that are provided are: Bible corpus (Christodouloupoulos and Steedman, 2015); a localization extracted from OPUS 2 ; parallel corpus extracted from Wikipedia; crawled corpus produced for this task; and monolingual Wikipedia dumps.

System Description
We participated in Gujarati-English pair only and we submit for both directions: English→Gujarati and Gujarati→English. As Gujarati is a lowresource language and only a little amount of parallel data is available, we explore the backtranslation technique for this pair. Also our models are based on Transformer as it has become state of the art for machine translation for many language pairs. We train systems at subword level. For back-translation, we train a phrase-based SMT (Koehn et al., 2003) system for each system in reverse direction. Using these SMT systems, monolingual sentences (for both Gujarati and English) are translated to create synthetic parallel data having original monolingual sentences at target and translated sentences at source side. These synthetic parallel data, along with the original parallel data are used to train a transformer based NMT system for each direction. The datasets that we use for training are shown in the Table 1, which combine to a total of 155,798 parallel sentences. These parallel data are compiled from different sources. The compiled datasets are Bible 3 , govin-clean.guen.tsv 4 , opus.gu-en.tsv 5 , wikipedia.gu-en.tsv 6 and wikititles-v1.gu-en.tsv 7 . We use newsdev2019 for tuning the model, which has 1,998 parallel sentences.  Apart from these parallel data, we use monolingual English (news crawl) and Gujarati (Wikipedia dumps) sentences for synthetic parallel data creation. After training two models i.e. English→Gujarati and Gujarati→English using the parallel data mentioned in Table 1, English and Gujarati monolingual sentences are back translated respectively.

Experimental Setup
We train phrase based statistical system (PB-SMT) (Koehn et al., 2003) as well as Transformer (Vaswani et al., 2017) based neural system for comparing their performance under low-resource conditions. In addition to that, PBSMT are used to genrate synthetic parallel data. PBSMT systems are trained only on original training data, while neural based models are trained on original training data (Transfomer in Table 2), and also with synthetic parallel data in addition to original data (Transfomer+Synth in Table 2). Synthetic parallel data are obtained through back-translation of a target monolingual corpus into source using PB-SMT system. We use Moses (Koehn et al., 2007) toolkit for PBSMT training and Sockeye (Hieber et al., 2017) toolkit for NMT training. Some preprocessing of data is required before using it for experiment. English data is tokenized using moses tokenizer, and truecased. For tokenizing Gujarati data, we use indic nlp library 8 . After tokeninzation and truecasing, we subword (Sennrich et al., 2015) all original data. We apply 10,000 BPE merge operations over English and Gujarati data independently.
For back-translation of monolingual data, two PBSMT models English→Gujarati and Gujarati→English are trained over original available parallel subworded corpora. 4-gram lan-8 https://github.com/anoopkunchukuttan/indic nlp library guage model is trained using KenLM (Heafield, 2011). For word alignment, we use GIZA++ (Och and Ney, 2003) with grow-diag-final-and heuristics. Model is tuned with Minimum Error Rate Training (Och, 2003). After these two models are trained, monolingual subworded data from both English and Gujarati are back-translated using English→Gujarati and Gujarati→English PB-SMT model, respectively. We merge the back translated data with original parallel data to have larger parallel corpora for Gujarati→English and English→Gujarati translation directions.
Finally, with the augmented parallel corpora, we train one Transformer based NMT model for each direction. We use the following hyperparameters values of Sockeye toolkit: 6 layers in both encoder and decoder, word embedding size of 512, hidden size of 512, maximum input length of 50 tokens, Adam optimizer, word batch size 1000, attention type is dot, learning rate of 0.0002. The rest of the hyper-parameters are set to the default values in Sockeye. We use early stopping criteria for terminating the training on the validation set of 1,998 parallel sentences.

Results
The official automatic evaluation uses the following metrics: BLEU (Papineni et al., 2002), TER (Snover et al., 2006), CharactTER (Wang et al., 2016). The official scores are shown in the Table 2. Phrase-base SMT (PBSMT) obtains BLEU scores of 5.2 and 7.3 for English→Gujarati and Gujarati→Englsih, respectively. Whereas, baseline NMT (Transformer) obtains lower BLEU scores of 4.0 and 5.5 for the same directions. Though, SMT systems outperforms baseline NMT systems trained using small amount of original parallel data only. We observe from the Table 2 that Transformer with synthetic (Transformer +   Table 8: Preliminary results of WMT19 News Translation Task. Systems ordered by DA score z-score cluster are considered tied; lines indicate clusters according to Wilcoxon rank-sum test p < 0.05; gray resources that fall outside the constraints provided. Table 3: Preliminary official results of WMT 2019 news translation task for Gujarati-English pair. Systems ordered by DA score z-score; systems within a cluster are considered tied; lines indicate clusters according to Wilcoxon rank-sum test p < 0.05; grayed entry indicates resources that fall outside the constraints provided. Synth) data obtained through back-translation of monolingual data, outperforms the baseline SMT systems with a margin of 2.9 and 3.1 BELU points. Also, as a result of augmenting backtranslated data with original training data, we obtain improvement of of 4.7 and 5.3 BLEU points over baseline NMT for English→Gujarati and Gujarati→English, respectively. The official preliminary human evaluation results are shown in the Table 3.

Conclusion
In this paper, we described our submission to the WMT 2019 News translation shared task for Gujarati-English language pair. This is the first time Gujarati language is introduced in a WMT shared task. We submit Transformer based NMT systems for English-Gujarati language pair. Since the number of parallel sentences in training set are very less and many sentences have length of only 2-3 tokens, BLEU scores for English-Gujarati pair using only available parallel corpus are very low (4.0 and 5.1 for English→Gujarati and Gujarati→English, respectively). So we use monolingual sentences for both languages to create synthetic parallel data through backtranslation, and merged them with original parallel data. We obtained improved BLEU scores of 8.1 and 10.4, respectively.