JU-Saarland Submission to the WMT2019 English–Gujarati Translation Shared Task

In this paper we describe our joint submission (JU-Saarland) from Jadavpur University and Saarland University in the WMT 2019 news translation shared task for English–Gujarati language pair within the translation task sub-track. Our baseline and primary submissions are built using Recurrent neural network (RNN) based neural machine translation (NMT) system which follows attention mechanism. Given the fact that the two languages belong to different language families and there is not enough parallel data for this language pair, building a high quality NMT system for this language pair is a difficult task. We produced synthetic data through back-translation from available monolingual data. We report the translation quality of our English–Gujarati and Gujarati–English NMT systems trained at word, byte-pair and character encoding levels where RNN at word level is considered as the baseline and used for comparison purpose. Our English–Gujarati system ranked in the second position in the shared task.


Introduction
Neural Machine translation (NMT) is an approach to machine translation (MT) that uses artificial neural network to directly model the conditional probability p(y|x) of translating a source sentence (x 1 ,x 2 ,...,x n ) into a target sentence (y 1 ,y 2 ,...,y m ). NMT has consistently performed better than the phrase-based statistical MT (PB-SMT) approaches and has provided state-ofthe-art results in the last few years. However, one of the major constraints of using supervised NMT is that it is not suitable for low resource language pairs. Thus, to use supervised NMT, low resource pairs need to resort to other techniques * These three authors have contributed equally.
to increase the size of the parallel training dataset. In the WMT 2019 news translation shared task, one such resource scarce language pair is English-Gujarati. Due to insufficient volume of parallel corpora available to train an NMT system for these language pairs, creation of more actual/synthetic parallel data for low resources languages such as Gujarati, is an important issue.
In this paper, we described our joint participation of Jadavpur University and Saarland University in the WMT 2019 news translation task for English-Gujarati and Gujarati-English. The released training data set is completely different in-domain compared to the development set and the size is not anywhere close to the sizable amount of training data which is typically required for the success of NMT systems. We use additional synthetic data produced through backtranslation from the monolingual corpus. This provides significant improvements in translation performance for both our English-Gujarati and Gujarati-English NMT systems. Our English-Gujarati system was ranked second in terms of BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) in the shared task.

Related Works
Dungarwal et al. (Dungarwal et al., 2014) developed a statistical method for machine translation, where phrase based method for Hindi-English and factored based method for English-Hindi SMT system was used. They had shown improvements to the existing SMT systems using pre-procesing and post-processing components that generated morphological inflections correctly. Imankulova et al. (Imankulova et al., 2017) showed how backtranslation and filtering from monolingual data can be used to build an effective translation system for a low-resourse language pair like Japanese- Russian. Sennrich et al. (Sennrich et al., 2016a) shown how back-translation of monolingual data can improve the NMT system. Ramesh et al. (Ramesh and Sankaranarayanan, 2018) demonstrated how an existing model like bidirectional recurrent neural network can be used to generate parallel sentences for non-English languages like English-Tamil and English-Hindi, which belong to low-resource language pair, to improve the SMT and the NMT systems. Choudhary et al. (Choudhary et al., 2018) has shown how to build NMT system for low resource parallel corpus language pair like English-Tamil using techniques like word embeddings and Byte-Pair-Encoding (Sennrich et al., 2016b) to handle Out-Of-Vocabulary Words.

Data Preparation
For our experiments we used both parallel and monolingual corpus released by the WMT 2019 Organizers. We back-translate the monolingual corpus and use it as additional synthetic parallel corpus to train our NMT system. The detailed statistics of the corpus is given in Table 1.
We performed our experiments on two datasets, one using the parallel corpus provided by WMT 2019 for the Gujarati-English news translation shared task, and the other using the parallel corpus combined with back translated sentences from provided monolingual corpus (only News crawl corpus was used for back translation) for the same language pair.
Since the released parallel corpus was very noisy, containing redundant sentences, we cleaned the parallel corpus, the procedure of which is described in section 3.1.
In the next step we shuffle the whole corpus as it reduces variance and makes sure that our model overfits less. We then split the dataset into three parts: training, validation and test set. Shuffling is important in the splitting part too as it is important to choose the test and validation set from the same distribution and must be chosen randomly from the available data. Here, test set was also shuffled as this dataset was used for our internal assessment. After cleaning, we randomly selected 64,346 sentence pairs for training, 1,500 sentence pairs for validation and 1,500 sentences as test data. It is to be noted that our validation and test corpus is taken from the released parallel data to setup a baseline model. Later when WMT19 Organizers released the development set, we continued training our models by considering WMT19 development set as our test set and the new development set consisting of 3,000 sentences which were obtained after combining 1,500 sentences from the validation and the testing set (both were from the parallel corpus as stated above). While training our final model, the released development set was used. After cleaning it was obvious that the amount of training data is not enough to train a neural system for such a low resource language pair. Therefore, preparation for large volume of parallel corpus is required which can be produced either by manual translation by professional translators or scraping parallel data from the internet. However, these processes are costly, tedious and sometimes inefficient (in case of scraping from internet).
As the released data was insufficient, to generate more training data, we use back-translation. For back-translation we applied two methods, first, using unsupervised statistical machine translation as described in (Artetxe et al., 2018) and second, using Doc translation API 1 (The API uses Google translator as of April 2019). We have explained the extraction of sentences and the corresponding results using the above methods in section 4.2. The synthetic dataset which we have generated can be found here. 2

Data Preprocessing
To train an efficient machine translation system, it is required to clean the available raw parallel corpus for the system to produce consistent and reliable translations. The released version of the raw parallel corpus consisted of redundant pairs which needs to be removed to obtain better results as demonstrated in previous works (Johnson et al., 2017) which are of types as given below: • The source is same for different targets.
• The source is different for the same target.
• Repeated identical sentence pair The redundancy in the translation pairs makes the model prone to overfitting and hence prevents it from recognizing new features. Thus, one of the sentence pair is kept while the other redundant pairs are removed. Some sentence pairs had combinations of both language pairs which were also identified as redundant. These pairs strictly need elimination as the vocabularies of the individual languages consist of alphanumeric characters of the other language which results in inconsistent encoding and decoding during encoderdecoder application steps on the considered language pair. We tokenize the English side using Moses (Koehn et al., 2007) tokenizer and for Gujarati, we use the Indic NLP library tokenization tool 3 . Punctuation normalization was also done.

Data Postprocessing
Postprocessing, such as detokenization (Klein et al., 2017), punctuation normalization 4 (Koehn et al., 2007), was performed on our translated data (on the test set) to produce the final translated data.

Experiment Setup
We have explained our experimental setups in the next two sections. The first section contains the setup used for our final submission and the next section describes all the other supporting experimental setups. We use the OpenNMT toolkit (Klein et al., 2017) for our experiments. We performed several experiments where the parallel corpus is sent to the model as space separated character format, space separated word format, and space separated Byte Pair Encoding (BPE) format (Sennrich et al., 2016b). For our final (i.e., primary) submission for the English-Gujarati task, the source input words were converted to BPE whereas the Gujarati words were kept as it is. For our Gujarati-English submission, both the source and the target were in simple word level format.

Primary System description
Our primary NMT systems are based on attentionbased uni-directional RNN (Cho et al., 2014) for Gujarati-English and bi-directional RNN (Cheng et al., 2016) for English-Gujarati.   (Cho et al., 2014)), *learning-rate was initially set to 1.0. Table 2 shows the hyper-parameter configurations for our Gujarati-English translation system. We initially trained our model with the cleaned parallel corpus provided by WMT 2019 up to 100K training steps. Thereafter, we fine-tune our generic model on domain specific corpus (containing 219K sentences back-translated using Doc Translator API) changing the learning rate to 0.5 and decay started from 130K training steps with a decay factor of 0.5 and keeping the other hyperparameters same as mentioned in Table 2.  To build our English-Gujarati translation system, we initially trained a generic model like our Gujarati-English translation system. However, in this case we use different hyper-parameter configurations as mentioned in Table 3. Additionally, here, we use byte-pair encoding on the English side with 32K merge operations. We do not perform BPE operation on the Gujarati corpus; we keep the original word format for Gujrati. Our generic model was trained with up to 100K training steps and then fine-tuned our model on domain specific parallel corpus having English side as BPE and Gujarati side as word level format. During fine-tuning, we reduce the learning rate from 1.0 to 0.25 and started decaying from 120K training steps with a decay factor of 0.5. The other hyper-parameter configurations remain unchanged. The respective hyperparameters used for the English-Gujarati task in our primary system submission were also tested for the reverse direction; however, it did not perform as good as the primary system and hence the final system is modified accordingly.

Other Supporting Experiments
In this section we describe all the supporting experiments that we performed for this shared task starting from Statistical MT to NMT with both supervised and unsupervised settings.
All the results and experiments discussed below are tested on the released development set (considering this as the test set). These models were not tested with the released test set as they provided poor BLEU scores on the development set.
We used uni-directional RNN having LSTM units trained on 64,346 pre-processed sentences (cf. Section 3) with 120K training steps and learning rate of 1.0. For English-Gujarati where input was space separated words for both sides, we achieved highest BLEU score of 4.15 after fine-tuning with 10K sentences selected from the cleaned parallel corpus whose total number of tokens(words) was exceeding 8.The BLEU score dropped to 3.56 while applying BPE on the both sides. For the other direction (Gujarati-English) of the language pair, we got highest BLEU scores of 5.13 and 5.09 at word level and BPE level respectively.
We also tried transformer-based NMT model (Vaswani et al., 2017) which however gave extremely poor results on similar experimental settings. The highest BLEU we achieved was 0.74 for Gujarati-English and 0.96 for English-Gujarati. The transformer model was trained until 100K training steps, with 64 batch size in a single GPU and positional encoding layers size was set to 2.
Since the the training data size was not enough, we used backtranslation to generate additional synthetic sentence pairs from the monolingual corpus released in WMT 2019. We initially used monoses (Artetxe et al., 2018), which is based on unsupervised statistical phrase based machine translation, to translate the monolingual sentences from English to Gujarati. We used 2M English sentences to train the monoses system. The training process took around 6 days in our modest 64 GB server. However, the results were extremely poor with a BLEU score of 0.24 for English-Gujarati and 0.01 for the opposite direction, without using preprocessed parallel corpus. Moreover, after adding preprocessed parallel corpus, the BLEU score dropped significantly. This motivated us to use online document translator, in our case Google translation API, for backtranslating sentence pairs from the released monolingual dataset. The back-translated data was later combined with our preprocessed parallel corpus for our final model. Additionally, we also tried a simple unidirectional RNN model on character level, however, this also fails to contribute in terms of improving performance. We have compiled all the results in table 4.

Primary System Results
Our primary submission for English-Gujarati using bidirectional RNN model with BPE at English side (see Section 4.1) and word format at Gujarati side gave the best result. On the other hand, the Gujarati-English primary submission, based on an uni-directional RNN model with both English and Gujarati in word format, gave the best result. Before submission, we performed punctuation normalization, unicode normalization, and detokenization for each runs. Table 5 shows the published results of our primary submissions on WMT 2019 Test set. Table 6 shows our hands on experimental results on the development set.

Conclusion and Future Work
In this paper, we applied NMT to one of the most challenging language pair, English-Gujarati, as the availability of parallel corpus is really scarce    for this language pair. In this scenario, collecting and preprocessing of data play very crucial role to increase the dataset as well as to obtain quality result using NMT. In this paper we show how increasing the parallel data through back-translation via Google translation API can increase the overall performance. Our primary result also exceeded Google translate (which gave a BLEU of 13.7) by a margin of around 8.0 absolute BLEU points. Our method is not just limited to English-Gujarati translation task; it can also be useful in various scarce-resource language pairs and domains. We did not make use of any ensemble mechanism in this task, otherwise we could have achieved higher BLEU scores. Therefore, in future we will try to ensemble several models, increasing more useful back-translated data using existing state-of-the-art model. In future, we would also like to explore cross-lingual BERT (Devlin et al., 2018) to enhance the performance.