The University of Sydney’s Machine Translation System for WMT19

This paper describes the University of Sydney’s submission of the WMT 2019 shared news translation task. We participated in the Finnish->English direction and got the best BLEU(33.0) score among all the participants. Our system is based on the self-attentional Transformer networks, into which we integrated the most recent effective strategies from academic research (e.g., BPE, back translation, multi-features data selection, data augmentation, greedy model ensemble, reranking, ConMBR system combination, and postprocessing). Furthermore, we propose a novel augmentation method Cycle Translation and a data mixture strategy Big/Small parallel construction to entirely exploit the synthetic corpus. Extensive experiments show that adding the above techniques can make continuous improvements of the BLEU scores, and the best result outperforms the baseline (Transformer ensemble model trained with the original parallel corpus) by approximately 5.3 BLEU score, achieving the state-of-the-art performance.


Introduction
Neural machine translation (NMT), as a succinct end-to-end paradigm, has resulted in massive leap in state-of-the-art performances for many language pairs (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015;Gehring et al., 2017;Wu et al., 2016;Vaswani et al., 2017). Among these encoder-decoder networks, the Transformer (Vaswani et al., 2017), which solely uses along attention mechanism and eschews the recurrent or convolutional networks, leads to state-of-the-art translation quality and fast convergence speed (Ahmed et al., 2017). Although many Transformer-based variants are proposed (e.g., DynamicConv (Wu et al., 2019), sparse-transformer (Child et al., 2019)), our # cycle translated sample sentence pair 1 She stuck to her principles even when some suggest that in an environment often considered devoid of such thing there are little point. 2 She insists on her own principles, even if some people think that it doesn't make sense in an environment that is often considered to be absent. preliminary experiments show that their performances are unstable compared to the traditional Transformer. Traditional Transformer therefore was employed as our baseline system. In this paper, we summarize the USYD NMT systems for the WMT 2019 Finnish→English (FI→EN) translation task.
As the limitation of time and computation resources, we only participated in one challenging task FI→EN, which lags behind other language pairs in translation performance (Bojar et al., 2018). We introduce our system with three parts.
First, at data level, we find that the data quality of both parallel and monolingual is unbalanced (i.e., contains a large number of low quality sentences). Thus, we apply several features to select the data after pre-processing, for example, language models, alignment scores etc. Meanwhile, in order to fully utilize monolingual corpus, not only back translation (Sennrich et al., 2015) is adopted to back translate the high quality monolingual sentences with target-to-source(T2S) model, we also propose Cycle Translation to improve the low-quality sentences, in turn resulting in cor- responding high-quality back translation results. Note that unlike text style transfer task (Shen et al., 2017;Fu et al., 2018;Prabhumoye et al., 2018) which transfers text to specific style (e.g., political slant, gender), we aim to improve the fluency of sentences, for instance, through cycle translation, low quality sentence in Table 1 becomes more fluent in terms of language model score. The top diagram of Figure 1 depicts data preparation process concretely.
As to model training in the middle part of Figure 1, we empirically introduced Big/Small parallel construction strategy to construct training data for different models. The intuition is all the data are advantageous and can be fully exploited by different models, thus we train 8 Transformer base models (M small × 8) by using different small scale corpus constructed by small parallel construction method and a Transformer big model (M big × 1) based on the big parallel construction method. In the meantime, a right-to-left model (M r2l ) is trained.
In addition, in inference phrase, we comprehensively consider the ensemble strategies at model level, sentence level and word level. For model level ensemble, while brutal ensemble top-N or last-M models may improve translation performance, it is difficult to obtain the optimal result. Hence we employ Greedy Model Selection based Ensembling (GMSE) (Partalas et al., 2008;Deng et al., 2018). For sentence level ensemble, we keep top n-best for multi-features reranking. And for word aspect, we adopt the confusion network decoding (Bangalore et al., 2001;Matusov et al., 2006;Sim et al., 2007) with using the consensus network minimum Bayes risk (MBR) criterion (Sim et al., 2007). After combination, a postprocessing algorithm is employed to correct inconsistent number and years between the source and target sentences. The bottom part of Figure 1 shows the inference process.
Our omnivorous model achieved the best BLEU (Papineni et al., 2002) scores among submitted systems, demonstrating the effectiveness of the proposed approach. Theoretically, our approach is not specific to the Finnish→English language pair, i.e., it is universal and effective for any language pairs. The remainder of this article is organized as follows: Section 2 will describe each component of the system. In Section 3, we intro-  duce the data preparing details. Then, the experimental results are showed in Section 4. Finally, we conclude in Section 5.

Neural Machine Translation Models
Given a source sentence X = x 1 , ..., x T , NMT model factors the distribution over target sentence Y = y 1 , ..., y T into a conditional probabilities: where the conditional probabilities are parameterized by neural networks. The NMT model consists of two units: an encoder and a decoder. The encoder is assumed that it can adequately represent the source sentence. Then, the decoder can recursively predict each target word. Parameters of encoder, decoder and attention mechanism are trained to maximize the likelihood with a cross-entropy loss applied: Concretely, an self-attentional encoder-decoder architecture (Vaswani et al., 2017) was selected to capture the causal structure. For training with different size of corpus, we employ the Transformer base (M base) and Transformer big (M big) in our structure, see Table 2.

Data Selection Features
Inspired by ), where their system shows data selection can obtain substantial gains, we deliberately design criteria for parallel and monolingual corpus. Both of them employ rulebased features, count features, language model Category Features NMT Features T2S score (Sennrich et al., 2016) LM Features BERT LM (Devlin et al., 2018) Transformer LM  N-gram LM (Stolcke, 2002) Alignment Features IBM model 2 (Dyer et al., 2013) Rule-based features Illegal characters  Count Features Word count Word count ratio features. And for parallel data, word alignmentbased features, T2S translation model score features are applied. The feature types are described in Table 3. Our BERT language model used here is trained from scratch by the open-source tool 1 with target side data. According to our observations, by using above multiple data selection filters, issues like misalignment, translation error, illegal characters, over translation and under translation in terms of length could be significantly reduced.

Cycle Translation for Low-quality Data
Although the data selection procedure has preserved relatively high quality monolingual data, there are still a large batch of data is incomplete or grammatically incorrect. To address this problem, we proposed Cycle Translation (denoted as CT (·), as Figure 2) to improve the mono-lingual data that below the quality-threshold (According to our empirical ablation study in section 4, the latter 50% will be cycle translated in our submitted system).

Back Translation for monolingual corpus
Back-translation (Sennrich et al., 2015;Bojar et al., 2018), translating the large scale monolingual corpus to generate synthetic parallel data by Target-to-Source pretrained model, has been widely utilized to improve the translation quality since adding the synthetic data into parallel data can enhance the in-domain information over the original corpus distributions, allowing the translation model to be more robust and deterministic.

Greedy Model Selection Based Ensemble
Model ensemble is a typical boosting technique, which refers to combining multiple models to re- Mono Parallel Figure 2: The Cycle Translation process, into which we feed the low quality monolingual data x, and then correspondingly obtain the improved data CT (x) (denoted as S2T (T 2S(x)) in figure). Note that models marked in red and green represent the T2S and S2T model trained by M small with the processed given parallel corpus, the red arrows indicate the data flows of the opposite language type of the inputs. The dotted double-headed arrow between the input x and the final output CT (x) means that they share the semantics but differs in fluency.
duce stochastic differences in the output that may not be avoided at a single run. Also normally, ensemble model outperforms the the best single one.
In neural machine translation, we generally ensemble several checkpoints saved during a single model training. However, our preliminary experiments show that both top-N or last-M ensembling approaches could only bring very insignificant improvements but consume a lot of GPU resources.
To overcome this issue, we adopt greedy model selection based ensembling(GMSE), which technically follows the instruction of (Deng et al., 2018).

Reranking n-best Hypotheses
As the NMT decoding being generally from left to right, this leads to label bias problem (Lafferty et al., 2001). To alleviate this problem, we rerank the n-best hypotheses through training a kbest batch MIRA ranker (Cherry and Foster, 2012) with multiple features on validation set. The feature pool we integrated include left-to-right (L2R) translation model, (right-to-left) R2L translation model, (target-to-source) T2S translation model, language model, IBM model 2 alignment score, and word count ratio. After multi-feature reranking, the best hypothesis of each model (M big × 1, M small × 8 and R2L model) was retained for system combination.
Pooling 1 Best List

Src sentence
Tgt output Multiple Systems Figure 3: The System Combination process, into which we feed each system/model with the source sentence x, in turn obtain corresponding 1-best result M big (x), M small1 (x), ... ,M small2 (x),M R2L (x) (Note that the 1-best result here of each system was already reranked). After pooling all system results, we can perform the ConMBR system combination decoding and obtain the final target side results.

Left-to-right NMT model
The L2R feature refers to the original translation model that could generate the n-best list. During reranking training, we keep the original perplexity score evaluated by this L2R model as L2R feature.

Right-to-Left NMT Model
The R2L NMT model using the same training data but with inverted target sentences (i.e., reverse target side characters "a b c d"→"d c b a"). Then, inverting the hypothesis in the n-best list such that each sequence can be given a perplexity score by R2L model.

Target-to-Source NMT Model
The T2S model was initially trained for backtranslation, we can employ this model to assess the translation adequacy as well by adding the T2S feature to reranking feature pool.

Language Model
Besides above features, we employ language models as an auxiliary feature to give the fluent sentences better scores such that the results are easier to understand by human.

Word Count Ratio
To alleviate over-translation or under-translation in terms of length, we set the optimal ratio of src Siltalan edellinen kausi liigassa oli :::::::

2006-07 pred
Siltala's previous season in the league was ::::: 2006 :: at ::: 07 +post Siltala's previous season in the league was :::::::: 2006-07  We use the deviation between the ratio of each sentence pair and this optimal ratio as the score.

System Combination
As is shown in Figure 3, in order to take full advantages of different models(M big ×1, M small ×8 and R2L model), we adopted word-level combination where confusion network was built. Concretely, our method follows Consensus Network Minimum Bayes Risk (ConMBR) (Sim et al., 2007), which can be modeled as where E con was obtained as backbone through performing consensus network decoding.

Post-processing
In addition to general post-processing strategies (i.e., de-BPE, de-tokenization and detruecase 2 ), we also employed a post-processing algorithm  for inconsistent number, date translation, for example, "2006-07" might be segmented as "2006 -@@ 07" by BPE, resulting in the wrong translation "2006 at 07". Our post-processing algorithm will search for the best matching number string from the source sentence to replace these types of errors, see Table 4.
We used all available parallel corpus 3 for Finnish→English except the "Wiki Headlines" due to the large number of incomplete sentences, and for monolingual target side English data, we selected all besides the "Common Crawl" and "News Discussions". The criteria is inspired by (Marie et al., 2018), who won the first place in this direction at WMT18. Table 5 shows the final corpus statistics. More details are as follows: Parallel Data: We use the criteria in section 2.2, the overall criteria are following: • Remove duplicate sentence pairs.
• Remove sentence pairs containing illegal characters.
• Retain sentence pairs between 3 and 80 in length.
• Remove sentence pairs that are too far from the best ratio(L f i : L en =0.76) • Remove pairs containing influent English sentences according to a series of LM features.
• Remove inadequate translation sentence pairs according to M T 2S score.
• Remove sentence pairs with poor alignment quality according to IBM model 2.
After data selection, there are approximately 5.8M parallel sentences.
Monolingual Data: For our Finnish→English system, back translation was performed for monolingual English data. Before back-translation, we filter them according to the aforementioned criteria in section 2.2 and concurrently, the scores of each sentence is obtained. After monolingual selection, there are 82M sentences remained, which is still a gigantic scale. We cycle translate the last 25%, 50% and 75% of it in terms of the LM scores to empirically identify the optimal threshold and improve the fluency of monolingual corpora. In doing so, all monolingual corpus is kept at relatively high quality.   Synthetic Parallel Data: The synthetic parallel data also needs to be filtered by alignment score and word count ratio to alleviate poor translation. Further filtration retains 75M synthetic data.
On the other hand, previous works have shown that the maximum gain can be obtained by mixing the sampled synthetic and original corpus in a ratio of 1:1 (Sennrich et al., 2015(Sennrich et al., , 2016. The size of the synthetic corpus is generally larger than the parallel corpus, thus partial sampling is required to satisfy the 1-1 ratio. However, such sampling leads to waste of enormous synthetic data. To address this issue, we argue that a better construction strategy can be introduced to make full use of the synthetic corpus, subsequently leading to better translation quality.
Small Parallel Construction: We randomly sampled approximate 5.8M corpus from the shuffled synthetic data for 8 times and mix them with parallel data respectively.
Big Parallel Construction: The aim of big construction is to fully utilize the synthetic data. To achieve this, we repeated the parallel corpus 13 times and then mixed it with all synthetic corpora.

Experiments
The metric we employed is detokenized casesensitive BLEU score. news-test2018 is utilized as validation set and test set is officially released news-test2019. Training set, validation set and test set are processed consistently. Both Finnish and English sentences are performed tokenization and truecasing with Moses scripts (Koehn et al., 2007). In order to limit the size of vocabulary of NMT models, we adopted byte pair encoding (BPE) (Sennrich et al., 2016) with 50k operations for each side. All the model we trained are optimized with Adam (Kingma and Ba, 2014). Larger beam size may worsen translation quality (Koehn and Knowles, 2017), thus we set beam size=10 for each model. All models were trained on 4 NVIDIA V100 GPUs.
In order to find the optimal threshold in cycle translation procedure, we first report our experimental results on validation data set with different thresholds, which ranges from [0%, 25%, 50%, 75%]. Intuitively, the quality improvement of monolingual sentences afforded by cycle translation could bring better synthetic parallel data, subsequently leading to more accurate translation model. Thus, this ablation experiment was trained with synthetic parallel corpus only with different cycle translation ratios on Transformer base model. As is shown in Table 7, when cycle translation threshold is 50%, the model could achieve the relatively best performance. We therefore set the cycle translation ratio to 50% in our following main experiment.
Our main experiment is shown in Table 6, our baseline system is developed with the M small configuration using the original parallel corpus and last-20 ensemble strategy. Unsurprisingly, the baseline system relatively performs the worst in Table 6. The M small configuration trained with selected parallel data improves BLEU by +0.7 points. According to exp. [3][4][5][6], adding these components can lead to continuous improvements. Notably, with Cycle Translation and Big/Small parallel construction strategy, our system could obtains +3.55 significant improvement. And exp. [8][9][10][11] show that with performing GMSE, multi-features reranking, ConMBR system combination and post-processing, our system further improved the BLEU score from 30.9 to 33.0 on the official data set news-test2019, which substantially outperforms the baseline by 5.3 BLEU score.

Conclusion and Future Work
This paper presents the University of Sydney's NMT systems for WMT2019 Finnish→English news translation task.
We leveraged multidimensional strategies to improve translation quality in three levels: 1) At data level, in addition to using various data selection criteria, we proposed cycle translation to improve monolingual sentence fluency. 2) For model training, we trained multiple models with R2L corpus and big/small parallel construction corpus respectively. 3) As for inference, we prove the effectiveness of multifeatures rescoring, ConMBR system combination and post-processing. We find that cycle translation and B/S construction approach bring the most significant improvement for our system.
In future work, we will apply the beam+noise method (Edunov et al., 2018) to generate robust synthetic data during back translation, we assume that this method combined with our proposed cycle translation strategy can bring greater improvement. Also, we would like to investigate hyperparameter optimization for neural machine translation to avoid empirical settings.