Checkpoint Reranking: An Approach to Select Better Hypothesis for Neural Machine Translation Systems

In this paper, we propose a method of re-ranking the outputs of Neural Machine Translation (NMT) systems. After the decoding process, we select a few last iteration outputs in the training process as the N-best list. After training a Neural Machine Translation (NMT) baseline system, it has been observed that these iteration outputs have an oracle score higher than baseline up to 1.01 BLEU points compared to the last iteration of the trained system.We come up with a ranking mechanism by solely focusing on the decoder’s ability to generate distinct tokens and without the usage of any language model or data. With this method, we achieved a translation improvement up to +0.16 BLEU points over baseline.We also evaluate our approach by applying the coverage penalty to the training process.In cases of moderate coverage penalty, the oracle scores are higher than the final iteration up to +0.99 BLEU points, and our algorithm gives an improvement up to +0.17 BLEU points.With excessive penalty, there is a decrease in translation quality compared to the baseline system. Still, an increase in oracle scores up to +1.30 is observed with the re-ranking algorithm giving an improvement up to +0.15 BLEU points is found in case of excessive penalty.The proposed re-ranking method is a generic one and can be extended to other language pairs as well.


Introduction
Neural Machine Translation(NMT) has brought excellent results in the field of Machine Transla-tionSutskever et al. (2014); ;  due to generation of high-quality translations for different language pairs. Yet even higher quality can be achieved by combining multiple models by techniques like ensembles Hansen and Salamon (1990) and reranking Shen et al. (2004). Our work deals with how Neural Machine Translation (NMT) can achieve better results explicitly with reranking methods.
Neural Machine Translation has an encoderdecoder architecture that is jointly trained to maximize the probability of target given source sentences. It first encodes the source sentence into a single vector, and the decoder predicts it.With the Attention Mechanism, it tries to apply weights to the input sentence at each time step. Recent approaches like the transformer model Vaswani et al. (2017) have achieved the state-of-the-art results for Machine Translation.
Neural Machine Translation (NMT) however, leads to over-translation and under-translation as it tends to ignore the past alignment information, and it is effectively tackled by introducing a coverage vector Tu et al. (2016). Other approaches such as Mi et al. (2016a) and Mi et al. (2016b) too solve the coverage problem in NMT. Without the coverage vector, it could result in a decrease in translation quality.
We propose a method that selects a better hypothesis giving high importance to distinct words generated from decoder without the usage of any language model or data.After applying the proposed reranking method, an overall improvement in translation quality is observed as compared to the baseline system. The rest of the paper is organized as follows; Section 2 discusses the work related to re-utilizing existing models for Machine Translation. Section 3 describes our approach for Checkpoint based Reranking. In Section 4, we present our Reranking Algorithm. In Section 5, we demonstrate all of our Experiments along with the results obtained, and finally, the paper is concluded in Section 6 with future directions.

Related Work
The work of Imamura and Sumita (2017) explains the concepts of reranking and ensembling in detail. It introduces a method of bidirectional reranking in which it combines the hypothesis from l2r and r2l decoding following the works of , which proposes an agreement model to solve unbalanced outputs of recurrent neural networks. Marie and Fujita (2018) has introduced a reranking system that uses a smorgasbord of informative features in tasks where PBSMT and NMT produce translations of different quality.
The work by Shen et al. (2004) shows how to apply perceptron-like reranking algorithms to improve the overall translation quality, and Olteanu et al. (2006) shows the usage of Language Models (LMs) for reranking on hypotheses generated by phrase-based Statistical Machine Translation systems. Wang et al. (2007) has shown linguistically motivated and computationally efficient structured language models for reranking in SMT systems.
The concept of Checkpoint ensembles is introduced by Sennrich et al. (2016) and was later on improvised to independent ensembling Sennrich et al. (2017). Vaswani et al. (2017) included a checkpoint averaging method for their model. Liu et al. (2018) has focused on decoding techniques that utilize existing models at parameter, word, and sentence level corresponding to checkpoint averaging, model ensembling, and candidate reranking and found that all of these improve the translation quality without retraining the model.

Checkpoint Based Reranking
In our approach, the iteration outputs are selected as the N -best list. It implies for the last K iterations; we have the corresponding K-best list for a sentence. We take our Oracle scores as the one that is having the largest BLEU Score Papineni et al. (2002) on the test reference hypothesis from this K-best list. After obtaining the oracle scores from this K-best list, we observe that this score is larger than the baseline system, and it indicates that there is scope for further improvement of translation quality. So we propose a reranking method that improves the translation quality over the baseline system without any language model or data.
We try to focus on the nature of translations that the decoder generates with and without coverage penalty. In the initial step, we keep track of the number of distinct words in the generated hypothe-sis, and the later ones we keep track of words that have repeated more than once.A higher score is given for sentences having a higher number of Distinct Tokens (D) and lower scores for those having more number of repetitive words (F ).
For each sentence in the N -best list, these scores are sorted, and the sentence having the highest score is selected. This process is repeated for the entire test set, and the ones that are having the top most scores are chosen as the reranked output, as shown in Section 4.

Reranking Procedure
Algorithm 1 Method Input: Translated Target Language Sentences H = ( h(n − k)...,h(n) ) at last k epochs for given sentence Output: Sentence having highest number of distinct words and lowest repetitive words For a sentence, FREQ is the count of each word; DISTINCT is the total count of unique words. For each hypothesis in the K-best list we divide DIS-TINCT with FREQ and select the highest scorer.

DataSet
We used ILCI Jha (2010) corpus, which has eleven language pairs from which we chose Telugu and Hindi as our parallel data during the training process. The entire corpus is manually cleaned to remove the misalignments. Table 1 shows the split ratio of sentences followed in the process.

Data
Size Training 45000 Validation 4000 Test 990

Experiments
We adopt the Keras implementationÁlvaro Peris and Casacuberta (2018) for our experiments.We use a two-layer encoder-decoder model with 500dimensional source and target embeddings and 500 units in each of the layers. The encoder layers are LSTM Hochreiter and Schmidhuber (1997) and decoder are ConditionalLSTM with Bahdanu's attention  and the optimizer used is Adam Kingma and Ba (2014) and the model is trained for 15 iterations with a batch size of 512 sentences. The rest of the parameters in the configuration file were set to their default values. We evaluate with coverage penalty and the absence of it for our experiments. The hypotheses are collected for the last k=3, 5, 7 during decoding. We evaluate the generated hypotheses with BLEU Papineni et al. (2002) for our experiments.  The scores obtained after each iteration are shown in Table 2. After this, we apply our proposed reranking method to the last few iteration outputs, which are selected as the N -best list. The proposed reranking method leads to an overall improvement of translation quality by +0.07, +0.15, +0.16 BLEU score compared to the baseline with oracle improvements up to +0.55, +0.90, +1.01 on the three systems. The scores obtained for each of them are shown in Tables 3, 4, 5.

With Coverage Penalty
We also evaluate our work by adding coverage penalty Wu et al. (2016) in the training process to ensure that this algorithm works when both the under translations and over translations are addressed adequately. All the hyperparameters are kept the same as the baseline system except for the coverage penalty.     With excess coverage penalty, there is a decline in translation quality compared to the baseline system without coverage penalty, as shown in Tables  2 and 10. Still, the proposed method gives an increase of +0.12, +0.15, +0.15 over baseline with oracle improvements up to +0.91, +1.30, +1.30 for the last 3, 5 and 7 checkpoints respectively as shown in Tables 11, 12, 13.
One can also observe that the improvements and the oracle scores increase correspondingly with the size of the N -best list.The variation with the baseline can be obtained as shown in Figure 1.     In this paper, we introduce a method of selecting an N -best list for NMT systems and propose a way of reranking to the generated hypotheses from the system. We observe that our approach is giving better results over the baseline model by following the proposed reranking method and is also evaluated with the coverage penalty. One can investigate our approach with varying beam sizes and analyzing the effect of length penalty Wu et al. (2016) and comparing it with methods such as . We also look forward to coming up with better reranking ways that are closer to the oracle scores and investigate the efficacy of the approach in low-resourced data conditions.
Language models are used for getting the likelihood of sentences and is a widely used concept for reranking hypotheses. Introducing Language Models during reranking could establish a tradeoff between perplexity and the scores to the hypotheses generated. We also plan to explore the work by Ç aglar Gülçehre et al. (2017) andÇ aglar Gülçehre et al. (2015) that introduces language models into the existing neural architecture with methods such as Shallow Fusion and Deep Fusion. It is another promising area to be looked upon for reranking.