Multi-Reference Training with Pseudo-References for Neural Translation and Text Generation

Neural text generation, including neural machine translation, image captioning, and summarization, has been quite successful recently. However, during training time, typically only one reference is considered for each example, even though there are often multiple references available, e.g., 4 references in NIST MT evaluations, and 5 references in image captioning data. We first investigate several different ways of utilizing multiple human references during training. But more importantly, we then propose an algorithm to generate exponentially many pseudo-references by first compressing existing human references into lattices and then traversing them to generate new pseudo-references. These approaches lead to substantial improvements over strong baselines in both machine translation (+1.5 BLEU) and image captioning (+3.1 BLEU / +11.7 CIDEr).


Introduction
Neural text generation has attracted much attention in recent years thanks to its impressive generation accuracy and wide applicability. In addition to demonstrating compelling results for machine translation (MT) (Sutskever et al., 2014;Bahdanau et al., 2014), by simple adaptation, practically very same or similar models have also proven to be successful for summarization (Rush et al., 2015;Nallapati et al., 2016) and image or video captioning (Venugopalan et al., 2015;Xu et al., 2015a).
The most common neural text generation model is based on the encoder-decoder framework (Sutskever et al., 2014) which generates a variable-length output sequence using an RNNbased decoder with attention mechanisms (Bahdanau et al., 2014;Xu et al., 2015b). There are many recent efforts in improving the generation accuracy, e.g., ConvS2S (Gehring et al., 2017) and Transformer (Vaswani et al., 2017). However, all these efforts are limited to training with a single reference even when multiple references are available.
Multiple references are essential for evaluation due to the non-uniqueness of translation and generation unlike classification tasks. In MT, even though the training sets are usually with single reference (bitext), the evaluation sets often come with multiple references. For example, the NIST Chinese-to-English and Arabic-to-English MT evaluation datasets (2003)(2004)(2005)(2006)(2007)(2008) have in total around 10,000 Chinese sentences and 10,000 Arabic sentences each with 4 different English translations. On the other hand, for image captioning datasets, multiple references are more common not only for evaluation, but also for training, e.g., the MSCOCO (Lin et al., 2014) dataset provides 5 references per image and PASCAL-50S and ABSTRACT-50S (Vedantam et al., 2015) even provide 50 references per image. Can we use the extra references during training? How much can we benefit from training with multiple references?
We therefore first investigate several different ways of utilizing existing human-annotated references, which include Sample One (Karpathy and Fei-Fei, 2015), Uniform, and Shuffle methods (explained in Sec. 2). Although Sample One has been explored in image captioning, to the best of our knowledge, this is the first time that an MT system is trained with multiple references.
Actually, four or five references still cover only a tiny fraction of the exponentially large space of potential references (Dreyer and Marcu, 2012). More importantly, encouraged by the success of training with multiple human references, we further propose a framework to generate many more pseudo-references automatically. In particular, we design a neural multiple-sequence alignment algo-rithm to compress all existing human references into a lattice by merging similar words across different references (see examples in Fig. 1); this can be viewed as a modern, neural version of paraphrasing with multiple-sequence alignment Lee, 2003, 2002). We can then generate theoretically exponentially more references from the lattice.
We make the following main contributions: • Firstly, we investigate three different methods for multi-reference training on both MT and image captioning tasks (Section 2).
• Secondly, we propose a novel neural network-based multiple sequence alignment model to compress the existing references into lattices. By traversing these lattices, we generate exponentially many new pseudoreferences (Section 3).
• We report substantial improvements over strong baselines in both MT (+1.5 BLEU) and image captioning (+3.1 BLEU / +11.7 CIDEr) by training on the newly generated pseudo-references (Section 4).

Using Multiple References
In order to make the multiple reference training easy to adapt to any frameworks, we do not change anything from the existing models itself. Our multiple reference training is achieved by converting a multiple reference dataset to a single reference dataset without losing any information. Considering a multiple reference dataset D, where the i th training example, (x i , Y i ), includes one source input x i , which is a source sentence in MT or image vector in image captioning, and a ref- We have the following methods to convert the multiple reference dataset to a single reference dataset D (note that the following D sample one , D uniform and D shuffle are ordered sets): Sample One: The most straightforward way is to use a different reference in different epochs during training to explore the variances between references. For each example, we randomly pick one of the K references in each training epoch (note that the random function will be used in each epoch). This method is commonly used in existing image captioning literatures, such as (Karpathy and Fei-Fei, 2015), but never used in MT. This approach can be formalized as: Uniform: Although all references are accessible by using Sample One, it is not guaranteed that all references are used during training. So we introduce Uniform which basically copies x i training example K times and each time with a different reference. This approach can be formalized as: Shuffle is based on Uniform, but shuffles all the source and reference pairs in random order before each epoch. So, formally it is: Sample One is supervised by different training signals in different epochs while both Uniform and Shuffle include all the references at one time. Note that we use mini-batch during training. When we set the batch size equal to the entire training set size in both Uniform and Shuffle, they become equivalent.

Pseudo-References Generation
In text generation tasks, the given multiple references are only a small portion in the whole space of potential references. To cover a larger number of references during training, we want to generate more pseudo-references which is similar to existing ones.
Our basic idea is to compress different references y 0 , y 1 , ..., y K into a lattice. We achieve this by merging similar words in the references. Finally, we generate more pseudo-references by simply traversing the compressed lattice and select those with high quality according to its BLEU score.
Take the following three references from the NIST Chinese-to-English machine translation dataset as an example:

Naive Idea: Hard Word Alignment
The simplest way to compress different references into a lattice is to do pairwise reference compression iteratively. At each time, we select two references and merge the same words in them.
Considering the previous example, we can derive an initial lattice from the three references as shown in Fig. 1(a). Assume that we first do a pairwise reference compression on first two references, we can merge at four sharing words: Indonesia, its, opposition and foreign, and the lattice will turn to Fig. 1(b). If we further compress the first and third references, we can merge at Indonesia, opposition, to and foreign, which gives the lattice Fig. 1  lattice can align these words, we can generate the lattice shown in Fig. 1 Following the previously described algorithm, we can merge the two references at "two elephants", at "to" and at "a". However, "to" in the two references are very different (it is a preposition in the first reference and an infinitive in the second) and should not be merged. Thus, the lattice in Fig. 2(b) will generate the following wrong pseudo-references: 1. Two elephants try to a small entry 2. Two elephants in an enclosure next to fit through a brick building Therefore, we need to investigate a better method to compress the lattice.

Measuring Word Similarity in Context
To tackle the above listed two problems of hard alignment, we need to identify synonyms and words with similar meanings. Barzilay and Lee (2002) utilize an external synonyms dictionary to get the similarity score between words. However, this method ignores the given context of each word. For example, in Fig. 1(a), there are two Indonesia's in the second path of reference. If we use a synonyms dictionary, both Indonesia tokens will be aligned to the Indonesia in the first or third sentence with the same score. This incorrect alignment would lead to meaningless lattice.
Thus, we introduce the semantic substitution matrix which measures the semantic similarity of each word pairs in context. Formally, given a sentence pair y i and y j , we build a semantic substitution matrix M = R |y i |×|y j | , whose cell M u,v represents the similarity score between word y i,u and word y j,v .
We propose a new neural network-based multiple sequence alignment algorithm to take context into consideration. We first build a language model (LM) to obtain the semantic representation of each word, then these word representations are used to construct the semantic substitution matrix between sentences. Fig. 3 shows the architecture of the bidirectional LM (Mousa and Schuller, 2017). The optimization goal of our LM is to minimize the i th word's prediction error given the surrounding word's hidden state: For any new given sentences, we concatenate both forward and backward hidden states to represent each word y i,u in a sentence y i . We then calculate the normalized cosine similarity score of word y i,u and y j,v as:  Fig. 4 shows an example of the semantic substitution matrix of first two sentences in example references of Fig. 1(a).

Iterative Pairwise Word Alignment using Dynamic Programming
With the help of semantic substitution matrix M u,v which measures pairwise word similarity, we need to find the optimal word alignment to compress references into a lattice. Unfortunately, this computation is exponential in the number of sequences. Thus, we use iterative pairwise alignment which greedily merges sentence pairs (Durbin et al., 1998).
Based on pairwise substitution matrix we can define an optimal pairwise sequence alignment as an optimal path from M 0,0 to M |y i |,|y j | . This is a dynamic programming problem with the state transition function described in Equation (3). Fig. 5 shows the optimal path according to the semantic substitution matrix in Fig. 4. There is a gap if the continuous step goes vertical or horizontal, and an alignment if it goes diagonal.
What order should we follow to do the iterative pairwise word alignment? Intuitively, we need to compress the most similar reference pair first, since this compression will lead to more aligned words. Following this intuition, we order reference pairs by the maximum alignment score opt(|y i |, |y j |) (i.e. the score of bottom-right cell in Fig. 5) which is the sum of all aligned words. Using this order, we can iteratively merge each sentence pair in descending order, unless both the sentences have already been merged (this will prevent generating a cyclic lattice).
Since the semantic substitution matrix M u,v , defined as a normalized cosine similarity, scales in (0, 1), it's very likely for the DP algorithm to align unrelated words. To tackle this problem, we deduct a global penalty p from each cell of M u,v . With the global penalty p, the DP algorithm will not align a word pair (y i,u , y i,v ) unless M u,v ≥ p.
After the pairwise references alignment, we merge those aligned words. For example, in Fig. 1, after we generate an initial lattice as shown in Fig. 1(a), we then calculate the maximum alignment score of all sentence pairs. After that, the lattice turns into Fig. 1(d) by merging the first two references (assuming they have the highest score) according to pairwise alignment shown in Fig. 5. Then we pick the sentence pair with next highest alignment score (assuming it's the last two sentences). Similar to the previous step, we find alignments according to the dynamic programming and merge to the final lattice (see Fig. 1(e)).

Traverse Lattice and Pseudo-References Selection by BLEU
We generate pseudo-references by simply traversing the generated lattice. For example, if we traverse the final lattice shown in Fig. 1(e), we can generate 213 pseudo-refrences in total. Then, we can put those generated pseudoreferences to expand the training dataset. To balance the number of generated pseudo-references for each example, we force the total number of pseudo-references from each example to be

Experiments
To investigate the empirical performances of our proposed algorithm, we conduct experiments on machine translation and image captioning.

Machine Translation
We evaluate our approach on NIST Chinese-to-English translation dataset which consists of 1M pairs of single reference data and 5974 pairs of 4 reference data (NIST 2002(NIST , 2003(NIST , 2004(NIST , 2005(NIST , 2006(NIST , 2008. Table 1 shows the statistics of this dataset. We first pre-train our model on a 1M pairs single reference dataset and then train on the NIST 2002NIST , 2003NIST , 2004NIST , 2005. We use the NIST 2006 dataset as validation set and NIST 2008 as test sets. Fig. 6(a) analyzes the number and quality of generated references using our proposed approach. We set the global penalty as 0.9 and only calculate the top 50 generated references for the average BLEU analysis. From the figure, we can see that when the sentence length grows, the number of generated references grows exponentially. To generate enough references for the following experiments, we set an initial global penalty as 0.9 and gradually decrease it by 0.05 until we collect no less than 100 references. We train a bidirectional language model on the pre-training dataset and training dataset with Glove (Pennington et al., 2014) word embedding size of 300 dimension, for 20 epochs to minimize the perplexity We employ byte-pair encoding (BPE) (Sennrich et al., 2015) which reduces the source and target language vocabulary sizes to 18k and 10k. We adopt length reward (Huang et al., 2017) to find optimal sentence length. We use a two layer bidirectional LSTM as the encoder and a two layer LSTM as the decoder. We perform pre-training for 20 epochs to minimize perplexity on the 1M dataset, with a batch size of 64, word embedding size of 500, beam size of 15, learning rate of 0.1, learning rate decay of 0.5 and dropout rate of 0.3. We then train the model in 30 epochs and use the best batch size among 100, 200, 400 for each update method. These batch sizes are multiple of the number of references used in experiments, so it is guaranteed that all the references of one single example are in one batch for the Uniform method. The learning rate is set as 0.01 and learning rate decay as 0.75. We do each experiment three times and report the average result. Table 2 shows the translation quality on the devset of machine translation task. Besides the original 4 references in the training set, we generate another four dataset with 10, 20, 50 and 100 references including pseudo-references using hard word alignment and soft word alignment. We compare the three update methods (Sample One, Uniform, Shuffle) with always using the first reference (First). All results of soft word alignment are better than corresponding hard word alignment results and the best result is achieved with 50 references using Uniform and soft word alignment. According to Table 3, Shuffle with original 4 references has +0.7 BLEU improvement and Uniform   with 50 references has +1.5 BLEU improvement. From Fig. 7(b), we can see that using the Sample One method, the translation quality drops dramatically with more than 10 references. This may be due to the higher variance of used reference in each epoch.

Image Captioning
For the image captioning task, we use the widelyused MSCOCO image captioning dataset. Following prior work, we use the Kapathy split (Karpathy and Fei-Fei, 2015). Table 1 shows the statistics of this dataset. We use Resnet (He et al., 2016)    coder. We train every model for 100 epochs and calculate the BLEU score on validation set and select the best model. For every update method, we find the optimal batch size among 50, 250, 500, 1000 and we use a beam size of 5. Fig. 6(b) analyzes the correlation between average references length with the number and quality of generated references. We set global penalty as 0.6 (which is also adopted for the generated references in the following experiments) and calculate the top 50 generated references for the average BLEU analysis. Since the length of original references is much shorter than the previous machine translation dataset, it has worse quality and fewer generated references. Table 4 shows that the best result is achieved with 20 references using Shuffle. This result is   different from the result of machine translation task where Uniform method is the best. This may be because the references in image captioning dataset are much more diverse than those in machine translation dataset. Different captions of one image could even talk about different aspects. When using the Uniform method, the high variance of references in one batch may harm the model and lead to worse text generation quality. Table 5 shows that it outperforms Sample One with 4 original references, which is adopted in previous work (Karpathy and Fei-Fei, 2015), +3.1 BLEU score and +11.7 CIDEr. Fig. 6 shows a training example in the COCO dataset and its corresponding generated lattice and pseudo-references which is sorted according to its BLEU score. Our proposed algorithm generates 73724 pseudo-references in total. All the top 50 pseudo-references' BLEU scores are above 97.1 and the top three even achieve 100.0 BLEU score though they are not identical to any original references. Although the BLEU of last two sentences is 0.0, they are still valid to describe this picture.

Conclusions
We introduce several multiple-reference training methods and a neural-based lattice compression framework, which can generate more training references based on existing ones. Our proposed framework outperforms the baseline models on both MT and image captioning tasks.