Bag-of-Words as Target for Neural Machine Translation

A sentence can be translated into more than one correct sentences. However, most of the existing neural machine translation models only use one of the correct translations as the targets, and the other correct sentences are punished as the incorrect sentences in the training stage. Since most of the correct translations for one sentence share the similar bag-of-words, it is possible to distinguish the correct translations from the incorrect ones by the bag-of-words. In this paper, we propose an approach that uses both the sentences and the bag-of-words as targets in the training stage, in order to encourage the model to generate the potentially correct sentences that are not appeared in the training set. We evaluate our model on a Chinese-English translation dataset, and experiments show our model outperforms the strong baselines by the BLEU score of 4.55.


Introduction
Neural Machine Translation (NMT) has achieve success in generating coherent and reasonable translations. Most of the existing neural machine translation systems are based on the sequenceto-sequence model (Sutskever et al., 2014). The sequence-to-sequence model (Seq2Seq) regards the translation problem as the mapping from the source sequences to the target sequences. The encoder of Seq2Seq compresses the source sentences into the latent representation, and the decoder of Seq2Seq generates the target sentences from the source representations. The cross-entropy loss, 1 The code is available at https://github.com/ lancopku/bag-of-words Source: 今年前两月广东高新技术产品出 口37.6亿美元。 Reference: Export of high -tech products in guangdong in first two months this year reached 3.76 billion us dollars . Translation 1: Guangdong 's export of new high technology products amounts to us $3.76 billion in first two months of this year . Translation 2: Export of high -tech products has frequently been in the spotlight , making a significant contribution to the growth of foreign trade in guangdong . Table 1: An example of two generated translations. Although Translation 1 is much more reasonable, it is punished more severely than Translation 2 by Seq2Seq.
which measures the distance of the generated distribution and the target distribution, is minimized in the training stage, so that the generated sentences are as similar as the target sentences.
Due to the limitation of the training set, most of the existing neural machine translation models only have one reference sentences as the targets. However, a sentence can be translated into more than one correct sentences, which have different syntax structures and expressions but share the same meaning. The correct translations that are not appeared in the training set will be punished as the incorrect translation by Seq2Seq, which is a potential harm to the model. Table 1 shows an example of two generated translations from Chinese to English. Translation 1 is apparently more proper as the translation of the source sentence than Translation 2, but it is punished even more severely than Translation 2 by Seq2Seq.
Because most of the correct translations for one source sentence share the similar bag-of-words, it is possible to distinguish the correct translations from the incorrect ones by the bag-of-words in most cases. In this paper, we propose an approach that uses both sentences and bag-of-words as the targets. In this way, the generated sentences which cover more words in the bag-of-words (e.g. Translation 1 in Table 1) are encouraged, while the incorrect sentences (e.g. Translation 2) are punished more severely. We perform experiments on a popular Chinese-English translation dataset. Experiments show our model outperforms the strong baselines by the BLEU score of 4.55.

Bag-of-Words as Target
In this section, we describe the proposed approach in detail.

Notation
Given a translation dataset that consists of N data samples, the i-th data sample (x i , y i ) contains a source sentence x i , and a target sentence y i . The bag-of-words of y i is denoted as b i . The source sentence x i , the target sentence y i , and the bag-ofwords b i are all sequences of words: where L i , M i , and K i denote the number of words in x i , y i , and b i , respectively. The target of our model is to generate both the target sequence y i and the corresponding bag-ofwords b i . For the purpose of simplicity, (x, y, b) is used to denote each data pair in the rest of this section.

Bag-of-Words Generation
We regard the bag-of-words generation as the multi-label classification problem. We first perform the encoding and decoding to obtain the scores of words at each position of the generated sentence. Then, we sum the scores of all positions as the sentence-level score. Finally, the sentencelevel score is used for multi-label classification, which identifies whether the word appears in the translation.
In our model, the encoder is a bi-directional Long Short-term Memory Network (BiL-STM), which produces the representation where f and f are the forward and the backward functions of LSTM for one time step, h t and h t are the forward and the backward hidden outputs respectively, x t is the input at the t-th time step, and L is the number of words in sequence x.
The decoder consists of a uni-directional LSTM, with an attention, and a word generator. The LSTM generates the hidden output q t : where f is the function of LSTM for one time step, and y t−1 is the last generated words at t-th time step. The attention mechanism  is used to capture the source information: where W t is a trainable parameter matrix. Then, the word generator is used to compute the probability of each output word at t-th time step: where W g and b g are parameters of the generator. To get a sentence-level score for the generated sentence, we generate a sequence of word-level score vectors s t at all positions with the output layer of decoder, and then we sum up the wordlevel score vectors to obtain a sentence-level score vector. Each value in the vector represents the sentence-level score of the corresponding word, and the index of the value is the index of the word in the dictionary. After normalizing the sentencelevel score with sigmoid function, we get the probability for each word, which represents how possible the word appears in the generated sentence regardless of the position in the sentence. Compared with the word-level probability p wt , the sentencelevel probability p b of each word is independent of the position in the sentence.
More specifically, the sentence-level probability of the generated bag-of-words p b can be written as: where M is the number of words in the target sentence.

Targets and Loss Function
We have two targets at the training stage: the reference translation (appears in the training set) and the bag-of-words. The bag-of-words is used as the approximate representation of the correct translations that do not appear in the training set. For the targets, we have two parts of loss functions: The total loss function can be written as: where λ i is the coefficient to balance two loss functions at i-th epoch. Since the bag-of-words generation module is built on the top of the word generation, we assign a small weight for the bagof-words training at the initial time, and gradually increase the weight until a certain value λ: In our experiments, we set the λ = 1.0, k = 0.1, and α = 0.1, based on the performance on the validation set.

Experiments
This section introduces the details of our experiments, including datasets, setups, baseline models as well as results.

Datasets
We evaluated our proposed model on the NIST translation task for Chinese-English translation and provided the analysis on the same task. We trained our model on 1.25M sentence pairs extracted from LDC corpora 2 , with 27.9M Chinese words and 34.5M English words. We validated our model on the dataset for the NIST 2002 translation task and tested our model on that for the NIST 2003NIST , 2004NIST , 2005NIST , 2006, 2008 translation tasks. We used the most frequent 50,000 words for both the Chinese vocabulary and the English vocabulary. The evaluation metric is BLEU (Papineni et al., 2002).

Setting
We implement the models using PyTorch, and the experiments are conducted on an NVIDIA 1080Ti GPU. Both the size of word embedding and hidden size are 512, and the batch size is 64. We use Adam optimizer (Kingma and Ba, 2014) to train the model with the default setting β 1 = 0.9, β 2 = 0.999 and = 1 × 10 −8 , and we initialize the learning rate to 0.0003. Based on the performance on the development sets, we use a 3-layer LSTM as the encoder and a 2-layer LSTM as the decoder. We clip the gradients (Pascanu et al., 2013) to the maximum norm Model MT-02 MT-03 MT-04 MT-05 MT-06 MT-08 All Moses (Su et al., 2016) 33  Table 2: Results of our model and the baselines (directly reported in the referred articles) on the Chinese-English translation. "-" means that the studies did not test the models on the corresponding datasets.
of 10.0. Dropout is used with the dropout rate set to 0.2. Following , we use beam search with a beam width of 10 to generate translation for the evaluation and test, and we normalize the log-likelihood scores by sentence length.

Baselines
We compare our model with several NMT systems, and the results are directly reported in their articles.
• Moses is an open source phrase-based translation system with default configurations and a 4-gram language model trained on the training data for the target language.
• RNNSearch  is a bidirectional GRU based model with the attention mechanism. The results of Moses, and RNNSearch come from Su et al. (2016).
• Lattice (Su et al., 2016) is a Seq2Seq model which encodes the sentences with multiple tokenizations.
• Mixed RNN (Li et al., 2017) extends RNNSearch with a mixed RNN as the encoder.
• POSTREG (Ganchev et al., 2010) extends RNNSearch with posterior regularization with a constrained posterior set. The results of CPR, and POSTREG come from .
• PKI  extends RNNSearch with posterior regularization to integrate prior knowledge. Table 2 shows the overall results of the systems on the Chinese-English translation task. We compare our model with our implementation of Seq2Seq+Attention model. For fair comparison, the experimental setting of Seq2Seq+Attention is the same as BAT, so that we can regard it as our proposed model removing the bag-of-words target. The results show that our model achieves the BLEU score of 36.51 on the total test sets, which outperforms the Seq2Seq baseline by the BLEU of 4.55.

Results
In order to further evaluate the performance of our model, we compare our model with the recent NMT systems which have been evaluated on the same training set and the test sets as ours. Their results are directly reported in the referred articles. As shown in Table 2, our model achieves high BLEU scores on all of the NIST Machine Translation test sets, which demonstrates the efficiency of our model.
We also give two translation examples of our model. As shown in Table 3, The translations of Seq2Seq+Attn omit some words, such as "of ", "committee", and "protection", and contain some redundant words, like "human chromosome" and "<unk>". Compared with Seq2Seq, the translations of our model is more informative and ade-Source: 人类共有二十三对染色体。 Reference: Humans have a total of 23 pairs of chromosomes . Seq2Seq+Attn: Humans have 23 pairs chromosomes in human chromosome . +Bag-of-Words: There are 23 pairs of chromosomes in mankind .
Reference: An official from the olympics organization committee said : " this proposal represents the committee 's sensitivity to environmental protection . " Seq2Seq+Attn: An official of the olympic preparatory committee said : " this proposal represents the <unk> of environmental sensitivity . " +Bag-of-Words: An official of the olympic preparatory committee said : " this proposal represents the sensitivity of the preparatory committee on environmental protection . "

Related Work
The studies of encoder-decoder framework (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014) for this task launched the Neural Machine Translation. To improve the focus on the information in the encoder,  proposed the attention mechanism, which greatly improved the performance of the Seq2Seq model on NMT. Most of the existing NMT systems are based on the Seq2Seq model and the attention mechanism. Some of them have variant architectures to capture more information from the inputs (Su et al., 2016;Tu et al., 2016), and some improve the attention mechanism Meng et al., 2016;Mi et al., 2016;Jean et al., 2015;Feng et al., 2016;Calixto et al., 2017), which also enhanced the performance of the NMT model.
There are also some effective neural networks other RNN. Gehring et al. (2017) turned the RNN-based model into CNN-based model, which greatly improves the computation speed. Vaswani et al. (2017) only used attention mechanism to build the model and showed outstanding performance. Also, some researches incorporated external knowledge and also achieved obvious improvement (Li et al., 2017;Chen et al., 2017).
There is also a study (Zhao et al., 2017) shares a similar name with this work, i.e. bag-of-word loss, our work has significant difference with this study. First, the methods are very different. The previous work uses the bag-of-word to constraint the latent variable, and the latent variable is the output of the encoder. However, we use the bag-of-word to supervise the distribution of the generated words, which is the output of the decoder. Compared with the previous work, our method directly supervises the predicted distribution to improve the whole model, including the encoder, the decoder and the output layer. On the contrary, the previous work only supervises the output of the encoder, and only the encoder is trained. Second, the motivations are quite different. The bag-of-word loss in the previous work is an assistant component, while the bag of word in this paper is a direct target. For example, in the paper you mentioned, the bag-of-word loss is a component of variational autoencoder to tackle the vanishing latent variable problem. In our paper, the bag of word is the representation of the unseen correct translations to tackle the data sparseness problem.

Conclusions and Future Work
We propose a method that regard both the reference translation (appears in the training set) and the bag-of-words as the targets of Seq2Seq at the training stage. Experimental results show that our model obtains better performance than the strong baseline models on a popular Chinese-English translation dataset. In the future, we will explore how to apply our method to other language pairs, especially the morphologically richer languages than English, and the low-resources languages.