Vocabulary Manipulation for Neural Machine Translation

In order to capture rich language phenomena, neural machine translation models have to use a large vocabulary size, which requires high computing time and large memory usage. In this paper, we alleviate this issue by introducing a sentence-level or batch-level vocabulary, which is only a very small sub-set of the full output vocabulary. For each sentence or batch, we only predict the target words in its sentence-level or batch-level vocabulary. Thus, we reduce both the computing time and the memory usage. Our method simply takes into account the translation options of each word or phrase in the source sentence, and picks a very small target vocabulary for each sentence based on a word-to-word translation model or a bilingual phrase library learned from a traditional machine translation model. Experimental results on the large-scale English-to-French task show that our method achieves better translation performance by 1 BLEU point over the large vocabulary neural machine translation system of Jean et al. (2015).


Introduction
Neural machine translation (NMT)  has gained popularity in recent two years. But it can only handle a small vocabulary size due to the computational complexity. In order to capture rich language phenomena and have a better word coverage, neural machine translation models have to use a large vocabulary. Jean et al. (2015) alleviated the large vocabulary issue by proposing an approach that partitions the training corpus and defines a subset of the full target vocabulary for each partition. Thus, they only use a subset vocabulary for each partition in * Accepted as a short paper in ACL 2016. the training procedure without increasing computational complexity. However, there are still some drawbacks of Jean et al. (2015)'s method. First, the importance sampling is simply based on the sequence of training sentences, which is not linguistically motivated, thus, translation ambiguity may not be captured in the training. Second, the target vocabulary for each training batch is fixed in the whole training procedure. Third, the target vocabulary size for each batch during training still needs to be as large as 30k, so the computing time is still high.
In this paper, we alleviate the above issues by introducing a sentence-level vocabulary, which is very small compared with the full target vocabulary. In order to capture the translation ambiguity, we generate those sentence-level vocabularies by utilizing word-to-word and phrase-tophrase translation models which are learned from a traditional phrase-based machine translation system (SMT). Another motivation of this work is to combine the merits of both traditional SMT and NMT, since training an NMT system usually takes several weeks, while the word alignment and rule extraction for SMT are much faster (can be done in one day). Thus, for each training sentence, we build a separate target vocabulary which is the union of following three parts: • target vocabularies of word and phrase translations that can be applied to the current sentence. (to capture the translation ambiguity) • top 2k most frequent target words. (to cover the unaligned target words) • target words in the reference of the current sentence. (to make the reference reachable) As we use mini-batch in the training procedure, we merge the target vocabularies of all the sentences in each batch, and update only those related parameters for each batch. In addition, we also shuffle the training sentences at the beginning of each epoch, so the target vocabulary for a specific sentence varies in each epoch. In the beam search for the development or test set, we Figure 1: The attention-based NMT architecture. ← − h i and − → h i are bi-directional encoder states. α tj is the attention prob at time t, position j. H t is the weighted sum of encoding states. s t is the hidden state. o t is an intermediate output state. A single feedforward layer projects o t to a target vocabulary V o , and applies softmax to predict the probability distribution over the output vocabulary.
apply the similar procedure for each source sentence, except the third bullet (as we do not have the reference) and mini-batch parts. Experimental results on large-scale English-to-French task (Section 5) show that our method achieves significant improvements over the large vocabulary neural machine translation system.

Neural Machine Translation
As shown in Figure 1, neural machine translation  is an encoder-decoder network. The encoder employs a bi-directional recurrent neural network to encode the source sentence x = (x 1 , ..., x l ), where l is the sentence length, into a sequence of hidden states h = (h 1 , ..., h l ), where ← − f and − → f are two gated recurrent units (GRU).
Given h, the decoder predicts the target translation by maximizing the conditional log-probability of the correct translation y * = (y * 1 , ...y * m ), where m is the length of target sentence. At each time t, the probability of each word y t from a target vocabulary V y is: where g is a multi layer feed-forward neural network, which takes the embedding of the previous word y * t−1 , the hidden state s t , and the context state H t as input. The output layer of g is a target vocabulary V o , y t ∈ V o in the training procedure. V o is originally defined as the full target vocabulary V y . We apply the softmax function over the output layer, and get the probability of p(y t |h, y * t−1 ..y * 1 ). In Section 3, we differentiate V o from V y by adding a separate and sentence-dependent V o for each source sentence. In this way, we enable to maintain a large V y , and use a small V o for each sentence.
The s t is computed as: where q is a GRU, c t is a weighted sum of h, the weights, α, are computed with a feed-forward neural network r:

Target Vocabulary
The output of function g is the probability distribution over the target vocabulary V o . As V o is defined as V y in , the softmax function over V o requires to compute all the scores for all words in V o , and results in a high computing complexity. Thus,  only uses top 30k most frequent words for both V o and V y , and replaces all other words as unknown words (UNK).

Target Vocabulary Manipulation
In this section, we aim to use a large vocabulary of V y (e.g. 500k, to have a better word coverage), and, at the same, to reduce the size of V o as small as possible (in order to reduce the computing time). Our basic idea is to maintain a separate and small vocabulary V o for each sentence so that we only need to compute the probability distribution of g over a small vocabulary for each sentence. Thus, we introduce a sentence-level vocabulary V x to be our V o , which depends on the sentence x. In the following part, we show how we generate the sentence-dependent V x . The first objective of our method aims to capture the real translation ambiguity for each word, and the target vocabulary of a sentence V o = V x is supposed to cover as many as those possible translation candidates. Take the English to Chinese translation for example, the target vocabulary for the English word bank should contain yínháng (a financial institution) and héàn (sloping land) in Chinese.
So we first use a word-to-word translation dictionary to generate some target vocaularies for x. Given a dictionary D(x) = [y 1 , y 2 , ...], where x is a source word, [y 1 , y 2 , ...] is a sorted list of candidate translations, we generate a target vocabulary V D x for a sentence x = (x 1 , ..., x l ) by merging all the candidates of all words x in x.
As the word-to-word translation dictionary only focuses on the source words, it can not cover the target unaligned functional or content words, where the traditional phrases are designed for this purpose. Thus, in addition to the word dictionary, given a word aligned training corpus, we also extract phrases P (x 1 ...x i ) = [y 1 , ..., y j ], where x 1 ...x i is a consecutive source words, and [y 1 , ..., y j ] is a list of target words 1 . For each sentence x, we collect all the phrases that can be applied to sentence x, e.g. x 1 ...x i is a sub-sequence of sentence x.
where subseq(x) is all the possible sub-sequence of x with a length limit. In order to cover target un-aligned functional words, we need top n most common target words.
Training: in our training procedure, our optimization objective is to maximize the loglikelihood over the whole training set. In order 1 Here we change the definition of a phrase in traditional SMT, where the [y1, ...yj] should also be a consecutive target words. But our task in this paper is to get the target vocabulary, so we only care about the target word set, not the order.
to make the reference reachable, besides V D x , V P x and V T x , we also need to include the target words in the reference y, where x and y are a translation pair. So for each sentence x, we have a target vocabulary V x : Then, we start our mini-batch training by randomly shuffling the training sentences before each epoch. For simplicity, we use the union of all V x in a batch, where b is the batch size. This merge gives an advantage that V b changes dynamically in each epoch, which leads to a better coverage of parameters.
Decoding: different from the training, the target vocabulary for a sentence x is and we do not use mini-batch in decoding.

Related Work
To address the large vocabulary issue in NMT, Jean et al. (2015) propose a method to use different but small sub vocabularies for different partitions of the training corpus. They first partition the training set. Then, for each partition, they create a sub vocabulary V p , and only predict and apply softmax over the vocabularies in V p in training procedure. When the training moves to the next partition, they change the sub vocabulary set accordingly. Noise-contrastive estimation (Gutmann and Hyvarinen, 2010;Mnih and Teh, 2012;Mikolov et al., 2013;Mnih and Kavukcuoglu, 2013) and hierarchical classes (Mnih and Hinton, 2009) are introduced to stochastically approximate the target word probability. But, as suggested by Jean et al. (2015), those methods are only designed to reduce the time complexity in training, not for decoding.

Data Preparation
We run our experiments on English to French (En-Fr) task. The training corpus consists of approximately 12 million sentences, which is identical  to the set of Jean et al. (2015) and Sutskever et al. (2014). Our development set is the concatenation of news-test-2012 and news-test-2013, which has 6003 sentences in total. Our test set has 3003 sentences from WMT news-test 2014. We evaluate the translation quality using the case-sensitive BLEU-4 metric (Papineni et al., 2002) with the multi-bleu.perl script. Same as Jean et al. (2015), our full vocabulary size is 500k, we use AdaDelta (Zeiler, 2012), and mini-batch size is 80. Given the training set, we first run the 'fast align' (Dyer et al., 2013) in one direction, and use the translation table as our word-to-word dictionary. Then we run the reverse direction and apply 'grow-diag-final-and' heuristics to get the alignment. The phrase table is extracted with a standard algorithm in Moses (Koehn et al., 2007).
In the decoding procedure, our method is very similar to the 'candidate list' of Jean et al. (2015), except that we also use bilingual phrases and we only include top 2k most frequent target words. Following Jean et al. (2015), we dump the alignments for each sentence, and replace UNKs with the word-to-word dictionary or the source word.

Reference Reachability
The reference coverage or reachability ratio is very important when we limit the target vocabulary for each source sentence, since we do not have the reference in the decoding time, and we do not want to narrow the search space into a bad space. Table 1 shows the average reference coverage ratios (in word-level) on the training and development sets. For each source sentence x, V * x here is a set of target word indexes (the vocabulary size is 500k, others are mapped to UNK). The average reference vocabulary size V R x for each sentence is 23.7 on the training set (22.6 on the dev. set). The word-to-word dictionary V D x has a better coverage than phrases V P x , and when we combine the three sets we can get better coverage ratios. Those statistics suggest that we can not use each of them alone due to the low reference coverage ratios. The last three columns show three combinations, all of which have higher than 90% coverage ratios. As there are many combinations, training an NMT system is time consuming, and we also want to keep the output vocabulary size small (the setting in the last column in Table 1 results in an average 11k vocabulary size for mini-batch 80), thus, in the following part, we only run one combination (top 10 candidates for both V P x and V D x , and top 2k for V T x ), where the full sentence coverage ratio is 20.7% on the development set.

Average Size of V o
With the setting shown in bold column in Table 1, we list average vocabulary size of Jean et al. (2015) and ours in Table 2. Jean et al. (2015) fix the vocabulary size to 30k for each sentence and mini-batch, while our approach reduces the vocabulary size to 2080 for each sentence, and 6153 for each mini-batch. Especially in the decoding time, our vocabulary size for each sentence is about 14.5 times smaller than 30k.

Translation Results
The red solid line in Figure 2 shows the learning curve of our method on the development set, which picks at epoch 7 with a BLEU score of 30.72. We also fix word embeddings at epoch 5, and continue several more epochs. The corresponding blue dashed line suggests that there is no significant difference between them.
We also run two more experiments:  202  324  605  1089 2067 10029   Table 3: Given a trained NMT model, we decode the development set with various top n most common target words. For En-Fr task, the results suggest that we can reduce the n to 50 without losing much in terms of BLEU score. The average size of V o is reduced to as small as 202, which is significant lower than 2067 (the default setting we use in our training).
single system dev. test Moses from  N/A 33.30 Jean (2015) Durrani et al. (2014) N/A 37.03   and 34.23 separately. Those results suggest that we should use both the translation dictionary and phrases in order to get better translation quality. Table 4 shows the single system results on En-Fr task. The standard Moses in  on the test set is 33.3. Our target vocabulary manipulation achieves a BLEU score of 34.45 on the test set, and 35.11 after the UNK replacement. Our approach improves the translation quality by 1.0 BLEU point on the test set over the method of Jean et al. (2015). But our single system is still about 2 points behind of the best phrase-based system (Durrani et al., 2014).

Decoding with Different Top n Most Common Target Words
Another interesting question is what is the performance if we vary the size top n most common target words in V T x . As the training for NMT is time consuming, we vary the size n only in the de-coding time. Table 3 shows the BLEU scores on the development set. When we reduce the n from 2000 to 50, we only loss 0.1 points, and the average size of sentence level V o is reduced to 202, which is significant smaller than 2067 (shown in Table 2). But we should notice that we train our NMT model in the condition of the bold column in Table 2, and only test different n in our decoding procedure only. Thus there is a mismatch between the training and testing when n is not 2000.

Speed
In terms of speed, as we have different code bases 2 between Jean et al. (2015) and us, it is hard to conduct an apple to apple comparison. So, for simplicity, we run another experiment with our code base, and increase V b size to 30k for each batch (the same size in Jean et al. (2015)). Results show that increasing the V b to 30k slows down the training speed by 1.5 times.

Conclusion
In this paper, we address the large vocabulary issue in neural machine translation by proposing to use a sentence-level target vocabulary V o , which is much smaller than the full target vocabulary V y . The small size of V o reduces the computing time of the softmax function in each predict step, while the large vocabulary of V y enable us to model rich language phenomena. The sentence-level vocabulary V o is generated with the traditional word-to-word and phrase-to-phrase translation libraries. In this way, we decrease the size of output vocabulary V o under 3k for each sentence, and we speedup and improve the large-vocabulary NMT system.