Neural Machine Translation via Binary Code Prediction

In this paper, we propose a new method for calculating the output layer in neural machine translation systems. The method is based on predicting a binary code for each word and can reduce computation time/memory requirements of the output layer to be logarithmic in vocabulary size in the best case. In addition, we also introduce two advanced approaches to improve the robustness of the proposed model: using error-correcting codes and combining softmax and binary codes. Experiments on two English-Japanese bidirectional translation tasks show proposed models achieve BLEU scores that approach the softmax, while reducing memory usage to the order of less than 1/10 and improving decoding speed on CPUs by x5 to x10.


Introduction
When handling broad or open domains, machine translation systems usually have to handle a large vocabulary as their inputs and outputs. This is particularly a problem in neural machine translation (NMT) models , such as the attention-based models (Bahdanau et al., 2014;Luong et al., 2015) shown in Figure 1. In these models, the output layer is required to generate a specific word from an internal vector, and a large vocabulary size tends to require a large amount of computation to predict each of the candidate word probabilities.
Because this is a significant problem for neural language and translation models, there are a number of methods proposed to resolve this problem, which we detail in Section 2.2. However, none of these previous methods simultaneously satisfies the following desiderata, all of which, we argue, are desirable for practical use in NMT systems: Memory efficiency: The method should not require large memory to store the parameters and calculated vectors to maintain scalability in resource-constrained environments.
Time efficiency: The method should be able to train the parameters efficiently, and possible to perform decoding efficiently with choosing the candidate words from the full probability distribution. In particular, the method should be performed fast on general CPUs to suppress physical costs of computational resources for actual production systems.
Compatibility with parallel computation: It should be easy for the method to be minibatched and optimized to run efficiently on GPUs, which are essential for training large NMT models.
In this paper, we propose a method that satisfies all of these conditions: requires significantly less memory, fast, and is easy to implement minibatched on GPUs. The method works by not predicting a softmax over the entire output vocab-ulary, but instead by encoding each vocabulary word as a vector of binary variables, then independently predicting the bits of this binary representation. In order to represent a vocabulary size of 2 n , the binary representation need only be at least n bits long, and thus the amount of computation and size of parameters required to select an output word is only O(log V ) in the size of the vocabulary V , a great reduction from the standard linear increase of O(V ) seen in the original softmax.
While this idea is simple and intuitive, we found that it alone was not enough to achieve competitive accuracy with real NMT models. Thus we make two improvements: First, we propose a hybrid model, where the high frequency words are predicted by a standard softmax, and low frequency words are predicted by the proposed binary codes separately. Second, we propose the use of convolutional error correcting codes with Viterbi decoding (Viterbi, 1967), which add redundancy to the binary representation, and even in the face of localized mistakes in the calculation of the representation, are able to recover the correct word.
In experiments on two translation tasks, we find that the proposed hybrid method with error correction is able to achieve results that are competitive with standard softmax-based models while reducing the output layer to a fraction of its original size.

Formulation and Standard Softmax
Most of current NMT models use one-hot representations to represent the words in the output vocabulary -each word w is represented by a unique sparse vector e id(w) ∈ R V , in which only one element at the position corresponding to the word ID id(w) ∈ {x ∈ N | 1 ≤ x ≤ V } is 1, while others are 0. V represents the vocabulary size of the target language. NMT models optimize network parameters by treating the one-hot representation e id(w) as the true probability distribution, and minimizing the cross entropy between it and the softmax probability v: where sum x represents the sum of all elements in x, x i represents the i-th element of x, W hu ∈ R V ×H and β u ∈ R V are trainable parameters and H is the total size of hidden layers directly connected to the output layer. According to Equation (4), this model clearly requires time/space computation in proportion to O(HV ), and the actual load of the computation of the output layer is directly affected by the size of vocabulary V , which is typically set around tens of thousands .

Prior Work on Suppressing Complexity of NMT Models
Several previous works have proposed methods to reduce computation in the output layer. The hierarchical softmax (Morin and Bengio, 2005) predicts each word based on binary decision and reduces computation time to O(H log V ). However, this method still requires O(HV ) space for the parameters, and requires calculation much more complicated than the standard softmax, particularly at test time.
The differentiated softmax (Chen et al., 2016) divides words into clusters, and predicts words using separate part of the hidden layer for each word clusters. This method make the conversion matrix of the output layer sparser than a fully-connected softmax, and can reduce time/space computation amount by ignoring zero part of the matrix. However, this method restricts the usage of hidden layer, and the size of the matrix is still in proportion to V .
Sampling-based approximations (Mnih and Teh, 2012;Mikolov et al., 2013) to the denominator of the softmax have also been proposed to reduce calculation at training. However, these methods are basically not able to be applied at test time, still require heavy computation like the standard softmax.
Vocabulary selection approaches (Mi et al., 2016;L'Hostis et al., 2016) can also reduce the vocabulary size at testing, but these methods abandon full search over the target space and the quality of picked vocabularies directly affects the translation quality.
Other methods using characters (Ling et al., 2015) or subwords (Sennrich et al., 2016;Chitnis and DeNero, 2015) can be applied to suppress the vocabulary size, but these methods also make for longer sequences, and thus are not a direct solution to problems of computational efficiency.

Representing Words using Bit Arrays
Figure 2(a) shows the conventional softmax prediction, and Figure 2(b) shows the binary code prediction model proposed in this study. Unlike the conventional softmax, the proposed method predicts each output word indirectly using dense bit arrays that correspond to each word. Let b(w) := [b 1 (w), b 2 (w), · · · , b B (w)] ∈ {0, 1} B be the target bit array obtained for word w, where each b i (w) ∈ {0, 1} is an independent binary function given w, and B is the number of bits in whole array. For convenience, we introduce some constraints on b. First, a word w is mapped to only one bit array b(w). Second, all unique words can be discriminated by b, i.e., all bit arrays satisfy that: 1 Third, multiple bit arrays can be mapped to the same word as described in Section 3.5. By considering second constraint, we can also constrain B ≥ log 2 V , because b should have at least V unique representations to distinguish each word. The output layer of the network independently predicts B probability values q := [q 1 (h), q 2 (h), · · · , q B (h)] ∈ [0, 1] B using the 1 We designed this injective condition using the id(·) function to ignore task-specific sensitivities between different word surfaces (e.g. cases, ligatures, etc.). current hidden values h by logistic regressions: where W hq ∈ R B×H and β q ∈ R B are trainable parameters. When we assume that each q i is the probability that "the i-th bit becomes 1," the joint probability of generating word w can be represented as: wherex := 1 − x. We can easily obtain the maximum-probability bit array from q by simply assuming the i-th bit is 1 if q i ≥ 1/2, or 0 otherwise. However, this calculation may generate invalid bit arrays which do not correspond to actual words according to the mapping between words and bit arrays. For now, we simply assume that w = UNK (unknown) when such bit arrays are obtained, and discuss alternatives later in Section 3.5.
The constraints described here are very general requirements for bit arrays, which still allows us to choose between a wide variety of mapping functions. However, designing the most appropriate mapping method for NMT models is not a trivial problem. In this study, we use a simple mapping method described in Algorithm 1, which was empirically effective in preliminary experiments. 2 Here, V is the set of V target words including 3 extra markers: UNK, BOS (begin-of-sentence), and EOS (end-of-sentence), and rank(w) ∈ N >0 is the rank of the word according to their frequencies in the training corpus. Algorithm 1 is one of the minimal mapping methods (i.e., satisfying B = log 2 V ), and generated bit arrays have the characteristics that their higher bits roughly represents the frequency of corresponding words (e.g., if w is frequently appeared in the training corpus, higher bits in b(w) tend to become 0).

Loss Functions
For learning correct binary representations, we can use any loss functions that is (sub-)differentiable and satisfies a constraint that: 2 Other methods examined included random codes, Huffman codes (Huffman, 1952) and Brown clustering (Brown et al., 1992) with zero-padding to adjust code lengths, and some original allocation methods based on the word2vec embeddings (Mikolov et al., 2013).
Algorithm 1 Mapping words to bit arrays.
where L is the minimum value of the loss function which typically does not affect the gradient descent methods. For example, the squareddistance: or the cross-entropy: are candidates for the loss function. We also examined both loss functions in the preliminary experiments, and in this paper, we only used the squared-distance function (Equation (10)), because this function achieved higher translation accuracies than Equation (11). 3

Efficiency of the Binary Code Prediction
The computational complexity for the parameters W hq and β q is O(HB). This is equal to O(H log V ) when using a minimal mapping method like that shown in Algorithm 1, and is significantly smaller than O(HV ) when using standard softmax prediction. For example, if we chose V = 65536 = 2 16 and use Algorithm 1's mapping method, then B = 16 and total amount of computation in the output layer could be suppressed to 1/4096 of its original size. On a different note, the binary code prediction model proposed in this study shares some ideas with the hierarchical softmax (Morin and Bengio, 2005) approach. Actually, when we used a binarytree based mapping function for b, our model can be interpreted as the hierarchical softmax with two strong constraints for guaranteeing independence between all bits: all nodes in the same level of the hierarchy share their parameters, and all levels of the hierarchy are predicted independently of each other. By these constraints, all bits in b can be calculated in parallel. This is particularly important because it makes the model conducive to being calculated on parallel computation backends such as GPUs.
However, the binary code prediction model also introduces problems of robustness due to these strong constraints. As the experimental results show, the simplest prediction model which directly maps words into bit arrays seriously decreases translation quality. In Sections 3.4 and 3.5, we introduce two additional techniques to prevent reductions of translation quality and improve robustness of the binary code prediction model.

Hybrid Softmax/Binary Model
According to the Zipf's law (Zipf, 1949), the distribution of word appearances in an actual corpus is biased to a small subset of the vocabulary. As a result, the proposed model mostly learns characteristics for frequent words and cannot obtain enough opportunities to learn for rare words. To alleviate this problem, we introduce a hybrid model using both softmax prediction and binary code prediction as shown in Figure 2(c). In this model, the output layer calculates a standard softmax for the N − 1 most frequent words and an OTHER marker which indicates all rare words. When the softmax layer predicts OTHER, then the binary code layer is used to predict the representation of rare words. In this case, the actual probability of generating a particular word can be separated into two equations according to the frequency of words: π(w, h) := Pr(b(w)|q(h)), where W hu ∈ R N ×H and β u ∈ R N are trainable parameters, and id(w) assumes that the value corresponds to the rank of frequency of each word. We also define the loss function for the hybrid Figure 3: Example of the classification problem using redundant bit array mapping. model using both softmax and binary code losses: where λ H and λ B are hyper-parameters to determine strength of both softmax/binary code losses. These also can be adjusted according to the training data, but in this study, we only used λ H = λ B = 1 for simplicity. The computational complexity of the hybrid model is O(H(N + log V )), which is larger than the original binary code model O(H log V ). However, N can be chosen as N V because the softmax prediction is only required for a few frequent words. As a result, we can control the actual computation for the hybrid model to be much smaller than the standard softmax complexity O(HV ), The idea of separated prediction of frequent words and rare words comes from the differentiated softmax (Chen et al., 2016) approach. However, our output layer can be configured as a fullyconnected network, unlike the differentiated softmax, because the actual size of the output layer is still small after applying the hybrid model.

Applying Error-correcting Codes
The 2 methods proposed in previous sections impose constraints for all bits in q, and the value of each bit must be estimated correctly for the correct word to be chosen. As a result, these models may generate incorrect words due to even a single bit error. This problem is the result of dense mapping between words and bit arrays, and can be avoided by creating redundancy in the bit array. Figure 3 shows a simple example of how this idea works when discriminating 2 words using 3 bits. In this case, the actual words are obtained by estimating the nearest centroid bit array according to the Hamming distance between each centroid and the predicted bit array. This approach can predict correct words as long as the predicted bit arrays are in the set of neighbors for the correct centroid (gray regions in the Figure 3), i.e., up to a 1-bit error in the predicted bits can be corrected. This ability to be robust to errors is a central idea behind error-correcting codes (Shannon, 1948). In general, an error-correcting code has the ability to correct up to (d − 1)/2 bit errors when all centroids differ d bits from each other (Golay, 1949). d is known as the free distance determined by the design of error-correcting codes. Errorcorrecting codes have been examined in some previous work on multi-class classification tasks, and have reported advantages from the raw classification (Dietterich and Bakiri, 1995; Klautau et al., 2003;Liu, 2006;Kouzani and Nasireding, 2009;Kouzani, 2010;Lin, 2011, 2013). In this study, we applied an error-correcting algorithm to the bit array obtained from Algorithm 1 to improve robustness of the output layer in an NMT system. A challenge in this study is trying a large classification (#classes > 10,000) with error-correction, unlike previous studies focused on solving comparatively small tasks (#classes < 100). And this study also tries to solve a generation task unlike previous studies. As shown in the experiments, we found that this approach is highly effective in these tasks. Figure 4 (a) and (b) illustrate the training and generation processes for the model with errorcorrecting codes. In the training, we first convert the original bit arrays b(w) to a center bit array b in the space of error-correcting code: where B (B) ≥ B is the number of bits in the error-correcting code. The NMT model learns its Algorithm 2 Encoding into a convolutional code.
parameters based on the loss between predicted probabilities q and b . Note that typical errorcorrecting codes satisfy O(B /B) = O(1), and this characteristic efficiently suppresses the increase of actual computation cost in the output layer due to the application of the error-correcting code. In the generation of actual words, the decoding method of the error-correcting code converts the redundant predicted bits q into a dense representationq := [q 1 (q),q 2 (q), · · · ,q B (q)], and usesq as the bits to restore the word, as is done in the method described in the previous sections.
It should be noted that the method for performing error correction directly affects the quality of the whole NMT model. For example, the mapping shown in Figure 3 has only 3 bits and it is clear that these bits represent exactly the same information as each other. In this case, all bits can be estimated using exactly the same parameters, and we can not expect that we will benefit significantly from applying this redundant representation. Therefore, we need to choose an error correction method in which the characteristics of original bits should be distributed in various positions of the resulting bit arrays so that errors in bits are not highly correlated with each-other. In addition, it is desirable that the decoding method of the applied error-correcting code can directly utilize the probabilities of each bit, because q generated by the network will be a continuous probabilities between zero and one.
In this study, we applied convolutional codes (Viterbi, 1967) to convert between original and redundant bits. Convolutional codes perform a set of bit-wise convolutions between original bits and weight bits (which are hyper-parameters). They are well-suited to our setting here because they distribute the information of original bits in different places in the resulting bits, work robustly for random bit errors, and can be decoded using Algorithm 3 Decoding from a convolutional code.
Algorithm 2 describes the particular convolutional code that we applied in this study, with two convolution weights [1001111] and [1101101] as fixed hyper-parameters. 4 Where x[i .. j] := [x i , · · · , x j ] and x · y := i x i y i . On the other hand, there are various algorithms to decode convolutional codes with the same format which are based on different criteria. In this study, we use the decoding method described in Algorithm 3, where x • y represents the concatenation of vectors x and y. This method is based on the Viterbi algorithm (Viterbi, 1967) and estimates original bits by directly using probability of redundant bits. Although Algorithm 3 looks complicated, this algorithm can be performed efficiently on CPUs at test time, and is not necessary at training time when we are simply performing calculation of Equation (6). Algorithm 2 increases the number of bits from B into B = 2(B +6), but does not restrict the actual value of B. We examined the performance of the proposed methods on two English-Japanese bidirectional translation tasks which have different translation difficulties: ASPEC (Nakazawa et al., 2016) and BTEC (Takezawa, 1999). Table 1 describes details of two corpora. To prepare inputs for training, we used tokenizer.perl in Moses (Koehn et al., 2007) and KyTea (Neubig et al., 2011) for English/Japanese tokenizations respectively, applied lowercase.perl from Moses, and replaced out-of-vocabulary words such that rank(w) > V − 3 into the UNK marker.
We implemented each NMT model using C++ in the DyNet framework (Neubig et al., 2017) and trained/tested on 1 GPU (GeForce GTX TITAN X). Each test is also performed on CPUs to compare its processing time. We used a bidirectional RNN-based encoder applied in Bahdanau et al. (2014), unidirectional decoder with the same style of (Luong et al., 2015), and the concat global attention model also proposed in Luong et al. (2015). Each recurrent unit is constructed using a 1-layer LSTM (input/forget/output gates and nonpeepholes) (Gers et al., 2000) with 30% dropout (Srivastava et al., 2014) for the input/output vectors of the LSTMs. All word embeddings, recurrent states and model-specific hidden states are designed with 512-dimentional vectors. Only output layers and loss functions are replaced, and other network architectures are identical for the conventional/proposed models. We used the Adam optimizer (Kingma and Ba, 2014) with fixed hyperparameters α = 0.001, β 1 = 0.9 β 2 = 0.999, ε = 10 −8 , and mini-batches with 64 sentences sorted according to their sequence lengths. For evaluating the quality of each model, we calculated case-insensitive BLEU (Papineni et al., 2002) every 1000 mini-batches. Table 2 lists summaries of all methods we examined in experiments. Softmax prediction (Fig. 2(a)) Binary Fig. 2 Table 3 shows the BLEU on the test set (bold and italic faces indicate the best and second places in each task), number of bits B (or B ) for the binary code, actual size of the output layer # out , number of parameters in the output layer # W,β , as well as the ratio of # W,β or amount of whole parameters compared with Softmax, and averaged processing time at training (per mini-batch on GPUs) and test (per sentence on GPUs/CPUs), respectively. Figure 5(a) and 5(b) shows training curves up to 180,000 epochs about some English→Japanese settings. To relax instabilities of translation qualities while training (as shown in Figure 5(a) and 5(b)), each BLEU in Table 3 is calculated by averaging actual test BLEU of 5 consecutive results  Figure 6: BLEU changes in the Hybrid-N methods according to the softmax size (En→Ja).

Results and Discussion
around the epoch that has the highest dev BLEU.
First, we can see that each proposed method largely suppresses the actual size of the output layer from ten to one thousand times compared with the standard softmax. By looking at the total number of parameters, we can see that the proposed models require only 70% of the actual memory, and the proposed model reduces the total number of parameters for the output layers to a practically negligible level. Note that most of remaining parameters are used for the embedding lookup at the input layer in both encoder/decoder. These still occupy O(EV ) memory, where E represents the size of each embedding layer and usually O(E/H) = O(1). These are not targets to be reduced in this study because these values rarely are accessed at test time because we only need to access them for input words, and do not need them to always be in the physical memory. It might be possible to apply a similar binary representation as that of output layers to the input layers as well, then express the word embedding by multiplying this binary vector by a word embedding matrix. This is one potential avenue of future work.
Taking a look at the BLEU for the simple Binary method, we can see that it is far lower than other models for all tasks. This is expected, as described in Section 3, because using raw bit arrays causes many one-off estimation errors at the output layer due to the lack of robustness of the output representation. In contrast, Hybrid-N and Binary-EC models clearly improve BLEU from Binary, and they approach that of Softmax. This demonstrates that these two methods effectively improve the robustness of binary code prediction models. Especially, Binary-EC generally achieves higher quality than Hybrid-512 despite the fact that it suppress the number of parameters by about 1/10. These results show that introducing redundancy to target bit arrays is more effective than incremental prediction. In addition, the Hybrid-N-EC model achieves the highest BLEU in all proposed methods, and in particular, comparative or higher BLEU than Softmax in BTEC. This behavior clearly demonstrates that these two methods are orthogonal, and combining them together can be effective. We hypothesize that the lower quality of Softmax in BTEC is caused by an over-fitting due to the large number of parameters required in the softmax prediction.
The proposed methods also improve actual computation time in both training and test. In particular on CPU, where the computation speed is directly affected by the size of the output layer, the proposed methods translate significantly faster than Softmax by x5 to x20. In addition, we can also see that applying error-correcting code is also effictive with respect to the decoding speed. Figure 6 shows the trade-off between the translation quality and the size of softmax layers in the hybrid prediction model (Figure 2(c)) without error-correction. According to the model definition in Section 3.4, the softmax prediction and raw binary code prediction can be assumed to be the upper/lower-bound of the hybrid prediction model. The curves in Figure 6 move between Softmax and Binary models, and this behavior intuitively explains the characteristics of the hybrid prediction. In addition, we can see that the BLEU score in BTEC quickly improves, and saturates at N = 1024 in contrast to the ASPEC model, which is still improving at N = 2048. We presume that the shape of curves in Figure 6 is also affected by the difficulty of the corpus, i.e., when we train the hybrid model for easy datasets (e.g., BTEC is easier than ASPEC), it is enough to use a small softmax layer (e.g. N ≤ 1024).

Conclusion
In this study, we proposed neural machine translation models which indirectly predict output words via binary codes, and two model improvements: a hybrid prediction model using both softmax and binary codes, and introducing error-correcting codes to introduce robustness of binary code prediction. Experiments show that the proposed model can achieve comparative translation qualities to standard softmax prediction, while significantly suppressing the amount of parameters in the output layer, and improving calculation speeds while training and especially testing.
One interesting avenue of future work is to automatically learn encodings and error correcting codes that are well-suited for the type of binary code prediction we are performing here. In Algorithms 2 and 3 we use convolutions that were determined heuristically, and it is likely that learning these along with the model could result in improved accuracy or better compression capability.