Unfolding and Shrinking Neural Machine Translation Ensembles

Ensembling is a well-known technique in neural machine translation (NMT) to improve system performance. Instead of a single neural net, multiple neural nets with the same topology are trained separately, and the decoder generates predictions by averaging over the individual models. Ensembling often improves the quality of the generated translations drastically. However, it is not suitable for production systems because it is cumbersome and slow. This work aims to reduce the runtime to be on par with a single system without compromising the translation quality. First, we show that the ensemble can be unfolded into a single large neural network which imitates the output of the ensemble system. We show that unfolding can already improve the runtime in practice since more work can be done on the GPU. We proceed by describing a set of techniques to shrink the unfolded network by reducing the dimensionality of layers. On Japanese-English we report that the resulting network has the size and decoding speed of a single NMT network but performs on the level of a 3-ensemble system.


Introduction
The top systems in recent machine translation evaluation campaigns on various language pairs use ensembles of a number of NMT systems (Bojar et al., 2016;Sennrich et al., 2016a;Chung et al., 2016;Neubig, 2016;Wu et al., 2016;Cromieres et al., 2016;Durrani et al., 2017). Ensembling (Dietterich, 2000;Hansen and Salamon, 1990) of neural networks is a simple yet very effective technique to improve the accuracy of NMT.
The decoder makes use of K NMT networks which are either trained independently (Sutskever et al., 2014;Chung et al., 2016;Neubig, 2016;Wu et al., 2016) or share some amount of training iterations (Sennrich et al., 2016b,a;Cromieres et al., 2016;Durrani et al., 2017). The ensemble decoder computes predictions from each of the individual models which are then combined using the arithmetic average (Sutskever et al., 2014) or the geometric average (Cromieres et al., 2016).
Ensembling consistently outperforms single NMT by a large margin. However, the decoding speed is significantly worse since the decoder needs to apply K NMT models rather than only one. Therefore, a recent line of research transfers the idea of knowledge distillation (Bucilu et al., 2006;Hinton et al., 2014) to NMT and trains a smaller network (the student) by minimizing the cross-entropy to the output of the ensemble system (the teacher) (Kim and Rush, 2016;Freitag et al., 2017). This paper presents an alternative to knowledge distillation as we aim to speed up decoding to be comparable to single NMT while retaining the boost in translation accuracy from the ensemble. In a first step, we describe how to construct a single large neural network which imitates the output of an ensemble of multiple networks with the same topology. We will refer to this process as unfolding. GPU-based decoding with the unfolded network is often much faster than ensemble decoding since more work can be done on the GPU. In a second step, we explore methods to reduce the size of the unfolded network. This idea is justified by the fact that ensembled neural networks are often over-parameterized and have a large degree of redundancy (LeCun et al., 1989;Hassibi et al., 1993;Srinivas and Babu, 2015). Shrinking the unfolded network leads to a smaller model which consumes less space on the disk and in the memory; a crucial factor on mobile devices. More importantly, the  decoding speed on all platforms benefits greatly from the reduced number of neurons. We find that the dimensionality of linear embedding layers in the NMT network can be reduced heavily by lowrank matrix approximation based on singular value decomposition (SVD). This suggest that high dimensional embedding layers may be needed for training, but do not play an important role for decoding. The NMT network, however, also consists of complex layers like gated recurrent units (Cho et al., 2014, GRUs) and attention (Bahdanau et al., 2015). Therefore, we introduce a novel algorithm based on linear combinations of neurons which can be applied either during training (data-bound) or directly on the weight matrices without using training data (data-free). We report that with a mix of the presented shrinking methods we are able to reduce the size of the unfolded network to the size of the single NMT network while keeping the boost in BLEU score from the ensemble. Depending on the aggressiveness of shrinking, we report either a gain of 2.2 BLEU at the same decoding speed, or a 3.4× CPU decoding speed up with only a minor drop in BLEU compared to the original single NMT system. Furthermore, it is often much easier to stage a single NMT system than an ensemble in a commercial MT workflow, and it is crucial to be able to optimize quality at specific speed and memory constraints. Unfolding and shrinking address these problems directly.

Unfolding K Networks into a Single Large Neural Network
The first concept of our approach is called unfolding. Unfolding is an alternative to ensembling of multiple neural networks with the same topology. Rather than averaging their predictions, unfolding constructs a single large neural net out of the indi-vidual models which has the same number of input and output neurons but larger inner layers. Our main motivation for unfolding is to obtain a single network with ensemble level performance which can be shrunk with the techniques in Sec. 3.
Suppose we ensemble two single layer feedforward neural nets as shown in Fig. 1. Normally, ensembling is implemented by performing an isolated forward pass through the first network ( Fig. 1(a)), another isolated forward pass through the second network ( Fig. 1(b)), and averaging the activities in the output layers of both networks. This can be simulated by merging both networks into a single large network as shown in Fig. 1(c). The first neurons in the hidden layer of the combined network correspond to the hidden layer in the first single network, and the others to the hidden layer of the second network. A single pass through the combined network yields the same output as the ensemble if the output layer is linear (up to a factor 2). The weight matrices in the unfolded network can be constructed by stacking the corresponding weight matrices (either horizontally or vertically) in network 1 and 2. This kind of aggregation of multiple networks with the same topology is not only possible for single-layer feedforward architectures but also for complex networks consisting of multiple GRU layers and attention.
For a formal description of unfolding we address layers with indices d = 0, 1, . . . , D. The special layer 0 has a single neuron for modelling bias vectors. Layer 1 holds the input neurons and layer D is the output layer. We denote the size of a layer in the individual models as s(d). When combining K networks, the layer size s (d) in the unfolded network is increased by factor K if d is an inner layer, and equal to s(d) if d is the in- put or output layer. We denote the weight matrix between two layers d 1 , , and the corresponding weight matrix in the unfolded network as W (d 1 , d 2 ) ∈ R s (d 1 )×s (d 2 ) . We explicitly allow d 1 and d 2 to be non-consecutive or reversed to be able to model recurrent networks. We use the zero-matrix if layers d 1 and d 2 are not connected. The construction of the unfolded weight matrix W (d 1 , d 2 ) from the individual matrices W k (d 1 , d 2 ) depends on whether the connected layers are inner layers or not. The complete formula is listed in Fig. 2. Unfolded NMT networks approximate but do not exactly match the output of the ensemble due to two reasons. First, the unfolded network synchronizes the attentions of the individual models. Each decoding step in the unfolded network computes a single attention weight vector. In contrast, ensemble decoding would compute one attention weight vector for each of the K input models. A second difference is that the ensemble decoder first applies the softmax at the output layer, and then averages the prediction probabilities. The unfolded network averages the neuron activities (i.e. the logits) first, and then applies the softmax function. Interestingly, as shown in Sec. 4, these differences do not have any impact on the BLEU score but yield potential speed advantages of unfolding since the computationally expensive softmax layer is only applied once.

Shrinking the Unfolded Network
After constructing the weight matrices of the unfolded network we reduce the size of it by iteratively shrinking layer sizes. In this section we denote the incoming weight matrix of the layer to shrink as U ∈ R m in ×m and the outgoing weight matrix as V ∈ R m×mout . Our procedure is inspired by the method of Srinivas and Babu (2015). They propose a criterion for removing neurons in inner layers of the network based on two intuitions. First, similarly to Hebb's learning rule, they detect redundancy by the principle neurons which fire together, wire together. If the incoming weight vectors U :,i and U :,j are exactly the same for two neurons i and j, we can remove the neuron j and add its outgoing connections to neuron i (V i,: ← V i,: + V j,: ) without changing the output. 1 This holds since the activity in neuron j will always be equal to the activity in neuron i. In practice, Srinivas and Babu use a distance measure based on the difference of the incoming weight vectors to search for similar neurons as exact matches are very rare.
The second intuition of the criterion used by Srinivas and Babu (2015) is that neurons with small outgoing weights contribute very little overall. Therefore, they search for a pair of neurons i, j ∈ [1, m] according the following term and remove the j-th neuron. 2 arg min Neuron j is selected for removal if (1) there is another neuron i which has a very similar set of incoming weights and if (2) j has a small outgoing weight vector. Their criterion is data-free since it does not require any training data. For further details we refer to Srinivas and Babu (2015). Srinivas and Babu (2015) propose to add the outgoing weights of j to the weights of a similar neuron i to compensate for the removal of j. However, we have found that this approach does not work well on NMT networks. We propose instead to compensate for the removal of a neuron by a linear combination of the remaining neurons in the layer. Data-free shrinking assumes for the sake of deriving the update rule that the neuron activation function is linear. We now ask the following question: How can we compensate as well as possible for the loss of neuron j such that the impact on the output of the whole network is minimized? Datafree shrinking represents the incoming weight vector of neuron j (U :,j ) as linear combination of the incoming weight vectors of the other neurons. The linear factors can be found by satisfying the following linear system:

Data-Free Neuron Removal
where U :,¬j is matrix U without the j-th column. In practice, we use the method of ordinary least squares to find λ because the system may be overdetermined. The idea is that if we mix the outputs of all neurons in the layer by the λ-weights, we get the output of the j-th neuron. The row vector V j,: contains the contributions of the j-th neuron to each of the neurons in the next layer. Rather than using these connections, we approximate their effect by adding some weight to the outgoing connections of the other neurons. How much weight depends on λ and the outgoing weights V j,: . The factor D k,l which we need to add to the outgoing connection of the k-th neuron to compensate for the loss of the j-th neuron on the l-th neuron in the next layer is: Therefore, the update rule for V is: In the remainder we will refer to this method as data-free shrinking. Note that we recover the update rule of Srinivas and Babu (2015) by setting λ to the i-th unit vector. Also note that the error introduced by our shrinking method is due to the fact that we ignore the non-linearity, and that the solution for λ may not be exact. The method is error-free on linear layers as long as the residuals of the least-squares analysis in Eq. 2 are zero.
GRU layers The terminology of neurons needs some further elaboration for GRU layers which rather consist of update and reset gates and states (Cho et al., 2014). On GRU layers, we treat the states as neurons, i.e. the j-th neuron refers to the j-th entry in the GRU state vector. Input connections to the gates are included in the incoming weight matrix U for estimating λ in Eq. 2. Removing neuron j in a GRU layer means deleting the j-th entry in the states and both gate vectors.

Data-Bound Neuron Removal
Although we find our data-free approach to be a substantial improvement over the methods of Srinivas and Babu (2015) on NMT networks, it still leads to a non-negligible decline in BLEU score when applied to recurrent GRU layers. Our data-free method uses the incoming weights to identify similar neurons, i.e. neurons expected to have similar activities. This works well enough for simple layers, but the interdependencies between the states and the gates inside gated layers like GRUs or LSTMs are complex enough that redundancies cannot be found simply by looking for similar weights. In the spirit of Babaeizadeh et al. (2016), our data-bound version records neuron activities during training to estimate λ. We compensate for the removal of the j-th neuron by using a linear combination of the output of remaining neurons with similar activity patterns. In each layer, we prune 40 neurons each 450 training iterations until the target layer size is reached. Let A be the matrix which holds the records of neuron activities in the layer since the last removal. For example, for the decoder GRU layer, a batch size of 80, and target sentence lengths of 20, A has 20 · 80 · 450 = 720K rows and m (the number of neurons in the layer) columns. 3 Similarly to Eq. 2 we find interpolation weights λ using the method of least squares on the following linear system.
The update rule for the outgoing weight matrix is the same as for our data-free method (Eq. 4).
The key difference between data-free and databound shrinking is the way λ is estimated. Datafree shrinking uses the similarities between incoming weights, and data-bound shrinking uses neuron activities recorded during training. Once we select a neuron to remove, we estimate λ, compensate for the removal, and proceed with the shrunk network. Both methods are prior to any decoding and result in shrunk parameter files which are then loaded to the decoder. Both methods remove neurons rather than single weights.
The data-bound algorithm runs gradient-based optimization on the unfolded network. We use the AdaGrad (Duchi et al., 2011) step rule, a small learning rate of 0.0001, and aggressive step clipping at 0.05 to avoid destroying useful weights which were learned in the individual networks prior to the construction of the unfolded network.
Our data-bound algorithm uses a data-bound version of the neuron selection criterion in Eq. 1 which operates on the activity matrix A. We search for the pair i, j ∈ [1, m] according the following term and remove neuron j.

Shrinking Embedding Layers with SVD
The standard attention-based NMT network architecture (Bahdanau et al., 2015) includes three linear layers: the embedding layer in the encoder, and the output and feedback embedding layers in the decoder. We have found that linear layers are particularly easy to shrink using low-rank matrix approximation. As before we denote the incoming weight matrix as U ∈ R m in ×m and the outgoing weight matrix as V ∈ R m×mout . Since the layer is linear, we could directly connect the previous layer with the next layer using the product of both weight matrices X = U · V . However, X may be very large. Therefore, we approximate X as a product of two low rank matrices Y ∈ R m in ×m and Z ∈ R m ×mout (X ≈ Y Z) where m m is the desired layer size. A very common way to find such a matrix factorization is using truncated singular value decomposition (SVD). The layer is eventually shrunk by replacing U with Y and V with Z.

Results
The individual NMT systems we use as source for constructing the unfolded networks are trained us-ing AdaDelta (Zeiler, 2012) on the Blocks/Theano implementation (van Merriënboer et al., 2015;Bastien et al., 2012) of the standard attentionbased NMT model (Bahdanau et al., 2015) with: 1000 dimensional GRU layers (Cho et al., 2014) in both the decoder and bidrectional encoder; a single maxout output layer (Goodfellow et al., 2013); and 620 dimensional embedding layers. We follow  and use subword units based on byte pair encoding rather than words as modelling units. Our SGNMT decoder (Stahlberg et al., 2017) 4 with a beam size of 12 is used in all experiments. Our primary corpus is the Japanese-English (Ja-En) ASPEC data set . We select a subset of 500K sentence pairs to train our models as suggested by Neubig et al. (2015). We report cased BLEU scores calculated with Moses' multi-bleu.pl to be strictly comparable to the evaluation done in the Workshop of Asian Translation (WAT). We also apply our method to the WMT data set for English-German (En-De), using the news-test2014 as a development set, and keeping news-test2015 and news-test2016 as test sets. En-De BLEU scores are computed using mteval-v13a.pl as in the WMT evaluation. We set the vocabulary sizes to 30K for Ja-En and 50K for En-De. We also report the size factor for each model which is the total number of model parameters (sum of all weight matrix sizes) divided by the number of parameters in the original NMT network (86M for Ja-En and 120M for En-De). We choose a widely used, simple ensembling method (prediction averaging) as our baseline. We feel that the prevalence of this method makes it a reasonable baseline for our experiments.
Shrinking the Unfolded Network First, we investigate which shrinking methods are effective for which layers. Tab. 1 summarizes our results on a 2-unfold network for Ja-En, i.e. two separate NMT networks are combined in a single large network as described in Sec. 2. The layers in the combined network are shrunk to the size of the original networks using the methods discussed in Sec. 3.
Shrinking the linear embedding layers with SVD (Sec. 3.3) is very effective. The unfolded model with shrunk embedding layers performs at the same level as the ensemble (compare rows (b) and (c)). In our initial experiments, we applied the method of Srinivas and Babu (2015) to shrink the other layers, but their approach performed very poorly on this kind of network: the BLEU score dropped down to 15.5 on the development set when shrinking all layers except the decoder maxout and embedding layers, and to 9.9 BLEU when applying their method only to embedding layers. 5 Row (e) in Tab. 1 shows that our data-free algorithm from Sec. 3.1 is better suited for shrinking the GRU and attention layers, leading to a drop of only 1 BLEU point compared to the ensemble (b) (i.e. 0.8 BLEU better than the single system (a)). However, using the data-bound version of our shrinking algorithm (Sec. 3.2) for the GRU layers performs best. 6 The shrunk model yields about the same BLEU score as the ensemble on the test set (25.2 in (b) and 25.3 in (f)). Shrinking the maxout layer remains more of a challenge (rows (g) and (h)), but the number of parameters in this layer is small. Therefore, shrinking all layers except the maxout layer leads to almost the same number of parameters (factor 1.05 in row (f)) as the original NMT network (a), and thus to a similar storage size, memory consumption, and decoding speed, but with a 1.8 BLEU gain. Based on these results we fix the shrinking method used for each layer for all remaining experiments as follows: We shrink linear embedding layers with our SVD-based method, GRU layers with our databound method, the attention layer with our datafree method, and do not shrink the maxout layer. Our data-bound algorithm from Sec. 3.2 has two mechanisms to compensate for the removal of a neuron. First, we use a linear combination of the remaining neurons to update the outgoing weight matrix by imitating its activations (Eq. 4). Second, stochastic gradient descent (SGD) fine-tunes all 5 Results with the original method of Srinivas and Babu (2015) are not included in Tab. 1. 6 If we apply different methods to different layers of the same network, we first apply SVD-based shrinking, then the data-free method, and finally the data-bound method.  Our results in Tab. 3 show that decoding with ensembles (rows (b) and (e)) is slow: combining the predictions of the individual models on the CPU is computationally expensive, and ensemble decoding requires K passes through the softmax layer which is also computationally expensive. Unfolding the ensemble into a single network and shrinking the embedding and attention layers improves the runtimes on the GPU significantly without noticeable impact on BLEU (rows (c) and (f)). This can be attributed to the fact that unfolding can reduce the communication overhead between CPU and GPU. Comparing rows (d) and (g) with row (a) reveals that shrinking the unfolded networks even further speeds up CPU and GPU decoding almost to the level of single system decoding. However, more aggressive shrinking yields a BLEU score of 25.3 when combining three systems (row (g)) -1.8 BLEU better than the single system, but 0.6 BLEU worse than the 3-     Table 4: Layer sizes of our setups for Ja-En.
ensemble. Therefore, we will investigate the impact of shrinking on the different layers in the next sections more thoroughly.

Degrees of Redundancy in Different Layers
We applied our shrinking methods to isolated layers in the 2-Unfold network of Tab. 1 (f). Fig. 3 plots the BLEU score when isolated layers are shrunk even below their size in the original NMT network. The attention layer is very robust against shrinking and can be reduced to 100 neurons (10% of the original size) without impacting the BLEU score. The embedding layers can be reduced to 60% but are sensitive to more aggressive pruning. Shrinking the GRU layers affects the BLEU score the most but still outperforms the single system when the GRU layers are shrunk to 30%.
Adjusting the Target Sizes of Layers Based on our previous experiments we revise our approach to shrink the 3-Unfold system in Tab  of shrinking all layers except the maxout layer to the same degree, we adjust the aggressiveness of shrinking for each layer. We suggest three different setups (Normal, Small, and Tiny) with the layer sizes specified in Tab. 4. 3-Unfold-Normal has the same number of parameters as the original NMT networks (size factor: 1.0), 3-Unfold-Small is only half their size (size factor: 0.5), and 3-Unfold-Tiny reduces the size by two thirds (size factor: 0.33). When comparing rows (a) and (c) in Tab. 5 we observe that 3-Unfold-Normal yields a gain of 2.2 BLEU with respect to the original single system and a slight improvement in decoding speed at the same time. 7 Networks with the size factor 1.0 like 3-Unfold-Normal are very likely to yield about the same decoding speed as the Single network regardless of the decoder implementation, machine learning framework, and hardware. Therefore, we think that similar results are possible on other platforms as well. CPU decoding speed directly benefits even more from smaller setups -3-Unfold-Tiny is only 0.3 BLEU worse than Single but decoding on a single CPU is 3.4 times faster (row (a) vs. row (e) in Tab. 5). This is of great practical use: batch decoding with only two CPU threads surpasses production speed which is often set to 2000 words per minute (Beck et al., 2016). Our initial experiments in Tab. 6 suggest that the Normal setup is applicable to En-De as well, with substantial improve-

Related Work
The idea of pruning neural networks to improve the compactness of the models dates back more than 25 years (LeCun et al., 1989). The literature is therefore vast (Augasta and Kathirvalavakumar, 2013). One line of research aims to remove unimportant network connections. The connections can be selected for deletion based on the secondderivative of the training error with respect to the weight (LeCun et al., 1989;Hassibi et al., 1993), or by a threshold criterion on its magnitude (Han et al., 2015). See et al. (2016) confirmed a high degree of weight redundancy in NMT networks.
In this work we are interested in removing neurons rather than single connections since we strive to shrink the unfolded network such that it resembles the layout of an individual model. We argued in Sec. 4 that removing neurons rather than connections does not only improve the model size but also the memory footprint and decoding speed. As explained in Sec. 3.1, our data-free method is an extension of the approach by Srinivas and Babu (2015); our extension performs significantly better on NMT networks. Our data-bound method (Sec. 3.2) is inspired by Babaeizadeh et al. (2016) as we combine neurons with similar activities during training, but we use linear combinations of multiple neurons to compensate for the loss of a neuron rather than merging pairs of neurons.
Using low rank matrices for neural network compression, particularly approximations via SVD, has been studied widely in the literature (Denil et al., 2013;Denton et al., 2014;Xue et al., 2013;Prabhavalkar et al., 2016;Lu et al., 2016). These approaches often use low rank matrices to approximate a full rank weight matrix in the original network. In contrast, we shrink an entire linear layer by applying SVD on the product of the incoming and outgoing weight matrices (Sec. 3.3).
In this paper we mimicked the output of the high performing but cumbersome ensemble by constructing a large unfolded network, and shrank this network afterwards. Another approach, known as knowledge distillation, uses the large model (the teacher) to generate soft training labels for the smaller student network (Bucilu et al., 2006;Hinton et al., 2014). The student network is trained by minimizing the cross-entropy to the teacher. This idea has been applied to sequence modelling tasks such as machine translation and speech recognition (Wong and Gales, 2016; Kim and Rush, 2016;Freitag et al., 2017). Our approach can be computationally more efficient as the training set does not have to be decoded by the large teacher network.
Junczys-Dowmunt et al. (2016a; 2016b) reported gains from averaging the weight matrices of multiple checkpoints of the same training run. However, our attempts to replicate their approach were not successful. Averaging might work well when the behaviour of corresponding units is similar across networks, but that cannot be guaranteed when networks are trained independently.

Conclusion
We have described a generic method for improving the decoding speed and BLEU score of single system NMT. Our approach involves unfolding an ensemble of multiple systems into a single large neural network and shrinking this network by removing redundant neurons. Our best results on Japanese-English either yield a gain of 2.2 BLEU compared to the original single NMT network at about the same decoding speed, or a 3.4× CPU decoding speed up with only a minor drop in BLEU.
The current formulation of unfolding works for networks of the same topology as the concatenation of layers is only possible for analogous layers in different networks. Unfolding and shrinking diverse networks could be possible, for example by applying the technique only to the input and output layers or by some other scheme of finding associations between units in different models, but we leave this investigation to future work as models in NMT ensembles in current research usually have the same topology (Bojar et al., 2016;Sennrich et al., 2016a;Chung et al., 2016;Neubig, 2016;Wu et al., 2016;Durrani et al., 2017).

Appendix: Probabilistic Interpretation of Data-Free and Data-Bound Shrinking
Data-free and data-bound shrinking can be interpreted as setting the expected difference between network outputs before and after a removal operation to zero under different assumptions.
For simplicity, we focus our probabilistic treatment of shrinking on single layer feedforward networks. Such a network maps an input x ∈ R m in to an output y ∈ R mout . The l-th output y l is computed according the following equation where u k ∈ R m in is the incoming weight vector of the k-th hidden neuron (denoted as U :,k in the main paper) and V ∈ R m×mout the outgoing weight matrix of the m-dimensional hidden layer. We now remove the j-th neuron in the hidden layer and modify the outgoing weights to compensate for the removal: where y l is the output after the removal operation and V ∈ R m×mout are the modified outgoing weights. Our goal is to choose V such that the expected error introduced by removing neuron j is zero: Data-free shrinking Data-free shrinking makes two assumptions to satisfy Eq. 9. First, we assume that the incoming weight vector u j can be represented as linear combination of the other weight vectors.
Second, it assumes that the neuron activation function σ(·) is linear. Starting with Eqs. 7 and 8 we can write E x (y l − y l ) as E x σ(xu T j )V j,l + We set this term to zero (and thus satisfy Eq. 9) by setting each component of the sum to zero.
Data-bound shrinking Data-bound shrinking does not require linearity in σ(·). It rather assumes that the expected value of the neuron activity j is a linear combination of the expected values of the other activities: E x (σ(xu T j )) = k∈[1,m]\{j} λ k E x (σ(xu T k )) (12) E x (·) is estimated using importance sampling: In practice, the samples in X are collected in the activity matrix A from Sec. 3.2. We can satisfy Eq. 9 by using the λ-values from Eq. 12, so that E x (y l − y l ) becomes Eq. 12 = k∈[1,m]\{j} E x (σ(xu T k ))(V k,l − V k,l + λ k V j,l ) Again, we set this to zero using Eq. 11.