Training Flexible Depth Model by Multi-Task Learning for Neural Machine Translation

The standard neural machine translation model can only decode with the same depth configuration as training. Restricted by this feature, we have to deploy models of various sizes to maintain the same translation latency, because the hardware conditions on different terminal devices (e.g., mobile phones) may vary greatly. Such individual training leads to increased model maintenance costs and slower model iterations, especially for the industry. In this work, we propose to use multi-task learning to train a flexible depth model that can adapt to different depth configurations during inference. Experimental results show that our approach can simultaneously support decoding in 24 depth configurations and is superior to the individual training and another flexible depth model training method——LayerDrop.


Introduction
As neural machine translation models become heavier and heavier (Vaswani et al., 2017), we have to resort to model compress techniques (e.g., knowledge distillation (Hinton et al., 2015;Kim and Rush, 2016)) to deploy smaller models in devices with limited resources, such as mobile phones. However, a practical challenge is that the hardware conditions of different devices vary greatly. To ensure the same calculation latency, customizing distinct model sizes (e.g., depth, width) for different devices is necessary, which leads to huge model training and maintenance costs (Yu et al., 2019). For example, we need to distill the pretrained large model into N individual small models. The situation becomes worse for the industry when considering more translation directions and more frequent model iterations. * Work done during Ph.D. study at Northeastern University.
An ideal solution is to train a single model that can run in different model sizes. Such attempts have been explored in SlimNet (Yu et al., 2019) and LayerDrop (Fan et al., 2020). SlimNet allows running in four width configurations by joint training of these width networks, while LayerDrop can decode with any depth configuration by applying Dropout (Srivastava et al., 2014) on layers during training.
In this work, we take a further step along the line of flexible depth network like LayerDrop. As shown in Figure 1, we first demonstrate that when there is a large gap between the predefined layer dropout during training and the actual pruning ratio during inference, LayerDrop's performance is poor. To solve this problem, we propose to use multitask learning to train a flexible depth model by treating each supported depth configuration as a task. We reduce the supported depth space for the aggressive model compression rate and propose an effective deterministic sub-network assignment method to eliminate the mismatch between training and inference in LayerDrop. Experimental results on deep Transformer (Wang et al., 2019) Figure 1: BLEU score heatmaps of a 12-layer encoder and a 6-layer decoder model trained by LayerDrop with different layer dropout p. p enc and p dec denote the layer-prunning ratio at inference on encoder and decoder, respectively. For example, p enc =11/12 means decoding by one encoder layer without the other 11 encoder layers. The red star marks the training layer dropout, i.e. p enc =p dec =p.
a capacity of k. We notice that although a pretrained vanilla Transformer can force decoding with any depth, its performance is far behind the independently trained model 1 . Therefore, the vanilla Transformer does not belong to FDM.

LayerDrop
In NMT, both encoder and decoder are generally composed of multiple layers with residual connections, which can be formally described as: To make the model robust to pruned layers (shallower networks), LayerDrop proposed by Fan et al. (2020), applies structured dropout over layers during training. A Bernoulli distribution associated with a pre-defined parameter p ∈ [0,1] controls the drop rate. It modifies Eq. 1 as: where P r(Q i = 0) = p and P r(Q i = 1) = 1 − p.
In this way, the l-th layer theoretically can take any proceeding layer as input, rather than just the previous one layer (l − 1-th layer). At runtime, given the desired layer-pruning ratio p = 1 − D inf /D where D inf is the number of layers actually used in decoding and D is the total number of layers, LayerDrop selects to remove the d-th layer such that: Algorithm 1: Training Flexible Depth Model by Multi-Task Learning.
Feed B into network (SN e , SN d ); The physical meaning ofφ(D) is to compress every D/d layers into one layer, where d ∈φ(D).
Guideline for deterministic sub-network assignment. The use of deterministic sub-networks is critical to maintaining the consistency between training and inference. However, for each d ∈ φ(D), it is not trivial to decide which d layers should be selected to construct the d-layer subnetwork. Here we propose two metrics to guide the procedure. The first is task balance (TB), whose motivation is to make every layer have as uniform tasks as possible. We use the standard deviation of the number of tasks per layer to measure it quantitatively: 2 For the diversity of depth configuration, we assume that D is not a prime number in this work.
where t(i) is the number of tasks in which the i-th The second is average layer distance (ALD), which requires the distance between adjacent layers in the subnetwork SN(d) = {L a 1 , L a 2 , . . . , L a d } should be large. For example, for a 6-layer network, if we want to build a 2-layer sub-network, it is unreasonable to select {L 1 , L 2 } directly because the features extracted by adjacent layers are semantically similar (Peters et al., 2018;Raganato and Tiedemann, 2018). Therefore, we use the average distance between layers in all sub-networks as the metric: Proposed method. Guided by these two metrics, we design an effective sub-network assignment method Optimal. We record the usage state s i of each layer to ensure not to put too many tasks on the same layer. At initialization, we set s i as Alive. For d ∈φ(D), Optimal prioritizes to process large depth. Optimal uniformly assigns one layer for every c = D/d layers to make ALD high. In each chunk, we pick the middle layer of ceil(c/2) − 1 (called MiddleLeft). Note that, LayerDrop uses the leftmost layer in each chunk (called Left), as shown in Eq. 3. Although Left and MiddleLeft have the same ALD, we found that there is a large gap in TB. For example, when D=12, Left's TB is 1.5, which is much higher than MiddleLeft's 0.78 (lower is better). Then, Optimal records which layers are used and picks the less used layers as much as possible. Each used layer is marked as Dead. If current alive layers cannot accommodate the picked depth d, we pass it and choose a smaller d until the alive layers are sufficient, or reset all layers as Alive.
Training. Algorithm 1 describes the training process of our method. During training, compared with individual training and LayerDrop from scratch, our FDM finetunes on the individually pretrained M M −N and uses sequence-level knowledge distillation (Seq-KD) (Kim and Rush, 2016) to help shallower networks training. We note that in conventional Seq-KD, the student model cannot finetune on the teacher model directly because the   when a few layers pruned, we can see that MT is the winner in most tasks (20/24). It indicates that our method is superior to LayerDrop for FDM training and demonstrates the potential to substitute a dozen models with different depths to just one model. Besides, in line with Fan et al. (2020), it is interesting to see the FDM without any pruning outperforms the individually trained model (see M=12, N=6), which is obvious evidence that jointly training of various depth models has a good regularization effect.
Knowledge distillation. Table 2 shows average BLEU scores of 24 tasks when training a flexible depth model with/without Seq-KD. It is clear that using distillation data helps FDM training in all systems, which is in line with the previous singlemodel compression study (Kim and Rush, 2016). According to Zhou et al. (2020), Seq-KD makes the training data distribution smoother, so we suspect that FDM benefits from Seq-KD because of the difficulty of multi-task learning.
Sub-layer assigiment strategy. Besides the proposed Optimal and Left used by LayerDrop and its improved version MiddleLeft, we also  compared with the other two strategies: Head and Seq, to check the consistency between BLEU and the proposed guidelines (TB and ALD). Head is the simplest method, which always picks the first d layers as the sub-network. However, it causes the bottom layers heavier than the top layers. Seq avoids this problem by sequentially skipping previously used layers. For example, for D=6, d=1, Seq first uses L 1 as the sub-network. Next, when d = 2, Seq selects L 2 and L 3 . This method ensures that the minimal burden on all layers, but it violates the ALD metrics. Table 3 shows the average BLEU scores on all tasks by several subnetwork strategies. While MiddleLeft already has good TB and ADL, we argue that it is not the best. This is because MiddleLeft treats each d independently regardless of which layers are used in the previous d . We can see the proposed policy with lower TB and higher ALD obtains the best result, which indicates that our proposed metrics are helpful to determine which strategy is sound.
Reduce the number of tasks. Intuitively, the number of tasks demines the learning difficulty of our method. To verify this assumption, we tested the other two baselines: (1) only training the flexible-depth encoder (depth from {1, 2, 3, 4, 6, 12}) but the decoder depth is the constant 6, denoted by MT (only encoder); (2) only training the flexible-depth decoder (depth from {1, 2, 3, 6}) but the encoder depth is the constant 12, denoted by MT (only decoder). Then we compared the average BLEU scores under fixing the decoder depth as 6 (BLEU N =6 ) and fixing the encoder depth as 12 (BLEU M =12 ). As shown in Table 4, when we reduce the number of tasks, we can generally obtain better performance. It indicates that if removing some unnecessary tasks, our FDM has the potential for further improvement.
Training efficiency.
Our multi-task learning needs to accumulate gradients on all tasks, and its cost is linearly related to the number of tasks. Actually, we can sample fewer tasks instead of enumerating them all. For example, randomly sam- pling 3 tasks from 6 depth candidates (denoted by #Enc.=3). Another way to reduce training costs is to use smaller batches. We compared different strategies at {100%, 50%, 25%} training costs, as shown in Table 5. First of all, we can see that more training costs can obtain better performance. Compared with reducing tasks and reducing batches, we found that the former is a better choice. In particular, sampling more depths on the encoder side is more important than the decoder side, which is consistent with the recent observation in Wang et al. (2019) that encoder is more important than decoder in terms of translation performance.

Conclusion
We demonstrated LayerDrop is not suitable for FDM training because of (1) the huge sub-network space in training and (2) the mismatch between training and inference. Then we proposed to use multi-task learning to mitigate it. Experimental results show that our approach can decode with up to 24 depth configurations and obtain comparable or better performance than individual training and LayerDrop. In the future, we plan to explore more effective FDM training methods, and combining flexible depth and width is also one of the attractive directions.