Dynamically Adjusting Transformer Batch Size by Monitoring Gradient Direction Change

The choice of hyper-parameters affects the performance of neural models. While much previous research (Sutskever et al., 2013; Duchi et al., 2011; Kingma and Ba, 2015) focuses on accelerating convergence and reducing the effects of the learning rate, comparatively few papers concentrate on the effect of batch size. In this paper, we analyze how increasing batch size affects gradient direction, and propose to evaluate the stability of gradients with their angle change. Based on our observations, the angle change of gradient direction first tends to stabilize (i.e. gradually decrease) while accumulating mini-batches, and then starts to fluctuate. We propose to automatically and dynamically determine batch sizes by accumulating gradients of mini-batches and performing an optimization step at just the time when the direction of gradients starts to fluctuate. To improve the efficiency of our approach for large models, we propose a sampling approach to select gradients of parameters sensitive to the batch size. Our approach dynamically determines proper and efficient batch sizes during training. In our experiments on the WMT 14 English to German and English to French tasks, our approach improves the Transformer with a fixed 25k batch size by +0.73 and +0.82 BLEU respectively.


Introduction
The performance of neural models is likely to be affected by the choice of hyper-parameters.While much previous research (Sutskever et al., 2013;Duchi et al., 2011;Kingma and Ba, 2015) focuses on accelerating convergence and reducing the effects of the learning rate, comparatively few papers concentrate on the effect of batch size.
However, batch size is also an important hyperparameter, and some batch sizes empirically lead to better performance than the others.Specifically, it has been shown that the performance of the Transformer model (Vaswani et al., 2017) for Neural Machine Translation (NMT) (Bahdanau et al., 2015;Gehring et al., 2017;Vaswani et al., 2017) relies heavily on the batch size (Popel and Bojar, 2018;Ott et al., 2018;Abdou et al., 2017;Zhang et al., 2019a).
The influence of batch size on performance raises the question, how to dynamically find proper and efficient batch sizes during training?In this paper, we investigate the relationship between the batch size and gradients, and propose a dynamic batch size approach by monitoring gradient direction changes.Our contributions are as follows: • We observe the effects on gradients with increasing batch size, and find that a large batch size stabilizes the direction of gradients; • We propose to automatically determine dynamic batch sizes in training by monitoring the gradient direction change while accumulating gradients of small batches; • To measure gradient direction change efficiently with large models, we propose an approach to dynamically select those gradients of parameters/layers which are sensitive to the batch size; • In machine translation experiments, our approach improves the training efficiency and the performance of the Transformer model.

Gradient Direction Change and Automated Batch Size
Gradients indicate the direction and size of parameter updates to minimize the loss function in training.To reveal the effects of the batch size in optimization, we evaluate its influence on the direction change of gradients.

Gradient Direction Change with Increasing Batch Size
To investigate the influence of batch size on gradient direction, we gradually accumulate gradients of small mini-batches as the gradients of a large batch that consists of those mini-batches, and observe how the direction of gradients varies.Let d j i : (x j i , y j i ) stands for the large batch concatenated from the ith mini-batch to the jth minibatch, where x j i and y j i are inputs and targets.Then the gradients g j i of model parameters θ on d j i are: In gradient accumulation, the gradients g k 0 are the sum of g k−1 0 and g k k : To measure the change of gradient direction during accumulation, we regard the two gradients g k−1 0 and g k 0 as 2 vectors, and compute the angle a(g k−1 0 , g k 0 ) between them: where "•" indicates inner-product of vectors.We use the angle of 2 vectors rather than cosine similarity because: • The angle indicates the change between gradient directions; • When the angle is small, a significant change in the angle only results in a subtle difference in cosine similarity. 1 We observe the gradient direction varying during accumulating gradients of a Transformer model training on the WMT 14 English-German task following the setting of Vaswani et al. (2017) with a batch size of around 50k target tokens.To achieve the gradient of the large batch size, we gradually accumulate gradients of mini-batches with around 4k target tokens.
Table 1 shows a typical example: (i) gradient change is high at the beginning, (ii) gradient change reduces with increasing batch size and (iii) eventually it will start fluctuating (here at k=10). 2   Intuitively, the less the direction of accumulated gradients is moved by the gradients of a new minibatch, the more certainty there is about the gradient direction.Thus we propose that the magnitude of the angle fluctuation relates to the certainty of the model parameter optimization direction, and may therefore serve as a measure of optimization difficulty.

Automated Batch Size with Gradient Direction Change
Table 1 shows that the optimization direction is less stable with a small batch than with a large batch.But after the direction of gradients has stabilized, accumulating more mini-batches seems useless as the gradient direction starts to fluctuate.
Thus, we suggest to compute dynamic and efficient batch sizes by accumulating gradients of mini-batches, while evaluating the gradient direction change with each new mini-batch, and stop accumulating more mini-batches and perform an optimization step when the gradient direction fluctuates.
In practice, we only monitor a(g k−1 0 , g k 0 ) for efficiency.We record the minimum angle change a min while accumulating gradients, and suppose the gradient direction starts to fluctuate, stop accumulating more mini-batches when a(g k−1 0 , g k 0 ) > a min * α.In this way we can achieve a dynamic batch size (the size of d k 0 ), where α is a pre-specified hyperparameter.
2 By comparing we can find the direction changes from

Efficiently Monitoring Gradient Direction Change
In practice, a model may have a large amount of parameters, and the cost of computing the cosine similarity between two corresponding gradient vectors are relatively high.To tackle this issue, we propose to divide model parameters into groups, and monitor gradient direction change only on a selected group in each optimization step.For a multi-layer model, i.e. the Transformer, a group may consist of parameters of 1 layer or several layers.
To select the parameter group which is sensitive to the batch size, we record the angles of gradient direction change a(g 0 0 , g 1 0 ), ..., a(g k−1 0 , g k 0 ) in the gradient accumulation, and define a max and a min as the maximum and minimum direction change: We then use ∆a to measure the uncertainty reduction in the optimization direction: Intuitively, the optimization direction of the parameter group which results in a larger ∆a profits more from the batch size, and the group with a larger ∆a should more frequently sampled.
We average the recent history of ∆a k of the kth parameter group into ∆a k .Inspired by Gumbel (1954); Maddison et al. (2014); Zhang et al. (2019b), we first add Gumble noise to each ∆a k to prevent the selection falling into a fixed group: where u ∈ (0, 1) is a uniform distribution.Then we zero negative values3 in ∆a * 1 , ..., ∆a * n and normalize them into a probability distribution: We use p k as the probability to sample the kth group, and β is a hyper-parameter to sharpen the probability distribution.We do not use softmax because it would heavily sharpen the distribution when the gap between values is large, and makes it almost impossible to select and evaluate the other groups in addition to the one with highest ∆a * k . 4

Experiments
We implemented our approaches based on the Neutron implementation (Xu and Liu, 2019) of the Transformer translation model.We applied our approach to the training of the Transformer, and to compare with Vaswani et al. (2017), we conducted our experiments on the WMT 14 English to German and English to French news translation tasks on 2 GTX 1080Ti GPUs.Hyper parameters were tuned on the development set (newstest 2012 and 2013).We followed all settings of Vaswani et al. (2017) except for the batch size.We used a beam size of 4 for decoding, and evaluated case-sensitive tokenized BLEU 5 with significance test (Koehn, 2004).
We used an α of 1.1 to determine the fluctuation of gradient direction by default.We regarded each encoder/decoder layer as a parameter group, and used a β of 3 for the parameter group selection.

Performance
We compared the results of our dynamic batch size approach to two fixed batch size baselines, the 25k 4 For example, the result of softmax over [22,31,60] is [3.13e-17, 2.54e-13, 1.00], the last element takes almost all possibility mass.But we later find that if ∆a is normalized (∆a = (amax − amin)/amax) in Equation 6, the softmax works comparably well, which avoids using the hyper parameter β in Equation 8.
5 https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ generic/multi-bleu.perl  2 with the statistics of batch sizes of our approach shown in Table 3 and the detailed distribution of batch sizes for the En-De task shown in Figure 1.
Table 2 and 3 show that our approach outperforms both the fixed 25k and 50k batch size settings with an average batch size around 26k, and our approach is slightly faster than the 25k setting despite of the additional cost for monitoring gradient direction change. 6 Figure 1 shows an interesting fact that the most frequently used automated batch sizes were close to the fixed value (25k) of Vaswani et al. (2017).

Analysis of Minimum Gradient Direction Change
In order to observe the varying of minimum gradient direction change during training, we averaged the minimum angle for every 2.5k training steps.
6 It is hard to accumulate an accurate 25k target tokens in a batch, and in fact, the fixed 25k setting results in an average batch size of 26729.79.Results are shown in Figure 2.
Figure 2 shows that the minimum direction change of gradients was small at the beginning, and gradually increased with training.Given that a small angle change indicates that there is more certainty in the gradient direction, this observation is consistent with the fact that finding the optimization direction is harder and harder with training.

Effects of α
We studied the effects of different α values on the En-De task, and results are shown in Table 4. 7Table 4 shows that with increasing α, the average batch size and the time cost increases along with the performance.A wide range of values works relatively well indicating that its selection is robust, and 1.1 seems to be a good trade off between the cost and the performance in our experiments. 8It is also worth noting that α = 1 outperforms the 25k baseline while being 1.42 times faster (Table 2).
4 Related Work Popel and Bojar (2018) demonstrate that the batch size affects the performance of the Transformer, and a large batch size tends to benefit performance, but they use fixed batch sizes during training.Abdou et al. (2017) propose to use a linearly increasing batch size from 65 to 100 which slightly outperforms their baseline.Smith et al. (2018) show that the same learning curve on both training and test sets can be obtained by increasing the batch size during training instead of decaying the learning rate.
For fast convergence, Balles et al. (2017) propose to approximately estimate the mean value of the batch size for the next batch by maximizing the expected gain with a sample gradient variance (||g|| 2 ) computed on the current batch, while our approach compares the gradient direction of change (a(g k−1 0 , g k 0 )) during accumulation of mini-batches in the assembling of a large batch.
We suggest our approach is complementary to Sutskever et al. (2013); Duchi et al. (2011);Kingma and Ba (2015), as their approaches decide the magnitude of the move in the optimization direction, while our approach provides reliable gradient direction.

Conclusion
In this paper, we analyze the effects of accumulated batches on the gradient direction, and propose to achieve efficient automated batch sizes by monitoring change in gradient accumulation and performing an optimization step when the accumulated gradient direction is almost stable.To improve the efficiency of our approach with large models, we propose a sampling approach to select gradients of parameters sensitive to the batch size.
Our approach improves the Transformer with a fixed 25k batch size by +0.73 and +0.82 BLEU on the WMT 14 English to German and English to French tasks respectively while preserving efficiency.

Figure 1 :Figure 2 :
Figure 1: Distribution of Dynamic Batch Sizes.Values on y-axis are percentages.

Table 1 :
The direction change of gradients while accumulating mini-batches.

Table 2 :
Performance.Time is the training time on the WMT 14 En-De task for 100k training steps.† indicates p < 0.01 in the significance test.

Table 3 :
Statistics of Batch Size.