Making Asynchronous Stochastic Gradient Descent Work for Transformers

Asynchronous stochastic gradient descent (SGD) converges poorly for Transformer models, so synchronous SGD has become the norm for Transformer training. This is unfortunate because asynchronous SGD is faster at raw training speed since it avoids waiting for synchronization. Moreover, the Transformer model is the basis for state-of-the-art models for several tasks, including machine translation, so training speed matters. To understand why asynchronous SGD under-performs, we blur the lines between asynchronous and synchronous methods. We find that summing several asynchronous updates, rather than applying them immediately, restores convergence behavior. With this method, the Transformer attains the same BLEU score 1.36 times as fast.


Introduction
Models based on Transformers (Vaswani et al., 2017) achieve state-of-the-art results on various machine translation tasks (Bojar et al., 2018).Distributed training is crucial to training these models in a reasonable amount of time, with the dominant paradigms being asynchronous or synchronous stochastic gradient descent (SGD).Prior work (Chen et al., 2016(Chen et al., , 2018;;Ott et al., 2018) found that asynchronous SGD yields low quality models, but did not elaborate further.We confirm this experimentally in Section 2.1.Then we conduct ablation studies to understand what makes asynchronous SGD under-perform, leading to a hybrid that trains high-quality models without waiting for synchronization barriers.
In synchronous SGD, gradients are collected from all workers and summed before updating, equivalent to one large batch.These accumulation and waiting processes are absent in asynchronous SGD, where updates are applied immediately after they are computed by any processor.Since each update comes from one processor, the batch size per update in asynchronous SGD is smaller.Prior work has shown that smaller batches degrade the final quality of Transformers (Smith et al., 2017;Popel and Bojar, 2018).Moreover, the model has typically updated several times while a gradient was computed, so gradients are stale.Stale gradients potentially degrade final model quality (Zhang et al., 2016;Srinivasan et al., 2018).
We investigate the effect of batch size and stale gradients on Transformer training, comparing with recurrent neural network (RNN) training.All of these experiments use the Adam optimizer, which has shown to perform well on a variety of tasks (Kingma and Ba, 2014) and was used in the original Transformer paper (Vaswani et al., 2017).We find that small batch sizes slightly degrade quality while stale gradients substantially degrade quality.
We adopt prior work that summed gradients in various contexts (Dean et al., 2012;Lian et al., 2015;Ott et al., 2018;Bogoychev et al., 2018) to increase the batch size while reducing staleness.Empirically, summing gradients globally in the parameter server performs equally well with synchronous SGD in terms of BLEU score, while also maintaining the speed benefit of asynchronous SGD.

Exploring Asynchronous SGD
In this section, we analyze the causes of poor performance in asynchronous SGD, including experiments.

Baseline: The Problem
To motivate this paper and set baselines, we first measure how poorly Transformers perform when trained with baseline asynchronous SGD (Chen et al., 2016(Chen et al., , 2018;;Ott et al., 2018).We train a Transformer model under both synchronous and arXiv:1906.03496v1[cs.CL] 8 Jun 2019 asynchronous SGD, contrasting the results with an RNN model.Moreover, we sweep learning rates to verify this effect is not an artifact of choosing hyperparameters that favor one scenario.
Our experiments use systems for the WMT 2017 English to German news translation task.The Transformer is standard with six encoder and six decoder layers.The RNN model (Barone et al., 2017) is based on the University of Edinburgh's winning WMT17 submission (Sennrich et al., 2017) and has 8 layers.Both models use backtranslated monolingual corpora (Sennrich et al., 2016a) and byte-pair encoding (Sennrich et al., 2016b).Model performance is validated on new-stest2016; we preserve newstest2017 as test for later experiments.We follow the rest of the hyperparameter settings on both Transformer and RNN models as suggested in the papers (Vaswani et al., 2017;Sennrich et al., 2017).We trained our model in a four GPUs environment with a dynamic batch size of 10 GB for each GPU with the Marian toolkit (Junczys-Dowmunt et al., 2018).The models are trained for 8 epochs or until reaching five continuous validations without loss improvement.Quality is measured on newstest2016 using sacre-BLEU (Post, 2018), preserving newstest2017 as test for later experiments.The Transformer's learning rate is linearly warmed up for 16k updates.We apply an inverse square root learning rate decay following Vaswani et al. (2017) for both models.Parameters are optimized using Adam (Kingma and Ba, 2014) with β 1 = 0.9 and β 2 = 0.98.
Results in Table 1 confirm that asynchronous SGD generally yields lower-quality systems than synchronous SGD.For Transformers, the asynchronous results are catastrophic, often yielding 0 BLEU.We can also see that Transformers and asynchronous SGD are more sensitive to learning rates compared to RNNs and synchronous SGD.
For subsequent experiments, we will use a learning rate of 0.0003 for Transformers and 0.0006 for RNNs.These were near the top in both asynchronous and synchronous settings (Table 1).

Batch Size
Synchronous SGD has a larger effective batch size because it sums gradients from all workers, hence approximates the gradient better.This section investigates the extent to which batch size is the cause of poor convergence.
We use dynamic batching, where we fit as many sentences as it can into a fixed amount of memory (so e.g. more sentences will be in a batch if all of them are short), hence batch sizes are denominated in memory sizes.Our GPUs each have 10 GB available for batches which, on average, corresponds to 250 sentences.
Since there are 4 GPUs, baseline synchronous SGD has an effective batch size of 40 GB, compared to 10 GB in asynchronous.We fill in the two missing scenarios: synchronous SGD with a total effective batch size of 10 GB and asynchronous SGD with a batch size of 40 GB.Because GPU memory is limited, we simulate a larger batch size in asynchronous SGD by locally accumulating gradients in each processor four times before sending the summed gradient to the parameter server (Ott et al., 2018;Bogoychev et al., 2018).
Using a larger batch size reduces noise in estimating the overall gradient (Wang et al., 2013).Therefore, both models achieved slightly better BLEU per update in their early stage of training as shown in Figure 1.However, serially computing batches is time consuming so asynchronous training with a 40 GB batch size performs worse in terms of BLEU by time.Goyal et al. (2017) suggested that the learning rate can be increased proportionate to the batch size to cope with slower processing time.We can scale up the learning rate in RNN training from 0.0006 to 0.0012 without reducing the final quality.Unfortunately, increasing the learning rate further causes quality degeneration in the Transformer model.From this experiment, we conclude that batch size is not the primary driver of poor performance of asynchronously trained Transformers, though it does have some lingering impact on final model quality.For RNNs, batch size and distributed training algorithm had little impact beyond the early stages of training, continuing the theme that Transformers are more sensitive to noisy gradients.

Gradient Staleness
A stale gradient occurs when parameters have updated while a processor was computing its gradient.Staleness can be defined as the number of updates that occurred between the processor pulling parameters and pushing its gradient.Under the ideal case where every processor spends equal time to process a batch, asynchronous SGD with N processors produces gradients with staleness N −1.Empirically, we can also expect an average staleness of N − 1 with normally distributed computation time (Zhang et al., 2016).
To isolate the impact of staleness, we introduce staleness into synchronous SGD.Workers only pull the latest parameter once every U updates, yielding an average staleness of (U −1) 2 .Since asynchronous SGD has average staleness 3 with N = 4 GPUs, we set U = 7 to achieve the same average staleness of 3. Additionally, we also tried a lower average staleness of 2 by setting U = 5.
In order to focus on the impact of the staleness, we set the batch size to 40 GB total RAM consumption, be they 4 GPUs with 10 GB each in synchronous SGD or emulated 40 GB batches on each GPU in asynchronous SGD.
Results are shown in Figure 2. Staleness 3 substantially degrades Transformer convergence and final quality (Figure 2a).However, the impact of staleness 2 is relatively minor.We also continue to see that Transformers are more sensitive than RNNs to training conditions.
An alternative way to interpret staleness is the distance between the parameters with which the gradient was computed and the parameters being updated by the gradient.To see this effect, we run another set of experiments with double the learning rate, so that parameters move faster.
Results for Transformer worsen when we double the learning rate (Figure 3).With staleness 3, the model stayed at 0 BLEU for both synchronous or asynchronous SGD, consistent with our earlier result (Table 1).
We conclude that staleness is primary, but not wholly, responsible for the poor performance of asynchronous SGD in training Transformers.However, asynchronous SGD still underperforms synchronous SGD with artificial staleness of 3 and the same batch size (40 GB).Our synchronous SGD training has consistent parameters across processors, whereas processors might have different parameters in asynchronous training.The staleness distribution might also play a role because staleness in asynchronous SGD follows a normal distribution (Zhang et al., 2016) while our synthetic staleness in synchronous SGD follows a uniform distribution.

Incremental Updates in Adam
Investigating the effect of batch size and staleness further, we analyze why it makes a difference that gradients computed from the same parameters are applied one at a time (incurring staleness) instead of summed then applied once (as in synchronous SGD).As seen in Section 2.3, our artificial staleness was damaging to convergence even though gradients were synchronously computed with respect to the same parameters.In standard stochastic gradient descent there is no difference: gradients are multiplied by the learning rate then substracted from the parameters in either case.The Adam optimizer handles incremental updates and sums differently.
Adam is scale invariant.For example, suppose that two processors generate gradients 0.5 and 0.5 with respect to the same parameter in the first iteration.Incrementally updating with 0.5 and 0.5 is the same as updating with 1 and 1 due to scale invariance.Updating with the summed gradient, 1, will only move parameters half as far.This is the theory underlying the rule of thumb that learning rate should scale with batch size (Ott et al., 2018).
In practice, gradients reported by processors are usually not the same: they are noisy estimates of the true gradient.In Table 2, we show examples where noise causes Adam to slow down.Summing gradients smooths out some of the noise.Next, we examine the formal basis for this effect.
Formally, Adam estimates the full gradient with an exponentially decaying average m t of gradients g t .The term V ar(g t )/(Eg t ) 2 is statistical efficiency, the square of coefficient of variation.In other words, Adam gives higher weight to gradients if historical samples have a lower coefficient of variation.The coefficient of variation of a sum of N independent1 samples decreases as 1/ √ N .Hence sums (despite having less frequent updates) may actually cause Adam to move faster because they have smaller coefficient of variation.An example appears in Table 2: updating with 1 moves faster than individually applying -1 and 2.
With these results in mind, we focus on how best to sum gradients.

Gradient Summing
Several papers wait and sum P gradients from different workers as a way to reduce staleness.In Chen et al. (2016), gradients are accumulated from different processors, and whenever the P gradients have been pushed, other processors cancel their process and restart from the beginning.This is relatively wasteful since some computation is thrown out and P −1 processors still idle for synchronization.Gupta et al. (2016) suggest that restarting is not necessary but processors still idle waiting for P to finish.Our proposed method follows Lian et al. (2015) in which an update happens every time P gradients have arrived and processors con-tinually generate gradients without synchronization.
Another direction to overcome stale gradient is to reduce its effect towards the model update.McMahan and Streeter (2014) dynamically adjust the learning rate depending on the staleness.Dutta et al. (2018) suggest to completely ignore stale gradient pushes.

Increasing Staleness
In the opposite direction, some work has added noise to gradients or increased staleness, typically to cut computational costs.Recht et al. (2011) propose a lock-free asynchronous gradient update.Lossy gradient compression by bit quantization (Seide et al., 2014;Alistarh et al., 2017) or threshold based sparsification (Aji and Heafield, 2017;Lin et al., 2017) also introduce noisy gradient updates.On top of that, these techniques store unsent gradients to be added into the next gradient, increasing staleness for small gradients.Dean et al. (2012) mention that communication overload can be reduced by reducing gradient pushes and parameter synchronization frequency.In McMahan et al. (2017) work, each processor independently updates its own local model and periodically synchronize the parameter by averaging across other processors.Ott et al. (2018) accumulates gradients locally, before sending it to the parameter server.Bogoychev et al. (2018) also locally accumulates the gradient, but also updates local parameters in between.

Accumulated Asynchronous SGD
Previous experiments have shown that increasing the batch size and reducing staleness improves the final quality of asynchronous training.Increasing the batch size can be achieved by accumulating gradients before updating.We experiment with variations on three ways to accumulate gradients: Local Accumulation: Gradients can be accumulated locally in each processor before sending it to the parameter server (Ott et al., 2018;Bogoychev et al., 2018).This approach scales the effective batch size and reduces communication costs as the workers communicate less often.However, this approach does not reduce staleness as the parameter server updates immediately after receiving a gradient.We experiment with accumulating four gradients locally, resulting in 40 GB effective batch size.
Global Accumulation: Each processor sends the computed gradient to the parameter server normally.However, the parameter server holds the gradient and only updates the model after it receives multiple gradients (Dean et al., 2012;Lian et al., 2015).This approach scales the effective batch size.On top of that, it decreases staleness as the parameter server updates less often.However, it does not reduce communication costs.We experiment with accumulating four gradients globally, resulting in 40 GB effective batch size and 0.75 average staleness.
Combined Accumulation: Local and global accumulation can be combined to gain the benefits of both: reduced communication cost and reduced average staleness.In this approach, gradients are accumulated locally in each processor before being sent.The parameter server also waits and accumulates gradients before running an optimizer.We accumulate two gradients both locally and globally.This yields in 40 GB effective batch size and 1.5 average staleness.
We tested the three gradient accumulation flavors on the English-to-German task with both Transformer and RNN models.Synchronous SGD also appears as a baseline.To compare results, we report best BLEU, raw training speed, and time needed to reach several BLEU checkpoints.Results are shown in Table 3.
Asynchronous SGD with global accumulation actually improves the final quality of the model over synchronous SGD, albeit not meaningfully.This one change, accumulating every 4 gradients (the number of GPUs), restores quality in asynchronous methods.It also achieves the fastest time to reach near-convergence BLEU in both Transformer and RNN.
While using local accumulation provides even faster raw speed, the model produces the worst quality among the other accumulation techniques.Asynchronous SGD with 4x local accumulation is essentially just ordinary asynchronous SGD with 4x larger batch size and 4x less update frequency.In particular, gradient staleness is still the same, therefore this does not help the convergence perupdate.
Combined accumulation performs somewhat in the middle.It does not converge as fast as asynchronous SGD with full global accumulation but  not as poor as asynchronous SGD with full local accumulation.Its speed is also in between, reflecting communication costs.

Generalization Across Learning Rates
Earlier in Table 1 we show that asynchronous Transformer learning is very sensitive towards the learning rate.In this experiment, we use an asynchronous SGD with global gradient accumulation to train English-to-German on different learning rates.We compare our result with vanilla synchronous and vanilla asynchronous SGD.
Our finding empirically show that asynchronous Transformer training while globally accumulating the gradients is significantly more robust.As shown in

Generalization Across Languages
To test whether our findings on English-to-German generalize, we train two more translation systems using globally accumulated gradients.Specifically, we train English to Finnish (EN → FI) and English to Russian (EN → RU) models for the WMT 2018 task (Bojar et al., 2018).We validate our model on newstest2015 for EN → FI and newstest2017 for EN → RU.Then, we test our model on newstest2017 for EN → DE and new-stest2018 for both EN → FI and EN → RU.The same network structures and hyperparameters are used as before.
The results shown in Table 4 empirically confirm that accumulating the gradient to obtain a larger batch size and a lower staleness in Transformer massively improves the result, compared to basic asynchronous SGD (+6 BLEU on average).The improvement is smaller in RNN experiment, but still substantial (+1 BLEU on average).We also have further confirmation that training a Transformer model with normal asynchronous SGD is impractical.

Conclusion
We evaluated the behavior of Transformer and RNN models under asynchronous training.We divide our analysis based on two main different aspects in asynchronous training: batch size and stale gradient.Our experimental results show that: • In general, asynchronous training damages the final BLEU of the NMT model.However, we found that the damage with the Transformer is significantly more severe.In addition, asynchronous training also requires a smaller learning rate to perform well.
• With the same number of processors, asynchronous SGD has a smaller effective batch size.We empirically show that training under a larger batch size setting can slightly improves the convergence.However, the improvement is very minimal.The result in asynchronous Transformer model is subpar, even with a larger batch size.
• Stale gradients play a bigger role in the training performance of asynchronous Transformer.We have shown that the Transformer model's performed poorly by adding a synthetic stale gradient.
Based on these findings, we suggest applying a modification in asynchronous training by accumulating a few gradients (for example for the number of processors) in the server before applying an update.This approach increases the batch size while also reducing the average staleness.We empirically show that this approach combine the high quality training of synchronous SGD and high training speed of asynchronous SGD.
Figure 1: The effect of batch sizes on convergence of Transformer and RNN models.

Figure 2 :Figure 3 :
Figure 2: Artificial staleness in synchronous SGD compared to synchronous and asynchronous baselines, all with our usual learning rate for each model.

Table 2 :
The Adam optimizer slows down when gradients have larger variance even if they have the same average, in this case 1.When alternating between −1 and 2, Adam takes 6 steps before the parameter has the correct sign.Updates can even slow down if gradients point in the same direction but have different scales.The learning rate is α = 0.001.

Table 3 :
Quality and convergence of asynchronous SGD with accumulated gradients on English to German dataset.Dashes indicate that model never reach the target BLEU.

Table 4 :
The effect of global accumulation on translation quality for different language pairs on development and test set, measured with BLEU score.

Table 5 ,
the model is now capable to learn on higher learning rate and yield comparable results compared to its synchronous variant.

Table 5 :
Performance of the asynchronous Transformer on English to German with 4x Global accumulations (GA) across different learning rates on development set measured with BLEU score.