Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation

In order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly. In order to achieve further speedup we introduce a technique that delays gradient updates effectively increasing the mini-batch size. Unfortunately with the increase of mini-batch size we worsen the stale gradient problem in asynchronous stochastic gradient descent (SGD) which makes the model convergence poor. We introduce local optimizers which mitigate the stale gradient problem and together with fine tuning our momentum we are able to train a shallow machine translation system 27% faster than an optimized baseline with negligible penalty in BLEU.


Introduction
With training times measured in days, parallelizing stochastic gradient descent (SGD) is valuable for making experimental progress and scaling data sizes.Synchronous SGD sums gradients computed by multiple GPUs into one update, equivalent to a larger batch size.But GPUs sit idle unless workloads are balanced, which is difficult in machine translation and other natural language tasks because sentences have different lengths.Asynchronous SGD avoids waiting, which is faster in terms of words processed per second.However asynchronous SGD suffers from stale gradients (Abadi et al., 2016) that degrade convergence, resulting in an almost no improvement in time to convergence (Hadjis et al., 2016).This paper makes asynchronous SGD even faster and deploys a series of convergence optimizations.
In order to achieve fastest training (and inspired by Goyal et al. (2017) we increase the mini-batch size, making the matrix operations more efficient and reducing the frequency of gradient communica-tion for the optimizer step.Unlike their task (image classification), text training consumes a lot of GPU Memory (Table 1) for word embedding activations making it impossible to fit mini-batches of similar magnitude as Goyal et al. (2017).
Our main contributions are as follows: 1. We introduce a delayed gradient updates which allow us to work with much larger minibatches which would otherwise not be possible due to limited GPU memory.2. We introduce local optimizers which run on each worker to mitigate the extra staleness and convergence issues (Dekel et al., 2010;Keskar et al., 2017) caused by large mini-batches.3. We highlight the importance of tuning the optimizer momentum and show how it can be used as a cooldown strategy.

Experiments
This section introduces each optimization along with an intrinsic experiment on the WMT 2016 Romanian→English task (Bojar et al., 2016).
The translation system is equivalent to Sennrich et al. (2016), which was the first place constrained system (and tied for first overall in the WMT16 shared task.).The model is a shallow bidirectional GRU (Bahdanau et al., 2014) encoder-decoder trained on 2.6 million parallel sentences.Due to variable-length sentences, machine translation systems commonly fix a memory budget then pack as many sentences as possible into a dynamicallysized batch.The memory allowance for minibatches in our system is 3 GB (for an average batch size of 2633 words).Adam (Kingma and Ba, 2015) is used to perform asynchronous SGD with learning rate of 0.0001.This is our baseline system.We also compare with a synchronous baseline which uses modified Adam parameters, warmup of 16000 mini-batches and inverse square root cooldown following Vaswani et al. (2017).We used 4 Tesla P 100 GPUs in a single node with the Marian NMT framework for training (Junczys-Dowmunt et al., 2018).Since we apply optimizations over asynchronous SGD we performed a learning rate and mini-batch-size parameter sweep over the baseline system and settled on a learning rate of 0.00045 and 10 GB memory allowance for mini-batches (average batch size of 10449 words).This is the fastest system we could train without sacrificing performance before adding our improvements.In our experiments on Table 2 we refer to this system as "Optimized asynchronous".All systems were trained until 5 consecutive stalls in the crossentropy metric of the validation set.Note that some systems require more epochs to reach this criteria which indicates poor model convergence.

Larger Batches and delayed updates
This experiment aims to increase speed, in wordsper-second (WPS), by increasing the batch size.Larger batches have two well-known impacts on speed: making more use of GPU parallelism and communicating less often.
After raising the batch size to the maximum that fits on the GPU,1 we emulate even larger batches by processing multiple mini-batches and summing their gradients locally without sending them to the optimizer.This still increases speed because communication is reduced (Table 1).We introduce parameter τ , which is the number of iterations a GPU performs locally before communicating externally as if it had run one large batch.The Words-per-second (WPS) column on Table 1 shows the effect on corpora processing speed when applying delayed gradients updates for different values of τ .While we reduce the overall training time if we just apply delayed gradient updates we worsen the overall convergence (Table 2).
When increasing the mini-batch size τ times without touching the learning rate, we effectively do τ times less updates per epoch.On the surface, it might seem that these less frequent updates are counterbalanced by the fact that each update is accumulated over a larger batch.But practical optimization heuristics like gradient clipping mean that in effect we end up updating the model less often, resulting in slower convergence.Goyal et al. (2017) recommend scaling the learning rate linearly with the mini-batch size in order to maintain convergence speed.

Warmup
Goyal et al. ( 2017) point out that just increasing the learning rate performs poorly for very large batch sizes, because when the model is initialized at a random point, the training error is large.Large error and large learning rate result in bad "jerky" updates to the model and it can't recover from those.Goyal et al. (2017) suggest that initially model updates should be small so that the model will not be pushed in a suboptimal state.Afterwards we no longer need to be so careful with our updates.

Lowering initial learning rate
Goyal et al. ( 2017) lower the initial learning rate and gradually increase it over a number of minibatches until it reaches a predefined maximum.This technique is also adopted in the work of Vaswani et al. (2017).This is the canonical way to perform warmup for neural network training.

Local optimizers
We propose an alternative warm up strategy and compare it with the canonical method.Since we emulate large batches by running multiple smaller batches, it makes sense to consider whether to optimize locally between each batch by adapting the concept of local per-worker optimizers from Zhang et al. (2014).In asynchronous SGD setting each GPU has a full copy of the model as well as the master copy of 1/N th of the parameters in its capacity as parameter server.We use the local optimizers to update the local model shard in between delayed gradient updates, which helps mitigate staleness.
Unlike prior work, we also update the shard of the global model that happens to be on the same GPU.Local updates are almost free because we avoid remote device communication.
Updating the parameter shard of the global model bears some resemblance to the Hogwild method (Recht et al., 2011) as we don't synchronize the updates to the shard, however, global updates are still synchronised.As before, once every τ iterations we run a global optimizer that updates the sharded parameter set and then distributes the updated model across all devices.Any local model divergences are lost at this point.We found that this strategy improves model convergence in the early epochs but tends to be harmful later on.We hypothesize that initially partial model updates reduce staleness, but when the model starts to converge, local optimizers introduce extra noise in the training, which is harmful.We use local optimizers purely as a warmup strategy, turning them off after the initial phase of the training.Empirically, we found that we can get the best convergence by using them for the first 4000 mini-batches that each device sees.On Table 2 we compare and contrast the two warmup strategies.By itself learning-rate warmup offers slower convergence but to a better point compared to local optimizers.The reader may notice that if we apply delayed gradient updates, the effective batch size that the global optimizer deals with is τ times larger than the mini-batch size on which the local optimizers runs.Therefore we use τ times lower learning rate for the local optimizers compared to the global optimizers.Momentum tuning is not a well explored area in deep learning.Most researchers simply use the default values for momentum for a chosen optimizer (Hadjis et al., 2016) (in the case of NMT, this is usually Adam).Hadjis et al. (2016) argue that this is an oversight especially when it comes to asynchronous SGD, because the asynchronisity adds extra implicit momentum to the training which is not accounted for.Because of this, asynchronous SGD has been deemed ineffective, as without mo-mentum tuning, the observed increase in training speed is negated by the lower convergence rate, resulting in near-zero net gain (Abadi et al., 2016).However, Hadjis et al. (2016) show that after performing a grid search over momentum values, it is possible to achieve convergence rates typical for synchronous SGD even when working with many asynchronous workers.The downside of momentum tuning is that we can't offer rule-of-thumb values, as they are individually dependent on the optimizer used, the neural model, the number of workers and the batch size.In our experiments, we lowered the overall momentum and in addition performed momentum cooldown where we reduced the momentum of our optimizer (Adam) after the first few thousand batches.

Results
Table 2 shows the effect of modifying momentum values.When using just delayed gradient updates, training is noticeably faster, but there are significant regressions in BLEU and CE (system 2).In order to mitigate those, when using delayed gradient updates, we tune the momentum and apply momentum cooldown on top of either of our warmup strategies.By doing this we not only further reduce training time, but also recover the loss of accuracy.Compared to the optimized baseline system (1), our best system (4) reduces the training time by 27%.Progression of the training can be seen on figures 1 and 2. Our system starts poorly compared to the baselines in terms of epoch-for-epoch convergence, but catches up in the later epochs.Due to faster training speed however, the desired BLEU score is achieved faster (Figure 2).
Local optimizers as a warmup strategy show faster convergence compared to learning rate warmup at almost no penalty to BLEU or crossentropy (System 4 vs system 6).Against the system used in WMT 16 (Sennrich et al., 2016), we achieve nearly 4 times faster training time with no discernible penalty in BLEU or CE.In contrast, the other communication reducing method tested, the work of Aji and Heafield (2017), is slower than our work and achieves worse BLEU and CE.

Using even larger mini-batches
We can achieve even greater processing speed by further increasing τ but we were unable to maintain the same convergence with the Romanian-English shallow model.We found that larger τ values are useful when dealing with the larger deep RNN mod- Table 2: Romanian-English results from our exploration and optimizations.We also compare our methods against the work of Aji and Heafield (2017) which also reduces communication.We use system (1) as our reference baseline upon which we improve.The system that achieved the best training time is bolded.els.With deep RNN models the parameters take the majority of the available VRAM leaving very little for mini-batches.In this scenario we can apply τ = 4 without negative effect towards convergence.We demonstrate the effectiveness of larger τ on Table 3.The baseline system is equivalent to the winning system for English-German at the WMT 2017 competition (Sennrich et al., 2017).The baseline is trained with synchronous SGD and our system uses asynchronous SGD, delayed gradient updates by a factor of 4, local optimizers and the momentum is tuned and further reduced after the first 16000 mini-batches.We found learning rate of 0.0007 to work the best.We do not report the numbers for asynchronous baseline because we were unable to achieve competitive BLEU scores without using delayed gradient updates.We speculate this is because with this type of deep model, our mini-batch size is very small leading to very jerky and unstable training updates.Larger mini-batches ensure the gradients produced by different workers are going to be closer to one another.Our training progression can be seen on figures 3 and 4. We show that even though we use 4 times larger mini-batches we actually manage to get lower Cross-Entropy epoch for epoch compared to the baseline (Figure 3).This coupled with out higher training speed makes our method reach the best BLEU score 1.6 times faster than the baseline (Figure 4).

Related work
We use larger mini-batches and delay gradient updates in order to increase the speed at which the dataset is processed.The principal reason why this works is because when mini-batch size is increased n (also includes delayed updates) times, communication is reduced by the same amount.This as-  pect of our work is similar to the work of Aji and Heafield (2017) where they drop the lower 99% of the gradient updates based on absolute value thus reducing the memory traffic.Compared with them we achieve faster dataset processing speed and also better model convergence as shown on Table 2.
Independently from us Mao et al. (2018) extend the work of Aji and Heafield (2017) aiming to reduce gradient communication without suffering any of the negative effects we have noted.In process they independently arrive to some of the methods that we use, notably tuning the momentum and applying warmup to achieve better convergence.
Independently from us Shazeer and Stern (2018) have done further exploratory work on ADAM's momentum parameters using the Transformer model (Vaswani et al., 2017)

Conclusion and Future work
We show that we can increase speed and maintain convergence rate for very large mini-batch asynchronous SGD by carefully adjusting momentum and applying warmup and cooldown strategies.While we have demonstrated our methods on GPUs, they are hardware agnostic and can be applied to neural network training on any multidevice hardware such as TPUs or Xeon Phis.We were able to achieve end-to-end training on multiple tasks a lot faster than the baseline systems.For our Romanian-English model, we train nearly 3X faster than the commonly used baseline and 1.5X faster over a specifically optimised baseline.When experimenting with English-German we are able to train our model 1.3X faster than the baseline model, achieving practically the same BLEU score and much better model cross-entropy.
In the future we would like to apply local optimizers in distributed setting where the communication latency between local and remote devices varies significantly we could use local optimizers to synchronize remote models less often.
Goyal et al. (2017) andVaswani et al. (2017) both employ cooldown strategies that lower the learning rate towards the end of training.Inspired by the work ofHadjis et al. (2016) however we decided to pursue a different cooldown strategy by modifying the momentum inside Adam's parameters.
Figure 1: Cross-entropy training progression per epoch for our ro-en systems.
Figure 2: BLEU scores for our ro-en systems.

Figure 3 :
Figure 3: CE scores for our en-de systems.

Figure 4 :
Figure 4: BLEU scores for our en-de systems.

Table 1 :
Relationship between the GPU Memory (VRAM) budget for batches (* means emulated by summing τ smaller batches), number of source words processed in each batch and wordsper-second (WPS) measured on a shallow model.

Table 3 :
Training times for English-German deep RNN system trained on WMT17 data.Our asynchronous system includes the optimizations of system (4) from Table2.
This work was published on 11.04.2018. 3This work was published on 01.05.2018. 2