Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training

One way to reduce network traffic in multi-node data-parallel stochastic gradient descent is to only exchange the largest gradients. However, doing so damages the gradient and degrades the model’s performance. Transformer models degrade dramatically while the impact on RNNs is smaller. We restore gradient quality by combining the compressed global gradient with the node’s locally computed uncompressed gradient. Neural machine translation experiments show that Transformer convergence is restored while RNNs converge faster. With our method, training on 4 nodes converges up to 1.5x as fast as with uncompressed gradients and scales 3.5x relative to single-node training.


Introduction
In recent years, neural network models have grown dramatically in terms of number of parameters (Wen et al., 2017), so exchanging gradients during data-parallel training is costly in terms of both bandwidth and time, especially in a distributed setting. Communication can be reduced (possibly at the expense of convergence) by sending only the top 1% of largest gradients in terms of absolute values, a method known as gradient dropping (Strom, 2015;Dryden et al., 2016;Aji and Heafield, 2017;Lin et al., 2018). Related methods are synchronizing less often (McMahan et al., 2017;Ott et al., 2018;Bogoychev et al., 2018) and quantization (Seide et al., 2014;Alistarh et al., 2016).
As these compression methods are lossy, each node's locally computed gradient is not immediately reflected in the global gradient. Our experiments show that gradient compression damages the model's performance, especially in the case of a Transformer model (Vaswani et al., 2017), which is known to be sensitive to noisy gradients (Chen et al., 2018;Ott et al., 2018;Aji and Heafield, 2019). We aim to repair the compressed gradient by combining it with the local gradients to improve the trade-off between convergence and compression rates.
In this paper, we apply gradient dropping to reduce the inter-node communication during distributed neural network training, which leads to faster training speed but reduced model convergence rate. We find that combining the sparse global gradient with the dense local gradient improves convergence. However, adding local information means that nodes' parameters will diverge over time. We address this by periodically averaging the model (McMahan et al., 2017), achieving faster end-to-end training time.

Sparse Gradient Compression
Gradients are skewed: most values are near zero while very few have large absolute value (Aji and Heafield, 2017). Formally, Pearson's skewness coefficient is typically 2-4, but up to 262 in embedding matrices where much of the parameters lie. Sparse gradient compression exploits this by rounding gradients below a threshold to zero, sending only a sparse matrix of large gradients (Strom, 2015). The threshold can be set dynamically to the top 1% of gradients, achieving constant compression (Dryden et al., 2016). Unsent gradients are added to the next gradient prior to compression (Seide et al., 2014).
Gradient dropping is outlined in Algorithm 1. At each time step t, each node n computes a local gradient L n t on its data. The error feedback mechanism adds unsent gradients from the previous step E t−1 to the local gradient L n t . The combined gradient is then broken into sparse gradient S n t and residual E t . Although the gradient is sparse, all Algorithm 1 Gradient dropping on node n 1: procedure SPARSESGD(L n t ) 2: L n t is local gradient of node n at step t. 3: ApplyOptimizer(G t ) 10: end procedure parameters are updated because Adam (Kingma and Ba, 2015) has momentum terms. Parameter updates run redundantly in all nodes so that only gradients are sent over the network.
The sum of sparse gradients is less sparse. We can send the summed gradient by itself (Lin et al., 2018) or again take the top 1% of summed gradients. Our cluster of 4 nodes is small enough that there was little speed difference, so we did not recompress the summed gradients.

Federated Averaging
Another way to reduce the bandwidth cost in multi-node training is by reducing the communication frequency (McMahan et al., 2017). In federated averaging, workers do not exchange gradients. Instead, each worker uses its local gradient to update its own local parameters. Each worker updates their local parameters by averaging across other nodes once every few steps.
In contrast with gradient dropping, federated averaging mainly uses the worker's local gradients for parameter updates. Gradients from other workers are not directly communicated and are therefore not taken into account by the optimizer.

Combining With Local Gradients
Recent work suggests that the Transformer is sensitive to noisy gradients, resulting in substantially worse models (Chen et al., 2018;Ott et al., 2018;Aji and Heafield, 2019). Consistent with these findings, both gradient sparsification and federated averaging yield low-quality Transformer models in our experiments. In gradient sparsification, noise comes from both thresholding and the errorfeedback mechanism, which causes stale gradients. Federated averaging also introduces stale updates as this approach delays model synchronization. Previous work has shown that both noisy and stale gradients damage the model's quality (McMahan and Streeter, 2014;Ott et al., 2018;Dutta et al., 2018).
To address noisy updates in gradient sparsification, we combine the compressed global gradient and the uncompressed locally computed gradient in an effort to better approximate the true global gradient. Formally, let G t be the compressed global gradient at time t and L n t be the gradient computed locally on node n. These will be combined into C n t that will be used to update the parameters.
An arguably naïve method sums the two gradients. With a scale-invariant optimizer like Adam, this is equivalent to averaging.
However, some of the locally-computed gradients were sent out and became part of the global gradient, so they will be double-counted by the sum. To compensate, we can subtract out the gradients S n t sent by node n.
The term G t − S n t equals to the sum of all sparse gradients from other nodes (or approximates it when the all-reduce compresses the result). The local gradient L n t used for updating does not include the error-feedback term E n t to prevent applying gradients multiple times while they are pending in error-feedback.

Periodic Synchronization
Nodes will diverge because local gradients differ. Therefore, models are averaged periodically. We average parameters (McMahan et al., 2017) every 500 steps with minimal impact on speed.
In the limit, a gradient is applied twice. First, a local update eventually makes its way to the other nodes via periodic averaging. Second, it accumulates with enough other gradients to be selected for inclusion in the compressed gradient and applied as part of a global update.

Experimental Setup
We use Marian (Junczys-Dowmunt et al., 2018) to train on nodes with 4xP100s. Multi-node experiments use 4 of these nodes, each connected with 40Gb Mellanox Infiniband. These scenarios will be abbreviated as 1x4 (one node with four GPUs) and 4x4 (four nodes with four GPUs each).

Model and Dataset
We perform our neural machine translation experiments on the following architectures.
Transformer: We train a Transformer model with six encoder and six decoder layers with tied embeddings. The model has 62M parameters. We train the model on the WMT 2017 English to German dataset with back-translated monolingual corpora (Sennrich et al., 2016b) and byte-pair encoding (Sennrich et al., 2016c), consisting of 19.1M sentence pairs. Model performance is validated on newstest2016 and tested on newstest2017.
Deep RNN: We also train a deep RNN model Sennrich et al. (2017) with eight layers of bidirectional LSTM consisting of 225M parameters. We train the model with the same English to German dataset from the Transformer experiment.
Shallow RNN: Our shallow RNN model is based on the winning system by Sennrich et al. (2016a) and is a single layer bidirectional encoderdecoder LSTM with attention consisting of 119M parameters. We train this model on WMT 2016 Romanian to English dataset, consisting of 2.5M sentence pairs. We also apply byte-pair encoding to this dataset. Model performance is validated on newsdev2016 and tested on newstest2016.
We apply layer normalization (Lei Ba et al., 2016) and exponential smoothing to train the model for 8 epochs of training.

Scaling Hyperparameters
In all our experiments, we use a memory budget of 10GB per GPU to dynamically fit as many sentences as possible, corresponding to an average of 450 and 250 sentences per batch per GPU for Ro-En and En-De, respectively. Hence, we apply several adjustments to the hyperparameters to accommodate the larger effective batch size of multinode synchronous SGD: Learning rate: The Adam optimizer is scaleinvariant, so the parameters move at the same magnitude regardless the gradient size. Therefore, we linearly scale the learning rate in all multi-node experiments, as suggested by Goyal et al. (2017). On one node, we use a learning rate of 0.0003 for Transformer and deep RNN models, and 0.001 for the shallow RNN model. These values multiplied by 4 for the 4-node setting. The single-node learning rates were optimized in the sense that further increasing them damages performance.
Warm-up: Learning rate warm-up helps over- come initial model instability when training with large mini-batches (Goyal et al., 2017). We add a linear learning rate warm-up for the Transformer, deep RNNs, and shallow RNNs for the first 16k, 4k, and 2k steps respectively. We apply inverse square root cool-down following Vaswani et al. (2017) for Transformer and deep RNN models.
We follow the rest of the hyperparameter settings as suggested in the papers (Vaswani et al., 2017;Sennrich et al., 2017Sennrich et al., , 2016a.

Restoring Quality
We approximate impact on quality by measuring the BLEU score (Papineni et al., 2002)  the model every 20 steps in federated averaging experiment, and every 500 steps in our proposed method. Figure 1 shows BLEU score per update. Gradient dropping and federated averaging reduce gradient quality and improvement per update is slower. In the Transformer case, the model is incapable of training at all. Local gradient incorporation improves the sparse gradient quality and improves convergence per-epoch over gradient dropping. In all architectures, the model achieved a comparable training curve compared to the uncompressed multi-node training. Table 1 summarizes model's performance in terms of BLEU. With local gradient incorporation, the models obtained better final quality, performing closer to uncompressed multi-node training. Local gradient incorporation enables the Transformer to train with a sparse gradient, albeit with slight quality degradation (0.28-0.32%). This result confirms Transformer's sensitivity to noisy updates and the ability of local gradients to mostly repair them.

Improving Training Speed
We measure the speed improvement of our proposed method by capturing the raw processing speed and time to reach certain BLEU. We compare it to both gradient dropping and federated averaging. We also measure the training efficiency by comparing the results with a single-node system. For the Transformer, we exclude gradient dropping and federated averaging as the models fail to train. Table 2 summarizes our experiments. Gradient dropping reduces network traffic 50-fold and significantly improves the raw training speed in multi-node setting up to 3.4x over single-node, and up to 1.6x faster raw speed over uncompressed multi-node setting. Federated averaging is faster because there is no additional communication overhead for every step, and no extra com- putational cost for sparse gradients' compression. Finally, our method incurs the combined cost of gradient dropping, occasional federated averaging, and local updates so it is slower than gradient dropping at raw speed, but still substantially faster than uncompressed multi-node training.
While vanilla gradient dropping and federated averaging have better raw speed, there is no clear improvement on convergence speed as noisy gradients damage the convergence. Local gradient update restores the gradient and improves the convergence speed. In our RNN experiments, convergence speedup is closer to the raw speedup, up to 3.5x single-node performance.
The Transformer convergence rate increases more slowly than raw batch processing speed.  While the rule of thumb is to scale learning rate linearly with batch size (Goyal et al., 2017), the Transformer model is also sensitive to high learning rates (Aji and Heafield, 2019). We obtained 1.6x convergence speedup, instead of the expected 2.7x speedup. Scaling the learning rate sublinearly can be explored.
Compression results are of course dependent on the ratio between computation and network bandwidth in a system, as well as model size. Because the method reduces network load, we would expect to see even larger speed improvement with commodity hardware instead of the 40 gigabit Infiniband network used in our experiments.

Conclusion
We improve model convergence when training with sparse gradients by utilizing an additional locally-computed gradient, while also negates the quality loss in terms of BLEU caused by gradient dropping. With gradient dropping and local gradient incorporation, we improve the raw training speed in terms of word/second by up to 3.4x over single-node system, and up to 1.6x over uncompressed multi-node system. We also evaluate the training speed by the time needed to reach a nearconvergence BLEU score. In this case, we improve the training speed by up to 3.5x over singlenode system and up to 1.5x over uncompressed multi-node system.