# A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning

@article{Horvath2021ABA, title={A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning}, author={Samuel Horv'ath and Peter Richt{\'a}rik}, journal={ArXiv}, year={2021}, volume={abs/2006.11077} }

Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed compute systems. A key bottleneck of such systems is the communication overhead for exchanging information across the workers, such as stochastic gradients. Among the many techniques proposed to remedy this issue, one of the most successful is the framework of compressed communication with error feedback (EF). EF remains the only known technique that can deal with the… Expand

#### 11 Citations

Preserved central model for faster bidirectional compression in distributed settings

- Computer Science, Mathematics
- ArXiv
- 2021

A new approach to tackle communication constraints in a distributed learning problem with a central server is developed and a new algorithm that performs bidirectional compression and achieves the same convergence rate as algorithms using only uplink compression is proposed. Expand

FedSKETCH: Communication-Efficient and Private Federated Learning via Sketching

- Computer Science
- ArXiv
- 2020

This work introduces FedSKetCH and FedSKETCHGATE algorithms to address both challenges in Federated learning jointly, where these algorithms are intended to be used for homogeneous and heterogeneous data distribution settings respectively. Expand

Optimal Client Sampling for Federated Learning

- Computer Science
- ArXiv
- 2020

A novel client subsampling scheme where the number of clients allowed to communicate their updates back to the master node is restricted and a simple algorithm is provided that approximates the optimal formula for client participation which only requires secure aggregation and thus does not compromise client privacy. Expand

FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout

- Computer Science
- ArXiv
- 2021

This work introduces Ordered Dropout, a mechanism that achieves an ordered, nested representation of knowledge in Neural Networks and enables the extraction of lower footprint submodels without the need of retraining, in the realm of FL in a framework called FjORD. Expand

Leveraging Spatial and Temporal Correlations in Sparsified Mean Estimation

- Computer Science, Mathematics
- 2021

We study the problem of estimating at a central server the mean of a set of vectors distributed across several nodes (one vector per node). When the vectors are highdimensional, the communication… Expand

Rethinking gradient sparsification as total error minimization

- Computer Science, Mathematics
- ArXiv
- 2021

This work identifies that the total error — the sum of the compression errors for all iterations — encapsulates sparsification throughout training and proposes a communication complexity model that minimizes the totalerror under a communication budget for the entire training. Expand

A Field Guide to Federated Optimization

- Computer Science
- ArXiv
- 2021

This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. Expand

Compressed Communication for Distributed Training: Adaptive Methods and System

- Computer Science, Mathematics
- ArXiv
- 2021

This paper introduces a novel adaptive gradient method with gradient compression that has a convergence rate of O(1/ √ T ) for non-convex problems and develops a scalable system called BytePS-Compress for two-way compression, where the gradients are compressed in both directions between workers and parameter servers. Expand

Compressed Gradient Tracking Methods for Decentralized Optimization with Linear Convergence

- Mathematics, Computer Science
- 2021

This paper proposes a novel compressed gradient tracking algorithm (C-GT) that combines gradient tracking technique with communication compression and shows that C-GT is compatible with a general class of compression operators that unifies both unbiased and biased compressors. Expand

EF21 with Bells&Whistles: Practical Algorithmic Extensions of Modern Error Feedback

- Computer Science, Mathematics
- 2021

Six practical extensions of EF21 are proposed, all supported by strong convergence theory: partial participation, stochastic approximation, variance reduction, proximal setting, momentum and bidirectional compression. Expand

#### References

SHOWING 1-10 OF 60 REFERENCES

Natural Compression for Distributed Deep Learning

- Computer Science, Mathematics
- ArXiv
- 2019

This work introduces a new, simple yet theoretically and practically effective compression technique: em natural compression (NC), which is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa. Expand

On Biased Compression for Distributed Learning

- Computer Science, Mathematics
- ArXiv
- 2020

It is shown for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings, and a new highly performing biased compressor is proposed---combination of Top-k and natural dithering---which in the authors' experiments outperforms all other compression techniques. Expand

QSGD: Randomized Quantization for Communication-Optimal Stochastic Gradient Descent

- Computer Science
- ArXiv
- 2016

Quantized SGD is proposed, a family of compression schemes which allow the compression of gradient updates at each node, while guaranteeing convergence under standard assumptions, and allows the user to trade off compression and convergence time. Expand

3LC: Lightweight and Effective Traffic Compression for Distributed Machine Learning

- Computer Science, Mathematics
- MLSys
- 2019

3LC is presented, a lossy compression scheme for state change traffic that strikes balance between multiple goals: traffic reduction, accuracy, computation overhead, and generality. Expand

Sparsified SGD with Memory

- Computer Science, Mathematics
- NeurIPS
- 2018

This work analyzes Stochastic Gradient Descent with k-sparsification or compression (for instance top-k or random-k) and shows that this scheme converges at the same rate as vanilla SGD when equipped with error compensation. Expand

99% of Distributed Optimization is a Waste of Time: The Issue and How to Fix it

- Computer Science
- 2019

A new variant of parallel block coordinate descent based on independent sparsification of the local gradient estimates before communication is developed, which is supported through extensive numerical experiments which demonstrate an almost perfect match with the theory on a number of synthetic and real datasets. Expand

On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning

- Computer Science, Mathematics
- AAAI
- 2020

It is proved that layer-wise compression is, in theory, better, because the convergence rate is upper bounded by that of entire-model compression for a wide range of biased and unbiased compression methods. Expand

Convex Optimization using Sparsified Stochastic Gradient Descent with Memory

- Computer Science
- 2018

A sparsification scheme for SGD where only a small constant number of coordinates are applied at each iteration, which outperforms QSGD in progress per number of bits sent and opens the path to using lock-free asynchronous parallelization on dense problems. Expand

Error Feedback Fixes SignSGD and other Gradient Compression Schemes

- Computer Science, Mathematics
- ICML
- 2019

It is proved that the algorithm EF-SGD with arbitrary compression operator achieves the same rate of convergence as SGD without any additional assumptions, and thus EF- SGD achieves gradient compression for free. Expand

Distributed Learning with Compressed Gradient Differences

- Computer Science, Mathematics
- ArXiv
- 2019

This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates. Expand