Don't Compress Gradients in Random Reshuffling: Compress Gradient Differences

Authors: Abdurakhmon Sadiev, Grigory Malinovsky, Eduard Gorbunov, Igor Sokolov, Ahmed Khaled, Konstantin Burlachenko, Peter Richtarik

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, to illustrate our theoretical findings we conduct experiments on federated logistic regression tasks and on distributed training of neural networks.
Researcher Affiliation Academia 1King Abdullah University of Science and Technology, Saudi Arabia 2Moscow Institute of Physics and Technology, Russian Federation 3Mohamed bin Zayed University of Artificial Intelligence, UAE 4Mila, Université de Montréal, Canada 5Princeton University, USA
Pseudocode Yes Algorithm 1 Q-RR: Distributed Random Reshuffling with Quantization ... Algorithm 2 DIANA-RR ... Algorithm 3 Q-NASTYA ... Algorithm 4 DIANA-NASTYA
Open Source Code Yes The codes are provided in the following anonymous repository: https://anonymous.4open.science/r/diana_rr-[]B0A5.
Open Datasets Yes Logistic Regression. ... The datasets were taken from open Lib SVM library Chang and Lin [2011] ... Training Deep Neural Network model: Res Net-18 on CIFAR-10. ... CIFAR10 dataset Krizhevsky and Hinton [2009].
Dataset Splits Yes The sizes of training and validation set are 5 * 10^4 and 10^4 respectively.
Hardware Specification Yes All algorithms were written in Python 3.8. We used three different CPU cluster node types: 1. AMD EPYC 7702 64-Core; 2. Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz; 3. Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz. ... equipped with 16-cores (2 sockets by 16 cores per socket) 3.3 GHz Intel Xeon, and four NVIDIA A100 GPU with 40GB of GPU memory.
Software Dependencies Yes All algorithms were written in Python 3.8. ... To conduct these experiments we use FL_Py Torch simulator [Burlachenko et al., 2021]. ... The distributed environment is simulated in Python 3.9 via using the software suite FL_Py Torch [Burlachenko et al., 2021].
Experiment Setup Yes In all experiments, for each method, we used the largest stepsize allowed by its theory multiplied by some individually tuned constant multiplier. ... In all algorithms, as a compression operator Q, we use Rand-k [Beznosikov et al., 2020] with fixed compression ratio k/d 0.02, where d is the number of features in the dataset. ... We followed constant stepsize strategy... For each of the considered non-local methods, we take the stepsize as the largest one predicted by the theory premultiplied by the individually tuned constant factor from the set {0.000975, ..., 4096}.