Don't Compress Gradients in Random Reshuffling: Compress Gradient Differences
Authors: Abdurakhmon Sadiev, Grigory Malinovsky, Eduard Gorbunov, Igor Sokolov, Ahmed Khaled, Konstantin Burlachenko, Peter Richtarik
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, to illustrate our theoretical findings we conduct experiments on federated logistic regression tasks and on distributed training of neural networks. |
| Researcher Affiliation | Academia | 1King Abdullah University of Science and Technology, Saudi Arabia 2Moscow Institute of Physics and Technology, Russian Federation 3Mohamed bin Zayed University of Artificial Intelligence, UAE 4Mila, Université de Montréal, Canada 5Princeton University, USA |
| Pseudocode | Yes | Algorithm 1 Q-RR: Distributed Random Reshuffling with Quantization ... Algorithm 2 DIANA-RR ... Algorithm 3 Q-NASTYA ... Algorithm 4 DIANA-NASTYA |
| Open Source Code | Yes | The codes are provided in the following anonymous repository: https://anonymous.4open.science/r/diana_rr-[]B0A5. |
| Open Datasets | Yes | Logistic Regression. ... The datasets were taken from open Lib SVM library Chang and Lin [2011] ... Training Deep Neural Network model: Res Net-18 on CIFAR-10. ... CIFAR10 dataset Krizhevsky and Hinton [2009]. |
| Dataset Splits | Yes | The sizes of training and validation set are 5 * 10^4 and 10^4 respectively. |
| Hardware Specification | Yes | All algorithms were written in Python 3.8. We used three different CPU cluster node types: 1. AMD EPYC 7702 64-Core; 2. Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz; 3. Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz. ... equipped with 16-cores (2 sockets by 16 cores per socket) 3.3 GHz Intel Xeon, and four NVIDIA A100 GPU with 40GB of GPU memory. |
| Software Dependencies | Yes | All algorithms were written in Python 3.8. ... To conduct these experiments we use FL_Py Torch simulator [Burlachenko et al., 2021]. ... The distributed environment is simulated in Python 3.9 via using the software suite FL_Py Torch [Burlachenko et al., 2021]. |
| Experiment Setup | Yes | In all experiments, for each method, we used the largest stepsize allowed by its theory multiplied by some individually tuned constant multiplier. ... In all algorithms, as a compression operator Q, we use Rand-k [Beznosikov et al., 2020] with fixed compression ratio k/d 0.02, where d is the number of features in the dataset. ... We followed constant stepsize strategy... For each of the considered non-local methods, we take the stepsize as the largest one predicted by the theory premultiplied by the individually tuned constant factor from the set {0.000975, ..., 4096}. |