2Direction: Theoretically Faster Distributed Training with Bidirectional Communication Compression
Authors: Alexander Tyurin, Peter Richtarik
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, our theoretical findings are corroborated by experimental evidence. |
| Researcher Affiliation | Academia | Alexander Tyurin KAUST Saudi Arabia alexandertiurin@gmail.com Peter Richt arik KAUST Saudi Arabia richtarik@gmail.com |
| Pseudocode | Yes | Algorithm 1 2Direction: A Fast Gradient Method Supporting Bidirectional Compression |
| Open Source Code | No | The paper does not provide any explicit statement about releasing code, nor does it include a link to a code repository. |
| Open Datasets | Yes | The experiments were implemented in Python 3.7.9. The distributed environment was emulated on machines with Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz. In each plot we show the relations between the total number of coordinates transmitted from and to the server and function values. The parameters of the algorithms are taken as suggested by the corresponding theory, except for the stepsizes that we fine-tune from a set {2i | i [ 20, 20]}. For 2Direction, we use parameters from Theorem 5.2 and finetune the step size L. We solve the logistic regression problem: fi(x1, . . . , xc) := 1 m P m j=1 log(1 + exp( yijaij , x)) − log(exp( yijaij , x)) + log P c y=1 exp a ijxy , where x1, . . . , xc Rd, c is the number of unique labels, aij Rd is a feature of a sample on the ith worker, yij is a corresponding label and m is the number of samples located on the ith worker. The Rand K compressor is used to compress information from the workers to the server, the Top K compressor is used to compress information from the server to the workers. The performance of algorithms is compared on CIFAR10 (Krizhevsky et al., 2009) (# of features = 3072, # of samples equals 50,000), and real-sim (# of features = 20958, # of samples equals 72,309) datasets. |
| Dataset Splits | No | The paper mentions the use of CIFAR10 and real-sim datasets but does not specify any training, validation, or test split percentages or sample counts. |
| Hardware Specification | Yes | The distributed environment was emulated on machines with Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz. |
| Software Dependencies | Yes | The experiments were implemented in Python 3.7.9. |
| Experiment Setup | Yes | The parameters of the algorithms are taken as suggested by the corresponding theory, except for the stepsizes that we fine-tune from a set {2i | i [ 20, 20]}. For 2Direction, we use parameters from Theorem 5.2 and finetune the step size L. |