2Direction: Theoretically Faster Distributed Training with Bidirectional Communication Compression

Authors: Alexander Tyurin, Peter Richtarik

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, our theoretical findings are corroborated by experimental evidence.
Researcher Affiliation Academia Alexander Tyurin KAUST Saudi Arabia alexandertiurin@gmail.com Peter Richt arik KAUST Saudi Arabia richtarik@gmail.com
Pseudocode Yes Algorithm 1 2Direction: A Fast Gradient Method Supporting Bidirectional Compression
Open Source Code No The paper does not provide any explicit statement about releasing code, nor does it include a link to a code repository.
Open Datasets Yes The experiments were implemented in Python 3.7.9. The distributed environment was emulated on machines with Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz. In each plot we show the relations between the total number of coordinates transmitted from and to the server and function values. The parameters of the algorithms are taken as suggested by the corresponding theory, except for the stepsizes that we fine-tune from a set {2i | i [ 20, 20]}. For 2Direction, we use parameters from Theorem 5.2 and finetune the step size L. We solve the logistic regression problem: fi(x1, . . . , xc) := 1 m P m j=1 log(1 + exp( yijaij , x)) − log(exp( yijaij , x)) + log P c y=1 exp a ijxy , where x1, . . . , xc Rd, c is the number of unique labels, aij Rd is a feature of a sample on the ith worker, yij is a corresponding label and m is the number of samples located on the ith worker. The Rand K compressor is used to compress information from the workers to the server, the Top K compressor is used to compress information from the server to the workers. The performance of algorithms is compared on CIFAR10 (Krizhevsky et al., 2009) (# of features = 3072, # of samples equals 50,000), and real-sim (# of features = 20958, # of samples equals 72,309) datasets.
Dataset Splits No The paper mentions the use of CIFAR10 and real-sim datasets but does not specify any training, validation, or test split percentages or sample counts.
Hardware Specification Yes The distributed environment was emulated on machines with Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz.
Software Dependencies Yes The experiments were implemented in Python 3.7.9.
Experiment Setup Yes The parameters of the algorithms are taken as suggested by the corresponding theory, except for the stepsizes that we fine-tune from a set {2i | i [ 20, 20]}. For 2Direction, we use parameters from Theorem 5.2 and finetune the step size L.