reproducibilityindex.ai

Secure Distributed Training at Scale

Authors: Eduard Gorbunov, Alexander Borzunov, Michael Diskin, Max Ryabinin

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We verify the effectiveness of our algorithm in controlled experiments1 and actual large-scale training runs. Specifically, we start with Res Net-18 for CIFAR-10 classification and follow up with pretraining ALBERT-large in a setup where almost a half of all peers are malicious.
Researcher Affiliation	Collaboration	1MIPT 2Mila Quebec AI Institute 3Yandex 4HSE University.
Pseudocode	Yes	Algorithm 1 BTARD-SGD for peer i (informal) ... Algorithm 2 BUTTERFLYCLIP for peer i ... Algorithm 3 ACCUSE (i, j), invoked on all peers
Open Source Code	Yes	Source code for the experiments is available at https://github.com/yandex-research/btard
Open Datasets	Yes	Specifically, we start with Res Net-18 for CIFAR-10 classification... The Res Net-18 test accuracy in the case of various attacks and robust aggregation techniques. Our setup is a Res Net-18 (He et al., 2015) model trained to solve the CIFAR10 classification task (Krizhevsky et al.). [...] Our setup is pre-training ALBERT-large (Lan et al., 2019) on the Wiki Text-103 dataset (Merity et al., 2017) using the LAMB optimizer (You et al., 2020).
Dataset Splits	No	The paper mentions using CIFAR-10 and Wiki Text-103 datasets for training and reports test accuracy, but does not explicitly specify the training/validation/test dataset splits used for model training or how data was partitioned for validation purposes (beyond cross-validation for peer checking).
Hardware Specification	Yes	We run distributed training on 16 cloud instances, each equipped with a single Tesla T4 GPU. [...] Our training swarm contains 16 peers with T4 GPUs and 1 Gi B/s network bandwidth. [...] running on a 8-core VM with 3.1Ghz Intel Xeon 6148 CPU and on a single 1080 Ti GPU.
Software Dependencies	No	The paper mentions software like PyTorch and LAMB optimizer, but does not provide specific version numbers for these or other key software components used in the experiments.
Experiment Setup	Yes	We train the model on 16 peers (each peer processes 8 samples per batch) using SGD with Nesterov (1983) momentum and the cosine annealing learning rate (Loshchilov & Hutter, 2017). We use a tuned setup achieving 93.5% test accuracy. Our method has a hyperparameter responsible for clipping strength in CENTEREDCLIP. We experiment with = 10 (weaker clipping) and = 1 (stronger clipping). [...] We use LAMB optimizer (You et al., 2020) with batches that contain 4,096 examples, training with a peak learning rate equal to 0,00176 and a warmup of 5,000 gradient descent steps. In addition, we use gradient clipping with a maximum norm of 1 and weight decay regularization with the weight of 0,01.