Secure Distributed Training at Scale

Authors: Eduard Gorbunov, Alexander Borzunov, Michael Diskin, Max Ryabinin

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify the effectiveness of our algorithm in controlled experiments1 and actual large-scale training runs. Specifically, we start with Res Net-18 for CIFAR-10 classification and follow up with pretraining ALBERT-large in a setup where almost a half of all peers are malicious.
Researcher Affiliation Collaboration 1MIPT 2Mila Quebec AI Institute 3Yandex 4HSE University.
Pseudocode Yes Algorithm 1 BTARD-SGD for peer i (informal) ... Algorithm 2 BUTTERFLYCLIP for peer i ... Algorithm 3 ACCUSE (i, j), invoked on all peers
Open Source Code Yes Source code for the experiments is available at https://github.com/yandex-research/btard
Open Datasets Yes Specifically, we start with Res Net-18 for CIFAR-10 classification... The Res Net-18 test accuracy in the case of various attacks and robust aggregation techniques. Our setup is a Res Net-18 (He et al., 2015) model trained to solve the CIFAR10 classification task (Krizhevsky et al.). [...] Our setup is pre-training ALBERT-large (Lan et al., 2019) on the Wiki Text-103 dataset (Merity et al., 2017) using the LAMB optimizer (You et al., 2020).
Dataset Splits No The paper mentions using CIFAR-10 and Wiki Text-103 datasets for training and reports test accuracy, but does not explicitly specify the training/validation/test dataset splits used for model training or how data was partitioned for validation purposes (beyond cross-validation for peer checking).
Hardware Specification Yes We run distributed training on 16 cloud instances, each equipped with a single Tesla T4 GPU. [...] Our training swarm contains 16 peers with T4 GPUs and 1 Gi B/s network bandwidth. [...] running on a 8-core VM with 3.1Ghz Intel Xeon 6148 CPU and on a single 1080 Ti GPU.
Software Dependencies No The paper mentions software like PyTorch and LAMB optimizer, but does not provide specific version numbers for these or other key software components used in the experiments.
Experiment Setup Yes We train the model on 16 peers (each peer processes 8 samples per batch) using SGD with Nesterov (1983) momentum and the cosine annealing learning rate (Loshchilov & Hutter, 2017). We use a tuned setup achieving 93.5% test accuracy. Our method has a hyperparameter responsible for clipping strength in CENTEREDCLIP. We experiment with = 10 (weaker clipping) and = 1 (stronger clipping). [...] We use LAMB optimizer (You et al., 2020) with batches that contain 4,096 examples, training with a peak learning rate equal to 0,00176 and a warmup of 5,000 gradient descent steps. In addition, we use gradient clipping with a maximum norm of 1 and weight decay regularization with the weight of 0,01.