Secure Distributed Training at Scale
Authors: Eduard Gorbunov, Alexander Borzunov, Michael Diskin, Max Ryabinin
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify the effectiveness of our algorithm in controlled experiments1 and actual large-scale training runs. Specifically, we start with Res Net-18 for CIFAR-10 classification and follow up with pretraining ALBERT-large in a setup where almost a half of all peers are malicious. |
| Researcher Affiliation | Collaboration | 1MIPT 2Mila Quebec AI Institute 3Yandex 4HSE University. |
| Pseudocode | Yes | Algorithm 1 BTARD-SGD for peer i (informal) ... Algorithm 2 BUTTERFLYCLIP for peer i ... Algorithm 3 ACCUSE (i, j), invoked on all peers |
| Open Source Code | Yes | Source code for the experiments is available at https://github.com/yandex-research/btard |
| Open Datasets | Yes | Specifically, we start with Res Net-18 for CIFAR-10 classification... The Res Net-18 test accuracy in the case of various attacks and robust aggregation techniques. Our setup is a Res Net-18 (He et al., 2015) model trained to solve the CIFAR10 classification task (Krizhevsky et al.). [...] Our setup is pre-training ALBERT-large (Lan et al., 2019) on the Wiki Text-103 dataset (Merity et al., 2017) using the LAMB optimizer (You et al., 2020). |
| Dataset Splits | No | The paper mentions using CIFAR-10 and Wiki Text-103 datasets for training and reports test accuracy, but does not explicitly specify the training/validation/test dataset splits used for model training or how data was partitioned for validation purposes (beyond cross-validation for peer checking). |
| Hardware Specification | Yes | We run distributed training on 16 cloud instances, each equipped with a single Tesla T4 GPU. [...] Our training swarm contains 16 peers with T4 GPUs and 1 Gi B/s network bandwidth. [...] running on a 8-core VM with 3.1Ghz Intel Xeon 6148 CPU and on a single 1080 Ti GPU. |
| Software Dependencies | No | The paper mentions software like PyTorch and LAMB optimizer, but does not provide specific version numbers for these or other key software components used in the experiments. |
| Experiment Setup | Yes | We train the model on 16 peers (each peer processes 8 samples per batch) using SGD with Nesterov (1983) momentum and the cosine annealing learning rate (Loshchilov & Hutter, 2017). We use a tuned setup achieving 93.5% test accuracy. Our method has a hyperparameter responsible for clipping strength in CENTEREDCLIP. We experiment with = 10 (weaker clipping) and = 1 (stronger clipping). [...] We use LAMB optimizer (You et al., 2020) with batches that contain 4,096 examples, training with a peak learning rate equal to 0,00176 and a warmup of 5,000 gradient descent steps. In addition, we use gradient clipping with a maximum norm of 1 and weight decay regularization with the weight of 0,01. |