Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
On the Tension between Byzantine Robustness and No-Attack Accuracy in Distributed Learning
Authors: Yi-Rui Yang, Chang-Wei Shi, Wu-Jun Li
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we will empirically test the effect of using robust aggregator when there are no Byzantine workers. Specifically, we use Byz SGD with various robust aggregators to train a Res Net-20 (He et al., 2016) deep learning model on the CIFAR-10 dataset (Krizhevsky et al., 2009) for 160 epochs without attacks. All the experiments are conducted on a distributed platform with 16 Docker containers serving as workers and an extra Docker container as the server. Each Docker container is bound to an NVIDIA TITAN Xp GPU. We test the performance of each method when the training instances are randomly distributed to the workers according to the Dirichlet distribution with hyperparameters α = 0.1, 1.0 and 10.0, respectively. A smaller α will lead to a more heterogeneous data distribution. Moreover, the batch normalization (BN) layers in the Res Net-20 model are replaced with group normalization layers since BN layers have a poor performance with heterogeneous data across workers (Wu & He, 2018). All algorithms are implemented with Py Torch 1.3. |
| Researcher Affiliation | Academia | 1National Key Laboratory for Novel Software Technology, School of Computer Science, Nanjing University, Nanjing, China. Correspondence to: Wu-Jun Li <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Byzantine-Robust Gradient Descent (Byz GD) Input: iteration number T, learning rates {ηt}T 1 t=0 , robust aggregator Agg( ); Initialization: model parameter w0; for t = 0 to T 1 do Broadcast wt to all workers; on worker i {1, . . . , n} in parallel do Compute local gradient gi = Fi(wt); Send gi to the server; end on worker Compute: wt+1 = wt ηt Agg(g1, . . . , gn); end for Output: model parameter w T . |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or provide a link to a code repository for the methodology described. |
| Open Datasets | Yes | Specifically, we use Byz SGD with various robust aggregators to train a Res Net-20 (He et al., 2016) deep learning model on the CIFAR-10 dataset (Krizhevsky et al., 2009) for 160 epochs without attacks. |
| Dataset Splits | No | The paper mentions using the CIFAR-10 dataset but does not explicitly provide the training, testing, or validation split percentages or sample counts. It only describes how training instances are distributed among workers. |
| Hardware Specification | Yes | All the experiments are conducted on a distributed platform with 16 Docker containers serving as workers and an extra Docker container as the server. Each Docker container is bound to an NVIDIA TITAN Xp GPU. |
| Software Dependencies | Yes | All algorithms are implemented with Py Torch 1.3. |
| Experiment Setup | Yes | We use cross-entropy as the loss function, set the batch size on each worker to 16, and use the cosine annealing learning rates (Loshchilov & Hutter, 2017). Specifically, the learning rate at the p-th epoch is ηp = 1+cos(pπ/160) / 2 η0 for p = 0, 1, . . . , 159. The initial learning rate η0 is selected from {0.1, 0.2, 0.5, 1.0}, and the best final top-1 test accuracy is used as the final metrics. Local momentum is used with momentum hyper-parameter set to 0.9. |