Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Sketched Adaptive Distributed Deep Learning: A Sharp Convergence Analysis

Authors: Zhijie Chen, Qiaobo Li, Arindam Banerjee

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, the SADL algorithms are shown to be competitive with and often outperform baselines on both vision and language tasks, in both supervised fine-tuning and training-from-scratch regimes. Further, the SADL algorithms are also competitive with the state-of-the-art communication-efficient distributed learning algorithms based on error feedback.
Researcher Affiliation Academia Zhijie Chen Siebel School of Computing and Data Science University of Illinois Urbana-Champaign EMAIL Qiaobo Li Siebel School of Computing and Data Science University of Illinois Urbana-Champaign EMAIL Arindam Banerjee Siebel School of Computing and Data Science University of Illinois Urbana-Champaign EMAIL
Pseudocode Yes Algorithm 1 Sketched Adaptive Distributed Deep Learning (SADL) Algorithm 2 ADA_OPT (AMSGrad)
Open Source Code Yes Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code is public.
Open Datasets Yes For the vision task, we train a Res Net101 (Wu and He, 2018) with a total of 42M parameters from scratch and finetune a Vi T-Base (Dosovitskiy et al., 2020) with 86M parameters on CIFAR10 (Krizhevsky et al., 2009). For the language task, we adopt SST2, a text classification task, from the GLUE benchmark (Wang et al., 2018).
Dataset Splits Yes For all experiments we split the training dataset uniformly over 5 clients. Our baselines include Fetch SGD (Rothchild et al., 2020), MARINA (Gorbunov et al., 2021), Cocktail SGD (Wang et al., 2023), CDAdam (Wang et al., 2022a), 1 bit-Adam (Tang et al., 2021), Fed CAMS (Wang et al., 2022b) and Fed PAQ (Reisizadeh et al., 2020). A comparison of the theoretical guarantees of the baselines can be found in Table 12 in the Appendix. We define the compression rate as the ratio between the transmission size and the full model size (both in bytes).
Hardware Specification Yes Our experiments were conducted on a computing cluster with AMD EPYC 7713 64-Core Processor and NVIDIA A100 Tensor Core GPU.
Software Dependencies No The paper mentions "We used stochastic lanczos algorithm implemented by the py Hessian library Yao et al. (2020)" but does not provide specific version numbers for this or any other software dependencies.
Experiment Setup Yes We use Cross Entropy with label smoothing as the loss function. The parameter for label smoothing is 0.1. We use a cosine learning rate scheduler on the server optimizer, with the minimal learning rate is 1e 5. Client batch size is 128, and weight decay is 1e 4. We sweep a wide range of server learning rates over the set [10 4, 5 10 4, 10 3, 5 10 3, 10 2, 5 10 2, 0.1, 0.5, 1.0, 2.0]. The algorithms share all the other hyper-parameters within each set of experiments. The only exception is Cocktail SGD since it s unclear how to incorporate server-side momentum in their framework. To ensure a fair comparison, we tune both the server learning rate and the client side momentum for Cocktail SGD. The selected hyperparameters are displayed in Table 2-7.