Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Efficient Adaptive Federated Optimization

Authors: Su Hyeong Lee, Sidharth Sharma, Manzil Zaheer, Tian Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical evaluations on image and text datasets demonstrate both the advantages of joint adaptivity and the effectiveness and efficiency of Fed Ada2/Fed Ada2++. Empirically, we show that Fed Ada2/Fed Ada2++, without transmitting preconditioners and employing on-device preconditioner compression, matches the performance of its more expensive counterparts, and outperforms baselines without joint adaptivity on both image and text datasets (Section 5).
Researcher Affiliation Collaboration Su Hyeong Lee Department of Statistics University of Chicago Sidharth Sharma Department of Computer Science Columbia University Manzil Zaheer Google Deep Mind Tian Li Department of Computer Science University of Chicago
Pseudocode Yes Algorithm 1 Fed Ada2: Efficient Jointly Adaptive Optimization Framework (Simplified) Algorithm 2 Adaptive server and client-side ADAGRAD with SM3 (Fed Ada2++) Algorithm 3 Delayed preconditioner SM3-I Algorithm 4 Delayed preconditioner SM3-II Algorithm 5 Server-side ADAGRAD and client-side optimizer mixture (Fed Ada2) Algorithm 6 Adam with Delayed Moment Updates (ADMU) Algorithm 7 Adaptive server-side ADAGRAD and client-side ADAM (Fed Ada Adam) Algorithm 8 Ada Grad with Delayed Updates (AGDU) Algorithm 9 Adaptive server and client-side ADAGRAD (Fed Ada Adagrad)
Open Source Code Yes We provide a very detailed explanation of the setup in Appendix H and I, and release the code to reproduce our experiments.
Open Datasets Yes Evaluation Setup. We explore the impact of adaptivity on both text and image datasets, i.e., Stack Overflow [27], CIFAR-100 [28], FEMNIST [29], and GLD-23K [30].
Dataset Splits Yes For images, we explore vision transformer models (Vi T-S [31]) which are pretrained on Image Net-21K [32], and finetune them the Google Landmarks dataset [30]. This represents a domain shift onto natural user-split pictorial data. We use the same model on the CIFAR-100 dataset [28], where we partition the data using LDA [33] with α = 0.001, a non-IID statistical topic modeling algorithm. To assess the performance of all algorithms in an additional realistic heterogeneous federated learning scenario, we further utilize FEMNIST [29] where each client is an individual writer. This setup evaluates federated learning algorithms under non-IID conditions, highlighting challenges such as personalization and robustness to client heterogeneity. Details for federated dataset statistics, learning tasks, and hyperparameter tuning are provided in Appendix I. (H.2) The client participation fraction for all GLD-23K experiments are set to 0.01. (I.1) We use a subsampling rate of 0.1, for a total of 400 clients and 500 communication rounds. (H.3) The CIFAR-10/100 datasets [28] consist of 32 32 3 images. In the smaller variant CIFAR-10, there are 10 labels, with 50,000 training images and 10,000 test images. The 10 classes represent common objects: airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks. CIFAR-100 is meant to be an extension of CIFAR-10, consisting of 60,000 color images, but with 100 classes instead of 10. Each class in CIFAR-100 contains 600 images, and the dataset is similarly split into 50,000 training images and 10,000 test images.
Hardware Specification Yes Experiments were performed on a computing cluster managed by Slurm, consisting of nodes with various configurations. The cluster includes nodes with multiple GPU types, including NVIDIA RTX 2080 Ti, A40, and H100 GPUs.
Software Dependencies No For example, Py Torch s implementation of Adam adopts ε = 10 8 as its default value.
Experiment Setup Yes For all Vi T experiments, images were resized to 224 224 pixels, and the client optimizer employed a linear learning rate warm-up, increasing from 0 to the final value over the first 10 local backpropagation steps. The local batch size was consistently set to 32 across all datasets used in this paper. Due to better empirical performance, Adam was selected as the main optimizer strategy for Vi T fine-tuning against the image datasets. We utilized prior work [4] as well as small-scale experiments regarding server-only adaptivity to guide the selection of the momentum parameters β1 = 0.9, β2 = 0.999 for server Adam. The identical parameters were selected for client Adam, and better choices may exist for either the server or client. In order to determine suitable learning rates and adaptivity parameters, we conduct extensive hyperparameter sweeps using a two-step procedure.