Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Adaptive Latent-Space Constraints in Personalized Federated Learning

Authors: Sana Ayromlou, David B. Emerson

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The contributions of this work are three-fold. First, it proposes using theoretically supported statistical distance measures as penalty constraints for dealing with heterogeneity in p FL, moving beyond previously considered paired feature-based constraints. Second, it demonstrates that leveraging the adaptability of MK-MMD or MMD-D measures through iterative re-optimization provides notable performance improvements in settings with high feature heterogeneity where existing methods, such as Ditto, under-perform. Finally, experiments with either natural or controllable levels of label and feature heterogeneity highlight the strengths and weaknesses of the types of drift penalties investigated. Additional experiments demonstrate that such measures are directly applicable to other p FL techniques and yield similar improvements across a number of datasets.
Researcher Affiliation	Collaboration	Sana Ayromlou Vector Institute & Google Toronto, Ontario, CA EMAIL D. B. Emerson Vector Institute Toronto, Ontario, CA EMAIL
Pseudocode	Yes	Algorithm 1: Ditto algorithm with Fed Avg aggregation and batch SGD for local optimization. Input :N, T, s, λ, η, w. Set w(i) L = w for each client i. for t = 0, . . . , T 1 do for each client i in parallel do Set w(i) G = w. for s iterations, draw batch b do w(i) G = w(i) G η ℓi b; w(i) G . w(i) L = w(i) L η ℓi b; w(i) L + λ 2 d(w(i) L , w) . end Send w(i) G to server for aggregation. end w = 1 n PN i=1 ni w(i) G . end
Open Source Code	Yes	2All code is found at: https://github.com/VectorInstitute/FL4Health/tree/main/research
Open Datasets	Yes	To quantify the utility of the proposed measures, four datasets are used. Each poses unique challenges in the FL setting. The first set of experiments considers several variants of CIFAR-10... The second dataset, referred to as Synthetic, is a generated dataset... The final two datasets focus on real clinical tasks with natural client splits and heterogeneity. Fed ISIC2019 is drawn from the FLamby benchmark [8] and consists of 2D dermatological images... The Rx Rx1 dataset [36] is composed of 6-channel fluorescent microscopy images...
Dataset Splits	Yes	Values of {5.0, 0.5, 0.1} are used to split data between five clients. The second dataset, referred to as Synthetic, is a generated dataset with controllable levels of feature heterogeneity across eight clients... The data is split across six clients... The data is partitioned into four clients based on the hospitals where the images were collected... Within each FL training run, the final metric is the uniform average of each client s performance on their respective test sets.
Hardware Specification	Yes	All experiments were performed on a high-performance computing cluster. For the CIFAR-10 and Synthetic datasets, an NVIDIA T4V2 GPU with 32GB of CPU memory was used. ... For the Fed-ISIC2019 and Rx Rx1 datasets, we used an NVIDIA A100 GPU with 64GB and 100GB of CPU memory, respectively.
Software Dependencies	No	The paper mentions various optimizers like SGD with momentum and Adam W, and refers to implementations in p FL-bench, but does not specify version numbers for general software libraries such as Python, PyTorch, TensorFlow, or scikit-learn.
Experiment Setup	Yes	In the experiments, hyperparameter sweeps are conducted to calibrate items such as learning rate for all methods. A full list of the hyperparameters considered and their optimal values appears in Appendix B. Other parameters, such as batch size, are detailed in Appendix C. The metrics reported in the results are the average of values across three training runs. ... For all variants of CIFAR-10, there are five clients and FL proceeds for 10 server rounds. Within each round, clients perform five epochs of local training using a standard SGD optimizer with momentum set to 0.9 and a batch size of 32.