Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Personalized Bayesian Federated Learning with Wasserstein Barycenter Aggregation

Authors: Ting Wei, Biao Mei, Junliang Lyu, Renquan Zhang, Feng Zhou, Yifan Sun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, experiments show that Fed WBA outperforms baselines in prediction accuracy, uncertainty calibration, and convergence rate, with ablation studies confirming its robustness. Comprehensive experiments demonstrate the superiority of Fed WBA in prediction accuracy, uncertainty calibration, and convergence rate compared to baselines. Additionally, ablation studies evaluate the robustness of our approach w.r.t. different components. In this section, we utilize four real-world datasets to evaluate the performance of Fed WBA in terms of prediction accuracy, uncertainty calibration, and convergence rate.
Researcher Affiliation	Academia	Ting Wei School of Statistics, Renmin University of China Beijing, China EMAIL; Biao Mei School of Statistics, Renmin University of China Beijing, China EMAIL; Junliang Lyu Guanghua School of Management, Peking University Beijing, China EMAIL; Renquan Zhang School of Mathematics Science, Dalian University of Technology Dalian, China EMAIL; Feng Zhou Center for Applied Statistics and School of Statistics, Renmin University of China Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing Beijing, China EMAIL; Yifan Sun Center for Applied Statistics and School of Statistics, Renmin University of China Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing Beijing, China EMAIL
Pseudocode	Yes	C Pseudocode for Algorithm Algorithm 1 Fed WBA Input: Size Z, kernels k( , ) and k( , ) Initialize: local posterior particles {{θk,i}N i=1}K k=1; global prior particles {θi}N i=1 for communication round K Server randomly samples a subset of clients of size Z; Server broadcasts {θi}N i=1 to each client k K; [Local updates] for client k in K do Rebuild the continuous prior by Equation (6); Update local posterior particles by Equation (2); end for Each client k K upload {θk,i}N i=1 to the server; [Server aggregates] Update global prior particles by Equation (5).
Open Source Code	Yes	2Our code is publicly available at https://github.com/TingWei1006/FedWBA
Open Datasets	Yes	Datasets: To benchmark under realistic non-i.i.d. conditions with label skew, we evaluate on four vision datasets: MNIST, Fashion MNIST (FMNIST) [42], CIFAR-10, and CIFAR-100 [24]. We adopt the setup from [45, 46, 1] in which each client receives 5 unique labels for MNIST, FMNIST, and CIFAR-10, and 10 labels sampled from distinct superclasses for CIFAR-100.
Dataset Splits	Yes	We adopt the setup from [45, 46, 1] in which each client receives 5 unique labels for MNIST, FMNIST, and CIFAR-10, and 10 labels sampled from distinct superclasses for CIFAR-100. Setup: We conduct all experiments with 100 communication rounds and 20% client participation per round, sufficient for algorithm convergence. We evaluate performance with client counts K {50, 100, 200}, considering that more clients disperse the training data.
Hardware Specification	Yes	In this section, we utilize four real-world datasets to evaluate the performance of Fed WBA in terms of prediction accuracy, uncertainty calibration, and convergence rate. We perform all experiments using a server with GPU (NVIDIA GeForce RTX 4090).
Software Dependencies	No	The paper mentions "Ada Grad" and refers to "Pot: Python optimal transport" [15], but it does not specify concrete versions for Python itself or common machine learning libraries like PyTorch or TensorFlow, which would be essential for reproduction. Therefore, it does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	Setup: We conduct all experiments with 100 communication rounds and 20% client participation per round, sufficient for algorithm convergence. We evaluate performance with client counts K {50, 100, 200}, considering that more clients disperse the training data. Following established architectures in prior work, we implement lightweight models: for MNIST and FMNIST, adopting the single-hidden-layer MLP from [45, 46], and for CIFAR-10/100, deploying the Le Net-style CNN in [1] to accommodate resource constraints. Hyperparameter: Similar to [29], in all SVGD experiments, we employ the radial basis function (RBF) kernel k(θ, θ0) = exp( θ θ0 2 2 h ). The bandwidth h is set to h = med2/ log N, with med being the median of pairwise particle distances in the current iteration. The bandwidth of the Gaussian kernel k( , ) used for KDE is set to 0.55. We use Ada Grad with momentum to set the learning rate ϵ. Considering communication constraints from uploaded data size, we set the number of particles to 10, balancing computational accuracy and communication overhead in SVGD, mitigating communication bottlenecks without sacrificing much variational inference performance.