Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Diffusion Federated Dataset

Authors: SEOKJU HAHN, Junghye Lee

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate fidelity and utility of synthetic dataset from Df D under non-IID conditions, optionally with formal privacy guarantees, addressing key needs in cross-silo FL scenarios. 5 Experimental Results. Datasets. We use three benchmark datasets: MNIST [76], CIFAR-10 [77], and Celeb A [78]... Evaluation Metrics. We evaluate both fidelity and utility of the generated synthetic dataset. To evaluate the fidelity of synthetic data, we use the widely-used metrics for generative modeling: Fr echet Inception Distance (FID [81]), Precision & Recall (P&R [82]), and Density & Coverage (D&C [83]). To evaluate utility, we use an accuracy evaluated from a classifier trained at the central server using class-labeled synthetic dataset. We defer the specific experimental setup to Appendix D. Table 1: Results on synthetic dataset quality. Table 2: Results on synthetic dataset utility.
Researcher Affiliation	Collaboration	Seok-Ju Hahn1 Argonne National Laboratory EMAIL. Junghye Lee2 Seoul National University EMAIL
Pseudocode	Yes	Figure 1: Overview of Df D. A Clients independently train diffusion models to be well-trained with Eq. (3). B The server randomly initializes synthetic dataset per Eq. (12). C The server requests ( ) inference on synthetic dataset to all clients, receives ( ) predictions ϵθi(x(j) t , t), i [K], j [N], transforms ( ) into energies ( , , ) and scores ( , , ) using Eq. (11), composes into global scores using Eq. (9), and refines synthetic dataset using ULA in Eq. (8) over T steps. Algorithm 1 Df D: Cooperative Diffusion Models Inference Framework for Synthetic Dataset
Open Source Code	Yes	Code is available at: https://github.com/vaseline555/Df D
Open Datasets	Yes	Datasets. We use three benchmark datasets: MNIST [76], CIFAR-10 [77], and Celeb A [78], after resizing all inputs to have spatial dimension of 32 32. As each dataset has separate train & test folds, we use the train fold to split into client datasets, and set the test fold aside for server-side evaluation.
Dataset Splits	Yes	We distribute the train fold of each dataset into K = 10 clients with three different non-IID conditions: i) Dirichlet distribution-based non-IID [79] for MNIST, ii) power-law distribution-based non-IID [21] for CIFAR-10, and iii) pathological non-IID [1] for Celeb A. To further simulate a convincing scenario in which a synthetic dataset should be procured (i.e., data-limited settings), we randomly sample local dataset to have a size of 300 on average, following the sample size configurations of the curated benchmark for the cross-silo FL setting [80]. For MNIST dataset, we use Dirichlet distribution with concentration parameter α = 0.1, following the setting of [79]. For CIFAR-10 dataset, we follow the setting of [21] using log-normal distribution with location=0 and scale=2. For Celeb A dataset, which has 40 different attributes, we first construct classes by combining gender (male/female), smiling (0/1), and eyeglasses (0/1) attributes, i.e., 8 classes as a result. We randomly distribute samples to clients so that they have only three distinct classes.
Hardware Specification	Yes	Specification. We conduct all experiments in a single server with Intel Xeon Gold 6226R CPU (@ 2.90GHz) and a single NVIDIA Ampere A100 GPU (w/ 40GB VRAM).
Software Dependencies	No	For the implementation of diffusion models, we resort to diffusers [112] library using Py Torch [113]. Table A1: Model and Training Configurations. Optimizer Adam [114].
Experiment Setup	Yes	All clients are taking 10K steps in total for T = 1000 rounds: E = 10 local updates for all comparison methods, and E = 10 1, 000 = 10, 000 local updates for Df D as it requires no update during communication rounds. The mini-batch size is set to B = 32, and the learning rates are tuned for all methods, and set to c(1 αt)p for c > 0, p 1 for Df D. We defer the specific experimental setup to Appendix D. Table A1: Model and Training Configurations. Optimizer Adam [114] Learning rate 2 10-4.