Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Diffusing DeBias: Synthetic Bias Amplification for Model Debiasing

Authors: Massimiliano Ciranni, Vito Paolo Pastore, Roberto Di Via, Enzo Tartaglione, Francesca Odone, Vittorio Murino

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments utilize six distinct datasets: three typical benchmark datasets, including Waterbirds [43], Biased Flickr-Faces-HQ (BFFHQ) [22], and Biased Action Recognition (BAR), taken from the original version as introduced in [39]. On top of these benchmark datasets, we are interested in evaluating our approach in challenging scenarios involving real-world images with natural biases and datasets where multiple shortcuts are present. For this reason, we include Image Net9 [2]/Image Net-A [14], and Urban Cars [32] in our experiments. More dataset details are in Sec. A.1.
Researcher Affiliation	Academia	1Ma LGa-DIBRIS, University of Genoa, Italy 2AI for Good (AIGO), Istituto Italiano di Tecnologia, Genova, Italy 3Telècom-Paris, Ecole Polytechnique Superior, France 4Department of Computer Science, University of Verona, Italy
Pseudocode	No	The paper describes methods and processes using schematic representations in figures (Figure 1, Figure 2) and detailed textual descriptions in Section 3 and its subsections, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/Malga-Vision/Diffusing De Bias.
Open Datasets	Yes	Our experiments utilize six distinct datasets: three typical benchmark datasets, including Waterbirds [43], Biased Flickr-Faces-HQ (BFFHQ) [22], and Biased Action Recognition (BAR), taken from the original version as introduced in [39]. On top of these benchmark datasets, we are interested in evaluating our approach in challenging scenarios involving real-world images with natural biases and datasets where multiple shortcuts are present. For this reason, we include Image Net9 [2]/Image Net-A [14], and Urban Cars [32] in our experiments. More dataset details are in Sec. A.1.
Dataset Splits	Yes	Waterbirds [43] is an image dataset exhibiting strong correlations (ρ = 0.950) between bird species and background environments (e.g., water vs. land). It has 4,795 training images, 1,199 images for validation, and 5,794 for testing. BFFHQ has a total of 19,200 training images where the semantic classes y {young, old}, the bias attribute b {female, male}, and the bias ratio ρ = 0.995. Then, it provides 1,000 validation images (all bias-aligned) and 1,000 testing images with uniformly distributed labels and bias attributes. We utilize the original BAR version as introduced in [39]. The dataset presents 6 different classes, but does not provide explicit bias annotations or a validation set, with 1,941 images for training and 654 for testing.
Hardware Specification	Yes	DDB s main limitations stem from the high computational cost of training diffusion models. As a trade-off between efficiency and generation quality, we reduce the training image resolution to 64 64; nonetheless, training our CDPM still requires 14 hours on an NVIDIA A30 GPU with 24 GB of VRAM for the largest considered dataset (BFFHQ, 19,200 images).
Software Dependencies	No	The paper mentions specific optimizers and models like "Adam W optimizer" [37], "Cosine Annealing LR" [36], and "Densenet-121" [17], but it does not specify software library versions (e.g., PyTorch 1.x, Python 3.x, CUDA version) for these components.
Experiment Setup	Yes	In our experiments, we generate 1000 synthetic images for each target class using a CDPM... The CDPM is trained from scratch for 70, 000, 150, 000, 200, 000, 100.000, 100.000 iterations for the datasets BAR, Waterbirds, BFFHQ, Image Net-9, and Urban Cars respectively. Batch size is set to 32, and the diffusion process is executed over 1, 000 steps with a linear noise schedule (where β1 = 1e 4 and βT = 0.028). Optimization is performed using MSE loss and Adam W optimizer, with an initial learning rate of 1e 4, adjusted by a Cosine Annealing LR [36] scheduler with a warm-up period for the first 10% of the total iterations. As per the Bias Amplifier training, we use synthetic images from CDPM. A Densenet-121 [17] with Image Net pre-training is employed across all datasets, featuring a single-layer linear classification head. The training uses regular Cross-Entropy loss function and the Adam W optimizer [37]. For BFFHQ, BAR, and Urban Cars, the learning rate is 1e 4, and 5e 4 for Waterbirds, and Image Net-9. Training spans 50 epochs with weight decay λ = 0.01 except for BFFHQ, which uses λ = 1.0 for 100 epochs.