Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DiffBreak: Is Diffusion-Based Purification Robust?

Authors: Andre Kassis, Urs Hengartner, Yaoliang Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experiments We reevaluate DBP, demonstrating its degraded performance when evaluation and backpropagation issues are addressed, cementing our theoretical findings from 3.1. Setup. We evaluate on CIFAR-10 [22] and Image Net [11] similar to previous work [41, 20, 30, 40].
Researcher Affiliation	Academia	Andre Kassis, Urs Hengartner, Yaoliang Yu Cheriton School of Computer Science, University of Waterloo Waterloo, Ontario, Canada EMAIL
Pseudocode	Yes	We discover and fix issues in all previous implementations, introducing Diff Grad the first reliable module for exact backpropagation through DBP (see Fig.1). In 3.2 and 3.3, we revisit DBP s previous robustness, attributing it to backpropagation issues and improper evaluation protocols. ... (see Appendix D for detailed analysis, Diff Grad s pseudo-code, and empirical evaluations of backpropagation issues)
Open Source Code	Yes	Availability: aside from scalability and backpropagation issues, existing DBP implementations and attacks lack generalizability. We provide Diff Break1 the first toolkit for evaluating any classifier with DBP under various optimization methods, including our novel LF, using our reliable Diff Grad module for backpropagation. 1https://github.com/andrekassis/Diff Break
Open Datasets	Yes	Setup. We evaluate on CIFAR-10 [22] and Image Net [11] similar to previous work [41, 20, 30, 40].
Dataset Splits	Yes	We use 256 random test samples per dataset, consistent with prior DBP work [41, 7].
Hardware Specification	Yes	Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Information about compute resources is given in Section 4.2. Experiments are conducted on a 40GB NVIDIA A100 GPU.
Software Dependencies	No	Time Consistency: torchsde2 is the de facto standard library for SDE solvers. Hence, several prior checkpointing approaches likely use torchsde as their backend. Yet, torchsde internally converts the integration interval into a Py Torch tensor, which causes a discrepancy in the time steps on which the score model is invoked during both propagation phases due to rounding issues if the checkpointing module is oblivious to this detail.
Experiment Setup	Yes	Setup. We evaluate on CIFAR-10 [22] and Image Net [11] similar to previous work [41, 20, 30, 40]. We consider two foundational DBP defenses: The VP-SDE DBP (Diff Pure) [30] and the Guided DDPM (see 3.2), GDMP [40]. We use the DMs [12, 19, 36] studied in the original works, adopting the same purification settings See Appendix E. ... As in previous work [40, 30, 20, 41, 26, 24], we focus on the white-box setting and use Auto Attack-ℓ8 (AA ℓ8) [10] with ϵ8 8{255 for CIFAR-10 and ϵ8 4{255 for Image Net.