Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Graph Diffusion that can Insert and Delete

Authors: Matteo Ninniri, Marco Podda, Davide Bacciu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test GRIDDD on property targeting in two widely used benchmarks (QM9 and ZINC-250k), where it consistently performs on par or better than the state of the art in terms of approximating the target property while keeping high chemical validity, despite having been trained on a more difficult problem. When applied to molecular optimization, GRIDDD convincingly outperforms other molecular optimizers, achieving a higher average improvement and optimization success rate.
Researcher Affiliation	Academia	Matteo Ninniri Department of Computer Science University of Pisa 56127 Pisa (Italy) EMAIL Marco Podda Department of Computer Science University of Pisa 56127 Pisa (Italy) EMAIL Davide Bacciu Department of Computer Science University of Pisa 56127 Pisa (Italy) EMAIL
Pseudocode	Yes	Algorithm 1 in Appendix A.3 describes the new training process. Algorithm 2 in Appendix A.3 describes the sampling process in detail.
Open Source Code	Yes	Our code is available at https://github.com/mninniri/Gr IDDD.
Open Datasets	Yes	Datasets. Following previous works [Ninniri et al., 2024, Vignac et al., 2023a], we used QM9 [Ramakrishnan et al., 2014], a dataset of 133k molecules made by up to 9 non-hydrogen atoms, and ZINC-250k, a collection of 250k drug-like molecules selected from the ZINC dataset [Irwin and Shoichet, 2005].
Dataset Splits	Yes	QM9. On QM9, the training set is made of the first 100000 samples in the dataset. The test set is made of 10% of the overall data, and the remaining data is used to make the validation set. ZINC-250k. The training set uses the first 80% of the data. The remainder is equally split between the validation set and the test set.
Hardware Specification	Yes	All experiments have been performed on an n Vidia A100 GPU with 80 GBs of VRAM (two on ZINC-250k).
Software Dependencies	No	The paper mentions: "Our code is based on Mi Di [Vignac et al., 2023b] which, in turn, is based on Di Gress." and "instructions on how to set up a Conda environment to run it". However, it does not specify concrete version numbers for any software, libraries, or programming languages used in the experiments.
Experiment Setup	Yes	In all our experiments, we have T = 500 diffusion timesteps. The function ζ (t) is parameterized with w = 0.05 and D = T/2 = 250. Similarly to Mi Di, we use ν = 1 for the node matrices noise schedulers and ν = 1.5 for the edge matrices. λX and λE are set, respectively, to 1 and 2. We set the hyperparameters pmin and pmax in Equation 2 respectively as 0.2 and 1. In QM9, we set the guidance scale to λ = 3, while in ZINC-250k we set λ = 2.