Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Private Hyperparameter Tuning with Ex-Post Guarantee

Authors: Badih Ghazi, Pritish Kamath, Alexander Knop, Ravi Kumar, Pasin Manurangsi, Chiyuan Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments We present two sets of experiments: In the first, we evaluate the performance of our algorithm on analytical tasks and in the second, we focus on the performance on a machine learning problem. ... The detailed results can be seen in Table 1.
Researcher Affiliation	Industry	Badih Ghazi Google Research EMAIL Pritish Kamath Google Research EMAIL Alexander Knop Google Research EMAIL Ravi Kumar Google Research EMAIL Pasin Manurangsi Google Research EMAIL Chiyuan Zhang Google Research EMAIL
Pseudocode	Yes	Algorithm 1 Hyperparameter Tuning Mechanism with Random Dropping. Parameters: Distribution E, Mechanisms Mi : D O and budget parameters εi for i [d] Input: Dataset D. S { } Sample k E for i = 1, . . . , d do Sample yi Ber(e εi k) {random drop} if yi = 1 then oi Mi(Di) S S {(o, i)} return maximum element in S {as per the total order on (O [d]) { }}
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The dataset used in the paper are standard and the paper has the code in the supplemental materials.
Open Datasets	Yes	Reddit: We use the webis/tldr-17 dataset [Völske et al., 2017]... train a linear regression model on a dataset of timeseries generated by Twitter usage [The AMA Team at Laboratoire d Informatique de Grenoble]... train a classifier for the MNIST dataset [Le Cun et al., 2010]... train a classifier for the Gisette [Guyon et al., 2004] dataset
Dataset Splits	No	The paper mentions using a test set in ML settings (Footnote 5, "If the test set is considered sensitive..."). The ML experiments describe training models on datasets (Twitter, MNIST, Gisette). While it mentions training, it doesn't explicitly state the exact splits (e.g., "80/10/10 split") for these datasets in the provided text. It mentions using "test set" and "training" but no explicit split percentages or counts.
Hardware Specification	Yes	Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: all the experiments are performed on a personal laptop within 10 minutes each.
Software Dependencies	No	In both cases we use Opacus [Yousefpour et al., 2021] for training DP-SGD... The example of MNIST written using Py Torch. The paper mentions software tools like Opacus and PyTorch but does not provide specific version numbers for these dependencies, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	(b) Algorithm 1 with the DP-SGD [Abadi et al., 2016] mechanism, learning linear models with ε = 0.01, possible values of ε in {0.1, 0.2, . . . 1}, learning rate in {0.01, 0.1, 1}, epochs in {1, 5, 10}, batch sizes in {32, 64, 128, 256, 512, 1000}, and clipping norms in {0.1, 1, 10}.