Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Instance-Optimality for Private KL Distribution Estimation

Authors: Jiayuan Ye, Vitaly Feldman, Kunal Talwar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the performance of our instance-optimal estimators via experiments (Section 4), and show the reward from studying instance-optimality: while the Add-constant (DP) algorithm is already minimax-optimal, our instance-optimal algorithms achieve significantly better performance on many instances of practical interest, such as power-law distributions and real-word token distributions. All experiments are performed on a Mac OS intergrated CPU ( 30 minutes) with 18GB RAM. All reported performance numbers are averaged across five random trials of data sampling and estimators.
Researcher Affiliation	Collaboration	Jiayuan Ye National University of Singapore Vitaly Feldman Apple Kunal Talwar Apple
Pseudocode	Yes	Algorithm 1 Non-DP Sampling Twice Algorithm 2 ε-DP Sampling Twice (Instance-Optimal) Algorithm 3 (ε, δ)-DP Sampling Twice (Instance-Optimal)
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The paper does not provide access to the code, but only uses open-source datasets in Section 4, and provides algorithm pseudocode in Algorithm 1 and 2.
Open Datasets	Yes	For real-word data, we experiment on randomly drawn tokens from Reddit [60], Enron-email [41] and MMLU [36, 35], where each token is one (sensitive) record. We chose Reddit and Enron-email datasets because they are user-specific and thus bear a natural notion of privacy risk (compared to e.g., wikipedia), and are standard and widely used text datasets in the private learning literature [58, 43, 45]. We additionally evaluate on MMLU to simulate diverse text domains.
Dataset Splits	Yes	One challenge for evaluating real-word dataset is the unknown ground-truth distribution p (as only empirical samples are given). To address this, we independently sample two equal-size datasets x and x , thus ensuring E [x i/ x 1] = pi for any i [d].
Hardware Specification	Yes	All experiments are performed on a Mac OS intergrated CPU ( 30 minutes) with 18GB RAM.
Software Dependencies	No	The paper does not explicitly mention specific software dependencies with version numbers for the experiments. It mentions using tokenizers like GPT2 and Bert, but not the software environment or library versions.
Experiment Setup	Yes	To address this limitation, we perform grid search for the optimal hyperparameters over α {0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99} and τ {0, 0.0625, 0.125, 0.25, 0.5, 1, 2, 4} ln d, and use the tuned hyperparameters τ = 0, α = 0.5 for Algorithm 1, and τ = min 1 ε, 1.0 ln d, α = 0.9 for Algorithm 2 in all experiments.