Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Private Zeroth-Order Optimization with Public Data

Authors: Xuchen Gong, Tian Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate that PAZO achieves superior privacy/utility tradeoffs across vision and text tasks in both pre-training and fine-tuning settings, outperforming the best first-order baselines (with public data) especially in highly private regimes, while offering up to 16 runtime speedup. ... In this section, we present the empirical performance of PAZO-{M,P,S} across both vision and language domains, and pre-training, fine-tuning, and prompt tuning tasks.
Researcher Affiliation	Academia	Xuchen Gong Tian Li University of Chicago EMAIL
Pseudocode	Yes	Algorithm 1 PAZO-M Algorithm 2 PAZO-P Algorithm 3 PAZO-S
Open Source Code	Yes	Our code is publicly available at github.com/xuchengong/pazo.
Open Datasets	Yes	The settings of our experiments cover and follow the experiments in the existing DP literature, including (1) Training NFRes Net18 on CIFAR-10 [35] from scratch, (2) fine-tuning Places365 pre-trained Vi T-S on Tiny-Image Net [36], (3) training LSTM on IMDB [37] from scratch, and (4) fine-tuning Ro BERTa-base with prompts on MNLI [38]. ... We randomly sample 100 training examples per class from SNLI [48] as the OOD public data.
Dataset Splits	Yes	For CIFAR-10, we use non-overlapped training samples with small class imbalance as ID public data and those with big class imbalance as OOD public data. ... For MNLI, we use non-overlapped MNLI training samples as ID public data and SNLI training samples as OOD public data. ... We follow previous work [41] that uses 4% of the training samples as public data... We randomly sample 4% of the Tiny-Image Net training samples as public data... We build the vocabulary based on the top 10K tokens in the IMDB training set and construct the Amazon Polarity public dataset with a size 4% of the IMDB training size, which gives us 2,000 public samples. ... We follow the few-shot setting in the past work [6, 8] and sample 512 MNLI training examples per class.
Hardware Specification	Yes	Each experiment is conducted on one 48GB L40S GPU.
Software Dependencies	No	The paper mentions software components like 'vmap' and 'optimized implementations' (for DP algorithms), and refers to using 'the codebase from Malladi et al. [6]' for MNLI experiments, but does not provide specific version numbers for key software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	The settings of our experiments cover and follow the experiments in the existing DP literature, including (1) Training NFRes Net18 on CIFAR-10 [35] from scratch, (2) fine-tuning Places365 pre-trained Vi T-S on Tiny-Image Net [36], (3) training LSTM on IMDB [37] from scratch, and (4) fine-tuning Ro BERTa-base with prompts on MNLI [38]. ... We set the number of epochs to 100. ... We thus train for 200 epochs in all DPZero experiments. The values of the smoothing parameter λ are presented in Table 10. We also report the hyperparameter search grid for each method in Table 12 13, where the batch size b is only tuned for non-private methods (SGD and Me ZO); We fix the private batch size to 64 for all private methods, including zeroth-order and first-order, with and without public data.