Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PseuZO: Pseudo-Zeroth-Order Algorithm for Training Deep Neural Networks

Authors: Pengyun Yue, Xuanlin Yang, Mingqing Xiao, Zhouchen Lin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that Pseu ZO outperforms Me ZO and Me ZO-SVRG in classiﬁcation, multiple choice and generation tasks in both full-parameter and PEFT ﬁne-tuning settings by boosting convergence in the early stages of training. For instance, under the same computation time, with respect to SST2 task, Pesu ZO gets 9.8% higher accuracy than Me ZO (91.2% v.s. 82.4%).
Researcher Affiliation	Collaboration	1State Key Lab of General AI, School of Intelligence Science and Technology, Peking University 2Institute for Artiﬁcial Intelligence, Peking University 3 Pazhou Laboratory (Huangpu), Guangzhou, Guangdong, China 4 Microsoft Research Asia 5 Zhongguancun Academy
Pseudocode	Yes	Algorithm 1 Matrix-based Pseu ZO Algorithm Algorithm 2 Sliding Window-based Pseu ZO Algorithm
Open Source Code	Yes	The code is available at https://github.com/Yang Big Mn/Pseu ZO.
Open Datasets	Yes	We conduct comprehensive experiments in various tasks on large auto-regressive language models like opt-1.3B [58] and the same prompt design as Me ZO is utilized which is effective and fair for comparison for various datasets including GLUE [52] and Super GLUE [51] benchmarks. Table 7: Training from scratch on typical computer vision classiﬁcation datasets for various feedback methods. We do not use local loss for MNIST as there are only two hidden layers.
Dataset Splits	Yes	We choose K = 16 as the batch size and randomly select 1024 samples for training and 512 samples for evaluation. All experiments are run on a single Nvidia A800 40Gi B GPU. When training, for WSC, CB and COPA, they have much less total samples and thus we set aside 100 evaluation samples and use the rest for training.
Hardware Specification	Yes	All experiments are run on a single Nvidia A800 40Gi B GPU.
Software Dependencies	No	The paper does not explicitly mention specific software dependencies with version numbers in the provided text.
Experiment Setup	Yes	Setup. We implement Pseu ZO, Me ZO-SVRG and Hi ZOO-L in the Me ZO framework with appropriate adjustment for fair comparison. We conduct comprehensive experiments in various tasks on large auto-regressive language models like opt-1.3B [58] and the same prompt design as Me ZO is utilized which is effective and fair for comparison for various datasets including GLUE [52] and Super GLUE [51] benchmarks. We run all experiments for 10K steps and evaluate performance of the model every 2K steps for Hi ZOO-L and Me ZO-SVRG. In order to ensure that Me ZO and Pseu ZO are sufﬁciently convergent, we run Pseu ZO and Me ZO for 10K and 20K steps, respectively. We choose K = 16 as the batch size and randomly select 1024 samples for training and 512 samples for evaluation. All experiments are run on a single Nvidia A800 40Gi B GPU.