reproducibilityindex.ai

DeepZero: Scaling Up Zeroth-Order Optimization for Deep Model Training

Authors: Aochuan Chen, Yimeng Zhang, Jinghan Jia, James Diffenderfer, Konstantinos Parasyris, Jiancheng Liu, Yihua Zhang, Zheng Zhang, Bhavya Kailkhura, Sijia Liu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments show that Deep Zero achieves state-of-the-art (SOTA) accuracy on Res Net-20 trained on CIFAR-10, approaching FO training performance for the first time.
Researcher Affiliation	Collaboration	Michigan State University, Lawrence Livermore National Laboratory, UC Santa Barbara Equal contributions
Pseudocode	Yes	Algorithm 1 ZO-Gra SP-oriented-LPR-guided ZO training
Open Source Code	Yes	Codes are available at https://github.com/OPTML-Group/Deep Zero.
Open Datasets	Yes	training a Res Net20 on CIFAR-10
Dataset Splits	No	No. The paper mentions training on CIFAR-10 for image classification and evaluating testing accuracy, but it does not explicitly provide the specific training, validation, and test dataset splits or a reference to how these splits were defined for reproduction.
Hardware Specification	Yes	Experiments are run on 4 NVIDIA V100 GPUs if not specified otherwise.
Software Dependencies	No	No. The paper mentions various optimizers (SGD, Adam) and a simulation code (Phi Flow) but does not provide specific version numbers for these software components or any other libraries like PyTorch or TensorFlow, which are essential for reproducibility.
Experiment Setup	Yes	We adopt SGD (stochastic gradient descent) as the FO training recipe, with a weight decay of 5 * 10^-4 and a momentum of 0.9. The learning rate is 0.1, governed by a cosine decay scheduler. In the ZO training scenario, we replace the FO gradient by (Sparse-CGE) with a smoothing parameter µ = 5 * 10^-3. When implementing ZO-Gra SP (3), we set the query budget q = 192 and use the same µ as CGE. Unless specified otherwise, the weight sparsity ratio is chosen to be 90% and the specific sparsity patterns are determined by SR (Smart Ratio). When implementing Deep Zero (Algorithm 2), we choose the number of epochs T = 50.