DeepZero: Scaling Up Zeroth-Order Optimization for Deep Model Training

Authors: Aochuan Chen, Yimeng Zhang, Jinghan Jia, James Diffenderfer, Konstantinos Parasyris, Jiancheng Liu, Yihua Zhang, Zheng Zhang, Bhavya Kailkhura, Sijia Liu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments show that Deep Zero achieves state-of-the-art (SOTA) accuracy on Res Net-20 trained on CIFAR-10, approaching FO training performance for the first time.
Researcher Affiliation Collaboration Michigan State University, Lawrence Livermore National Laboratory, UC Santa Barbara Equal contributions
Pseudocode Yes Algorithm 1 ZO-Gra SP-oriented-LPR-guided ZO training
Open Source Code Yes Codes are available at https://github.com/OPTML-Group/Deep Zero.
Open Datasets Yes training a Res Net20 on CIFAR-10
Dataset Splits No No. The paper mentions training on CIFAR-10 for image classification and evaluating testing accuracy, but it does not explicitly provide the specific training, validation, and test dataset splits or a reference to how these splits were defined for reproduction.
Hardware Specification Yes Experiments are run on 4 NVIDIA V100 GPUs if not specified otherwise.
Software Dependencies No No. The paper mentions various optimizers (SGD, Adam) and a simulation code (Phi Flow) but does not provide specific version numbers for these software components or any other libraries like PyTorch or TensorFlow, which are essential for reproducibility.
Experiment Setup Yes We adopt SGD (stochastic gradient descent) as the FO training recipe, with a weight decay of 5 * 10^-4 and a momentum of 0.9. The learning rate is 0.1, governed by a cosine decay scheduler. In the ZO training scenario, we replace the FO gradient by (Sparse-CGE) with a smoothing parameter µ = 5 * 10^-3. When implementing ZO-Gra SP (3), we set the query budget q = 192 and use the same µ as CGE. Unless specified otherwise, the weight sparsity ratio is chosen to be 90% and the specific sparsity patterns are determined by SR (Smart Ratio). When implementing Deep Zero (Algorithm 2), we choose the number of epochs T = 50.