DeepZero: Scaling Up Zeroth-Order Optimization for Deep Model Training
Authors: Aochuan Chen, Yimeng Zhang, Jinghan Jia, James Diffenderfer, Konstantinos Parasyris, Jiancheng Liu, Yihua Zhang, Zheng Zhang, Bhavya Kailkhura, Sijia Liu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments show that Deep Zero achieves state-of-the-art (SOTA) accuracy on Res Net-20 trained on CIFAR-10, approaching FO training performance for the first time. |
| Researcher Affiliation | Collaboration | Michigan State University, Lawrence Livermore National Laboratory, UC Santa Barbara Equal contributions |
| Pseudocode | Yes | Algorithm 1 ZO-Gra SP-oriented-LPR-guided ZO training |
| Open Source Code | Yes | Codes are available at https://github.com/OPTML-Group/Deep Zero. |
| Open Datasets | Yes | training a Res Net20 on CIFAR-10 |
| Dataset Splits | No | No. The paper mentions training on CIFAR-10 for image classification and evaluating testing accuracy, but it does not explicitly provide the specific training, validation, and test dataset splits or a reference to how these splits were defined for reproduction. |
| Hardware Specification | Yes | Experiments are run on 4 NVIDIA V100 GPUs if not specified otherwise. |
| Software Dependencies | No | No. The paper mentions various optimizers (SGD, Adam) and a simulation code (Phi Flow) but does not provide specific version numbers for these software components or any other libraries like PyTorch or TensorFlow, which are essential for reproducibility. |
| Experiment Setup | Yes | We adopt SGD (stochastic gradient descent) as the FO training recipe, with a weight decay of 5 * 10^-4 and a momentum of 0.9. The learning rate is 0.1, governed by a cosine decay scheduler. In the ZO training scenario, we replace the FO gradient by (Sparse-CGE) with a smoothing parameter µ = 5 * 10^-3. When implementing ZO-Gra SP (3), we set the query budget q = 192 and use the same µ as CGE. Unless specified otherwise, the weight sparsity ratio is chosen to be 90% and the specific sparsity patterns are determined by SR (Smart Ratio). When implementing Deep Zero (Algorithm 2), we choose the number of epochs T = 50. |