Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark
Authors: Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, Tianlong Chen
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a detailed comparative analysis of different ZO optimization methods, shedding light on the often-overlooked forward gradient method (Ren et al., 2022) and other ZO optimization techniques in LLM fine-tuning. This benchmarking study helps reveal the pros and cons of these methods in accuracy and efficiency. |
| Researcher Affiliation | Collaboration | 1Michigan State University 2The University of North Carolina at Chapel Hill 3UT Austin 4University of Minnesota Twin Cities 5IBM Research 6Princeton University 7DAMO Academy, Alibaba Group US 8MIT 9Harvard University. |
| Pseudocode | Yes | Algorithm A1 A General Pipeline for A FO/ZO Optimizer |
| Open Source Code | Yes | Codes to reproduce all our experiments are at https: //github.com/ZO-Bench/ZO-LLM. |
| Open Datasets | Yes | We focus on three tasks, considering their complexity from low to high, which include (1) the simplest binary classification task, Stanford Sentiment Treebank v2 (SST2) (Socher et al., 2013), (2) the question-answering task, Choice Of Plausible Alternatives (COPA) (Roemmele et al., 2011), (3) the commonsense reasoning task, Wino Grande (Sakaguchi et al., 2021), and (4) the multi-sentence reading comprehension (Multi RC) (Khashabi et al., 2018) (for efficiency evaluation only). |
| Dataset Splits | No | The paper mentions datasets like SST2, COPA, Wino Grande, and Multi RC, and discusses "test accuracy," but does not explicitly provide details about specific train/validation/test dataset splits (e.g., percentages, counts, or explicit methodology for splitting) needed to reproduce the partitioning. |
| Hardware Specification | Yes | Table 4. The peak memory cost (in GB), the required GPU resources, and the runtime cost (in seconds) of each optimizer when fine-tuning the full OPT-13B model on Multi RC with an averaged 400 context length... 1 A100 ... 2 A100 ... 4 A100 |
| Software Dependencies | No | The paper mentions using the 'Py Torch framework' but does not specify exact version numbers for PyTorch or other software dependencies. |
| Experiment Setup | Yes | We run ZO (or BP-free) optimizers and FO optimizers for 20000 and 625 iterations respectively, as outlined in (2)... When implementing (RGE), unless specified otherwise, we set the query budget per gradient estimation to q = 1. We determine the values of other hyperparameters, such as the smoothing parameter and learning rate, through a grid search for each method. |