Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

Authors: Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, Tianlong Chen

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a detailed comparative analysis of different ZO optimization methods, shedding light on the often-overlooked forward gradient method (Ren et al., 2022) and other ZO optimization techniques in LLM fine-tuning. This benchmarking study helps reveal the pros and cons of these methods in accuracy and efficiency.
Researcher Affiliation Collaboration 1Michigan State University 2The University of North Carolina at Chapel Hill 3UT Austin 4University of Minnesota Twin Cities 5IBM Research 6Princeton University 7DAMO Academy, Alibaba Group US 8MIT 9Harvard University.
Pseudocode Yes Algorithm A1 A General Pipeline for A FO/ZO Optimizer
Open Source Code Yes Codes to reproduce all our experiments are at https: //github.com/ZO-Bench/ZO-LLM.
Open Datasets Yes We focus on three tasks, considering their complexity from low to high, which include (1) the simplest binary classification task, Stanford Sentiment Treebank v2 (SST2) (Socher et al., 2013), (2) the question-answering task, Choice Of Plausible Alternatives (COPA) (Roemmele et al., 2011), (3) the commonsense reasoning task, Wino Grande (Sakaguchi et al., 2021), and (4) the multi-sentence reading comprehension (Multi RC) (Khashabi et al., 2018) (for efficiency evaluation only).
Dataset Splits No The paper mentions datasets like SST2, COPA, Wino Grande, and Multi RC, and discusses "test accuracy," but does not explicitly provide details about specific train/validation/test dataset splits (e.g., percentages, counts, or explicit methodology for splitting) needed to reproduce the partitioning.
Hardware Specification Yes Table 4. The peak memory cost (in GB), the required GPU resources, and the runtime cost (in seconds) of each optimizer when fine-tuning the full OPT-13B model on Multi RC with an averaged 400 context length... 1 A100 ... 2 A100 ... 4 A100
Software Dependencies No The paper mentions using the 'Py Torch framework' but does not specify exact version numbers for PyTorch or other software dependencies.
Experiment Setup Yes We run ZO (or BP-free) optimizers and FO optimizers for 20000 and 625 iterations respectively, as outlined in (2)... When implementing (RGE), unless specified otherwise, we set the query budget per gradient estimation to q = 1. We determine the values of other hyperparameters, such as the smoothing parameter and learning rate, through a grid search for each method.