Fine-Tuning Language Models with Just Forward Passes
Authors: Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, Sanjeev Arora
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) Me ZO significantly outperforms in-context learning and linear probing; (2) Me ZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12 memory reduction and up to 2 GPU-hour reduction in our implementation; (3) Me ZO is compatible with both full-parameter and parameter-efficient tuning techniques such as Lo RA and prefix tuning; (4) Me ZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). |
| Researcher Affiliation | Academia | Princeton University {smalladi, tianyug, eshnich, ad27, jasonlee, danqic, arora}@princeton.edu |
| Pseudocode | Yes | Algorithm 1: Me ZO |
| Open Source Code | Yes | Our code is available at https://github.com/princeton-nlp/Me ZO. |
| Open Datasets | Yes | For Ro BERTa-large, we consider classification datasets: SST-2 [86], SST-5 [86], TREC [97], MNLI [103], SNLI [12], and RTE [22, 8, 37, 10]. ... For OPT experiments, we consider the Super GLUE dataset collection [98], including: Bool Q [21], CB [24], COPA [81], Multi RC [51], Re Co RD [111], RTE [22, 8, 37, 10], Wi C [77], and WSC [55]. We also include SST-2 [86] and two question answering (QA) datasets, SQu AD [80] and DROP [31]. |
| Dataset Splits | Yes | For OPT experiments... We randomly sample 1,000 examples for training, 500 examples for validation, and 1,000 examples for testing, respectively, for each datset. ... For Ro BERTa experiments, we follow Malladi et al. [67] in studying the few-shot and many-shot settings, sampling k examples per class for k = 16 and k = 512 (details in Appendix E). We run Me ZO for 100K steps and fine-tuning for 1000 steps, noting that one Me ZO step is substantially faster than one fine-tuning step (see Appendix F.6 for a comparison). |
| Hardware Specification | Yes | For example, with a single A100 80GB GPU, Me ZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. ... We test OPT models of various sizes with Nvidia A100 GPUs (80GB memory) on Multi RC (average #tokens=400)... We conduct our experiments with 80GB A100s connected by NVLink and Infinite Band, which are state-of-the-art solutions for distributed training. |
| Software Dependencies | No | In memory profiling, we use standard implementation with Huggingface s transformers [104] package. ... For multi-GPU backpropagation, we use fully sharded data parallel (FSDP) [33] provided by Py Torch [76]. |
| Experiment Setup | Yes | We use the hyperparameters in Table 15 for Me ZO experiments on Ro BERTa-large (Table 18 and Figure 2). ... We use the hyperparameters in Table 16 for Me ZO experiments on OPT. |