Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Models
Authors: Tanmay Gautam, Youngsuk Park, Hao Zhou, Parameswaran Raman, Wooseok Ha
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluated across a range of both masked and autoregressive LMs (up to 7B parameters) on benchmark downstream tasks, Me ZO-SVRG outperforms Me ZO with up to 20% increase in test accuracies in both fulland partial-parameter fine-tuning settings. |
| Researcher Affiliation | Collaboration | 1University of California, Berkeley, USA 2Amazon AI Research & Education, Santa Clara, USA 3Amazon AI Labs, Santa Clara, USA. |
| Pseudocode | Yes | The method is summarized in Algorithm 1. |
| Open Source Code | Yes | The code for the experiments is available at https: //github.com/amazon-science/mezo_svrg. |
| Open Datasets | Yes | We fine-tune on tasks from the NLP GLUE and Super GLUE benchmarks: Multi-Genre Natural Language Inference Corpus (MNLI), Stanford Question Answering Dataset (QNLI), Stanford Sentiment Treebank (SST-2), Corpus of Linguistic Acceptability (Co LA), and Bool Q (Williams et al., 2018; Wang et al., 2018; Socher et al., 2013; Warstadt et al., 2018; Wang et al., 2019). |
| Dataset Splits | Yes | Similar to Malladi et al. (2023), for each task, our experiments are conducted in a many-shot fine-tuning setting: 512 training examples, 256 validation examples and 256 test samples are randomly sampled from the dataset. |
| Hardware Specification | Yes | All experiments are run on a single GPU; specifically, we consider Nvidia A100 40GB or H100 80GB GPUs. |
| Software Dependencies | No | The paper mentions using "Huggingface datasets library" and "Huggingface transformers package" but does not specify exact version numbers for any software dependencies. |
| Experiment Setup | Yes | Setup. We evaluate on both full (FP32) and half (BF16) precision. We detail the experiment results for the BF16 setting in Appendix J.We mainly consider a prompt-free fine-tuning setting (more challenging loss landscape) but include prompted results for Ro BERTa-large (Liu et al., 2019) in Appendix G. ...Further details of the experiment setup and implementation are provided in Appendices D and E. |