Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bilevel ZOFO: Efficient LLM Fine-Tuning and Meta-Training

Authors: Reza Shirkavand, Peiran Yu, Qi He, Heng Huang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on various LLMs of different scales to demonstrate the effectiveness of bilevel-ZOFO in improving current zeroth order methods and PEFT. We also conduct experiments in testing its potential in meta training.
Researcher Affiliation	Academia	Reza Shirkavand Department of Computer Science University of Maryland College Park EMAIL Peiran Yu Department of Computer Science and Engineering University of Texas at Arlington EMAIL Qi He Department of Computer Science University of Maryland College Park EMAIL Heng Huang Department of Computer Science University of Maryland College Park EMAIL
Pseudocode	Yes	Algorithm 1 Bilevel first-order method Algorithm 2 Bilevel Zeroth-order-first-order Method (Bilevel ZOFO)
Open Source Code	No	We will provide the code for our experiments after paper decision is available.
Open Datasets	Yes	Following Me ZO [35], we evaluate our approach on a range of classification and multiple-choice tasks: Bool Q [3], CB [54], CB [54], COPA [44], Re Co RD: [61],RTE [53], SST2 [53], Wi C [39], Wino Grande [45]. In this setting, training and testing are conducted on the same task. ... The sub-tasks are sourced from CROSSFIT [57] and UNIFIEDQA [19], comprising a total of 142 unique sub-tasks.
Dataset Splits	Yes	For each task, we randomly sample 1000 examples for training, 500 examples for validation, and 1000 examples for testing. For bilevel-ZOFO, the training set is split into upper-level and lower-level subsets with a 1:2 ratio. ... To train our method, we split the training dataset of each sub-task to two subsets, 256 samples as the development dataset for upper-level updates and 512 samples for lower-level training.
Hardware Specification	Yes	All experiments used a batch size of 8 and were conducted in bfloat16 precision on a single A6000 Ada 48GB GPU. ... We use A6000ada 48GPUs in our experiments. ... We employ a batch size of 4 and train on a single rtx6000ada GPU.
Software Dependencies	No	The paper mentions "Adam optimizer [20]" but does not specify a version number for this or any other software library or environment.
Experiment Setup	Yes	For Me ZO and first-order PEFT experiments, we explore learning rates from the set {1e 2, 1e 3, 1e 4, 1e 5, 1e 6}. For Bilevel-ZOFO, we sweep both the upper-level and lower-level learning rates: lrupper {1e 4, 1e 5, 1e 6} and lrlower {1e 2, 1e 3, 1e 4, 1e 5}. ... We perform 10 lower-level updates between each pair of upper-level updates. ... All experiments use the Adam optimizer [20],including baselines and both lower-level and upper-level optimizers. No weight decay was applied, and the models were trained with a constant learning rate schedule. Batch size is set to 16 for all experiments. ... We set λ = 10000