Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Beyond Value Functions: Single-Loop Bilevel Optimization under Flatness Conditions

Authors: Liuyuan Jiang, Quan Xiao, Lisha Chen, Tianyi Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we empirically validate our theoretical results through experiments on real-world tasks. In the main paper, we will focus on the LLM PEFT problem (3). Additional experiments, including fair representation learning problem on the NLSY-7k dataset [73, 80], and Bi DORA fine-tuning [68], are provided in Appendix C.
Researcher Affiliation Academia Liuyuan Jiang , , Quan Xiao , , Lisha Chen , Tianyi Chen , Rensselaer Polytechnic Institute, Troy, NY University of Rochester, Rochester, NY Cornell Tech, Cornell University, New York, NY EMAIL, EMAIL EMAIL, EMAIL
Pseudocode Yes Algorithm 1 PBGD Free from value function (PBGD-Free) algorithm 1: Inputs: initial point x0, y0; step sizes η, ηγ; counters T, K K = 1 is a common choice 2: for t = 0, 1, . . . , T 1 do 3: for k = 0, 1, . . . , K 1 do 4: yγ t,k+1 = yγ t,k ηγ γ 1 yf(xt, yγ t,k) + yg(xt, yγ t,k) set yγ t,0 = yγ t 1 5: end for 6: xt+1 = xt ηgt, where gt = xf(xt, yγ t ) set yγ t = yγ t,K 7: end for 8: Outputs: (x T , yγ T )
Open Source Code No Our code is adapted from the bilevel LLM post-training library https://github.com/Post-LLM/BIPOST and experiment details are referred to Appendix C.3.
Open Datasets Yes We evaluate our PEFT framework (3) using the Dahoas/rm-hh-rlhf dataset for DPO loss and the Open Orca dataset [55] for SFT loss.
Dataset Splits Yes We conduct experiments on Microsoft Research Paraphrase Corpus (MRPC) dataset [20], and Internet Movie Database (IMDb) in Hugging Face by fine-tuning Bert model [59]. We apply fully-single-loop versions of PBGD and PBGD-Free in Algorithm 1 to solve the problem in Section 74 and compare it with training using DARTS [48], the algorithm used in the original Bi Do Ra algorithm [68], and the naive results trained on minm,v ltr(m, v) where ltr is the combined loss for training dataset including the ones used for both ll tr and ls tr for Do Ra [53]. The experiment is conducted on a single NVIDIA RTX A5000 GPU (24GB) using CUDA 12.2 and NVIDIA driver version 535.113.01.
Hardware Specification Yes All experiments were conducted on a cluster of NVIDIA A6000 GPUs, each with 40 GB of memory.
Software Dependencies Yes We conduct experiments on Microsoft Research Paraphrase Corpus (MRPC) dataset [20], and Internet Movie Database (IMDb) in Hugging Face by fine-tuning Bert model [59]. We apply fully-single-loop versions of PBGD and PBGD-Free in Algorithm 1 to solve the problem in Section 74 and compare it with training using DARTS [48], the algorithm used in the original Bi Do Ra algorithm [68], and the naive results trained on minm,v ltr(m, v) where ltr is the combined loss for training dataset including the ones used for both ll tr and ls tr for Do Ra [53]. The experiment is conducted on a single NVIDIA RTX A5000 GPU (24GB) using CUDA 12.2 and NVIDIA driver version 535.113.01.
Experiment Setup Yes The learning rate is set to 1 10 5, using Adam [43] as the optimizer. All experiments were conducted on a cluster of NVIDIA A6000 GPUs, each with 40 GB of memory. Training was performed using Py Torch with the Deep Speed library https://github.com/deepspeedai/Deep Speed to optimize memory usage and distributed training efficiency. We consider a time-limited experiment under a consistent computational budget, reflecting real-world constraints where training time is often a critical factor. Algorithm hyperparameter. We use a penalty constant of γ = 10 for our proposed PBGD-Free algorithm (Algorithm 1) with a single inner loop (K = 1). For the baseline F2SA algorithm [11, 45], we set γ = 10 with K = 3 inner updates for training LLAMA-3-3B [31], and K = 5 for PYTHIA1b [6]. For the BOME algorithm, we similarly use K = 3 and K = 5 inner loops, adopting its hyperparameter η = 0.5 for calculating the penalty constant, as suggested in [105]. For the ALRIGHT algorithm [24], we use its default setting of λ = 0.5 as suggested in literature [24]. Since the ALRIGHT algorithm in [24] is a bi-objective learning algorithm that does not have the representation learning capability, we examine it on an alternative formulation minx,y[f DPO(x,y), g SFT(y)].