Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SeedLoRA: A Fusion Approach to Efficient LLM Fine-Tuning

Authors: Yong Liu, Di Fu, Shenggan Cheng, Zirui Zhu, Yang Luo, Minhao Cheng, Cho-Jui Hsieh, Yang You

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments on LLa MA27B and Mistral-7B demonstrate that Seed Lo RA significantly improves performance over individual Lo RA models by 4.9% on GSM8K and 6.6% on Human Eval, effectively matching or exceeding full fine-tuning performance while maintaining the efficiency benefits of Lo RA.
Researcher Affiliation	Academia	1Department of Computer Science, National University of Singapore 2College of Information Sciences and Technology, Pennsylvania State University 3Department of Computer Science, University of California, Los Angeles. Correspondence to: Yong Liu <EMAIL>, Yang You <EMAIL>.
Pseudocode	No	The paper describes the method in text and mathematical equations in Section 3, Proposed Method, but does not include a dedicated pseudocode or algorithm block.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code, nor does it provide a link to a code repository.
Open Datasets	Yes	For code generation, we use Code-Feedback (Zheng et al., 2024) as training data, LLa MA2-7B (Touvron et al., 2023) and Mistral-7B-v0.1 (Jiang et al., 2023) serve as base models. We evaluate using Human Eval (Chen et al., 2021), an established benchmark for Python text-to-code generation. For comprehensive assessment, we incorporate Human Eval+ from Eval Plus (Liu et al., 2024). For math reasoning, the Meta Math QA (Yu et al., 2023) dataset is employed to fine-tune on the LLa MA2-7B and Mistral-7B models. The evaluation is conducted using the GSM8k (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) benchmarks, which are specifically constructed to test the model s capacity for mathematical reasoning and problem-solving. For the general domain, the T ULU V2 (Wang et al., 2023a) dataset is utilized in training on the LLa MA2-7B and Mistral-7B-v0.1. Following the setting of Open-Instruct (Ivison et al., 2023), we evaluate model on MMLU (Hendrycks et al., 2020), GSM8k, BBH (Suzgun et al., 2022), Ty Di QA (Clark et al., 2020), Truthful QA (Lin et al., 2021) and Human Eval.
Dataset Splits	No	The paper refers to using datasets like Code-Feedback for training and Human Eval for evaluation, and also Meta Math QA. While it indicates which datasets are used for training and evaluation, it does not explicitly provide specific details about the training, validation, and test splits (e.g., percentages, sample counts, or specific predefined split files) within these datasets that would be needed to reproduce the data partitioning.
Hardware Specification	Yes	Training is conducted on Nvidia A100 and H100 GPUs using BFloat16 precision.
Software Dependencies	No	For evaluation, we utilize v LLM (Kwon et al., 2023) to conduct our tests, ensuring efficient and scalable inference. No specific version number for vLLM or other key software components are provided.
Experiment Setup	Yes	Table 5. LLa MA-2-7B model with Lo RA on Tulu-v2. For the results of Lo RA and its variants, we report the best performance of 3 Lo RA models, which is trained with different seeds. Model Dataset Method r α LR LR Scheduler Warmup Epochs Batch Size σ LLa MA2-7B Meta Math QA Lo RA 8 16 3e-5 cosine 300 3 128 median