Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fine-tuning with Reserved Majority for Noise Reduction

Authors: Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct comprehensive experiments to validate the effectiveness of NORM, covering tasks such as general instruction tuning, mathematical reasoning, and code generation, using three strong pre-trained models. NORM consistently outperforms Lo RA and other PREFT methods, achieving an average gain of +4.67 over the best PEFT methods and +1.63 over the strong PREFT method TAIA, when applied to Llama3-8B. Additional analysis confirms the robustness of NORM, and shows that Sim-Search outperforms alternative similarity-based search methods. Further experiments demonstrate that NORM significantly improves the utilization of the fine-tuning corpus while maintaining the retention of pre-trained knowledge.
Researcher Affiliation	Academia	Fudan University School of Artificial Intelligence, Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory
Pseudocode	No	The paper describes methods textually and with a high-level diagram (Figure 3) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code is available at https://github.com/pixas/No RM.
Open Datasets	Yes	We choose a 100K subset of TULU V2 as the general instruction tuning dataset and evaluate each fine-tuning method across various tasks, including symbolic reasoning, commonsense reasoning, knowledge understanding and multi-lingual understanding. Apart from general tuning, we also choose math reasoning and code generation as specific fine-tuning tasks and utilize LLama3-8B as the pretrained model. Specifically, we employ Meta Math QA (Yu et al., 2024b) to fine-tune the base model for math reasoning, which consists of 395K training samples evolved from GSM-8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021b). The evaluation sets are the corresponding test sets of GSM-8K and MATH to test models solving capabilities for math word problems. For the code generation, we utilize Magicoder-Evol-Instruct-110K (Wei et al., 2024) as the training data. All fine-tuned models are assessed on Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021) benchmarks, which contain 164 and 378 high-quality Python text-to-code problems, respectively. For more rigorous evaluation for programming-oriented models, we also test models on Human Eval+ and MBPP+ of the Eval Plus (Liu et al., 2024) benchmark.
Dataset Splits	Yes	We choose a 100K subset of TULU V2 as the general instruction tuning dataset... Specifically, we employ Meta Math QA (Yu et al., 2024b) to fine-tune the base model for math reasoning, which consists of 395K training samples... The evaluation sets are the corresponding test sets of GSM-8K and MATH... For the code generation, we utilize Magicoder-Evol-Instruct-110K (Wei et al., 2024) as the training data... All fine-tuned models are assessed on Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021) benchmarks... we compare NORM to Lo RA and TAIA for fine-tuning LLa MA3-8B with a range of instructiontuning sample sizes, specifically [1K, 10K, 50K, 100K, 330K], with 330K being the full size of TULU V2.
Hardware Specification	Yes	All experiments are conducted on 4 NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions BFloat16 precision and using large language models and specific PEFT methods, but does not list specific software libraries with version numbers (e.g., 'PyTorch 1.x' or 'Python 3.x').
Experiment Setup	Yes	We adopt vanilla Lo RA (Hu et al., 2022) as the fine-tuning method and choose three configurations of Lo RA rank and α values: {(16, 32), (32, 64), (64, 128)}. The learning rate is set to 2e-4 and the total batch size is set to 128. After fine-tuning, we set up three parameter drop strategies... We use BFloat16 precision and fine-tune all training corpus for 1 epoch. The learning rate is set to 2e-4 and the Lo RA rank is set to 64. We use a linear warmup strategy with a 0.03 warmup ratio and a cosine learning rate scheduler. For NORM s setting, the search step τ is set to 0.1 and the search range starts at 1.