Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Fine-tuning with Reserved Majority for Noise Reduction
Authors: Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct comprehensive experiments to validate the effectiveness of NORM, covering tasks such as general instruction tuning, mathematical reasoning, and code generation, using three strong pre-trained models. NORM consistently outperforms Lo RA and other PREFT methods, achieving an average gain of +4.67 over the best PEFT methods and +1.63 over the strong PREFT method TAIA, when applied to Llama3-8B. Additional analysis confirms the robustness of NORM, and shows that Sim-Search outperforms alternative similarity-based search methods. Further experiments demonstrate that NORM significantly improves the utilization of the fine-tuning corpus while maintaining the retention of pre-trained knowledge. |
| Researcher Affiliation | Academia | Fudan University School of Artificial Intelligence, Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory |
| Pseudocode | No | The paper describes methods textually and with a high-level diagram (Figure 3) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code is available at https://github.com/pixas/No RM. |
| Open Datasets | Yes | We choose a 100K subset of TULU V2 as the general instruction tuning dataset and evaluate each fine-tuning method across various tasks, including symbolic reasoning, commonsense reasoning, knowledge understanding and multi-lingual understanding. Apart from general tuning, we also choose math reasoning and code generation as specific fine-tuning tasks and utilize LLama3-8B as the pretrained model. Specifically, we employ Meta Math QA (Yu et al., 2024b) to fine-tune the base model for math reasoning, which consists of 395K training samples evolved from GSM-8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021b). The evaluation sets are the corresponding test sets of GSM-8K and MATH to test models solving capabilities for math word problems. For the code generation, we utilize Magicoder-Evol-Instruct-110K (Wei et al., 2024) as the training data. All fine-tuned models are assessed on Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021) benchmarks, which contain 164 and 378 high-quality Python text-to-code problems, respectively. For more rigorous evaluation for programming-oriented models, we also test models on Human Eval+ and MBPP+ of the Eval Plus (Liu et al., 2024) benchmark. |
| Dataset Splits | Yes | We choose a 100K subset of TULU V2 as the general instruction tuning dataset... Specifically, we employ Meta Math QA (Yu et al., 2024b) to fine-tune the base model for math reasoning, which consists of 395K training samples... The evaluation sets are the corresponding test sets of GSM-8K and MATH... For the code generation, we utilize Magicoder-Evol-Instruct-110K (Wei et al., 2024) as the training data... All fine-tuned models are assessed on Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021) benchmarks... we compare NORM to Lo RA and TAIA for fine-tuning LLa MA3-8B with a range of instructiontuning sample sizes, specifically [1K, 10K, 50K, 100K, 330K], with 330K being the full size of TULU V2. |
| Hardware Specification | Yes | All experiments are conducted on 4 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions BFloat16 precision and using large language models and specific PEFT methods, but does not list specific software libraries with version numbers (e.g., 'PyTorch 1.x' or 'Python 3.x'). |
| Experiment Setup | Yes | We adopt vanilla Lo RA (Hu et al., 2022) as the fine-tuning method and choose three configurations of Lo RA rank and α values: {(16, 32), (32, 64), (64, 128)}. The learning rate is set to 2e-4 and the total batch size is set to 128. After fine-tuning, we set up three parameter drop strategies... We use BFloat16 precision and fine-tune all training corpus for 1 epoch. The learning rate is set to 2e-4 and the Lo RA rank is set to 64. We use a linear warmup strategy with a 0.03 warmup ratio and a cosine learning rate scheduler. For NORM s setting, the search step τ is set to 0.1 and the search range starts at 1. |