Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
Authors: Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, Hao Peng
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Reinforcement learning (RL) yields substantial improvements in large language models (LLMs) downstream task performance and alignment with human values. Surprisingly, such large gains result from updating only a small subnetwork comprising just 5%-30% of the parameters, with the rest effectively unchanged. We refer to this phenomenon as parameter update sparsity induced by RL. It is observed across all 7 widely-used RL algorithms (e.g., PPO, GRPO, DPO) and all 10 LLMs from different families in our experiments. |
| Researcher Affiliation | Academia | Sagnik Mukherjee Lifan Yuan Dilek Hakkani-TΓΌr Hao Peng University of Illinois Urbana-Champaign EMAIL |
| Pseudocode | No | The paper describes methodologies and processes in paragraph form and through figures, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/Sagnik Mukherjee/sparsity_in_rl. |
| Open Datasets | Yes | We analyze publicly released model checkpoints on Hugging Face released by the authors. With the exception of models where RL is applied directly to the pretrained base model (e.g., Deep Seek-R1-Zero), most models follow a conventional three-stage pipeline: pretraining, supervised fine-tuning (SFT), and RL. We analyze both the RL and SFT stages by measuring the update sparsity between model checkpoints before and after RL or SFT fine-tuning. Our experiments cover Tulu 8B/70B (Lambert et al., 2025), Eurus 7B (Yuan et al., 2025; Cui et al., 2025), Deep Seek Math 7B (Shao et al., 2024), and KTO/Sim PO models (Meng et al., 2024). ... For DPO we choose the LSAT (Wang et al., 2022), Logi QA (Liu et al., 2021) splits from AGIEval (Zhong et al., 2024), Math split of MMLU Pro (Wang et al., 2024b). For PRIME, we report results on the MATH500 (Hendrycks et al., 2021) benchmark across difficulty levels. ... The model is trained for one epoch on the allenai/llama-3.1-tulu-3-8b-preference-mixture dataset. PRIME: For PRIME, We fine-tune Qwen2.5-Math-7B using on a mixture of GSM8K and MATH datasets. |
| Dataset Splits | No | The paper mentions specific datasets and benchmarks like AGIEval, MMLU Pro Math, and MATH500. It also mentions training DPO on the 'allenai/llama-3.1-tulu-3-8b-preference-mixture dataset' for one epoch, and PRIME on 'a mixture of GSM8K and MATH datasets' for 15 epochs. However, it does not provide explicit training, validation, or test split percentages or sample counts for these datasets, often referring to original papers or using standard benchmark setups without detailing the splits within this paper. |
| Hardware Specification | No | The paper mentions "bfloat16 mixed-precision and Deep Speed Stage 3 for memory and compute efficiency across 8 processes" and the use of the "Delta advanced computing and data resource". However, it does not specify actual hardware components like GPU models, CPU types, or explicit memory configurations used for the experiments. |
| Software Dependencies | No | The paper mentions "DPO with Open-Instruct and PRIME with verl" as well as "Deep Speed Stage 3" and "Py Torch". However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | B Hyperparameter choices for Gradient Masking experiments: DPO: For DPO, we fine-tuned the LLa MA-3.1-Tulu-3-8B model using Direct Preference Optimization (DPO) with bfloat16 mixed-precision and Deep Speed Stage 3 for memory and compute efficiency across 8 processes. Training uses a sequence length of 2048 tokens with an effective batch size of 128, achieved by setting the per-device batch size to 1 with 16 gradient accumulation steps. A linear learning rate schedule is applied with a peak learning rate of 5 10 7 and a warmup ratio of 0.1, without weight decay. The model is trained for one epoch on the allenai/llama-3.1-tulu-3-8b-preference-mixture dataset. PRIME: For PRIME, We fine-tune Qwen2.5-Math-7B using on a mixture of GSM8K and MATH datasets. The training batch size is set to 64. The actor is optimized with a learning rate of 5 10 7 while the reward model is trained with a learning rate of 1 10 6. We performed four rollouts are performed per sample. We use gradient clipping of 10.0, and a temperature Ξ² of 0.05. Training is conducted on for 15 epochs. |