Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MISA: Memory-Efficient LLMs Optimization with Module-wise Importance Sampling
Authors: Yuxi Liu, Renjia Deng, Yutong He, xue wang, Tao Yao, Kun Yuan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on diverse learning tasks validate the effectiveness of MISA. Source code is available at: https://github.com/pkumelon/MISA. 4 Experiments This section evaluates the performance of MISA across various fine-tuning and pre-training tasks. All our experiments are conducted on RTX 4090 24GB. We also conducted detailed ablation experiments, as presented in Appendix D. |
| Researcher Affiliation | Collaboration | Yuxi Liu Peking University EMAIL Renjia Deng Peking University EMAIL Yutong He Peking University EMAIL Xue Wang Alibaba DAMO Academy EMAIL Tao Yao Shanghai Jiao Tong University EMAIL Kun Yuan Peking University EMAIL |
| Pseudocode | Yes | Algorithm 1 Module-wise Importance Sampling (MISA) Require: θ0,N, T, B, η, α, δ > 0, β1, β2 (0, 1) and the sampling block τn at outer loop n 1: Partition the model into B modules (not layers); Partition weights into modules; 2: Initialize probability weights P 1 = ( 1 B ); 3: Initialize the module gradient estimate G0 b = 0 for b [B] and let G0 = (G0 1, , G0 B); 4: for n = 1, ..., N do 5: Sample Ln modules (labeled with index τn) according to P n such that the ratio of trainable parameter is less than δ. (See Algorithm 2 for more details); Importance sampling; 6: Initialize mn,0 τn = 0, vn,0 τn = 0 7: for t = 1, ..., T do 8: Sample a batch of data and calculate block stochastic gradient gn,t τn for selected module τn; 9: Update the corresponding first-order and second-order momentum as follows: 10: mn,t τn β1mn,t 1 τn + (1 β1) gn,t τn , vn,t τn β2vn,t 1 τn + (1 β2) (gn,t τn )2 11: Update the corresponding module as follows: 12: θn,t τn θn,t 1 τn αmn,t τn /( q vn,t τn + ε) 13: end for 14: Update Gn b for each b [B] according to (4); Track block gradient norm; 15: Update pn+1 b exp(ηGn b ) PB j=1 exp(ηGn j ) for each b [B]; Update sampling probability; 16: θn+1,0 τn θn,T τn α β1 1 β1 vn,T τn +ε; 17: gτn, mτn, vτn None Clear optimizer states; 18: end for 19: Return θN,P N |
| Open Source Code | Yes | Source code is available at: https://github.com/pkumelon/MISA. |
| Open Datasets | Yes | We evaluated MISA on different LLMs across three benchmarks: Commonsense Reasoning [25], Math Reasoning [25], and Instruction Following, encompassing a total of 16 datasets. We trained the LLa MA2 130M and 350M variant [32] on the C4 dataset [46]. To evaluate MISA on math reasoning tasks, we tested LLa MA3-8B and Qwen2.5-7B on four arithmetic benchmarks: GSM8K [8], SVAMP [42], AQUA [34], and MAWPS [28], following [25] for dataset settings. The models were fine-tuned on MATH10K [25], which combines training data from AQUA, GSM8K, and MAWPS...to demonstrate MISA s efficiency in instruction-following fine-tuning, we fine-tuned Tiny LLa MA[67], LLa MA2-7B[54] and Mistral-7B[26] on the Alpaca GPT-4 dataset...Model performance was evaluated on MMLU[21], MMLU-pro[57] and MT-Bench [72] |
| Dataset Splits | Yes | Following the settings in [25], we combined the training data from all eight tasks into a single training set for fine-tuning and then evaluated each model separately on the eight test sets. The models were fine-tuned on MATH10K [25], which combines training data from AQUA, GSM8K, and MAWPS, incorporating LM-generated chain-of-thought [58] steps. Figure 3: Validation loss of LISA, BAdam, and MISA across three epochs of fine-tuning Mistral-7B (left), LLa MA2-7B (middle) and Tiny LLa MA (right) on the Alpaca-GPT4 dataset. The x-axis represents training time (minutes). Table 6: Validation perplexity of pre-training LLa MA 350M model on C4 dataset. |
| Hardware Specification | Yes | All our experiments are conducted on RTX 4090 24GB. |
| Software Dependencies | No | The paper mentions the use of the Adam optimizer, but no specific software versions for libraries like PyTorch, TensorFlow, or Python are provided. |
| Experiment Setup | Yes | We conducted extensive hyperparameter searches. The learning rate was searched in {2e-4, 1e-4, 5e-5, 1e-5, 5e-6, 3e-6, 1e-6}. For Lo RA and Ga Lore, we explored ranks in {8, 16, 32}, and we found that a rank of 16 or 32 consistently yielded better performance than 8. For MISA, η was searched in {0.1, 0.5, 1}. The table below presents the optimal hyperparameter settings. Tables 17-24 in Appendix I provide detailed hyperparameters for each experiment, including learning rate, batch size, warmup steps, epochs, dropout, optimizer, α, δ, η, T, target modules, and activated parameters. |