Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

Authors: Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Marcin Chochowski, Yashaswi Karnati, Raviraj Joshi, Ameya Mahabaleshwarkar, ZIJIA CHEN, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using this recipe, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40 fewer training tokens compared to similarly-sized models. The resulting model surpasses the accuracy of similarly-sized models while achieving 2 faster inference throughput, significantly advancing the Pareto frontier. ... We conduct several ablation studies evaluating the impact of pruning different components on accuracy and runtime performance. Our experiments, summarized in Section 3.1, reveal key insights and highlight differences from Transformer-only compression 3. We then describe our main results for the NEMOTRON-H 4B model in Section 3.2. ... Tables 2 to 5 present accuracy comparisons between our compressed 4B hybrid model, other similar-sized community models, and the parent 8B hybrid model.
Researcher Affiliation Industry Ali Taghibakhshi , Sharath Turuvekere Sreenivas , Saurav Muralidharan Marcin Chochowski , Yashaswi Karnati , Raviraj Joshi, Ameya Sunil Mahabaleshwarkar Zijia Chen, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov EMAIL
Pseudocode Yes The following algorithm provides a concise walkthrough of Mamba head and head channel ranking: Require: Activation scores s Rmh md, target channels kd, target heads per group {kg}G g=1 Ensure: Head ranking R, channel ranking Dtop 1: Compute channel scores: sd s:,d 2 d 2: Dtop top-kd indices of {sd} 3: Compute head scores: fh sh,Dtop 2 h 4: for g 1 to G do 5: Rg argsort-descending({fh | h Gg}) 6: Rsel g first kg elements of Rg 7: end for 8: R LG g=1 Rsel g
Open Source Code No We intend to release the model weights and code pending internal review.
Open Datasets No We use a random sample from the Phase 3 data mixture employed for training Nemotron-H models [5] for both importance estimation and knowledge distillation (KD). ...The instruction-tuned model is then further aligned with two rounds of RPO. ...To extend the context length of the aligned NEMOTRON-H 4B model, we perform SFT using data designed for long-context understanding: this training data is derived by manipulating the general domain chat dataset from the second SFT-KD round during alignment.
Dataset Splits No I could not find specific training/test/validation dataset splits with exact percentages, sample counts, or citations to predefined splits. The paper mentions using a random sample from a data mixture and 1024 calibration samples, but does not provide explicit split information for the main evaluation or training.
Hardware Specification Yes All experiments were performed on 16-32 NVIDIA DGX H100 nodes (8 H100 80GB) for short turnaround times.
Software Dependencies No The paper does not provide specific version numbers for software dependencies like programming languages or libraries. It only mentions the use of general frameworks implicitly.
Experiment Setup Yes Our search space includes depth reduction (removing 4-26 layers from the original 52-layer architecture) combined with width pruning of embedding channels (3072-4096), FFN dimension (9984-21504), Mamba heads (64-128), and Mamba head channels (32-64). ...For KD, the batch size is 768, with a sequence length of 8192, a cosine decay learning rate schedule (starting at 1.6e-4 and decaying to 8e-4), with a 60-step linear warmup.