Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models
Authors: Yonggan Fu, Xin Dong, Shizhe Diao, Matthijs Van keirsbilck, Hanrong Ye, Wonmin Byeon, Yashaswi Karnati, Lucas Liebenwein, Maksim Khadkevich, Alexander Keller, Jan Kautz, Yingyan (Celine) Lin, Pavlo Molchanov
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Combining these methods, we introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy efficiency frontier of state-of-the-art SLMs, e.g., achieving over +5.5% average accuracy, 1.3 /1.9 lower latency, and 18.7 /45.6 higher throughput compared to Qwen3-1.7B/0.6B, respectively. As shown in Fig. 1, Nemotron-Flash markedly pushes forward the accuracy efficiency frontier compared to state-of-the-art (SOTA) SLMs. For example, with all models accelerated using Tensor RT-LLM s Auto Deploy kernels [10] and CUDA Graph, Nemotron-Flash-3B achieves +2.0%/+5.5% higher average accuracy, 1.7 /1.3 lower latency, and 6.4 /18.7 higher throughput compared to Qwen2.5-3B/Qwen3-1.7B, respectively. |
| Researcher Affiliation | Collaboration | NVIDIA Research, Georgia Institute of Technology |
| Pseudocode | Yes | Algorithm 1: Aging Evolutionary Search for Hybrid Operator Combinations |
| Open Source Code | No | The paper does not provide an explicit link to a code repository for the Nemotron-Flash implementation or an unambiguous statement of its release. Mentions of TensorRT-LLM and Flash Linear Attention refer to third-party tools used by the authors. |
| Open Datasets | Yes | We train a series of Llama models... on 100B tokens from the Smollm-corpus [12]. We first train the models on Zyda2 [31], then switch to higher-quality datasets, including commonsense reasoning datasets (Climb-Mix [32] and Smollm-corpus [12]), a proprietary high-quality dataset with high proportions of math and code, and Mega Math [33]. |
| Dataset Splits | No | The paper mentions evaluation protocols like '5-shot evaluation for GSM8K and MMLU, 3-shot for MBPP and MBPP-Plus, and 0-shot for all remaining tasks,' which describe how benchmarks are used. However, it does not provide specific train/test/validation splits (e.g., percentages or counts) for the main training datasets (e.g., Smollm-corpus or Zyda2) used to develop the models. |
| Hardware Specification | Yes | Latency is measured on an NVIDIA H100 GPU for decoding 8k tokens with a batch size of 1 using CUDA Graph. All models are trained on 100B tokens from the Smollm-corpus... measured by decoding 8k tokens with a batch size of 1 on an NVIDIA A100 GPU with CUDA Graph enabled. Both models are trained for 4.5T tokens using 256 NVIDIA H100 GPUs. |
| Software Dependencies | No | The paper mentions several software components like 'Tensor RT-LLM s Auto Deploy kernels [10]', 'CUDA Graph', 'Flash Attention [20]', 'Flash Linear Attention [21]', and 'Adam optimizer'. However, it does not specify explicit version numbers for these software dependencies as used in the authors' implementation, which is required for reproducibility. |
| Experiment Setup | Yes | We train a series of Llama models... on 100B tokens from the Smollm-corpus [12] using the Adam W optimizer and a cosine learning rate schedule with an initial learning rate of 5e-4. Both models are trained using the Adam optimizer (without weight decay, due to the use of weight normalization) and a cosine learning rate schedule with an initial learning rate of 1e-3. Both models are trained for 4.5T tokens using 256 NVIDIA H100 GPUs, with a batch size of 2M tokens and a context length of 4096, except for the final 25B tokens, where we extend the context length to 29000. The learning rates for the first and second stages are set to 8e-6 and 5e-6, respectively. Each stage is trained for one epoch using a cosine learning rate scheduler and a global batch size of 384. |