Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Authors: Yuxian Gu, Qinghao Hu, Haocheng Xi, Junyu Chen, Shang Yang, Song Han, Han Cai

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Jet-Nemotron across a comprehensive suite of benchmarks, including MMLU(-Pro) [18, 19], commonsense reasoning [33, 34, 35, 36, 37, 38], mathematical reasoning [20, 21, 22, 39], retrieval [23, 24, 25], coding [26, 27, 28, 40], and long-context tasks [29]. Our Jet-Nemotron2B model matches or surpasses SOTA full-attention models, such as Qwen2.5 [4], Qwen3 [5], Gemma3 [41, 42] and Llama3.2 [2], across all benchmarks, while achieving significantly higher generation throughput.
Researcher Affiliation	Industry	Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai NVIDIA https://github.com/NVlabs/Jet-Nemotron
Pseudocode	No	The paper describes methods and processes (e.g., Post NAS Roadmap in Figure 2) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code	Yes	NVIDIA https://github.com/NVlabs/Jet-Nemotron
Open Datasets	Yes	At the first stage, we use a combination of Nemotron-CC [63] and Redstone-QA [64] as our pre-training corpus and train Jet-Nemotron models for 50B tokens. ... We evaluate Jet-Nemotron across a comprehensive suite of benchmarks, including MMLU(-Pro) [18, 19], mathematical reasoning [18, 20, 21, 22], commonsense reasoning [33, 34, 35, 36, 37, 38], retrieval [23, 24, 25], coding [26, 27, 28, 40], and long-context tasks [29].
Dataset Splits	Yes	We adopt 4-shot evaluation for GSM8K [22] and MATH [18] and 5-shot evaluation for GPQA [20] and MMLU-Pro [19]. We use the official implementation of Eval Plus [40] and CRUXEval [28] for coding tasks. For all other tasks, we use the zero-shot setting.
Hardware Specification	Yes	Our throughput evaluation was performed on a DGX H100 server, featuring 8 NVIDIA H100 GPUs, 2 Intel Xeon Platinum 8480C (112 cores) CPUs, and 2TB of RAM.
Software Dependencies	Yes	Specifically, our environment include Pytorch 2.7.0 and Triton 3.3.0. We implement the full-attention block with Flash Attention 2.7.4 [69] and linear attention blocks with Flash-Linear-Attention 0.2.1 [70]. Model inference is based on the Transformers 4.52.0 implementation [71].
Experiment Setup	Yes	Training Details. The training consists of two stages... At the first stage, we use a combination of Nemotron-CC [63] and Redstone-QA [64] as our pre-training corpus and train Jet-Nemotron models for 50B tokens. ... At the second stage, we include more high-quality data from math [65] and coding [66, 67] domains into our data mixture. The models are then trained on 350B tokens. We summarize the experimental costs in Appendix A.2. ... The final Jet-Nemotron models are composed of a stack of blocks, each containing a Multi-Layer Perceptron (MLP) layer and an attention layer. The attention layer is selected from one of three types: full attention, sliding window attention, or Jet Block. The detailed architecture configurations are presented in Table 7. (Tables 7, 8, 9 provide specific hyperparameters like Vocabulary Size, Hidden Size, Attention Head Number, Convolution Kernel Size, etc.)