Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
FFN Fusion: Rethinking Sequential Computation in Large Language Models
Authors: Akhiad Bercovich, Mohammed Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izik Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applying these techniques to Llama-3.1-405B-Instruct, we create Llama-Nemotron-Ultra-253B-Base (Ultra253B-Base), an efficient model that achieves a 1.71 speedup in inference latency and 35 lower per token cost while maintaining strong performance across benchmarks. Through analysis and empirical validation, we argue for the conditions under which this fusion preserves model behavior, showing that these conditions are commonly satisfied in practice, especially in larger models where the potential efficiency gains are most significant. |
| Researcher Affiliation | Industry | NVIDIA, EMAIL |
| Pseudocode | No | The paper describes a "simple greedy algorithm" in Appendix B but it is presented in prose, not as structured pseudocode or an algorithm block. |
| Open Source Code | No | The creation of Ultra-253B-Base, a powerful 253B parameter model that matches or exceeds Llama-405B s capabilities while offering substantially improved efficiency, to be publicly released upon acceptance. |
| Open Datasets | Yes | The dataset comprises 224 billion tokens collected from three public datasets: Fine Web [34], Dolma [40], and Buzz-V1.2 [21]. |
| Dataset Splits | No | For applying Puzzle throughout our experiments, we used the same dataset mixture used in [3], termed Distillation Mix. ... For Ultra-253B-Base KD training we used the same data reinforced with synthetic data generated with Llama-405B, following the [48] approach... This involved a multistage distillation process: 54B tokens at 8k context, followed by 5B tokens each at 16k and 32k, and finally 0.8B tokens at 128k. The KD process improved MMLU and MT-bench scores to 85.17 and 9.10, respectively. ... This involved 73B tokens at a context length of 8k and another 15B tokens at a context length of 258k, and also yielded strong performance, even before instruction-based tuning (Figure 4). The text describes the data used for training and its properties (token counts, context length), but not how the entire dataset was split into explicit training, validation, and testing sets for model evaluation. |
| Hardware Specification | Yes | specifying that the derivative model must obtain a 1.5 latency speedup and fit within a single NVIDIA 8 H100 node (640 GB total), and in a single B100 GPU (192 GB). ... Table 2 details the user latency (tokens/second) achieved by Llama-405B, by 253B model (with and without FFN Fusion), and by Llama-3.3-70B, all under identical tensor parallel settings on a single 8 H100 node. Notably, Ultra-253B-Base is 1.71 faster than the parent for single-user decoding. On NVIDIA H200, its rate increases to 90.05 tokens/second. |
| Software Dependencies | No | Running entire blocks completely in parallel is not currently natively supported by heavily optimized inference frameworks such as Tensor RT-LLM or v LLM [24]. Nevertheless, one can envision assigning each full block to a different GPU, thereby maximizing parallelism and potentially achieving significant speedups. We use our block-wise parallelization strategies using a more flexible environment (e.g., Hugging Face Transformers [46]). The paper mentions software tools used, but does not specify their version numbers. |
| Experiment Setup | Yes | We first run the standard Puzzle search on Llama-405B, specifying that the derivative model must obtain a 1.5 latency speedup and fit within a single NVIDIA 8 H100 node (640 GB total), and in a single B100 GPU (192 GB). ... To recover performance, we used KD as described in [3]. This involved a multistage distillation process: 54B tokens at 8k context, followed by 5B tokens each at 16k and 32k, and finally 0.8B tokens at 128k. ... This involved 73B tokens at a context length of 8k and another 15B tokens at a context length of 258k. ... a 35 lower per-token cost at batch size 32. |