Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

Authors: Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, yelong shen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes.
Researcher Affiliation	Collaboration	1Microsoft 2Stanford University
Pseudocode	No	The paper describes methods using mathematical formulas and text, such as the GMU equation and token mixing as a matrix operator. However, it does not contain any clearly labeled pseudocode or algorithm blocks with structured, step-by-step instructions in a code-like format.
Open Source Code	Yes	We release our training codebase on open-source data at https://github.com/microsoft/Arch Scale.
Open Datasets	Yes	We use a 4K training sequence length and the Slim Pajama [SAKM+23] dataset for all our scaling experiments. We evaluate the long-context retrieval capabilities of the models using a difficult Phonebook benchmark [JBKM24] with a 32K context length (containing 1,850 name-number pairs). Using the optimal sliding window size from the Phonebook benchmark, we evaluate our architectures on both long-context retrieval tasks (Table 1) and traditional downstream benchmarks (Table 2). Across both contexts, hybrid models with SSMs consistently outperform pure Transformer architectures. ... We evaluate the long-context retrieval capabilities of the models using a difficult Phonebook benchmark [JBKM24] with a 32K context length (containing 1,850 name-number pairs). ... Table 1: Retrieval accuracy on Needle-In-A-Haystack (NIAH) tasks with 32K context from the RULER [HSK+24] long context benchmark. ... Our model achieves significantly better performance than the strong Phi4-mini-Reasoning baseline on challenging reasoning benchmarks such as Math500, AIME24/25, and GPQA Diamond
Dataset Splits	Yes	We first study the data scaling behavior across architectures through fixing the model size at 1B parameters with d = 16 and scaling the number of training tokens T from 100B to 600B. We also study the FLOPs scaling behaviors of the model architectures with up to 3.4B parameters and 342B tokens through varying the model depth d = {8, 12, 16, 20, 24}. We use a 4K training sequence length and the Slim Pajama [SAKM+23] dataset for all our scaling experiments. ... We evaluate the long-context retrieval capabilities of the models using a difficult Phonebook benchmark [JBKM24] with a 32K context length... Table 4: Pass@1 performance of models on reasoning benchmarks measured with a maximum generation length of 32K. We report Pass@1 accuracy averaged over 64 samples for AIME24/25 and 8 samples for Math500 and GPQA Diamond to ensure evaluation robustness.
Hardware Specification	Yes	We pre-train our model on 5T tokens from the data corpus used by Phi4-mini [MAA+25] on 1K A100-80GB GPUs for 14 days. ...delivers up to 10 higher decoding throughput on 2K-length prompts with 32K generation length under the v LLM [KLZ+23] inference framework. ... Throughput and latency of text generation with various architectures under the v LLM inference framework (using one A100-80GB GPU and no Tensor Parallelism). ... The training speed is measured in MTPS (Million Tokens Per Second) with 64 A100-80GB GPUs.
Software Dependencies	Yes	Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10 higher decoding throughput on 2K-length prompts with 32K generation length under the v LLM [KLZ+23] inference framework. We customize the official v LLM framework with the version 0.7.3 to support our Phi4-mini-Flash architecture. ... We leverage the Math-Verify library5 (version 0.7.0) and Lighteval6 (version 0.10.0) to enable efficient and robust evaluation on reasoning tasks.
Experiment Setup	Yes	We use a simple linear rule from the previous works on Transformer models [KMH+20, TJY+24] for scaling the architectural shape of our Transformer++ baseline, including model width w, model depth d, number of attention query heads hq and the MLP inner dimension wmlp, i.e., w = αd, α = α0 = 128, hq = d, hkv = d/4, wmlp = 4w. ... Except for the learning rate, we fix other hyper-parameters of the Adam W optimizer with β1 = 0.9, β2 = 0.95, ϵ = 10 8 and a weight decay of 0.1. A learning rate schedule is applied with 1B warm-up tokens linearly increasing to the peak learning rate η, followed by a linear decay to zero. The learning rate is further scaled as η 1/ d following Depth-µP. For studying the FLOPs scaling behavior across model architectures, we adopt the Chinchilla scaling law [HBM+22] to scale the number of training tokens T linearly with the number of model parameters. ... we mitigate by introducing label smoothing of 0.1 and attention dropout of 0.05.