Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Authors: Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train a proof-of-concept model from scratch with 3.5 billion parameters and 800 billion tokens. We show that this model can effortlessly use varying levels of compute, significantly improving with additional compute especially on reasoning tasks, such as math and coding. Further, this architecture naturally reduces compute costs via zero-shot per-token adaptive compute, KV-cache sharing and speculative decoding.
Researcher Affiliation Academia Jonas Geiping1 Sean Mc Leish2 Neel Jain2 John Kirchenbauer2 Siddharth Singh2 Brian R. Bartoldson3 Bhavya Kailkhura3 Abhinav Bhatele2 Tom Goldstein2 1ELLIS Institute Tübingen, Max-Planck Institute for Intelligent Systems, Tübingen AI Center 2University of Maryland, College Park 3Lawrence Livermore National Laboratory EMAIL EMAIL
Pseudocode No The paper describes the architecture in Section 3 and includes diagrams (Figure 2), but does not provide a formal pseudocode block or algorithm in a structured format.
Open Source Code Yes Justification: We provide our code with the supplementary material, which also includes the data processing scripts. We will also publish the processed training dataset and final model to ease reproduction and analysis.
Open Datasets Yes We list all data sources in Appendix F. Table 12: Datasets used for model pre-training (Part 1: Standard sources) ... smollm-fineweb-edu Hugging Face TB/smollm-corpus odc-by generic-text 1.0 Ben Allal et al. (2024)
Dataset Splits No In our material, we refer to the final checkpoint of this run as our main model . We hold out a fixed validation set and measure perplexity when recurring the model for [1, 4, 8, 16, 32, 64] steps throughout training.
Hardware Specification Yes We train this model using compute time allocated on a HPE Cray EX supercomputer containing compute nodes with AMD MI250X GPUs, connected using a Slingshot dragonfly network. ... We execute a controlled vllm throughput benchmark using random data with input length 512 and output length 512 using the V1 engine. We run the benchmark for 128 prompts and report output tokens/s measured on a single NVIDIA RTX 6000 Ada GPU, which we show in Table 11
Software Dependencies Yes We train in bfloat16 mixed precision using a PyTorch-based implementation (Zamirai et al., 2021). ... ROCM 6.2.0, PyTorch 2.6 pre-release 11/02
Experiment Setup Yes Optimizer and Learning Rate Schedule. We train using the Adam optimizer with decoupled weight regularization (β1 = 0.9, β2 = 0.95, η = 5 10 4) (Kingma and Ba, 2015; Loshchilov and Hutter, 2017), modified to include update clipping (Wortsman et al., 2023b) and removal of the ε constant as in Everett et al. (2024). We clip gradients above 1. We train with warm-up and a constant learning rate (Zhai et al., 2022; Geiping and Goldstein, 2023), warming up to our maximal learning rate within the first 4096 steps.