Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multi-Token Prediction Needs Registers

Authors: Anastasios Gerontopoulos, Spyridon Gidaris, Nikos Komodakis

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness and versatility of Mu To R across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code is available at https://github.com/nasosger/Mu To R.
Researcher Affiliation	Collaboration	Anastasios Gerontopoulos1,3 Spyros Gidaris2 Nikos Komodakis1,3,4 1Archimedes, Athena Research Center 2valeo.ai 3University of Crete 4IACM Forth
Pseudocode	No	The paper describes the methodology using prose and mathematical equations (e.g., Equation 1, 2, 4, 6) and does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/nasosger/Mu To R.
Open Datasets	Yes	Our experiments target three widely used mathematical reasoning benchmarks: GSM8K [Cobbe et al., 2021], MATH500 [Lightman et al., 2023], and AQUA-RAT [Ling et al., 2017]. For fine-tuning, we use curated subsets from Open Math Instruct-2 [Toshniwal et al., 2025]... As for summarization, we target the following benchmarks: SAMSum [Gliwa et al., 2019] and Dialog Sum [Chen et al., 2021]... We train Llama Gen-B (111M parameters; Sun et al. 2024) on Image Net [Deng et al., 2009]...
Dataset Splits	Yes	SAMSum consists of approximately 16K messenger-like conversations with summaries. They are split between the training set ( 14K samples), the validation set ( 818 samples) and the test set ( 819 samples). On the other hand, Dialog Sum contains approximately 14K conversation-summary pairs... These pairs are split between the training set ( 12.5K samples), the validation set ( 500 samples) and the test set ( 1.5K samples). GSM8K comprises approximately 8.7K grade-school level math problems, with around 1.3K of them forming the test set.
Hardware Specification	Yes	All experiments with Gemma 2B model are run using a single A100 GPU and gradient accumulation. For the experiments with the Llama 3 8B model, we utilize 5 A100 and Fully Sharded Data Parallelism (FSDP). All experiments are run using 8 H100 GPUs and the Distributed Data Parallel (DDP) framework.
Software Dependencies	No	The paper mentions using 'Adam W optimizer' and 'bfloat16 precision' but does not specify version numbers for key software components such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other relevant libraries.
Experiment Setup	Yes	We finetune all models for 5 epochs, using Adam W optimizer [Loshchilov and Hutter, 2017] without weight decay and a batch size of 10. We also employ a learning rate scheduler with linear decay and warmup, setting the peak learning rate to be 5e-5 for Gemma 2B and 2e-5 for Llama 3 8B... Both Next-Token baseline and Mu To R are trained for 360K update steps, using Adam W optimizer with β1 = 0.9, β2 = 0.95 and weight decay = 0.05. We employ a constant learning rate, equal to 0.0004, and a batch size of 1024.