Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Authors: Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron C. Courville, Se-Young Yun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical validation: Across models from 135M to 1.7B parameters under equal compute budgets, Mo R establishes a new Pareto frontier by improving validation loss and few-shot accuracy relative to vanilla and recursive baselines ( 3.1, 3.2). 3 Experiments 4 Ablation Studies
Researcher Affiliation Collaboration Sangmin Bae1 Yujin Kim1 Reza Bayat2 Sungnyun Kim1 Jiyoun Ha3 Tal Schuster4 Adam Fisch4 Hrayr Harutyunyan5 Ziwei Ji4 Aaron Courville2,6 Se-Young Yun1 1KAIST AI 2Mila 3Google Cloud 4Google Deep Mind 5Google Research 6Université de Montréal
Pseudocode No The paper describes algorithms in text and provides figures (e.g., Figure 1, Figure 2) illustrating the architecture and routing mechanisms, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes https://github.com/raymin0223/mixture_of_recursions
Open Datasets Yes We pretrain our models from scratch using a Llama-based Transformer architecture3 [65], referring to the configurations of Smol LM open-source models [4], on a deduplicated subset of the Fine Web-Edu dataset [75] in Smol LM-Corpus [7].
Dataset Splits Yes We pretrain our models from scratch... on a deduplicated subset of the Fine Web-Edu dataset [75] in Smol LM-Corpus [7]. We evaluate the models on validation set of Fine Web-edu and six few-shot benchmarks [26]. ... We adhered to the standard number of shots for each dataset, and used the continuation task specifically for MMLU for simplicity.
Hardware Specification Yes Pretraining was conducted using four H100 or A100 GPUs. All evaluation performance measurements were conducted using a single H100 or A100 GPU.
Software Dependencies No Flash Attention 2 [17] to support variable-length KV caches within a batch. We adopt a static-sized cache where each position is updated over time, since this is compatible with torch.compile [73]. A specific version number for PyTorch is not explicitly stated.
Experiment Setup Yes We utilized a Llama-based Transformer architecture [65], referring to the configurations of the open-source Smol LM models [4]... Pretraining was conducted using four H100 or A100 GPUs. In our main and iso FLOPs analysis experiments, we utilized a Trapezoid learning rate scheduler, which consists of warmup (about 5%), stable, and cooldown (20%) phases. ... In contrast, for all other experiments, we used a simple cosine annealing scheduler. (and Table 6 for model architecture details)