Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Nested Learning: The Illusion of Deep Learning Architectures

Authors: Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Table 1: Performance of HOPE and baselines on language modeling and common-sense reasoning tasks. 4 Experiments For the sake of space, in the main paper, we report the results of the HOPE s evaluation on language modeling, and common-sense reasoning, tasks. Language Modeling and Common-sense Reasoning. We follow recent sequence modeling studies [28, 67, 68] and report the results of HOPE and baselines with size of 760M, and 1.3B on language modeling and also commonsense reasoning downstream tasks [69 75]. These results are reported in Table 1. HOPE demonstrate a very good perfomance across all the scales and benchmark tasks, outperforming both Transformers and recent modern recurrent neural networks, including Delta Net [63] and Titans [28].
Researcher Affiliation	Industry	Ali Behrouz Google Research USA EMAIL Meisam Razaviyayn Google Research USA EMAIL Peiling Zhong Google Research USA EMAIL Vahab Mirrokni Google Research USA EMAIL
Pseudocode	No	The paper contains mathematical formulations and equations, but no explicitly labeled pseudocode or algorithm blocks with structured steps like code.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: In the paper, we provide all the details need to produce the results, including the details of the implementation. All the datasets used in this paper are publicly available.
Open Datasets	Yes	Datasets: We evaluate HOPE and baselines on Wikitext [69], LMB [70], PIQA [71], Hella Swag [72], Wino Grande [73], ARC-easy (ARC-e) and ARC-challenge (ARC-c) [74], SIQA [75], and Bool Q [117] benchmarks. Baselines: As for the baselines, we use Ret Net [61] and Delta Net [63] as the representatives of the models that are purely based on Hebbianor Delta-rule, and two modern matrixvalued recurrent models with the best performance compared to others: i.e., RWKV-7 [118] and Comba [67]. As another group of baselines, we compare with attention-free deep memory modules with diverse internal attentional bias of dot-product, L2, and Lp regression: i.e., TTT [65], Miras [58], DLA [68] and Titans [28]. Finally, we also compare with Transformers [27] as well as the hybrid of attention and linear RNN, Samba [119]. Training: We train models with about 760M and 1.3B parameters, trained with 30B and 100B tokens, respectively, from a mixture of Fine Web-Edu [120] and long-context documents with a vocabulary size of 32K to train all the models from scratch.
Dataset Splits	No	The paper lists several datasets used for evaluation but does not explicitly state the training, validation, and test splits (e.g., percentages or sample counts) within the provided text. While it refers to 'common-sense reasoning downstream tasks' and 'standard next-token prediction for language modeling', it does not specify how the data was partitioned for these tasks.
Hardware Specification	Yes	Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We have used TPUv5 to perform all the experiments in the paper.
Software Dependencies	No	All models are trained with standard next-token prediction for language modeling, optimized using Adam W with tuned learning rate for each model, and with the default optimizer configuration as in Behrouz et al. [28]. The paper mentions 'Adam W' and refers to a prior work for default optimizer configuration but does not specify any software versions (e.g., Python, PyTorch, TensorFlow versions, or specific versions of AdamW).
Experiment Setup	Yes	Table 2: Architectural Details. Model Block Dim Head Peak LR Token 170M 12 768 16 3e-3 15B 340M 24 1024 16 1.5e-3 15B 760M 24 1536 16 1.25e-3 30B 1.3B 18 2048 8 7e-4 100B Training: We train models with about 760M and 1.3B parameters, trained with 30B and 100B tokens, respectively, from a mixture of Fine Web-Edu [120] and long-context documents with a vocabulary size of 32K to train all the models from scratch. All models are trained with standard next-token prediction for language modeling, optimized using Adam W with tuned learning rate for each model, and with the default optimizer configuration as in Behrouz et al. [28].