Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Blending Complementary Memory Systems in Hybrid Quadratic-Linear Transformers

Authors: Kazuki Irie, Morris Yau, Samuel J Gershman

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on general language modeling and retrieval tasks by training 340Mand 1.3B-parameter models from scratch, as well as on synthetic algorithmic tasks designed to precisely illustrate the benefits of certain hybrid methods over others. We also evaluate our hybrid memory systems on reinforcement learning in partially observable environments.
Researcher Affiliation	Academia	Kazuki Irie1 Morris Yau2 Samuel J. Gershman1,3 1Department of Psychology and Center for Brain Science, Harvard University, Cambridge, MA, USA 2MIT CSAIL, Cambridge, MA, USA 3Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Cambridge, MA, USA EMAIL, EMAIL
Pseudocode	No	The paper describes algorithms and models using mathematical equations (e.g., Eq. 1-17) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is public: https://github.com/kazuki-irie/hybrid-memory.
Open Datasets	Yes	We conduct experiments to test general language modeling and in-context retrieval abilities (using the standard lm-evaluation-harness [19]), by training 340Mand 1.3B-parameter language models from scratch using 15B tokens of the Hugging Face Fine Web-Edu dataset [20]. We evaluate the trained models through two perplexity evaluation settings on Wiki Text-2 [39] (Wiki.) and LAMBADA (LMB.) [40], and six zero-shot common sense reasoning tasks: Pi QA [41], Hella Swag (Hella.) [42], Wino Grande [43] (Wino.), ARC-easy (ARC-e) and ARC-challenge (Arc-c) [44]. ... FDA [45], SWDE [46], and SQu AD [47] tasks... The training data we used can be easily downloaded from Hugging Face: https://huggingface.co/datasets/Hugging Face FW/fineweb-edu. Evaluation was also done using the publicly accessible lm-evaluation-harness [19].
Dataset Splits	Yes	We train with sequence lengths from 3 to 40, and validate on sequences of lengths from 40 to 256. Each training run on a single H100 takes about 70 min.
Hardware Specification	Yes	Training of 340M models using 4 H100-80GB GPUs take about 8 hours for the baseline transformer and 10 hours for Delta Net and all the HQLT models with the window size from 64 to 1024 tokens. ... Each training run on a single H100 takes about 70 min.
Software Dependencies	No	The paper mentions software tools like `fla` [24], `flame` [67], `lm-evaluation-harness` [19], and `Adam optimizer` [72], but it does not specify concrete version numbers for these software components.
Experiment Setup	Yes	Table 5: Hyper-parameters of language models. Model 340M 1.3B Number of layers 24 Feedforward block multiplier 4 Total hidden size 1024 2048 Number of heads 8 16 Sequence length 2048 2240 Effective Batch size 64 Learning rate 1e 3 Warmup steps 1024 Minimum learning rate 0.1 Max norm clipping 1.0 Std. of weight initializers 0.02. ...We train with an effective batch size of 64 per GPU with a sequence length of 2048... We search for the best learning rate among {5e 3, 1e 3, 5e 4, 1e 4}... We use a batch size of 64 and a learning rate of 3e 4 for all the system components, using the Adam optimizer [72]. We apply a scale of 0.1 on the entropy term in the loss. Importantly, we use dropout with a dropping rate of 0.1 inside the sequence models, including on the observation embeddings.