Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Titans: Learning to Memorize at Test Time

Authors: Ali Behrouz, Peilin Zhong, Vahab Mirrokni

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results on language modeling, common-sense reasoning, and time series tasks show that Titans are effective compared to baselines, while they can effectively scale to larger context window in needle-in-haystack tasks.
Researcher Affiliation	Industry	Ali Behrouz Google Research USA EMAIL Peilin Zhong Google Research USA EMAIL Vahab Mirrokni Google Research USA EMAIL
Pseudocode	No	The paper describes methods and formulas but does not include explicit pseudocode or algorithm blocks. For example, it provides mathematical equations like Equation 3 for memory update and Equation 15 for attention calculation, but these are not structured as pseudocode.
Open Source Code	No	We plan to provide data and code via Github after the paper is made publicly available online.
Open Datasets	Yes	While the first three are trained on 15B tokens sampled from Fine Web-Edu dataset [47], the last two are trained on 30B and 100B tokens from the same dataset... Following recent studies on linear recurrent models [35, 30, 38], we use Wikitext [123], LMB [124], PIQA [125], Hella Swag [126], Wino Grande [127], ARC-easy (ARC-e) and ARC-challenge (ARC-c) [128], SIQA [129], and Bool Q [130]... In this part, we use Single NIAH (S-NIAH) task from RULER benchmark [62]... train them on a subset of the Pile dataset [63]... on DNA modeling tasks... Genomics Benchmarks [136]... time series forecasting benchmark datasets ETT, ECL, Traffic, and Weather [6].
Dataset Splits	Yes	We follow the original experimental setup and training process in the benchmark [53]... In this part, we use Single NIAH (S-NIAH) task from RULER benchmark [62] and evaluate Titans and baselines on sequences with length 2K, 4K, 8K, and 16K... use training length of 4K tokens (2K for SWA).
Hardware Specification	No	To fully take advantage of hardware accelerators (e.g., TPUs, GPUs), we need to tensorize the process and use more matmuls.
Software Dependencies	No	In the training, we follow the training procedure of Yang et al. [35], and use LLama 2 tokenizer with a vocabulary size of 32K and use training length of 4K tokens (2K for SWA). We fixed the persistent memory size (# tokens) to 128, and use 256 memory tokens to encode the past data (i.e., output of the long-term memory). We employ Adam W optimizer with learning rate of 4e-4 with cosine annealing schedule with batch size of 0.5M tokens, and weight decay of 0.1.
Experiment Setup	Yes	In the training, we follow the training procedure of Yang et al. [35], and use LLama 2 tokenizer with a vocabulary size of 32K and use training length of 4K tokens (2K for SWA). We fixed the persistent memory size (# tokens) to 128, and use 256 memory tokens to encode the past data (i.e., output of the long-term memory). We employ Adam W optimizer with learning rate of 4e-4 with cosine annealing schedule with batch size of 0.5M tokens, and weight decay of 0.1. We use: (1) chunk size: which is 16; (2) segment size: in which, we follow previous studies and use 2048 as sliding window (when exists) and 512 as segment size... For the memory architecture, we use an MLP with LM layers (default is LM = 2) with expansion factor of 4 and GELU activation function [42]... We report the number of blocks, heads, size of hidden dimension, and peak of learning rate in Table 5.