Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Titans: Learning to Memorize at Test Time
Authors: Ali Behrouz, Peilin Zhong, Vahab Mirrokni
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results on language modeling, common-sense reasoning, and time series tasks show that Titans are effective compared to baselines, while they can effectively scale to larger context window in needle-in-haystack tasks. |
| Researcher Affiliation | Industry | Ali Behrouz Google Research USA EMAIL Peilin Zhong Google Research USA EMAIL Vahab Mirrokni Google Research USA EMAIL |
| Pseudocode | No | The paper describes methods and formulas but does not include explicit pseudocode or algorithm blocks. For example, it provides mathematical equations like Equation 3 for memory update and Equation 15 for attention calculation, but these are not structured as pseudocode. |
| Open Source Code | No | We plan to provide data and code via Github after the paper is made publicly available online. |
| Open Datasets | Yes | While the first three are trained on 15B tokens sampled from Fine Web-Edu dataset [47], the last two are trained on 30B and 100B tokens from the same dataset... Following recent studies on linear recurrent models [35, 30, 38], we use Wikitext [123], LMB [124], PIQA [125], Hella Swag [126], Wino Grande [127], ARC-easy (ARC-e) and ARC-challenge (ARC-c) [128], SIQA [129], and Bool Q [130]... In this part, we use Single NIAH (S-NIAH) task from RULER benchmark [62]... train them on a subset of the Pile dataset [63]... on DNA modeling tasks... Genomics Benchmarks [136]... time series forecasting benchmark datasets ETT, ECL, Traffic, and Weather [6]. |
| Dataset Splits | Yes | We follow the original experimental setup and training process in the benchmark [53]... In this part, we use Single NIAH (S-NIAH) task from RULER benchmark [62] and evaluate Titans and baselines on sequences with length 2K, 4K, 8K, and 16K... use training length of 4K tokens (2K for SWA). |
| Hardware Specification | No | To fully take advantage of hardware accelerators (e.g., TPUs, GPUs), we need to tensorize the process and use more matmuls. |
| Software Dependencies | No | In the training, we follow the training procedure of Yang et al. [35], and use LLama 2 tokenizer with a vocabulary size of 32K and use training length of 4K tokens (2K for SWA). We fixed the persistent memory size (# tokens) to 128, and use 256 memory tokens to encode the past data (i.e., output of the long-term memory). We employ Adam W optimizer with learning rate of 4e-4 with cosine annealing schedule with batch size of 0.5M tokens, and weight decay of 0.1. |
| Experiment Setup | Yes | In the training, we follow the training procedure of Yang et al. [35], and use LLama 2 tokenizer with a vocabulary size of 32K and use training length of 4K tokens (2K for SWA). We fixed the persistent memory size (# tokens) to 128, and use 256 memory tokens to encode the past data (i.e., output of the long-term memory). We employ Adam W optimizer with learning rate of 4e-4 with cosine annealing schedule with batch size of 0.5M tokens, and weight decay of 0.1. We use: (1) chunk size: which is 16; (2) segment size: in which, we follow previous studies and use 2048 as sliding window (when exists) and 512 as segment size... For the memory architecture, we use an MLP with LM layers (default is LM = 2) with expansion factor of 4 and GELU activation function [42]... We report the number of blocks, heads, size of hidden dimension, and peak of learning rate in Table 5. |