Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Authors: Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Similar to Transformer, TTT-Linear and TTT-MLP can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research. [...] We perform standard experiments with 2k and 8k context lengths on the Pile (Gao et al., 2020)... |
| Researcher Affiliation | Collaboration | 1Stanford University 2UC San Diego 3UC Berkeley 4Meta AI. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. Figure 2 and Figure 3 show conceptual diagrams of sequence modeling layers and computation graphs, respectively, and Appendix A provides mathematical derivations for the dual form, but none are presented as pseudocode or an algorithm. |
| Open Source Code | No | The paper states: "Our main codebase is based on Easy LM (Geng, 2023), an open-source project for training and serving LLMs in JAX." This indicates the authors *used* an open-source project, not that their specific implementation for TTT-Linear and TTT-MLP is open-source. While "Menghao Guo for help with code release" is mentioned in acknowledgments, it is not an explicit statement that the code for *this paper's methodology* has been released or is available. |
| Open Datasets | Yes | Following the Mamba paper (Gu & Dao, 2023), we perform standard experiments with 2k and 8k context lengths on the Pile (Gao et al., 2020), a popular dataset of documents for training open-source LLMs (Black et al., 2022). However, the Pile contains few sequences of length greater than 8k (de Vries, 2023). To evaluate capabilities in long context, we also experiment with context lengths ranging from 1k to 32k in 2 increments, on a subset of the Pile called Books3, which has been widely used to train LLMs in long context (Liu et al., 2024). |
| Dataset Splits | Yes | Following the Mamba paper (Gu & Dao, 2023), we perform standard experiments with 2k and 8k context lengths on the Pile (Gao et al., 2020)... All models are trained with the Chinchilla recipe described in the Mamba paper and reproduced in our Appendix C. [...] Transformer finetuning. Finetuning starts a new cosine schedule with the same optimization hyper-parameter as training from scratch... This baseline starts from the model trained (according to the Chinchilla recipe) on Books 2k, then uses 20% more tokens to finetune at the designated context length, following the Llama Long paper (Xiong et al., 2023). |
| Hardware Specification | Yes | Due to resource constraints, our experiments are written in JAX and run on TPUs. On a v5e-256 TPU pod, the Transformer baseline takes 0.30s per iteration of training at context 2k, while TTT-Linear takes 0.27s per iteration, already 10% faster without any systems optimization. However, Mamba (implemented in Py Torch, Triton, and CUDA) can only run on GPUs, so for fair comparison, we also rewrite our method into GPU kernels. [...] Figure 8 shows the latency of our inference kernel for forward (prefill) and generate (decode). All models are 1.3B (1.4B for Mamba). As expected, time per token grows linearly for Transformer as the context length increases, but stays roughly constant for the other methods. Note that our Transformer baseline is significantly faster that in the Mamba paper, because we use v LLM (Kwon et al., 2023), a state-of-the-art serving system, instead of the Hugging Face Transformer (Wolf et al., 2019). Figure 8. Latency on an NVIDIA A100 GPU with 80G HBM and PCIe connections. |
| Software Dependencies | No | Our main codebase is based on Easy LM (Geng, 2023), an open-source project for training and serving LLMs in JAX. [...] Mamba (implemented in Py Torch, Triton, and CUDA) can only run on GPUs... [...] we use v LLM (Kwon et al., 2023)... |
| Experiment Setup | Yes | Our training configurations are in Table 2, which simply reproduces Table 12 in the Mamba paper. All models are trained with a batch size of 0.5M tokens regardless of context length. All of our optimization hyper-parameters follow the improved recipe in Appendix E.2 of the Mamba paper, reproduced below: Adam W optimizer: β = (0.9, 0.95) Cosine schedule: decay to end learning rate 1e 5 Linear learning rate warmup over 10% of the training steps Weight decay: 0.1 Gradient clipping: 1.0 Mixed Precision [...] The inner-loop base learning rate ηbase is set to 1 for TTT-Linear and 0.1 for TTT-MLP. Our heuristic for setting ηbase is similar to how people set the outer-loop learning rate for regular training: We tried ηbase {0.01, 0.1, 1, 10} and used the largest value that does not cause instabilities. For TTT-MLP, we use linear warmup for ηbase over 10% of the training steps, similar to regular training. |