Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
RNNs are not Transformers (Yet): The Key Bottleneck on In-Context Retrieval
Authors: Kaiyue Wen, Xingyu Dang, Kaifeng Lyu
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical analysis reveals that CoT improves RNNs but is insufficient to close the gap with Transformers. A key bottleneck lies in the inability of RNNs to perfectly retrieve information from the context, even with CoT... We validate our theory on synthetic and natural language experiments. |
| Researcher Affiliation | Academia | Kaiyue Wen1 Xingyu Dang2 Kaifeng Lyu3 1 Stanford University 2 Tsinghua University 3 University of California, Berkeley EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Depth-First Search Algorithm Algorithm 2 Depth-First Search Algorithm with Retrieving |
| Open Source Code | No | The paper does not contain an explicit statement about releasing code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | We validate our theoretical findings through synthetic and natural language experiments on Is Tree and Hot Pot-QA... We use the Hotpot-QA (Yang et al., 2018) dataset. |
| Dataset Splits | No | The reported accuracy is calculated over a validation set of 5000 samples using the last iteration of the model... We only test on a subset of 350 samples of the validation set where all the models can answer correctly given the correct paragraphs. |
| Hardware Specification | Yes | We run all the experiments on a server with 8 A100s and the estimated time to reproduce the results is within 2 days. |
| Software Dependencies | No | The paper mentions models like LLaMA and Mamba architectures, and Python's `re` library, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We train three different architectures... We train every model with at least 1M samples to guarantee convergence using Adam with a cosine learning rate... we train all the Transformer models with learning rates 1e-3 and the rest of the models with learning rates 3e-4... three different model sizes (0.5M, 1M, 2M) on Is Tree with three different sizes of graph (16, 32, 64) under three different setups... We test our models under a 4-shot setting with Chain-of-Thought. |