Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Extrapolation by Association: Length Generalization Transfer In Transformers
Authors: Ziyang Cai, Nayoung Lee, Avi Schwarzschild, Samet Oymak, Dimitris Papailiopoulos
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate this length generalization transfer across diverse algorithmic tasks, including arithmetic operations, string transformations, and maze navigation. Our results show that transformer models can inherit generalization capabilities from similar tasks when trained jointly. |
| Researcher Affiliation | Collaboration | Ziyang Cai University of Wisconsin-Madison Nayoung Lee University of Wisconsin-Madison Avi Schwarzschild Carnegie Mellon University Samet Oymak University of Michigan Dimitris Papailiopoulos University of Wisconsin-Madison Microsoft Research |
| Pseudocode | No | The paper describes experimental settings and results, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is included in the supplementary materials section of the submission. |
| Open Datasets | No | All training data is generated on-the-fly during training. The paper describes the task definitions and generation procedures, and states that code is included in supplementary materials, which implies the data generation process is reproducible, but does not provide concrete access information (link, DOI, etc.) for a pre-existing public dataset. |
| Dataset Splits | Yes | At test time, we evaluate using exact match accuracy on a fixed test set of 1024 examples. For each configuration, we report results across 5 random initialization seeds but the dataset is kept the same. |
| Hardware Specification | Yes | For all experiments in the paper, we run on a single machine with two NVIDIA Ge Force RTX 3090 graphics cards. |
| Software Dependencies | No | The paper mentions using transformer models with a Llama architecture and AdamW optimizer, but does not specify version numbers for programming languages, libraries, or frameworks like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Table 3 lists the hyperparameters used for training across different task domains and model types. From-scratch models are trained with a higher learning rate and larger batch sizes, while pretrained models (Smol LM-360M) use lower learning rates and shorter training schedules. All models are optimized using Adam W with a learning rate schedule that includes a warm-up phase, a constant phase, and a cosine decay phase. |