Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Extrapolation by Association: Length Generalization Transfer In Transformers

Authors: Ziyang Cai, Nayoung Lee, Avi Schwarzschild, Samet Oymak, Dimitris Papailiopoulos

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate this length generalization transfer across diverse algorithmic tasks, including arithmetic operations, string transformations, and maze navigation. Our results show that transformer models can inherit generalization capabilities from similar tasks when trained jointly.
Researcher Affiliation	Collaboration	Ziyang Cai University of Wisconsin-Madison Nayoung Lee University of Wisconsin-Madison Avi Schwarzschild Carnegie Mellon University Samet Oymak University of Michigan Dimitris Papailiopoulos University of Wisconsin-Madison Microsoft Research
Pseudocode	No	The paper describes experimental settings and results, but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is included in the supplementary materials section of the submission.
Open Datasets	No	All training data is generated on-the-fly during training. The paper describes the task definitions and generation procedures, and states that code is included in supplementary materials, which implies the data generation process is reproducible, but does not provide concrete access information (link, DOI, etc.) for a pre-existing public dataset.
Dataset Splits	Yes	At test time, we evaluate using exact match accuracy on a fixed test set of 1024 examples. For each configuration, we report results across 5 random initialization seeds but the dataset is kept the same.
Hardware Specification	Yes	For all experiments in the paper, we run on a single machine with two NVIDIA Ge Force RTX 3090 graphics cards.
Software Dependencies	No	The paper mentions using transformer models with a Llama architecture and AdamW optimizer, but does not specify version numbers for programming languages, libraries, or frameworks like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Table 3 lists the hyperparameters used for training across different task domains and model types. From-scratch models are trained with a higher learning rate and larger batch sizes, while pretrained models (Smol LM-360M) use lower learning rates and shorter training schedules. All models are optimized using Adam W with a learning rate schedule that includes a warm-up phase, a constant phase, and a cosine decay phase.