Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Length Generalization via Auxiliary Tasks

Authors: Pranjal Awasthi, Anupam Gupta, Ravi Kumar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations on a variety of synthetic benchmarks known to be challenging for length generalization, including sequence sorting, and reversal, demonstrate that our proposed method yields significant improvements in generalization to substantially longer sequences.
Researcher Affiliation	Collaboration	Pranjal Awasthi Google EMAIL Anupam Gupta NYU EMAIL Ravi Kumar Google EMAIL
Pseudocode	No	The paper describes theoretical concepts and experimental results but does not include any clearly labeled pseudocode or algorithm blocks. The methods are described in prose.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The code needs to go through an internal approval process for release. We will submit it post paper acceptance.
Open Datasets	No	D Dataset Details: Below we describe our example generation process for each of the synthetic tasks, for both main and auxiliary tasks. For the main task, we generate 100M examples, except for SLi M, where we generate 1M examples. For all synthetic tasks, we generate 100M examples for the auxiliary task.
Dataset Splits	No	In order to generate the training data for the main task, we sample a length n from [4, 20] at random and chose a random sequence of length n. ... Crucially, the training data for the auxiliary tasks is generated from (N, n)-perturbations: i.e., first drawing an input sequence of a larger length and then random subsampling the irrelevant tokens down to length n. (As mentioned above, we subsample these tokens without replacement to get a fixed length, instead of sampling each one independently.)
Hardware Specification	No	The paper mentions training was conducted using the Jax Flaxformer codebase but does not specify any particular GPU models, CPU types, or other hardware components used for the experiments.
Software Dependencies	No	All our experiments are conducted using the Jax Flaxformer codebase and involve training decoder only transformer models from scratch. ... Optimizer Adam W
Experiment Setup	Yes	Table 1: Hyperparameters for the experiments. Embedding size (d) 1024 Vocabulary size (q) 103 (for sorting, SLi M) 200 (for reversal, increment, copy and parity) Position embedding type None # Attention heads (h) 16 MLP inner dimensionality (d ) 2048 Sequence length 512 Base learning rate 1e-5 Optimizer Adam W LR warmup Linear for 10 epochs LR decay schedule Cosine, one cycle with default parameters Dropout None Activation GELU Depth 2 (for sorting) 4 (for reversal, copy, increment, parity, and SLi M). We train all our models for 200, 000 gradient steps.