Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Knee-Deep in C-RASP: A Transformer Depth Hierarchy

Authors: Andy J Yang, Michaël Cadilhac, David Chiang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we provide empirical evidence that our theory predicts the depth required for transformers without positional encodings to length-generalize on a family of sequential dependency tasks. We find experimentally that the C-RASP depth hierarchy closely predicts the depth that transformers require to solve problems with particular sequential dependencies (Fig. 2). 5 Experiments Our depth hierarchy result suggests that transformers will require greater depth in order to model deeper sequential dependencies. We empirically validate this by training future-masked transformers with no positional encodings and varying depths to learn the 𝐿𝑘language, for varying 𝑘.
Researcher Affiliation	Academia	Andy Yang University of Notre Dame EMAIL Michaël Cadilhac De Paul University EMAIL David Chiang University of Notre Dame EMAIL
Pseudocode	No	The paper defines the architecture and processes mathematically (e.g., Definition B.2 for fixed-precision transformer operations) but does not include any distinct pseudocode or algorithm blocks with structured, step-by-step instructions in a code-like format.
Open Source Code	Yes	The code used for our experiments is provided at https://github.com/pentagonalize/CRASP_depth. LLMs were used to assist in writing code and debugging.
Open Datasets	No	We generated samples of 𝐿𝑘to place into bins [201, 250], [251, 300], [301, 350], [351, 400] by uniformly sampling a length 𝑛from the bin and uniformly sampling 𝑘 1 positions at which to switch between 𝑎and 𝑏. For each 𝑘and each bin, 1000 strings were generated.
Dataset Splits	Yes	The [201, 250] bin of 1000 examples was split into a training set of 800 examples and a validation set of 200 examples. The other bins were reserved for evaluation.
Hardware Specification	No	The experiments were run on an internal cluster of GPUs. Performing the training loop for a given number of layers over all 𝐿𝑘required an average of 9.37 104 TFLOPs and 936.8 Mi B of memory.
Software Dependencies	No	Adam was used as the optimizer (Kingma and Ba, 2015).
Experiment Setup	Yes	We trained future-masked transformers without positional encodings. Because the sets of next tokens are mutually exclusive, we trained the transformer to perform multi-class classification with crossentropy as the loss function. Adam was used as the optimizer (Kingma and Ba, 2015). The dimension 𝑑and learning rate 𝜂were tuned by searching over 𝑑 [256, 512] and 𝜂 [10 4, 10 5]. Each hyperparameter configuration was trained for 25 epochs or until 100% accuracy was achieved on the validation set.