Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Knee-Deep in C-RASP: A Transformer Depth Hierarchy
Authors: Andy J Yang, Michaรซl Cadilhac, David Chiang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we provide empirical evidence that our theory predicts the depth required for transformers without positional encodings to length-generalize on a family of sequential dependency tasks. We find experimentally that the C-RASP depth hierarchy closely predicts the depth that transformers require to solve problems with particular sequential dependencies (Fig. 2). 5 Experiments Our depth hierarchy result suggests that transformers will require greater depth in order to model deeper sequential dependencies. We empirically validate this by training future-masked transformers with no positional encodings and varying depths to learn the ๐ฟ๐language, for varying ๐. |
| Researcher Affiliation | Academia | Andy Yang University of Notre Dame EMAIL Michaรซl Cadilhac De Paul University EMAIL David Chiang University of Notre Dame EMAIL |
| Pseudocode | No | The paper defines the architecture and processes mathematically (e.g., Definition B.2 for fixed-precision transformer operations) but does not include any distinct pseudocode or algorithm blocks with structured, step-by-step instructions in a code-like format. |
| Open Source Code | Yes | The code used for our experiments is provided at https://github.com/pentagonalize/CRASP_depth. LLMs were used to assist in writing code and debugging. |
| Open Datasets | No | We generated samples of ๐ฟ๐to place into bins [201, 250], [251, 300], [301, 350], [351, 400] by uniformly sampling a length ๐from the bin and uniformly sampling ๐ 1 positions at which to switch between ๐and ๐. For each ๐and each bin, 1000 strings were generated. |
| Dataset Splits | Yes | The [201, 250] bin of 1000 examples was split into a training set of 800 examples and a validation set of 200 examples. The other bins were reserved for evaluation. |
| Hardware Specification | No | The experiments were run on an internal cluster of GPUs. Performing the training loop for a given number of layers over all ๐ฟ๐required an average of 9.37 104 TFLOPs and 936.8 Mi B of memory. |
| Software Dependencies | No | Adam was used as the optimizer (Kingma and Ba, 2015). |
| Experiment Setup | Yes | We trained future-masked transformers without positional encodings. Because the sets of next tokens are mutually exclusive, we trained the transformer to perform multi-class classification with crossentropy as the loss function. Adam was used as the optimizer (Kingma and Ba, 2015). The dimension ๐and learning rate ๐were tuned by searching over ๐ [256, 512] and ๐ [10 4, 10 5]. Each hyperparameter configuration was trained for 25 epochs or until 100% accuracy was achieved on the validation set. |