Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Do Language Models Use Their Depth Efficiently?

Authors: Róbert Csordás, Christopher D Manning, Chris Potts

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Most of the experiments presented in the main paper are performed with Llama 3.1 70B [20], using NDIF and NNsight [21]. Unless noted otherwise, the results are computed on 10 random examples from GSM8K [22]. In bar plots, each bar starts from 0 (no stacking). The main results are also shown in the appendix on different models, including Llama, Qwen [23], and OLMo 2 [24]. In Sec. 3.1, we measure how the layers and sublayers contribute to the residual stream. In Sec. 3.2, we use causal interventions to measure the effect of layers on downstream computations. In Sec. 3.3 we show that deeper or otherwise more complex computations do not influence the number of layers that have a causal effect on the prediction model. In Sec. 3.4 we train linear projections to find the correspondence between the layers of an independently trained shallow and deep Qwen model.
Researcher Affiliation Academia Róbert Csordás Christopher D. Manning Christopher Potts Stanford University, Stanford, CA, USA EMAIL EMAIL
Pseudocode No The paper defines the Transformer layer mathematically (Eq. 1-4) and describes experimental methodologies in prose, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes 1Our code is public: https://github.com/robertcsordas/llm_effective_depth
Open Datasets Yes Most of the experiments presented in the main paper are performed with Llama 3.1 70B [20], using NDIF and NNsight [21]. Unless noted otherwise, the results are computed on 10 random examples from GSM8K [22]. ... We analyze two datasets: MQu AKE [31], which consists of multi-hop questions with a known number of hops, and the MATH dataset [30], which consists of complex math problems with different difficulty levels.
Dataset Splits Yes Unless noted otherwise, the results are computed on 10 random examples from GSM8K [22]. ... We finetune all parameters of Llama 3.2 3B on the arithmetic splits of the Deep Mind Math dataset [32], with batch size 64, for 10k steps... Max over 20 random examples from the validation set.
Hardware Specification Yes The experiments on the Qwen models and the Llama 3.1 70B Instruct models, which are not available on NDIF, are done on 4 Nvidia A6000 48Gb GPUs, with a rough duration of a day for the 70B experiment, and another day for all the Qwen experiments. For Sec. 3.5, we trained each model on 2 Nvidia A100 80Gb GPUs for 2 days. Full-finetuning Llama 3.1 3B on the Deep Mind Math Dataset (Sec. D.5) was done on 4 Nvidia H200 GPUs for 10 hours. Training the linear maps between the pair of layers of the Qwen models (Sec. 3.4) was done on A6000 GPUs, taking 80 GPU-days in total.
Software Dependencies No The paper mentions using "NDIF and NNsight" as tools, but it does not provide specific version numbers for these or any other software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup Yes We finetune all parameters of Llama 3.2 3B on the arithmetic splits of the Deep Mind Math dataset [32], with batch size 64, for 10k steps, with a warmup of 100 steps followed by a constant learning rate of 2 × 10−5.