Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Ladder-Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference with Communication Overlapping
Authors: Muru Zhang, Mayank Mishra, Zhongzhu Zhou, William Brandon, Jue Wang, Yoon Kim, Jonathan Ragan-Kelley, Shuaiwen Leon Song, Ben Athiwaratkun, Tri Dao
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train a 1.2B and 3.5B Ladder Residual based Transformer models from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speedup at inference time with sharding over 8 devices. In Table 1, we provide the inference speedup on Transformers of different sizes. We conduct experiments under two scenarios to verify if we can maintain the same performance as standard Transformer |
| Researcher Affiliation | Collaboration | 1Together AI 2University of Southern California 3MITIBM Watson Lab 4University of Sydney 5Massachusetts Institute of Technology 6Princeton University. |
| Pseudocode | Yes | Algorithm 1 Ladder Transformer Layer with Tensor Parallelism. Note that the Async All Reduce (ARR) returns a handle which is passed to the next layer. |
| Open Source Code | No | The text is ambiguous or lacks a clear, affirmative statement of release. The paper mentions building upon gpt-fast (Py Torch Labs, 2024), using Axolotl2 and Open LM Engine3, but does not provide a direct link or explicit statement that the code for their Ladder Residual method is open-source or released by them. |
| Open Datasets | Yes | We train a 1.2B and 3.5B Ladder Transformer model with 100B tokens on the Fine Web-edu dataset (Lozhkov et al., 2024)... We use Eleuther AI s LM eval harness (Gao et al., 2024) to evaluate models on ARC (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Sci Q (Welbl et al., 2017) and Winogrande (Trinh & Le, 2018). We also evaluate perplexity on Wikitext (Merity et al., 2017). Infinity-Instruct dataset1, which contains 3B tokens.1https://huggingface.co/datasets/BAAI/Infinity-Instruct |
| Dataset Splits | No | The paper describes total tokens used for pretraining (100B) and fine-tuning (3B), context length (2048), and batch sizes (4M tokens, 32). It also mentions evaluation on benchmarks using N-shots (e.g., 5-shots for MMLU), which indicates evaluation strategy rather than data splitting. It does not provide explicit train/test/validation split percentages or sample counts for any of the datasets used in evaluation. |
| Hardware Specification | Yes | All benchmarks are done on NVIDIA H100 GPUs. We use HSDP (Hybrid Sharded Data Parallel) (Zhao et al., 2023; Rajbhandari et al., 2020) to train the 3.5B models. For HSDP, we shard the model within 1 node (equipped with 8x H100 GPUs). |
| Software Dependencies | No | The paper mentions several software components like Py Torch, JAX, NCCL, CUDA graphs, PyTorch compile, Eleuther AI s LM eval harness, Star Coder tokenizer, Axolotl, and Open LM Engine. However, it does not provide specific version numbers for any of these dependencies, which is required for a reproducible description. |
| Experiment Setup | Yes | The prompt length and generation length is fixed to 1024 and 512 respectively, while we vary the tensor parallel world sizes among 1, 2, 4 and 8, and batch sizes among 1, 4, 16 and 64 to understand performance under different generation settings. All our models are trained on 100B tokens of Fine Web-edu dataset... We train all our models with 2048 context length with a batch size of 4M tokens in a batch. The models are trained with cosine scheduler with a warmup of 8B tokens to a peak learning rate of 3 10 4. The learning rate is then decayed over 92B tokens to 3 10 5. We conduct supervised fine-tuning (SFT)... We train for 2 epochs with Adam W optimizer with a batch size of 32. We use 5 10 6 learning rate with 200 steps of linear warmup, followed by cosine annealing to the end. |