Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Ladder-Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference with Communication Overlapping

Authors: Muru Zhang, Mayank Mishra, Zhongzhu Zhou, William Brandon, Jue Wang, Yoon Kim, Jonathan Ragan-Kelley, Shuaiwen Leon Song, Ben Athiwaratkun, Tri Dao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train a 1.2B and 3.5B Ladder Residual based Transformer models from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speedup at inference time with sharding over 8 devices. In Table 1, we provide the inference speedup on Transformers of different sizes. We conduct experiments under two scenarios to verify if we can maintain the same performance as standard Transformer
Researcher Affiliation Collaboration 1Together AI 2University of Southern California 3MITIBM Watson Lab 4University of Sydney 5Massachusetts Institute of Technology 6Princeton University.
Pseudocode Yes Algorithm 1 Ladder Transformer Layer with Tensor Parallelism. Note that the Async All Reduce (ARR) returns a handle which is passed to the next layer.
Open Source Code No The text is ambiguous or lacks a clear, affirmative statement of release. The paper mentions building upon gpt-fast (Py Torch Labs, 2024), using Axolotl2 and Open LM Engine3, but does not provide a direct link or explicit statement that the code for their Ladder Residual method is open-source or released by them.
Open Datasets Yes We train a 1.2B and 3.5B Ladder Transformer model with 100B tokens on the Fine Web-edu dataset (Lozhkov et al., 2024)... We use Eleuther AI s LM eval harness (Gao et al., 2024) to evaluate models on ARC (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Sci Q (Welbl et al., 2017) and Winogrande (Trinh & Le, 2018). We also evaluate perplexity on Wikitext (Merity et al., 2017). Infinity-Instruct dataset1, which contains 3B tokens.1https://huggingface.co/datasets/BAAI/Infinity-Instruct
Dataset Splits No The paper describes total tokens used for pretraining (100B) and fine-tuning (3B), context length (2048), and batch sizes (4M tokens, 32). It also mentions evaluation on benchmarks using N-shots (e.g., 5-shots for MMLU), which indicates evaluation strategy rather than data splitting. It does not provide explicit train/test/validation split percentages or sample counts for any of the datasets used in evaluation.
Hardware Specification Yes All benchmarks are done on NVIDIA H100 GPUs. We use HSDP (Hybrid Sharded Data Parallel) (Zhao et al., 2023; Rajbhandari et al., 2020) to train the 3.5B models. For HSDP, we shard the model within 1 node (equipped with 8x H100 GPUs).
Software Dependencies No The paper mentions several software components like Py Torch, JAX, NCCL, CUDA graphs, PyTorch compile, Eleuther AI s LM eval harness, Star Coder tokenizer, Axolotl, and Open LM Engine. However, it does not provide specific version numbers for any of these dependencies, which is required for a reproducible description.
Experiment Setup Yes The prompt length and generation length is fixed to 1024 and 512 respectively, while we vary the tensor parallel world sizes among 1, 2, 4 and 8, and batch sizes among 1, 4, 16 and 64 to understand performance under different generation settings. All our models are trained on 100B tokens of Fine Web-edu dataset... We train all our models with 2048 context length with a batch size of 4M tokens in a batch. The models are trained with cosine scheduler with a warmup of 8B tokens to a peak learning rate of 3 10 4. The learning rate is then decayed over 92B tokens to 3 10 5. We conduct supervised fine-tuning (SFT)... We train for 2 epochs with Adam W optimizer with a batch size of 32. We use 5 10 6 learning rate with 200 steps of linear warmup, followed by cosine annealing to the end.