Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Ladder-Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference with Communication Overlapping
Authors: Muru Zhang, Mayank Mishra, Zhongzhu Zhou, William Brandon, Jue Wang, Yoon Kim, Jonathan Ragan-Kelley, Shuaiwen Leon Song, Ben Athiwaratkun, Tri Dao
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train a 1.2B and 3.5B Ladder Residual based Transformer models from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speedup at inference time with sharding over 8 devices. In Table 1, we provide the inference speedup on Transformers of different sizes. We conduct experiments under two scenarios to verify if we can maintain the same performance as standard Transformer |
| Researcher Affiliation | Collaboration | 1Together AI 2University of Southern California 3MITIBM Watson Lab 4University of Sydney 5Massachusetts Institute of Technology 6Princeton University. |
| Pseudocode | Yes | Algorithm 1 Ladder Transformer Layer with Tensor Parallelism. Note that the Async All Reduce (ARR) returns a handle which is passed to the next layer. |
| Open Source Code | No | The text is ambiguous or lacks a clear, affirmative statement of release. The paper mentions building upon gpt-fast (Py Torch Labs, 2024), using Axolotl2 and Open LM Engine3, but does not provide a direct link or explicit statement that the code for their Ladder Residual method is open-source or released by them. |
| Open Datasets | Yes | We train a 1.2B and 3.5B Ladder Transformer model with 100B tokens on the Fine Web-edu dataset (Lozhkov et al., 2024)... We use Eleuther AI s LM eval harness (Gao et al., 2024) to evaluate models on ARC (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Sci Q (Welbl et al., 2017) and Winogrande (Trinh & Le, 2018). We also evaluate perplexity on Wikitext (Merity et al., 2017). Infinity-Instruct dataset1, which contains 3B tokens.1https://huggingface.co/datasets/BAAI/Infinity-Instruct |
| Dataset Splits | No | The paper describes total tokens used for pretraining (100B) and fine-tuning (3B), context length (2048), and batch sizes (4M tokens, 32). It also mentions evaluation on benchmarks using N-shots (e.g., 5-shots for MMLU), which indicates evaluation strategy rather than data splitting. It does not provide explicit train/test/validation split percentages or sample counts for any of the datasets used in evaluation. |
| Hardware Specification | Yes | All benchmarks are done on NVIDIA H100 GPUs. We use HSDP (Hybrid Sharded Data Parallel) (Zhao et al., 2023; Rajbhandari et al., 2020) to train the 3.5B models. For HSDP, we shard the model within 1 node (equipped with 8x H100 GPUs). |
| Software Dependencies | No | The paper mentions several software components like Py Torch, JAX, NCCL, CUDA graphs, PyTorch compile, Eleuther AI s LM eval harness, Star Coder tokenizer, Axolotl, and Open LM Engine. However, it does not provide specific version numbers for any of these dependencies, which is required for a reproducible description. |
| Experiment Setup | Yes | The prompt length and generation length is fixed to 1024 and 512 respectively, while we vary the tensor parallel world sizes among 1, 2, 4 and 8, and batch sizes among 1, 4, 16 and 64 to understand performance under different generation settings. All our models are trained on 100B tokens of Fine Web-edu dataset... We train all our models with 2048 context length with a batch size of 4M tokens in a batch. The models are trained with cosine scheduler with a warmup of 8B tokens to a peak learning rate of 3 10 4. The learning rate is then decayed over 92B tokens to 3 10 5. We conduct supervised fine-tuning (SFT)... We train for 2 epochs with Adam W optimizer with a batch size of 32. We use 5 10 6 learning rate with 200 steps of linear warmup, followed by cosine annealing to the end. |