Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Synergistic Tensor and Pipeline Parallelism

Authors: Mengshi Qi, Jiaxuan Peng, Jie M. Zhang, Juan Zhu, Yong Li, Huadong Ma

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our approach improves training throughput by up to 12% for LLMs and 16% for MLLMs compared to existing scheduling methods.
Researcher Affiliation Academia 1State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China EMAIL
Pseudocode No The paper describes the proposed synergistic tensor and pipeline parallel schedule in Section 4 and illustrates it in Figure 5. However, it presents the schedule as a diagram of execution blocks and pipeline stages rather than a structured pseudocode or algorithm block.
Open Source Code Yes Our source code is avaiable at https://github.com/MICLAB-BUPT/STP.
Open Datasets Yes We evaluated our proposed schedule on the series of Qwen2 (LLM) [39] and Qwen2-VL (MLLM) [36] models, as detailed in Table 2.
Dataset Splits No The paper discusses how the model layers are split across pipeline stages (e.g., 'uniformly splitting the model while ensuring that the last stage contains two fewer layers than the other stages'), but it does not specify any training, validation, or test dataset splits for the data used to train these models. The details provided are about model partitioning, not data partitioning for evaluation.
Hardware Specification Yes Our implementation is built upon the open-source Megatron-Core project [33] and tested on up to 32 NVIDIA A800 SXM4 80G GPUs distributed across 4 nodes. We conduct experiments on 16 NVIDIA H20 GPUs, which are equipped with PCIe Gen 5 interconnection. We conduct experiments on 16 NVIDIA H20 96GB GPUs
Software Dependencies No Our implementation is built upon the open-source Megatron-Core project [33] and tested on up to 32 NVIDIA A800 SXM4 80G GPUs distributed across 4 nodes. Meanwhile, Flash Attention 2 [3] is leveraged in all models for efficiency. The paper mentions specific software (Megatron-Core, Flash Attention 2) but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We evaluated our proposed schedule on the series of Qwen2 (LLM) [39] and Qwen2-VL (MLLM) [36] models, as detailed in Table 2. For consistency, all schedules are configured with two virtual stages per device. We conduct the experiment on Qwen2-12.1B using activation checkpointing (AC) with a batch size of 128 and a sequence length of 6k. adjusting the microbatch size across various parallel configurations to maximize memory utilization and achieve optimal throughput and Model FLOPs Utilization (MFU).