Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ZeCO: Zero-Communication Overhead Sequence Parallelism for Linear Attention
Authors: Yuhong CHOU, Zehao Liu, Rui-Jie Zhu, Xinyi Wan, Tianjian Li, Congying Chu, Qian Liu, Jibin Wu, Zejun MA
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive multi-level experiments (collective communication, operator, and model) demonstrate the significant performance gains of Ze CO. As shown in Figure 1, the All Scan collective achieves up to 3.9 communication speedup, the fastest existing sequence parallelism method, while the Ze CO sequence parallel operator delivers up to 9.3 overall speedup. At the model level, Ze CO boosts throughput by over 60% and demonstrates near-linear scalability from 8 to 256 devices, even with context lengths up to 8M tokens. 4 Experiments We evaluate the efficiency and scalability of the proposed Ze CO SP Algorithm and All-Scan Communication Operator on 1B-GLA models. |
| Researcher Affiliation | Collaboration | Yuhong Chou1 , Zehao Liu1 , Ruijie Zhu3, Xinyi Wan4, Tianjian Li2, Congying Chu5, Qian Liu2 , Jibin Wu1 , Zejun Ma2 1The Hong Kong Polytechnic University 2Tik Tok 3UC Santa Cruz 4National University of Singapore 5Institute of Automation, Chinese Academy of Sciences |
| Pseudocode | Yes | Algorithm 1 Forward pass for Ze CO with All-Scan comunication... Algorithm 2 All-Scan Algorithm... Algorithm 3 Backward pass for Ze CO with All-Scan comunication |
| Open Source Code | No | Justification: We plan to open source the complete code in the near future, and this article provides all the conditions for reproduction. |
| Open Datasets | No | The paper describes training 1B-GLA models and evaluating performance on various sequence lengths (e.g., 8K, 16K, 32K, 8M tokens) but does not provide specific access information for any publicly available or open datasets used for this purpose. |
| Dataset Splits | No | The paper describes training 1B-GLA models and evaluating performance on various sequence lengths (e.g., 8K, 16K, 32K, 8M tokens) but does not provide details on specific dataset splits (e.g., training, validation, test splits for a dataset). |
| Hardware Specification | Yes | All experiments are conducted on a GPU cluster equipped with 256 H100 80GB GPUs. |
| Software Dependencies | No | Model is trained in Lingua [38], a Py Torch-based distributed training. We implement the All-Scan communication algorithm using the Triton-Distributed framework, which integrates Open SHMEM into the Triton compiler to enable distributed communication within operator implementations [39, 40]. |
| Experiment Setup | Yes | In experiment Section 4.1, H is 32, the tensor size of each chunk of segmentation is 16384, the hidden dimension d is 4096, and sequence length per device L is 8192. The experimental setup with 5 rounds of warm-up and 50 rounds of experiment was averaged, see in Table 2. In experiment Section 4.2, In the experiment of algorithm run time, we test the GLA-attention algorithm equipped with different SP methods, record the time of 1 iteration of FWD and BWD. H is 16, the tensor size of each chunk of segmentation is 16384, the hidden dimension d is 2048, and sequence length per device L is 16384 and 32768. The experimental setup with 5 rounds of warm-up and reported the average of 50 rounds of experiment, see in Table 3, Table 4. In the experiment of Model throughput, we test the GLA-1B Model equipped with different SP methods, and record the throughput in the training stage. H is 32, the tensor size of each chunk of segmentation is 16384, the number of model layers is 20, the hidden dimension d is 2048, and the sequence length per device L is 16384 and 32768. The experimental setup with 5 rounds of warm-up reported the average of 100 steps of the experiment, see in Table 5, Table 6. |