Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mixtures of Subspaces for Bandwidth Efficient Context Parallel Training

Authors: Sameera Ramasinghe, Thalaiyasingam Ajanthan, Hadi Mohaghegh Dolatabadi, Gil Avraham, Violetta Shevchenko, Yan Zuo, Chamin P Hewa Koneputugodage, Alexander Long

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments 0 100 200 300 400 Hours Dec. CP Comp. (ours) 300Mbps Cen. CP 100Gbps Dec. CP 300Mbps 0 100 200 300 400 Hours 0 100 200 300 400 Hours Figure 2: Convergence in low-bandwidth settings. From left to right: Fineweb, C4, and Book Corpus. The training curves are presented against wall-clock time for an 8-layer (800M) model trained with a 132K context window parallelized across 8 GPUs. Decentralized models utilize 300Mbps connections while the centralized model has datacenter-grade 100Gbps links. Our compressed model achieves on-par convergence to the centralized model, even under a 300Mbps bandwidth budget. In contrast, the non-compressed decentralized model with 300Mbps links suffers from significantly slower convergence. ... 5.1 Experimental Setup We evaluate decoder-only models on three large-scale corpora Fine Web (FW) [30], C4 [35], and Book Corpus (BC) [56]. For each dataset, we reserve 10 % of the training split for validation. All model backbones follow LLAMA 3 [11]; exact model specifications are given in the corresponding sections. We use a base-learning-rate = 3 10 4 with linear warm-up and decay, and apply a weight-decay = 0.01. Every transformer layer is compressed except for the final block, where K and V projections are compressed by 98 % and 95 % (overall 96.5%), respectively by choosing r w.r.t. d appropriately. We use the GPT2 tokenizer for all models.
Researcher Affiliation Industry Sameera Ramasinghe Ajanthan Thalaiyasingam Hadi Mohaghegh Dolatabadi Gil Avraham Violetta Shevchenko Yan Zuo Chamin Hewa Koneputugodage Alexander Long Pluralis Research
Pseudocode Yes Algorithm 1 Compression-aware context parallel attention (per node, per head) Require: Input X Rni d, Attention weight W Rd d, Warm-started basis U Rd r, learnable linear head ψ : Rd Rm, sync interval c, current step t 1: Compute local keys and values: Z XW 2: Zavg Mean Token(Z) 3: θ ψ(Zavg) Compute rotation param from local chunk 4: U R(θ) U Construct data-dependent subspace 5: Compress: Zcomp ZU 6: Broadcast (Zcomp, θ) to all other nodes 7: Receive (Z(comp,j), θj) from all other nodes j 8: for all received (Z(comp,j), θj) do 9: Uj R(θj) U 10: Zj Z(comp,j)U j Decompress 11: end for 12: Aggregate global Zg {Kj, Vj}, j from all nodes 13: Compute blockwise attention: A Softmax(QK / d)V 14: if t mod c = 0 then 15: W All Reduce Avg(W ) Sparse sync of attention weights 16: end if
Open Source Code Yes 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Provided in supplementary materials
Open Datasets Yes 5.1 Experimental Setup We evaluate decoder-only models on three large-scale corpora Fine Web (FW) [30], C4 [35], and Book Corpus (BC) [56].
Dataset Splits Yes For each dataset, we reserve 10 % of the training split for validation.
Hardware Specification Yes 5.2 Bandwidth efficiency in decentralized settings We train an 8-layer, 800M-parameter model (embedding-size = 2048, attention-heads = 8 ) under two network settings: a centralized 100Gbps fabric and decentralized 300Mbps internet-grade links. Using CP, we process a sequence length of 132K tokens across eight A100 GPUs connected at the respective bandwidths. ... Figure 3: Scaling across parallelism strategies. Our compression based CP scheme can be seamlessly fused with other parallel training strategies. We train a 3B-parameter model (32 layers) with both pipeline parallel and CP enabled across 32 A100s. Our compressed approach yields substantial throughput gains over uncompressed decentralized CP and nearly matches the performance of centralized CP.
Software Dependencies No The paper does not explicitly state specific software dependencies with version numbers in the main text. While it mentions the use of 'GPT2 tokenizer' in Section 5.1, no version is provided. The NeurIPS checklist indicates code is in supplementary materials, where dependencies might be listed, but this information is not in the main paper.
Experiment Setup Yes 5.1 Experimental Setup We evaluate decoder-only models on three large-scale corpora Fine Web (FW) [30], C4 [35], and Book Corpus (BC) [56]. For each dataset, we reserve 10 % of the training split for validation. All model backbones follow LLAMA 3 [11]; exact model specifications are given in the corresponding sections. We use a base-learning-rate = 3 10 4 with linear warm-up and decay, and apply a weight-decay = 0.01. Every transformer layer is compressed except for the final block, where K and V projections are compressed by 98 % and 95 % (overall 96.5%), respectively by choosing r w.r.t. d appropriately. We use the GPT2 tokenizer for all models. ... Table 3: Effect of warmup steps (val. perplexity ). All models are trained for 10K steps with a 132K context. The method is not highly sensitive to the number of warmup steps. ... 5.3 Ablations ... In practice, we default to 500 steps to provide a safe and stable baseline.