Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Subspace Networks: Scaling Decentralized Training with Communication-Efficient Model Parallelism

Authors: Sameera Ramasinghe, Thalaiyasingam Ajanthan, Gil Avraham, Yan Zuo, Alexander Long

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 Experiments We evaluate decoder-only models (based on Llama 3 [14]) across four large-scale datasets: Wiki Text (WT) [33], Book Corpus (BC) [63], Open Web Text (OWT) [15], and C4 [37]. For WT, we use the standard splits; for BC and OWT, we randomly select 10% of training data as validation; for C4, due to computational constraints, we report training loss only. The base model has a context length of 1024, embedding dimension 4096, 24 heads, and 8 layers ( 2B parameters); larger models (up to 8B parameters) are noted explicitly in ablation sections. We use a base learning rate η = 3e-4 (with warmup and linear decay), weight decay 0.01, and batch size 32, unless otherwise specified.
Researcher Affiliation Industry Sameera Ramasinghe Ajanthan Thalaiyasingam Gil Avraham Yan Zuo Alexander Long Pluralis Research
Pseudocode No The paper contains mathematical formulations and derivations but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Provided with supplementary materials.
Open Datasets Yes We evaluate decoder-only models (based on Llama 3 [14]) across four large-scale datasets: Wiki Text (WT) [33], Book Corpus (BC) [63], Open Web Text (OWT) [15], and C4 [37].
Dataset Splits Yes For WT, we use the standard splits; for BC and OWT, we randomly select 10% of training data as validation; for C4, due to computational constraints, we report training loss only.
Hardware Specification Yes Experiments (except the 8B Llama run on L4 GPUs with internet-based decentralized connections) use A10g GPUs (24GB VRAM) with one layer per GPU.
Software Dependencies No The paper mentions "torch.distributed.pipelining" and "Torch Titan [29]" but does not specify their version numbers or the versions of other key software components.
Experiment Setup Yes The base model has a context length of 1024, embedding dimension 4096, 24 heads, and 8 layers ( 2B parameters); larger models (up to 8B parameters) are noted explicitly in ablation sections. We use a base learning rate η = 3e-4 (with warmup and linear decay), weight decay 0.01, and batch size 32, unless otherwise specified. We use GPipe [18] via torch.distributed.pipelining, integrating our compression into all but the final transformer layer. We initialize Uk with isotropic Gaussian noise and set k = 40, achieving 100 compression.