Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

Authors: Adel Nabli, Louis Fournier, Pierre ERBACHER, Louis Serrano, Eugene Belilovsky, Edouard Oyallon

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present our experiments. Section 4.2 details the shared experimental setup. In Sec. 4.3, we demonstrate the shortcomings of DPU and WP initially discussed in Sec. 3 which motivate the design of ACCO. This initial analysis focuses on small language models and datasets, using Tiny Stories [16] as a testbed. Sec. 4.4 shows that ACCO scales effectively by training a 125M-parameter GPTNeo [6] on Open Web Text [21]. Sec. 4.5 pushes further with instruction tuning of a 2.7B GPTNeo model, emphasizing communication bottlenecks and the benefits of ACCO. Finally, Sec. 4.6 compares ACCO and DDP on heterogeneous hardware, where ACCO lets faster GPUs accumulate updates while waiting unlike DDP resulting in faster gradient computation.
Researcher Affiliation Academia Adel Nabli1,2 Louis Fournier1 Pierre Erbacher1 Louis Serrano1 Eugene Belilovsky2 Edouard Oyallon1 1Sorbonne Université, CNRS, ISIR, Paris France 2Mila Quebec AI Institute, Concordia University, Montréal Québec EMAIL
Pseudocode Yes B.2 Algorithm Pseudo-Code We present our algorithm for time-varying batch size N (t) i . Algorithm 1 Training with ACCO in parallel for a worker i
Open Source Code Yes The code to reproduce all our experiments is available at https://github.com/edouardoyallon/acco.
Open Datasets Yes We experiment with small language models on the Tiny Stories dataset [16]... We pre-trained a model equivalent to GPT-2 [54]... and the Open Web Text dataset [21]... and finetuned it on the Alpaca dataset [70]... We used the GPT-Neo tokenizer, pre-trained on the Pile dataset [20]... Tiny Stories Available at: https://huggingface.co/datasets/roneneldan/Tiny Stories ... Open Web Text Dataset Available at: https://huggingface.co/datasets/Skylion007/ openwebtext ... Alpaca Dataset Available at: https://huggingface.co/datasets/tatsu-lab/alpaca
Dataset Splits No The paper mentions using specific datasets like Tiny Stories, Open Web Text, and Alpaca. It mentions evaluating on 'a test split of Open Web Text' and 'finetuned it on the Alpaca dataset [70] containing 52k pairs of instruction/answer.' However, it does not provide explicit details on how the dataset splits (e.g., percentages for training, validation, and test sets) were created or used for reproducibility in the main text or appendix.
Hardware Specification Yes We experiment on our local cluster of NVIDIA A100-80GB GPUs with 8 GPUs per node and an Omni-PAth interconnection network at 100 Gb/s for inter-node connections, intra-node connections being done with NVLink 300 GB/s... The first was conducted on 8 H100-PCIe 80GB on a single node. The second was on 32 A100-80G GPU distributed on 4 nodes.
Software Dependencies No Our code is in Py Torch [52]... No specific version number for PyTorch or other software dependencies is provided, which is required for reproducibility.
Experiment Setup Yes We trained all our models with Adam W [36], using mixed precision: our model parameters, gradient accumulation and communication buffers are in bfloat16 [24] while our sharded optimizer states are in single precision... We use a 36M-parameter GPT-Neo-based [6] decoder-only transformer and train a BPE tokenizer on Tiny Stories to match their 10k vocabulary. All experiments are run with 8 workers on a single node... we pre-trained a model equivalent to GPT-2 [54] with both ACCO and DDP with a Ze RO optimizer. Specifically, we used the GPT-Neo architecture [6] with 125 million parameters... We maxed out the memory of our GPUs with a local mini-batch size of 24. To reach a sufficiently large overall batch size, we used 1 step of gradient accumulation for DDP, and none for ACCO... In Tab. 3, we report additional experimental details... Table 4: Training hyperparameters for ACCO and DDP configurations. Hyperparameter 8 H100 32 A100 mini-batch_size 24 24 n_grad_accumulation ACCO: -DDP: 1 ACCO: -DDP: 1 sequence_len 1024 1024 #tokens_batch 400K 1.5M optimizer Adam W Adam W learning_rate 6e-4 6e-4 weight_decay 0.1 0.1 adam_beta1 0.9 0.9 adam_beta2 0.95 0.95 nb_steps_tot 50000 50000 scheduler cosine cosine n_warmup_steps 0 0... Table 5: Finetuning hyperparameters for ACCO, DDP and DPU configurations. Hyperparameter ACCO DDP DPU mini-batch_size 4 4 4 n_grad_accumulation 2 4 4 total batch_size 128 128 128 optimizer Adam W Adam W Adam W learning_rate 2e-5 2e-5 2e-5 weight_decay 0.0 0.0 0.0 adam_beta1 0.9 0.9 0.9 adam_beta2 0.95 0.95 0.95 nb_steps_tot 50000 50000 50000 scheduler cosine cosine cosine n_warmup_steps 0 0 50