Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DUO: No Compromise to Accuracy Degradation
Authors: Jinda Jia, Cong Xie, Hanlin Lu, Fanjiang Ye, Hao Feng, Daoce Wang, Haibin Lin, Zhi Zhang, Xin Liu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments are organized into two parts: accuracy and performance. Together, these evaluations show that DUO improves training accuracy while maintaining nearly identical training speed in Sharded Data Parallelism with gradient compression and can even increase throughput by enabling more aggressive compression. |
| Researcher Affiliation | Collaboration | 1Indiana University, 2Byte Dance Inc. EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Distributed training with DUO Algorithm 2 SGD + Fast-Slow Update |
| Open Source Code | Yes | We also provide our source code along with detailed instructions to reproduce the main training results. |
| Open Datasets | Yes | We use The Pile [12] as our training dataset due to its open-source availability and general applicability. |
| Dataset Splits | No | All pre-training tasks are run for 80 000 iterations, processing 80 billion tokens in total. Accuracy is evaluated via validation-loss comparison across all settings. While validation loss is mentioned, the specific split percentages or sample counts for training, validation, and testing are not explicitly detailed in the paper. |
| Hardware Specification | Yes | 4-node setup with A100 GPUs: Each node is equipped with 4 NVIDIA A100-SXM4-40GB GPUs connected via NVLink. Nodes are connected by 100 Gbps Slingshot links. 8-node setup with H20 GPUs: Each node is equipped with 8 NVIDIA H20 GPUs connected via NVLink. Nodes are connected by 400 Gbps Infini Band links. 4-node setup with single A100 GPU: Each node has a single NVIDIA A100-SXM4-40GB GPU. Nodes are connected by 100 Gbps Ethernet links. |
| Software Dependencies | No | To effectively evaluate the proposed method, we integrate DUO into Megatron-LM, one of the most commonly used open-source LLM training frameworks. No specific version numbers for Megatron-LM or other software dependencies are provided in the paper's text. |
| Experiment Setup | Yes | The hyperparameter settings follow those of the OPT model [35], ensuring that models of the same size use identical configurations. For reproducibility, we provide the detailed configurations in Table 6 (Appendix B). Additionally, Table 7 (Appendix C) provides training configuration for the throughput test. |