Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed

Authors: Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He

ICML 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on up to 256 GPUs show that 1-bit Adam enables up to 3.3 higher throughput for BERT-Large pre-training and up to 2.9 higher throughput for SQu AD ﬁne-tuning. In addition, we provide theoretical analysis for our proposed work.
Researcher Affiliation	Collaboration	1Microsoft 2Department of Computer Science, University of Rochester 3Department of Computer Science, ETH Zurich.
Pseudocode	Yes	Algorithm 1 1-bit Adam
Open Source Code	Yes	The 1-bit Adam optimizer and the communication primitive backend have been open sourced in a deep learning optimization library called Deep Speed2. 2https://github.com/microsoft/Deep Speed
Open Datasets	Yes	We use the same dataset as Devlin et al. (2019), which is a concatenation of Wikipedia and Books Corpus with 2.5B and 800M words respectively. We use the GLUE ﬁne-tuning benchmark(Wang et al., 2018) to evaluate the convergence of the BERT models trained by Adam and 1-bit Adam. In addition, we also evaluate the convergence and performance of 1-bit Adam for SQu AD 1.1 ﬁne-tuning task3 using a pre-trained BERT model checkpoint from Hugging Face4. 3https://rajpurkar.github.io/SQu AD-explorer/ ... We train CIFAR10 using Res Net-18(He et al., 2016). ... Image Net (Russakovsky et al., 2015) ... Celeb Faces Attributes Dataset (Celeb A) (Liu et al., 2015)
Dataset Splits	Yes	For GLUE benchmarks we use original Adam optimizer and perform single-task training on the dev set.
Hardware Specification	Yes	the ﬁrst cluster has 4 NVIDIA Tesla V100 GPUs per node... the second cluster has 8 V100 GPUs per node... We run the experiments on 8 1080Ti GPUs where each GPU is used as one worker.
Software Dependencies	No	NVIDIA NCCL is an efﬁcient and widely used communication library that has been tightly integrated in DL frameworks like Py Torch and Tensor Flow. ... We design a custom collective primitive using Message Passing Interface (MPI). ... The CUDA-Aware version works only on systems with Inﬁni Band whereas the basic version can run on any system with Ethernet interconnect. MVAPICH2-GDR
Experiment Setup	Yes	For BERT pre-training, the learning rate linearly increases to 4 10 4 as a warmup in the ﬁrst 12.5K steps, then decays into 0.99 of the original after every 520 steps. We set the two parameters in Algorithm 1 as β1 = 0.9 and β2 = 0.999 for 1-bit Adam and Adam. For convergence test, we set total batch size as 4K for BERT-Base and BERT-Large. For SQu AD ﬁne-tuning we use the same parameters as published by Hugging Face (batch size = 24, learning rate=3e 5, dropout=0.1, 2 epochs), except that we increase the batch size to 96 (using 32 GPUs).