Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Private Training Large-scale Models with Efficient DP-SGD

Authors: Liangyu Wang, Junxiao Wang, Jie Ren, Zihang Xiang, David Keyes, Di Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental suite is methodically designed to assess the robustness and efficiency of Flash DP across a range of training paradigms and hardware configurations. We explore Flash DP s performance in terms of memory efficiency and throughput under varying batch sizes, its adaptability to Automatic Mixed Precision (AMP) training (Appendix Section E.2), its scalability when employing Distributed Data Parallel (DDP) and Pipeline Parallel (PP) techniques (Appendix Section E.3), and utility evaluation (Appendix Section E.4). Table 1: Differential Batch-size Analysis.
Researcher Affiliation	Academia	Liangyu Wang KAUST Junxiao Wang Guangzhou University Jie Ren KAUST Zihang Xiang KAUST David E. Keyes KAUST Di Wang* KAUST
Pseudocode	Yes	Algorithm 1 Algorithm: Flash DP with Block-wise All-Reduce on GPUs
Open Source Code	Yes	Flash DP s code has been open-sourced in https://github.com/kaustpradalab/flashdp.
Open Datasets	Yes	Our experiments utilize the Wikitext dataset (Merity, 2016) and are conducted on NVIDIA A100 (80GB) GPUs using the Py Torch framework (Paszke et al., 2019). ... The model is trained on the Fineweb-edu (Lozhkov et al., 2024) dataset.
Dataset Splits	No	The paper mentions using the Wikitext dataset and Fineweb-edu dataset and assesses 'validation loss' (Table 3), implying a validation split. However, it does not explicitly provide specific percentages, sample counts, or detailed methodologies for how the datasets were split into training, validation, or test sets.
Hardware Specification	Yes	On a computational platform equipped with four NVIDIA A100 GPUs... Our experiments utilize the Wikitext dataset (Merity, 2016) and are conducted on NVIDIA A100 (80GB) GPUs... using DDP on four A100 GPUs (80GB each).
Software Dependencies	No	Our experiments utilize the Wikitext dataset (Merity, 2016) and are conducted on NVIDIA A100 (80GB) GPUs using the Py Torch framework (Paszke et al., 2019). ... Python is mentioned in the context of implementing the algorithm and CUDA is mentioned, but specific version numbers for PyTorch, Python, or CUDA are not provided.
Experiment Setup	Yes	Key hyperparameters include a total batch size of 524,288 tokens, a micro batch size per device of 32, and a sequence length of 1024. We use a maximum learning rate of 6 10 4 and a minimum learning rate of 6 10 5, with weight decay set at 0.1 and gradient clipping at 1.0. ... enabling differential privacy with delta set at 1 10 5 and a clipping threshold of 100.