Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Removing Data Heterogeneity Influence Enhances Network Topology Dependence of Decentralized SGD

Authors: Kun Yuan, Sulaiman A. Alghunaim, Xinmeng Huang

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical simulations are conducted to validate our theories. Keywords: Decentralized optimization, stochastic optimization, transient stage... 7. Numerical Simulation In this section, we validate the established theoretical results with numerical simulations. 7.1 Strongly-Convex Scenario Problem. We consider the following decentralized least-square problem... Simulation settings. In our simulations, we set d = 10 and M = 1000. To control the data heterogeneity across the nodes, we ﬁrst let each node i be associated with a local solution x i , and such x i is generated by x i = x + vi where x N(0, Id) is a randomly generated vector while vi N(0, σ2 h Id) controls the similarity between each local solution. Generally speaking, a large σ2 h results in local solutions {x i } that are vastly diﬀerent from each other. With x i at hand, we can generate local data that follows distinct distributions. At node i, we generate each element in Ai following standard normal distribution. Measurement bi is generated by bi = Aix i + si where si N(0, σ2 s I) is some white noise.
Researcher Affiliation	Academia	Kun Yuan EMAIL Center for Machine Learning Research, Peking University AI for Science Institute Beijing 100871, P. R. China; Sulaiman A. Alghunaim EMAIL Department of Electrical Engineering Kuwait University Safat 13060, Kuwait; Xinmeng Huang EMAIL Graduate Group in Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 19104, USA
Pseudocode	Yes	Algorithm 1: D2/Exact-Diﬀusion... Algorithm 2: D2/Exact-Diﬀusion with multiple gossip steps... Algorithm 3: xi = Fast Gossip Average
Open Source Code	No	The paper does not provide an explicit statement or link to the source code for the methodology described in this specific paper. While it mentions a related project 'Blue Fog' in the related works section, it does not confirm that the code for this paper's work is available there.
Open Datasets	Yes	7.3 Simulation with Real Datasets This subsection examines the performances of P-SGD, D-SGD, D2/ED, and MG-D2/ED with real datasets. We run experiments for the regularized logistic regression problem with... We consider two real datasets: MNIST (Deng, 2012) and COVTYPE.binary (Rossi and Ahmed, 2015).
Dataset Splits	No	The paper describes how the datasets were used for training and distributed among nodes to create heterogeneity (e.g., 'In COVTYPE.binary, we use 50,000 samples as training data...', 'half of the nodes maintain 54% positive samples...'), but it does not specify explicit train/test/validation splits for evaluating model generalization performance.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, memory) used for running the simulations. It only mentions simulation settings like 'd = 10 and M = 1000' which are problem parameters.
Software Dependencies	No	The paper does not mention any specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers).
Experiment Setup	Yes	In our simulations, we set d = 10 and M = 1000... To control the data heterogeneity across the nodes, we ﬁrst let each node i be associated with a local solution x i , and such x i is generated by x i = x + vi where x N(0, Id) is a randomly generated vector while vi N(0, σ2 h Id) controls the similarity between each local solution... At each iteration k, each node will randomly sample a row in Ai and the corresponding element in bi and use them to evaluate the stochastic gradient. The metric for all simulations in this subsection is 1 n Pn i=1 x(k) i x 2... The left plot in Fig. 1 lists the performances of all algorithms. Each algorithm utilizes the same learning rate which decays by half for every 2,000 gossip communications... To this end, we let σ2 h = 0.2... we let σ2 h = 0... The regularization coeﬃcient ρ = 0.001 for all simulations.