Distributed Machine Learning through Heterogeneous Edge Systems

Authors: Hanpeng Hu, Dan Wang, Chuan Wu7179-7186

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our testbed implementation and experiments show that ADSP outperforms existing parameter synchronization models significantly in terms of ML model convergence time, scalability and adaptability to large heterogeneity. 5 Performance Evaluation We implement ADSP as a ready-to-use Python library based on Tensor Flow (Abadi et al. 2016), and evaluate its performance with testbed experiments.
Researcher Affiliation Academia Hanpeng Hu,1 Dan Wang,2 Chuan Wu1 1The University of Hong Kong, 2The Hong Kong Polytechnic University
Pseudocode Yes Algorithm 1 Commit Rate Adjustment at the Scheduler
Open Source Code No The paper states 'We implement ADSP as a ready-to-use Python library based on Tensor Flow', but does not provide any link or explicit statement about making this library open-source or publicly available.
Open Datasets Yes (i) image classification on Cifar-10 (Krizhevsky and Hinton 2010) using a CNN model from the Tensor Flow tutorial (Tensorflow 2019)
Dataset Splits No The paper mentions using the Cifar-10 dataset and training with mini-batches, but it does not provide specific details on how the dataset was split into training, validation, and test sets (e.g., percentages or sample counts).
Hardware Specification Yes Testbed. We emulate heterogeneous edge systems following the distribution of hardware configurations of edge devices in a survey (Jkielty 2019), using 19 Amazon EC2 instances (Wang and Ng 2010): 7 t2.large instances, 5 t2.xlarge instances, 4 t2.2xlarge instances and 2 t3.xlarge instances as workers, and 1 t3.2xlarge instance as the PS.
Software Dependencies No The paper states it is 'based on Tensor Flow', but does not provide a specific version number for Tensor Flow or any other software dependencies with their versions.
Experiment Setup Yes Default Settings. By default, each mini-batch in our model training includes 128 examples. The check period of ADSP is 60 seconds, and each epoch is 20 minutes long. The global learning rate is 1/M (which we find works well through experiments). The local learning rate is initialized to 0.1 and decays exponentially over time.