Distributed Machine Learning through Heterogeneous Edge Systems
Authors: Hanpeng Hu, Dan Wang, Chuan Wu7179-7186
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our testbed implementation and experiments show that ADSP outperforms existing parameter synchronization models significantly in terms of ML model convergence time, scalability and adaptability to large heterogeneity. 5 Performance Evaluation We implement ADSP as a ready-to-use Python library based on Tensor Flow (Abadi et al. 2016), and evaluate its performance with testbed experiments. |
| Researcher Affiliation | Academia | Hanpeng Hu,1 Dan Wang,2 Chuan Wu1 1The University of Hong Kong, 2The Hong Kong Polytechnic University |
| Pseudocode | Yes | Algorithm 1 Commit Rate Adjustment at the Scheduler |
| Open Source Code | No | The paper states 'We implement ADSP as a ready-to-use Python library based on Tensor Flow', but does not provide any link or explicit statement about making this library open-source or publicly available. |
| Open Datasets | Yes | (i) image classification on Cifar-10 (Krizhevsky and Hinton 2010) using a CNN model from the Tensor Flow tutorial (Tensorflow 2019) |
| Dataset Splits | No | The paper mentions using the Cifar-10 dataset and training with mini-batches, but it does not provide specific details on how the dataset was split into training, validation, and test sets (e.g., percentages or sample counts). |
| Hardware Specification | Yes | Testbed. We emulate heterogeneous edge systems following the distribution of hardware configurations of edge devices in a survey (Jkielty 2019), using 19 Amazon EC2 instances (Wang and Ng 2010): 7 t2.large instances, 5 t2.xlarge instances, 4 t2.2xlarge instances and 2 t3.xlarge instances as workers, and 1 t3.2xlarge instance as the PS. |
| Software Dependencies | No | The paper states it is 'based on Tensor Flow', but does not provide a specific version number for Tensor Flow or any other software dependencies with their versions. |
| Experiment Setup | Yes | Default Settings. By default, each mini-batch in our model training includes 128 examples. The check period of ADSP is 60 seconds, and each epoch is 20 minutes long. The global learning rate is 1/M (which we find works well through experiments). The local learning rate is initialized to 0.1 and decays exponentially over time. |