Accelerated On-Device Forward Neural Network Training with Module-Wise Descending Asynchronism

Authors: Xiaohan Zhao, Hualin Zhang, Zhouyuan Huo, Bin Gu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations conducted on NVIDIA s AGX Orin, a popular embedded device, show that Async FGD reduces memory consumption and enhances hardware efficiency, offering a novel approach to on-device learning.
Researcher Affiliation Academia 1 Nanjing University of Information Science and Technology, China 2 Mohamed bin Zayed University of Artificial Intelligence, UAE 3 School of Artificial Intelligence, Jilin University, China Author Zhouyuan Huo is currently at Google. No work performed at Google. Corresponding author.
Pseudocode Yes Algorithm 1 Async FGD-SGD
Open Source Code No The paper does not provide any concrete statement or link regarding the public availability of its source code.
Open Datasets Yes Table 1 mentions 'CIFAR-10' and 'SVHN' datasets. Section 6.3 states 'The models are fine-tuned with weights pre-trained on Imagenet'.
Dataset Splits No The paper mentions 'validation performance' for learning rate selection but does not provide specific data split information (percentages, counts, or explicit citation for predefined splits) for training, validation, or test sets.
Hardware Specification Yes Experiments utilize Python 3.8 and Pytorch, primarily on n Vidia s AGX Orin. Additional results on alternate platforms are in the appendix. In Section I.2, specific hardware includes: NVIDIA AGX Orin, four NVIDIA 1080 Ti GPUs, NVIDIA A100, and Intel(R) Xeon(R) CPU E5-2678 v3 @2.50GHZ.
Software Dependencies No The paper states 'Experiments utilize Python 3.8 and Pytorch', but does not provide a specific version number for Pytorch, nor other key software dependencies with version numbers.
Experiment Setup Yes Batch size is 64 unless noted. The optimal learning rate (chosen from {1e 5, 1e 4, 1e 3, 1e 2} with Adam optimizer [18]) is based on validation performance. The parameter α is initially set to 1 for the classifier, with others at 0 for the first 10 epochs. Subsequently, α is gradually increased to 0.15 for specific layers. The seeds for all experiments are fixed to 0. Details on network architecture and model splitting are provided in Appendix H.