Distributed Distillation for On-Device Learning

Authors: Ilai Bistritz, Ariana Mann, Nicholas Bambos

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Simulations support our theoretical findings and show that even a naive implementation of our algorithm significantly reduces the communication overhead while achieving an overall comparable accuracy to the state-of-the-art. and 5 Simulation Results We conduct DNN simulations to evaluate the performance of Distributed Distillation (D-Distillation) compared to two baselines: Distributed-SGD (D-SGD) and Silo-SGD (where each device trains its DNN with only its private data and no communication).
Researcher Affiliation Academia Ilai Bistritz, Ariana J. Mann, Nicholas Bambos Stanford University {bistritz,ajmann,bambos}@stanford.edu
Pseudocode Yes Algorithm 1 Distributed Distillation
Open Source Code No The paper does not provide an unambiguous statement or a direct link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes training Le Net-5 on MNIST [40] and Res Net-8 on CIFAR10 [41]. Citations [40] and [41] refer to papers introducing these standard public datasets.
Dataset Splits No The paper mentions distributing "MNIST training data" to devices and using a "test accuracy", but does not provide explicit details about train/validation/test dataset splits, specific percentages, or sample counts for each split.
Hardware Specification No The paper discusses 'edge devices' like 'smartphone or IoT' as the target environment for on-device learning but does not specify the hardware (e.g., GPU models, CPU types, memory) used to run the simulations or experiments presented in the paper.
Software Dependencies No The paper mentions machine learning models (DNNs, Le Net-5, Res Net) and datasets (MNIST, CIFAR-10) but does not provide specific software dependencies with version numbers, such as deep learning frameworks or libraries.
Experiment Setup Yes We selected the best hyperparameters for each algorithm from a limited search as detailed in Appendix 12.1. and from Appendix 12.1: For the MNIST dataset, we used a LeNet-5 architecture with a learning rate of 0.01 for the SGD optimizer. The batch size was 32. For the CIFAR-10 dataset, we used a ResNet-8 architecture with a learning rate of 0.01 for the SGD optimizer. The batch size was 128.