Group Knowledge Transfer: Federated Learning of Large CNNs at the Edge

Authors: Chaoyang He, Murali Annavaram, Salman Avestimehr

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train CNNs designed based on Res Net-56 and Res Net-110 using three distinct datasets (CIFAR-10, CIFAR-100, and CINIC-10) and their non-I.I.D. variants. Our results show that Fed GKT can obtain comparable or even slightly higher accuracy than Fed Avg. More importantly, Fed GKT makes edge training affordable. Compared to the edge training using Fed Avg, Fed GKT demands 9 to 17 times less computational power (FLOPs) on edge devices and requires 54 to 105 times fewer parameters in the edge CNN. Our source code is released at Fed ML (https://fedml.ai).
Researcher Affiliation Academia Chaoyang He Murali Annavaram Salman Avestimehr University of Southern California Los Angeles, CA 90007 chaoyang.he@usc.edu annavara@usc.edu avestime@usc.edu
Pseudocode Yes Algorithm 1 Group Knowledge Transfer.
Open Source Code Yes Our source code is released at Fed ML (https://fedml.ai).
Open Datasets Yes Our training task is image classification on CIFAR-10 [24], CIFAR-100 [24], and CINIC-10 [25].
Dataset Splits Yes We also generate their non-I.I.D. variants by splitting training samples into K unbalanced partitions. Details of these three datasets are introduced in Appendix A.1.
Hardware Specification Yes Our server node has 4 NVIDIA RTX 2080Ti GPUs with sufficient GPU memory for large model training. We use several CPU-based nodes as clients training small CNNs.
Software Dependencies No The paper mentions developing the framework based on Fed ML [23], but does not provide specific version numbers for Fed ML or other software dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup Yes There are four important hyper-parameters in our Fed GKT framework: the communication round, as stated in line #2 of Algorithm 1, the edge-side epoch number, the server-side epoch number, and the server-side learning rate. After a tuning effort, we find that the edge-side epoch number can simply be 1. The server epoch number depends on the data distribution. For I.I.D. data, the value is 20, and for non-I.I.D., the value depends on the level of data bias. For I.I.D., Adam optimizer [65] works better than SGD with momentum [64], while for non-I.I.D., SGD with momentum works better. During training, we reduce the learning rate once the accuracy has plateaued [68, 69]. We use the same data augmentation techniques for fair comparison (random crop, random horizontal flip, and normalization). More details of hyper-parameters are described in Appendix B.4.