Ensemble Distillation for Robust Model Fusion in Federated Learning

Authors: Tao Lin, Lingjing Kong, Sebastian U. Stich, Martin Jaggi

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show in extensive empirical experiments on various CV/NLP datasets (CIFAR-10/100, Image Net, AG News, SST2) and settings (heterogeneous models/data) that the server model can be trained much faster, requiring fewer communication rounds than any existing FL technique so far.
Researcher Affiliation Academia Tao Lin , Lingjing Kong , Sebastian U. Stich, Martin Jaggi. MLO, EPFL, Switzerland {tao.lin, lingjing.kong, sebastian.stich, martin.jaggi}@epfl.ch
Pseudocode Yes Algorithm 1 Illustration of Fed DF on K homogeneous clients (indexed by k) for T rounds, nk denotes the number of data points per client and C the fraction of clients participating in each round. The server model is initialized as x0. While FEDAVG just uses the averaged models xt,0, we perform N iterations of server-side model fusion on top (line 7 line 10). 1: procedure SERVER ...
Open Source Code No The paper does not provide any explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes We evaluate the learning of different SOTA FL methods on both CV and NLP tasks, on architectures of Res Net [20], VGG [63], Shuffle Net V2 [48] and Distil BERT [60]. We consider federated learning CIFAR-10/100 [38] and Image Net [39] (down-sampled to image resolution 32 for computational feasibility [11]) from scratch for CV tasks; while for NLP tasks, we perform federated fine-tuning on a 4-class news classification dataset (AG News [80]) and a 2-class classification task (Stanford Sentiment Treebank, SST2 [66]).
Dataset Splits Yes The validation dataset is created for CIFAR-10/100, Image Net, and SST2, by holding out 10%, 1% and 1% of the original training samples respectively; the remaining training samples are used as the training dataset (before partitioning client data) and the whole procedure is controlled by random seeds. We use validation/test datasets on the server and report the test accuracy over three different random seeds.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. It mentions 'commodity mobile devices' in a general context but not as the experimental hardware.
Software Dependencies No The paper mentions various models and libraries (e.g., ResNet, VGG, DistilBERT, Adam optimizer), but it does not specify the version numbers for any software dependencies or programming languages used in the experiments.
Experiment Setup Yes Unless mentioned otherwise the learning rate is set to 0.1 for Res Net-like nets, 0.05 for VGG, and 1e 5 for Distil BERT. The local training in our experiments uses a constant learning rate (no decay), no Nesterov momentum acceleration, and no weight decay. Adam with learning rate 1e 3 (w/ cosine annealing) is used to distill knowledge from the ensemble of received local models. We employ early-stopping to stop distillation after the validation performance plateaus for 1e3 steps (total 1e4 update steps). We perform 100 communication rounds, and active clients are sampled with ratio C =0.4 from a total of 20 clients.