reproducibilityindex.ai

Ensemble Distillation for Robust Model Fusion in Federated Learning

Authors: Tao Lin, Lingjing Kong, Sebastian U. Stich, Martin Jaggi

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show in extensive empirical experiments on various CV/NLP datasets (CIFAR-10/100, Image Net, AG News, SST2) and settings (heterogeneous models/data) that the server model can be trained much faster, requiring fewer communication rounds than any existing FL technique so far.
Researcher Affiliation	Academia	Tao Lin , Lingjing Kong , Sebastian U. Stich, Martin Jaggi. MLO, EPFL, Switzerland {tao.lin, lingjing.kong, sebastian.stich, martin.jaggi}@epfl.ch
Pseudocode	Yes	Algorithm 1 Illustration of Fed DF on K homogeneous clients (indexed by k) for T rounds, nk denotes the number of data points per client and C the fraction of clients participating in each round. The server model is initialized as x0. While FEDAVG just uses the averaged models xt,0, we perform N iterations of server-side model fusion on top (line 7 line 10). 1: procedure SERVER ...
Open Source Code	No	The paper does not provide any explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets	Yes	We evaluate the learning of different SOTA FL methods on both CV and NLP tasks, on architectures of Res Net [20], VGG [63], Shufﬂe Net V2 [48] and Distil BERT [60]. We consider federated learning CIFAR-10/100 [38] and Image Net [39] (down-sampled to image resolution 32 for computational feasibility [11]) from scratch for CV tasks; while for NLP tasks, we perform federated ﬁne-tuning on a 4-class news classiﬁcation dataset (AG News [80]) and a 2-class classiﬁcation task (Stanford Sentiment Treebank, SST2 [66]).
Dataset Splits	Yes	The validation dataset is created for CIFAR-10/100, Image Net, and SST2, by holding out 10%, 1% and 1% of the original training samples respectively; the remaining training samples are used as the training dataset (before partitioning client data) and the whole procedure is controlled by random seeds. We use validation/test datasets on the server and report the test accuracy over three different random seeds.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. It mentions 'commodity mobile devices' in a general context but not as the experimental hardware.
Software Dependencies	No	The paper mentions various models and libraries (e.g., ResNet, VGG, DistilBERT, Adam optimizer), but it does not specify the version numbers for any software dependencies or programming languages used in the experiments.
Experiment Setup	Yes	Unless mentioned otherwise the learning rate is set to 0.1 for Res Net-like nets, 0.05 for VGG, and 1e 5 for Distil BERT. The local training in our experiments uses a constant learning rate (no decay), no Nesterov momentum acceleration, and no weight decay. Adam with learning rate 1e 3 (w/ cosine annealing) is used to distill knowledge from the ensemble of received local models. We employ early-stopping to stop distillation after the validation performance plateaus for 1e3 steps (total 1e4 update steps). We perform 100 communication rounds, and active clients are sampled with ratio C =0.4 from a total of 20 clients.