Benchmarking Algorithms for Federated Domain Generalization

Authors: Ruqi Bai, Saurabh Bagchi, David I. Inouye

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental While prior federated learning (FL) methods mainly consider client heterogeneity, we focus on the Federated Domain Generalization (DG) task, which introduces train-test heterogeneity in the FL context. Existing evaluations in this field are limited in terms of the scale of the clients and dataset diversity. Thus, we propose a Federated DG benchmark that aim to test the limits of current methods with high client heterogeneity, large numbers of clients, and diverse datasets. Towards this objective, we introduce a novel data partition method that allows us to distribute any domain dataset among few or many clients while controlling client heterogeneity. We then introduce and apply our methodology to evaluate 14 DG methods, which include centralized DG methods adapted to the FL context, FL methods that handle client heterogeneity, and methods designed specifically for Federated DG on 7 datasets. Our results suggest that, despite some progress, significant performance gaps remain in Federated DG, especially when evaluating with a large number of clients, high client heterogeneity, or more realistic datasets. Furthermore, our extendable benchmark code will be publicly released to aid in benchmarking future Federated DG approaches.
Researcher Affiliation Academia Ruqi Bai, Saurabh Bagchi & David I. Inouye Elmore Family School of Electrical and Computer Engineering Purdue University West Lafayette, IN 47907, USA {bai116,sbagchi,dinouye}@purdue.edu
Pseudocode Yes Algorithm 1 Domain Partition Algorithm
Open Source Code Yes Code for reproduce the results is available at the following link: https://github.com/inouye-lab/Fed DG_Benchmark.
Open Datasets Yes Specifically, our study includes five image datasets and two text datasets. Additionally, within these 7 datasets, we include one subpopulation shift within image datasets (Celeb A) and another within text datasets (Civilcomments). Furthermore, our dataset selections span a diverse range of subjects including general objects, wild camera traps, medical images, human faces, online comments, and programming codes.
Dataset Splits Yes For each dataset, we first partition the dataset into 5 categories, namely training dataset, in-domain validation dataset, in-domain test dataset, held-out validation dataset and test domain dataset. For FEMNIST and PACS dataset, we use Cartoon and Sketch as training domain, Art-painting as heldout domain, Painting as test domain. For training domain, we partition 10%, 10% of the total training domain datasets as in-domain validation dataset and in-domain test dataset respectively. For IWild Cam, Civil Comments and Py150, we directly apply the Wilds official partitions.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or cloud computing instance specifications used for the experiments.
Software Dependencies Yes We include the requirements.txt file generated by conda to help environment setup.
Experiment Setup Yes In this section, we present the hyperparameters selected for the evaluation. We grid search 8 times of run starting with learning rate same as ERM and other hyperparameters from the methods original parameters. We than select the hyperparameter based on the best performance on the held-out domain validation set. Please refer to the Table 7 to review all hyperparameters.