Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Federated Learning under Covariate Shifts with Generalization Guarantees

Authors: Ali Ramezani-Kebrya, Fanghui Liu, Thomas Pethick, Grigorios Chrysos, Volkan Cevher

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate the superiority of FTW-ERM over existing FL baselines in challenging imbalanced federated settings in terms of data distribution shifts across clients. Experimentally demonstrate more than 16% overall test accuracy improvement over existing FL baselines when training Res Net-18 (He et al., 2016) on CIFAR10 (Krizhevsky) in challenging imbalanced federated settings in terms of data distribution shifts across clients. In conclusion, we expand the concept and application scope of FL to a general setting under intra/inter-client covariate shifts, provide an in-depth theoretical understanding of learning with FTW-ERM via a general DRM, and experimentally validate the utility of the proposed framework.
Researcher Affiliation	Academia	Ali Ramezani-Kebrya EMAIL Department of Informatics, University of Oslo and Visual Intelligence Centre. Fanghui Liu EMAIL Laboratory for Information and Inference Systems (LIONS), EPFL. Thomas Pethick EMAIL Laboratory for Information and Inference Systems (LIONS), EPFL. Grigorios Chrysos EMAIL Laboratory for Information and Inference Systems (LIONS), EPFL. Volkan Cevher EMAIL Laboratory for Information and Inference Systems (LIONS), EPFL.
Pseudocode	Yes	Algorithm 1: Histogram-based density ratio matching.
Open Source Code	No	The paper does not provide an explicit statement about releasing code, a direct link to a code repository, or mention of code in supplementary materials.
Open Datasets	Yes	For MNIST-based experiments we use a Le Net (Le Cun et al., 1989) with cross entropy loss and compute standard deviations over 5 independent executions. For CIFAR10-based experiments we use the larger Res Net-18 (He et al., 2016). We make use of three datasets in the experiments: MNIST (Le Cun et al., 1998), Fashion MNIST 8 (Xiao et al., 2017), and CIFAR10 (Krizhevsky).
Dataset Splits	Yes	We split the 10-class Fashion MNIST dataset between 5 clients and simulate a target shift by including different fractions of examples from each class across the training data and test data. Table 1: Fashion MNIST with label shift across five clients, where each client receives different fractions of examples from each class. Table 6: CIFAR10 target shift distribution across 100 clients where groups of 10 clients shares the same distribution. Table 9: Fashion MNIST target shift distribution.
Hardware Specification	No	All experiments are carried out on an internal cluster using one GPU. This statement is too general and does not provide specific hardware models (e.g., GPU model, CPU type, memory details).
Software Dependencies	No	The stochastic gradient for each of the clients are computed with a batch size of 64 and aggregated on the server, which uses the Adam optimizer. Experiments on MNIST and Fashion MNIST uses a Le Net (Le Cun et al., 1998), a learning rate of 0.001, no weight decay, and runs for 5, 000 iterations. For CIFAR10 experiments we use the larger Res Net-18 (He et al., 2016). While specific models and optimizers are mentioned, no version numbers for programming languages, libraries (e.g., PyTorch, TensorFlow), or other software are provided.
Experiment Setup	Yes	The stochastic gradient for each of the clients are computed with a batch size of 64 and aggregated on the server, which uses the Adam optimizer. Experiments on MNIST and Fashion MNIST uses a Le Net (Le Cun et al., 1998), a learning rate of 0.001, no weight decay, and runs for 5, 000 iterations. For CIFAR10 experiments we use the larger Res Net-18 (He et al., 2016). Batch normalization in Res Net-18 is treated by averaging the statistics on the server and subsequently broadcasting to the workers. A learning rate of 0.0001 and weight decay of 0.0001 are used. We report the best iterate in terms of average test accuracy after 20, 000 iterations in Table 7. The partial client participation experiment in Table 2 uses 200, 000 iterations.