Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Flick: Empowering Federated Learning with Commonsense Knowledge

Authors: Ran Zhu, Mingkun Yang, Shiqiang Wang, Jie Yang, Qing Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive results on three datasets demonstrate that Flick improves the global model accuracy by up to 11.43%, and accelerates convergence by up to 12.9 , validating its effectiveness in addressing data heterogeneity.
Researcher Affiliation	Collaboration	1Delft University of Technology, 2IBM Research EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: Data generation and usage.
Open Source Code	Yes	The code can be found at https://github.com/Ran-ZHU/Flick.
Open Datasets	Yes	We use three datasets: (1) PACS [41], consisting of 9,991 images in 7 classes across the following four domains: Photo, Art Painting, Cartoon, and Sketch; (2) Office-Caltech [42], containing 10 overlapping classes between the Office dataset [43] and Caltech256 dataset [44], with data from four domains: Amazon, Caltech, DSLR, and Webcam; (3) Domain Net [45], a large-scale benchmark covering six domains: Clipart, Infograph, Painting, Quickdraw, Real, and Sketch, each originally containing 345 object classes.
Dataset Splits	Yes	For the PACS dataset, we use 20 clients, and for Office-Caltech, we adopt 8 clients. Data from each domain is partitioned into 5 (PACS) or 2 (Office-Caltech) subsets using a Dirichlet distribution with concentration parameter α = 0.1 and α = 0.05, respectively. To evaluate the scalability of Flick in large-scale federated settings, we further conduct experiments on the Domain Net dataset with 100 clients, where 20% are randomly selected to participate in each communication round. Each domain is split into 15 or 17 subsets using a Dirichlet distribution with α = 0.1. Across all three datasets, each client receives data from a single domain with a skewed label distribution, effectively simulating real-world scenarios characterized by both domain shift and label skew.
Hardware Specification	Yes	Figure 14(a) shows the learning curves of Flick and its counterparts, integrated into the baseline method Fed Avg, with time-to-accuracy performance evaluated on an NVIDIA A40 GPU; (b) Wall-clock local latency of a specific client, running on an NVIDIA Jetson AGX Orin, across participating rounds, including training on both original and synthetic data points, as well as local captioning.
Software Dependencies	No	All methods are implemented in Python, with neural networks developed using Py Torch. We utilize Salesforce/blip-image-captioning-large [24] for the image captioning and sd-legacy/stable-diffusion-v1-5 [51] for the image generation, sourced from Hugging Face. We also use gpt-4o-mini from Open AI API [52] to analyze offloaded captions. More details are given in Appendix B.1 and B.2.
Experiment Setup	Yes	Implementation Details. For fair comparisons, all methods are implemented using the same settings. We use SGD as an optimizer with a learning rate of 0.01; the weight decay is 4e 5 and the momentum is 0.9. The batch size for local training is 64 and 32 for the two datasets, respectively, with four clients participating in each round. The communication rounds are set to 150.