Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning

Authors: Yichen Li, Xiuying Wang, Wenchao Xu, Haozhao Wang, Yining Qi, Jiahua Dong, Ruixuan Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate our proposed method using three datasets and various baselines. We investigate the relationship between data heterogeneity and training efficiency. Additionally, we conduct ablation studies to examine each module in Fed FD. Finally, we conduct a sensitivity analysis to verify the effectiveness of our method. Table 1 shows the test accuracy of various methods with heterogeneous data across three datasets.
Researcher Affiliation Academia 1School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China 2Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates 3International School, Beijing University of Posts and Telecommunications, Beijing, China 4Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
Pseudocode Yes The workflow of Fed FD is shown in Algorithm 1 and Figure.2 illustrates the Fed FD framework. Algorithm 1: Fed FD Input :T: communication round; K: client number; Dk: local dataset for the client k; w: global model; wk: local model; Mk: projection layer; Sk: the k-th model architecture; Wk: matrix for orthogonality. Output :w, {w1, . . . , wk}: global and local models. 1 for c = 1 to T do // communication round 2 Server randomly selects a subset of devices St; 3 Server send the global model w to devices. 4 for each selected client k St in parallel do 5 Train the local model wk with (1); 6 Send the local model wk back to the server. 8 w Server Aggregation({wk}k St) with (3); 9 Get the aggregated feature representation ed with (5); 10 Orthogonalize WD to obtain projection layer Md with (7); 11 Distill the feature knowledge to the global model with (9).
Open Source Code No Answer: [Yes] Justification: We could provide our code if required.
Open Datasets Yes Dataset: We conduct our experiments with heterogeneously partitioned datasets over three datasets: CIFAR-10, CIFAR-100 [19], and Tiny-Image Net [20].
Dataset Splits Yes We apply all the training samples and distribute them to user models, and we use all the testing samples for the performance evaluation.
Hardware Specification No The computation is completed in the HPC Platform of Huazhong University of Science and Technology.
Software Dependencies No No specific software dependencies with version numbers were explicitly mentioned in the paper's main content or supplementary materials.
Experiment Setup Yes Unless otherwise mentioned, we set the number of local training epoch E = 10, communication round T = 200, and the client number K = 20 with an active ratio r = 0.4. For local training, the batch size is 64 and the weight decay is 1e 4. The learning rate is 0.01 for distillation and 0.001 for training the local model. For the model on the server, we employ Res Net-18 [11] as the basic backbone.