FedPFT: Federated Proxy Fine-Tuning of Foundation Models
Authors: Zhaopeng Peng, Xiaoliang Fan, Yufan Chen, Zheng Wang, Shirui Pan, Chenglu Wen, Ruisheng Zhang, Cheng Wang
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on seven commonly used datasets (i.e., four text and three vision) demonstrate the superiority of Fed PFT. Our code is available at https://github.com/pzpdzd/Fed PFT. Extensive experiments on three FMs and seven commonly used datasets demonstrate that Fed PFT outperforms existing baselines that fine-tune FMs without using the full model. |
| Researcher Affiliation | Academia | 1Fujian Key Laboratory of Sensing and Computing for Smart Cities, School of Informatics, Xiamen University 2School of Information and Communication Technology, Griffith University 3School of Information Science and Engineering, Lanzhou University |
| Pseudocode | No | The paper does not contain a clearly labeled section for pseudocode or algorithms. |
| Open Source Code | Yes | Our code is available at https://github.com/pzpdzd/Fed PFT. |
| Open Datasets | Yes | Our NLP FM evaluations encompass four text datasets: SST2 [Socher et al., 2013], QNLI [Socher et al., 2013], MNLI [Williams et al., 2017], and QQP2. The CV FM is evaluated on three image datasets: CIFAR-10 [Krizhevsky et al., 2009], CIFAR-100 [Krizhevsky et al., 2009] and Flowers [Nilsback and Zisserman, 2008]. In addition, we employ the Bookcorpus [Zhu et al., 2015] and Wikipedia datasets for distillation of NLP sub FMs, and the Image Net-1k [Russakovsky et al., 2015] for the distillation of CV sub-FM, respectively. |
| Dataset Splits | No | The paper states: "The evaluation metric is the accuracy on the given validation set." However, it does not provide specific details on how this validation set was created (e.g., explicit percentages or sample counts for the train/validation/test splits) for reproducibility. |
| Hardware Specification | Yes | all experiments are conducted in Py Torch 2.1 and NVIDIA 3090 GPUs. |
| Software Dependencies | Yes | all experiments are conducted in Py Torch 2.1 and NVIDIA 3090 GPUs. |
| Experiment Setup | Yes | In the FL scenario, we set up 100 clients with 500 total communication rounds and employ the Dirichlet data partition method [Hsu et al., 2019] to construct different label-skew data heterogeneity scenarios. In each communication round, we randomly select 10 clients for local fine-tuning, using a linear decay of the global learning rate over rounds and Adam W as the local fine-tuning optimizer. We use Fed Avg [Mc Mahan et al., 2017] for global model aggregation. For Fed OT, following [Xiao et al., 2023], we use 2 layers at the bottom and 2 layers at the top as Adapter, and compress the intermediate 8 layers into 3 layers as Emulator. For Fed PFT, we construct sub-FMs by performing layer-wise compression on the intermediate 10 layers. Three FMs share consistent hyper-parameters for the number of layers L, attention heads h, hidden dimension dmodel, and FFN dimension dff, all set at L = 12, h = 12, dmodel = 768, and dff = 3072. We conduct a study to investigate the impact of two hyper-parameters in the sub-FM alignment module: the interval t between two alignments during FL fine-tuning and the proportion p of neurons that need to be updated for each alignment. The effects of two hyper-parameters on the QNLI dataset are presented in Table.5, suggesting that both t and p should be chosen moderately. |