Breaking Barriers of System Heterogeneity: Straggler-Tolerant Multimodal Federated Learning via Knowledge Distillation

Authors: Jinqian Chen, Haoyu Tang, Junhao Cheng, Ming Yan, Ji Zhang, Mingzhu Xu, Yupeng Hu, Liqiang Nie

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on two datasets for video moment retrieval and two datasets for imagetext retrieval demonstrate that our method achieves superior results with high straggler robustness.
Researcher Affiliation Collaboration Jinqian Chen1,3 , Haoyu Tang1 , Junhao Cheng1 , Ming Yan2 , Ji Zhang2 , Mingzhu Xu1 , Yupeng Hu1 and Liqiang Nie4 1School of Software, Shandong University 2Alibaba Group 3School of Software Engineering, Xi an Jiaotong University 4Harbin Institute of Technology (Shenzhen)
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access information (e.g., repository link, explicit statement of release) for open-source code.
Open Datasets Yes Flickr30k [Young et al., 2014]: This image-text retrieval dataset consists of 31,784 images, each of which is manually annotated with five different sentence descriptions. As in [Qu et al., 2021], 29,784, 1,000, and 1,000 images with paired sentences are adopted for training, validation, and testing, respectively. 2. MSCOCO [Lin et al., 2014]: This image-text retrieval dataset contains 123,287 images, each of which is paired with five annotated sentences. For fair comparisons, the public dataset split is adopted [Qu et al., 2021], i.e., 113,287, 5000, and 5000 images for training, validation, and testing, respectively. 3. Charades-STA [Gao et al., 2017]: This dataset video moment retrieval is manually annotated by [Gao et al., 2017], which contains 6,672 videos with 29.76 seconds long on average. The number of sentence-video pairs is 16,127 in total. Following the common settings [Gao et al., 2017], we divide those pairs into two parts, i.e., 12408 pairs for training and 3720 pairs for testing, respectively. 4. Activity Net Captions (Anet) [Krishna et al., 2017]: This video moment retrieval dataset contains 14926 videos with an average duration of 120 seconds. The sentence-video pairs are 71,957 in total, where the corresponding sentences are longer with more complicated semantics. Following [Gao et al., 2017], we adopt 37,417, 17,505, and 17,031 sentence-video pairs for training, validation, and testing, respectively.
Dataset Splits Yes As in [Qu et al., 2021], 29,784, 1,000, and 1,000 images with paired sentences are adopted for training, validation, and testing, respectively. ... i.e., 113,287, 5000, and 5000 images for training, validation, and testing, respectively. ... we adopt 37,417, 17,505, and 17,031 sentence-video pairs for training, validation, and testing, respectively.
Hardware Specification Yes All experiments were performed on a cluster of 4 heterogeneous devices with different configurations. We implemented our framework using Py Torch 1.7.1. For the server side, we used a high-performance computing node equipped with four Intel Xeon processors and 128GB memory. For the visionlanguage knowledge distillation, the Vi T-B/32 version CLIP [Radford et al., 2021] is adopted as the teacher model. Each client device was equipped with a GPU of a different model and memory size.
Software Dependencies Yes We implemented our framework using Py Torch 1.7.1.
Experiment Setup Yes Based on our MFL-AKD framework, we conducted experiments with 40 to 60 communication rounds for the federated learning process. ... For the text-image retrieval task, the model on each client is trained locally for 10 rounds and 30 rounds on the Flicker30k and MSCOCO datasets, respectively. For the video moment retrieval task, the model on each client is trained for 40 rounds and 60 rounds on the Charades-STA and Anet datasets, respectively. The stochastic gradient descent (SGD) optimizer with a learning rate of 0.001 is adopted. During the federated learning process, we applied knowledge distillation with a temperature of 5 and a weight of 0.5 to encourage model convergence.