Federated Learning for Vision-and-Language Grounding Problems

Authors: Fenglin Liu, Xian Wu, Shen Ge, Wei Fan, Yuexian Zou11572-11579

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments of aim Net-based federated learning framework on two representative tasks, i.e., image captioning and VQA, demonstrate the effective and universal improvements of all metrics over the baselines.
Researcher Affiliation Collaboration Fenglin Liu,1 Xian Wu,3 Shen Ge,3 Wei Fan,3 Yuexian Zou1,2 1ADSPLAB, School of ECE, Peking University, Shenzhen, China 2Peng Cheng Laboratory, Shenzhen, China 3Tencent, Beijing, China
Pseudocode No The paper does not include any explicit pseudocode blocks or algorithms labeled as such.
Open Source Code No The paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We evaluate our framework on image captioning and VQA. In image captioning, our reported results are evaluated on the popular MSCOCO image captioning dataset (Chen et al. 2015) and the Flickr30k image captioning dataset (Young et al. 2014). The datasets contain 123,287 images and 31,783 images, respectively, with 5 sentences paired to each image... In VQA, we evaluate the framework on VQA v2.0 dataset, where the images are collected from the MSCOCO dataset (Lin et al. 2014).
Dataset Splits Yes To make fair comparisons, we use the widely-used splits (Karpathy and Li 2015) to report our results. There are 5,000 images each in the validation set and the test set for MSCOCO, and 1,000 images as for Flickr30k... VQA 2.0 is split into train, validation and test-standard sets. There are 82,783, 40,504 and 81,434 images, (443,757, 214,354 and 447,793 corresponding questions) in the training, validation and test set, respectively.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as GPU models, CPU types, or memory.
Software Dependencies No The paper mentions using components like Faster R-CNN and the MSCOCO captioning evaluation toolkit, and references papers for some modules (e.g., Vaswani et al. 2017 for Multi-Head Attention), but it does not specify version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For our proposal, d stands for the hidden/model size of the baseline decoder. The number of both extracted visual and textual features are 36, which means N = M = 36. Following Vaswani et al. (2017), we set the number of attention heads to 8 and the feed-forward network dimension to 2048. For equipping with our aim Net in baseline models, i.e., using the fine-grained image representations learned by aim Net in baseline models, we replace the original features with the refined features directly since our features are considered to be more powerful. Also our aim Net does not make any changes in the number or the size of original feature vectors (each of them can be seen as a weighted average of the original features). We preserve the original settings for all baselines, and our framework is end-to-end trainable.