FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Authors: Zehan Wang, Ziang Zhang, Xize Cheng, Rongjie Huang, Luping Liu, Zhenhui Ye, Haifeng Huang, Yang Zhao, Tao Jin, Peng Gao, Zhou Zhao

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we bind Image Bind with extra image-text and audio-text expert spaces, resulting in three main variants: Image Bind++, Intern VLIB and Intern VLIB++. These resulting spaces outperform Image Bind on 5 audio-image-text downstream tasks across 9 datasets.
Researcher Affiliation Collaboration 1Zhejiang University 2Shanghai AI Lab 3Byte Dance.
Pseudocode No The paper describes procedures in paragraph text and equations but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our code and checkpoints are released at https://github.com/zehanwang01/Free Bind
Open Datasets Yes Unimodal data Following (Wang et al., 2023d), we employ the texts of COCO (Lin et al., 2014), CC3M (Changpinyo et al., 2021; Sharma et al., 2018), MSRVTT (Xu et al., 2016), MAD (Soldan et al., 2022), Audio Caps (Kim et al., 2019) and Clotho (Drossos et al., 2020) as the unimodal source text. There are 2.33M text samples in total (only 1M texts are selected from CC3M). All the unpaired image data are from Image Net (Deng et al., 2009) training set, which consists of 1.3M images without any annotations. The audios are sourced from Audio Set (Gemmeke et al., 2017) training set, total in 2M audio clips.
Dataset Splits No The paper uses "unpaired texts, images, and audios" and "sampled subsets" for training but does not specify explicit training, validation, and test splits with percentages, counts, or references to predefined splits for the overall experimental setup.
Hardware Specification Yes All our experiments are conducted on a single 4090 GPU.
Software Dependencies No The paper mentions using Adam optimizer but does not specify version numbers for key software components or libraries (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes For both kinds of basic bond, the temperature of softmax in data collection is 1/100, and the temperature of Info NCE loss is 1/50. We use Adam (Kingma & Ba, 2014) optimizer with a learning rate of 1e-3 and batch size of 4096 for both bond. The displacement bond is trained for 5 epochs, while the combination bond is trained for 20 epochs.