FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
Authors: Zehan Wang, Ziang Zhang, Xize Cheng, Rongjie Huang, Luping Liu, Zhenhui Ye, Haifeng Huang, Yang Zhao, Tao Jin, Peng Gao, Zhou Zhao
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we bind Image Bind with extra image-text and audio-text expert spaces, resulting in three main variants: Image Bind++, Intern VLIB and Intern VLIB++. These resulting spaces outperform Image Bind on 5 audio-image-text downstream tasks across 9 datasets. |
| Researcher Affiliation | Collaboration | 1Zhejiang University 2Shanghai AI Lab 3Byte Dance. |
| Pseudocode | No | The paper describes procedures in paragraph text and equations but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and checkpoints are released at https://github.com/zehanwang01/Free Bind |
| Open Datasets | Yes | Unimodal data Following (Wang et al., 2023d), we employ the texts of COCO (Lin et al., 2014), CC3M (Changpinyo et al., 2021; Sharma et al., 2018), MSRVTT (Xu et al., 2016), MAD (Soldan et al., 2022), Audio Caps (Kim et al., 2019) and Clotho (Drossos et al., 2020) as the unimodal source text. There are 2.33M text samples in total (only 1M texts are selected from CC3M). All the unpaired image data are from Image Net (Deng et al., 2009) training set, which consists of 1.3M images without any annotations. The audios are sourced from Audio Set (Gemmeke et al., 2017) training set, total in 2M audio clips. |
| Dataset Splits | No | The paper uses "unpaired texts, images, and audios" and "sampled subsets" for training but does not specify explicit training, validation, and test splits with percentages, counts, or references to predefined splits for the overall experimental setup. |
| Hardware Specification | Yes | All our experiments are conducted on a single 4090 GPU. |
| Software Dependencies | No | The paper mentions using Adam optimizer but does not specify version numbers for key software components or libraries (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | For both kinds of basic bond, the temperature of softmax in data collection is 1/100, and the temperature of Info NCE loss is 1/50. We use Adam (Kingma & Ba, 2014) optimizer with a learning rate of 1e-3 and batch size of 4096 for both bond. The displacement bond is trained for 5 epochs, while the combination bond is trained for 20 epochs. |