LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Authors: Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, WANG HongFa, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Cai Wan Zhang, Zhifeng Li, Wei Liu, Li Yuan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Language Bind has achieved superior performance on a wide range of 15 benchmarks covering video, audio, depth, and infrared. Moreover, multiple experiments have provided evidence for the effectiveness of Language Bind in achieving indirect alignment and complementarity among diverse modalities.
Researcher Affiliation Collaboration 1Peking University, 2Pengcheng Lab, 3Tencent Data Platform, 4National University of Singapore, 5Nari Technology Development Limited Company, 6Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
Pseudocode No The paper describes its methods in narrative text and uses figures to illustrate the architecture (e.g., Figure 3), but it does not provide any formal pseudocode or algorithm blocks.
Open Source Code No The paper states 'We promise to release the VIDAL-10M dataset upon publication' but makes no explicit statement about releasing the source code for the Language Bind methodology.
Open Datasets Yes We thus propose VIDAL-10M with 10 Million data with Video, Infrared, Depth, Audio and their corresponding Language. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. ...We promise to release the VIDAL-10M dataset upon publication.
Dataset Splits Yes For a fair comparison of dataset validity, we use the Vit-B/32 model of CLIP4CLIP to conduct validation experiments using the 100K subset of VIDAL-10M and the 380k subset of How To100M.
Hardware Specification No The paper does not explicitly state the specific hardware used for its experiments (e.g., GPU models, CPU types).
Software Dependencies No The paper mentions software components like 'Open CLIP-large', 'BPE tokenizer', 'Chat GPT model', 'OFA', 'm PLUG-owl model', 's RGB-TIR model', 'GLPN model', 'NLTK toolkit', and 'CLIP4Clip', but it does not specify version numbers for these dependencies, which is required for reproducibility.
Experiment Setup Yes In this section, we introduce our training configuration. ...Table 12: Training setting. CLIP4Clip Language Bind Config Video Video Infrared Depth Audio Vision encoder Vi T-Base/32 Vi T-Large/14 Optimizer Bert Adam Adam W ... Epochs 1 16 1 1 8 Learning rate 1e-4 1e-4 1e-4 5e-4 5e-4 ... Batch size 512 640 1024 1024 512 ... Mask ratio 0.3 0.5 0.5 0.3 Lo RA rank 16 2 2 16