reproducibilityindex.ai

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Authors: Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, WANG HongFa, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Cai Wan Zhang, Zhifeng Li, Wei Liu, Li Yuan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Language Bind has achieved superior performance on a wide range of 15 benchmarks covering video, audio, depth, and infrared. Moreover, multiple experiments have provided evidence for the effectiveness of Language Bind in achieving indirect alignment and complementarity among diverse modalities.
Researcher Affiliation	Collaboration	1Peking University, 2Pengcheng Lab, 3Tencent Data Platform, 4National University of Singapore, 5Nari Technology Development Limited Company, 6Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
Pseudocode	No	The paper describes its methods in narrative text and uses figures to illustrate the architecture (e.g., Figure 3), but it does not provide any formal pseudocode or algorithm blocks.
Open Source Code	No	The paper states 'We promise to release the VIDAL-10M dataset upon publication' but makes no explicit statement about releasing the source code for the Language Bind methodology.
Open Datasets	Yes	We thus propose VIDAL-10M with 10 Million data with Video, Infrared, Depth, Audio and their corresponding Language. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. ...We promise to release the VIDAL-10M dataset upon publication.
Dataset Splits	Yes	For a fair comparison of dataset validity, we use the Vit-B/32 model of CLIP4CLIP to conduct validation experiments using the 100K subset of VIDAL-10M and the 380k subset of How To100M.
Hardware Specification	No	The paper does not explicitly state the specific hardware used for its experiments (e.g., GPU models, CPU types).
Software Dependencies	No	The paper mentions software components like 'Open CLIP-large', 'BPE tokenizer', 'Chat GPT model', 'OFA', 'm PLUG-owl model', 's RGB-TIR model', 'GLPN model', 'NLTK toolkit', and 'CLIP4Clip', but it does not specify version numbers for these dependencies, which is required for reproducibility.
Experiment Setup	Yes	In this section, we introduce our training configuration. ...Table 12: Training setting. CLIP4Clip Language Bind Config Video Video Infrared Depth Audio Vision encoder Vi T-Base/32 Vi T-Large/14 Optimizer Bert Adam Adam W ... Epochs 1 16 1 1 8 Learning rate 1e-4 1e-4 1e-4 5e-4 5e-4 ... Batch size 512 640 1024 1024 512 ... Mask ratio 0.3 0.5 0.5 0.3 Lo RA rank 16 2 2 16