VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Authors: Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks.
Researcher Affiliation Collaboration Sihan Chen12 , Handong Li12 , Qunbo Wang2, Zijia Zhao21, Mingzhen Sun21, Xinxin Zhu2, Jing Liu12 1School of Artificial Intelligence, University of Chinese Academy of Sciences 2 Institute of Automation, Chinese Academy of Science
Pseudocode No The paper provides diagrams and textual descriptions of the methods but does not include formal pseudocode or algorithm blocks.
Open Source Code Yes Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.
Open Datasets Yes Code, model and dataset will be released at https://github.com/TXH-mercury/VAST. The training is conducted on a combination corpus consisting of VAST27M, VALOR-1M, Wav Caps, CC14M, and 110M randomly sampled pairs from LAION-400M
Dataset Splits Yes Specific train/val/test splits of those benchmarks can be found in Table 9
Hardware Specification Yes VAST is trained using the Py Torch framework on 64 Tesla V100 cards.
Software Dependencies No The paper mentions 'Py Torch framework' but does not specify a version number for PyTorch or any other software dependencies.
Experiment Setup Yes The training is conducted... for a total of 200K training steps. At each training step, one corpus is sampled for training. ... The initial learning rate is set to 1e-4, and a linear decay schedule is used. The batch size is set to 1024. Specific finetuning hyperparameters of VAST for different benchmarks are presented in Table 10.