VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Authors: Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. |
| Researcher Affiliation | Collaboration | Sihan Chen12 , Handong Li12 , Qunbo Wang2, Zijia Zhao21, Mingzhen Sun21, Xinxin Zhu2, Jing Liu12 1School of Artificial Intelligence, University of Chinese Academy of Sciences 2 Institute of Automation, Chinese Academy of Science |
| Pseudocode | No | The paper provides diagrams and textual descriptions of the methods but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code, model and dataset will be released at https://github.com/TXH-mercury/VAST. |
| Open Datasets | Yes | Code, model and dataset will be released at https://github.com/TXH-mercury/VAST. The training is conducted on a combination corpus consisting of VAST27M, VALOR-1M, Wav Caps, CC14M, and 110M randomly sampled pairs from LAION-400M |
| Dataset Splits | Yes | Specific train/val/test splits of those benchmarks can be found in Table 9 |
| Hardware Specification | Yes | VAST is trained using the Py Torch framework on 64 Tesla V100 cards. |
| Software Dependencies | No | The paper mentions 'Py Torch framework' but does not specify a version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | The training is conducted... for a total of 200K training steps. At each training step, one corpus is sampled for training. ... The initial learning rate is set to 1e-4, and a linear decay schedule is used. The batch size is set to 1024. Specific finetuning hyperparameters of VAST for different benchmarks are presented in Table 10. |