Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Authors: Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation.
Researcher Affiliation Collaboration 1Peking University, China 2Kuaishou Technology, China.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It provides diagrams for the model architecture and training pipeline but no algorithmic steps in pseudocode format.
Open Source Code Yes Our code and models are available at https://video-lavit.github.io.
Open Datasets Yes The training dataset used by Video-La VIT only consists of publicly available image and video datasets. In the following, we present a detailed elaboration of the dataset usage at each training stage. Stage 1: The video tokenizer and detokenizer are trained on the Web Vid-10M (Bain et al., 2021)... Stage 2: The language model is pre-trained on a mixture of video, image and text data, including Web Vid-10M (Bain et al., 2021); 93M samples from Conceptual Caption (Sharma et al., 2018; Changpinyo et al., 2021), SBU (Ordonez et al., 2011), and BLIP-Capfilt (Li et al., 2022). Moreover, we also employ the English text corpus from Red Pajama (Together Computer, 2023)...
Dataset Splits Yes We adopt the validation set of MS-COCO (Lin et al., 2014) and randomly select 30K samples.
Hardware Specification Yes GPU Usage 128 NVIDIA A100 64 NVIDIA A100 64 NVIDIA A100
Software Dependencies No The paper mentions software frameworks like 'Megatron' and 'Deep Speed' and optimizers like 'Adam W', but does not provide specific version numbers for these or other key software components such as programming languages or libraries.
Experiment Setup Yes The detailed training hyper-parameter settings for the video tokenizer, detokenizer, and language model in Video-La VIT are reported in Table 8. Table 8 explicitly lists hyperparameters such as 'Global batch size 2048', 'Peak learning rate of LLM 2e-5', 'Training Steps 30K', and 'Weight decay 0.1'.