Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Authors: Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation. |
| Researcher Affiliation | Collaboration | 1Peking University, China 2Kuaishou Technology, China. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. It provides diagrams for the model architecture and training pipeline but no algorithmic steps in pseudocode format. |
| Open Source Code | Yes | Our code and models are available at https://video-lavit.github.io. |
| Open Datasets | Yes | The training dataset used by Video-La VIT only consists of publicly available image and video datasets. In the following, we present a detailed elaboration of the dataset usage at each training stage. Stage 1: The video tokenizer and detokenizer are trained on the Web Vid-10M (Bain et al., 2021)... Stage 2: The language model is pre-trained on a mixture of video, image and text data, including Web Vid-10M (Bain et al., 2021); 93M samples from Conceptual Caption (Sharma et al., 2018; Changpinyo et al., 2021), SBU (Ordonez et al., 2011), and BLIP-Capfilt (Li et al., 2022). Moreover, we also employ the English text corpus from Red Pajama (Together Computer, 2023)... |
| Dataset Splits | Yes | We adopt the validation set of MS-COCO (Lin et al., 2014) and randomly select 30K samples. |
| Hardware Specification | Yes | GPU Usage 128 NVIDIA A100 64 NVIDIA A100 64 NVIDIA A100 |
| Software Dependencies | No | The paper mentions software frameworks like 'Megatron' and 'Deep Speed' and optimizers like 'Adam W', but does not provide specific version numbers for these or other key software components such as programming languages or libraries. |
| Experiment Setup | Yes | The detailed training hyper-parameter settings for the video tokenizer, detokenizer, and language model in Video-La VIT are reported in Table 8. Table 8 explicitly lists hyperparameters such as 'Global batch size 2048', 'Peak learning rate of LLM 2e-5', 'Training Steps 30K', and 'Weight decay 0.1'. |