reproducibilityindex.ai

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Authors: Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation.
Researcher Affiliation	Collaboration	1Peking University, China 2Kuaishou Technology, China.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It provides diagrams for the model architecture and training pipeline but no algorithmic steps in pseudocode format.
Open Source Code	Yes	Our code and models are available at https://video-lavit.github.io.
Open Datasets	Yes	The training dataset used by Video-La VIT only consists of publicly available image and video datasets. In the following, we present a detailed elaboration of the dataset usage at each training stage. Stage 1: The video tokenizer and detokenizer are trained on the Web Vid-10M (Bain et al., 2021)... Stage 2: The language model is pre-trained on a mixture of video, image and text data, including Web Vid-10M (Bain et al., 2021); 93M samples from Conceptual Caption (Sharma et al., 2018; Changpinyo et al., 2021), SBU (Ordonez et al., 2011), and BLIP-Capfilt (Li et al., 2022). Moreover, we also employ the English text corpus from Red Pajama (Together Computer, 2023)...
Dataset Splits	Yes	We adopt the validation set of MS-COCO (Lin et al., 2014) and randomly select 30K samples.
Hardware Specification	Yes	GPU Usage 128 NVIDIA A100 64 NVIDIA A100 64 NVIDIA A100
Software Dependencies	No	The paper mentions software frameworks like 'Megatron' and 'Deep Speed' and optimizers like 'Adam W', but does not provide specific version numbers for these or other key software components such as programming languages or libraries.
Experiment Setup	Yes	The detailed training hyper-parameter settings for the video tokenizer, detokenizer, and language model in Video-La VIT are reported in Table 8. Table 8 explicitly lists hyperparameters such as 'Global batch size 2048', 'Peak learning rate of LLM 2e-5', 'Training Steps 30K', and 'Weight decay 0.1'.