CALVIN: Improved Contextual Video Captioning via Instruction Tuning

Authors: Gowthami Somepalli, Arkabandhu Chowdhury, Jonas Geiping, Ronen Basri, Tom Goldstein, David Jacobs

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose a model, CALVIN, a specialized video LLM that leverages previous movie context to generate fully contextual scene descriptions. To achieve this, we train our model on a suite of tasks that integrate both image-based question-answering and video captioning within a unified framework, before applying instruction tuning to refine the model s ability to provide scene captions.
Researcher Affiliation Collaboration University of Maryland, College Park Meta AI ELLIS Institute, MPI for Intelligent Systems Weizmann Institute of Science
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No We will release the code upon acceptance.
Open Datasets Yes To accomplish this, we train our model using the Movie Audio Descriptions(MAD) dataset [80], which includes human-generated annotations for movie scenes, complete with timestamps. Additionally, we incorporate the Web Vid-3M [8] video caption dataset and MAD [80] dataset train split into this stage. As such, this data mix is of higher quality as the majority of it is human-annotated.
Dataset Splits Yes In Stage 1, we train only the projection module (Q-Former and linear layer) for 400,000 iterations, with gradient accumulation over 4 steps and per-GPU batch size of 32. The learning rate has a cosine schedule and a warmup phase of 2,500 steps with min LR 1e-6 and max LR 1e-4. In Stage-2 we train Q-Former, linear projection, and the LLM. We train each model for 120,000 iterations with a cosine learning rate with min LR of 1e-6 and max LR of 1e-4. The per-GPU batch size is 12 for image datasets and 6 for video datasets.
Hardware Specification Yes All models are trained on a single A100 node with 8 GPUs.
Software Dependencies No The paper mentions using specific models like Llama-2 7b and CLIP ViT-h/14, but does not specify software dependencies like Python, PyTorch, or CUDA versions.
Experiment Setup Yes In Stage 1, we train only the projection module (Q-Former and linear layer) for 400,000 iterations, with gradient accumulation over 4 steps and per-GPU batch size of 32. The learning rate has a cosine schedule and a warmup phase of 2,500 steps with min LR 1e-6 and max LR 1e-4. In Stage-2 we train Q-Former, linear projection, and the LLM. We train each model for 120,000 iterations with a cosine learning rate with min LR of 1e-6 and max LR of 1e-4. The per-GPU batch size is 12 for image datasets and 6 for video datasets. Across all stages, a weight decay of 0.05 was applied.