reproducibilityindex.ai

CALVIN: Improved Contextual Video Captioning via Instruction Tuning

Authors: Gowthami Somepalli, Arkabandhu Chowdhury, Jonas Geiping, Ronen Basri, Tom Goldstein, David Jacobs

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose a model, CALVIN, a specialized video LLM that leverages previous movie context to generate fully contextual scene descriptions. To achieve this, we train our model on a suite of tasks that integrate both image-based question-answering and video captioning within a unified framework, before applying instruction tuning to refine the model s ability to provide scene captions.
Researcher Affiliation	Collaboration	University of Maryland, College Park Meta AI ELLIS Institute, MPI for Intelligent Systems Weizmann Institute of Science
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	We will release the code upon acceptance.
Open Datasets	Yes	To accomplish this, we train our model using the Movie Audio Descriptions(MAD) dataset [80], which includes human-generated annotations for movie scenes, complete with timestamps. Additionally, we incorporate the Web Vid-3M [8] video caption dataset and MAD [80] dataset train split into this stage. As such, this data mix is of higher quality as the majority of it is human-annotated.
Dataset Splits	Yes	In Stage 1, we train only the projection module (Q-Former and linear layer) for 400,000 iterations, with gradient accumulation over 4 steps and per-GPU batch size of 32. The learning rate has a cosine schedule and a warmup phase of 2,500 steps with min LR 1e-6 and max LR 1e-4. In Stage-2 we train Q-Former, linear projection, and the LLM. We train each model for 120,000 iterations with a cosine learning rate with min LR of 1e-6 and max LR of 1e-4. The per-GPU batch size is 12 for image datasets and 6 for video datasets.
Hardware Specification	Yes	All models are trained on a single A100 node with 8 GPUs.
Software Dependencies	No	The paper mentions using specific models like Llama-2 7b and CLIP ViT-h/14, but does not specify software dependencies like Python, PyTorch, or CUDA versions.
Experiment Setup	Yes	In Stage 1, we train only the projection module (Q-Former and linear layer) for 400,000 iterations, with gradient accumulation over 4 steps and per-GPU batch size of 32. The learning rate has a cosine schedule and a warmup phase of 2,500 steps with min LR 1e-6 and max LR 1e-4. In Stage-2 we train Q-Former, linear projection, and the LLM. We train each model for 120,000 iterations with a cosine learning rate with min LR of 1e-6 and max LR of 1e-4. The per-GPU batch size is 12 for image datasets and 6 for video datasets. Across all stages, a weight decay of 0.05 was applied.