CALVIN: Improved Contextual Video Captioning via Instruction Tuning
Authors: Gowthami Somepalli, Arkabandhu Chowdhury, Jonas Geiping, Ronen Basri, Tom Goldstein, David Jacobs
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose a model, CALVIN, a specialized video LLM that leverages previous movie context to generate fully contextual scene descriptions. To achieve this, we train our model on a suite of tasks that integrate both image-based question-answering and video captioning within a unified framework, before applying instruction tuning to refine the model s ability to provide scene captions. |
| Researcher Affiliation | Collaboration | University of Maryland, College Park Meta AI ELLIS Institute, MPI for Intelligent Systems Weizmann Institute of Science |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We will release the code upon acceptance. |
| Open Datasets | Yes | To accomplish this, we train our model using the Movie Audio Descriptions(MAD) dataset [80], which includes human-generated annotations for movie scenes, complete with timestamps. Additionally, we incorporate the Web Vid-3M [8] video caption dataset and MAD [80] dataset train split into this stage. As such, this data mix is of higher quality as the majority of it is human-annotated. |
| Dataset Splits | Yes | In Stage 1, we train only the projection module (Q-Former and linear layer) for 400,000 iterations, with gradient accumulation over 4 steps and per-GPU batch size of 32. The learning rate has a cosine schedule and a warmup phase of 2,500 steps with min LR 1e-6 and max LR 1e-4. In Stage-2 we train Q-Former, linear projection, and the LLM. We train each model for 120,000 iterations with a cosine learning rate with min LR of 1e-6 and max LR of 1e-4. The per-GPU batch size is 12 for image datasets and 6 for video datasets. |
| Hardware Specification | Yes | All models are trained on a single A100 node with 8 GPUs. |
| Software Dependencies | No | The paper mentions using specific models like Llama-2 7b and CLIP ViT-h/14, but does not specify software dependencies like Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | In Stage 1, we train only the projection module (Q-Former and linear layer) for 400,000 iterations, with gradient accumulation over 4 steps and per-GPU batch size of 32. The learning rate has a cosine schedule and a warmup phase of 2,500 steps with min LR 1e-6 and max LR 1e-4. In Stage-2 we train Q-Former, linear projection, and the LLM. We train each model for 120,000 iterations with a cosine learning rate with min LR of 1e-6 and max LR of 1e-4. The per-GPU batch size is 12 for image datasets and 6 for video datasets. Across all stages, a weight decay of 0.05 was applied. |