Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CALVIN: Improved Contextual Video Captioning via Instruction Tuning
Authors: Gowthami Somepalli, Arkabandhu Chowdhury, Jonas Geiping, Ronen Basri, Tom Goldstein, David Jacobs
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose a model, CALVIN, a specialized video LLM that leverages previous movie context to generate fully contextual scene descriptions. To achieve this, we train our model on a suite of tasks that integrate both image-based question-answering and video captioning within a unified framework, before applying instruction tuning to refine the model s ability to provide scene captions. |
| Researcher Affiliation | Collaboration | University of Maryland, College Park Meta AI ELLIS Institute, MPI for Intelligent Systems Weizmann Institute of Science |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We will release the code upon acceptance. |
| Open Datasets | Yes | To accomplish this, we train our model using the Movie Audio Descriptions(MAD) dataset [80], which includes human-generated annotations for movie scenes, complete with timestamps. Additionally, we incorporate the Web Vid-3M [8] video caption dataset and MAD [80] dataset train split into this stage. As such, this data mix is of higher quality as the majority of it is human-annotated. |
| Dataset Splits | Yes | In Stage 1, we train only the projection module (Q-Former and linear layer) for 400,000 iterations, with gradient accumulation over 4 steps and per-GPU batch size of 32. The learning rate has a cosine schedule and a warmup phase of 2,500 steps with min LR 1e-6 and max LR 1e-4. In Stage-2 we train Q-Former, linear projection, and the LLM. We train each model for 120,000 iterations with a cosine learning rate with min LR of 1e-6 and max LR of 1e-4. The per-GPU batch size is 12 for image datasets and 6 for video datasets. |
| Hardware Specification | Yes | All models are trained on a single A100 node with 8 GPUs. |
| Software Dependencies | No | The paper mentions using specific models like Llama-2 7b and CLIP ViT-h/14, but does not specify software dependencies like Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | In Stage 1, we train only the projection module (Q-Former and linear layer) for 400,000 iterations, with gradient accumulation over 4 steps and per-GPU batch size of 32. The learning rate has a cosine schedule and a warmup phase of 2,500 steps with min LR 1e-6 and max LR 1e-4. In Stage-2 we train Q-Former, linear projection, and the LLM. We train each model for 120,000 iterations with a cosine learning rate with min LR of 1e-6 and max LR of 1e-4. The per-GPU batch size is 12 for image datasets and 6 for video datasets. Across all stages, a weight decay of 0.05 was applied. |