Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CLIP-It! Language-Guided Video Summarization
Authors: Medhini Narasimhan, Anna Rohrbach, Trevor Darrell
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We outperform baselines and prior work by a significant margin on both standard video summarization datasets (TVSum and Sum Me) and a query-focused video summarization dataset (QFVS). Particularly, we achieve large improvements in the transfer setting, attesting to our method s strong generalization capabilities. In this section, we describe the experimental setup and evaluation of our method on two tasks: generic video summarization and query-focused video summarization. |
| Researcher Affiliation | Academia | Medhini Narasimhan Anna Rohrbach Trevor Darrell University of California, Berkeley EMAIL |
| Pseudocode | No | The paper describes the system architecture and components but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a project website link (https://medhini.github.io/clip_it) but does not explicitly state that the source code for the methodology is released there or provide a direct link to a code repository. |
| Open Datasets | Yes | We evaluate our approach on two standard video summarization datasets (TVSum [36] and Sum Me [7]) and on the generic summaries for UT Egocentric videos [16] provided by the QFVS dataset [33]. TVSum [36] consists of 50 videos... Sum Me [7] consists of 25 videos... augment training data with 39 videos from the You Tube dataset [2] and 50 videos from the Open Video Project (OVP) dataset [24]. |
| Dataset Splits | Yes | Following [33], we run four rounds of experiments leaving out one video for testing and one for validation, while keeping the remaining two for training. |
| Hardware Specification | No | The paper describes the training of models and experiments but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for these experiments. |
| Software Dependencies | No | The paper mentions using various pre-trained networks and models (e.g., Google Net, Res Net, CLIP, Bi-Modal Transformer) but does not specify version numbers for these or other software dependencies like deep learning frameworks or libraries. |
| Experiment Setup | Yes | We employ 3 loss functions (classification, diversity, and reconstruction) to train our model. The supervised setting uses all 3 and the unsupervised setting uses only diversity and reconstruction losses. ... The final loss function for supervised learning is then, Lsup = α Lc + β Ld + λ Lr, where α, β, and λ control the trade-off between the three loss functions. |