CLIP-It! Language-Guided Video Summarization

Authors: Medhini Narasimhan, Anna Rohrbach, Trevor Darrell

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We outperform baselines and prior work by a significant margin on both standard video summarization datasets (TVSum and Sum Me) and a query-focused video summarization dataset (QFVS). Particularly, we achieve large improvements in the transfer setting, attesting to our method s strong generalization capabilities. In this section, we describe the experimental setup and evaluation of our method on two tasks: generic video summarization and query-focused video summarization.
Researcher Affiliation Academia Medhini Narasimhan Anna Rohrbach Trevor Darrell University of California, Berkeley {medhini, anna.rohrbach, trevordarrell}@berkeley.edu
Pseudocode No The paper describes the system architecture and components but does not provide structured pseudocode or algorithm blocks.
Open Source Code No The paper provides a project website link (https://medhini.github.io/clip_it) but does not explicitly state that the source code for the methodology is released there or provide a direct link to a code repository.
Open Datasets Yes We evaluate our approach on two standard video summarization datasets (TVSum [36] and Sum Me [7]) and on the generic summaries for UT Egocentric videos [16] provided by the QFVS dataset [33]. TVSum [36] consists of 50 videos... Sum Me [7] consists of 25 videos... augment training data with 39 videos from the You Tube dataset [2] and 50 videos from the Open Video Project (OVP) dataset [24].
Dataset Splits Yes Following [33], we run four rounds of experiments leaving out one video for testing and one for validation, while keeping the remaining two for training.
Hardware Specification No The paper describes the training of models and experiments but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for these experiments.
Software Dependencies No The paper mentions using various pre-trained networks and models (e.g., Google Net, Res Net, CLIP, Bi-Modal Transformer) but does not specify version numbers for these or other software dependencies like deep learning frameworks or libraries.
Experiment Setup Yes We employ 3 loss functions (classification, diversity, and reconstruction) to train our model. The supervised setting uses all 3 and the unsupervised setting uses only diversity and reconstruction losses. ... The final loss function for supervised learning is then, Lsup = α Lc + β Ld + λ Lr, where α, β, and λ control the trade-off between the three loss functions.