CLIP-It! Language-Guided Video Summarization
Authors: Medhini Narasimhan, Anna Rohrbach, Trevor Darrell
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We outperform baselines and prior work by a significant margin on both standard video summarization datasets (TVSum and Sum Me) and a query-focused video summarization dataset (QFVS). Particularly, we achieve large improvements in the transfer setting, attesting to our method s strong generalization capabilities. In this section, we describe the experimental setup and evaluation of our method on two tasks: generic video summarization and query-focused video summarization. |
| Researcher Affiliation | Academia | Medhini Narasimhan Anna Rohrbach Trevor Darrell University of California, Berkeley {medhini, anna.rohrbach, trevordarrell}@berkeley.edu |
| Pseudocode | No | The paper describes the system architecture and components but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a project website link (https://medhini.github.io/clip_it) but does not explicitly state that the source code for the methodology is released there or provide a direct link to a code repository. |
| Open Datasets | Yes | We evaluate our approach on two standard video summarization datasets (TVSum [36] and Sum Me [7]) and on the generic summaries for UT Egocentric videos [16] provided by the QFVS dataset [33]. TVSum [36] consists of 50 videos... Sum Me [7] consists of 25 videos... augment training data with 39 videos from the You Tube dataset [2] and 50 videos from the Open Video Project (OVP) dataset [24]. |
| Dataset Splits | Yes | Following [33], we run four rounds of experiments leaving out one video for testing and one for validation, while keeping the remaining two for training. |
| Hardware Specification | No | The paper describes the training of models and experiments but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for these experiments. |
| Software Dependencies | No | The paper mentions using various pre-trained networks and models (e.g., Google Net, Res Net, CLIP, Bi-Modal Transformer) but does not specify version numbers for these or other software dependencies like deep learning frameworks or libraries. |
| Experiment Setup | Yes | We employ 3 loss functions (classification, diversity, and reconstruction) to train our model. The supervised setting uses all 3 and the unsupervised setting uses only diversity and reconstruction losses. ... The final loss function for supervised learning is then, Lsup = α Lc + β Ld + λ Lr, where α, β, and λ control the trade-off between the three loss functions. |