reproducibilityindex.ai

CLIP-It! Language-Guided Video Summarization

Authors: Medhini Narasimhan, Anna Rohrbach, Trevor Darrell

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We outperform baselines and prior work by a significant margin on both standard video summarization datasets (TVSum and Sum Me) and a query-focused video summarization dataset (QFVS). Particularly, we achieve large improvements in the transfer setting, attesting to our method s strong generalization capabilities. In this section, we describe the experimental setup and evaluation of our method on two tasks: generic video summarization and query-focused video summarization.
Researcher Affiliation	Academia	Medhini Narasimhan Anna Rohrbach Trevor Darrell University of California, Berkeley {medhini, anna.rohrbach, trevordarrell}@berkeley.edu
Pseudocode	No	The paper describes the system architecture and components but does not provide structured pseudocode or algorithm blocks.
Open Source Code	No	The paper provides a project website link (https://medhini.github.io/clip_it) but does not explicitly state that the source code for the methodology is released there or provide a direct link to a code repository.
Open Datasets	Yes	We evaluate our approach on two standard video summarization datasets (TVSum [36] and Sum Me [7]) and on the generic summaries for UT Egocentric videos [16] provided by the QFVS dataset [33]. TVSum [36] consists of 50 videos... Sum Me [7] consists of 25 videos... augment training data with 39 videos from the You Tube dataset [2] and 50 videos from the Open Video Project (OVP) dataset [24].
Dataset Splits	Yes	Following [33], we run four rounds of experiments leaving out one video for testing and one for validation, while keeping the remaining two for training.
Hardware Specification	No	The paper describes the training of models and experiments but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for these experiments.
Software Dependencies	No	The paper mentions using various pre-trained networks and models (e.g., Google Net, Res Net, CLIP, Bi-Modal Transformer) but does not specify version numbers for these or other software dependencies like deep learning frameworks or libraries.
Experiment Setup	Yes	We employ 3 loss functions (classiﬁcation, diversity, and reconstruction) to train our model. The supervised setting uses all 3 and the unsupervised setting uses only diversity and reconstruction losses. ... The ﬁnal loss function for supervised learning is then, Lsup = α Lc + β Ld + λ Lr, where α, β, and λ control the trade-off between the three loss functions.