reproducibilityindex.ai

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

Authors: Elad Ben Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	SVi T shows strong performance improvements on multiple video understanding tasks and datasets, including the ﬁrst place in the Ego4D CVPR 22 Point of No Return Temporal Localization Challenge.
Researcher Affiliation	Collaboration	Elad Ben-Avraham Tel Aviv University eladba4@gmail.com Roei Herzig Tel Aviv University, IBM Research roeiherz@gmail.com Karttikeya Mangalam UC Berkeley mangalam@cs.berkeley.edu Amir Bar Tel Aviv University amirb4r@gmail.com Anna Rohrbach UC Berkeley anna.rohrbach@berkeley.edu Leonid Karlinsky MIT-IBM Lab leonidka@ibm.com Trevor Darrell UC Berkeley trevordarrell@berkeley.edu Amir Globerson Tel Aviv University, Google Research gamir@tauex.tau.ac.il
Pseudocode	No	The paper describes models and processes using text and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	For code and pretrained models, visit the project page at https://eladb3.github.io/SVi T/. SVi T is implemented in Py Torch, and the code is available on our project page.
Open Datasets	Yes	We use the following video datasets: (1) Something-Something v2 (SSv2) [30]... (2) Something Else [59]... (3) Ego4D [31]... (4) Diving48 [54]... (5) Atomic Visual Actions (AVA) [32]. For auxiliary image datasets we use the Ego4D [31], and the 100 Days of Hands (100DOH) [72] datasets.
Dataset Splits	Yes	The split contains 174 classes with 54,919/54,876 videos for training/validation. We pretrain the SVi T model on the K400 [45] video dataset with our auxiliary image datasets. Then, we ﬁnetune on the target video understanding task (detailed in Section 3.1) together with the auxiliary image datasets and the SVi T loss.
Hardware Specification	No	The provided paper does not specify the exact hardware (e.g., GPU models, CPU types) used for running the experiments in the main text. While the reproducibility checklist indicates that this information is provided, it is not present in the given document portion.
Software Dependencies	No	The paper states, 'SVi T is implemented in Py Torch,' but it does not provide specific version numbers for PyTorch or any other software dependencies needed for reproducibility.
Experiment Setup	Yes	Each of the three terms in the loss is multiplied by a hyper-parameter (λCon, λHAOG, λV id), and the total loss is the weighted combination of the three terms: LTotal := λCon LCon + λHAOGLHAOG + λV id LV id. Each training batch contains 64 images and 64 videos in order to minimize the overall loss in Equation 8.