Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

Authors: Elad Ben Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental SVi T shows strong performance improvements on multiple video understanding tasks and datasets, including the first place in the Ego4D CVPR 22 Point of No Return Temporal Localization Challenge.
Researcher Affiliation Collaboration Elad Ben-Avraham Tel Aviv University EMAIL Roei Herzig Tel Aviv University, IBM Research EMAIL Karttikeya Mangalam UC Berkeley EMAIL Amir Bar Tel Aviv University EMAIL Anna Rohrbach UC Berkeley EMAIL Leonid Karlinsky MIT-IBM Lab EMAIL Trevor Darrell UC Berkeley EMAIL Amir Globerson Tel Aviv University, Google Research EMAIL
Pseudocode No The paper describes models and processes using text and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes For code and pretrained models, visit the project page at https://eladb3.github.io/SVi T/. SVi T is implemented in Py Torch, and the code is available on our project page.
Open Datasets Yes We use the following video datasets: (1) Something-Something v2 (SSv2) [30]... (2) Something Else [59]... (3) Ego4D [31]... (4) Diving48 [54]... (5) Atomic Visual Actions (AVA) [32]. For auxiliary image datasets we use the Ego4D [31], and the 100 Days of Hands (100DOH) [72] datasets.
Dataset Splits Yes The split contains 174 classes with 54,919/54,876 videos for training/validation. We pretrain the SVi T model on the K400 [45] video dataset with our auxiliary image datasets. Then, we finetune on the target video understanding task (detailed in Section 3.1) together with the auxiliary image datasets and the SVi T loss.
Hardware Specification No The provided paper does not specify the exact hardware (e.g., GPU models, CPU types) used for running the experiments in the main text. While the reproducibility checklist indicates that this information is provided, it is not present in the given document portion.
Software Dependencies No The paper states, 'SVi T is implemented in Py Torch,' but it does not provide specific version numbers for PyTorch or any other software dependencies needed for reproducibility.
Experiment Setup Yes Each of the three terms in the loss is multiplied by a hyper-parameter (λCon, λHAOG, λV id), and the total loss is the weighted combination of the three terms: LTotal := λCon LCon + λHAOGLHAOG + λV id LV id. Each training batch contains 64 images and 64 videos in order to minimize the overall loss in Equation 8.