Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens
Authors: Elad Ben Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | SVi T shows strong performance improvements on multiple video understanding tasks and datasets, including the first place in the Ego4D CVPR 22 Point of No Return Temporal Localization Challenge. |
| Researcher Affiliation | Collaboration | Elad Ben-Avraham Tel Aviv University eladba4@gmail.com Roei Herzig Tel Aviv University, IBM Research roeiherz@gmail.com Karttikeya Mangalam UC Berkeley mangalam@cs.berkeley.edu Amir Bar Tel Aviv University amirb4r@gmail.com Anna Rohrbach UC Berkeley anna.rohrbach@berkeley.edu Leonid Karlinsky MIT-IBM Lab leonidka@ibm.com Trevor Darrell UC Berkeley trevordarrell@berkeley.edu Amir Globerson Tel Aviv University, Google Research gamir@tauex.tau.ac.il |
| Pseudocode | No | The paper describes models and processes using text and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | For code and pretrained models, visit the project page at https://eladb3.github.io/SVi T/. SVi T is implemented in Py Torch, and the code is available on our project page. |
| Open Datasets | Yes | We use the following video datasets: (1) Something-Something v2 (SSv2) [30]... (2) Something Else [59]... (3) Ego4D [31]... (4) Diving48 [54]... (5) Atomic Visual Actions (AVA) [32]. For auxiliary image datasets we use the Ego4D [31], and the 100 Days of Hands (100DOH) [72] datasets. |
| Dataset Splits | Yes | The split contains 174 classes with 54,919/54,876 videos for training/validation. We pretrain the SVi T model on the K400 [45] video dataset with our auxiliary image datasets. Then, we finetune on the target video understanding task (detailed in Section 3.1) together with the auxiliary image datasets and the SVi T loss. |
| Hardware Specification | No | The provided paper does not specify the exact hardware (e.g., GPU models, CPU types) used for running the experiments in the main text. While the reproducibility checklist indicates that this information is provided, it is not present in the given document portion. |
| Software Dependencies | No | The paper states, 'SVi T is implemented in Py Torch,' but it does not provide specific version numbers for PyTorch or any other software dependencies needed for reproducibility. |
| Experiment Setup | Yes | Each of the three terms in the loss is multiplied by a hyper-parameter (λCon, λHAOG, λV id), and the total loss is the weighted combination of the three terms: LTotal := λCon LCon + λHAOGLHAOG + λV id LV id. Each training batch contains 64 images and 64 videos in order to minimize the overall loss in Equation 8. |