Generative Video Transformer: Can Objects be the Words?

Authors: Yi-Fu Wu, Jaesik Yoon, Sungjin Ahn

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare our model with previous RNNbased approaches as well as other possible video transformer baselines. We demonstrate OCVT performs well when compared to baselines in generating future frames. OCVT also develops useful representations for video reasoning, achieving start-of-the-art performance on the CATER task.
Researcher Affiliation Collaboration 1Department of Computer Science, Rutgers University 2SAP Labs 3Rutgers Center for Cognitive Science.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access information (specific repository link, explicit code release statement) for its methodology.
Open Datasets Yes We also evaluate on the CATER dataset (Girdhar & Ramanan, 2020), a video-understanding benchmark that requires long-term temporal reasoning.
Dataset Splits No The paper specifies training lengths for the bouncing balls dataset (e.g., 'For Mod1, we train on 20 frames'), but it does not provide explicit training, validation, and test dataset splits with percentages, sample counts, or references to predefined splits.
Hardware Specification No The paper mentions 'a single 48GB GPU' but does not specify the exact model (e.g., NVIDIA A100) or other specific hardware details like CPU, memory, or cloud instance types used for experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers, such as programming language versions, library versions, or solver versions.
Experiment Setup Yes We then apply the following formula to obtain the predicted bounding box: ˆzwhere t+1 = zwhere t +c tanh( ˆzwhere t+1 ), where c is a hyperparameter between 0 and 1 controlling the maximum update in one timestep. ... βwhere, βdepth, and βpres are hyperparameters used to control the contribution of each loss term.