Parallelized Spatiotemporal Slot Binding for Videos

Authors: Gautam Singh, Yue Wang, Jiawei Yang, Boris Ivanovic, Sungjin Ahn, Marco Pavone, Tong Che

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we test PSB extensively as an encoder within an autoencoding framework paired with a wide variety of decoder options. Compared to the state-of-the-art, our architecture demonstrates stable training on longer sequences, achieves parallelization that results in a 60% increase in training speed, and yields performance that is on par with or better on unsupervised 2D and 3D object-centric scene decomposition and understanding.
Researcher Affiliation Collaboration 1Rutgers University 2NVIDIA Research 3University of Southern California 4KAIST 5Stanford University.
Pseudocode Yes The operation of a PSB block is also summarized in Algorithm 1.
Open Source Code Yes See our project page at this link. (The linked project page contains a GitHub repository link: https://github.com/NVlabs/PSB)
Open Datasets Yes In this setting, we evaluate on the MOVi benchmark (Greff et al., 2022) comprising five datasets: MOVi-A, MOVi-B, MOVi-C, MOVi-D, and MOVi-E. ... We synthesized these datasets as extensions of the CLEVR dataset to incorporate physical dynamics, multiple cameras, 3D camera pose information, moving ego observer, and visual complexity. ... We extended the codebase of the original CLEVR dataset to generate this dataset.
Dataset Splits No The paper does not provide explicit details on how the datasets were split into training, validation, and test sets (e.g., percentages or counts for each split).
Hardware Specification No The paper makes general statements about hardware, such as "on hardware such as a GPU", but does not provide specific details like GPU models, CPU types, or memory amounts used for experiments.
Software Dependencies No The paper describes software components but does not provide specific version numbers for reproducibility (e.g., "Python 3.x", "PyTorch x.x").
Experiment Setup Yes To train the models, we use a linear learning rate warm-up in the first 30000 training steps and use exponential decay thereafter with a half-life of 1M steps. We use a peak learning rate of 3e 4. All models are trained to 300K steps. We used a batch size of 24 episodes with 6 time-steps per episode. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with β1 = 0.9 and β2 = 0.95.