Parallelized Spatiotemporal Slot Binding for Videos
Authors: Gautam Singh, Yue Wang, Jiawei Yang, Boris Ivanovic, Sungjin Ahn, Marco Pavone, Tong Che
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, we test PSB extensively as an encoder within an autoencoding framework paired with a wide variety of decoder options. Compared to the state-of-the-art, our architecture demonstrates stable training on longer sequences, achieves parallelization that results in a 60% increase in training speed, and yields performance that is on par with or better on unsupervised 2D and 3D object-centric scene decomposition and understanding. |
| Researcher Affiliation | Collaboration | 1Rutgers University 2NVIDIA Research 3University of Southern California 4KAIST 5Stanford University. |
| Pseudocode | Yes | The operation of a PSB block is also summarized in Algorithm 1. |
| Open Source Code | Yes | See our project page at this link. (The linked project page contains a GitHub repository link: https://github.com/NVlabs/PSB) |
| Open Datasets | Yes | In this setting, we evaluate on the MOVi benchmark (Greff et al., 2022) comprising five datasets: MOVi-A, MOVi-B, MOVi-C, MOVi-D, and MOVi-E. ... We synthesized these datasets as extensions of the CLEVR dataset to incorporate physical dynamics, multiple cameras, 3D camera pose information, moving ego observer, and visual complexity. ... We extended the codebase of the original CLEVR dataset to generate this dataset. |
| Dataset Splits | No | The paper does not provide explicit details on how the datasets were split into training, validation, and test sets (e.g., percentages or counts for each split). |
| Hardware Specification | No | The paper makes general statements about hardware, such as "on hardware such as a GPU", but does not provide specific details like GPU models, CPU types, or memory amounts used for experiments. |
| Software Dependencies | No | The paper describes software components but does not provide specific version numbers for reproducibility (e.g., "Python 3.x", "PyTorch x.x"). |
| Experiment Setup | Yes | To train the models, we use a linear learning rate warm-up in the first 30000 training steps and use exponential decay thereafter with a half-life of 1M steps. We use a peak learning rate of 3e 4. All models are trained to 300K steps. We used a batch size of 24 episodes with 6 time-steps per episode. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with β1 = 0.9 and β2 = 0.95. |