SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Authors: Ziyi Wu, Nikita Dvornik, Klaus Greff, Thomas Kipf, Animesh Garg

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Slot Former on four video datasets consisting of diverse object dynamics. Our method not only presents competitive results on standard video prediction metrics, but also achieves significant gains when evaluating on object-aware metrics in the long range. Crucially, we demonstrate that Slot Former s unsupervised dynamics knowledge can be successfully transferred to downstream supervised tasks (e.g., VQA and goal-conditional planning) to improve their performance for free.
Researcher Affiliation Collaboration 1 University of Toronto, 2 Vector Institute, 3 Samsung AI Centre Toronto, 4 Google Research
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. Figure 1 is an architecture diagram.
Open Source Code No To facilitate future research, we will release the code of our work and the pre-trained model weights alongside the camera ready version of this paper.
Open Datasets Yes We evaluate our method s capability in video prediction on two datasets, OBJ3D (Lin et al., 2020) and CLEVRER (Yi et al., 2019), and demonstrate its ability for downstream reasoning and planning tasks on three datasets, CLEVRER, Physion (Bear et al., 2021) and PHYRE (Bakhtin et al., 2019).
Dataset Splits Yes OBJ3D consists of CLEVR-like (Johnson et al., 2017) dynamic scenes, where a sphere is launched from the front of the scene to collide with other still objects. There are 2,920 videos for training and 200 videos for testing.
Hardware Specification Yes All of our methods are implemented in Py Torch (Paszke et al., 2019), and can be trained on servers with 4 modern GPUs in less than 5 days, enabling both industrial and academic researchers.
Software Dependencies Yes All of our methods are implemented in Py Torch (Paszke et al., 2019)... We follow BERT (Kenton & Toutanova, 2019) to implement our model by stacking multiple transformer encoder blocks.
Experiment Setup Yes Table 8: Variations in model architectures and training settings on different datasets. Includes: Image Resolution, Number of Slots, Slot Size, Batch Size, Training Steps, Burn-in Steps T, Rollout Steps K, Latent Size De, Number of Layers NT, Loss Weight λ.