SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models

Authors: Ziyi Wu, Jingyu Hu, Wuyue Lu, Igor Gilitschenski, Animesh Garg

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to answer the following questions: (i) Can we learn object-oriented scene decomposition supervised by the denoising loss of DMs? (Section 4.2) (ii) Will our LDM-based decoder improve the visual generation quality of slot models? (Section 4.3) (iii) Is the object-centric representation learned by Slot Diffusion useful for downstream dynamics modeling tasks? (Section 4.4) (iv) Can we extend our method to handle real-world data? (Section 4.5) (v) Can Slot Diffusion benefit from other recent improvements in object-centric learning? (Section 4.6) (vi) What is the impact of each design choice on Slot Diffusion? (Section 4.7)
Researcher Affiliation Academia Ziyi Wu1,2 Jingyu Hu1, Wuyue Lu1, Igor Gilitschenski1,2 Animesh Garg1,2 1University of Toronto 2Vector Institute
Pseudocode No The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm".
Open Source Code No The paper mentions "Additional results and details are available at our website" but does not explicitly state that the source code for their method is released there. It also refers to the "online official implementation" of Slot Former (a baseline), but not their own code.
Open Datasets Yes We evaluate our method in unsupervised object discovery and slot-based visual generation on six datasets, namely, the two most complex image datasets CLEVRTex [43] and Celeb A [59] from SLATE [83], and four video datasets MOVi-D/E/Solid/Tex [25] from STEVE [84]. Then, we show Slot Diffusion s capability for downstream video prediction and reasoning tasks on Physion [4]. Finally, we scale our method to unconstrained real-world images on PASCAL VOC 2012 [21] and MS COCO 2017 [54].
Dataset Splits Yes Physion splits the videos into three sets, namely, Training, Readout Fitting, and Testing. [...] A linear readout model is trained on observed and rollout scene representations from the Readout Fitting set to classify whether the two cued objects (one in red and one in yellow) contact.
Hardware Specification Yes For training, we report the default settings (batch size 32 of length-3 video clips, frame resolution 128 128) on NVIDIA A40 GPUs. For testing, we report the inference time on NVIDIA T4 GPUs (each video contains 24 frames).
Software Dependencies No The paper mentions using a pre-trained VQ-VAE, LDM, and the Hugging Face transformers library for DINO pre-trained encoder, but it does not specify explicit version numbers for these or other software components.
Experiment Setup Yes We first pre-train VQ-VAE for 100 epochs on each dataset with a cosine learning rate schedule decaying from 1e-3, and fix it during the object-centric model training. [...] See Table 5 for detailed slot configurations and training settings. (Table 5 includes: Max Learning Rate, Gradient Clipping, Batch Size, Training Epochs)