Learning to Compose: Improving Object Centric Learning by Injecting Compositionality

Authors: Whie Jung, Jaehoon Yoo, Sungjin Ahn, Seunghoon Hong

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our framework on four datasets and verify that our model consistently surpasses auto-encoding based baselines by a substantial margin. (3) We show that our objective enhances the robustness of object-centric learning on three major factors, such as number of latents, encoder and decoder architectures. (Section 1) 5 EXPERIMENT Implementation Details We base our implementation on existing frameworks (Singh et al., 2022a; Jiang et al., 2023). 5.1 UNSUPERVISED OBJECT SEGMENTATION 5.2 ROBUSTNESS OF COMPOSITIONAL OBJECTIVE 5.3 INTERNAL ANALYSIS
Researcher Affiliation Academia Whie Jung, Jaehoon Yoo, Sungjin Ahn, Seunghoon Hong School of Computing, KAIST {whieya, wogns98, sungjin.ahn, seunghoon.hong}@kaist.ac.kr
Pseudocode No The paper describes algorithms and mathematical formulations in text and equations but does not include any explicitly labeled pseudocode or algorithm blocks/figures.
Open Source Code No We employ the features from the pre-trained auto-encoder2 to represent an image.2https://huggingface.co/stabilityai/sd-vae-ft-ema-original (Section 5) The provided URL links to a pre-trained model (auto-encoder) used in the research, not the authors' own source code for the methodology described in the paper. The paper also mentions basing implementation on existing frameworks, but does not provide specific access to their own code.
Open Datasets Yes CLEVRTex (Karazija et al., 2021) consists of various rigid objects with homogeneous textures. Multi Shape Net (Stelzner et al., 2021) includes more complex and realistic furniture objects. PTR (Hong et al., 2021) and Super-CLEVR (Li et al., 2023) contain objects composed of multi-colored parts and textures. All of the datasets are center-cropped and resized to 128x128 resolution images. (Section 5 Datasets)
Dataset Splits Yes All results are evaluated on held-out validation set. (Table 1 caption) To explore the scalability of our novel objective in a complex real-world dataset, we examine our framework in BDD100k dataset Yu et al. (2020), which consists of diverse driving scenes. Since the images captured on night or rainy days often produce blurry and dark images, we filter the data to collect only sunny and daytime images using metadata, which leaves about 12k, 1.7k images in the training/validation set, respectively. (Appendix B.7)
Hardware Specification No The paper details the model architectures and training hyperparameters, but it does not specify the exact hardware components used for running the experiments, such as specific GPU models (e.g., NVIDIA A100), CPU models, or memory configurations.
Software Dependencies Yes We employ the features from the pre-trained auto-encoder2 to represent an image.2https://huggingface.co/stabilityai/sd-vae-ft-ema-original (Section 5) We employ pretrained DINOv2 Oquab et al. (2023) and Stable Diffusion Rombach et al. (2022) for the image encoder and slot decoder in our auto-encoding path, respectively. (Appendix B.7) These statements specify distinct, identifiable pre-trained models which serve as specific software dependencies for the experiments.
Experiment Setup Yes Table 3 provides details of hyperparameters used in experiments. (Appendix A) Table 3: Hyperparameters used in our experiments. General Batch Size 64 Training Steps 200K Learning Rate 0.0001 CNN Backbone Input Resolution 128 Output Resolution 64 Slot Attention Input Resolution 64 # Iterations 7 Slot Size 192 Auto-Encoder Model KL-8 Input Resolution 128 Output Resolution 16 Output Channels 4 Diffusion Decoder Input Resolution 16 Input Channels 4 β scheduler Linear ... Surrogate Decoder Layers 8 # Heads 8 Hidden Dim 384