Learning to Compose: Improving Object Centric Learning by Injecting Compositionality
Authors: Whie Jung, Jaehoon Yoo, Sungjin Ahn, Seunghoon Hong
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our framework on four datasets and verify that our model consistently surpasses auto-encoding based baselines by a substantial margin. (3) We show that our objective enhances the robustness of object-centric learning on three major factors, such as number of latents, encoder and decoder architectures. (Section 1) 5 EXPERIMENT Implementation Details We base our implementation on existing frameworks (Singh et al., 2022a; Jiang et al., 2023). 5.1 UNSUPERVISED OBJECT SEGMENTATION 5.2 ROBUSTNESS OF COMPOSITIONAL OBJECTIVE 5.3 INTERNAL ANALYSIS |
| Researcher Affiliation | Academia | Whie Jung, Jaehoon Yoo, Sungjin Ahn, Seunghoon Hong School of Computing, KAIST {whieya, wogns98, sungjin.ahn, seunghoon.hong}@kaist.ac.kr |
| Pseudocode | No | The paper describes algorithms and mathematical formulations in text and equations but does not include any explicitly labeled pseudocode or algorithm blocks/figures. |
| Open Source Code | No | We employ the features from the pre-trained auto-encoder2 to represent an image.2https://huggingface.co/stabilityai/sd-vae-ft-ema-original (Section 5) The provided URL links to a pre-trained model (auto-encoder) used in the research, not the authors' own source code for the methodology described in the paper. The paper also mentions basing implementation on existing frameworks, but does not provide specific access to their own code. |
| Open Datasets | Yes | CLEVRTex (Karazija et al., 2021) consists of various rigid objects with homogeneous textures. Multi Shape Net (Stelzner et al., 2021) includes more complex and realistic furniture objects. PTR (Hong et al., 2021) and Super-CLEVR (Li et al., 2023) contain objects composed of multi-colored parts and textures. All of the datasets are center-cropped and resized to 128x128 resolution images. (Section 5 Datasets) |
| Dataset Splits | Yes | All results are evaluated on held-out validation set. (Table 1 caption) To explore the scalability of our novel objective in a complex real-world dataset, we examine our framework in BDD100k dataset Yu et al. (2020), which consists of diverse driving scenes. Since the images captured on night or rainy days often produce blurry and dark images, we filter the data to collect only sunny and daytime images using metadata, which leaves about 12k, 1.7k images in the training/validation set, respectively. (Appendix B.7) |
| Hardware Specification | No | The paper details the model architectures and training hyperparameters, but it does not specify the exact hardware components used for running the experiments, such as specific GPU models (e.g., NVIDIA A100), CPU models, or memory configurations. |
| Software Dependencies | Yes | We employ the features from the pre-trained auto-encoder2 to represent an image.2https://huggingface.co/stabilityai/sd-vae-ft-ema-original (Section 5) We employ pretrained DINOv2 Oquab et al. (2023) and Stable Diffusion Rombach et al. (2022) for the image encoder and slot decoder in our auto-encoding path, respectively. (Appendix B.7) These statements specify distinct, identifiable pre-trained models which serve as specific software dependencies for the experiments. |
| Experiment Setup | Yes | Table 3 provides details of hyperparameters used in experiments. (Appendix A) Table 3: Hyperparameters used in our experiments. General Batch Size 64 Training Steps 200K Learning Rate 0.0001 CNN Backbone Input Resolution 128 Output Resolution 64 Slot Attention Input Resolution 64 # Iterations 7 Slot Size 192 Auto-Encoder Model KL-8 Input Resolution 128 Output Resolution 16 Output Channels 4 Diffusion Decoder Input Resolution 16 Input Channels 4 β scheduler Linear ... Surrogate Decoder Layers 8 # Heads 8 Hidden Dim 384 |