Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Disentangled Representation Learning via Modular Compositional Bias
Authors: whie jung, Dong Hoon Lee, Seunghoon Hong
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our method shows competitive performance in both attribute and object disentanglement, and uniquely achieves joint disentanglement of global style and objects. Code is available at https://github.com/whieya/Compositional-DRL. |
| Researcher Affiliation | Academia | Whie Jung Dong Hoon Lee Seunghoon Hong KAIST EMAIL |
| Pseudocode | Yes | Algorithm 1 Matching Technique |
| Open Source Code | No | Our code is not cleaned and prepared enough for sharing. Our dataset and codes will be released in future. |
| Open Datasets | Yes | Datasets For attribute disentanglement, we evaluate our method on Shapes3D [22], Cars3D [36], MPI3D [13], which are standard datasets in attribute DRL. For object disentanglement, we use three multi-object datasets, including CLEVR-Easy [42], CLEVR [20], and CLEVR-Tex [42]. To evaluate joint disentanglement of attributes and objects, we introduce the CLEVR-Style dataset, a new variant of the CLEVR dataset augmented with four distinct artistic styles (see Appendix A.7). To further assess the scalability and robustness of our method on more complex datasets, we also conduct experiments on the MSN-Style, an augmentation of the Multi Shape Net (MSN) [45] dataset |
| Dataset Splits | Yes | To construct the CLEVR-Style dataset, we first sample 25K images from the original CLEVR dataset and then augment each with three additional styles. Including the unmodified images, this produces a total of 80K/10K/10K images for the train/val/test splits, respectively. Similarly, we construct the MSN-Style dataset by first sampling 15k images from the original MSN dataset [45] and augmenting with three identical styles used in CLEVR-Style. It produces a total of 40k/10k/10k images for the train/val/test splits, respectively. |
| Hardware Specification | Yes | We conduct all our experiments on a GPU Server that consists of an Intel Xeon Gold 6230 CPU, 256GB RAM, and 8 NVIDIA RTX 3090 GPUs (with 24GB VRAM), or 8 NVIDIA RTX 6000 GPUs (with 48GB VRAM). |
| Software Dependencies | No | The paper mentions using a pre-trained diffusion model Gψ and a pre-trained VAE, but does not provide specific version numbers for these or other software libraries (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | We implement the decoder Dϕ as a latent diffusion model built on a pre-trained VAE following [21, 52]. We use a fixed batch size of 64 and a learning rate of 0.0001 across all of the experiments. We use λPrior = 1 and λCon = 0.01 for all experiments. We set the number of latents to k = 10 in attribute disentanglement and use K = 4, 11, 11, 12, 6 for CLEVREasy, CLEVR, CLEVRTex, Clevr Tex-Style, MSN-Style datasets in object disentanglement, respectively. When training the diffusion model, we use a v-prediction [40] loss to ensure reliable few-step generation. See Appendix A.6 for additional implementation details. |