Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models

Authors: Baao Xie, Qiuyu Chen, Yunnan Wang, Zequn Zhang, Xin Jin, Wenjun Zeng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate our method s superior performance in disentanglement and reconstruction. Furthermore, the model inherits enhanced interpretability and generalizability from MLLMs.
Researcher Affiliation Academia Baao Xie1 Qiuyu Chen1,2 Yunnan Wang1,2 Zequn Zhang1,3 Xin Jin1,* Wenjun Zeng1 1Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China 2Shanghai Jiao Tong University, Shanghai, China 3 University of Science and Technology of China, Hefei, China
Pseudocode Yes Figure 5: Overall training algorithm of GEM.
Open Source Code No Code is available at here. (in footnote) and The codes will be released in the camera-ready version, with detailed instructions for user to reproduce. (in checklist).
Open Datasets Yes We evaluate the GEM on two datasets: 1) Celeb A [75] contains over 200,000 high-quality facial images. ... 2) LSUN [76] consists of about one million images across various object categories such as cars, buildings, animals, etc.
Dataset Splits Yes Commonly, we use employ the entirety of the Celeb A dataset, which includes 162,770 images for training, 19,867 for validation, and 19,962 for testing.
Hardware Specification Yes All the experiments are processed using the Adam optimizer with a learning rate of 1e-4, and conducted on the Nvidia Tesla A100 GPUs, with a batch size of 32.
Software Dependencies No The paper mentions 'Py Torch [77]' and 'GPT-4o [62]' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes For every experiment, the latent dimension size is set to 6. Concentrating on the disentanglement capacity of the framework, all experimental images are resized to a resolution of 64 × 64 to minimize computational resources. ... All the experiments are processed using the Adam optimizer with a learning rate of 1e-4, and conducted on the Nvidia Tesla A100 GPUs, with a batch size of 32. ... λadv, λdis and λgem serve as hyperparameters to balance the disentanglement capability and reconstruction quality, with default values set to 0.8, 0.6 and 0.6, respectively.