MIDGArD: Modular Interpretable Diffusion over Graphs for Articulated Designs

Authors: Quentin Leboutet, Nina Wiedemann, zhipeng cai, Michael Paulitsch, Kai Yuan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show the superiority of MIDGAr D on the quality, consistency, and interpretability of the generated assets. Importantly, the generated models are fully simulatable, i.e., can be seamlessly integrated into standard physics engines such as Mu Jo Co, broadening MIDGAr D’s applicability to fields such as digital content creation, meta realities, and robotics.
Researcher Affiliation Industry Quentin Leboutet Nina Wiedemann Zhipeng Cai Michael Paulitsch Kai Yuan Intel Labs XRL e Xtended Reality Laboratory {firstname.lastname}@intel.com
Pseudocode No The paper includes figures illustrating pipelines, but no structured pseudocode or algorithm blocks.
Open Source Code No Code and models are available at https://quentin-leboutet.github.io/MIDGAr D. Opensource code will be provided upon acceptance.
Open Datasets Yes Dataset All experiments were conducted using the Part Net Mobility dataset [97], which contains a diverse set of articulated 3D objects with detailed geometric and kinematic annotations.
Dataset Splits No The paper mentions a 'train-test split' but does not explicitly provide details for a separate validation split, nor its percentages or counts.
Hardware Specification Yes We trained the structure generator and the image VQ-VAE on an NVIDIA RTX 3090 GPU, while the shape generator was trained on an NVIDIA RTX 6000 GPU. Evaluation took place on a single NVIDIA RTX 3090 GPU.
Software Dependencies No The paper mentions various software components (e.g., SDFusion, BERT, ResNet-18, VQ-VAE, MuJoCo) and a general programming language (Python) but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes The denoising model used in the structure generator contains six graph attention blocks, with a latent embedding size of 512 and 32 attention heads. We set the maximum number of nodes in the graph to N = 8. Our training parameters closely follow those in NAP, with the key difference being the use of an implicit denoising diffusion pipeline [80] over 100 time steps, as opposed to a DDPM with 1,000 time steps. Our shape generator is adapted from SDFusion [8] and trained on the Part Net Mobility dataset. We used the same hyperparameters as the multimodal model in SDFusion and utilized their pre-trained VQ-VAE checkpoint. We excluded 10 categories from training due to their objects containing numerous equally-shaped parts (e.g., keyboards with over 30 keys). For encoding the part dimensions, we use an MLP with three hidden layers (size 16, 64 and 256 respectively).