Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AutoPartGen: Autoregressive 3D Part Generation and Discovery
Authors: Minghao Chen, Jianyuan Wang, Roman Shapovalov, Tom Monnier, Hyunyoung Jung, Dilin Wang, Rakesh Ranjan, Iro Laina, Andrea Vedaldi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate both the overall 3D generation capabilities and the part-level generation quality of Auto Part Gen, demonstrating that it achieves state-of-the-art performance in 3D part generation. ... 4 Experiments We first give the implementation details of Auto Part Gen, including network architectures, training procedures, and datasets. We then demonstrate its performance under various conditions, highlighting its versatility for different applications. Next, we compare our approach with state-of-the-art 3D completion methods and provide ablation studies to analyze key design choices. |
| Researcher Affiliation | Collaboration | 1Visual Geometry Group, University of Oxford 2Meta AI |
| Pseudocode | No | The paper describes the model and its components in sections 3.1, 3.2, and 3.3, detailing the latent 3D shape space, latent 3D diffusion, and autoregressive 3D part generation, but does not present the steps in a formal pseudocode or algorithm block. |
| Open Source Code | No | Answer: [No] Justification: We will strive to release the code and model weights as open source. However, our data license does not allow us to release the data for training the model. Other authors should be able to obtain a similar dataset using resources like Objaverse. |
| Open Datasets | Yes | Object, Image and Masks to 3D Parts Generation ... meshes sourced from Google Scanned Objects [11]; and (3) masks-to-parts generation with user-provided 2D part masks, where the masks are taken from Part Objaverse-Tiny [61]. |
| Dataset Splits | No | The paper describes using "approximately 1.7M 3D assets" for training the VAE and pretraining the diffusion model, and "Part Objaverse-Tiny [61]" for evaluation. However, it does not explicitly provide specific numerical splits (e.g., percentages or counts) for training, validation, and testing for these datasets within the main text or supplementary materials, nor does it specify how the "part dataset" used for fine-tuning is split. |
| Hardware Specification | Yes | Training. We use the Adam W optimizer with a learning rate of 1e-4 and train the model for 500K iterations on 256 NVIDIA H100 GPUs. Training the full model takes approximately 4 days. |
| Software Dependencies | No | The paper mentions several components like "Adam W optimizer", "DINOv2 [40]", "Di T [41]", and "DDIM scheduler [46]", but it does not specify version numbers for programming languages or specific software libraries (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | 4.1 Implementation Details Architecture. Our architecture builds upon the 3DShape2Vec Set [65] framework, with some modifications. Specifically, we increase the input points of the VAE encoder to 32K, and utilize both point coordinates and normals as input features to better capture fine-grained geometric details. The diffusion model is implemented as a Di T [41] with a width of 2048 and 24 layers. For imageconditioned generation, we use DINOv2 [40] to encode the input image I and part-masked images J(k) = I M (k) independently. The resulting features are concatenated along the channel dimension and passed through a small MLP to match the diffusion transformer input. We provide more details in the supplementary material. Training. We use the Adam W optimizer with a learning rate of 1e-4 and train the model for 500K iterations on 256 NVIDIA H100 GPUs. Training the full model takes approximately 4 days. More details on hyperparameters and data preprocessing are provided in the supplementary material. During training, we randomly drop the image condition, the geometry condition, or both with probabilities of 0.05 each. For CFG, we use wimg = 7 and wgeom = 4 as the default setting. ... A.1 Training: Our VAE architecture comprises an 8-layer encoder with a dimension of 768 and a 16-layer decoder with a dimension of 1024. ... supervise the VAE using a combination of surface normal loss, Eikonal loss, and KL divergence regularization, weighted by 10, 0.1, and 0.001, respectively... We randomly vary the number of input tokens between {512, 2048} during training. The model is optimized using Adam W with a learning rate of 1e 4, linearly warmed up from 1e 5 over the first 3 epochs. We use a batch size of 1536 and set the weight decay to 0.01. Training is conducted on 128 NVIDIA H100 GPUs for 150 epochs. ... The diffusion backbone follows Di T [41], with 24 transformer layers and hidden dimension 2048. We train with a fixed token length of 512 for 300 epochs, learning rate 1e 4, and batch size 10 per GPU on 128 GPUs. We then fine-tune the model to additionally condition on masked image and geometry tokens on the part dataset in an autoregressive manner for approximately 300k steps. ... Fine-tuning uses Adam W with weight decay 0.01 and batch size 6 per GPU on 128 GPUs. Subsequently, we increase the token length to 2048 and continue training for an additional 100k steps on 256 GPUs with batch size 1 per GPU. We adopt the DDIM scheduler [46] with 1000 steps, use v-prediction, and a zero signal-to-noise ratio [33]. |