Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ShapeCraft: LLM Agents for Structured, Textured and Interactive 3D Modeling

Authors: Shuyuan Zhang, ChenHan Jiang, Zuoou Li, Jiankang Deng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Qualitative and quantitative experiments demonstrate Shape Craft s superior performance in generating geometrically accurate and semantically rich 3D assets compared to existing LLM-based agents. We further show the versatility of Shape Craft through examples of animated and user-customized editing, highlighting its potential for broader interactive applications.
Researcher Affiliation Academia 1Imperial College London 2Hong Kong University of Science and Technology
Pseudocode Yes Algorithm 1: Iterative Shape Modeling with Multi-path Sampling
Open Source Code No Answer: [No] Justification: will be released after acceptance.
Open Datasets Yes We benchmark on 26 long-form functional prompts from MARVEL-40M+ [52], itself derived from Objaverse [12].
Dataset Splits Yes All evaluations are performed on the exported meshes. We benchmark on 26 long-form functional prompts from MARVEL-40M+ [52], itself derived from Objaverse [12].
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or cloud providers used for running experiments.
Software Dependencies Yes We employ the same Qwen3-235B-A22B with thinking disabled as Parser and Coder agents... And Qwen-VL-Max as the Evaluator agent.
Experiment Setup Yes For shape modeling, we set the number of path M = 3 and the iterative update step T = 3 for each node. More experiment settings can be found in Appendix Section B. ... we set a uniform sampling temperature of 0.5 across all LLM and VLM queries, allowing up to three retries in terms of network failure; the visual evaluation score is ranged from 0 to 10 and an early-stopping threshold of 9 is applied; we allow up to one update of the GPS representation G during representation bootstrapping, effectively setting N = 1.