Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

Authors: Yandan Yang, Baoxiong Jia, Shujie Zhang, Siyuan Huang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we aim to answer the following questions: Q1: How does SCENEWEAVER perform compared to existing data-driven and open-vocabulary scene synthesis methods? Q2: How does the reflective agentic framework behave during the iterative scene refinement? Q3: How effective is each module in SCENEWEAVER, and how critical are they to overall performance? Settings We quantitatively evaluate SCENEWEAVER against existing methods under two primary settings: 1) common room types, where large-scale human-designed datasets support direct datadriven learning, and 2) open-vocabulary scene generation, following [8], which evaluates generation across diverse room type descriptions.
Researcher Affiliation	Academia	1 State Key Laboratory of General Artificial Intelligence, BIGAI 2Tsinghua University
Pseudocode	No	The paper only describes processes and tool metadata, but does not contain a dedicated section or figure explicitly labeled 'Pseudocode' or 'Algorithm', nor structured code-like blocks.
Open Source Code	No	We will open-source the code and dataset upon acceptance.
Open Datasets	Yes	Data-driven generative models [15, 2, 3], trained on datasets like 3D-FRONT [16], learns realistic but coarse scene layouts, constrained by the limited variety and level of detail of scenes in the dataset.
Dataset Splits	No	In the common setting, models are evaluated based on the average score over 10 scenes each for the living room and bedroom categories. In the open-vocabulary setting, evaluation is based on the average score over 3 scenes for each of 8 room types, using the prompt Design me a <room_type> as the user query.
Hardware Specification	Yes	All reported experiments are conducted on a machine equipped with an NVIDIA Ge Force RTX 4090 GPU.
Software Dependencies	Yes	We use Blender 3.6 to record and render the scene.
Experiment Setup	Yes	For all settings, we set the maximum number of iterations in SCENEWEAVER to 10. The memory length is set to 1 to avoid hallucination. The maximum number of steps is set to 10. However, the procedure may terminate earlier if the intermediate results already meet the user s requirements with a high score. The reflection module determines whether to continue optimizing or stop.