Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting

Authors: Yian Wang, Xiaowen Qiu, Jiageng Liu, Zhehuan Chen, Jiting Cai, Yufei Wang, Tsun-Hsuan Johnson Wang, Zhou Xian, Chuang Gan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that ARCHITECT outperforms existing methods in producing realistic and complex environments, making it highly suitable for Embodied AI and robotics applications.
Researcher Affiliation Academia Yian Wang Umass Amherst yianwang@umass.edu Xiaowen Qiu Umass Amherst xiaowenqiu@umass.edu Jiageng Liu Umass Amherst jiagengliu@umass.edu Zhehuan Chen Umass Amherst zhehuanchen@umass.edu Jiting Cai Shanghai Jiao Tong University caijiting@sjtu.edu.cn Yufei Wang Carnegie Mellon University yufeiw2@andrew.cmu.edu Tsun-Hsuan Wang MIT tsunw@mit.edu Zhou Xian Carnegie Mellon University zhouxian@cmu.edu Chuang Gan Umass Amherst chuanggan@umass.edu
Pseudocode No The paper provides a high-level pipeline diagram (Figure 2) and describes procedural steps in text, but it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block, nor does it present structured steps formatted like code.
Open Source Code Yes Our code will be made publicly available.
Open Datasets Yes We retrieve objects from Objaverse. Deitke et al. [2023b,a] Objaverse is a dataset that contains massive annotated 3D objects. It includes various objects, including manually designed objects, everyday items, historical and antique items, etc. In the process of generating indoor scene objects, we retrieve suitable furniture from the Objaverse dataset and place them in the scene. We also retrieve articulated objects from Partnet Mobility [Xiang et al., 2020].
Dataset Splits No The paper mentions the datasets used (Objaverse, Partnet Mobility) but does not specify how these datasets were split into training, validation, or test sets with percentages, sample counts, or references to predefined standard splits.
Hardware Specification Yes All experiments, including qualitative evaluation, quantitative evaluation and robotics task are all conducted on an A100 GPU.
Software Dependencies Yes We use Marigold [Ke et al., 2024] as the depth estimation model. We use Grounded Segment-Anything as our segementation model. We use the SD-XL [Podell et al., 2023] inpainting model provided by diffusers as the image inpainting diffusion model. We use Luisa Render [Zheng et al., 2022] as our renderer. For text-to-3D generation, we first use MVdream [Shi et al., 2023] to generate a image and then feed the image to Instant Mesh [Xu et al., 2024] to generate the 3D asset.
Experiment Setup Yes Additionally, we use an 84-degree FOV for our camera during rendering, a standard parameter for real-world cameras. Consequently, for a square room, this setup results in approximately 95 percent of the room being visible from a single corner-to-corner view. For small object placement, we first ask LLMs to determine which objects can accommodate small objects on or in them, and then inpaint each of them with heuristic relative views. For objects like tables or desks on which we are placing items, we use a top-down view. For shelves or cabinets in which we are placing objects, we use a front view. The distance of the camera from the object is adjusted according to the scale of the object and the camera’s FOV, ensuring the full object is visible during inpainting. The prompt above is fed into GPT-4V along with an image generated by a 2D inpainting model.