Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MetaFind: Scene-Aware 3D Asset Retrieval for Coherent Metaverse Scene Generation

Authors: Zhenyu Pan, Yucheng Lu, Han Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations demonstrate the improved spatial and stylistic consistency of Meta Find in various retrieval tasks compared to baseline methods. We conduct comprehensive experiments to evaluate Meta Find across multiple dimensions, including object-level retrieval, scene-level layout-aware retrieval, and robustness under varying design choices. We then present quantitative results on the Objaverse-LVIS dataset to assess retrieval performance under different modality combinations. Next, we evaluate scene-level quality on the Proc THOR dataset, highlighting the benefits of layout-aware retrieval using our ESSGNN context encoder. We further perform extensive ablation studies to analyze the contribution of core architectural components and training strategies.
Researcher Affiliation Academia Zhenyu Pan Northwestern University EMAIL Yucheng Lu New York University EMAIL Han Liu Northwestern University EMAIL
Pseudocode Yes Algorithm 1 Iterative Layout-Aware Scene Composition
Open Source Code No Answer: [No] Justification: We have already released a simplified version of the framework. For the final version, we plan to build a startup based on it; therefore, we do not intend to provide open access at this time.
Open Datasets Yes For object-level representation learning, we utilize the Objaverse-LVIS dataset, which comprises approximately 48,000 distinct 3D assets. For scene-level data, we leverage the Proc THOR, which includes over 10,000 generated houses constructed from a curated collection of more than 3,000 unique assets.
Dataset Splits Yes In both datasets, we allocate 80% of the data for training and reserve the remaining 20% for testing.
Hardware Specification No We gratefully acknowledge support from the NVIDIA Academic Grant ( Interactive Spatial Reasoning and 3D Scene Generation with RL-Enhanced VLMs ) and the provision of cloud computing resources, which enabled systematic training and evaluation of our Meta Find and other baselines.
Software Dependencies No Meta Find builds upon ULIP2 [30], a tri-modal learning framework that aligns text, image, and point cloud into a shared embedding space. For layout-level reasoning, we introduce the Equivariant Spatial-Semantic Graph Neural Network (ESSGNN), an EGNN-based encoder... These sentences are then encoded into dense vectors using a frozen text encoder (e.g., CLIP or BERT).
Experiment Setup Yes In the first stage, both query and gallery encoders are trained on large-scale object-level data from Objaverse-LVIS, where each asset has full modality inputs (text, images, and point clouds). We introduce stochastic modality masking to simulate partial-modality queries: each modality in the query has a 30% probability of being independently masked... The temperature is 0.5 for all experiments. The gallery encoder is trained to be modality-complete, and both towers share the contrastive retrieval objective: Lpre = log exp(sim(fquery(Q), fgallery(A))/τ) P A B exp(sim(fquery(Q), fgallery(A ))/τ), (5) where τ is a temperature hyperparameter and B denotes the gallery batch. In the second training stage, we enhance the query encoder with spatial context derived from the current scene layout... This residual design allows layout reasoning to enhance retrieval without disrupting the original embedding space. To ensure robustness in real-world settings where scene layouts may not always be available, we introduce stochastic scene dropout during training: the layout vector elayout is omitted in 30% of batches... We adopt a bidirectional contrastive learning objective to symmetrically align query and gallery embeddings.