Multiview Scene Graph
Authors: Juexiao Zhang, Gao Zhu, Sihang Li, Xinhao Liu, Haorui Song, Xinran Tang, Chen Feng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate any method tackling this task, we developed an MSG dataset based on a public 3D dataset. We also propose an evaluation metric based on the intersection-over-union score of MSG edges. Moreover, we develop a novel baseline method built on mainstream pretrained vision models, combining visual place recognition and object association into one Transformer decoder architecture. Experiments demonstrate that our method has superior performance compared to existing relevant baselines. |
| Researcher Affiliation | Academia | Juexiao Zhang Gao Zhu Sihang Li Xinhao Liu Haorui Song Xinran Tang Chen Feng New York University {juexiao.zhang, cfeng}@nyu.edu |
| Pseudocode | No | The paper provides a diagram of the Ao MSG model in Figure 2, but it does not include formal pseudocode or an algorithm block describing the steps of the method. |
| Open Source Code | Yes | All codes and resources are open-source at https://ai4ce.github.io/MSG/. |
| Open Datasets | Yes | To facilitate the research of MSG, we curated a dataset from a publicly available 3D scene-level dataset ARKit Scenes [8] and designed a set of evaluation metrics based on the intersection-over-union of the graph adjacency matrix. ... [8] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. URL https://openreview.net/forum?id=tj Zjv_qh_CE. |
| Dataset Splits | No | The paper states: "4492 scenes are used for training and 200 scenes are used for testing." (Section 5.1). It does not explicitly mention a separate validation split or how it was derived, only training and testing sets. |
| Hardware Specification | Yes | All the models are trained on a single H100 or GTX 3090 graphics card for 30 epochs or until convergence. |
| Software Dependencies | No | The paper mentions using "DINOv2 [49]" as the encoder and "Grounding DINO [39]" as the detector, but it does not specify version numbers for these or other software dependencies (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | For Ao MSG, we experimented with different choices of backbones, sizes of the Transformer decoder, and dimensions of the final linear projector heads. Their results are discussed in Section 5.4. All the models are trained on a single H100 or GTX 3090 graphics card for 30 epochs or until convergence. We provide detailed hyperparameters in the appendix. ... Table 3: Hyperparameters used in the Ao MSG main experiments. (Includes: Batch size, Learning rate, Epochs, Optimizer, Weight decay, Loss Functions, Thresholds, etc.) |