SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models

Authors: An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, Sifei Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results demonstrate that Spatial RGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts. The model also exhibits strong generalization capabilities, effectively reasoning about complex spatial relations and functioning as a region-aware dense reward annotator for robotic tasks. Code, dataset, and benchmark are released at https://www.anjiecheng.me/Spatial RGPT. We evaluate the effectiveness of our proposed Spatial RGPT in three aspects: (1) spatial reasoning benchmarks (Section 4.1), (2) standard vision-language benchmarks (Section 4.2), and (3) real-world applications (Section 4.3).
Researcher Affiliation Collaboration An-Chieh Cheng1, Hongxu Yin2, Yang Fu1, Qiushan Guo2, Ruihan Yang1, Jan Kautz2, Xiaolong Wang1,2, Sifei Liu2 1UC San Diego, 2NVIDIA
Pseudocode Yes The pseudocode for our denoising process is as in Listing 2.
Open Source Code Yes Code, dataset, and benchmark are released at https://www.anjiecheng.me/Spatial RGPT. ... The data pipeline, data, model weights, and benchmark will be publicly available upon paper publication.
Open Datasets Yes We use our automated annotation pipeline to annotate images from the Open Images [49] dataset, which covers a wide range of subjects and is of high resolution. ... [49] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
Dataset Splits Yes All the samples come from the validation or test splits of the original datasets and are unseen by Spatial RGPT during the training phase.
Hardware Specification Yes The first two stages of Spatial RGPT are inherited from VILA [50], which is trained on 16 A100 GPU nodes, with each node having 8 GPUs. ... The depth connector is further pre-trained using 2 A100 GPU nodes, taking 4 hours. The final visual instruction-tuning is also experimented on 2 A100 GPU nodes, taking 12 hours.
Software Dependencies No The paper mentions specific models (e.g., Metric3Dv2, Wild Camera, SAM, Llama3-70B, LLaMA2-7B, CLIP-L, SigLIP) and frameworks (e.g., PyTorch) but does not provide specific version numbers for software dependencies or libraries required for reproduction.
Experiment Setup Yes In the instruction fine-tuning stage, the maximum learning rate is reduced to 2e-5, and the batch size is adjusted to 16. All other hyperparameters remain the same as in the pre-training stage.