reproducibilityindex.ai

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

Authors: Antonin Vobecky, Oriane Siméoni, David Hurych, Spyridon Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate quantitatively the strengths of the proposed model on several open-vocabulary tasks: Zero-shot 3D semantic segmentation using existing datasets; 3D grounding and retrieval of free-form language queries, using a small dataset that we propose as an extension of nu Scenes.
Researcher Affiliation	Collaboration	Antonin Vobecky1,2,3 Oriane Siméoni1 David Hurych1 Spyros Gidaris1 Andrei Bursuc1 Patrick Pérez1 Josef Sivic2 1 valeo.ai, Paris, France 2 CIIRC CTU in Prague 3 FEE CTU in Prague
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm".
Open Source Code	Yes	You can find the project page here https://vobecant.github.io/POP3D.
Open Datasets	Yes	We use the nu Scenes [10] dataset composed of 1000 sequences in total, divided into 700/150/150 scenes for train/val/test splits.
Dataset Splits	Yes	We use the nu Scenes [10] dataset composed of 1000 sequences in total, divided into 700/150/150 scenes for train/val/test splits. Each sequence consists of 30 40 scenes resulting in 28, 130 training and in 6, 019 validation scenes.
Hardware Specification	Yes	We train our models on 8 A100 GPUs.
Software Dependencies	No	The paper mentions software components like "Adam optimizer", "Res Net-101", "Mask CLIP+", and "TPVFormer" but does not provide specific version numbers for any of them.
Experiment Setup	Yes	If not mentioned otherwise, we use the default learning rate of 2e-4, Adam [30] optimizer, and a cosine learning rate scheduler with final learning rate 1e-6, and with linear warmup from 1e-5 learning rate for the first 500 iterations. Both prediction heads have two layers, i.e., Nocc = Nft = 2, and Cocc = 512 and Cft = 1024 feature channels. We put the same weight to the occupancy and feature losses, i.e., we set λ = 1 in Eq. 8.