Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Promptable 3-D Object Localization with Latent Diffusion Models

Authors: Cheng-Yao Hong, Li-Heng Wang, Tyng-Luh Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct thorough evaluations on three key 3-D object recognition tasks: general 3-D object detection, few-shot detection, and grounding-based detection. Experimental results demonstrate that our framework achieves competitive performance relative to state-of-the-art methods, validating its effectiveness, versatility, and broad applicability in 3-D computer vision. (from Abstract) and "4 Experiments In this section, we present experimental results aimed at demonstrating the effectiveness of the proposed method."
Researcher Affiliation Academia 1Institute of Information Science, Academia Sinica 2University of Southern California EMAIL
Pseudocode Yes Algorithm 1: Training def train(pc, gt_b, gt_l, clsn, cond, T): # Extract 3-D scene features pts, zv = foundation.encoder.v(pc) zt = foundation.encoder.t(clsn) # Compute conditional embeddings cz = prompt.encoder(cond) # Generate object anchor features bo = cross_attention(zv, zt) # Initialize bounding boxes bb_init = init_boxes(bo) # Encode to latent space bb_latent = box_vae.encoder(bb_init, bo) # Sample random diffusion timestep t = randint(1, T) # Add noise to latent eps = normal(mean=0, std=1) bb_noisy = corrupt(bb_latent, t, eps) # Predict noise with diffusion model eps_pred = ldm(bb_noisy, cz, t) # Compute diffusion loss L_diff = mse(eps_pred, eps) # Decode latent to boxes bb_pred = box_vae.decoder(bb_latent) # Compute detection loss L_det = detection_loss(bb_pred, gt_b, gt_l) loss = L_diff + L_det update(model, loss) return loss corrupt(x, t, eps):sqrt( alpha_cumprod(t)) * x + sqrt(1 alpha_cumprod(t)) * eps alpha_cumprod(t): Qt i=1 αi (Partial text from Algorithm 1)
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The paper discloses all the implementation details and information needed to reproduce the main experimental results.
Open Datasets Yes We evaluate the proposed method on two standard indoor benchmarks: SUN RGB-D and Scan Net. SUN RGB-D includes 5,285 training scenes along with corresponding validation scenes, while Scan Net comprises 1,201 training and 312 validation scenes reconstructed from 2.5 million RGB-D frames. Following prior works [17, 10, 50, 61], we evaluate on the 10 most common object classes for SUN RGB-D and 18 semantic classes for Scan Net. Performance is measured using mean Average Precision (m AP) at Io U thresholds of 0.25 and 0.5. All results are averaged over three random splits, and we report both the mean and standard deviation. (from Section 4.1, paragraph 1) and citations for these datasets [9, 65, 90, 5, 89, 72, 28].
Dataset Splits Yes SUN RGB-D includes 5,285 training scenes along with corresponding validation scenes, while Scan Net comprises 1,201 training and 312 validation scenes reconstructed from 2.5 million RGB-D frames. ... All results are averaged over three random splits, and we report both the mean and standard deviation. (from Section 4.1) and "The base/novel splits are 6/4 for FS-SUNRGBD and 12/6 for FS-Scan Net." (from Section 4.2).
Hardware Specification Yes All experiments utilize eight NVIDIA RTX A6000 Ada GPUs.
Software Dependencies No The paper mentions and uses "Adam optimizer", "cosine annealing schedule", "DDIM sampling steps", "focal loss", "asymmetric classification losses", "pretrained video-based LDM from Stable Diffusion", and "CLIP text encoder". However, it does not specify explicit version numbers for these software libraries, frameworks (like PyTorch or TensorFlow), or the programming language used (e.g., Python 3.x).
Experiment Setup Yes Optimization employs the Adam optimizer with an initial learning rate of 5 10 4, a cosine annealing schedule featuring a 500-step linear warm-up, and a minimum learning rate of 1 10 6 to ensure stable convergence. ... The model trains for 18K iterations with a batch size of 8, accumulating gradients over 16 steps. ... the loss coefficients initially set as λdiff = 1.0, λdet = 0.2 gradually adjust to λdiff = 0.5, λdet = 1.0. (from Section 4, "Training and loss functions" paragraph)