Vision Foundation Model Enables Generalizable Object Pose Estimation

Authors: Kai Chen, Yiyao Ma, Xingyu Lin, Stephen James, Jianshu Zhou, Yun-Hui Liu, Pieter Abbeel, DOU QI

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations on benchmark datasets demonstrate the effectiveness and versatility of VFM-6D in various real-world scenarios. Extensive evaluations on representative benchmark datasets Wild6D, CO3D and LINEMOD demonstrate the superior generalization capability of VFM-6D.
Researcher Affiliation Collaboration Kai Chen1, Yiyao Ma1, Xingyu Lin2, Stephen James3, Jianshu Zhou1 Yun-Hui Liu1, Pieter Abbeel2, Qi Dou1 1The Chinese University of Hong Kong, 2UC Berkeley, 3Dyson Robot Learning Lab. Pieter Abbeel holds concurrent appointments as a Professor at UC Berkeley and as an Amazon Scholar.
Pseudocode No The paper describes the method's steps in text and figures, but does not include a formal pseudocode block or algorithm listing.
Open Source Code Yes Project website: vfm-6d.github.io/. The relevant data, code, and instructions will be publicly available via the project homepage: https://vfm-6d.github.io/.
Open Datasets Yes We mainly evaluated VFM-6D on two category-level benchmark datasets Wild6D [62] and CO3D [63], to investigate its generalization capability for category-level object pose estimation for novel object categories. ... We therefore further evaluated VFM-6D on LINEMOD [64]... we collected 20 categories of objects from Shape Net [60].
Dataset Splits No For each category, one object instance was randomly selected to synthesize a total of 10K RGBD images. Concurrently, the object mask and ground-truth object coordinate map were generated for each image. In total, the synthetic training dataset comprises 200K RGBD images. ... Wild6D testing data consists of 162 different object instances... CO3D is more comprehensive... We therefore further evaluated VFM-6D on LINEMOD.... No explicit mention of validation split.
Hardware Specification Yes All experiments were conducted on a server with two NVIDIA-A40-48GB GPUs.
Software Dependencies No We developed VFM-6D based on DINOv2-s. We customized the synthetic data generation pipeline from [59] to utilize Blender for generating photorealistic synthetic data... No specific version number for Blender or other libraries like PyTorch is provided, only the DINOv2-s model name.
Experiment Setup Yes We set n = m = 1024. ... For the proposed 2D-to-3D foundation feature lifting module, we set N1 = 1, and the 3D position embedding layer adopts a 2-layer MLP. For the shape representation module, we set N2 = 4. We set τ = 0.05 for the pose-aware contrastive loss, and β = 0.1 for the smooth L1 loss. ... During training, we use Adam optimizer with a base learning rate of 1 10 4 and halved every 3 epochs. The total training epoch is set to 15 with a batch size of 8.