Learning to Estimate Object Poses without Real Image Annotations

Authors: Haotong Lin, Sida Peng, Zhize Zhou, Xiaowei Zhou

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on the LINEMOD [Hinterstoisser et al., 2012], Occluded LINEMOD [Brachmann et al., 2014] and YCB-Video [Xiang et al., 2018] dataset, which are widely used benchmark datasets for object pose estimation.
Researcher Affiliation Academia Haotong Lin , Sida Peng , Zhize Zhou and Xiaowei Zhou State Key Lab of CAD and CG, Zhejiang University {haotongl, pengsida, zhouzhize, xwzhou}@zju.edu.cn
Pseudocode No The paper describes its method in prose and uses diagrams but does not contain any pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/zju3dv/pvnet-depth-sup .
Open Datasets Yes We train the pose estimator on the real training data without using annotations. Specifically, we follow Self6D [Wang et al., 2020] to use 15% of the real data on the LINEMOD dataset and 10% of the real training data on the YCB-Video dataset. The LINEMOD and YCB-Video datasets provide depth maps collected by RGBD sensors. Our synthetic data is rendered by Blender Proc.
Dataset Splits No The paper mentions using a percentage of 'real training data' from LINEMOD and YCB-Video datasets, but does not explicitly specify a validation dataset split or a validation set.
Hardware Specification Yes All networks are trained until they are converged and it takes about 4 hours to train a pose estimation network and 2 hours to finetune it on 4 TITAN Xp gpus.
Software Dependencies No The paper mentions tools and models like Blender Proc, PVNet, Center Net, ICP, ResNet-34, and ResNet-18, but does not specify software versions for any libraries or frameworks used in implementation (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes Specifically, for M classes of objects and K keypoints, our network takes the H W 3 image as input and outputs H W (K 2 M +M +1) tensor. ... We take the initial learning rate as 1e-3 and halve it every 20 epochs. After pretraining, we set the initial learning rate as 5e-4 and halve it every 10 epochs to finetune the networks to employ learning with unannotated RGBD data. Every 5 epochs, we update the supervision poses of real training data using the pose estimator at the time and the pose refiner.