Learning to Estimate Object Poses without Real Image Annotations
Authors: Haotong Lin, Sida Peng, Zhize Zhou, Xiaowei Zhou
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on the LINEMOD [Hinterstoisser et al., 2012], Occluded LINEMOD [Brachmann et al., 2014] and YCB-Video [Xiang et al., 2018] dataset, which are widely used benchmark datasets for object pose estimation. |
| Researcher Affiliation | Academia | Haotong Lin , Sida Peng , Zhize Zhou and Xiaowei Zhou State Key Lab of CAD and CG, Zhejiang University {haotongl, pengsida, zhouzhize, xwzhou}@zju.edu.cn |
| Pseudocode | No | The paper describes its method in prose and uses diagrams but does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/zju3dv/pvnet-depth-sup . |
| Open Datasets | Yes | We train the pose estimator on the real training data without using annotations. Specifically, we follow Self6D [Wang et al., 2020] to use 15% of the real data on the LINEMOD dataset and 10% of the real training data on the YCB-Video dataset. The LINEMOD and YCB-Video datasets provide depth maps collected by RGBD sensors. Our synthetic data is rendered by Blender Proc. |
| Dataset Splits | No | The paper mentions using a percentage of 'real training data' from LINEMOD and YCB-Video datasets, but does not explicitly specify a validation dataset split or a validation set. |
| Hardware Specification | Yes | All networks are trained until they are converged and it takes about 4 hours to train a pose estimation network and 2 hours to finetune it on 4 TITAN Xp gpus. |
| Software Dependencies | No | The paper mentions tools and models like Blender Proc, PVNet, Center Net, ICP, ResNet-34, and ResNet-18, but does not specify software versions for any libraries or frameworks used in implementation (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | Specifically, for M classes of objects and K keypoints, our network takes the H W 3 image as input and outputs H W (K 2 M +M +1) tensor. ... We take the initial learning rate as 1e-3 and halve it every 20 epochs. After pretraining, we set the initial learning rate as 5e-4 and halve it every 10 epochs to finetune the networks to employ learning with unannotated RGBD data. Every 5 epochs, we update the supervision poses of real training data using the pose estimator at the time and the pose refiner. |