DeepSimHO: Stable Pose Estimation for Hand-Object Interaction via Physics Simulation
Authors: Rong Wang, Wei Mao, Hongdong Li
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our method noticeably improves the stability of the estimation and achieves superior efficiency over test-time optimization. |
| Researcher Affiliation | Academia | Rong Wang Wei Mao Hongdong Li The Australian National University {rong.wang, wei.mao, hongdong.li}@anu.edu.au |
| Pseudocode | No | No explicit pseudocode or algorithm blocks found. Figure 2 provides a schematic overview, not a step-by-step algorithm. |
| Open Source Code | Yes | The code is available at https://github.com/rongakowang/Deep Sim HO. |
| Open Datasets | Yes | We evaluate our method and state-of-the-art methods on two datasets: Dex YCB [6] and HO3D [18]. |
| Dataset Splits | Yes | We use the official "S0" train-test split for the training and evaluation. Following [52], we evaluate on right hand poses and filter out samples in which the hand or object is not within the field of view of the camera. To ensure consistent comparison in physics metrics, we remove test samples where the hand does not interact with the object and only select those that remain stable after simulation (see the GT results in Table 1), resulting in a total of 6348 samples. For training, we do not perform this selection of stability in order to include more data. However, we mask out the replicated stability loss on unstable training samples, e.g. no hand-object interaction, to avoid misleading supervision. The HO3D dataset [18] consists of 66K frames featuring 10 different objects. We select the "v2" version that is mostly evaluated by previous works [56, 20, 52, 34]. Since its ground truth hand poses for the test set are not released, we follow [56] to evaluate on a subset named "v2 ", whose physics plausibility is manually verified by [56]. For training data, we use the official HO3D v2 training split and follow the same practice as the Dex YCB dataset to perform sample selection and loss masking. The total HO3Dv2 test set consists of 6076 samples. |
| Hardware Specification | Yes | We implement the model in PyTorch [42] and train it using the Adam [29] optimizer on a single NVIDIA RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions PyTorch and MuJoCo but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We set the learning rate to 5e-5. When training on both datasets, we follow [34, 52] to crop input images into 224x224 pixels using the provided hand-object bounding boxes. In addition, we follow [34] and perform data augmentation with random translation and rescaling by a factor of 0.1. However, we exclude rotation augmentation as it can affect the ground truth stability. Finally, we set λh = 0.5, λd = 0.1, λs = 0.1, and follow [34] to set λo1 = 0, λo2 = 0.2 on the Dex YCB dataset and λo1 = 0.2, λo2 = 0 on the HO3D dataset for a fair comparison. For physics simulation, we use the MuJoCo [48] simulator. ... We set the gravity acceleration as 9.8 m/s2 in the y direction of the camera frame. For the adhesion force, we empirically set the gain as 100 and the maximum control range as 10... Finally, we set the simulation step to be T = 100 and time duration in each step as t = 0.02. |