$\text{Di}^2\text{Pose}$: Discrete Diffusion Model for Occluded 3D Human Pose Estimation

Authors: Weiquan Wang, Jun Xiao, Chunping Wang, Wei Liu, Zhao Wang, Long Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations conducted on various benchmarks (e.g., Human3.6M, 3DPW, and 3DPW-Occ) have demonstrated its effectiveness.
Researcher Affiliation Collaboration Weiquan Wang1, Jun Xiao1, Chunping Wang2, Wei Liu3, Zhao Wang1, Long Chen4 1Zhejiang University 2Finvolution Group 3Tencent 4Hong Kong University of Science and Technology
Pseudocode Yes In this section, we provide complete training and inference algorithms for discrete diffusion process. Algorithm 1 Training Algorithm for the discrete diffusion process. Algorithm 2 Inference Algorithm for the discrete diffusion process.
Open Source Code No We will release code upon paper acceptance.
Open Datasets Yes Human3.6M [34] is the most extensive benchmark for 3D HPE... 3DPW [72] is the first dataset... Additionally, to further verify the occlusion-robustness, we evaluate Di2Pose on the 3DPW-Occ [83], which is a subset of the 3DPW.
Dataset Splits Yes We follow [22] with same protocol, which involves training on subjects S1, S5, S6, S7, and S8, and testing on subjects S9 and S11.
Hardware Specification Yes All experiments are carried out on one NVIDIA A100 PCIe GPU.
Software Dependencies No The proposed Di2Pose is completely implemented in Py Torch [53]. However, no specific version number for PyTorch or other software dependencies is provided.
Experiment Setup Yes Pose Quantization Step. The pose encoder is constructed with four Local-MLP blocks, while the pose decoder incorporates a single block. Within these Local-MLP blocks, the embedding dimensions D for the pose encoder and decoder are configured to 2048 and 512, respectively. For the quantization process, the projected vector qi features the channel d = 5. The levels per channel, denoted as [L1, , Ld], are specified as [7, 5, 5, 5, 5]. The number of quantized tokens N is set to 100. Discrete Diffusion Process. For the occlude and replace transition matrix, we linearly increase βs and γs from 0 to 0.1 and 0.9, respectively, and decrease αs from 1 to 0. For the discrete diffusion model, we use off-the-shelf image encoder [79] to extract feature sequence of conditional 2D image. As for the pose denoiser, we build a 21-layer 16-head transformer with the dimension of 1024. We set steps S as 100 and loss weight λ is set to 5e-4.