3D Visibility-Aware Generalizable Neural Radiance Fields for Interacting Hands

Authors: Xuan Huang, Hanhui Li, Zejun Yang, Zhisheng Wang, Xiaodan Liang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the Interhand2.6M dataset demonstrate that our proposed VANe RF outperforms conventional Ne RFs significantly. Project Page: https://github.com/Xuan Huang0/VANe RF.
Researcher Affiliation Collaboration 1Shenzhen Campus of Sun Yat-sen University, Shenzhen, China 2Tencent, Shenzhen, China 3Dark Matter AI Research, Guangzhou, China
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Project Page: https://github.com/Xuan Huang0/VANe RF.
Open Datasets Yes Our experiments are conducted on the large-scale Interhand2.6M (Moon et al. 2020) dataset that consists of single and interacting hand images with various subjects, poses, and views.
Dataset Splits No The paper states '143,893 training images and 9,475 test images in total' but does not specify a separate validation dataset split.
Hardware Specification Yes The whole training process takes about 40 hours on four NVIDIA RTX 3090 GPUs.
Software Dependencies No The paper mentions 'Py Torch' but does not specify its version number or any other software dependencies with version details.
Experiment Setup Yes Our network is implemented using Py Torch and trained with the Adam optimizer (Kingma and Ba 2014) with a batch size of 4. For both the VA-Ne RF and the discriminator, their initial learning rates are set to 1 × 10−3 and decay by half four times (at the 2nd, 5th, 10th, and 20th epoch respectively) during training. The whole training process takes about 40 hours on four NVIDIA RTX 3090 GPUs. Loss weights in Eq. (4) are set as λrgb = 10.0, λV GG = 1.0, λadv = 0.1, λvis = 0.1. The total number of training epochs is 30. As in (Mihajlovic et al. 2022), we adopt a coarse-to-fine rendering strategy during training that first renders patches by accumulating color and density values of 64 sampled points along a camera ray, and then 128 sampled points for fine-grained rendering.