Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generalizable Hand-Object Modeling from Monocular RGB Images via 3D Gaussians

Authors: Xingyu Liu, Pengfei Ren, Qi Qi, Haifeng Sun, Zirui Zhuang, Jing Wang, Jianxin Liao, Jingyu Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on two challenging datasets demonstrate that HOGS outperforms state-of-the-art methods in monocular hand-object reconstruction and photo-realistic rendering. Experimental results on two challenging datasets show that our method outperforms the state-of-the-art (SOTA) methods in monocular hand-object reconstruction and photo-realistic rendering. We conduct comparisons with existing hand-object and hand-held object reconstruction methods, including 3D dense supervised baselines (typically SDF-based) and 2D photometric supervised baselines (e.g., MOHO [52] and HOLD [7]). Additionally, we re-implement 3DGS-Avatar [37] and GOF [51], which originally utilize 3DGS to model animable human bodies and reconstruct static unbounded scenes, extending them to a hand-object baseline and a rigid object baseline (marked by ). We perform ablation studies to verify the effectiveness of two proposed critical components, V-PM and G-PM.
Researcher Affiliation Academia Xingyu Liu*, Pengfei Ren*, Qi Qi, Haifeng Sun, Zirui Zhuang, Jing Wang, Jianxin Liao, Jingyu Wang State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications EMAIL
Pseudocode No The paper describes methods in figures (e.g., Figure 2) and in structured text. However, it does not contain any explicitly labeled 'Pseudocode', 'Algorithm', or 'Algorithm X' blocks or figures that present a procedure in a code-like format.
Open Source Code Yes Code is available at https://github.com/ ru1ven/HOGS.
Open Datasets Yes Dex YCB [1] is a hand-object dataset containing 582K RGB-D frames over 1,000 sequences of 10 subjects grasping 20 different objects. HO3D_v3 [13] is an RGB-D hand-object interaction dataset with 10 subjects manipulating 10 objects from the YCB dataset.
Dataset Splits Yes We follow the dataset split in [35], filtering samples without interactions, obtaining 147,526 training samples. For reconstruction evaluation, we follow [2, 52, 27] to downsample the video data to 6 frames per second, resulting in 5,928 testing samples. Following the evaluation protocol of [7], we select 18 sequences for training and evaluate the quality of reconstructed hand-held object meshes.
Hardware Specification Yes On Dex YCB, we train the model for a total of 360k iterations, which takes approximately 10 hours on an NVIDIA RTX 4090 GPU.
Software Dependencies No The paper mentions several models and optimizers (e.g., Vision Transformer [6], Point Net++ [14], ULIP [46], Adam W optimizer [24]) but does not provide specific version numbers for the underlying software environment or libraries like Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup Yes We initialize the hand Gaussians and object Gaussians by randomly sampling K = 5,000 points within the canonical MANO surface and the 3D bounding box, respectively. For optimization, we follow [21] to employ cloning or splitting and pruning to adaptively control the density of the 3D Gaussians during optimization. We employ an individual set of Gaussians for each subject or object. For the input of the visual encoder, we crop the hand-object region of the RGB image and resize it to 224 224. We use an Adam W optimizer [24] for training. On Dex YCB, we train the model for a total of 360k iterations, which takes approximately 10 hours on an NVIDIA RTX 4090 GPU. On HO3D, we train for 200k iterations. After 360k iterations on Dex YCB, we fix the parameters of the 3D Gaussians and continue training for an additional 10 epochs, focusing solely on optimizing the pose parameters to prevent underfitting. The overall loss can be formulated as: Ltotal = λrgb Lrgb + λmask Lmask + λperc Lperc + λpose Lpose + λcont Lcont + λpen Lpen + λskin Lskin + λiso-pos Liso-pos + λiso-cov Liso-cov, (24) where λrgb, λmask, λperc, λpose, λcont, λpen, λskin, λiso-pos, and λiso-cov are set to 1, 0.1, 0.01, 1, 20, 10, 0.1, 1, and 100, respectively. The weight factors λrot, λtrans, and λcorner are set as 10, 1e4, and 1e3, respectively.