Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning

Authors: Zhi Jing, Siyuan Yang, Jicong Ao, Ting Xiao, Yu-Gang Jiang, Chenjia Bai

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we create a novel benchmark with augmented scenarios to evaluate the quality of the collected data. The results show that the performance of the 2D and 3D diffusion policies can scale with the generated dataset. We conducted the following three groups of experiments: (1) a comprehensive evaluation of demonstration generation and execution performance of our framework compared to Robotwin; (2) an effectiveness evaluation of MCTS in enhancing the demonstration generation process; and (3) a validity evaluation of the collected demonstration data in training bimanual dexterous manipulation policies. Experimental Results. As shown in Tab. 1, MCTS can significantly improve the reasoning ability of LLM with minimal additional token consumption. Experimental Results. As shown in Tab. 2, for some relatively easy tasks, DP3 demonstrates few-shot learning capability. We trained both DP and DP3 using 100, 50, and 20 trajectories generated by our method. During each training session, we record results from three checkpoints. For DP, we evaluate checkpoints at epochs 150, 200, and 250, while for DP3, we evaluate at epochs 1500, 2000, and 2500.
Researcher Affiliation Collaboration Zhi Jing1,2 Siyuan Yang3,2 Jicong Ao2 Ting Xiao4 Yu-Gang Jiang1 Chenjia Bai 2 1Fudan University 2Institute of Artificial Intelligence (Tele AI), China Telecom 3University of Science and Technology of China 4East China University of Science and Technology Correspondence to: Chenjia Bai (EMAIL)
Pseudocode No The paper describes a framework and its components, detailing steps and mechanisms in prose and using figures (like Figure 1: The overview of Humanoid Gen), but it does not present any formal pseudocode or algorithm blocks with numbered or bulleted steps labeled as such.
Open Source Code Yes Project page is https://openhumanoidgen.github.io.
Open Datasets Yes Project page is https://openhumanoidgen.github.io. Building on Humanoid Gen, we develop a comprehensive benchmark called HGen-Bench for bimanual dexterous manipulation. In our setup, the Unitree H1-2 humanoid robot equipped with Inspire hands serves as the robotic platform, and SAPIEN [14] as the simulation engine for data collection.
Dataset Splits No We collected 100 data samples for each task using our method. Each sample includes RGB and depth images captured from six cameras, the joint states, and the action ground truth of the robot. We train both DP and DP3 using 100, 50, and 20 trajectories generated by our method. During each training session, we record results from three checkpoints. For DP, we evaluate checkpoints at epochs 150, 200, and 250, while for DP3, we evaluate at epochs 1500, 2000, and 2500. For evaluation, we test each of the three checkpoints using seeds 0, 1, and 2, and report the mean and standard deviation of the success rates.
Hardware Specification No The paper mentions the robot hardware used for data collection and real-world experiments ("the Unitree H1-2 humanoid robot equipped with Inspire hands", "D435i depth camera"), but does not specify the computational hardware (e.g., GPU models, CPU types, memory) used for training the Diffusion Policies (DP and DP3) or running the simulations for the main experiments.
Software Dependencies No Our benchmark builds on Mani Skill3 [52] physics engine. In our setup, the Unitree H1-2 humanoid robot equipped with Inspire hands serves as the robotic platform, and SAPIEN [14] as the simulation engine for data collection. We employed the SNOPT solver from the pydrake library to solve this optimization problem. We utilize the constrained motion planner from the mplib library to minimize the cost function Cost(θt) while guaranteeing the satisfaction of the constraints during the movement process. Using the Deep Seek-R1 [55], we generate and execute the demonstration scripts with both frameworks and compare the final success rates.
Experiment Setup Yes We train both DP and DP3 using 100, 50, and 20 trajectories generated by our method. During each training session, we record results from three checkpoints. For DP, we evaluate checkpoints at epochs 150, 200, and 250, while for DP3, we evaluate at epochs 1500, 2000, and 2500. DP3 is trained for 3000 epochs with checkpoints saved every 500 epochs. In contrast, DP is trained for 300 epochs with checkpoints saved every 50 epochs. The batch size is set to 256 for DP3 and 64 for DP, reflecting the difference in memory requirements between point cloud and image inputs. Both methods adopt a noise schedule with βstart = 0.0001, βend = 0.02, and a squared cosine schedule (squaredcos_cap_v2) over 100 diffusion training steps. However, the sampling strategies differ: DP3 employs DDIM (prediction_type=sample), while DP uses DDPM with ϵ-prediction (prediction_type=epsilon) and enables clip_sample to stabilize training. Both policies use the Adam W optimizer with a learning rate of 1e-4, β = [0.95, 0.999], and a weight decay of 1e-6. A cosine learning rate scheduler with a linear warm-up of 500 steps is applied to improve convergence during the initial training phase. In our experiments, we set H = 8 and nobs = 3 for both DP3 and DP.