Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning

Authors: Jiyuan Shi, Xinzhe Liu, Dewei Wang, ouyang lu, Sören Schwertfeger, Chi Zhang, Fuchun Sun, Chenjia Bai, Xuelong Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our method achieves robust locomotion and precise motion tracking in both simulation and on the full-size Unitree H1-2 robot. In this section, we evaluate the performance of ALMI using the Unitree H1-2 robot in both simulated and real-world environments.
Researcher Affiliation Collaboration Jiyuan Shi 1 Xinzhe Liu 1,2 Dewei Wang 1,3 Ouyang Lu 1,4 Sören Schwertfeger 2 Chi Zhang 1 Fuchun Sun 5 Chenjia Bai1 Xuelong Li 1 1Institute of Artificial Intelligence (Tele AI), China Telecom 2Shanghai Tech University 3University of Science and Technology of China 4Northwestern Polytechnical University 5Tsinghua University Correspondence to: Chenjia Bai (EMAIL)
Pseudocode Yes Algorithm 1 Training process of the locomotion policy
Open Source Code Yes Please see: (1) https://drive.google.com/file/d/12hK8wajdeDG3wN0_WWCt0NY0N9p1HVlA/view?usp=sharing for our code;
Open Datasets Yes Additionally, we release a large-scale whole-body motion control dataset featuring high-quality episodic trajectories from Mu Jo Co simulations. The project page is https://almi-humanoid.github.io. The dataset is available at https://almi-humanoid.github.io/. We adopt the high-quality CMU Mo Cap dataset [64] with 1122 motion clips (denoted as Dcmu) to evaluate different metrics in Isaac Gym [26]. reference motion from the AMASS dataset [8].
Dataset Splits No Using the ALMI-X dataset, we give preliminary attempts to train a whole-body foundation model with supervised learning that can execute various motions in response to text commands. We adopt the high-quality CMU Mo Cap dataset [64] with 1122 motion clips (denoted as Dcmu) to evaluate different metrics in Isaac Gym [26]. The text refers to datasets but does not specify explicit training/test/validation splits (e.g., percentages or exact counts) for the ALMI-X dataset or how the CMU Mo Cap dataset is split for their experiments.
Hardware Specification No Policy training is conducted within the Isaac Gym simulator utilizing 4,096 parallel environments. We deploy ALMI on the Unitree H1-2 robot to evaluate real-world performance. our policy does not require additional sim-to-real techniques and can be deployed directly on the NVIDIA Jetson Orin NX onboard the robot for inference. The paper mentions that policy training is conducted in Isaac Gym, which is GPU-based, but does not specify the type or model of GPUs used for training. It only specifies the NVIDIA Jetson Orin NX for inference on the robot.
Software Dependencies No We use PPO to train the policy and employ an asymmetric actor-critic architecture. Policy training is conducted within the Isaac Gym simulator. implemented with Pinocchio [77]. The paper mentions using PPO (an algorithm) and Isaac Gym (a simulator), but does not provide specific version numbers for these or any other software libraries or dependencies. Pinocchio is mentioned but without a version.
Experiment Setup Yes Policy training is conducted within the Isaac Gym simulator utilizing 4,096 parallel environments. We perform three rounds of adversarial iterations and evaluate the policies of the final round. Notably, during the adversarial iterative training process, the initial lower-body policy converges after approximately 104 steps. As iterations progress, the number of steps required for convergence decreases significantly. The total duration of the three training iterations is approximately 17 hours. The hyperparameters of the PPO algorithm and the information of the network backbone are listed in Table 10. Table 10: Hyperparameters related to PPO: Actor lstm size [64], Actor MLP size [64, 32], Critic MLP size [64, 32], Optimizer Adam, Batch size 4096, Mini Batches 4, Learning epoches 8, Activation elu, Entropy coef(ws) 0.01, Value loss coef(wv) 1.0, Clip param 0.2, Max grad norm 1.0, Init noise std 0.8, Learning rate 1e-3, Desired KL 0.01, GAE decay factor(λ) 0.95, GAE discount factor(γ) 0.998, Curriculum window size(w) 40.