Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hierarchical Equivariant Policy via Frame Transfer

Authors: Haibo Zhao, Dian Wang, Yizhe Zhu, Xupeng Zhu, Owen Lewis Howell, Linfeng Zhao, Yaoyao Qian, Robin Walters, Robert Platt

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	HEP achieves state-of-the-art performance in complex robotic manipulation tasks, demonstrating significant improvements in both simulation and real-world settings. (Code and videos are available at project page.)
Researcher Affiliation	Academia	1Northeastern University 2Robotics and Ai Institute. Correspondence to: Dian Wang <EMAIL>.
Pseudocode	No	The paper describes the methodology using prose and mathematical equations but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	(Code and videos are available at project page.)
Open Datasets	Yes	To evaluate our policy, we first perform experiments in simulated environments in the RLBench (James et al., 2020) benchmark implemented using Coppelia Sim (Rohmer et al., 2013) and Py Rep (James et al., 2019).
Dataset Splits	Yes	Each task is trained using 100 demonstrations, more detailed task descriptions and visualizations are provided in Appendix F. We experiment in three tasks as shown in Figure 6. These tasks are challenging due to their extreme long horizon (can be divided into 6 to 9 sub-tasks) and the diverse types of manipulation involved. Evaluations are conducted in 20 trials: 10 with object placements similar to the training dataset s and 10 with unseen placements. To evaluate the generalizability of our model, we perform a one-shot experiment where the model is trained to finish a pick-place task with only one demonstration. During testing, the object is placed in unseen poses, as shown in Figure 7. The results in Table 4 demonstrate the strong generalizability of our model, achieving an 80% success rate over 20 trials.
Hardware Specification	No	Our real-world experiments are conducted on a UR5e robotic arm equipped with a Robotiq 2F-85 gripper and three Intel Real Sense D455 cameras as shown in Figure 10 . Demonstrations are collected using a 6-Do F 3DConnexion Space Mouse at a 10 Hz rate, logging both the visual observations (from all three cameras) and the robot s end-effector actions (position, orientation, and gripper states).
Software Dependencies	No	We train our models with the Adam W ((Loshchilov & Hutter, 2019)) optimizer (with a learning rate of 10 4 and weight decay of 5*10 4). We use DDPM ((Ho et al., 2020)) with 100 denoising steps for both training and evaluation. We training each tasks with 100000 iterates. In practice, we implement the T(3)-invariance in the Point Net by using the relative position to the center of each voxel, and implement the SO(2)-equivariance using escnn (Cesa et al., 2022).
Experiment Setup	Yes	In the simulation experiments, we we use a batch size of 16 for training. Specifically, the observation contains one step of history observation, and 3 steps of history action and the output of the denoising process is a sequence of 18 action steps. In close-loop control we use all 18 steps for training and execute 18 steps, similar to prior work ((Xian et al., 2023)). In close-loop control 18 steps and 9 steps are used for training and execution, similar to setting of (Wang et al., 2024) a. We train our models with the Adam W ((Loshchilov & Hutter, 2019)) optimizer (with a learning rate of 10 4 and weight decay of 5*10 4). We use DDPM ((Ho et al., 2020)) with 100 denoising steps for both training and evaluation. We training each tasks with 100000 iterates.