reproducibilityindex.ai

Going Beyond Heuristics by Imposing Policy Improvement as a Constraint

Authors: Chi-Chang Lee, Zhang-Wei Hong, Pulkit Agrawal

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on robotic locomotion, helicopter control, and manipulation tasks demonstrate that our method consistently outperforms the heuristic policy, regardless of the heuristic rewards quality. Code is available at https://github.com/Improbable-AI/hepo.
Researcher Affiliation	Academia	Chi-Chang Lee1 , Zhang-Wei Hong2 , Pulkit Agrawal2 Improbable AI Lab Massachusetts Institute of Technology 1 National Taiwan University, Taiwan. 2 Improbable AI Lab, MIT, Cambridge, USA.
Pseudocode	Yes	Algorithm 1 Heuristic-Enhanced Policy Optimization (HEPO) and Algorithm 2 Detailed Heuristic-Enhanced Policy Optimization (HEPO) are provided.
Open Source Code	Yes	Code is available at https://github.com/Improbable-AI/hepo.
Open Datasets	Yes	We conduct experiments on 9 tasks from Isaac Gym (ISAAC) [19] and 20 tasks from the Bidexterous Manipulation (BI-DEX) benchmark [24].
Dataset Splits	No	The paper does not explicitly provide training/validation/test dataset splits (e.g., percentages or specific counts). It mentions 'finite data regime' and uses metrics like 'interquartile mean (IQM) of the normalized return' for evaluation, but the partitioning of data into specific training, validation, and testing sets is not detailed.
Hardware Specification	Yes	Each training procedure can be performed on a single Ge Force RTX 2080 Ti device.
Software Dependencies	No	Our experiments are based on a continuous action actor-critic algorithm implemented in rl_games [25]. The paper refers to the rl_games GitHub repository (May 2021) but does not provide specific version numbers for rl_games or other key software dependencies like deep learning frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup	Yes	The hyperparameters for updating the Lagrangian multiplier α in HEPO are listed as follows: Name Value Initial α 0.0 Step size η of α (learning rate) 0.01 Clipping range of δα ( ϵα, ϵα) 1.0 Range of the α value [0, ). For PPO, we employed the same policy network and value network architecture, and the same hyperparameters used in Isaac Gym Envs [19].