Going Beyond Heuristics by Imposing Policy Improvement as a Constraint

Authors: Chi-Chang Lee, Zhang-Wei Hong, Pulkit Agrawal

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on robotic locomotion, helicopter control, and manipulation tasks demonstrate that our method consistently outperforms the heuristic policy, regardless of the heuristic rewards quality. Code is available at https://github.com/Improbable-AI/hepo.
Researcher Affiliation Academia Chi-Chang Lee1 , Zhang-Wei Hong2 , Pulkit Agrawal2 Improbable AI Lab Massachusetts Institute of Technology 1 National Taiwan University, Taiwan. 2 Improbable AI Lab, MIT, Cambridge, USA.
Pseudocode Yes Algorithm 1 Heuristic-Enhanced Policy Optimization (HEPO) and Algorithm 2 Detailed Heuristic-Enhanced Policy Optimization (HEPO) are provided.
Open Source Code Yes Code is available at https://github.com/Improbable-AI/hepo.
Open Datasets Yes We conduct experiments on 9 tasks from Isaac Gym (ISAAC) [19] and 20 tasks from the Bidexterous Manipulation (BI-DEX) benchmark [24].
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits (e.g., percentages or specific counts). It mentions 'finite data regime' and uses metrics like 'interquartile mean (IQM) of the normalized return' for evaluation, but the partitioning of data into specific training, validation, and testing sets is not detailed.
Hardware Specification Yes Each training procedure can be performed on a single Ge Force RTX 2080 Ti device.
Software Dependencies No Our experiments are based on a continuous action actor-critic algorithm implemented in rl_games [25]. The paper refers to the rl_games GitHub repository (May 2021) but does not provide specific version numbers for rl_games or other key software dependencies like deep learning frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup Yes The hyperparameters for updating the Lagrangian multiplier α in HEPO are listed as follows: Name Value Initial α 0.0 Step size η of α (learning rate) 0.01 Clipping range of δα ( ϵα, ϵα) 1.0 Range of the α value [0, ). For PPO, we employed the same policy network and value network architecture, and the same hyperparameters used in Isaac Gym Envs [19].