Going Beyond Heuristics by Imposing Policy Improvement as a Constraint
Authors: Chi-Chang Lee, Zhang-Wei Hong, Pulkit Agrawal
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on robotic locomotion, helicopter control, and manipulation tasks demonstrate that our method consistently outperforms the heuristic policy, regardless of the heuristic rewards quality. Code is available at https://github.com/Improbable-AI/hepo. |
| Researcher Affiliation | Academia | Chi-Chang Lee1 , Zhang-Wei Hong2 , Pulkit Agrawal2 Improbable AI Lab Massachusetts Institute of Technology 1 National Taiwan University, Taiwan. 2 Improbable AI Lab, MIT, Cambridge, USA. |
| Pseudocode | Yes | Algorithm 1 Heuristic-Enhanced Policy Optimization (HEPO) and Algorithm 2 Detailed Heuristic-Enhanced Policy Optimization (HEPO) are provided. |
| Open Source Code | Yes | Code is available at https://github.com/Improbable-AI/hepo. |
| Open Datasets | Yes | We conduct experiments on 9 tasks from Isaac Gym (ISAAC) [19] and 20 tasks from the Bidexterous Manipulation (BI-DEX) benchmark [24]. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits (e.g., percentages or specific counts). It mentions 'finite data regime' and uses metrics like 'interquartile mean (IQM) of the normalized return' for evaluation, but the partitioning of data into specific training, validation, and testing sets is not detailed. |
| Hardware Specification | Yes | Each training procedure can be performed on a single Ge Force RTX 2080 Ti device. |
| Software Dependencies | No | Our experiments are based on a continuous action actor-critic algorithm implemented in rl_games [25]. The paper refers to the rl_games GitHub repository (May 2021) but does not provide specific version numbers for rl_games or other key software dependencies like deep learning frameworks (e.g., PyTorch, TensorFlow). |
| Experiment Setup | Yes | The hyperparameters for updating the Lagrangian multiplier α in HEPO are listed as follows: Name Value Initial α 0.0 Step size η of α (learning rate) 0.01 Clipping range of δα ( ϵα, ϵα) 1.0 Range of the α value [0, ). For PPO, we employed the same policy network and value network architecture, and the same hyperparameters used in Isaac Gym Envs [19]. |