Eureka: Human-Level Reward Design via Coding Large Language Models
Authors: Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, Anima Anandkumar
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies, EUREKA outperforms human experts on 83% of the tasks, leading to an average normalized improvement of 52%. |
| Researcher Affiliation | Collaboration | 1NVIDIA, 2UPenn, 3Caltech, 4UT Austin |
| Pseudocode | Yes | See Alg. 1 for pseudocode; all prompts are included in App. A. |
| Open Source Code | No | We are committed to open-sourcing all prompts, environments, and generated reward functions to promote further research on LLM-based reward design. |
| Open Datasets | Yes | Our environments consist of 10 distinct robots and 29 tasks implemented using the Isaac Gym simulator (Makoviychuk et al., 2021). ... It is worth noting that both benchmarks are publicly released concurrently, or after the GPT-4 knowledge cut-off date (September 2021), so GPT-4 is unlikely to have accumulated extensive internet knowledge about these tasks, making them ideal testbeds for assessing EUREKA s reward generation capability compared to measurable human-engineered reward functions. |
| Dataset Splits | No | The paper describes the training and evaluation procedures for RL policies (e.g., '5 independent PPO training runs' and '10 policy checkpoints sampled at fixed intervals'), but does not explicitly define traditional train/validation/test dataset splits as would apply to a fixed dataset in supervised learning. |
| Hardware Specification | Yes | All our experiments took place on a single 8 A100 GPU station. ... Eureka can be run on 4 V100 GPUs, which is readily accessible on an academic compute budget. |
| Software Dependencies | No | We use GPT-4 (Open AI, 2023), in particular the gpt-4-0314 variant, as the backbone LLM for all LLM-based reward-design algorithms unless specified otherwise. |
| Experiment Setup | Yes | In all our experiments, EUREKA conducts 5 independent runs per environment, and for each run, searches for 5 iterations with K = 16 samples per iteration. |