Eureka: Human-Level Reward Design via Coding Large Language Models

Authors: Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, Anima Anandkumar

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies, EUREKA outperforms human experts on 83% of the tasks, leading to an average normalized improvement of 52%.
Researcher Affiliation Collaboration 1NVIDIA, 2UPenn, 3Caltech, 4UT Austin
Pseudocode Yes See Alg. 1 for pseudocode; all prompts are included in App. A.
Open Source Code No We are committed to open-sourcing all prompts, environments, and generated reward functions to promote further research on LLM-based reward design.
Open Datasets Yes Our environments consist of 10 distinct robots and 29 tasks implemented using the Isaac Gym simulator (Makoviychuk et al., 2021). ... It is worth noting that both benchmarks are publicly released concurrently, or after the GPT-4 knowledge cut-off date (September 2021), so GPT-4 is unlikely to have accumulated extensive internet knowledge about these tasks, making them ideal testbeds for assessing EUREKA s reward generation capability compared to measurable human-engineered reward functions.
Dataset Splits No The paper describes the training and evaluation procedures for RL policies (e.g., '5 independent PPO training runs' and '10 policy checkpoints sampled at fixed intervals'), but does not explicitly define traditional train/validation/test dataset splits as would apply to a fixed dataset in supervised learning.
Hardware Specification Yes All our experiments took place on a single 8 A100 GPU station. ... Eureka can be run on 4 V100 GPUs, which is readily accessible on an academic compute budget.
Software Dependencies No We use GPT-4 (Open AI, 2023), in particular the gpt-4-0314 variant, as the backbone LLM for all LLM-based reward-design algorithms unless specified otherwise.
Experiment Setup Yes In all our experiments, EUREKA conducts 5 independent runs per environment, and for each run, searches for 5 iterations with K = 16 samples per iteration.