RL-GPT: Integrating Reinforcement Learning and Code-as-policy
Authors: Shaoteng Liu, Haoqi Yuan, Minda Hu, Yanwei Li, Yukang Chen, Shu Liu, Zongqing Lu, Jiaya Jia
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach outperforms traditional RL methods and existing GPT agents, demonstrating superior efficiency. In the Minecraft game, it possibly obtains diamonds within a single day on an RTX3090. Additionally, it achieves good performance on designated Mine Dojo tasks. |
| Researcher Affiliation | Collaboration | 1 The Chinese University of Hong Kong 2 Smart More 3 The Hong Kong University of Science and Technology 4 School of Computer Science, Peking University 5 Beijing Academy of Artificial Intelligence |
| Pseudocode | Yes | Algorithm 1 RL-GPT s Two-loop Iteration |
| Open Source Code | No | We will release all the code upon acceptance. |
| Open Datasets | Yes | Mine Dojo [13] stands out as a pioneering framework developed within the renowned Minecraft game, tailored specifically for research involving embodied agents. |
| Dataset Splits | No | The paper describes training and evaluation processes and total samples (e.g., '10 million samples') but does not specify explicit training/validation/test dataset splits (e.g., percentages or exact counts for each split). |
| Hardware Specification | Yes | Specifically, within the Mine Dojo environment, it attains good performance on the majority of selected tasks and adeptly locates diamonds within a single day, utilizing only an RTX3090 GPU. |
| Software Dependencies | No | The paper mentions software like GPT-4 API and PPO, and refers to Python for coding, but it does not specify version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | The training and evaluation are the same as Mineagent or other RL pipelines as discussed in Appendix C. The difference is that our RL action space contains high-level coded actions generated by LLMs. Our method doesn t depend on any video pretraining. It can work with only environment interaction. Similar to Mine Agent [13], we employ Proximal Policy Optimization (PPO) [67] as the RL baseline. This approach alternates between sampling data through interactions with the environment and optimizing a "surrogate" objective function using stochastic gradient ascent." and "For the slow agents and fast agents, we design special templates, responding formats, and examples. We design some special prompts such as assume you are an experienced RL researcher that is designing the RL training job for Minecraft . Details can be found in the Appendix A. |