RL-GPT: Integrating Reinforcement Learning and Code-as-policy

Authors: Shaoteng Liu, Haoqi Yuan, Minda Hu, Yanwei Li, Yukang Chen, Shu Liu, Zongqing Lu, Jiaya Jia

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach outperforms traditional RL methods and existing GPT agents, demonstrating superior efficiency. In the Minecraft game, it possibly obtains diamonds within a single day on an RTX3090. Additionally, it achieves good performance on designated Mine Dojo tasks.
Researcher Affiliation Collaboration 1 The Chinese University of Hong Kong 2 Smart More 3 The Hong Kong University of Science and Technology 4 School of Computer Science, Peking University 5 Beijing Academy of Artificial Intelligence
Pseudocode Yes Algorithm 1 RL-GPT s Two-loop Iteration
Open Source Code No We will release all the code upon acceptance.
Open Datasets Yes Mine Dojo [13] stands out as a pioneering framework developed within the renowned Minecraft game, tailored specifically for research involving embodied agents.
Dataset Splits No The paper describes training and evaluation processes and total samples (e.g., '10 million samples') but does not specify explicit training/validation/test dataset splits (e.g., percentages or exact counts for each split).
Hardware Specification Yes Specifically, within the Mine Dojo environment, it attains good performance on the majority of selected tasks and adeptly locates diamonds within a single day, utilizing only an RTX3090 GPU.
Software Dependencies No The paper mentions software like GPT-4 API and PPO, and refers to Python for coding, but it does not specify version numbers for these or other key software dependencies.
Experiment Setup Yes The training and evaluation are the same as Mineagent or other RL pipelines as discussed in Appendix C. The difference is that our RL action space contains high-level coded actions generated by LLMs. Our method doesn t depend on any video pretraining. It can work with only environment interaction. Similar to Mine Agent [13], we employ Proximal Policy Optimization (PPO) [67] as the RL baseline. This approach alternates between sampling data through interactions with the environment and optimizing a "surrogate" objective function using stochastic gradient ascent." and "For the slow agents and fast agents, we design special templates, responding formats, and examples. We design some special prompts such as assume you are an experienced RL researcher that is designing the RL training job for Minecraft . Details can be found in the Appendix A.