reproducibilityindex.ai

RL-GPT: Integrating Reinforcement Learning and Code-as-policy

Authors: Shaoteng Liu, Haoqi Yuan, Minda Hu, Yanwei Li, Yukang Chen, Shu Liu, Zongqing Lu, Jiaya Jia

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach outperforms traditional RL methods and existing GPT agents, demonstrating superior efficiency. In the Minecraft game, it possibly obtains diamonds within a single day on an RTX3090. Additionally, it achieves good performance on designated Mine Dojo tasks.
Researcher Affiliation	Collaboration	1 The Chinese University of Hong Kong 2 Smart More 3 The Hong Kong University of Science and Technology 4 School of Computer Science, Peking University 5 Beijing Academy of Artificial Intelligence
Pseudocode	Yes	Algorithm 1 RL-GPT s Two-loop Iteration
Open Source Code	No	We will release all the code upon acceptance.
Open Datasets	Yes	Mine Dojo [13] stands out as a pioneering framework developed within the renowned Minecraft game, tailored specifically for research involving embodied agents.
Dataset Splits	No	The paper describes training and evaluation processes and total samples (e.g., '10 million samples') but does not specify explicit training/validation/test dataset splits (e.g., percentages or exact counts for each split).
Hardware Specification	Yes	Specifically, within the Mine Dojo environment, it attains good performance on the majority of selected tasks and adeptly locates diamonds within a single day, utilizing only an RTX3090 GPU.
Software Dependencies	No	The paper mentions software like GPT-4 API and PPO, and refers to Python for coding, but it does not specify version numbers for these or other key software dependencies.
Experiment Setup	Yes	The training and evaluation are the same as Mineagent or other RL pipelines as discussed in Appendix C. The difference is that our RL action space contains high-level coded actions generated by LLMs. Our method doesn t depend on any video pretraining. It can work with only environment interaction. Similar to Mine Agent [13], we employ Proximal Policy Optimization (PPO) [67] as the RL baseline. This approach alternates between sampling data through interactions with the environment and optimizing a "surrogate" objective function using stochastic gradient ascent." and "For the slow agents and fast agents, we design special templates, responding formats, and examples. We design some special prompts such as assume you are an experienced RL researcher that is designing the RL training job for Minecraft . Details can be found in the Appendix A.