Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning

Authors: Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, Pierre-Yves Oudeyer

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We design a set of experiments in Baby AI-Text aiming to provide answers for the scientific questions introduced in Section 1. We plot the mean and standard deviation of the success rate (i.e. 1 if the goal has been reached, 0 otherwise) over 2 seeds of GFlan-T5, NPAE-Flan-T5, DRRN and Symbolic-PPO in Figure 2.
Researcher Affiliation Collaboration 1Inria (Flowers), University of Bordeaux, France 2Hugging Face 3Univ Angers, LERIA, SFR MATHSTIC, F-49000 Angers, France 4Sorbonne University, ISIR, Paris, France.
Pseudocode No The paper describes its methods but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes by releasing, in addition of the code of this paper1, a Python library named Lamorel2 facilitating the use of LLMs at scale for RL practitioners. ... 1https://github.com/flowersteam/ Grounding_LLMs_with_online_RL 2https://github.com/flowersteam/lamorel
Open Datasets Yes and transpose the Baby AI environment (Chevalier-Boisvert et al., 2019) into a textual version.
Dataset Splits No The paper mentions training and evaluating on test episodes, but it does not explicitly define or refer to a distinct validation split for hyperparameter tuning or early stopping.
Hardware Specification Yes When using Flan-T5 780M, each LLM instance is distributed (Vertical Model Parallelism11) 2 Nvidia A100 80GB GPUs requiring thus a total of 8 Nvidia A100 80GB GPUs to run an experiment (2 GPUs 4 LLM instances). For Flan-T5 80M and Flan-T5 3B, we respectively use 1 Nvidia V100 32GB and 4 Nvidia A100 80GB per LLM instance.
Software Dependencies No The paper mentions using 'Hugging Face Transformers Python library' and 'Pytorch Distributed', as well as 'Adam' optimizer, but it does not provide specific version numbers for these software components.
Experiment Setup Yes We reused PPO s hyperparameters from Ramamurthy et al. (2022) and did not perform any further tuning (see Table 7). We used an Adam (Kingma & Ba, 2014) optimizer with the hyperparameters listed in Table 8).