reproducibilityindex.ai

Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning

Authors: Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, Pierre-Yves Oudeyer

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We design a set of experiments in Baby AI-Text aiming to provide answers for the scientific questions introduced in Section 1. We plot the mean and standard deviation of the success rate (i.e. 1 if the goal has been reached, 0 otherwise) over 2 seeds of GFlan-T5, NPAE-Flan-T5, DRRN and Symbolic-PPO in Figure 2.
Researcher Affiliation	Collaboration	1Inria (Flowers), University of Bordeaux, France 2Hugging Face 3Univ Angers, LERIA, SFR MATHSTIC, F-49000 Angers, France 4Sorbonne University, ISIR, Paris, France.
Pseudocode	No	The paper describes its methods but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	by releasing, in addition of the code of this paper1, a Python library named Lamorel2 facilitating the use of LLMs at scale for RL practitioners. ... 1https://github.com/flowersteam/ Grounding_LLMs_with_online_RL 2https://github.com/flowersteam/lamorel
Open Datasets	Yes	and transpose the Baby AI environment (Chevalier-Boisvert et al., 2019) into a textual version.
Dataset Splits	No	The paper mentions training and evaluating on test episodes, but it does not explicitly define or refer to a distinct validation split for hyperparameter tuning or early stopping.
Hardware Specification	Yes	When using Flan-T5 780M, each LLM instance is distributed (Vertical Model Parallelism11) 2 Nvidia A100 80GB GPUs requiring thus a total of 8 Nvidia A100 80GB GPUs to run an experiment (2 GPUs 4 LLM instances). For Flan-T5 80M and Flan-T5 3B, we respectively use 1 Nvidia V100 32GB and 4 Nvidia A100 80GB per LLM instance.
Software Dependencies	No	The paper mentions using 'Hugging Face Transformers Python library' and 'Pytorch Distributed', as well as 'Adam' optimizer, but it does not provide specific version numbers for these software components.
Experiment Setup	Yes	We reused PPO s hyperparameters from Ramamurthy et al. (2022) and did not perform any further tuning (see Table 7). We used an Adam (Kingma & Ba, 2014) optimizer with the hyperparameters listed in Table 8).