Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning
Authors: Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, Pierre-Yves Oudeyer
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We design a set of experiments in Baby AI-Text aiming to provide answers for the scientific questions introduced in Section 1. We plot the mean and standard deviation of the success rate (i.e. 1 if the goal has been reached, 0 otherwise) over 2 seeds of GFlan-T5, NPAE-Flan-T5, DRRN and Symbolic-PPO in Figure 2. |
| Researcher Affiliation | Collaboration | 1Inria (Flowers), University of Bordeaux, France 2Hugging Face 3Univ Angers, LERIA, SFR MATHSTIC, F-49000 Angers, France 4Sorbonne University, ISIR, Paris, France. |
| Pseudocode | No | The paper describes its methods but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | by releasing, in addition of the code of this paper1, a Python library named Lamorel2 facilitating the use of LLMs at scale for RL practitioners. ... 1https://github.com/flowersteam/ Grounding_LLMs_with_online_RL 2https://github.com/flowersteam/lamorel |
| Open Datasets | Yes | and transpose the Baby AI environment (Chevalier-Boisvert et al., 2019) into a textual version. |
| Dataset Splits | No | The paper mentions training and evaluating on test episodes, but it does not explicitly define or refer to a distinct validation split for hyperparameter tuning or early stopping. |
| Hardware Specification | Yes | When using Flan-T5 780M, each LLM instance is distributed (Vertical Model Parallelism11) 2 Nvidia A100 80GB GPUs requiring thus a total of 8 Nvidia A100 80GB GPUs to run an experiment (2 GPUs 4 LLM instances). For Flan-T5 80M and Flan-T5 3B, we respectively use 1 Nvidia V100 32GB and 4 Nvidia A100 80GB per LLM instance. |
| Software Dependencies | No | The paper mentions using 'Hugging Face Transformers Python library' and 'Pytorch Distributed', as well as 'Adam' optimizer, but it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We reused PPO s hyperparameters from Ramamurthy et al. (2022) and did not perform any further tuning (see Table 7). We used an Adam (Kingma & Ba, 2014) optimizer with the hyperparameters listed in Table 8). |