Policy Improvement using Language Feedback Models
Authors: Victor Zhong, Dipendra Misra, Xingdi Yuan, Marc-Alexandre Côté
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce Language Feedback Models (LFMs) that identify desirable behaviour actions that help achieve tasks specified in the instruction for imitation learning in instruction following. To train LFMs, we obtain feedback from Large Language Models (LLMs) on visual trajectories verbalized to language descriptions. First, by using LFMs to identify desirable behaviour to imitate, we improve in task-completion rate over strong behavioural cloning baselines on three distinct language grounding environments (Touchdown, Science World, and ALFWorld). |
| Researcher Affiliation | Collaboration | Victor Zhong University of Waterloo Microsoft Research victor.zhong@uwaterloo.ca Dipendra Misra Microsoft Research Xingdi Yuan Microsoft Research Marc-Alexandre Côté Microsoft Research |
| Pseudocode | Yes | Appendix E provides pseudo-code for the entire procedure for policy improvement using LFMs. Algorithm 1 TRAINFEEDBACKMODEL: Training a Language Feedback Model using LLM feedback. Algorithm 2 IMITATEUSINGFEEDBACK: Imitation learning using desirable behaviour identified by a feedback model. Algorithm 3 Policy improvement using Language Feedback Models. |
| Open Source Code | Yes | Source code for our environments and experiments are available at github.com/vzhong/language_feedback_models. |
| Open Datasets | Yes | We evaluate using LFM s for policy improvement on three distinct language grounding benchmarks. Formally, the environments from a benchmark are distinct partially-observed Markov Decision Processes that share some (or all) of the environment dynamics but have different instructions, observations, and/or action space. ... ALFWorld is a verbalization of ALFRED [36]... Science World is a textual simulation benchmark for basic science experiments [42]... Touchdown is a navigation benchmark where the agent navigates Google Street View images to follow long, compositional instructions [12]. |
| Dataset Splits | Yes | This data is split into a 80% train/20% validation dataset to train the LFM. |
| Hardware Specification | Yes | We train feedback models and policies using 80GB A100 GPUs. To produce rollouts at in parallel, we use a cluster of 200 32GB V100 GPUs. |
| Software Dependencies | Yes | We use the GPT-4 (2023-03-15) for action prediction and feedback, and finetune 770M FLAN-T5 [13] for policy and feedback models. Verbalized observations v contain the most recent 20 steps. ... Touchdown verbalization uses vit-large-patch14. |
| Experiment Setup | Yes | We train models for 10k steps with batch 20, learning rate 5e-5, and early stopping over validation demos. For ACTPRED and LFM, we limit the amount of LLM usage to 100k GPT-2 tokens. |