Policy Improvement using Language Feedback Models

Authors: Victor Zhong, Dipendra Misra, Xingdi Yuan, Marc-Alexandre Côté

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Language Feedback Models (LFMs) that identify desirable behaviour actions that help achieve tasks specified in the instruction for imitation learning in instruction following. To train LFMs, we obtain feedback from Large Language Models (LLMs) on visual trajectories verbalized to language descriptions. First, by using LFMs to identify desirable behaviour to imitate, we improve in task-completion rate over strong behavioural cloning baselines on three distinct language grounding environments (Touchdown, Science World, and ALFWorld).
Researcher Affiliation Collaboration Victor Zhong University of Waterloo Microsoft Research victor.zhong@uwaterloo.ca Dipendra Misra Microsoft Research Xingdi Yuan Microsoft Research Marc-Alexandre Côté Microsoft Research
Pseudocode Yes Appendix E provides pseudo-code for the entire procedure for policy improvement using LFMs. Algorithm 1 TRAINFEEDBACKMODEL: Training a Language Feedback Model using LLM feedback. Algorithm 2 IMITATEUSINGFEEDBACK: Imitation learning using desirable behaviour identified by a feedback model. Algorithm 3 Policy improvement using Language Feedback Models.
Open Source Code Yes Source code for our environments and experiments are available at github.com/vzhong/language_feedback_models.
Open Datasets Yes We evaluate using LFM s for policy improvement on three distinct language grounding benchmarks. Formally, the environments from a benchmark are distinct partially-observed Markov Decision Processes that share some (or all) of the environment dynamics but have different instructions, observations, and/or action space. ... ALFWorld is a verbalization of ALFRED [36]... Science World is a textual simulation benchmark for basic science experiments [42]... Touchdown is a navigation benchmark where the agent navigates Google Street View images to follow long, compositional instructions [12].
Dataset Splits Yes This data is split into a 80% train/20% validation dataset to train the LFM.
Hardware Specification Yes We train feedback models and policies using 80GB A100 GPUs. To produce rollouts at in parallel, we use a cluster of 200 32GB V100 GPUs.
Software Dependencies Yes We use the GPT-4 (2023-03-15) for action prediction and feedback, and finetune 770M FLAN-T5 [13] for policy and feedback models. Verbalized observations v contain the most recent 20 steps. ... Touchdown verbalization uses vit-large-patch14.
Experiment Setup Yes We train models for 10k steps with batch 20, learning rate 5e-5, and early stopping over validation demos. For ACTPRED and LFM, we limit the amount of LLM usage to 100k GPT-2 tokens.