reproducibilityindex.ai

Policy Improvement using Language Feedback Models

Authors: Victor Zhong, Dipendra Misra, Xingdi Yuan, Marc-Alexandre Côté

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Language Feedback Models (LFMs) that identify desirable behaviour actions that help achieve tasks specified in the instruction for imitation learning in instruction following. To train LFMs, we obtain feedback from Large Language Models (LLMs) on visual trajectories verbalized to language descriptions. First, by using LFMs to identify desirable behaviour to imitate, we improve in task-completion rate over strong behavioural cloning baselines on three distinct language grounding environments (Touchdown, Science World, and ALFWorld).
Researcher Affiliation	Collaboration	Victor Zhong University of Waterloo Microsoft Research victor.zhong@uwaterloo.ca Dipendra Misra Microsoft Research Xingdi Yuan Microsoft Research Marc-Alexandre Côté Microsoft Research
Pseudocode	Yes	Appendix E provides pseudo-code for the entire procedure for policy improvement using LFMs. Algorithm 1 TRAINFEEDBACKMODEL: Training a Language Feedback Model using LLM feedback. Algorithm 2 IMITATEUSINGFEEDBACK: Imitation learning using desirable behaviour identified by a feedback model. Algorithm 3 Policy improvement using Language Feedback Models.
Open Source Code	Yes	Source code for our environments and experiments are available at github.com/vzhong/language_feedback_models.
Open Datasets	Yes	We evaluate using LFM s for policy improvement on three distinct language grounding benchmarks. Formally, the environments from a benchmark are distinct partially-observed Markov Decision Processes that share some (or all) of the environment dynamics but have different instructions, observations, and/or action space. ... ALFWorld is a verbalization of ALFRED [36]... Science World is a textual simulation benchmark for basic science experiments [42]... Touchdown is a navigation benchmark where the agent navigates Google Street View images to follow long, compositional instructions [12].
Dataset Splits	Yes	This data is split into a 80% train/20% validation dataset to train the LFM.
Hardware Specification	Yes	We train feedback models and policies using 80GB A100 GPUs. To produce rollouts at in parallel, we use a cluster of 200 32GB V100 GPUs.
Software Dependencies	Yes	We use the GPT-4 (2023-03-15) for action prediction and feedback, and finetune 770M FLAN-T5 [13] for policy and feedback models. Verbalized observations v contain the most recent 20 steps. ... Touchdown verbalization uses vit-large-patch14.
Experiment Setup	Yes	We train models for 10k steps with batch 20, learning rate 5e-5, and early stopping over validation demos. For ACTPRED and LFM, we limit the amount of LLM usage to 100k GPT-2 tokens.