Distilling Internet-Scale Vision-Language Models into Embodied Agents

Authors: Theodore Sumers, Kenneth Marino, Arun Ahuja, Rob Fergus, Ishita Dasgupta

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments using our method to flexibly teach new language groundings, including object names (Sec. 5.1), attributes (Sec. 5.2), category membership (Sec. 5.3) and even ad-hoc user preferences (Sec. 5.4). An analysis of this imperfect supervision signal, including transferable insight into how different types of noise affect downstream task performance (Sec. 5.5).
Researcher Affiliation Collaboration 1Department of Computer Science, Princeton University, Princeton, New Jersey 2Deep Mind, New York City, United States.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link to its own open-source code for the methodology described.
Open Datasets Yes We use human-human data to learn a task-agnostic motor policy: e.g., an agent that knows how to lift something, but not what a plane is. We refer to this as the original agent, and train it via behavioral cloning (BC) on the human-human dataset described by Interactive Agents Team (2021); for details on the dataset, agent architecture, and BC implementation, please refer to that work.
Dataset Splits No The paper mentions generating initial trajectories for training and evaluation trajectories for testing, but does not explicitly define or use a separate validation dataset split.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper mentions using the 'Flamingo' VLM and the 'Playhouse environment' but does not specify any software libraries or dependencies with their version numbers.
Experiment Setup Yes For each experiment, we generate an initial set of approximately 10,000 trajectories with a generic Lift an object instruction (due to implementation details, the actual number varied from 10,000 to 11,500). Episodes end when the agent lifts an object, or after 120 seconds. We use the 80B parameter model described by Alayrac et al. (2022) with greedy sampling. We use a simple QAstyle zero-shot prompt: [IMG 0] Q: What is this object? A:.