Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents

Authors: Wenlong Huang, Fei Xia, Dhruv Shah, Danny Driess, Andy Zeng, Yao Lu, Pete Florence, Igor Mordatch, Sergey Levine, Karol Hausman, brian ichter

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate how such grounded models can be obtained across three simulation and real-world domains, and that the proposed decoding strategy is able to solve complex, long-horizon embodiment tasks in a robotic setting by leveraging the knowledge of both models. ... Our contributions are as followed: ... 3) we show empirical evidence, across three simulation and real-world domains, that the proposed method performs strongly on a wide range of tasks while also significantly outperforming prior methods in efficiency. ... 4 Experiments
Researcher Affiliation Collaboration Wenlong Huang1 , Fei Xia2, Dhruv Shah3, Danny Driess2, Andy Zeng2, Yao Lu2, Pete Florence2, Igor Mordatch2, Sergey Levine2,3, Karol Hausman2, Brian Ichter2 1Stanford University, 2Google Deepmind, 3UC Berkeley
Pseudocode Yes Algorithm 1 Grounded Decoding (GD) w/ Greedy Search
Open Source Code Yes grounded-decoding.github.io
Open Datasets Yes Herein we experiment with a simulated tabletop manipulation environment based on RAVENS [95]. We create a custom set of 20 tasks, with 10 seen tasks and 10 unseen tasks. Seen tasks are used for training (for supervised baseline) or for few-shot prompting. ... We further evaluate the long-horizon reasoning of Grounded Decoding for 2D maze-solving on Minigrid [10].
Dataset Splits No The paper mentions 'seen tasks' for training and 'unseen tasks' for testing but does not explicitly provide details about a separate validation split or how it was used beyond 'seen tasks are used for training'.
Hardware Specification No The paper does not provide specific details about the hardware used, such as GPU models, CPU specifications, or memory, for running the experiments. It generally discusses training and environments without hardware specifics.
Software Dependencies No The paper mentions several software components like Instruct GPT, PaLM, CLIPort, PPO, CLIP, and owl-vit, but it does not specify their version numbers, which is necessary for reproducibility.
Experiment Setup No The paper mentions some aspects of the experimental setup, such as training on '50,000 pre-collected demonstrations' and using 'sparse outcome reward', but it does not provide specific hyperparameters like learning rates, batch sizes, or optimizer settings, which are crucial for detailed experimental reproducibility.