PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

Authors: Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, Quan Vuong, Tingnan Zhang, Tsang-Wei Edward Lee, Kuang-Huei Lee, Peng Xu, Sean Kirmani, Yuke Zhu, Andy Zeng, Karol Hausman, Nicolas Heess, Chelsea Finn, Sergey Levine, Brian Ichter

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate PIVOT on real-world robotic navigation, real-world manipulation from images, instruction following in simulation, and additional spatial inference tasks such as localization. We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities.
Researcher Affiliation Collaboration 1Google Deep Mind 2The University of Texas at Austin 3Stanford University.
Pseudocode Yes Algorithm 1 Prompting with Iterative Visual Optimization
Open Source Code No The paper mentions a 'Website and Hugging Face demo' and states 'We provide an interactive demo on Hugging Face with a few demonstrative images as well as the ability to upload new images and questions; available here.' However, it does not explicitly state that the source code for their methodology is open-source or provide a link to a code repository.
Open Datasets Yes To this end, we evaluate GPT-4V with 3 rounds of PIVOT on a random subset of 1000 examples from the Ref COCO test A split. We find strong performance even in the first iteration with modest improvement over further iterations. Prompts used are in Appendix H and results are in Figure 5 and examples in Figure 3.
Dataset Splits No The paper mentions evaluating on a 'random subset of 1000 examples from the Ref COCO test A split' and using 'demonstration data from the RT-X mobile manipulator dataset'. While it discusses fine-tuning a VLM, it does not explicitly provide the training, validation, and test dataset splits used for these experiments in a way that allows reproduction of the data partitioning.
Hardware Specification No The paper mentions using 'GPT-4V (Open AI, 2023)' and 'Gemini (Google, 2023)' models for its experiments, but it does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud compute specifications) used to run these models or their experiments.
Software Dependencies No The paper mentions 'CV2 (Itseez, 2015)' in Appendix G but does not specify a version number for this or any other software dependency, such as Python, PyTorch, or specific ML frameworks used.
Experiment Setup Yes Unless otherwise noted, the VLM used herein was GPT-4V (Open AI, 2023). For creating the text prompt wp, we prompt the VLM to use chain of thought to reason through the problem and then summarize the top few labels. The distributions PA in Algorithm 1 are approximated as isotropic Gaussians.