PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
Authors: Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, Quan Vuong, Tingnan Zhang, Tsang-Wei Edward Lee, Kuang-Huei Lee, Peng Xu, Sean Kirmani, Yuke Zhu, Andy Zeng, Karol Hausman, Nicolas Heess, Chelsea Finn, Sergey Levine, Brian Ichter
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate PIVOT on real-world robotic navigation, real-world manipulation from images, instruction following in simulation, and additional spatial inference tasks such as localization. We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2The University of Texas at Austin 3Stanford University. |
| Pseudocode | Yes | Algorithm 1 Prompting with Iterative Visual Optimization |
| Open Source Code | No | The paper mentions a 'Website and Hugging Face demo' and states 'We provide an interactive demo on Hugging Face with a few demonstrative images as well as the ability to upload new images and questions; available here.' However, it does not explicitly state that the source code for their methodology is open-source or provide a link to a code repository. |
| Open Datasets | Yes | To this end, we evaluate GPT-4V with 3 rounds of PIVOT on a random subset of 1000 examples from the Ref COCO test A split. We find strong performance even in the first iteration with modest improvement over further iterations. Prompts used are in Appendix H and results are in Figure 5 and examples in Figure 3. |
| Dataset Splits | No | The paper mentions evaluating on a 'random subset of 1000 examples from the Ref COCO test A split' and using 'demonstration data from the RT-X mobile manipulator dataset'. While it discusses fine-tuning a VLM, it does not explicitly provide the training, validation, and test dataset splits used for these experiments in a way that allows reproduction of the data partitioning. |
| Hardware Specification | No | The paper mentions using 'GPT-4V (Open AI, 2023)' and 'Gemini (Google, 2023)' models for its experiments, but it does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud compute specifications) used to run these models or their experiments. |
| Software Dependencies | No | The paper mentions 'CV2 (Itseez, 2015)' in Appendix G but does not specify a version number for this or any other software dependency, such as Python, PyTorch, or specific ML frameworks used. |
| Experiment Setup | Yes | Unless otherwise noted, the VLM used herein was GPT-4V (Open AI, 2023). For creating the text prompt wp, we prompt the VLM to use chain of thought to reason through the problem and then summarize the top few labels. The distributions PA in Algorithm 1 are approximated as isotropic Gaussians. |