reproducibilityindex.ai

PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

Authors: Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, Quan Vuong, Tingnan Zhang, Tsang-Wei Edward Lee, Kuang-Huei Lee, Peng Xu, Sean Kirmani, Yuke Zhu, Andy Zeng, Karol Hausman, Nicolas Heess, Chelsea Finn, Sergey Levine, Brian Ichter

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate PIVOT on real-world robotic navigation, real-world manipulation from images, instruction following in simulation, and additional spatial inference tasks such as localization. We ﬁnd, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities.
Researcher Affiliation	Collaboration	1Google Deep Mind 2The University of Texas at Austin 3Stanford University.
Pseudocode	Yes	Algorithm 1 Prompting with Iterative Visual Optimization
Open Source Code	No	The paper mentions a 'Website and Hugging Face demo' and states 'We provide an interactive demo on Hugging Face with a few demonstrative images as well as the ability to upload new images and questions; available here.' However, it does not explicitly state that the source code for their methodology is open-source or provide a link to a code repository.
Open Datasets	Yes	To this end, we evaluate GPT-4V with 3 rounds of PIVOT on a random subset of 1000 examples from the Ref COCO test A split. We ﬁnd strong performance even in the ﬁrst iteration with modest improvement over further iterations. Prompts used are in Appendix H and results are in Figure 5 and examples in Figure 3.
Dataset Splits	No	The paper mentions evaluating on a 'random subset of 1000 examples from the Ref COCO test A split' and using 'demonstration data from the RT-X mobile manipulator dataset'. While it discusses fine-tuning a VLM, it does not explicitly provide the training, validation, and test dataset splits used for these experiments in a way that allows reproduction of the data partitioning.
Hardware Specification	No	The paper mentions using 'GPT-4V (Open AI, 2023)' and 'Gemini (Google, 2023)' models for its experiments, but it does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud compute specifications) used to run these models or their experiments.
Software Dependencies	No	The paper mentions 'CV2 (Itseez, 2015)' in Appendix G but does not specify a version number for this or any other software dependency, such as Python, PyTorch, or specific ML frameworks used.
Experiment Setup	Yes	Unless otherwise noted, the VLM used herein was GPT-4V (Open AI, 2023). For creating the text prompt wp, we prompt the VLM to use chain of thought to reason through the problem and then summarize the top few labels. The distributions PA in Algorithm 1 are approximated as isotropic Gaussians.