Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

Authors: Sipeng Zheng, jiazheng liu, Yicheng Feng, Zongqing Lu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We carry out extensive experiments from a wide range of perspectives to validate our model s capability to strategically act and plan. The project s website and code can be found at https://sites.google.com/view/steve-eye.
Researcher Affiliation Collaboration Sipeng Zheng1, Jiazheng Liu2, Yicheng Feng2, Zongqing Lu2,1 1 Beijing Academy of Artificial Intelligence 2 School of Computer Science, Peking University
Pseudocode No The paper does not contain explicit pseudocode or algorithm blocks. It describes the model architecture and training procedure in narrative text and figures.
Open Source Code Yes The project s website and code can be found at https://sites.google.com/view/steve-eye.
Open Datasets Yes We choose Mine Dojo (Fan et al., 2022) as the Minecraft platform to collect our instruction data and conduct experiments. ...In Minecraft, such knowledge should contain item recipes, details of item attributes, their associated numerical value, etc. We access this vital information from Minecraft-Wiki (Fandom, 2023), which comprises an extensive collection of over 9,000 HTML pages.
Dataset Splits No The paper mentions collecting 850K instruction-answer pairs for model training and refers to 'test sets' and a 'validation dataset' (FK-QA test set) for evaluation. However, it does not specify the exact percentages or counts for a general train/validation/test split for the overall 850K dataset used for training, making exact reproduction of the data partitioning challenging without this information.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It mentions using models like LLaMA-2 but not the underlying hardware.
Software Dependencies No The paper mentions software components and models such as LLaMA-2 (Touvron et al., 2023b), CLIP (Radford et al., 2021), VQ-GAN (Esser et al., 2021), LoRA (Hu et al., 2021), MineDojo (Fan et al., 2022), and ChatGPT (OpenAI, 2022). However, it does not specify version numbers for these software dependencies, which are necessary for reproducible descriptions.
Experiment Setup No The paper describes the overall two-stage instruction-tuning strategy, the size of the visual codebook and language vocabulary, and the use of LoRA. However, it lacks specific numerical details for hyperparameters such as learning rates, batch sizes, number of epochs, or optimizer configurations, which are essential for fully reproducing the experimental setup.