STEVE-1: A Generative Model for Text-to-Behavior in Minecraft

Authors: Shalev Lifshitz, Keiran Paster, Harris Chan, Jimmy Ba, Sheila McIlraith

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines and robustly completing 12 of 13 tasks in our early-game evaluation suite. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling.
Researcher Affiliation Academia Shalev Lifshitz1,2 shalev.lifshitz@mail.utoronto.ca Keiran Paster1,2 keirp@cs.toronto.edu Harris Chan1,2 hchan@cs.toronto.edu Jimmy Ba1,2 jba@cs.toronto.edu Sheila Mc Ilraith1,2 sheila@cs.toronto.edu 1Department of Computer Science, University of Toronto, Toronto, Canada. 2Vector Institute for Artificial Intelligence, Toronto, Canada.
Pseudocode Yes Algorithm 1: Sampling Episode Segments with Packed Hindsight Relabeling Function sample_episode_segment(T, min_btwn_goals, max_btwn_goals)
Open Source Code Yes All resources, including our model weights, training scripts, and evaluation tools are made available for further research. ... Model weights, training code, videos, and an interactive demo script are hosted on our project webpage at https://sites.google.com/view/steve-1.
Open Datasets Yes Our gameplay dataset consists of two types of episodes: 7,854 episodes (38.94M frames) of a contractor dataset made available from Baker et al. [5] and 2,267 episodes (14.96M frames) of gameplay generated by running various pretrained VPT agents.
Dataset Splits No We trained our prior model for 50 epochs on this dataset and used early-stopping with a small validation set. No specific percentages or sample counts for the validation set were provided.
Hardware Specification Yes The main STEVE-1 is trained using Pytorch [45] distributed data parallel on four A40 GPUs for 160M frames, or just under three epochs of our gameplay dataset.
Software Dependencies No The paper mentions PyTorch [45] and Adam W [37] but does not provide specific version numbers for these software components. The citations [45] and [37] refer to papers from 2019, which is not a specific software version.
Experiment Setup Yes Hyperparameters are selected to match those in Baker et al. [5] with the exception of learning rate, which we set to 4e-5. Our models are optimized using Adam W [37]. See Table 1 for a full list of hyperparameters. (Table 1 includes: trunc_t 64, T 640, batch_size 12, num_workers 4, weight_decay 0.039428, n_frames 160M, learning_rate 4e-5, optimizer Adam W [37], warmup_frames 10M, p_uncond 0.1, min_btwn_goals 15, max_btwn_goals 200, vpt_architecture 2x)