Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Zero-shot World Models via Search in Memory

Authors: Federico Malato, Ville Hautamäki

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the models on the quality of latent reconstruction and on the perceived similarity of the reconstructed image, on both next-step and long horizon dynamics prediction. The results of our study demonstrate that a search-based world model is comparable to a training based one in both cases. Notably, our model show stronger performance in long-horizon prediction with respect to the baseline on a range of visually different environments. 4 Experiments We test our approach against Pla Net [8].
Researcher Affiliation	Academia	Federico Malato School of Computing University of Eastern Finland Joensuu, FI 80101 EMAIL Ville Hautamäki School of Computing University of Eastern Finland Joensuu, FI 80101 EMAIL
Pseudocode	No	The paper describes the methods in narrative text and with visual diagrams, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	A working implementation of our code is provided at https://github.com/fmalato/zero_shot_world_models.
Open Datasets	Yes	Specifically, we use five tracks from Super Tux Kart... two tasks from Minecraft [6]... and two tasks from Atari [14]... [6] William H. Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel, Manuela Veloso, and Ruslan Salakhutdinov. Minerl: A large-scale dataset of minecraft demonstrations, 2019. [14] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning, 2013.
Dataset Splits	No	The paper mentions extracting "20 random starting samples from a separate batch of unseen trajectories" and "20 random samples from a disjoint set of test trajectories" for evaluation, but does not provide specific percentages or counts for the overall training, validation, and test dataset splits.
Hardware Specification	Yes	All our models are trained on consumer hardware, consisting of a single RTX 4080 GPU. To fit all the models in one run, the test is run on CPU, using an Intel i7 12650HX.
Software Dependencies	No	The paper mentions using "torch.distributions package" and that the "PyTorch implementation uses https://github.com/abhayraw1/planet-torch as reference", but it does not specify explicit version numbers for these or any other software components.
Experiment Setup	Yes	Table 4: Hyperparameters for VAE models used in this study. Name Super Tux Kart Mine RL Atari image size 64x64x3 64x64x3 64x64x3 latent size 128 512 128 learning rate 5 10 5 3 10 4 5 10 5 epochs 250 50 250 batch size 128 128 128 beta 0.0 5 10 8 0.0 5 10 8 0.0 5 10 8 beta interval (epochs) 25 225 5 45 25 225 Table 5: Hyperparameters used to train the Pla Net baseline. Name Super Tux Kart Mine RL Atari latent size 128 512 128 hidden size 256 256 256 learning rate 1 10 3 1 10 3 1 10 3 epochs 250 250 250 batch size 64 64 64 beta 0.1 0.1 0.1