Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning Interactive World Model for Object-Centric Reinforcement Learning

Authors: Fan Feng, Phillip Lippe, Sara Magliacane

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate the effectiveness of our proposed interactive world model and policy learning framework, we aim to address the following questions: (i) How accurately does the model learns the state disentanglement and interaction models? (ii) How well does it perform in long-horizon task learning? and (iii) How well does the framework achieve compositional generalization? To answer these questions, we evaluate our method on a range of simulated control, robotic manipulation, and embodied AI benchmarks, including Sprites World [77], Open AI-Gym Fetch [78], i Gibson [79], and Libero [80].
Researcher Affiliation	Academia	Fan Feng1,2 Phillip Lippe3 Sara Magliacane3 1 University of California San Diego 2 Mohamed bin Zayed University of Artificial Intelligence 3 University of Amsterdam
Pseudocode	Yes	Algorithm 1 FIOC-WM: Offline World Model and Online Policy Learning (Simplified)
Open Source Code	No	The code and data will be publicly available after acceptance.
Open Datasets	Yes	To answer these questions, we evaluate our method on a range of simulated control, robotic manipulation, and embodied AI benchmarks, including Sprites World [77], Open AI-Gym Fetch [78], i Gibson [79], and Libero [80].
Dataset Splits	No	The paper describes data collection (e.g., "collect 3000 episodes with random actions") and training steps, but does not provide explicit dataset splits (e.g., percentages or counts for training, validation, and test sets) from a single collected dataset.
Hardware Specification	Yes	Compute used for training the FIOC-WM: For Sprites-World, we use 3 hours on 1x NVIDIA A100; For Fetch, we use 8 hours on 6x NVIDIA 4090; For i-Gibson, we use 9 hours on 6x NVIDIA 4090; For Libero-object, we use 8 hours on 1x NVIDIA A100; For Kitchen, we use 6 hours on 1x NVIDIA A100.
Software Dependencies	No	The paper mentions models like DINO-v2 and R3M and framework components like MLP and GRU layers, but it does not specify version numbers for any underlying software dependencies like Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup	Yes	For the VAE used to learn latent states, we employ a two-layer MLP with a hidden size of 256. The specific hyperparameters for different environments are detailed in Table A7. The hyperparameters for the loss terms are set as {α, β, γ, η} = {1, 0.05, 0.1, 0.2}, and the learning rate is set to 3 × 10−4. For MPC, we use gradient descent with a learning rate of 5 × 10−5. For those using PPO, we set the learning rate to 3 × 10−4 with a clip ratio of 0.1. The MLP architecture consists of hidden sizes [256, 256] for Gym-Fetch, while for other environments, we use [512, 512]. Generalized Advantage Estimation (GAE) is set to 0.95 for all environments, and the entropy coefficient is 0.1.