Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Dr. Strategy: Model-Based Generalist Agents with Strategic Dreaming

Authors: Hany Hamed, Subin Kim, Dongyeong Kim, Jaesik Yoon, Sungjin Ahn

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we show that the proposed model outperforms prior pixel-based MBRL methods in various visually complex and partially observable navigation tasks.
Researcher Affiliation Collaboration 1KAIST 2SAP. Correspondence to: Sungjin Ahn <EMAIL>.
Pseudocode Yes Algorithm 1 Dr. Strategy Initialize: World Model M, Replay buffer D, landmark auto-encoder (encฯ•(s), {l1, ...., l N}, decฯ•(l)), Highway policy ฯ€l(at|st, l), Explorer ฯ€e(at|st), Achiever ฯ€g(at|st, g)
Open Source Code No The paper does not include an explicit statement about releasing code or a link to a code repository for the methodology described.
Open Datasets Yes To empirically investigate the proposed agent, we evaluate it in two types of navigation environments and a robot manipulation environment. One type of navigation environment is 2D navigation... We have designed a 3D-Maze navigation... Additionally, our evaluation extends to a robot manipulation environment, the Robo Kitchen benchmark introduced in a prior work (Mendonca et al., 2021).
Dataset Splits No The paper mentions 'zero-shot evaluation' where goals are unseen during training and are user-defined at test time, which defines a test set. However, it does not explicitly describe a separate validation set or specific validation split percentages/counts. It states: 'For the evaluations, we trained all baselines for 3 seeds per environment', which refers to evaluation runs, not a validation set.
Hardware Specification Yes The training of our agent took 2 to 6 days based on the environment using 24GB VRAM GPU.
Software Dependencies No The paper mentions using 'Dreamer V2 (Hafner et al., 2020)', 'Adam optimizer (Kingma & Ba, 2014)', and 'VQ-VAE', but does not provide specific version numbers for these or other software libraries/frameworks (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup Yes Appendix E. Hyperparameters Table 4. We made minor changes only in a few hyper-parameters such as the learning rates of world model, actor, and critic by following the hyperparameters of Choreographer (Mazzaglia et al., 2022b) as it is also utilizing VQ-VAE like our method. ... Batch size B 50 Trajectory length TS 50 Discrete latent dimensions 32 Discrete latent classes 32 ... Learning rate 3e-4 Imagination Horizon H 15 Discount 0.99 Lambda-target parameter 0.95 Actor learning rate 8e-5 Critic learning rate 8e-5 ...