Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Strategic Planning: A Top-Down Approach to Option Generation

Authors: Max Ruiz Luyten, Antonin Berthon, Mihaela Van Der Schaar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate that our framework significantly enhances the performance of different underlying RL algorithms, leading to faster convergence and the discovery of more complex behaviors. Taken together, our findings highlight that top-down strategic exploration opens new avenues to improve RL in real-world decision problems. ... 5. Experiments We evaluate the Strategist framework on different versions of Crafter (Hafner, 2021) and with different RL backbones: Proximal Policy Optimization (PPO) (Schulman et al., 2017), and Exploration via Distributional Ensemble (EDE) (Jiang et al., 2022).
Researcher Affiliation	Academia	Max Ruiz Luyten * 1 Antonin Berthon * 1 Mihaela van der Schaar 1 1University of Cambridge. Correspondence to: Max Ruiz Luyten <EMAIL>.
Pseudocode	No	The paper describes the Strategist agent's phases and tree construction process in natural language, but it does not include a clearly labeled pseudocode or algorithm block with structured steps formatted like code.
Open Source Code	Yes	We instantiate these ideas in the Strategist agent1, which uses the LLM-based tree search to encode domain knowledge into actionable top-down strategies without prespecifying their components. ... 1https://github.com/antoninbrthn/strategist
Open Datasets	Yes	We evaluate the Strategist framework on different versions of Crafter (Hafner, 2021) ... We utilize two main environment configurations: Modified Crafter (Easy & Medium) ... Original Crafter ... 1000 from the expert human demonstration dataset provided with the original Crafter environment (Hafner, 2021).
Dataset Splits	No	The paper states: "We collect 5000 frames: 4000 from 200 trajectories of PPO agents pretrained on the environment, and 1000 from the expert human demonstration dataset provided with the original Crafter environment (Hafner, 2021)." This describes the data collection for the reward shaping network but does not specify training/test/validation splits for the main RL agent experiments on the Crafter environment itself.
Hardware Specification	Yes	Experiments were ran on two NVIDIA RTX 6000 ADA GPUs with 48GB VRAM and 120GB RAM.
Software Dependencies	No	The paper mentions "standard PPO (Schulman et al., 2017) implementation from stable baselines 3 (Raffin et al., 2021)", "GPT-4o (Open AI et al., 2024)", "GPT-4o-mini model", "Adam optimizer (Kingma & Ba, 2017)", and "Smartplay (Wu et al., 2024)". However, it does not provide specific version numbers for `stable-baselines3` or any other software libraries or frameworks used in their implementation.
Experiment Setup	Yes	PPO Hyperparameters. In our experiments, we use the standard PPO (Schulman et al., 2017) implementation from stable baselines 3 (Raffin et al., 2021) with the default hyperparameters: Policy/Value Network Architecture: Cnn Policy Rollout Length per Update: 2048 Learning Rate: 3 * 10^-4. Batch Size: 64 Clip Range: 0.2 Discount Factor: 0.99 Training Schedules. We train for a total of 2M steps, logging performance metrics and 20k and saving checkpoints every 500k steps). We evaluate each performance metrics over 10 episodes.