Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Strategic Planning: A Top-Down Approach to Option Generation
Authors: Max Ruiz Luyten, Antonin Berthon, Mihaela Van Der Schaar
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we demonstrate that our framework significantly enhances the performance of different underlying RL algorithms, leading to faster convergence and the discovery of more complex behaviors. Taken together, our findings highlight that top-down strategic exploration opens new avenues to improve RL in real-world decision problems. ... 5. Experiments We evaluate the Strategist framework on different versions of Crafter (Hafner, 2021) and with different RL backbones: Proximal Policy Optimization (PPO) (Schulman et al., 2017), and Exploration via Distributional Ensemble (EDE) (Jiang et al., 2022). |
| Researcher Affiliation | Academia | Max Ruiz Luyten * 1 Antonin Berthon * 1 Mihaela van der Schaar 1 1University of Cambridge. Correspondence to: Max Ruiz Luyten <EMAIL>. |
| Pseudocode | No | The paper describes the Strategist agent's phases and tree construction process in natural language, but it does not include a clearly labeled pseudocode or algorithm block with structured steps formatted like code. |
| Open Source Code | Yes | We instantiate these ideas in the Strategist agent1, which uses the LLM-based tree search to encode domain knowledge into actionable top-down strategies without prespecifying their components. ... 1https://github.com/antoninbrthn/strategist |
| Open Datasets | Yes | We evaluate the Strategist framework on different versions of Crafter (Hafner, 2021) ... We utilize two main environment configurations: Modified Crafter (Easy & Medium) ... Original Crafter ... 1000 from the expert human demonstration dataset provided with the original Crafter environment (Hafner, 2021). |
| Dataset Splits | No | The paper states: "We collect 5000 frames: 4000 from 200 trajectories of PPO agents pretrained on the environment, and 1000 from the expert human demonstration dataset provided with the original Crafter environment (Hafner, 2021)." This describes the data collection for the reward shaping network but does not specify training/test/validation splits for the main RL agent experiments on the Crafter environment itself. |
| Hardware Specification | Yes | Experiments were ran on two NVIDIA RTX 6000 ADA GPUs with 48GB VRAM and 120GB RAM. |
| Software Dependencies | No | The paper mentions "standard PPO (Schulman et al., 2017) implementation from stable baselines 3 (Raffin et al., 2021)", "GPT-4o (Open AI et al., 2024)", "GPT-4o-mini model", "Adam optimizer (Kingma & Ba, 2017)", and "Smartplay (Wu et al., 2024)". However, it does not provide specific version numbers for `stable-baselines3` or any other software libraries or frameworks used in their implementation. |
| Experiment Setup | Yes | PPO Hyperparameters. In our experiments, we use the standard PPO (Schulman et al., 2017) implementation from stable baselines 3 (Raffin et al., 2021) with the default hyperparameters: Policy/Value Network Architecture: Cnn Policy Rollout Length per Update: 2048 Learning Rate: 3 * 10^-4. Batch Size: 64 Clip Range: 0.2 Discount Factor: 0.99 Training Schedules. We train for a total of 2M steps, logging performance metrics and 20k and saving checkpoints every 500k steps). We evaluate each performance metrics over 10 episodes. |