Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Policy Mirror Descent with Lookahead

Authors: Kimon Protopapas, Anas Barakat

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform simulations to verify our theoretical findings empirically on the standard Deep Sea RL environment from Deep Mind s bsuite [34].
Researcher Affiliation	Academia	Department of Computer Science, ETH Z urich, Switzerland. Contact: EMAIL, EMAIL. Most of this work was completed when both authors were affiliated with ETH Z urich, K.P as a Master student and A.B. as a postdoctoral fellow. A.B. is currently affiliated with Singapore University of Technology and Design as a research fellow.
Pseudocode	Yes	Algorithm 1 Lookahead Q-function Estimation via Monte Carlo Planning
Open Source Code	Yes	Our codebase where all our experiments can be replicated is available here: https: //gitlab.com/kimon.protopapa/pmd-lookahead.
Open Datasets	Yes	on the standard Deep Sea RL environment from Deep Mind s bsuite [34]
Dataset Splits	No	The paper does not explicitly provide training/validation/test dataset splits. For RL environments, the training is through interaction, and explicit dataset splits are not always presented in the same way as in supervised learning.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (CPU/GPU models, memory, etc.) used to run the experiments.
Software Dependencies	No	The paper mentions software like Deep Mind's bsuite, MCTX, JAX, and gymnax, but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	We run the exact h-PMD algorithm for 100 iterations for increasing values of h using the KL divergence. Similar results were observed for the Euclidean divergence. We tested two different stepsize schedules: (a) in dotted lines in Fig. 1 (left), ηk equal to its lower bound in sec. 4, with the choice ck := γ2h(k+1) (note the dependence on h); and (b) in solid lines, ηk identical stepsize schedule across all values of h with ck := γ2(k+1) to isolate the effect of the lookahead.We run the h-PMD algorithm for different values of h in both exact and inexact settings on the Deep Sea environment from Deep Mind s bsuite [34] using a grid size of 64 by 64, and a discount factor γ = 0.99.