Policy Mirror Descent with Lookahead

Authors: Kimon Protopapas, Anas Barakat

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform simulations to verify our theoretical findings empirically on the standard Deep Sea RL environment from Deep Mind s bsuite [34].
Researcher Affiliation Academia Department of Computer Science, ETH Z urich, Switzerland. Contact: kprotopapas@student.ethz.ch, barakat9anas@gmail.com. Most of this work was completed when both authors were affiliated with ETH Z urich, K.P as a Master student and A.B. as a postdoctoral fellow. A.B. is currently affiliated with Singapore University of Technology and Design as a research fellow.
Pseudocode Yes Algorithm 1 Lookahead Q-function Estimation via Monte Carlo Planning
Open Source Code Yes Our codebase where all our experiments can be replicated is available here: https: //gitlab.com/kimon.protopapa/pmd-lookahead.
Open Datasets Yes on the standard Deep Sea RL environment from Deep Mind s bsuite [34]
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits. For RL environments, the training is through interaction, and explicit dataset splits are not always presented in the same way as in supervised learning.
Hardware Specification No The paper does not explicitly describe the specific hardware (CPU/GPU models, memory, etc.) used to run the experiments.
Software Dependencies No The paper mentions software like Deep Mind's bsuite, MCTX, JAX, and gymnax, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We run the exact h-PMD algorithm for 100 iterations for increasing values of h using the KL divergence. Similar results were observed for the Euclidean divergence. We tested two different stepsize schedules: (a) in dotted lines in Fig. 1 (left), ηk equal to its lower bound in sec. 4, with the choice ck := γ2h(k+1) (note the dependence on h); and (b) in solid lines, ηk identical stepsize schedule across all values of h with ck := γ2(k+1) to isolate the effect of the lookahead.We run the h-PMD algorithm for different values of h in both exact and inexact settings on the Deep Sea environment from Deep Mind s bsuite [34] using a grid size of 64 by 64, and a discount factor γ = 0.99.