Policy Mirror Descent with Lookahead
Authors: Kimon Protopapas, Anas Barakat
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform simulations to verify our theoretical findings empirically on the standard Deep Sea RL environment from Deep Mind s bsuite [34]. |
| Researcher Affiliation | Academia | Department of Computer Science, ETH Z urich, Switzerland. Contact: kprotopapas@student.ethz.ch, barakat9anas@gmail.com. Most of this work was completed when both authors were affiliated with ETH Z urich, K.P as a Master student and A.B. as a postdoctoral fellow. A.B. is currently affiliated with Singapore University of Technology and Design as a research fellow. |
| Pseudocode | Yes | Algorithm 1 Lookahead Q-function Estimation via Monte Carlo Planning |
| Open Source Code | Yes | Our codebase where all our experiments can be replicated is available here: https: //gitlab.com/kimon.protopapa/pmd-lookahead. |
| Open Datasets | Yes | on the standard Deep Sea RL environment from Deep Mind s bsuite [34] |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits. For RL environments, the training is through interaction, and explicit dataset splits are not always presented in the same way as in supervised learning. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (CPU/GPU models, memory, etc.) used to run the experiments. |
| Software Dependencies | No | The paper mentions software like Deep Mind's bsuite, MCTX, JAX, and gymnax, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We run the exact h-PMD algorithm for 100 iterations for increasing values of h using the KL divergence. Similar results were observed for the Euclidean divergence. We tested two different stepsize schedules: (a) in dotted lines in Fig. 1 (left), ηk equal to its lower bound in sec. 4, with the choice ck := γ2h(k+1) (note the dependence on h); and (b) in solid lines, ηk identical stepsize schedule across all values of h with ck := γ2(k+1) to isolate the effect of the lookahead.We run the h-PMD algorithm for different values of h in both exact and inexact settings on the Deep Sea environment from Deep Mind s bsuite [34] using a grid size of 64 by 64, and a discount factor γ = 0.99. |