Provably sample-efficient RL with side information about latent dynamics
Authors: Yao Liu, Dipendra Misra, Miro Dudik, Robert E. Schapire
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In synthetic experiments, we verify various properties of our algorithm and compare it with several transfer RL algorithms that require access to full simulators (i.e., those that also simulate observations). |
| Researcher Affiliation | Industry | Yao Liu Amazon Web Services yaoliuai@amazon.com Dipendra Misra Microsoft Research dipendra.misra@microsoft.com Miroslav Dudík Microsoft Research mdudik@microsoft.com Robert E. Schapire Microsoft Research schapire@microsoft.com |
| Pseudocode | Yes | Algorithm 1 Robust Dynamic Programming. RDP(M , η) ... Algorithm 2 Transfer from Abstract Simulator using Inverse Dynamics. TASID(M , M , F, η, ϵ, δ) |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code for the described methodology or links to a code repository. |
| Open Datasets | Yes | We evaluate TASID in the visual Mini Grid environment [Chevalier-Boisvert et al., 2018] with noisy observations. |
| Dataset Splits | No | The paper mentions running a grid search over hyperparameters and evaluating in simulation environments, but does not specify explicit train/validation/test dataset splits or cross-validation setup for its data. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running its experiments. |
| Software Dependencies | No | The paper mentions algorithms like PPO and environments like MiniGrid, but it does not provide specific version numbers for any software dependencies or libraries required for reproduction. |
| Experiment Setup | Yes | For baseline algorithms, we run grid search over hyperparameters listed in Table 3 in Appendix D, separately for each environment specification (each value of H), and report the best results of PPO(+RND)(+DR). For TASID, we consider only one hyperparameter, the number of training episodes per time step n D, and search over three possible values: 1000, 2500, 10000. |