RL for Latent MDPs: Regret Guarantees and a Lower Bound
Authors: Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we perform an empirical evaluation of the suggested algorithms on toy problems (Section 4), while focusing on the importance of the made assumptions. |
| Researcher Affiliation | Collaboration | Jeongyeol Kwon The University of Texas at Austin kwonchungli@utexas.edu Yonathan Efroni Microsoft Research, NYC jonathan.efroni@gmail.com Constantine Caramanis The University of Texas at Austin constantine@utexas.edu Shie Mannor Technion, NVIDIA shie@ee.technion.ac.il, smannor@nvidia.com |
| Pseudocode | Yes | Algorithm 1 Latent Upper Confidence Reinforcement Learning (L-UCRL); Algorithm 2 Access to True Contexts; Algorithm 3 Inference of Contexts; Algorithm 4 (Informal) Recovery of LMDP parameters |
| Open Source Code | No | The paper does not provide any statement about releasing source code for the methodology described. |
| Open Datasets | No | The paper states: "We generate random instances of LMDPs of size M = 7, S = 15, A = 3 and set the time-horizon H = 30." This indicates that synthetic data was generated for the experiments, but no public dataset or access information to the generated data is provided. |
| Dataset Splits | No | The paper mentions generating synthetic data instances for experiments but does not specify any training, validation, or test dataset splits. The experiments are run on these generated instances without explicit data partitioning. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments. |
| Software Dependencies | No | The paper mentions using "Q-MDP heuristic [32]" for planning, but it does not specify any version numbers for this or any other software components, which is required for reproducibility. |
| Experiment Setup | Yes | We generate random instances of LMDPs of size M = 7, S = 15, A = 3 and set the time-horizon H = 30. The reward distribution is set to be 0 for most state-action pairs. ... For various levels of δ, we generate the parameters for transition probabilities randomly while keeping the distance between different MDPs to satisfy δ (Tm1 Tm2)(s |s, a) 1 2δ for m1 = m2. ... To learn the parameters of PSR, we run 10^6 episodes with H = 4. ... In the clustering step, we run an additional 5 * 10^3 episodes to obtain longer trajectories of length H = 20, 40 and 80. |