reproducibilityindex.ai

RL for Latent MDPs: Regret Guarantees and a Lower Bound

Authors: Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we perform an empirical evaluation of the suggested algorithms on toy problems (Section 4), while focusing on the importance of the made assumptions.
Researcher Affiliation	Collaboration	Jeongyeol Kwon The University of Texas at Austin kwonchungli@utexas.edu Yonathan Efroni Microsoft Research, NYC jonathan.efroni@gmail.com Constantine Caramanis The University of Texas at Austin constantine@utexas.edu Shie Mannor Technion, NVIDIA shie@ee.technion.ac.il, smannor@nvidia.com
Pseudocode	Yes	Algorithm 1 Latent Upper Confidence Reinforcement Learning (L-UCRL); Algorithm 2 Access to True Contexts; Algorithm 3 Inference of Contexts; Algorithm 4 (Informal) Recovery of LMDP parameters
Open Source Code	No	The paper does not provide any statement about releasing source code for the methodology described.
Open Datasets	No	The paper states: "We generate random instances of LMDPs of size M = 7, S = 15, A = 3 and set the time-horizon H = 30." This indicates that synthetic data was generated for the experiments, but no public dataset or access information to the generated data is provided.
Dataset Splits	No	The paper mentions generating synthetic data instances for experiments but does not specify any training, validation, or test dataset splits. The experiments are run on these generated instances without explicit data partitioning.
Hardware Specification	No	The paper does not provide any specific details about the hardware used to run the experiments.
Software Dependencies	No	The paper mentions using "Q-MDP heuristic [32]" for planning, but it does not specify any version numbers for this or any other software components, which is required for reproducibility.
Experiment Setup	Yes	We generate random instances of LMDPs of size M = 7, S = 15, A = 3 and set the time-horizon H = 30. The reward distribution is set to be 0 for most state-action pairs. ... For various levels of δ, we generate the parameters for transition probabilities randomly while keeping the distance between diﬀerent MDPs to satisfy δ (Tm1 Tm2)(s \|s, a) 1 2δ for m1 = m2. ... To learn the parameters of PSR, we run 10^6 episodes with H = 4. ... In the clustering step, we run an additional 5 * 10^3 episodes to obtain longer trajectories of length H = 20, 40 and 80.