Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning approach

Authors: Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Alekh Agarwal, Wen Sun

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that BRIEE is more sample efficient than the state-of-art Block MDP algorithm HOMER and other empirical RL baselines on challenging rich-observation combination lock problems which require deep exploration.
Researcher Affiliation Collaboration 1Princeton University 2Carnegie Mellon University 3Cornell University 4Google Research.
Pseudocode Yes Algorithm 1 Block-structured Representation learning with Interleaved Explore Exploit (BRIEE) ... Algorithm 2 Representation Learning Oracle (REPLEARN) ... Algorithm 3 Least Square Value Iteration LSVI
Open Source Code Yes Our code can be find at https://github.com/yudasong/briee.
Open Datasets No The paper describes a custom 'diabolical combination lock (comblock) problem' environment used for evaluation. While it mentions the motivation from a prior benchmark (Misra et al., 2019), it does not provide a direct link, DOI, or formal citation for accessing a publicly available dataset used for training.
Dataset Splits No The paper discusses 'replay buffers Dh and D h' for data collection and learning, and 'evaluation runs' but does not specify a formal validation split (e.g., 80/10/10 percentages or absolute counts) from a static dataset.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No The paper mentions software like PPO, RND, and LSVI-UCB but does not provide specific version numbers for any of these software components or other ancillary libraries.
Experiment Setup Yes We provide the full list of hyperparameters in Table. 2. ... We provide the hyperparameters of BRIEE for the dense reward environment in Table.6 ... We provide the hyperparameters of PPO for the dense reward environment in Table.7.