Provably efficient RL with Rich Observations via Latent State Decoding

Authors: Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, John Langford

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide finite-sample guarantees on the quality of the learned state decoding function and exploration policies, and complement our theory with an empirical evaluation on a class of hard exploration problems.
Researcher Affiliation Collaboration 1Carnegie Mellon University 2Microsoft Research, New York 3University of Illinois at Urbana-Champaign 4Microsoft Research, Redmond. Correspondence to: Simon S. Du <ssdu@cs.cmu.edu>.
Pseudocode Yes Algorithm 1 PCID (Policy Cover via Inductive Decoding) ... Algorithm 2 Clustering to Find Latent-state Embeddings. ... Algorithm 3 Dynamic Programming for Reaching a State
Open Source Code Yes Our code is available at https://github.com/Microsoft/State Decoding.
Open Datasets No The paper describes custom-designed environments ('a form of a combination lock') for the experiments, but it does not provide concrete access information (e.g., links, citations) to a publicly available or open dataset.
Dataset Splits No The paper describes running algorithms for a certain number of episodes and using replicates with different randomizations, but it does not specify explicit train, validation, or test dataset splits in terms of percentages or sample counts for the data used in experiments.
Hardware Specification No The paper mentions that 'a single replicate of the above protocol for the two-layer neural net model and H = 50 takes less than 10 minutes on a standard laptop.' However, 'standard laptop' is too general and does not provide specific hardware details (e.g., CPU/GPU model, memory).
Software Dependencies No The paper mentions using 'Ada Grad' for training neural networks but does not specify version numbers for any software components, libraries, or programming languages used in the experiments.
Experiment Setup Yes For both Lock-Bernoulli and Lock-Gaussian, we experiment with linear decoding functions, which we fit via ordinary least squares. For Lock-Gaussian only, we also use two-layer neural networks... trained using Ada Grad with a fixed learning rate of 0.1, for a maximum of 5K iterations. Each algorithm has two hyperparameters that we tune. In our algorithm (PCID), we use k-means clustering instead of Algorithm 2, so one of the hyperparameters is the number of clusters k. The second one is the number of trajectories n to collect in each outer iteration. For ORACLEQ, these are the learning rate and a confidence parameter c. For QLEARNING, these are the learning rate and frac ∈ [0, 1], a fraction of the 100K episodes over which to anneal the exploration probability linearly from 1 down to 0.01. Each algorithm runs for 100K episodes and we say that it has solved the lock by episode t if at round t its running-average reward is 0.25 = 0.5V ∗.