Provably efficient RL with Rich Observations via Latent State Decoding
Authors: Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, John Langford
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide finite-sample guarantees on the quality of the learned state decoding function and exploration policies, and complement our theory with an empirical evaluation on a class of hard exploration problems. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Microsoft Research, New York 3University of Illinois at Urbana-Champaign 4Microsoft Research, Redmond. Correspondence to: Simon S. Du <ssdu@cs.cmu.edu>. |
| Pseudocode | Yes | Algorithm 1 PCID (Policy Cover via Inductive Decoding) ... Algorithm 2 Clustering to Find Latent-state Embeddings. ... Algorithm 3 Dynamic Programming for Reaching a State |
| Open Source Code | Yes | Our code is available at https://github.com/Microsoft/State Decoding. |
| Open Datasets | No | The paper describes custom-designed environments ('a form of a combination lock') for the experiments, but it does not provide concrete access information (e.g., links, citations) to a publicly available or open dataset. |
| Dataset Splits | No | The paper describes running algorithms for a certain number of episodes and using replicates with different randomizations, but it does not specify explicit train, validation, or test dataset splits in terms of percentages or sample counts for the data used in experiments. |
| Hardware Specification | No | The paper mentions that 'a single replicate of the above protocol for the two-layer neural net model and H = 50 takes less than 10 minutes on a standard laptop.' However, 'standard laptop' is too general and does not provide specific hardware details (e.g., CPU/GPU model, memory). |
| Software Dependencies | No | The paper mentions using 'Ada Grad' for training neural networks but does not specify version numbers for any software components, libraries, or programming languages used in the experiments. |
| Experiment Setup | Yes | For both Lock-Bernoulli and Lock-Gaussian, we experiment with linear decoding functions, which we fit via ordinary least squares. For Lock-Gaussian only, we also use two-layer neural networks... trained using Ada Grad with a fixed learning rate of 0.1, for a maximum of 5K iterations. Each algorithm has two hyperparameters that we tune. In our algorithm (PCID), we use k-means clustering instead of Algorithm 2, so one of the hyperparameters is the number of clusters k. The second one is the number of trajectories n to collect in each outer iteration. For ORACLEQ, these are the learning rate and a confidence parameter c. For QLEARNING, these are the learning rate and frac ∈ [0, 1], a fraction of the 100K episodes over which to anneal the exploration probability linearly from 1 down to 0.01. Each algorithm runs for 100K episodes and we say that it has solved the lock by episode t if at round t its running-average reward is 0.25 = 0.5V ∗. |