Provable Rich Observation Reinforcement Learning with Combinatorial Latent States
Authors: Dipendra Misra, Qinghua Liu, Chi Jin, John Langford
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Proof of Concept Experiments. We empirically evaluate Facto RL to support our theoretical results, and to provide implementation details. We consider a problem with d factors each emitting 2 atoms. We implement model classes F and G using feed-forward neural networks. Specifically, for G we apply the Gumbel-softmax trick to model the bottleneck following Misra et al. (2020). We train the models using cross-entropy loss instead of squared loss that we use for theoretical analysis. For the independence test task, we declare two atoms to be independent, if the best log-loss on the validation set is greater than c. We train the model using Adam optimization and perform model selection using a held-out set. We defer the full model and training details to Appendix F. For each time step, we collect 20,000 samples and share them across all routines. This gives a sample complexity of 20, 000 H. We repeat the experiment 3 times and found that each time, the model was able to perfectly detect the latent child function, learn a 1/2-policy cover, and estimate the model with error < 0.01. This is in sync with our theoretical findings and demonstrates the empirical use of Facto RL. |
| Researcher Affiliation | Collaboration | Dipendra Misra Microsoft Research Qinghua Liu Princeton University Chi Jin Princeton University John Langford Microsoft Research |
| Pseudocode | Yes | Algorithm 1 Facto RL(F, G, δ, σ, ηmin, βmin, d, κ). Algorithm 2 Factorize Emission(Ψh 1, bφh 1, F). Algorithm 3 Learn Decoder(G, Ψh 1, c chh). Algorithm 4 Est Model(Ψh 1, bφh 1, bφh). Algorithm 5 Ind Test(F, D, u, v, β). |
| Open Source Code | Yes | We will make the code available at: https://github.com/cereb-rl. |
| Open Datasets | No | The paper describes a synthetic problem setup for its "Proof of Concept Experiments" but does not provide details about a publicly available dataset or instructions to access one. |
| Dataset Splits | Yes | We remove 0.2% of the training data and use it as a validation set. We evaluate on the validation set after every epoch, and use the model with the best performance on the validation set. |
| Hardware Specification | No | The paper states "We used Py Torch 1.6 to develop the code" but does not specify any hardware details like CPU models, GPU models, or memory for running the experiments. |
| Software Dependencies | Yes | We used Py Torch 1.6 to develop the code and used default initialization scheme for all parameters. |
| Experiment Setup | Yes | We train the model using Adam optimization and perform model selection using a held-out set... with learning rate of 0.001, and a batch size of 32. |