Provable Rich Observation Reinforcement Learning with Combinatorial Latent States

Authors: Dipendra Misra, Qinghua Liu, Chi Jin, John Langford

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Proof of Concept Experiments. We empirically evaluate Facto RL to support our theoretical results, and to provide implementation details. We consider a problem with d factors each emitting 2 atoms. We implement model classes F and G using feed-forward neural networks. Specifically, for G we apply the Gumbel-softmax trick to model the bottleneck following Misra et al. (2020). We train the models using cross-entropy loss instead of squared loss that we use for theoretical analysis. For the independence test task, we declare two atoms to be independent, if the best log-loss on the validation set is greater than c. We train the model using Adam optimization and perform model selection using a held-out set. We defer the full model and training details to Appendix F. For each time step, we collect 20,000 samples and share them across all routines. This gives a sample complexity of 20, 000 H. We repeat the experiment 3 times and found that each time, the model was able to perfectly detect the latent child function, learn a 1/2-policy cover, and estimate the model with error < 0.01. This is in sync with our theoretical findings and demonstrates the empirical use of Facto RL.
Researcher Affiliation Collaboration Dipendra Misra Microsoft Research Qinghua Liu Princeton University Chi Jin Princeton University John Langford Microsoft Research
Pseudocode Yes Algorithm 1 Facto RL(F, G, δ, σ, ηmin, βmin, d, κ). Algorithm 2 Factorize Emission(Ψh 1, bφh 1, F). Algorithm 3 Learn Decoder(G, Ψh 1, c chh). Algorithm 4 Est Model(Ψh 1, bφh 1, bφh). Algorithm 5 Ind Test(F, D, u, v, β).
Open Source Code Yes We will make the code available at: https://github.com/cereb-rl.
Open Datasets No The paper describes a synthetic problem setup for its "Proof of Concept Experiments" but does not provide details about a publicly available dataset or instructions to access one.
Dataset Splits Yes We remove 0.2% of the training data and use it as a validation set. We evaluate on the validation set after every epoch, and use the model with the best performance on the validation set.
Hardware Specification No The paper states "We used Py Torch 1.6 to develop the code" but does not specify any hardware details like CPU models, GPU models, or memory for running the experiments.
Software Dependencies Yes We used Py Torch 1.6 to develop the code and used default initialization scheme for all parameters.
Experiment Setup Yes We train the model using Adam optimization and perform model selection using a held-out set... with learning rate of 0.001, and a batch size of 32.