Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective
Authors: Raj Ghugare, Homanga Bharadhwaj, Benjamin Eysenbach, Sergey Levine, Russ Salakhutdinov
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods. |
| Researcher Affiliation | Academia | VNIT Nagpur 2Carnegie Mellon University 3UC Berkeley |
| Pseudocode | Yes | Algorithm 1 The ALM objective can be optimized with any RL algorithm. We present an implementation based on DDPG (Lillicrap et al., 2015). |
| Open Source Code | Yes | Project website with code: https://alignedlatentmodels.github.io/ |
| Open Datasets | Yes | We start by comparing ALM with the baselines on the locomotion benchmark proposed by Wang et al. (2019). |
| Dataset Splits | No | The paper mentions 'validation' in the context of learned Q-functions ('validation' in Appendix A.6 refers to a Q-function update), but not as explicit dataset splits for the environments used. |
| Hardware Specification | No | The paper acknowledges assistance in 'setting up the compute necessary for running the experiments' but does not provide specific details on the hardware used (e.g., CPU, GPU models, memory). |
| Software Dependencies | No | The paper mentions various algorithms and network components (e.g., DDPG, SAC-SVG, layer normalization, Relu/Elu activations) but does not list specific software dependencies with version numbers (e.g., PyTorch, TensorFlow, Python versions). |
| Experiment Setup | Yes | Table 3: A default set of hyper-parameters used in our experiments. Hyperparameters Value Discount (γ) 0.99 Warmup steps 5000 Soft update rate (τ) 0.005 Weighted target parameter (λ) 0.95 Replay Buffer 10^6 for humanoid 10^5 otherwise Batch size 512 Learning rate 1e-4 Max grad norm 100.0 Latent dimension 50 Coefficient of classifier rewards 0.1 Exploration stddev. clip 0.3 Exploration stddev. schedule linear(1.0 , 0.1, 100000) |