Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective

Authors: Raj Ghugare, Homanga Bharadhwaj, Benjamin Eysenbach, Sergey Levine, Russ Salakhutdinov

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods.
Researcher Affiliation Academia VNIT Nagpur 2Carnegie Mellon University 3UC Berkeley
Pseudocode Yes Algorithm 1 The ALM objective can be optimized with any RL algorithm. We present an implementation based on DDPG (Lillicrap et al., 2015).
Open Source Code Yes Project website with code: https://alignedlatentmodels.github.io/
Open Datasets Yes We start by comparing ALM with the baselines on the locomotion benchmark proposed by Wang et al. (2019).
Dataset Splits No The paper mentions 'validation' in the context of learned Q-functions ('validation' in Appendix A.6 refers to a Q-function update), but not as explicit dataset splits for the environments used.
Hardware Specification No The paper acknowledges assistance in 'setting up the compute necessary for running the experiments' but does not provide specific details on the hardware used (e.g., CPU, GPU models, memory).
Software Dependencies No The paper mentions various algorithms and network components (e.g., DDPG, SAC-SVG, layer normalization, Relu/Elu activations) but does not list specific software dependencies with version numbers (e.g., PyTorch, TensorFlow, Python versions).
Experiment Setup Yes Table 3: A default set of hyper-parameters used in our experiments. Hyperparameters Value Discount (γ) 0.99 Warmup steps 5000 Soft update rate (τ) 0.005 Weighted target parameter (λ) 0.95 Replay Buffer 10^6 for humanoid 10^5 otherwise Batch size 512 Learning rate 1e-4 Max grad norm 100.0 Latent dimension 50 Coefficient of classifier rewards 0.1 Exploration stddev. clip 0.3 Exploration stddev. schedule linear(1.0 , 0.1, 100000)