Learning What To Do by Simulating the Past

Authors: David Lindner, Rohin Shah, Pieter Abbeel, Anca Dragan

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Deep RLSP on Mu Jo Co environments and show that it can recover fairly good performance on the task reward given access to a small number of states sampled from a policy optimized for that reward.
Researcher Affiliation Academia David Lindner Department of Computer Science ETH Zurich david.lindner@inf.ethz.ch Rohin Shah, Pieter Abbeel & Anca Dragan Center for Human-Compatible AI UC Berkeley {rohinmshah,pabbeel,anca}@berkeley.edu
Pseudocode Yes Algorithm 1 The DEEP RLSP algorithm.
Open Source Code Yes We provide code to replicate our experiments at https://github.com/Human Compatible AI/deep-rlsp.
Open Datasets No The paper mentions generating its own data through "random rollouts" or "environment interactions" but does not provide access information (link, citation, repository) for a publicly available, pre-existing dataset that was used for training.
Dataset Splits No The paper does not explicitly provide specific training/test/validation dataset splits (e.g., percentages or sample counts) for its experiments. Data is primarily generated through rollouts and used for training models, rather than being split from a fixed dataset.
Hardware Specification No The paper mentions using the "Mu Jo Co physics simulator" and environments, but does not provide specific hardware details such as exact GPU or CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions several software components like "Tensor Flow framework", "Open AI Gym", "Soft Actor-Critic (SAC)", and "stable-baselines". However, it does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes The hyperparameters of our experiments are described in detail in Appendix B. For example, B.1 Feature Function: "The latent space has dimension 30." "trained for 100 epochs on 100 rollouts of a random policy in the environment. During training we use a batch size of 500 and a learning rate of 10-5." B.5 DEEP RLSP HYPERPARAMETERS: "learning rate of 0.01", "200 forward and backward trajectories", "algorithm until T = 10".