Learning What To Do by Simulating the Past
Authors: David Lindner, Rohin Shah, Pieter Abbeel, Anca Dragan
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Deep RLSP on Mu Jo Co environments and show that it can recover fairly good performance on the task reward given access to a small number of states sampled from a policy optimized for that reward. |
| Researcher Affiliation | Academia | David Lindner Department of Computer Science ETH Zurich david.lindner@inf.ethz.ch Rohin Shah, Pieter Abbeel & Anca Dragan Center for Human-Compatible AI UC Berkeley {rohinmshah,pabbeel,anca}@berkeley.edu |
| Pseudocode | Yes | Algorithm 1 The DEEP RLSP algorithm. |
| Open Source Code | Yes | We provide code to replicate our experiments at https://github.com/Human Compatible AI/deep-rlsp. |
| Open Datasets | No | The paper mentions generating its own data through "random rollouts" or "environment interactions" but does not provide access information (link, citation, repository) for a publicly available, pre-existing dataset that was used for training. |
| Dataset Splits | No | The paper does not explicitly provide specific training/test/validation dataset splits (e.g., percentages or sample counts) for its experiments. Data is primarily generated through rollouts and used for training models, rather than being split from a fixed dataset. |
| Hardware Specification | No | The paper mentions using the "Mu Jo Co physics simulator" and environments, but does not provide specific hardware details such as exact GPU or CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions several software components like "Tensor Flow framework", "Open AI Gym", "Soft Actor-Critic (SAC)", and "stable-baselines". However, it does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility. |
| Experiment Setup | Yes | The hyperparameters of our experiments are described in detail in Appendix B. For example, B.1 Feature Function: "The latent space has dimension 30." "trained for 100 epochs on 100 rollouts of a random policy in the environment. During training we use a batch size of 500 and a learning rate of 10-5." B.5 DEEP RLSP HYPERPARAMETERS: "learning rate of 0.01", "200 forward and backward trajectories", "algorithm until T = 10". |