Trajectory-Based Off-Policy Deep Reinforcement Learning
Authors: Andreas Doerr, Michael Volpp, Marc Toussaint, Trimpe Sebastian, Christian Daniel
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed approach on a series of continuous control benchmark tasks. The results show that the proposed algorithm is able to successfully and reliably learn solutions using fewer system interactions than standard policy gradient methods. The experimental evaluation of the proposed DD-OPG method is threefold. In Sec. 6.1, the resulting surrogate return model is visualized, highlighting different modeling options. A benchmark against state-of-the-art PG methods is shown in Sec. 6.2 to highlight fast and data-efficient learning. Finally, important parts of the proposed algorithms and their effects on the final learning performance are highlighted in an ablation study in Sec. 6.3. |
| Researcher Affiliation | Collaboration | 1Bosch Center for Artificial Intelligence, Renningen, Germany. 2Max Planck Institute for Intelligent Systems, Stuttgart/T ubingen, Germany. 3Machine Learning and Robotics Lab, University of Stuttgart, Germany. |
| Pseudocode | Yes | Algorithm 1 Model-free DD-OPG |
| Open Source Code | Yes | https://github.com/boschresearch/DD_OPG |
| Open Datasets | Yes | The proposed DD-OPG method is evaluated... on a series of continuous control benchmark tasks. The resulting learning performances are visualized in Fig. 2 for the cartpole, mountaincar and swimmer environment (left to right) (Duan et al., 2016). |
| Dataset Splits | No | The paper evaluates learning performance over 'system interaction steps' and refers to continuous control benchmark environments, but does not specify explicit train/validation/test dataset splits with percentages or sample counts needed to reproduce data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions using Adam as an optimizer but does not specify any software dependencies (e.g., libraries, frameworks, or languages) with version numbers that would be needed to replicate the experiments. |
| Experiment Setup | Yes | For all methods, hyper-parameters are selected to achieve maximal accumulated average return, i.e. fast and stable policy optimization. Details about the individual methods configuration and the employed environments can be found in Appendix B. |