reproducibilityindex.ai

Trajectory-Based Off-Policy Deep Reinforcement Learning

Authors: Andreas Doerr, Michael Volpp, Marc Toussaint, Trimpe Sebastian, Christian Daniel

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the proposed approach on a series of continuous control benchmark tasks. The results show that the proposed algorithm is able to successfully and reliably learn solutions using fewer system interactions than standard policy gradient methods. The experimental evaluation of the proposed DD-OPG method is threefold. In Sec. 6.1, the resulting surrogate return model is visualized, highlighting different modeling options. A benchmark against state-of-the-art PG methods is shown in Sec. 6.2 to highlight fast and data-efﬁcient learning. Finally, important parts of the proposed algorithms and their effects on the ﬁnal learning performance are highlighted in an ablation study in Sec. 6.3.
Researcher Affiliation	Collaboration	1Bosch Center for Artiﬁcial Intelligence, Renningen, Germany. 2Max Planck Institute for Intelligent Systems, Stuttgart/T ubingen, Germany. 3Machine Learning and Robotics Lab, University of Stuttgart, Germany.
Pseudocode	Yes	Algorithm 1 Model-free DD-OPG
Open Source Code	Yes	https://github.com/boschresearch/DD_OPG
Open Datasets	Yes	The proposed DD-OPG method is evaluated... on a series of continuous control benchmark tasks. The resulting learning performances are visualized in Fig. 2 for the cartpole, mountaincar and swimmer environment (left to right) (Duan et al., 2016).
Dataset Splits	No	The paper evaluates learning performance over 'system interaction steps' and refers to continuous control benchmark environments, but does not specify explicit train/validation/test dataset splits with percentages or sample counts needed to reproduce data partitioning.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies	No	The paper mentions using Adam as an optimizer but does not specify any software dependencies (e.g., libraries, frameworks, or languages) with version numbers that would be needed to replicate the experiments.
Experiment Setup	Yes	For all methods, hyper-parameters are selected to achieve maximal accumulated average return, i.e. fast and stable policy optimization. Details about the individual methods conﬁguration and the employed environments can be found in Appendix B.