Self-Paced Deep Reinforcement Learning
Authors: Pascal Klink, Carlo D'Eramo, Jan R. Peters, Joni Pajarinen
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the conducted experiments, the curricula generated with the proposed algorithm significantly improve learning performance across several environments and deep RL algorithms, matching or outperforming state-of-the-art existing CRL algorithms. |
| Researcher Affiliation | Academia | Pascal Klink1, Carlo D Eramo1, Jan Peters1, Joni Pajarinen1,2 1 Intelligent Autonomous Systems, Technische Universität Darmstadt, Germany 2 Department of Electrical Engineering and Automation, Aalto University, Finland |
| Pseudocode | Yes | Algorithm 1 Self-Paced Deep Reinforcement Learning |
| Open Source Code | Yes | Code for running the experiments can be found at https://github.com/psclklnk/spdl |
| Open Datasets | Yes | We use the Open AI Gym simulation environment [53]... We use the Nvidia Isaac Gym simulator [54] for this experiment. |
| Dataset Splits | No | The paper describes training and evaluation in continuous environments but does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper mentions 'on our hardware' but does not provide specific details such as GPU models, CPU types, or memory used for the experiments. |
| Software Dependencies | No | The paper mentions software like 'Stable Baselines library' and 'SciPy library' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We evaluate the performance using TRPO [16], PPO [17] and SAC [18]. For all DRL algorithms, we use the implementations provided in the Stable Baselines library [52]. ... In each iteration, the parameter αi is chosen such that the KL divergence penalty w.r.t. the current context distribution is in constant proportion ζ to the average reward obtained during the last iteration of policy optimization αi = B(νi, Di) = ζ (1/K PK k=1 R τ k, ck / DKL (pνi(c) µ(c))) ... For the experiments, we restrict pν(c) to be Gaussian. |