CEM-RL: Combining evolutionary and gradient-based methods for policy search
Authors: Pourchot, Sigaud
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the resulting method, CEM-RL, on a set of benchmarks classically used in deep RL. We show that CEM-RL benefits from several advantages over its competitors and offers a satisfactory trade-off between performance and sample efficiency. |
| Researcher Affiliation | Collaboration | Alo ıs Pourchot1,2, Olivier Sigaud2 (1) Gleamer 96bis Boulevard Raspail, 75006 Paris, France alois.pourchot@gleamer.ai (2) Sorbonne Universit e, CNRS UMR 7222 Institut des Syst emes Intelligents et de Robotique, F-75005 Paris, France olivier.sigaud@upmc.fr |
| Pseudocode | Yes | A pseudo-code of CEM-RL is provided in Algorithm 1. |
| Open Source Code | Yes | 1The code for reproducing the experiments is available at https://github.com/apourchot/ CEM-RL. |
| Open Datasets | Yes | We evaluate the corresponding algorithms in several continuous control tasks simulated with the MUJOCO physics engine and commonly used as policy search benchmarks: HALF-CHEETAH-V2, HOPPER-V2, WALKER2D-V2, SWIMMER-V2 and ANT-V2 (Brockman et al., 2016). |
| Dataset Splits | No | The paper evaluates policies in simulated environments (MUJOCO) where data is generated dynamically through interaction, rather than using a static dataset that is split into training, validation, and test sets. Therefore, explicit dataset splits are not applicable or provided. |
| Hardware Specification | No | No specific hardware details (e.g., GPU models, CPU models, or cloud computing specifications) used for running experiments are provided in the paper. |
| Software Dependencies | No | The paper mentions software libraries such as PYTORCH and the MUJOCO physics engine, but it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | Architectures of the networks are described in Appendix A. Most TD3 and DDPG hyper-parameters were reused from Fujimoto et al. (2018). The only notable difference is the use of tanh non linearities instead of RELU in the actor network, after we spotted that the latter performs better than the former on several environments. We trained the networks with the Adam optimizer (Kingma & Ba, 2014), with a learning rate of 1e 3 for both the actor and the critic. The discount rate γ was set to 0.99, and the target weight τ to 5e 3. All populations contained 10 actors, and the standard deviations σinit, σend and the constant τcem of the CEM algorithm were respectively set to 1e 3, 1e 5 and 0.95. Finally, the size of the replay buffer was set to 1e6, and the batch size to 100. |