CEM-RL: Combining evolutionary and gradient-based methods for policy search

Authors: Pourchot, Sigaud

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the resulting method, CEM-RL, on a set of benchmarks classically used in deep RL. We show that CEM-RL benefits from several advantages over its competitors and offers a satisfactory trade-off between performance and sample efficiency.
Researcher Affiliation Collaboration Alo ıs Pourchot1,2, Olivier Sigaud2 (1) Gleamer 96bis Boulevard Raspail, 75006 Paris, France alois.pourchot@gleamer.ai (2) Sorbonne Universit e, CNRS UMR 7222 Institut des Syst emes Intelligents et de Robotique, F-75005 Paris, France olivier.sigaud@upmc.fr
Pseudocode Yes A pseudo-code of CEM-RL is provided in Algorithm 1.
Open Source Code Yes 1The code for reproducing the experiments is available at https://github.com/apourchot/ CEM-RL.
Open Datasets Yes We evaluate the corresponding algorithms in several continuous control tasks simulated with the MUJOCO physics engine and commonly used as policy search benchmarks: HALF-CHEETAH-V2, HOPPER-V2, WALKER2D-V2, SWIMMER-V2 and ANT-V2 (Brockman et al., 2016).
Dataset Splits No The paper evaluates policies in simulated environments (MUJOCO) where data is generated dynamically through interaction, rather than using a static dataset that is split into training, validation, and test sets. Therefore, explicit dataset splits are not applicable or provided.
Hardware Specification No No specific hardware details (e.g., GPU models, CPU models, or cloud computing specifications) used for running experiments are provided in the paper.
Software Dependencies No The paper mentions software libraries such as PYTORCH and the MUJOCO physics engine, but it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes Architectures of the networks are described in Appendix A. Most TD3 and DDPG hyper-parameters were reused from Fujimoto et al. (2018). The only notable difference is the use of tanh non linearities instead of RELU in the actor network, after we spotted that the latter performs better than the former on several environments. We trained the networks with the Adam optimizer (Kingma & Ba, 2014), with a learning rate of 1e 3 for both the actor and the critic. The discount rate γ was set to 0.99, and the target weight τ to 5e 3. All populations contained 10 actors, and the standard deviations σinit, σend and the constant τcem of the CEM algorithm were respectively set to 1e 3, 1e 5 and 0.95. Finally, the size of the replay buffer was set to 1e6, and the batch size to 100.