Population-Guided Parallel Policy Search for Reinforcement Learning

Authors: Whiyoung Jung, Giseung Park, Youngchul Sung

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical results show that the constructed algorithm outperforms most of the current state-of-the-art RL algorithms, and the gain is significant in the case of sparse reward environment. and 4 EXPERIMENTS In this section, we provide numerical results on performance comparison between the proposed P3S-TD3 algorithm and current state-of-the-art on-policy and off-policy baseline algorithms on several Mu Jo Co environments (Todorov et al. (2012)).
Researcher Affiliation Academia Whiyoung Jung, Giseung Park, Youngchul Sung School of Electrical Engineering Korea Advanced Institute of Science and Technology {wy.jung, gs.park, ycsung}@kaist.ac.kr
Pseudocode Yes The pseudocode of the P3S-TD3 is given in Appendix H. and APPENDIX H. PSEUDOCODE OF THE P3S-TD3 ALGORITHM which contains Algorithm 1 The Population-Guided Parallel Policy Search TD3 (P3S-TD3) Algorithm.
Open Source Code Yes The implementation code for P3S-TD3 is available at https://github.com/wyjung0625/p3s.
Open Datasets Yes In this section, we provide numerical results on performance comparison between the proposed P3S-TD3 algorithm and current state-of-the-art on-policy and off-policy baseline algorithms on several Mu Jo Co environments (Todorov et al. (2012)).
Dataset Splits No The paper describes evaluation methods (e.g., 'Evaluation of the policies is conducted every Reval = 4000 time steps. At each evaluation instant, the agent (or learner) fixes its policy... to obtain 10 episodic rewards.') and uses an experience replay buffer, but it does not specify explicit train/validation/test dataset splits with percentages, sample counts, or defined validation sets as part of a fixed dataset partitioning for reproduction.
Hardware Specification No The paper does not provide specific details on the hardware used for experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions software components like 'Adam optimizer' and 'Mu Jo Co environments' but does not specify version numbers for any software dependencies or libraries.
Experiment Setup Yes APPENDIX I. HYPER-PARAMETERS: TD3 The networks for two Q-functions and the policy have 2 hidden layers. The first and second layers have sizes 400 and 300, respectively. ... We used the Adam optimizer with learning rate 10 3, discount factor γ = 0.99, target smoothing factor τ = 5 10 3, the period d = 2 for updating the policy. The experience replay buffer size is 106, and the mini-batch size B is 100. The standard deviation for exploration noise σ and target noise σ are 0.1 and 0.2, respectively, and the noise clipping factor c is 0.5. P3S-TD3 In addition to the hyper-parameters for TD3, we used N = 4 learners, the period M = 250 of updating the best policy and β, the number of recent episodes Er = 10 for determining the best learner b. The parameter dmin was chosen among {0.02, 0.05} for each environment... The time steps for initial exploration Tinitial was set as 250 for Hopper-v1 and Walker2d-v1 and as 2500 for Half Cheetah-v1 and Ant-v1.