Genetic Soft Updates for Policy Evolution in Deep Reinforcement Learning

Authors: Enrico Marchesini, Davide Corsi, Alessandro Farinelli

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation focuses on mapless navigation, a well-known problem in robotics and recent DRL (Zhang et al., 2017; Wahid et al., 2019; Marchesini & Farinelli, 2020b). In particular, we consider two tasks developed with Unity (Juliani et al., 2018): (i) a discrete action space indoor scenario with obstacles for a mobile robot and (ii) a continuous task for aquatic drones, with dynamic waves and physically realistic water. Besides considering standard metrics related to performance (success rate and reward), we also consider safety properties that are particularly important in these domains (e.g., the agent does not collide with obstacles).
Researcher Affiliation Academia Enrico Marchesini , Davide Corsi, Alessandro Farinelli University of Verona, Department of Computer Science
Pseudocode Yes Appendix B provides a general pseudocode Algorithm 1 Supe-RL Algorithm 2 Function Generate Children
Open Source Code No No explicit statement or link providing open-source code for the methodology described in this paper.
Open Datasets No No concrete access information (specific link, DOI, repository name, formal citation with authors/year, or reference to established benchmark datasets) for a publicly available or open dataset is provided for the main robotic navigation tasks. For Mu Jo Co, it references a benchmark, but no specific access details for the *dataset* are given here, only the environments.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits needed to reproduce the experiment for its main tasks. It describes how the environment works and how data is collected, but not the splitting strategy for training, validation, and test sets.
Hardware Specification Yes Data are collected on an i7-9700k, using the implementation of Section 3.1.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, etc.).
Experiment Setup Yes We considered the same set of hyper-parameters (reported in Appendix D) for the baselines and Supe-RL based approaches. ... Rainbow, GRainbow, and SGRainbow Here we discuss only the hyper-parameters of the algorithmic implementations which are relevant to GRainbow and SGRainbow, referring to the original papers for further details (notice that the three approaches share the same hyper-parameters). Based on our experiments, we decided to use a Priority Experience Replay (Schaul et al., 2016) buffer of size 30000 with a batch size of 64, α = 0.6, βstart = 0.4 and βincrement = 0.0005. The soft update of the target network (Lillicrap et al., 2015) for Rainbow is performed with τ = 0.01. Furthermore, given the well-structured reward function, we notice that a simple ϵ greedy exploration strategy with decay = 0.99 and ϵmin = 0.02 led to a faster training phase compared to the introduction of noisy exploration in the network (Fortunato et al., 2017). Finally, we noticed that a genetic evaluation collects on average 30000 total transitions, and adding a random 10% of these transitions into the priority buffer (i.e., transp = 0.1), showed the best results.