Understanding the Evolution of Linear Regions in Deep Reinforcement Learning

Authors: Setareh Cohan, Nam Hee Kim, David Rolnick, Michiel van de Panne

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We seek to understand how observed region counts and their densities evolve during deep reinforcement learning using empirical results that span a range of continuous control tasks and policy network dimensions.
Researcher Affiliation Academia Setareh Cohan Department of Computer Science University of British Columbia setarehc@cs.ubc.ca Nam Hee Kim Department of Computer Science Aalto University namhee.kim@aalto.fi David Rolnick School of Computer Science Mc Gill University drolnick@cs.mcgill.ca Michiel van de Panne Department of Computer Science University of British Columbia van@cs.ubc.ca
Pseudocode No The paper describes the region counting method in prose but does not include a structured pseudocode or algorithm block.
Open Source Code Yes Our code is available at https://github.com/setarehc/deep_rl_regions.
Open Datasets Yes We conduct our experiments on four continuous control tasks including Half Cheetah-v2, Walker-v2, Ant-v2, and Swimmer-v2 environments from the Open AI gym benchmark suits [Brockman et al., 2016].
Dataset Splits No The paper does not provide specific training/validation/test dataset splits with percentages, sample counts, or references to predefined static splits, as data is generated through interaction with RL environments.
Hardware Specification No The paper mentions 'computational resources provided by Compute Canada' and states that 'Our experiments require modest compute resources,' but it does not provide specific hardware details like GPU/CPU models or memory.
Software Dependencies No The paper mentions using 'Stable-Baselines3 implementations of the PPO algorithm' but does not provide specific version numbers for Stable-Baselines3 or other software dependencies.
Experiment Setup Yes We train 18 policy network configurations with N {32, 48, 64, 96, 128, 192} neurons, widths w {8, 16, 32, 64}, and depths d {1, 2, 3, 4}. We use a fixed value function network structure of (64, 64) in all of our experiments. We adopt the network initialization and hyperparameters of PPO from Stable Baselines3 [Raffin et al., 2021] and train our policy networks on 2M samples (i.e. 2M timesteps in the environment).