Parameter Space Noise for Exploration

Authors: Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, Marcin Andrychowicz

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that both offand on-policy methods benefit from this approach through experimental comparison of DQN, DDPG, and TRPO on high-dimensional discrete action environments as well as continuous control tasks.
Researcher Affiliation Collaboration Open AI Karlsruhe Institute of Technology (KIT) University of California, Berkeley
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Reference implementations of DQN and DDPG with adaptive parameter space noise are available online.4 (Footnote 4: https://github.com/openai/baselines)
Open Datasets Yes For discrete-action environments, we use the Arcade Learning Environment (ALE, Bellemare et al. (2013)) benchmark... We use DDPG (Lillicrap et al., 2015) as the RL algorithm for all environments with similar hyperparameters... The following environments from Open AI Gym7 (Brockman et al., 2016) are used... we use the following environments from rllab8 (Duan et al., 2016), modified according to Houthooft et al. (2016).
Dataset Splits No The paper does not provide specific dataset splits (e.g., percentages or counts) for training, validation, and testing as typically done with static datasets. Instead, it describes training durations (e.g., "trained for 40 M frames", "trained for 1 M timesteps") and evaluation frequency/method (e.g., "evaluate the performance of the agent every 10 thousand steps by using no noise for 20 episodes").
Hardware Specification No The paper does not specify the hardware used for experiments, such as specific CPU or GPU models, or details about cloud computing instances.
Software Dependencies No The paper mentions software like Adam optimizer, Open AI Gym, and rllab, but does not provide specific version numbers for any of them.
Experiment Setup Yes For ALE... target networks are updated every 10 K timesteps. The Q-value network is trained using the Adam optimizer... with a learning rate of 10-4 and a batch size of 32. The replay buffer can hold 1 M state transitions. ... For parameter space noise... ϵ = 0.01. ... We set γ = 0.99, clip rewards to be in [ -1, 1], and clip gradients for the output layer of Q to be within [ -1, 1]. For DDPG, we use a similar network architecture... both the actor and critic use 2 hidden layers with 64 Re LU units each. ... The target networks are soft-updated with τ = 0.001. The critic is trained with a learning rate of 10-3 while the actor uses a learning rate of 10-4. ... batch sizes of 128. The critic is regularized using an L2 penalty with 10-2. The replay buffer holds 100 K state transitions and γ = 0.99 is used. ... TRPO uses a step size of δKL = 0.01, a policy network of 2 hidden layers with 32 tanh units for the nonlocomotion tasks, and 2 hidden layers of 64 tanh units for the locomotion tasks. The Hessian calculation is subsampled with a factor of 0.1, γ = 0.99, and the batch size per epoch is set to 5 K timesteps.