Continuous control with deep reinforcement learning

Authors: Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra

ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end : directly from raw pixel inputs.
Researcher Affiliation Industry Google Deepmind London, UK {countzero, jjhunt, apritzel, heess, etom, tassa, davidsilver, wierstra} @ google.com
Pseudocode Yes Algorithm 1 DDPG algorithm
Open Source Code No The paper mentions 'You can view a movie of some of the learned policies at https://goo.gl/J4PIAz', but this link is for demonstration videos, not for the source code of the methodology itself. There is no explicit statement or link providing access to the source code.
Open Datasets Yes These environments were simulated using Mu Jo Co (Todorov et al., 2012). Figure 1 shows renderings of some of the environments used in the task (the supplementary contains details of the environments and you can view some of the learned policies at https://goo.gl/J4PIAz).
Dataset Splits No The paper does not explicitly provide details about a validation dataset split (e.g., specific percentages or sample counts for validation data).
Hardware Specification No The paper does not explicitly describe the specific hardware used to run its experiments, such as CPU/GPU models or other detailed machine specifications.
Software Dependencies No The paper mentions software like 'Adam' and the 'MuJoCo' physics engine, but it does not provide specific version numbers for any software components.
Experiment Setup Yes We used Adam (Kingma & Ba, 2014) for learning the neural network parameters with a learning rate of 10 4 and 10 3 for the actor and critic respectively. For Q we included L2 weight decay of 10 2 and used a discount factor of γ = 0.99. For the soft target updates we used τ = 0.001. The neural networks used the rectified non-linearity (Glorot et al., 2011) for all hidden layers. The final output layer of the actor was a tanh layer, to bound the actions. The low-dimensional networks had 2 hidden layers with 400 and 300 units respectively ( 130,000 parameters). Actions were not included until the 2nd hidden layer of Q. When learning from pixels we used 3 convolutional layers (no pooling) with 32 filters at each layer. This was followed by two fully connected layers with 200 units ( 430,000 parameters). The final layer weights and biases of both the actor and critic were initialized from a uniform distribution [ 3 10 3, 3 10 3] and [3 10 4, 3 10 4] for the low dimensional and pixel cases respectively. This was to ensure the initial outputs for the policy and value estimates were near zero. The other layers were initialized from uniform distributions [ 1 f , 1 f ] where f is the fan-in of the layer. The actions were not included until the fully-connected layers. We trained with minibatch sizes of 64 for the low dimensional problems and 16 on pixels. We used a replay buffer size of 106. For the exploration noise process we used temporally correlated noise in order to explore well in physical environments that have momentum. We used an Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930) with θ = 0.15 and σ = 0.2.