Guide Actor-Critic for Continuous Control

Authors: Voot Tangkaratt, Abbas Abdolmaleki, Masashi Sugiyama

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments, we show that our method is a promising reinforcement learning method for continuous controls. We evaluate GAC on the Open AI gym platform (Brockman et al., 2016) with the Mujoco Physics simulator (Todorov et al., 2012). Figure 1 shows the learning performance on 9 continuous control tasks.
Researcher Affiliation Academia Voot Tangkaratt RIKEN AIP, Tokyo, Japan voot.tangkaratt@riken.jp Abbas Abdolmaleki The University of Aveiro, Aveiro, Portugal abbas.a@ua.pt Masashi Sugiyama RIKEN AIP, Tokyo, Japan The University of Tokyo, Tokyo, Japan masashi.sugiyama@riken.jp
Pseudocode Yes The pseudo-code of our method is provided in Appendix B and the source code is available at https://github.com/voot-t/guide-actor-critic.
Open Source Code Yes The pseudo-code of our method is provided in Appendix B and the source code is available at https://github.com/voot-t/guide-actor-critic.
Open Datasets Yes We evaluate GAC on the Open AI gym platform (Brockman et al., 2016) with the Mujoco Physics simulator (Todorov et al., 2012). These are well-known, publicly available platforms.
Dataset Splits No The paper does not explicitly describe train/validation/test dataset splits in the traditional supervised learning sense. It describes evaluation protocols in terms of 'training time steps' and 'test episodes'.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. It mentions the use of 'Mujoco Physics simulator' but not the underlying hardware.
Software Dependencies No The paper mentions 'Open AI gym platform' and 'Mujoco Physics simulator' and states 'all environments are v1'. It also mentions the 'Adam optimizer (Kingma & Ba, 2014)'. However, it does not specify explicit version numbers for software libraries like Python, PyTorch, or TensorFlow, which are essential for reproducibility.
Experiment Setup Yes The actor and critic are neural networks with two hidden layers of 400 and 300 units, as described in Appendix C. We use the Adam optimizer (Kingma & Ba, 2014) with learning rate 0.001 and 0.0001 for the critic network and the actor network, respectively. The moving average step for target networks is set to τ 0.001. The maximum size of the replay buffer is set to 1000000. The mini-batches size is set to N 256. The weights of the actor and critic networks are initialized as described by Glorot & Bengio (2010), except for the output layers where the initial weights are drawn uniformly from Up 0.003, 0.003q, as described by Lillicrap et al. (2015). The initial covariance Σ in GAC is set to be an identity matrix. DDPG and QNAF use the OU-process with noise parameters θ 0.15 and σ 0.2 for exploration. For GAC, the KL upper-bound is fixed to ϵ 0.0001. The entropy lower-bound κ is adjusted heuristically by κ maxp0.99p E E0q E0, E0q. We apply this heuristic update once every 5000 training steps. The dual function is minimize by the sequential least-squares quadratic programming (SLSQP) method with an initial values η 0.05 and ω 0.05. The number of samples for computing the target critic value is M 10.