Guide Actor-Critic for Continuous Control
Authors: Voot Tangkaratt, Abbas Abdolmaleki, Masashi Sugiyama
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments, we show that our method is a promising reinforcement learning method for continuous controls. We evaluate GAC on the Open AI gym platform (Brockman et al., 2016) with the Mujoco Physics simulator (Todorov et al., 2012). Figure 1 shows the learning performance on 9 continuous control tasks. |
| Researcher Affiliation | Academia | Voot Tangkaratt RIKEN AIP, Tokyo, Japan voot.tangkaratt@riken.jp Abbas Abdolmaleki The University of Aveiro, Aveiro, Portugal abbas.a@ua.pt Masashi Sugiyama RIKEN AIP, Tokyo, Japan The University of Tokyo, Tokyo, Japan masashi.sugiyama@riken.jp |
| Pseudocode | Yes | The pseudo-code of our method is provided in Appendix B and the source code is available at https://github.com/voot-t/guide-actor-critic. |
| Open Source Code | Yes | The pseudo-code of our method is provided in Appendix B and the source code is available at https://github.com/voot-t/guide-actor-critic. |
| Open Datasets | Yes | We evaluate GAC on the Open AI gym platform (Brockman et al., 2016) with the Mujoco Physics simulator (Todorov et al., 2012). These are well-known, publicly available platforms. |
| Dataset Splits | No | The paper does not explicitly describe train/validation/test dataset splits in the traditional supervised learning sense. It describes evaluation protocols in terms of 'training time steps' and 'test episodes'. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. It mentions the use of 'Mujoco Physics simulator' but not the underlying hardware. |
| Software Dependencies | No | The paper mentions 'Open AI gym platform' and 'Mujoco Physics simulator' and states 'all environments are v1'. It also mentions the 'Adam optimizer (Kingma & Ba, 2014)'. However, it does not specify explicit version numbers for software libraries like Python, PyTorch, or TensorFlow, which are essential for reproducibility. |
| Experiment Setup | Yes | The actor and critic are neural networks with two hidden layers of 400 and 300 units, as described in Appendix C. We use the Adam optimizer (Kingma & Ba, 2014) with learning rate 0.001 and 0.0001 for the critic network and the actor network, respectively. The moving average step for target networks is set to τ 0.001. The maximum size of the replay buffer is set to 1000000. The mini-batches size is set to N 256. The weights of the actor and critic networks are initialized as described by Glorot & Bengio (2010), except for the output layers where the initial weights are drawn uniformly from Up 0.003, 0.003q, as described by Lillicrap et al. (2015). The initial covariance Σ in GAC is set to be an identity matrix. DDPG and QNAF use the OU-process with noise parameters θ 0.15 and σ 0.2 for exploration. For GAC, the KL upper-bound is fixed to ϵ 0.0001. The entropy lower-bound κ is adjusted heuristically by κ maxp0.99p E E0q E0, E0q. We apply this heuristic update once every 5000 training steps. The dual function is minimize by the sequential least-squares quadratic programming (SLSQP) method with an initial values η 0.05 and ω 0.05. The number of samples for computing the target critic value is M 10. |