reproducibilityindex.ai

Guide Actor-Critic for Continuous Control

Authors: Voot Tangkaratt, Abbas Abdolmaleki, Masashi Sugiyama

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through experiments, we show that our method is a promising reinforcement learning method for continuous controls. We evaluate GAC on the Open AI gym platform (Brockman et al., 2016) with the Mujoco Physics simulator (Todorov et al., 2012). Figure 1 shows the learning performance on 9 continuous control tasks.
Researcher Affiliation	Academia	Voot Tangkaratt RIKEN AIP, Tokyo, Japan voot.tangkaratt@riken.jp Abbas Abdolmaleki The University of Aveiro, Aveiro, Portugal abbas.a@ua.pt Masashi Sugiyama RIKEN AIP, Tokyo, Japan The University of Tokyo, Tokyo, Japan masashi.sugiyama@riken.jp
Pseudocode	Yes	The pseudo-code of our method is provided in Appendix B and the source code is available at https://github.com/voot-t/guide-actor-critic.
Open Source Code	Yes	The pseudo-code of our method is provided in Appendix B and the source code is available at https://github.com/voot-t/guide-actor-critic.
Open Datasets	Yes	We evaluate GAC on the Open AI gym platform (Brockman et al., 2016) with the Mujoco Physics simulator (Todorov et al., 2012). These are well-known, publicly available platforms.
Dataset Splits	No	The paper does not explicitly describe train/validation/test dataset splits in the traditional supervised learning sense. It describes evaluation protocols in terms of 'training time steps' and 'test episodes'.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. It mentions the use of 'Mujoco Physics simulator' but not the underlying hardware.
Software Dependencies	No	The paper mentions 'Open AI gym platform' and 'Mujoco Physics simulator' and states 'all environments are v1'. It also mentions the 'Adam optimizer (Kingma & Ba, 2014)'. However, it does not specify explicit version numbers for software libraries like Python, PyTorch, or TensorFlow, which are essential for reproducibility.
Experiment Setup	Yes	The actor and critic are neural networks with two hidden layers of 400 and 300 units, as described in Appendix C. We use the Adam optimizer (Kingma & Ba, 2014) with learning rate 0.001 and 0.0001 for the critic network and the actor network, respectively. The moving average step for target networks is set to τ 0.001. The maximum size of the replay buffer is set to 1000000. The mini-batches size is set to N 256. The weights of the actor and critic networks are initialized as described by Glorot & Bengio (2010), except for the output layers where the initial weights are drawn uniformly from Up 0.003, 0.003q, as described by Lillicrap et al. (2015). The initial covariance Σ in GAC is set to be an identity matrix. DDPG and QNAF use the OU-process with noise parameters θ 0.15 and σ 0.2 for exploration. For GAC, the KL upper-bound is ﬁxed to ϵ 0.0001. The entropy lower-bound κ is adjusted heuristically by κ maxp0.99p E E0q E0, E0q. We apply this heuristic update once every 5000 training steps. The dual function is minimize by the sequential least-squares quadratic programming (SLSQP) method with an initial values η 0.05 and ω 0.05. The number of samples for computing the target critic value is M 10.