Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Guide Actor-Critic for Continuous Control
Authors: Voot Tangkaratt, Abbas Abdolmaleki, Masashi Sugiyama
ICLR 2018 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments, we show that our method is a promising reinforcement learning method for continuous controls. We evaluate GAC on the Open AI gym platform (Brockman et al., 2016) with the Mujoco Physics simulator (Todorov et al., 2012). Figure 1 shows the learning performance on 9 continuous control tasks. |
| Researcher Affiliation | Academia | Voot Tangkaratt RIKEN AIP, Tokyo, Japan EMAIL Abbas Abdolmaleki The University of Aveiro, Aveiro, Portugal EMAIL Masashi Sugiyama RIKEN AIP, Tokyo, Japan The University of Tokyo, Tokyo, Japan EMAIL |
| Pseudocode | Yes | The pseudo-code of our method is provided in Appendix B and the source code is available at https://github.com/voot-t/guide-actor-critic. |
| Open Source Code | Yes | The pseudo-code of our method is provided in Appendix B and the source code is available at https://github.com/voot-t/guide-actor-critic. |
| Open Datasets | Yes | We evaluate GAC on the Open AI gym platform (Brockman et al., 2016) with the Mujoco Physics simulator (Todorov et al., 2012). These are well-known, publicly available platforms. |
| Dataset Splits | No | The paper does not explicitly describe train/validation/test dataset splits in the traditional supervised learning sense. It describes evaluation protocols in terms of 'training time steps' and 'test episodes'. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. It mentions the use of 'Mujoco Physics simulator' but not the underlying hardware. |
| Software Dependencies | No | The paper mentions 'Open AI gym platform' and 'Mujoco Physics simulator' and states 'all environments are v1'. It also mentions the 'Adam optimizer (Kingma & Ba, 2014)'. However, it does not specify explicit version numbers for software libraries like Python, PyTorch, or TensorFlow, which are essential for reproducibility. |
| Experiment Setup | Yes | The actor and critic are neural networks with two hidden layers of 400 and 300 units, as described in Appendix C. We use the Adam optimizer (Kingma & Ba, 2014) with learning rate 0.001 and 0.0001 for the critic network and the actor network, respectively. The moving average step for target networks is set to τ 0.001. The maximum size of the replay buffer is set to 1000000. The mini-batches size is set to N 256. The weights of the actor and critic networks are initialized as described by Glorot & Bengio (2010), except for the output layers where the initial weights are drawn uniformly from Up 0.003, 0.003q, as described by Lillicrap et al. (2015). The initial covariance Σ in GAC is set to be an identity matrix. DDPG and QNAF use the OU-process with noise parameters θ 0.15 and σ 0.2 for exploration. For GAC, the KL upper-bound is fixed to ϵ 0.0001. The entropy lower-bound κ is adjusted heuristically by κ maxp0.99p E E0q E0, E0q. We apply this heuristic update once every 5000 training steps. The dual function is minimize by the sequential least-squares quadratic programming (SLSQP) method with an initial values η 0.05 and ω 0.05. The number of samples for computing the target critic value is M 10. |