Boosting the Actor with Dual Critic
Authors: Bo Dai, Albert Shaw, Niao He, Lihong Li, Le Song
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, our algorithm is evaluated on several locomotion tasks in the Mu Jo Co benchmark (Todorov et al., 2012), and compares favorably to state-of-the-art algorithms across the board. We evaluated the dual actor-critic (Dual-AC) algorithm on several continuous control environments from the Open AI Gym (Brockman et al., 2016) with Mu Jo Co physics simulator (Todorov et al., 2012). We compared Dual-AC with several representative actor-critic algorithms, including trust region policy optimization (TRPO) (Schulman et al., 2015a) and proximal policy optimization (PPO) (Schulman et al., 2017). We ran the algorithms with 5 random seeds and reported the average rewards with 50% confidence interval. Details of the tasks and setups of these experiments including the policy/value function architectures and the hyperparameters values, are provided in Appendix C. |
| Researcher Affiliation | Collaboration | Bo Dai*1, Albert Shaw*1, Niao He2, Lihong Li3, Le Song1, 4 1 Georgia Institute of Technology, 2 University of Illinois at Urbana-Champaign 3 Google AI, 4 Ant Financial Services Group |
| Pseudocode | Yes | Algorithm 1 Dual Actor-Critic (Dual-AC) |
| Open Source Code | No | The paper references code used for baselines: "For a fair comparison, we use the codes from https://github.com/joschu/modular rl reported to have achieved the best scores in Henderson et al. (2018)." It does not provide a link or statement about open-sourcing the code for the Dual-AC algorithm developed in this paper. |
| Open Datasets | Yes | We evaluated the dual actor-critic (Dual-AC) algorithm on several continuous control environments from the Open AI Gym (Brockman et al., 2016) with Mu Jo Co physics simulator (Todorov et al., 2012). |
| Dataset Splits | No | The paper mentions batch sizes for training but does not specify how the datasets were split into training, validation, or test sets, nor does it refer to standard splits for the environments used. |
| Hardware Specification | No | The paper does not specify any hardware details such as GPU/CPU models, memory, or cloud computing instance types used for running the experiments. It only mentions the use of the MuJoCo physics simulator. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies, libraries, or programming languages used in the implementation or experiments. |
| Experiment Setup | Yes | We use the γ = 0.995 for all the algorithms. We keep constant stepsize and tuned for TRPO, PPO and Dual-AC in {0.001, 0.01, 0.1}. The batchsize are set to be 52 trajectories for comparison to the competitors in Section 6.2. For the Ablation study, we set batchsize to be 24 trajectories for faster runtime. The CG damping parameter for TRPO is set to be 10^-4. We iterate 20 steps for the Fisher information matrix computation. For the ηV , ηµ, 1/ηα in Dual-AC from {0.001, 0.01, 0.1, 1}. |