Deep Coherent Exploration for Continuous Control
Authors: Yijie Zhang, Herke Van Hoof
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the coherent versions of A2C (Mnih et al., 2016), PPO (Schulman et al., 2017), and SAC (Haarnoja et al., 2018), where the experiments on Open AI Mu Jo Co (Todorov et al., 2012; Brockman et al., 2016) tasks show that deep coherent exploration outperforms other exploration strategies in terms of both learning speed and stability. |
| Researcher Affiliation | Academia | 1University of Copenhagen, Copenhagen, Denmark (work done while YZ was a master student at the University of Amsterdam) 2University of Amsterdam, Amsterdam, the Netherlands. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | For more details, please refer to documents and source code from Open AI Spinning Up (Achiam, 2018) and our implementation5. 5https://github.com/pyijiezhang/deep-coherent-explorationfor-continuous-control |
| Open Datasets | Yes | This comparison is evaluated in combination of A2C (Mnih et al., 2016), PPO (Schulman et al., 2017), and SAC (Haarnoja et al., 2018) on Open AI Gym Mu Jo Co (Todorov et al., 2012; Brockman et al., 2016) continuous control tasks. |
| Dataset Splits | No | The paper describes training steps and evaluation frequency, but does not provide specific dataset split information (e.g., percentages, sample counts) for a static dataset, which is common in continuous reinforcement learning environments rather than traditional supervised learning datasets. |
| Hardware Specification | No | The paper thanks SURFsara for providing "computational resources" but does not specify any exact GPU/CPU models, processor types, or memory details used for running its experiments. |
| Software Dependencies | No | The paper mentions software frameworks like "Open AI Baselines" and "Open AI Spinning Up", and algorithms like "Adam", but does not provide specific version numbers for any libraries or solvers required for replication. |
| Experiment Setup | Yes | For exploration in parameter space, we use a fixed action noise with a standard deviation of 0.1. For A2C and PPO, their standard deviations of parameter noise are all initialized at 0.017... For SAC, we initialize the standard deviation of parameter noise at 0.034... We consider five values of β (0.0, 0.01, 0.1, 0.5, and 1.0) for deep coherent exploration... In all experiments, agents are trained with a total of 10^6 environmental steps... A2C and PPO use four parallel workers, where each worker collects a trajectory of 1000 steps for each epoch... After each epoch, both A2C and PPO update their value functions for 80 gradient steps... A2C updates its policy for one gradient step, while PPO updates its policy for up to 80 gradient steps... SAC uses a single worker, with a step size of 4000 for each epoch. After every 50 environmental steps, both the policy and the value function are updated for 50 gradient steps... All three algorithms use two-layer feedforward neural networks with the same network architectures... A2C and PPO use a learning rate of 3x10^-4 for the policies and a learning rate of 10^-3 for the value functions. SAC uses a single learning rate of 10^-3 for both policy and value function. |