Distributed Distributional Deterministic Policy Gradients
Authors: Gabriel Barth-Maron, Matthew W. Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva TB, Alistair Muldal, Nicolas Heess, Timothy Lillicrap
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally we examine the contribution of each of these individual components, and show how they interact, as well as their combined contributions. Our results show that across a wide variety of simple control tasks, difficult manipulation tasks, and a set of hard obstacle-based locomotion tasks the D4PG algorithm achieves state of the art performance. |
| Researcher Affiliation | Industry | Deep Mind London, UK {gabrielbm, mwhoffman, budden, wdabney, horgan, dhruvat, alimuldal, heess, countzero}@google.com |
| Pseudocode | Yes | Algorithm pseudocode for the D4PG algorithm which includes all the above-mentioned modifications can be found in Algorithm 1. Here the actor and critic parameters are updated using stochastic gradient descent with learning rates, αt and βt respectively, which are adjusted online using ADAM (Kingma & Ba, 2015). While this pseudocode focuses on the learning process, also shown is pseudocode for actor processes which in parallel fill the replay table with data. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We first consider evaluating performance on a number of simple, physical control tasks by utilizing a suite of benchmark tasks (Tassa et al., 2018) developed in the Mu Jo Co physics simulator (Todorov et al., 2012). |
| Dataset Splits | No | The paper does not explicitly provide specific training/validation/test dataset splits (e.g., percentages, sample counts, or predefined split citations) needed to reproduce the experiment. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions software like MuJoCo physics simulator, ApeX framework, and the ADAM optimizer, but does not provide specific version numbers for these components. |
| Experiment Setup | Yes | In all experiments we use a replay table of size R 1 ˆ 106 and only consider behavior policies which add fixed Gaussian noise ϵNp0, 1q to the current online policy; in all experiments we use a value of ϵ 0.3. For all algorithms we initialize the learning rates for both actor and critic updates to the same value. In the next section we will present a suite of simple control problems for which this value corresponds to α0 β0 1ˆ10 4; for the following, harder problems we set this to a smaller value of α0 β0 5 ˆ 10 5. Similarly for the control suite we utilize a batch size of M 256 and for all subsequent problems we will increase this to M 512. |