Stochastic Variance-Reduced Policy Gradient
Authors: Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, Marcello Restelli
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we suggest practical variants of SVRPG, and we empirically evaluate them on continuous MDPs. In this section, we evaluate the performance of SVRPG and compare it with policy gradient (PG) on well known continuous RL tasks: Cart-pole balancing and Swimmer (e.g., Duan et al., 2016). |
| Researcher Affiliation | Academia | 1Politecnico di Milano, Milano, Italy 2Inria, Lille, France. |
| Pseudocode | Yes | Algorithm 1 SVRG Input: a dataset DN, number of epochs S, epoch size m, step size α, initial parameter θ0 m := eθ 0. Algorithm 2 SVRPG Input: number of epochs S, epoch size m, step size α, batch size N, mini-batch size B, gradient estimator g, initial parameter θ0 m := eθ 0 := θ0 |
| Open Source Code | Yes | Task implementations are from the rllab library (Duan et al., 2016), on which our agents are also based. Code available at github.com/Dam930/rllab. |
| Open Datasets | Yes | In this section, we evaluate the performance of SVRPG and compare it with policy gradient (PG) on well known continuous RL tasks: Cart-pole balancing and Swimmer (e.g., Duan et al., 2016). Task implementations are from the rllab library (Duan et al., 2016), on which our agents are also based. |
| Dataset Splits | No | The paper mentions evaluating performance using 'test-trajectories' but does not provide specific percentages or counts for training, validation, or test dataset splits. The environments used are continuous control tasks where data is generated by interaction, and traditional dataset splits are not explicitly defined in the text. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions using the 'rllab library' and 'Adam' for optimization but does not provide specific version numbers for these software components or any other dependencies. |
| Experiment Setup | Yes | For our algorithm, we use a batch size N = 100, a mini-batch size B = 10, and the jointly adaptive step size α and epoch length m proposed in Section 5.2. In all the experiments, we use deep Gaussian policies with adaptive standard deviation (details on network architecture in Appendix E). |