Stochastic Variance-Reduced Policy Gradient

Authors: Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, Marcello Restelli

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we suggest practical variants of SVRPG, and we empirically evaluate them on continuous MDPs. In this section, we evaluate the performance of SVRPG and compare it with policy gradient (PG) on well known continuous RL tasks: Cart-pole balancing and Swimmer (e.g., Duan et al., 2016).
Researcher Affiliation Academia 1Politecnico di Milano, Milano, Italy 2Inria, Lille, France.
Pseudocode Yes Algorithm 1 SVRG Input: a dataset DN, number of epochs S, epoch size m, step size α, initial parameter θ0 m := eθ 0. Algorithm 2 SVRPG Input: number of epochs S, epoch size m, step size α, batch size N, mini-batch size B, gradient estimator g, initial parameter θ0 m := eθ 0 := θ0
Open Source Code Yes Task implementations are from the rllab library (Duan et al., 2016), on which our agents are also based. Code available at github.com/Dam930/rllab.
Open Datasets Yes In this section, we evaluate the performance of SVRPG and compare it with policy gradient (PG) on well known continuous RL tasks: Cart-pole balancing and Swimmer (e.g., Duan et al., 2016). Task implementations are from the rllab library (Duan et al., 2016), on which our agents are also based.
Dataset Splits No The paper mentions evaluating performance using 'test-trajectories' but does not provide specific percentages or counts for training, validation, or test dataset splits. The environments used are continuous control tasks where data is generated by interaction, and traditional dataset splits are not explicitly defined in the text.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper mentions using the 'rllab library' and 'Adam' for optimization but does not provide specific version numbers for these software components or any other dependencies.
Experiment Setup Yes For our algorithm, we use a batch size N = 100, a mini-batch size B = 10, and the jointly adaptive step size α and epoch length m proposed in Section 5.2. In all the experiments, we use deep Gaussian policies with adaptive standard deviation (details on network architecture in Appendix E).