reproducibilityindex.ai

Stochastic Variance-Reduced Policy Gradient

Authors: Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, Marcello Restelli

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we suggest practical variants of SVRPG, and we empirically evaluate them on continuous MDPs. In this section, we evaluate the performance of SVRPG and compare it with policy gradient (PG) on well known continuous RL tasks: Cart-pole balancing and Swimmer (e.g., Duan et al., 2016).
Researcher Affiliation	Academia	1Politecnico di Milano, Milano, Italy 2Inria, Lille, France.
Pseudocode	Yes	Algorithm 1 SVRG Input: a dataset DN, number of epochs S, epoch size m, step size α, initial parameter θ0 m := eθ 0. Algorithm 2 SVRPG Input: number of epochs S, epoch size m, step size α, batch size N, mini-batch size B, gradient estimator g, initial parameter θ0 m := eθ 0 := θ0
Open Source Code	Yes	Task implementations are from the rllab library (Duan et al., 2016), on which our agents are also based. Code available at github.com/Dam930/rllab.
Open Datasets	Yes	In this section, we evaluate the performance of SVRPG and compare it with policy gradient (PG) on well known continuous RL tasks: Cart-pole balancing and Swimmer (e.g., Duan et al., 2016). Task implementations are from the rllab library (Duan et al., 2016), on which our agents are also based.
Dataset Splits	No	The paper mentions evaluating performance using 'test-trajectories' but does not provide specific percentages or counts for training, validation, or test dataset splits. The environments used are continuous control tasks where data is generated by interaction, and traditional dataset splits are not explicitly defined in the text.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies	No	The paper mentions using the 'rllab library' and 'Adam' for optimization but does not provide specific version numbers for these software components or any other dependencies.
Experiment Setup	Yes	For our algorithm, we use a batch size N = 100, a mini-batch size B = 10, and the jointly adaptive step size α and epoch length m proposed in Section 5.2. In all the experiments, we use deep Gaussian policies with adaptive standard deviation (details on network architecture in Appendix E).