Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Stochastic Variance-Reduced Policy Gradient
Authors: Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, Marcello Restelli
ICML 2018 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we suggest practical variants of SVRPG, and we empirically evaluate them on continuous MDPs. In this section, we evaluate the performance of SVRPG and compare it with policy gradient (PG) on well known continuous RL tasks: Cart-pole balancing and Swimmer (e.g., Duan et al., 2016). |
| Researcher Affiliation | Academia | 1Politecnico di Milano, Milano, Italy 2Inria, Lille, France. |
| Pseudocode | Yes | Algorithm 1 SVRG Input: a dataset DN, number of epochs S, epoch size m, step size α, initial parameter θ0 m := eθ 0. Algorithm 2 SVRPG Input: number of epochs S, epoch size m, step size α, batch size N, mini-batch size B, gradient estimator g, initial parameter θ0 m := eθ 0 := θ0 |
| Open Source Code | Yes | Task implementations are from the rllab library (Duan et al., 2016), on which our agents are also based. Code available at github.com/Dam930/rllab. |
| Open Datasets | Yes | In this section, we evaluate the performance of SVRPG and compare it with policy gradient (PG) on well known continuous RL tasks: Cart-pole balancing and Swimmer (e.g., Duan et al., 2016). Task implementations are from the rllab library (Duan et al., 2016), on which our agents are also based. |
| Dataset Splits | No | The paper mentions evaluating performance using 'test-trajectories' but does not provide specific percentages or counts for training, validation, or test dataset splits. The environments used are continuous control tasks where data is generated by interaction, and traditional dataset splits are not explicitly defined in the text. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions using the 'rllab library' and 'Adam' for optimization but does not provide specific version numbers for these software components or any other dependencies. |
| Experiment Setup | Yes | For our algorithm, we use a batch size N = 100, a mini-batch size B = 10, and the jointly adaptive step size α and epoch length m proposed in Section 5.2. In all the experiments, we use deep Gaussian policies with adaptive standard deviation (details on network architecture in Appendix E). |