Identifying Policy Gradient Subspaces

Authors: Jan Schneider, Pierre Schumacher, Simon Guist, Le Chen, Daniel Haeufle, Bernhard Schölkopf, Dieter Büchler

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper conducts a comprehensive empirical evaluation of gradient subspaces in the context of PG algorithms, assessing their properties across various simulated RL benchmarks. Our experiments reveal several key findings: (i) there exist parameter-space directions that exhibit significantly larger curvature compared to other parameter-space directions, (ii) the gradients live in the subspace spanned by these directions, and (iii) the subspace remains relatively stable throughout the RL training.
Researcher Affiliation Academia 1Max Planck Institute for Intelligent Systems, T ubingen, Germany 2Hertie Institute for Clinical Brain Research, T ubingen, Germany 3Institute for Computer Engineering, University of Heidelberg, Germany
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code for our experiments is available on the project website. To facilitate reproducing our results, we make our code, as well as the raw analysis data, including hyperparmeter settings and model checkpoints, publically available on the project website.
Open Datasets Yes We apply the algorithms to twelve benchmark tasks from Open AI Gym (Brockman et al., 2016), Gym Robotics (Plappert et al., 2018a), and the Deep Mind Control Suite (Tunyasuvunakool et al., 2020).
Dataset Splits No The paper discusses training phases (initial, training, convergence) based on performance improvement criteria, but it does not provide specific dataset split information (exact percentages, sample counts, or citations to predefined splits) for train/validation/test sets needed to reproduce data partitioning for typical machine learning experiments.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No Our code builds upon the algorithm implementations of Stable Baselines3 (Raffin et al., 2021). The paper mentions a software library, Stable Baselines3, but does not provide a specific version number for it or other software dependencies.
Experiment Setup Yes With the tuned hyperparameters from RL Baselines3 Zoo that we use for training, the PPO actor and critic usually contain around 5,000 parameters, and the SAC actor and critic around 70,000 and 140,000 parameters (2 Q-networks a 70,000 parameters), respectively. Appendix C (Table 1) provides detailed hyperparameter values and ranges from which random configurations were drawn, including learning rate, batch size, n steps, n epochs, gamma, gae lambda, clip range, ent coef, net arch, train freq, tau, and learning starts.