Adaptive-Gradient Policy Optimization: Enhancing Policy Learning in Non-Smooth Differentiable Simulations

Authors: Feng Gao, Liangzhi Shi, Shenao Zhang, Zhaoran Wang, Yi Wu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the theoretical side, we demonstrate AGPO s convergence, emphasizing its stable performance under non-smooth dynamics due to low variance. On the empirical side, our results show that AGPO effectively mitigates the challenges posed by non-smoothness in policy learning through differentiable simulation.
Researcher Affiliation Academia 1Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China 2Northwestern University, Illinois, United States 3Shanghai Qi Zhi Institute, Shanghai, China.
Pseudocode Yes Algorithm 1 Adaptive-Gradient Policy Optimization
Open Source Code No The paper states "We implemented our algorithm using JAX (Bradbury et al., 2018) for empirical analysis." but does not provide an explicit statement about releasing their code or a link to a repository for the methodology described.
Open Datasets Yes We employed the canonical Ant task from Brax (Freeman et al., 2021).
Dataset Splits No The paper details episode lengths, number of runs, and parallel environments (e.g., "num envs = 64", "num eval envs = 128" in Table 1) for the simulation setup, but does not provide explicit training/test/validation dataset splits (e.g., percentages or sample counts) for a fixed dataset, which is common in supervised learning but less so in RL where interaction with an environment generates data.
Hardware Specification Yes We conducted our experiments on one NVIDIA Ge Force RTX 3090 GPU with 24 GB GDDR6X memory.
Software Dependencies Yes We implemented our codes on the JAX framework, supporting XLA and automatic differentiation. ... We utilize the implementations provided by Stable Baselines3 (Raffin et al., 2021) and add a custom wrapper for our simulation environments. ... By leveraging auto-differentiation tools like Py Torch (Paszke et al., 2019) and JAX (Bradbury et al., 2018) or specially crafted differentiable kernels (Xu et al., 2022)...
Experiment Setup Yes Table 1. Training hyper-parameters for AGPO. (Includes specific values for learning rate, batch size, hidden sizes, discount factor, etc.)