Time Discretization-Invariant Safe Action Repetition for Policy Gradient Methods

Authors: Seohong Park, Jaekyeom Kim, Gunhee Kim

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show that our method is not only δ-invariant but also robust to stochasticity, outperforming previous δ-invariant approaches on eight Mu Jo Co environments with both deterministic and stochastic settings. Our code is available at https://vision.snu.ac.kr/projects/sar.
Researcher Affiliation Academia Seohong Park Seoul National University artberryx@snu.ac.kr Jaekyeom Kim Seoul National University jaekyeom@snu.ac.kr Gunhee Kim Seoul National University gunhee@snu.ac.kr
Pseudocode No The paper describes the proposed method, Safe Action Repetition (SAR), using textual descriptions and mathematical formulations (e.g., Equation 6). However, it does not include any explicitly labeled pseudocode or algorithm blocks to detail the steps of the method.
Open Source Code Yes Our code is available at https://vision.snu.ac.kr/projects/sar.
Open Datasets Yes We test SAR on eight continuous control environments from Mu Jo Co [37]: Inverted Pendulum-v2, Inverted Double Pendulum-v2, Hopper-v2, Walker2d-v2, Half Cheetah-v2, Ant-v2, Reacher-v2 and Swimmer-v2.
Dataset Splits No The paper conducts experiments on MuJoCo environments for reinforcement learning, involving training and evaluation. While it details the training process and varying time scales, it does not specify explicit training/validation/test dataset splits with percentages, sample counts, or predefined split citations in the manner typically found in supervised learning datasets.
Hardware Specification No The paper describes the experimental setup and evaluation environments (MuJoCo) but does not provide specific details about the hardware used, such as exact CPU or GPU models, memory specifications, or types of computing clusters.
Software Dependencies No The paper mentions the use of PyTorch [24] and Stable Baselines3 [25] in its references, implying their use in implementation. However, it does not provide specific version numbers for these or any other critical software dependencies (e.g., Python version, CUDA version) required for reproducibility.
Experiment Setup Yes We test SAR on eight continuous control environments from Mu Jo Co [37]: Inverted Pendulum-v2, Inverted Double Pendulum-v2, Hopper-v2, Walker2d-v2, Half Cheetah-v2, Ant-v2, Reacher-v2 and Swimmer-v2. We mainly compare our method to Fi GAR-C described in Section 4.2 because it is the only prior method that is δ-invariant and does not always require infinite decision steps even if δ 0, but we also make additional comparisons with other baselines such as DAU [36], ARP [14], modified PPO as well in Section 5.1 and Appendices F.2 and G. For SAR s distance function in Equation (6), we use (s, si) = s si 1/dim(S), where 1 is the ℓ1 norm and s is the state normalized by its moving average. This distance function corresponds to the average difference in each normalized state dimension, where the normalization permits sharing the hyperparameter dmax for all Mu Jo Co tasks. We also share tmax in Fi GAR-C for all environments. Finally, we impose an upper limit of tmax on the maximum duration of actions in SAR for two reasons: (1) to further stabilize training and (2) to ensure a fair comparison with Fi GAR-C by setting the same limit on time duration.